Analyzing Hot Code Paths Using Flame Graphs (NEW)
- Application: SPECjbb2015® Benchmark. This benchmark is relevant to anyone who is interested in Java server performance including:
- JVM vendors
- Hardware developers
- Java application developers
- Researchers and members of the academic community
- Performance Analysis Tools: Hotspots Analysis inIntel® VTune™(version 2021.7 or newer)Profiler
- Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed toIntel® VTune™.Profiler
- Most recipes in theIntel® VTune™Performance Analysis Cookbook are flexible. You can apply them to different versions ofProfilerIntel® VTune™. In some cases, minor adjustments may be required.Profiler
- Operating System: Ubuntu* 18.04.1 LTS
- CPU: Intel® Xeon® Gold 6252 processor architecture codenamed Cascade Lake
Create a Baseline
- For the purpose of this recipe, let us first shorten the runtime of SPECjbb2015. Change these properties in theconfig/specjbb2015.propsfile:specjbb.input.number_customers=1 specjbb.input.number_products=1
- In accordance with popular optimization practices and guidance, start optimizing the application with-XX:+UseParallelOldGCand-XX:-UseAdaptiveSizePolicyJVM options.
- Make sure to tune these parameters for optimal performance of your Java application:
- Garbage Collection (GC) Algorithm- When you enable theUseParallelOldGCoption, you can collect old and young generation collections in parallel. Garbage collection can then work more efficiently because you have reduced the overall full GC pause. If throughput is your goal, specify-XX:+UseParallelOldGC.
- Heap Tuning- By default, JVMs adapt their heap based on runtime heuristics. To achieve pause, throughput, and footprint goals, the GC can resize heap generations based on GC statistics. In some cases, to increase throughput, you may want to disable this option and set the heap size manually. Use the heap as a performance baseline for further optimizations.java -XX:-UseAdaptiveSizePolicy -XX:+UseParallelOldGC -jar specjbb2015.jar –m COMPOSITE
Run Hotspots Analysis
- RunVTune Profiler(version 2021.7 or newer).
- In the Welcome screen, clickConfigure Analysis.
- In theWHEREpane, select Local Host.
- In theWHATpane, enter these values:
- Application parameters:-XX:-UseAdaptiveSizePolicy -XX:+UseParallelOldGC -jar specjbb2015.jar -m COMPOSITE
- In theHOWpane, open the Analysis Tree and selectHotspotsanalysis in theAlgorithmgroup.
- SelectHardware Event-Based Samplingmode and check theCollect stacksoption.
- Click theStartbutton to run the analysis.
Analyze Hotspots Information
Identify Hot Code Paths in the Flame Graph
A function from the application module of the user.
A function from the System or Kernel module
A synchronization function from the Threading Library (like OpenMP Barrier)
An overhead function from the Threading library (like OpenMP Fork or OpenMP Dispatcher)
- Start optimizing from the bottommost function and work your way up. Focus on hot functions that are wide on the flame graph.
- In this example, the flame graph displays stacks and frames that are only from the JVM. Therefore almost all of the CPU time was spent in the JVM.
- Consequently, the CPU time spent on the application was significantly low. Application stacks or frames are not even visible in the flame graph.
- The hottest code path isclone --> start_thread --> thread_native_entry --> GCTaskThread::run --> StealMarkingTask::do_it -->and so on.
- Pay attention to theGCTaskThread::runfunction/frame, which runs Java Garbage Collector tasks.
- When you hover overGCTaskThread::runfunction/frame, you can see in the details at the bottom that 93.3% of CPU Time was spent on the function and its callees.
Change JVM Options
- Click theConfigure Analysisbutton in the Welcome screen ofVTune Profiler.
- In theWHEREpane, selectLocal Host.
- In theWHATpane, setApplicationtojava.
- Change application parameters. Use-Xms2g -Xmx4g -XX:-UseAdaptiveSizePolicy -XX:+UseParallelOldGC -jar specjbb2015.jar -m COMPOSITE.
- ClickStartto run the analysis.
- We can observe a 6x reduction inElapsed Timefrom ~375 s to ~54s.
- TheTop Hotspotssection also displays a new list of functions (including theGenerateReceiptstask) with shorter CPU times.
- You may want to focus on new hot code paths that proceed in this direction:
- JVM Compile::Compile —> ...
- JVM Interpreter —> org::spec::jbb::sm::ReceiptBuilder
- org::spec::jbb::sm::ReceiptBuilder —> ...
- Review your JVM options to identify more opportunities for optimization.
- If you want to optimize the JVM next, a good starting point is to focus on theMicroarchitecture Usagemetric and follow recommendations in theInsightssection of theSummarywindow:
- ApplyThreadingto increase parallelism n your application.
- Run theMicroarchitecture Explorationanalysis to examine the efficiency of application runs on the hardware used.