HPC Performance Characterization View
1. Define a Performance Baseline
2. Determine Optimization Opportunities
- All client platforms.
- Server platforms based on Intel® microarchitecture code name Skylake, with up to four sockets.
- Explore theEffective Physical Core Utilizationmetric as a measure of the parallel efficiency of the application. A value of 100% means that the application code execution uses all available physical cores. If the value is less than 100%, it is worth looking at the second level metrics to discover reasons for parallel inefficiency.
- Learn about opportunities to use the logical cores. In some cases, using logical cores leads to application concurrency increases and overall performance improvements.
- For some Intel® processors, such as Intel® Xeon Phi™ or Intel Atom®, or systems where Intel Hyper-Threading Technology (Intel HT Technology) is OFF or absent, the metric breakdown between physical and logical core utilization is not available. In these cases, a singleEffective CPU Utilizationmetric is displayed to show parallel execution efficiency.
- For applications that do not use OpenMP or MPI runtime libraries:
- Review theEffective CPU Utilization Histogram, which displays the Elapsed Time of your application, broken down by CPU utilization levels.
- Use the data in theBottom-upandTop-down Treewindows to identify the most time-consuming functions in your application by CPU utilization. Focus on the functions with the largest CPU time and low CPU utilization level as your candidates for optimization (for example, parallelization).
- For applications with Intel OpenMP*:
- Compare the serial time to the parallel region time. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible. Look at serial hotspots to define candidates for further parallelization.
- Review theOpenMP Potential Gainto estimate the efficiency of OpenMP parallelization in the parallel part of the code. The Potential Gain metric estimates the elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving OpenMP parallelism. If Potential Gain for a region is significant, you can go deeper and select the link on a region name to navigate to theBottom-upwindow employing anOpenMP Regiondominant grouping and the region of interest selection.
- For MPI applications:Review the MPI Imbalance metric that shows the CPU time spent by ranks spinning in waits on communication operations, normalized by number of ranks on the profiling node. The metric issue detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI Busy wait time by ranks is not significant, then the rank on with the minimal time most likely lies on the critical path of application execution. In this case, review the CPU utilization metrics by this rank.
- For hybrid MPI + OpenMP applications:The sub-sectionMPI Rank on Critical Pathshows OpenMP efficiency metrics like Serial Time (outside of any OpenMP region), Parallel Region time, and OpenMP Potential Gain. If the minimal MPI Busy Wait time is significant, it can be a result of suboptimal communication schema between ranks or imbalance triggered by another node. In this case, use Intel® Trace Analyzer and Collector for in depth analysis of communication schema.
- Your application makes use of a GPU.
- TheTimemetric indicates if the GPU was idle at any point during data collection. A value of 100% implies that your application offloaded work to the GPU throughout the duration of data collection. Anything lower presents an opportunity to improve GPU utilization.
- TheIPC Ratemetric indicates the average number of instructions per cycle processed by the two FPU pipelines of Intel ®Integrated Graphics. To have your workload fully utilize the floating-point capability of the GPU, the IPC Rate should be closer to 2.
- EU Statebreaks down the activity of GPU execution units. Check here to see if they were stalled or idle when processing your workload.
- Occupancyis a measure of the efficiency of scheduling the GPU thread. A value below 100% recommends that you tune the sizes of the work items in your workload. Consider running the GPU Offload Analysis. This provides an insight into computing tasks running on the GPU as well as additional GPU-related performance metrics.
- TheOffload Timemetric displays the total duration of the OpenMP offload regions in your workload. IfOffload Timeis below 100%, consider offloading more code to the GPU.
- TheCompute,Data Transfer, andOverheadmetrics help you understand what constitutes theOffload Time. Ideally, theComputeportion should be 100%. If theData Transfercomponent is significant, try to transfer less data between the host and the GPU.
- func_nameis the name of the source function where the OpenMP target directive is declared.
- device_numberis the internal OpenMP device number where the offload was targeted.
- file_nameandline_numberconstitute the source location of the OpenMP target directive.
Compiler Options to Enable
-g -mllvm -parallel-source-info=2
/Zi -mllvm -parallel-source-info=2
- Group byOpenMP Offload Region. In this grouping, the grid displays:
- OpenMP Offload Timemetrics
- Instance Count
- The timeline view displays ruler markers that indicate the span ofOpenMP Offload RegionsandOpenMP Offload Operationswithin those regions.
- A highMemory Boundvalue might indicate that a significant portion of execution time was lost while fetching data. The section shows a fraction of cycles that were lost in stalls being served in different cache hierarchy levels (L1, L2, L3) or fetching data from DRAM. For last level cache misses that lead to DRAM, it is important to distinguish if the stalls were because of a memory bandwidth limit since they can require specific optimization techniques when compared to latency bound stalls.VTuneshows a hint about identifying this issue in the DRAM Bound metric issue description. This section also offers the percentage of accesses to a remote socket compared to a local socket to see if memory stalls can be connected with NUMA issues.Profiler
- For Intel® Xeon Phi™ processors formerly code named Knights Landing, there is no way to measure memory stalls to assess memory access efficiency in general. Therefore Back-end Bound stalls that include memory-related stalls as a high-level characterization metric are shown instead. The second level metrics are focused particularly on memory access efficiency.
- A highL2 Hit BoundorL2 Miss Boundvalue indicates that a high ratio of cycles were spent handing L2 hits or misses.
- TheL2 Miss Boundmetric does not take into account data brought into the L2 cache by the hardware prefetcher. However, in some cases the hardware prefetcher can generate significant DRAM/MCDRAM traffic and saturate the bandwidth. TheDemand MissesandHW Prefetchermetrics show the percentages of all L2 cache input requests that are caused by demand loads or the hardware prefetcher.
- A highDRAM Bandwidth BoundorMCDRAM Bandwidth Boundvalue indicates that a large percentage of the overall elapsed time was spent with high bandwidth utilization. A highDRAM Bandwidth Boundvalue is an opportunity to run the Memory Access analysis to identify data structures that can be allocated in high bandwidth memory (MCDRAM), if it is available.
- TheBandwidth Utilization Histogramshows how much time the system bandwidth was utilized by a certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High, Medium and Low. The thresholds are calculated based on benchmarks that calculate the maximum value. You can also set the threshold by moving sliders at the bottom of the histogram. The modified values are applied to all subsequent results in the project.
- Switch to theBottom-upwindow and review theMemory Boundcolumns in the grid to determine optimization opportunities.
- The Vectorization metric represents the percentage of packed (vectorized) floating point operations. 0% means that the code is fully scalar while 100% means the code is fully vectorized. The metric does not take into account the actual vector length used by the code for vector instructions. As a result, if the code is fully vectorized and uses a legacy instruction set that loaded only half a vector length, the Vectorization metric still shows 100%.Low vectorization means that a significant fraction of floating point operations are not vectorized. Use Intel® Advisor to understand possible reasons why the code was not vectorized.The second level metrics allow for rough estimates of the size of floating point work with particular precision and see the actual vector length of vector instructions with particular precision. Partial vector length can provide information about legacy instruction set usage and show an opportunity to recompile the code with modern instruction set, which can lead to additional performance improvement. Relevant metrics might include:
- Instruction Mix
- FP Arithmetic Instructions per Memory Read or Write
- TheTop Loops/Functions with FPU Usage by CPU Timetable shows the top functions that contain floating point operations sorted by CPU time and allows for a quick estimate of the fraction of vectorized code, the vector instruction set used in the loop/function, and the loop type.
- For Intel® Xeon Phi™ processors (formerly code named Knights Landing), the following FPU metrics are available instead of FLOP counters:
- SIMD Instructions per Cycle
- Fraction of packed SIMD instructions versus scalar SIMD Instructions per cycle
- Vector instructions for loops set based on static analysis
- Outgoing and Incoming Bandwidth Boundmetrics shows the percent of elapsed time that an application spent in communication closer to or reaching interconnect bandwidth limit.
- Bandwidth Utilization Histogramshows how much time the interconnect bandwidth was utilized by a certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High, Medium, and Low.
- Outgoing and Incoming Packet Ratemetrics shows the percent of elapsed time that an application spent in communication closer to or reaching interconnect packet rate limit.
- Packet Rate Histogramshows how much time the interconnect packet rate was reached by a certain value and provides thresholds to categorize packet rate as High, Medium, and Low.
3. Analyze Source
4. Analyze Process/Thread Affinity
vtune -report affinity -format=html -r <result_dir>
5. Explore Other Analysis Types
- Use the Intel Advisor to analyze the application for vectorization optimization.