Window: Summary - HPC Performance Characterization
- OpenMP Analysis Collection Time: Displays metrics for the duration of serial (outside of any parallel region) and parallel portions of the program. If the Serial time is significant, review theTop Serial Hotspotssection and consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible.
- Top OpenMP Regions by Potential Gain: Displays the efficiency of Intel OpenMP* parallelization in the parallel part of the code and checks for an MPI imbalance. The Potential Gain metric estimates the elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving parallel execution. If Potential Gain for a region is significant, you can go deeper and select the link on a region name to navigate to theBottom-upwindow employing anOpenMP Regiondominant grouping and the region of interest selection.
- Effective CPU Utilization Histogram: Graphical representation of the percentage of wall time the specific number of CPUs the application was running simultaneously. The CPU usage does not contain spin and overhead time that does not perform actual work. Hover over a vertical bar to identify the amount of Elapsed Time the application spent using the specified number of logical CPU cores. Use the Average Physical Core Utilization and Average Logical Core Utilization numbers as a baseline for your performance measurements. The CPU usage at any point cannot surpass the available number of logical CPU cores.
- A highL2 Hit BoundorL2 Miss Boundvalue indicates that a high ratio of cycles were spent handing L2 hits or misses.
- TheL2 Miss Boundmetric does not take into account data brought into the L2 cache by the hardware prefetcher. However, in some cases the hardware prefetcher can generate significant DRAM/MCDRAM traffic and saturate the bandwidth. TheDemand MissesandHW Prefetchermetrics show the percentages of all L2 cache input requests that are caused by demand loads or the hardware prefetcher.
- A highDRAM Bandwidth BoundorMCDRAM Bandwidth Boundvalue indicates that a large percentage of the overall elapsed time was spent with high bandwidth utilization. A highDRAM Bandwidth Boundvalue is an opportunity to run the Memory Access analysis to identify data structures that can be allocated in high bandwidth memory (MCDRAM), if it is available.
Intel® Omni-Path Fabric Usage
- Outgoing and Incoming Bandwidth Boundmetrics shows the percent of elapsed time that an application spent in communication closer to or reaching interconnect bandwidth limit.
- Bandwidth Utilization Histogramshows how much time the interconnect bandwidth was utilized by a certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High, Medium, and Low.
- Outgoing and Incoming Packet Ratemetrics shows the percent of elapsed time that an application spent in communication closer to or reaching interconnect packet rate limit.
- Packet Rate Histogramshows how much time the interconnect packet rate was reached by a certain value and provides thresholds to categorize packet rate as High, Medium, and Low.
Collection and Platform Info
Application Command Line
Path to the target application.
Operating system used for the collection.
Name of the computer used for the collection.
Size of the result collected by the
Collection start time
Collection stop time
Stop time (in UTC format) of the external collection. Explore the
Timelinepane to track the performance statistics provided by the custom collector over time.
Name of the processor used for the collection.
Frequency of the processor used for the collection.
Logical CPU Count
Logical CPU core count for the machine used for the collection.
Physical Core Count
Number of physical cores on the system.
User launching the data collection. This field is available if you enabled the per-user event-based sampling collection mode during the product installation.