Application Performance Snapshot User Guide for Linux* OS

ID 772048
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Metrics Reference

This section provides a complete list of metrics supported by Application Performance Snapshot with their descriptions. If data for a metric is available in the statistics files, it will be displayed in the analysis summary on the command line and in the HTML report. Note that some of the metrics are platform-specific, some are available only if the application utilizes MPI or OpenMP*.

Elapsed Time

Execution time of specified application in seconds.

SP GFLOPS

Number of single precision giga-floating point operations calculated per second. SP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.

DP GFLOPS

Number of double precision giga-floating point operations calculated per second. DP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.

CPI (Cycles per Instruction Retired) Rate

The amount of time each executed instruction took measured by cycles. A CPI of 1 is considered acceptable for high performance computing (HPC) applications, but different application domains will have varied expected values. The CPI value tends to be greater when there is long-latency memory, floating-point, or SIMD operations, non-retired instructions due to branch mispredictions, or instruction starvation at the front end.

CPU Utilization

This metric helps evaluate the parallel efficiency of your application. It estimates the utilization of all the logical CPU cores in the system by your application. 100% utilization means that your application keeps all the logical CPU cores busy for the entire time that it runs. Note that the metric does not distinguish between useful application work and the time that is spent in parallel runtimes.

MPI Time

Time spent inside the MPI library. Values more than 15% might need additional exploration on MPI communication efficiency. This might be caused by high wait times inside the library, active communications, non-optimal settings of the MPI library. See MPI Imbalance metric to see if the application has load balancing problem.

MPI Imbalance

Mean unproductive wait time per process spent in the MPI library calls when a process is waiting for data.

Serial Time

Time spent by the application outside any OpenMP region in the master thread during collection. This directly impacts application Collection Time and scaling. High values might signal a performance problem to be solved via code parallelization or algorithm tuning.

OpenMP Imbalance

The metric indicates the percentage of elapsed time that your application wastes at OpenMP* synchronization barriers because of load imbalance.

Memory Stalls

This metric indicates how memory subsystem issues affect the performance. It measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. The metric value can indicate that a significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. See the second level metrics to define if the application is cache- or DRAM-bound and the NUMA efficiency.

Cache Stalls

This metric indicates how often the machine was stalled on L1, L2, and L3 cache. While cache hits are serviced much more quickly than hits in DRAM, they can still incur a significant performance penalty. This metric also includes coherence penalties for shared data.

DRAM Stalls

This metric indicates how often the CPU was stalled on the main memory (DRAM) because of demand loads or stores.

DRAM Bandwidth

The metrics in this section indicate the extent of high DRAM bandwidth utilization by the system during elapsed time. They include:

  • Average Bandwidth - Average memory bandwidth used by the system during elapsed time.
  • Peak - Maximum memory bandwidth used by the system during elapsed time.
  • Bound - The portion of elapsed time during which the utilization of memory bandwidth was above a 70% threshold value of the theoretical maximum memory bandwidth for the platform.
Some applications can execute in phases that use memory bandwidth in a non-uniform manner. For example, an application that has an initialization phase may use more memory bandwidth initially. Use these metrics to identify how the application uses memory through the duration of execution.

NUMA: % of Remote Accesses

In non-uniform memory architecture (NUMA) machines, memory requests missing last level cache may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric indicates the percentage of remote accesses, the lower the better.

Vectorization

The percentage of packed (vectorized) floating point operations. The higher the value, the bigger the vectorized portion of the code. This metric does not account for the actual vector length used for executing vector instructions. As a result, if the code is fully vectorized, but uses a legacy instruction set that only utilizes a half of the vector length, the Vectorization metric is still equal to 100%.

Instruction Mix

This section contains the breakdown of micro-operations by single precision (SP FLOPs) and double precision (DP FLOPs) floating point and non-floating point (non-FP) operations. SP and DP FLOPs contain next level metrics that enable you to estimate the fractions of packed and scalar operations. Packed operations can be analyzed by the vector length (128, 256, 512-bit) used in the application.

FP Arith/Mem Rd Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory read instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.

FP Arith/Mem Wr Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution. The metric value might indicate unaligned access to data for vector operations.

Intel® Omni-Path Fabric Interconnect Bandwidth and Packet Rate

(Available for compute nodes equipped with Intel® Omni-Path Fabric (Intel® OP Fabric) and with Intel® VTune™ Profiler drivers installed)

Average interconnect bandwidth and packet rate per compute node, broken down by outgoing and incoming values. High values close to the interconnect limit might lead to higher latency network communications. The interconnect metrics are available for Intel Omni-Path Fabric when the Intel VTune Profiler driver is installed.

GPU Metrics

This section contains metrics that enable you to analyze the efficiency of GPU utilization within your application.

GPU Accumulated Time

This is the sum total of all times when each GPU stack had at least one thread scheduled.

GPU IPC (Instructions Per Cycle)

This is the average number of instructions per cycle processed by the two FPU pipelines of Intel ®Integrated Graphics.

GPU Stack Utilization

The average portion of time during when at least one GPU XVE thread was scheduled on each GPU stack. This metric is a percentage of the GPU Accumulated Time. This metric has a second-level breakdown by state:

  • XVE Active: The normalized sum of all cycles on all cores spent actively executing instructions.
  • XVE Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core.
  • XVE Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread was loaded, but the core remained stalled.

PCIe Metrics

Average bandwidth of inbound read and write operations initiated by PCIe devices. The data is shown for GPU and network controller devices.