User Guide

Contents

Analyze Platform Performance

Understand the platform-level metrics provided by the Input and Output analysis of
Intel® VTune™
Profiler
.
The Input and Output analysis type provides the following metrics for platform-level analysis:
  • PCIe and platform-level IO traffic
  • DRAM Bandwidth
  • Intel® Ultra Path Interconnect (Intel® UPI) Utilization
Use these metrics to:
To analyze these kinds of performance issues, run the
Input and Output
analysis with the following options enabled:
On FreeBSD* OS, the Input and Output analysis supports all platform-level metrics, except for MMIO accesses. For prerequisites and limitations, see the Input and Output Analysis topic.

Analyze Topology and Device Utilization

Once the data collection finishes,
VTune
Profiler
opens the default
Summary
window.
Start your investigation with the
Platform Diagram
section of the
Summary
window. The Platform Diagram presents system topology and utilization metrics for PCIe and Intel® UPI links, DRAM, and physical cores.
The Platform Diagram is available starting with server platforms based on Intel® microarchitecture code named Skylake, with up to four sockets.
Physical PCIe devices are shown with short names that indicate the PCI bus and device numbers. Full device name, link capabilities, and status are shown in the device tooltip. Hover over the device image to see detailed device information.
The Platform Diagram highlights device status issues that may be a reason of limited throughput. A common issue is that the configured link speed/width does not match the maximum speed/width of the device.
Each device link is attributed with the
Effective Link Utilization
metric that represents the ratio of available physical bandwidth to the bandwidth consumed on data transferring. The metric does not account for protocol overhead (TLP headers, DLLPs, physical encoding) and reflects link utilization in terms of payloads. Thus, it cannot reach 100%. However, this metric can give a clue on how far from saturation the link is. Maximum theoretical bandwidth is calculated for device link capabilities shown in the device tooltip.
The Platform Diagram shows the
Average DRAM Utilization
when the
Evaluate max DRAM bandwidth
checkbox is selected in the analysis configuration. Otherwise, it shows the average DRAM bandwidth.
The
Average UPI Utilization
metric reveals UPI utilization in terms of transmit. The Platform Diagram shows a single cross-socket connection, regardless of how many UPI links connect a pair of packages. If there is more than one link, the maximum value is shown.
The
Average Physical Core Utilization
metric, displayed on top of each socket, indicates the utilization of physical cores by computations of the application being analyzed.
Once you examine topology and utilization, drill down into the details to investigate platform performance.

Analyze PCIe Traffic

To explore PCIe traffic processing on the platform, start your investigation with the
PCIe Traffic Summary
section of the
Summary
window. These top-level metrics reflect the total Inbound and Outbound PCIe traffic:
  • Inbound PCIe Bandwidth
    is induced by PCIe devices that write and read to and from the system memory. These metrics are only available for server platforms based on the Intel® microarchitecture code named Sandy Bridge EP and later.
    • Inbound PCIe Read
      — the PCIe device reads from the platform memory.
    • Inbound PCIe Write
      — the PCIe device writes to the platform memory.
  • Outbound PCIe Bandwidth
    is induced by core transactions targeting the memory or registers of the PCIe device. Typically, the core accesses the device memory through the Memory-Mapped I/O (MMIO) address space. These metrics are only available for server platforms based on the Intel® microarchitecture code named Broadwell EP and later.
    • Outbound PCIe Read
      — the core reads from the registers of the device.
    • Outbound PCIe Write
      — the core writes to the registers of the device.
Starting with server platforms based on the Intel® microarchitecture code name Skylake,
Inbound and Outbound PCIe Bandwidth
metrics can be collected per-device. To get per-device metric attribution, load the sampling driver, use Linux perf-based collection on kernel versions >=5.10, or run
VTune
Profiler
as root.
You can analyze the
Inbound and Outbound PCIe Bandwidth
over time on a per-device basis using the timeline in the
Bottom-up
or the
Platform
tabs:

Analyze Efficiency of Intel® Data Direct I/O Utilization

To understand whether your application utilizes Intel® DDIO efficiently, explore the second level metrics in the
PCIe Traffic Summary
section.
The
L3 Hit/Miss Ratios
for
Inbound PCIe requests
reflect the proportions of requests made by IO devices to the system memory that hit/miss the L3 cache.
For a detailed explanation of Intel® DDIO utilization efficiency, see the Effective Utilization of Intel® Data Direct I/O Technology Cookbook recipe.
L3 Hit/Miss metrics are available for Intel® Xeon® processors code named Haswell and newer. The sampling driver must be loaded.
The
Average Latency
metric of the
Inbound PCIe read/write groups
shows an average amount of time a platform spends on processing inbound read/write requests for a single cache line.
The
Core/IO conflicts
ratio shows a portion of Inbound PCIe write requests that experienced contention for a cache line between the CPU core and the IO controller. These conflicts are caused by the core snooping on a cache line, which, under certain conditions, may cause the IO controller to lose ownership of this cache line. This forces the IO controller to reacquire this line. Such issues can occur in applications that use the polling communication model and can result in suboptimal throughput and latency. To resolve this, consider tuning the
Snoop Response Hold Off
option of the Integrated IO configuration of UEFI/BIOS (option name may vary depending on platform manufacturer).
Average Latency
for inbound PCIe reads/writes and
Core/IO Conflicts
metrics are available on Intel® Xeon® processors code named Skylake and newer. The sampling driver must be loaded.
You can get a per-device breakdown for
Inbound and Outbound Traffic
,
Inbound request L3 hits and misses
,
Average latencies
, and
Core/IO Conflicts
using the
Bottom-up
pane with the
Package/M2PCIe
grouping:

Analyze MMIO Access

Outbound PCIe traffic visible in the
PCIe Traffic Summary
section of the
Summary
tab is caused by cores writing and reading to and from memory and/or registers of PCIe devices.
Typically, cores access PCIe device memory through the Memory-Mapped I/O (MMIO) address space. Each load or store operation targeting the MMIO address space that a PCIe device is mapped to causes outbound PCIe read or write transactions respectively. Such loads and stores are quite expensive, since they are affected by the PCIe device access latency. Therefore, such accesses should be minimized to achieve high performance.
Enable the
Locate MMIO accesses
option during analysis configuration to detect the sources of outbound traffic. Use the
MMIO Access
section to locate functions performing
MMIO Reads
and
MMIO Writes
that target specific PCIe devices.
Use the
Bottom-up
pane to locate sources of memory-mapped PCIe device accesses. Explore the call stacks and drill down to source and assembly view:
Double click on the function name to drive into source code or assembly view to locate the code responsible for MMIO reads and writes at source line level:
MMIO access data is collected when the
Locate MMIO accesses
check box is selected. However, there are some limitations:
  • This feature is only available starting with server platforms based on the Intel® microarchitecture code name Skylake.
  • Only
    Attach to Process
    and
    Launch Application
    collection modes are supported.

Analyze Memory and Cross-Socket Bandwidth

Non-optimal application topology can result in induced DRAM and Intel® QuickPath Interconnect (Intel® QPI) or Intel® Ultra Path Interconnect (Intel® UPI) cross-socket traffic, which can limit performance.
Use the
Platform
tab to correlate
Inbound PCIe Traffic
with DRAM and cross-socket interconnect bandwidth consumption:
VTune
Profiler
provides per-channel breakdown of DRAM bandwidth.
Two metrics are available for UPI traffic:
  • UPI Utilization Outgoing
    – ratio metric that shows UPI utilization in terms of transmit.
  • UPI Bandwidth
    – shows detailed bandwidth information with breakdown by data/non-data.
You can get a breakdown of UPI metrics by UPI links. See the specifications of your processor to determine the number of UPI links that are enabled on each socket of your processor.
UPI link names reveal the topology of your system by showing which sockets and UPI controllers they are connected to.
Below is an example of a result collected on a four-socket server powered by Intel® processors with microarchitecture code named Skylake. The data reveals significant UPI traffic imbalance with bandwidth being much higher on links connected to socket 3:

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.