• 2021
  • 11/09/2021
  • Public Content

IO Issues: High Latency and Low PCIe* Bandwidth

This recipe uses Intel® VTune™ Amplifier's Disk IO analysis for a sample IO bound application and changes affinity for a PCIe device to increase read access bandwidth and get optimization.
Content experts
: Roman Sudarikov
Disk IO analysis was renamed to Input and Output analysis starting with Intel VTune Amplifier 2019.


This section lists the hardware and software tools used for the performance analysis scenario.
  • Application
    that performs sequential 128K read access during 3 seconds. The application is available at
  • Performance analysis tools
    • Intel VTune Amplifier 2018: Disk Input and Output analysis
      • Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed to
        Intel® VTune™
      • Most recipes in the
        Intel® VTune™
        Performance Analysis Cookbook are flexible. You can apply them to different versions of
        Intel® VTune™
        . In some cases, minor adjustments may be required.
      • Get the latest version of
        Intel® VTune™
  • Operating system
    : Red Hat* Enterprise Server 7.2
  • CPU
    : Intel microarchitecture code name Skylake
  • IO device specification
    : Intel® Solid State Drive Data Center Family for PCIe* P3500/P3600/P3700 Series

Run Disk Input and Output Analysis

For IO bound applications, you are recommended to start with the Disk Input and Output analysis:
  1. Click the
    New Project
    button on the toolbar and specify a name for the new project, for example:
  2. In the
    Analysis Target
    window, select the
    local host
    target system type for the host-based analysis.
  3. Select the
    Launch Application
    target type and specify an application for analysis on the right pane.
  4. Click the
    Choose Analysis
    button on the right, select
    Platform Analysis > Disk Input and Output
    and click
    VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Analyze Bandwidth and Latency Metrics

Start your analysis with the
view that provides high-level statistics on the application execution. Focus on the
I/O Wait Time
metric, which is a primary indicator of I/O efficiency:
The I/O Wait Time metric shows that almost 30% of the Elapsed time the
application was waiting on I/O.
Select the
Disk IO operation type on the histogram to analyze the read access time distribution:
Unsteady flow typically signals a performance degradation. This is also confirmed with the read access value, which is 3 orders of magnitude greater than what the device specification declares (20 usec).
Switch to the
window and apply the
Storage Device/Partition
grouping level. Focus on the Timeline data:
I/O Operations
Data Transfers
sections of the Timeline view show high number of IO Waits and unsteady data flow.
PCIe Bandwidth
section shows that the read bandwidth of the device - local to
- is only about 65% of what the device specification claims.
Change the Timeline grouping to
Package / Core / H/W Context
to explore your application affinity:
You see that the application is running on
though the device is local to
. This could be the reason of high latency and lower than expected bandwidth.

Change Application Affinity and Re-run the Analysis

To solve the detected IO issues but keep the workload itself and device placement intact, change the application affinity and rerun the Disk Input and Output analysis.
The new result shows that the application is waiting on I/O operations only about 2% of the Elapsed time:
The histogram does not show read access time distribution anymore. All IO operations are executed in a sub-millisecond range:
The Timeline view now displays smooth data flows for IO operations and IO Data Transfers, which confirms that affinity optimization reduced the latency:
The change also increased the PCIe bandwidth to about 93% of what the device specification claims.


Learn some key take-ways from IO performance analysis for PCIe bandwidth-bound applications:
  • Determine IO Unit (IOU) Affinity for PCIe devices.
  • Distribute applications to IO Units appropriately.
  • Learn performance capabilities of your device.
  • Set reasonable performance targets.
  • Run Disk Input and Output analysis to debug IO solutions with lower than expected bandwidth.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at