User Guide

Contents

What's New in
Intel® VTune™
Profiler

Intel® VTune™
Profiler
2022.2.0

Download this version of
Intel® VTune™
Profiler
from the product download page. This version contains the following additions:
  • HPC Performance Characterization Analysis
    • Better Hardware Observability
      This release adds the
      Platform Diagram
      to the Summary tab of the HPC Performance Characterization analysis result. The
      Platform Diagram
      reveals system topology, utilization metrics for physical cores, DRAM, and Intel® Ultra Path Interconnect (Intel® UPI) links.
      Available for server platforms based on Intel® microarchitecture code named Skylake and newer.
  • VTune Profiler
    Server
    • New Command-Line Options for Convenience
      The
      vtune-backend
      binary that launches
      VTune Profiler
      Server features new command-line options to make setup in certain environments more convenient. You can now specify a base URL that
      VTune Profiler
      Server will use as the basis for URL generation. Additionally, new options were added to suppress automatic help tours on startup and to provide/decline consent to collect usage information right from the command line.
      These new options can be especially useful if you are running
      VTune Profiler
      Server inside a container.
  • More Information on Windows*
    • Support for Debug Information For Inline Functions
      VTune Profiler
      is now capable of reading debugging information for inline functions from PDB symbol files on Windows* OS.
      VTune Profiler
      can now display names and source code for inline functions in your workload.

Intel® VTune™
Profiler
2022.1.0

This version contains the following additions:
  • Hardware Support
    • Support for First Generation of Intel® Arc™ High-performance Discrete GPUs
      This release of
      Intel® VTune™
      Profiler
      supports the first generation of Intel® Arc™ high-performance discrete GPUs code named Alchemist, and previously known as DG2. The support includes:
      • Explicit support for DPC++, DirectX, Intel® Media SDK, OpenCL™, and OpenMP offload software technologies.
      • Support for multi-GPU systems. You can now profile all Intel GPU devices, including integrated and discrete GPUs.
      • Support for GPU Offload and GPU Hotspots analyses, including source level in-kernel profiling.
    Families of Intel® X
    e
    graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® X
    e
    Graphics
    .
  • Input and Output Analysis
    • Intel® VT-d Observability
      Intel® Virtualization Technology for Directed I/O (Intel® VT-d) observability is introduced in the Input and Output analysis for server platforms based on 3rd Gen Intel® Xeon® Scalable processors (code named Ice Lake), the Intel Atom® P5900 Processor Family (code named Snow Ridge), and newer. New performance metrics reveal efficiency of hardware-driven DMA addresses remapping and penalties for sub-optimal Intel VT-d utilization.
  • Managed Code Targets
    • .NET 6 Support
      This release introduces support for analyzing .NET 6 targets using User-Mode Sampling. You can analyze .NET 6 workloads in Launch Application and Attach to Process modes on both Windows* and Linux* hosts.
  • Application Performance Snapshot
    • Histograms in Metric Tooltips
      The metric tooltips in Application Performance Snapshot HTML reports were enhanced with histograms that clearly visualize the distribution of metric values observed during analysis.
  • Operating System Support
    • New Host Operating Systems
      This release introduces support for these OS hosts:
      • Microsoft Windows* 11
      • Ubuntu* 21.10

Intel® VTune™
Profiler
2022.0.0

This version contains the following additions:
  • Analyses
    • Algorithm Group
      • Flame Graph View in Hotspots Analysis
        This version of
        VTune Profiler
        introduces support for
        Flame Graphs
        in the
        Hotspots
        analysis type. The Hotspots by CPU Utilization viewpoint has been enhanced with a Flame Graph window that displays a graphical view of hot code paths. Use flame graphs to analyze the time spent on each function and its callee functions.
    • Input and Output
      • Platform Diagram
        This release introduces the
        Platform Diagram
        , a new starting point for the Input and Output analysis. It reveals system topology and high-level utilization metrics for hardware resources including PCIe devices, Intel® Ultra Path Interconnect, and memory. It enables you to examine the utilization of your hardware at a glance.
        This feature is enabled for 1st and 2nd Generation Intel® Xeon® Scalable Processors in up to four-socket configurations, excluding the Intel® Xeon® Platinum 9200 series processors code named Cascade Lake AP. This feature is also supported on Intel Atom® Processors P Series code named Snow Ridge.
      • Intel® Data Direct I/O Technology
        Intel® Data Direct I/O (Intel DDIO) utilization efficiency metrics are extended with average Inbound PCIe read/write latency and core/IO contention indicator.
      • Linux Perf* Capabilities
        It is now possible to perform Linux perf-based data collection without root access on 1st and 2nd Generation Intel Xeon® Scalable Processors on Linux kernel versions 5.10 and newer.
    • Platform Analyses
      • VTune
        Profiler
        - Platform Profiler as Analysis Type
        VTune
        Profiler
        – Platform Profiler has been completely integrated into
        VTune
        Profiler
        as an analysis type. Platform Profiler is now fully available as an analysis from the GUI or command line of
        VTune
        Profiler
        . For more information, see Platform Profiler Analysis.
      • CPU Throttling Data in System Overview Analysis
        The System Overview analysis now displays information about factors that can cause the CPU to throttle. Use this information to examine if your system is overheated or consumes significant power, both of which could result in frequency drops that affect system performance.
    • Microarchitecture Analyses
      • Platform Diagram in Memory Usage View
        This release introduces the
        Platform diagram
        in the Memory Usage viewpoint of the Memory Access analysis type. Use this diagram to understand:
        • System topology
        • Utilization metrics for DRAM
        • Intel® UPI links
        • Physical cores
        The platform diagram is available for:
        • All client platforms
        • Server platforms based on Intel® microarchitecture code name Skylake, with up to four sockets.
  • Analysis Targets
    • .NET
      • .NET 5 Workloads
        This release introduces support for running the Hotspots analysis on .NET 5 targets in Launch Application mode when using hardware event-based sampling.
      • Extended Support for .NET 5 Workloads
        You can now analyze .NET 5 workloads in the
        Attach to Process
        mode when you use Hardware Event-Based Sampling.
    • FreeBSD* OS
    • Support for Unified Shared Memory Workloads
      Starting with the 2021.8 release, you can profile OpenCL, SYCL, and DPC++ applications that use Unified Shared Memory (USM) workloads. For OpenCL applications, this release also supports explicit data transfer of the buffer as Unified Shared Memory.
  • GPU Accelerators
    • Source-level analysis for DPC++ and OpenMP applications running on GPU over Level Zero
      The following modes in GPU Compute/Media Hotspots analysis are now available when profiling Level Zero applications:
      Support also includes full-scale analysis of the kernel source per code line, including Source/Assembly mapping.
    • Advanced Data Transfer Information in GPU Offload Analysis
      The following additions to the Graphics window clarify better the data transfer between CPU host and GPU device when you run GPU profiling analyses:
      • Allocation time information displays as part of total time by device operation.
      • Data Transferred table has been renamed as Transfer Size table. Columns under Transfer Size feature new names for data transferred between host and device.
      • Highlights and tool tips for workloads with sub-optimal offload schemes direct your attention to improve offload schema where necessary.
    • Improved Tooltips for Occupancy Metrics in GPU Analysis
      The GPU Compute/Media Hotspots Analysis has been enhanced to detect factors that limit peak achievable occupancy for the hottest computing tasks that make the EU array idle when waiting for the scheduler. Improved tooltips for occupancy metrics now provide information about peak occupancy and bounding reasons for existing computing task launch configuration.
    • GPU Analysis Coverage for Self-Check
      Coverage of checks by the self-check functionality in VTune Profiler now includes GPU analyses as well. Run vtune-self-checker.sh script on Windows and Linux systems to check for the GPU Compute/Media Hotspots Analysis in source analysis and characterization modes when you run DPC++ applications on an Intel GPU. You must install the Intel® oneAPI Base Toolkit for this purpose.
    • Occupancy Report in GPU Hotspots Analysis
      The GPU Compute/Media Hotspots analysis has been enhanced to display occupancy information in the
      Summary
      section. Use this data to understand the architectural limitations of the GPU that affect occupancy.
    • CPU Context for GPU Execution in GPU Offload Analysis
      The GPU Offload analysis now presents a richer set of information about execution on the GPU by including context from the CPU. This includes stack information on:
      • Execution
      • Data transfer from host to device
      • Data transfer from device to host
      The viewpoint for the GPU Offload Analysis now includes the Call Stack pane with a new grouping by
      GPU Computing Task/Host Call Stack
      . Navigate through transfer data contained in these panes to identify inefficient code paths in your application.
    • Analysis of Multiple GPUs
      When you have multiple GPUs connected to your system, you can now analyze all of the GPUs collectively with the GPU Offload and GPU Compute/Media Hotspots analyses. Previously, you could analyze a single GPU at a time after
      VTune Profiler
      identified all the GPUs connected to the system. When you run these analyses on all connected GPUs, see analysis information about each GPU in the
      Summary
      window. Full compute set in
      Characterization
      mode is not available in multi-adapter and multi-tile analysis.
    • Hottest CPU Tasks in GPU Offload Analysis
      The
      Summary
      view in the GPU Offload analysis now includes the
      Hottest Host Tasks
      table, which displays the most active tasks running on the CPU. Use this table to examine the overhead on the host. Click on a performance-critical task to see more information in the Graphics window, where results are grouped by host Task Type.
    • Support for Affinity Mask
      If you use the
      ZE_AFFINITY_MASK
      variable to bind your workload to a single tile,
      VTune Profiler
      can then attribute kernels to the correct tile and also display relevant metrics per kernel.
    • Host-GPU Bandwidth Information in GPU Offload Analysis
      Previously, you checked the
      Analyze memory bandwidth
      option in the GPU Offload analysis to see data required for this computation. Starting with this release of
      VTune Profiler
      , you can use the
      Analyze host-GPU bandwidth
      option instead. Depending on your hardware configuration, this selection displays DRAM bandwidth, PCIe bandwidth, or both sets of data on the timeline.
    • PCIe Bandwidth Information in Custom and Command Line Runs of GPU Offload Analysis
      Use new options to collect information about PCIe bandwidth (between the host and GPU sides) when you run custom and command line runs of the GPU Offload analysis:
      • Use the switch for both custom and command line runs.
      • In the UI, check the
        Analyze host-GPU PCIe bandwidth
        option for custom analysis.
    • Improvements to Peak Occupancy Metric
      The
      GPU Peak Occupancy
      metric for a computing task now flags the factors that limit peak occupancy in the order of priority. Start tuning your application by addressing the most restricting factor.
      VTune Profiler
      customizes recommendations for potential improvements based on the launch parameters of the compute kernel (work size, SLM and barriers usage).
    • Enhancements to GPU Offload Summary
      The Summary window of the GPU Offload analysis contains these enhancements for an improved user experience:
      • Locate hotspots in your function when the GPU is not busy. See the new
        Top hotspots when GPU was idle
        table in the
        GPU Time, % of Elapsed Time
        (formerly
        GPU Utilization
        ) section.
      • The
        Hottest Computing Functions
        section now includes occupancy information.
    • Data Collection of CPU Host Stacks
      When you collect information about host stacks in the GPU Offload and GPU Compute/Media Hotspots analyses, you can now filter the data by selecting a call stack mode from the filter bar.
    • Support to Trace DirectX* API on CPU Host
      This release of
      VTune Profiler
      introduces support to profile DirectX applications on the CPU host. These versions of the DirectX API can be traced:
      • DXGI
      • Direct3D 11
      • Direct3D 12
      • Direct3D-11-On-12(D3D11On12)
  • Hardware Support
    • Analysis Support for Intel® Microarchitecture Code Named Alder Lake
      This version of
      VTune Profiler
      introduces support for Intel® microarchitecture code named Alder Lake in these analysis types:
    • Support for Intel® Atom® Processors
      Support for Intel Atom® Processor P Series code named Snow Ridge, including Hotspots, Microarchitecture Exploration, Memory Access, and Input and Output analyses.
    • Support for 3rd Gen Intel® Xeon® Scalable Processor Architecture
      This releases supports the 3rd Gen Intel® Xeon® Scalable processor architecture (code named Ice Lake Server) .
  • IDE Support
    • Support for Microsoft Visual Studio* 2022
      This release introduces support for the integration of
      VTune Profiler
      into Microsoft Visual Studio 2022.
  • VTune Profiler
    Server
  • Application Performance Snapshot
    • Metric tooltips in HTML reports
      Metric tooltips in APS HTML reports now present a more holistic view of metrics and their properties. The new tooltips present a compact yet comprehensive overview of a metric, which helps you to better understand the importance of metrics in performance analysis. This change includes a visual bar that indicates where the metric value stands in terms of current performance and tuning potential.
    • PCIe bandwidth info in CLI reports
      APS command line reports now include PCIe bandwidth metrics. This data is only available on server platforms when using the Sampling Driver.
    • New reports and filters
      APS now features the following new types of reports and filters:
      • Node topology report: view relations between ranks, nodes, and PCIe devices.
      • Metrics report: get a configurable table that displays any collected metric for each rank, node, or device.
      • Ability to filter data by node.
    • Outlier Detection
      This release introduces a mechanism for the detection of outliers, or individual metric values contributing to an average metric that differ significantly from the overall distribution or break a certain threshold. Outliers can cause imbalance and distort average metric values. You can now see outliers in both HTML and CLI reports, with attribution to specific rank or node where an outlier occurred.
    • Metric Tooltip Enhancements
      Metric tooltips now visualize ranges of average metrics, with their minimum, maximum, and average contributing values.
  • MPI Support
  • User Interface
    • Main Vertical Toolbar
      This release introduces a new main vertical toolbar to enhance your user experience. All controls previously located in the main horizontal toolbar are now located on this toolbar. The vertical toolbar is designed to enhance your experience with clear, bright controls.
    • Enhanced Project Navigator User Experience
      The Project Navigator pane now features menu options to open a new or existing project to better facilitate your
      VTune
      Profiler
      experience.
    • Improvements to Vectorization Information
      The Vectorization sections of Performance Snapshot and HPC Performance Characterization analyses have been enriched to provide a clearer picture of the state of vectorization in your application. Quickly see if your code is not vectorized at all, if your code does not use the latest vector instruction set extension, or if your code has too many scalar instructions. This version of
      VTune
      Profiler
      also features improved recommendations to resolve vectorization issues.
    • Rich Metric Tooltips in Multiple Analyses
      This release introduces rich metric tooltips in Performance Snapshot, Hotspots, HPC Performance Characterization, and Microarchitecture Exploration analyses. The new tooltips aim to make metrics more intuitive by providing visualizations for thresholds, desired direction (more/less is better), and tuning potential. Hover over a metric to get this tooltip.
    • Detection of Compilation with Low Optimization Level in Hotspots Analysis
      When debug information is available,
      VTune Profiler
      now detects and flags modules that may have been compiled using non-optimal compiler optimization flags in the
      Top Hotspots
      section of the Hotspots analysis result. This can help detect underutilization of compiler optimization capabilities and correct the build system setup.
    • Platform Diagram Extended with Persistent Memory Block
      For Input and Output and Memory Access analyses, the
      Platform Diagram
      shown in
      Summary
      windows now features a dedicated block for Persistent Memory devices, together with average per-socket bandwidth.
      This data is available on server platforms based on Intel microarchitectures code named Cascade Lake and Ice Lake.
    • Changes to Viewpoint Selection
      The Viewpoint selection was adjusted with respect to each analysis type. Now, the viewpoint selection is disabled for certain analysis types, and only features a managed set of most helpful viewpoints for other analysis types. You can re-enable the display of all applicable viewpoints in the Options pane.
  • Code Annotations
    • New Instrumentation and Tracing Technology API Capabilities
      A new Histogram API was added to ITT API. This API enables you to collect arbitrary histogram data without extra overhead. The Summary tab of the Input and Output analysis automatically displays this data in the form of a histogram.
  • Debug Formats
    • Support for DWARF5 Debug Format
      VTune Profiler
      now supports version 5 of the DWARF debug format. You can now use debug information in DWARF 5 format to resolve function names and source locations for binaries.
  • Command Line Analysis
    • Perf Tool Parameters for All Analysis Types
      You can now use the target-system command to get parameters on the command line for the native
      perf
      tool for all CPU hardware-based analysis types, including custom analyses. Use the
      get-perf-cmd
      argument for this purpose. You can collect the
      perf
      trace on a target with the Linux Perf tool and then import the trace to the VTune Profiler UI.
  • Documentation

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.