What's New in
- Analysis Targets
- Extended Support for .NET5 WorkloadsYou can now analyze .NET5 workloads in theAttach to Processmode when you use Hardware Event-Based Sampling.
- Support for Unified Shared Memory WorkloadsStarting with the 2021.8 release, you can profile OpenCL, SYCL, and DPC++ applications that use Unified Shared Memory (USM) workloads. For OpenCL applications, this release also supports explicit data transfer of the buffer as Unified Shared Memory.
- Input and Output Analysis on FreeBSD* OSYou can now run the Input and Output analysis on remote FreeBSD targets. Analysis scope is limited to platform-level metrics.
- GPU Accelerators
- Hottest CPU Tasks in GPU Offload AnalysisTheSummaryview in the GPU Offload analysis now includes theHottest Host Taskstable, which displays the most active tasks running on the CPU. Use this table to examine the overhead on the host. Click on a performance-critical task to see more information in the Graphics window, where results are grouped by host Task Type.
- Support for Affinity MaskIf you use theZE_AFFINITY_MASKvariable to bind your workload to a single tile,VTune Profilercan then attribute kernels to the correct tile and also display relevant metrics per kernel.
- Code Annotations
- New Cookbook Recipe on Analyzing Hot Code Paths Using Flame GraphsTheVTune ProfilerPerformance Analysis Cookbook features a new recipe that demonstrates how you can analyze hot code paths in your application using flame graphs.
- Algorithm Group
- Flame Graph View in Hotspots AnalysisThis version ofVTune Profilerintroduces support forFlame Graphsin theHotspotsanalysis type. The Hotspots by CPU Utilization viewpoint has been enhanced with a Flame Graph window that displays a graphical view of hot code paths. Use flame graphs to analyze the time spent on each function and its callee functions.
- GPU Accelerators
- CPU Context for GPU Execution in GPU Offload AnalysisThe GPU Offload analysis now presents a richer set of information about execution on the GPU by including context from the CPU. This includes stack information on:
- Data transfer from host to device
- Data transfer from device to host
- Analysis of Multiple GPUsWhen you have multiple GPUs connected to your system, you can now analyze all of the GPUs collectively with the GPU Offload and GPU Compute/Media Hotspots analyses. Previously, you could analyze a single GPU at a time afterVTune Profileridentified all the GPUs connected to the system. When you run these analyses on all connected GPUs, see analysis information about each GPU in theSummarywindow. Full compute set inCharacterizationmode is not available in multi-adapter and multi-tile analysis.
- Debug Formats
- Support for DWARF5 Debug FormatVTune Profilernow supports version 5 of the DWARF debug format. You can now use debug information in DWARF 5 format to resolve function names and source locations for binaries.
- Microarchitecture Analyses
- Platform Diagram in Memory Usage View
The platform diagram is available for:
- System topology
- Utilization metrics for DRAM
- Intel® UPI links
- Physical cores
- All client platforms
- Server platforms based on Intel® microarchitecture code name Skylake, with up to four sockets.
- Command Line Analysis
- Perf Tool Parameters for All Analysis TypesYou can now use the target-system command to get parameters on the command line for the nativeperftool for all CPU hardware-based analysis types, including custom analyses. Use theget-perf-cmdargument for this purpose. You can collect theperftrace on a target with the Linux Perf tool and then import the trace to the VTune Profiler UI.
- Intel® VTune™– Platform Profiler IntegrationProfiler
- VTune- Platform Profiler as Analysis TypeProfilerVTune– Platform Profiler has been completely integrated intoProfilerVTuneas an analysis type. Platform Profiler is now fully available as an analysis from the GUI or command line ofProfilerVTune. For more information, see Platform Profiler Analysis.Profiler
- Application Performance Snapshot
- Outlier DetectionThis release introduces a mechanism for the detection of outliers, or individual metric values contributing to an average metric that differ significantly from the overall distribution or break a certain threshold. Outliers can cause imbalance and distort average metric values. You can now see outliers in both HTML and CLI reports, with attribution to specific rank or node where an outlier occurred.
- Metric Tooltip EnhancementsMetric tooltips now visualize ranges of average metrics, with their minimum, maximum, and average contributing values.
- New CLI Cheat Sheet for quick referenceAdded a new downloadable document, theVTuneCLI Cheat Sheet. You can use this print-friendly PDF for quick reference on theProfilerVTunecommand-line interface.Profiler
- User Interface
- Enhanced Project Navigator User ExperienceThe Project Navigator pane now features menu options to open a new or existing project to better facilitate yourVTuneexperience.Profiler
- Code Parallelism
- Improvements to Vectorization InformationThe Vectorization sections of Performance Snapshot and HPC Performance Characterization analyses have been enriched to provide a clearer picture of the state of vectorization in your application. Quickly see if your code is not vectorized at all, if your code does not use the latest vector instruction set extension, or if your code has too many scalar instructions. This version ofVTunealso features improved recommendations to resolve vectorization issues.Profiler
- Hardware Support
- Support for the 3rd Gen Intel® Xeon® Scalable processors (code named Ice Lake Server)This release introduces full support for the Ice Lake Server architecture in the Input and Output analysis.
- GPU Accelerators
- Advanced Data Transfer Information in GPU Offload AnalysisThe following additions to the Graphics window clarify better the data transfer between CPU host and GPU device when you run GPU profiling analyses:
- Allocation time information displays as part of total time by device operation.
- Data Transferred table has been renamed as Transfer Size table. Columns under Transfer Size feature new names for data transferred between host and device.
- Highlights and tool tips for workloads with sub-optimal offload schemes direct your attention to improve offload schema where necessary.
- Improved Tool tips for Occupancy Metrics in GPU AnalysisThe GPU Compute/Media Hotspots Analysis has been enhanced to detect factors that limit peak achievable occupancy for the hottest computing tasks that make the EU array idle when waiting for the scheduler. Improved tooltips for occupancy metrics now provide information about peak occupancy and bounding reasons for existing computing task launch configuration.
- GPU Analysis Coverage for Self-CheckCoverage of checks by the self-check functionality in VTune Profiler now includes GPU analyses as well. Run vtune-self-checker.sh script on Windows and Linux systems to check for the GPU Compute/Media Hotspots Analysis in source analysis and characterization modes when you run DPC++ applications on an Intel GPU. You must install the Intel® oneAPI Base Toolkit for this purpose.
- Application Performance Snapshot
- Metric tooltips in HTML reportsMetric tooltips in APS HTML reports now present a more holistic view of metrics and their properties. The new tooltips present a compact yet comprehensive overview of a metric, which helps you to better understand the importance of metrics in performance analysis. This change includes a visual bar that indicates where the metric value stands in terms of current performance and tuning potential.
- PCIe bandwidth info in CLI reportsAPS command line reports now include PCIe bandwidth metrics. This data is only available on server platforms when using the Sampling Driver.
- New reports and filtersAPS now features the following new types of reports and filters:
- Node topology report: view relations between ranks, nodes, and PCIe devices.
- Metrics report: get a configurable table that displays any collected metric for each rank, node, or device.
- Ability to filter data by node.
- Managed Code TargetsThis release introduces support for running the Hotspots analysis on .NET 5 targets in Launch Application mode when using hardware event-based sampling.
- Hardware Support
- Support for 3rd Gen Intel® Xeon® Scalable Processor ArchitectureThis releases supports the 3rd Gen Intel® Xeon® Scalable processor architecture (code named Ice Lake Server) .
- User Interface
- This release introduces a new main vertical toolbar to enhance your user experience. All controls previously located in the main horizontal toolbar are now located on this toolbar. The vertical toolbar is designed to enhance your experience with clear, bright controls.
- Hardware Support
- This version includes support for Intel Atom® Processor P Series code named Snow Ridge, including Hotspots, Microarchitecture Exploration, Memory Access, and Input and Output analyses.
- GPU Accelerators
- Source-level analysis for DPC++ and OpenMP applications running on GPU over Level ZeroThe following modes in GPU Compute/Media Hotspots analysis are now available when profiling Level Zero applications:Support also includes full-scale analysis of the kernel source per code line, including Source/Assembly mapping.
- Input and Output Analysis
- New major features in Input and Output analysis
- This release introduces thePlatform Diagram, a new starting point for the Input and Output analysis. It reveals system topology and high-level utilization metrics for hardware resources including PCIe devices, Intel® Ultra Path Interconnect, and memory. It enables you to examine the utilization of your hardware at a glance.This feature is enabled for 1st and 2nd Generation Intel® Xeon® Scalable Processors in up to four-socket configurations, excluding the Intel® Xeon® Platinum 9200 series processors code named Cascade Lake AP. This feature is also supported on Intel Atom® Processors P Series code named Snow Ridge.
- Intel® Data Direct I/O (Intel DDIO) utilization efficiency metricsare extended with average Inbound PCIe read/write latency and core/IO contention indicator.
- It is now possible to performLinux perf-based data collectionwithout root access on 1st and 2nd Generation Intel Xeon® Scalable Processors on Linux kernel versions 5.10 and newer.
- Software Enhancement
- Fix for an issue where Command line analysis based on User-Mode Sampling does not work when using a non-root accountIfVTunewas installed by a root/sudo user, some executable files were requiring only root permissions to run an analysis based on the User-Mode Sampling collector, such as the Hotspots analysis. This issue has been rectified in this release.Profiler
- Guidance resource on GPU-profiling features inIntel® VTune™ProfilerA new article captures learning pathways to profile GPUs and illustrates techniques to Optimize Applications for Intel® GPUs withIntel® VTune™. Use this article to understand theProfilerIntel® VTune™workflow to profile and optimize GPUs. The article also informs about several key resources including procedural topics, cookbook recipes, and webinars that explain GPU compute profiling and graphics profiling with Intel software analyzer products.Profiler
- GPU Accelerators
- GPU Adapter Selection for Profiling Analyses in Multi-GPU SystemsWhen you have multiple Intel GPUs connected to your system, you can now select a specific GPU adapter directly in the user interface for your GPU Offload Analysis or GPU Compute/Media Hotspots Analysis. TheTarget GPUpulldown menu appears in theHOWpane of the analysis configuration whenVTunedetects multiple Intel GPUs on your system. The menu lists available GPU adapters with their Bus/Device/Function (BDF) values.Profiler
- Energy Consumption Metrics in GPU Compute/Media Hotspots AnalysisWhen you run the GPU Compute/Media Hotspots Analysis on an Intel® Iris® XeMAX graphics discrete GPU in a Linux environment, you can now use theAnalyze power usageoption to collect information about energy consumed by the GPU. The analysis results display energy consumption metrics over time and per discrete GPU kernel. Use this data to better monitor power usage with processing time and optimize for either purpose.
- Data Transfer Information in GPU Offload AnalysisKernel information in the GPU Offload Analysis combines data transfer times to and from the GPU kernel with the execution time. In theSummarywindow, you can now see the total time for computing tasks along with the execution time. Previously, this display included only the execution time. In theGraphicswindow, the total time for computing task by kernel now combines the data transfer time between device and host as well as the actual execution time. TheGraphicswindow also displays now information about the size of data transfer between the host (CPU) and GPU (device).
- Update to IP Architecture diagramThe IP Architecture Diagram of the GPU Compute/Media Hotspots analysis is renamed as theMemory Hierarchy Diagram. The diagram features a new design that can help make the understanding of metrics more intuitive. The diagram also displays the same markers to highlight metrics as the ones used to indicate performance or data issues in the Summary and Grid displays. This provides a consistent look and feel to the diagram and helps you correlate metrics between both displays.
- SIMD utilization metrics at kernel level.The GPU Compute/Media Hotspots analysis in the Dynamic Instruction Count mode now includes SIMD utilization metrics at the kernel and instruction level. These metrics help identify instructions in the OpenCL kernel that utilize SIMD poorly.
- GPU metrics in APS and HPC Analysis type.The GPU utilization analysis in Application Performance Snapshot (APS) and the HPC Performance Characterization analysis now includes these GPU computation metrics:
The GPU Compute metric set of Application Performance Snapshot has been enhanced with OpenMP Offload Efficiency metrics, including offload region overhead. These metrics are available for binaries compiled with the Intel® C/C++ Compiler included in several Intel® oneAPI Toolkits 2021.1-beta05 or newer.
- GPU Time
- GPU IPC
- GPU Utilization
- Percentage of stalled and idle EUs
- Simplified dependency on Intel® Metrics Discovery API libraryThere is now a simplified dependency on the Intel® Metrics Discovery API library to collect GPU hardware statistics on Linux* systems.Intel® VTune™now automatically selects the latestProfilerlibstdc++available in runtime to satisfy the GPU analysis requirements. For older versions of the product, follow procedures to enable manual configuration.
- Extension to Command Line AnalysisThe report generated when you run analysis from the command line now includes GPU analysis data. Apply thecomputing-taskandcomputing-instancegroupings to your collected data to focus on time-consuming computing tasks.
- Dynamic Instruction Count Collection in GPU Compute/Media Hotspots AnalysisThe GPU Compute/Media Hotspots analysis has been improved to include Dynamic instruction count collection. The analysis results provide better accuracy for basic block Assembly analysis
- FPGA Accelerators
- Multiple enhancements to CPU/FPGA Interaction AnalysisThe CPU/FPGA Interaction analysis type features several new additions to enhance your FPGA profiling experience.
- Analysis results now displayActivity percentageandIdle percentagemetrics to describe the proportion of cycles when a channel instruction was enabled or absent.
- The analysis type can now profile loops and display occupancy information for them.
- You can now adjust the depth of channels usingAverage depthandMaximum depthinformation that displays in the analysis results.
- Performance Summary:
- Performance Snapshot Analysis Type for Quick SummaryUse Performance Snapshot as the starting point for your performance analysis. Get a quick overview of issues that affect your application performance. Performance Snapshot provides recommendations for next steps to help you select other analyses for deeper profiling. It also characterizes the workload on the system.
- Algorithm Group
- Anomaly Detection Analysis for Performance AnomaliesUse the Anomaly Detection analysis type in theAlgorithmgroup to detect performance anomalies in frequently recurring code intervals including loop iterations. Anomaly Detection uses Intel® Processor Trace (Intel® PT) technology to perform detailed analysis at the microsecond level. These are some metrics that get highlighted in analysis results whenIntel® VTune™identifies a performance anomaly:Profiler
Anomaly Detection can also detect hypervisors that do not have support for processor trace virtualization through Intel® Processor Trace (Intel® PT).
- Instructions Retired
- Kernel CPU Time
- User CPU Time
- Inactive/Wait Time
- CPU Frequency
- Support for OpenMP Offload in HPC AnalysisThe HPC Performance Characterization analysis type supports the offload of OpenMP regions. The summary pane now includes a breakdown of OpenMP offload time byCompute,Data Transfer, andOverhead. The bottom-up pane now allows grouping byOpenMP Offload Region. With this grouping active, the grid displays several new columns. The timeline shows scale markers that indicate the span of OpenMP offload regions and OpenMP operations internal to those regions.
- I/O Analysis
- Improvements and Changes to Input and Output Analysis
- The Input and Output analysis type features a new methodology for locating sources of reads and writes targeting Memory-Mapped I/O (MMIO) address space regions to which I/O devices are mapped. SuchMMIO reads and writesare expensive loads and stores resulting inOutbound PCIe traffic.
- The collection of source-level Memory Mapped I/O (MMIO) data in the Input and Output analysis supports InfiniBand* devices.
- Platform I/O metrics can now be attributed to individual devices managed by Intel® VMD technology.
- Per-device metrics are now available when running Input and Output analysis as a non-root user, as long as the sampling driver is loaded.
- Enhanced profiling for servers based on Intel® processor microarchitectures codenamed Skylake and Cascade Lake by highlighting code that potentially performs MMIO reads.
- This analysis type featuresInbound PCIe Read/Write L3 Hit/Miss Ratiometrics that show the utilization efficiency of Intel® Data Direct I/O (Intel® DDIO) hardware technology. There are new metrics for Intel® Xeon® Scalable processors that allow data break down by PCIe devices. Input and Output analysis is deprecated in the Windows version ofIntel® VTune™.Profiler
- Energy Analysis
- Rootless Data Collection on Linux SystemsYou do not require root privileges to run energy analysis usingIntel® VTune™in a Linux environment. You can run this analysis without root privileges once your system administrator installs sampling drivers forProfilerIntel® VTune™and configures relevant permissions for the drivers. Administrator privileges are required to collect energy data in Windows machines.Profiler
- Processor Package Energy ConsumptionOptions for Energy analysis, based on the Intel SoC Watch data collector, have been extended to monitor processor package energy consumption over time and identify how it correlates with CPU throttling
- Platform Analysis:
- Enhancements to System Overview AnalysisUse the System Overview analysis as an entry point to platform analysis. Assess your system (IO, accelerators and CPU) performance and get guidance for further analysis steps.
- The System Overview analysis can display energy consumption data. Enable theAnalyze energy usageoption to get energy consumption characterization on theSummarytab with the total energy consumed by CPU packages and DRAM, as well as overtime energy consumption data on thePlatformtab.
- The Hardware Tracing mode in the System Overview analysis enables application analysis at the micro-second level and helps you to identify causes for latency. These are some metrics you can collect:
- User/kernel metrics
- OS Kernel Activity
- OS Scheduling
- Thread/Hardware grouping
- Module entry points
- Improvements to Platform Profiler:
- Overview and Memory views are extended with new metrics to analyze Non-Uniform Memory Access (NUMA) behavior
- User authentication and authorization has been added to enable access control to your data
- There is a new option to choose or modify the location of Platform Profiler data files
- VTuneServer for HPC EnvironmentProfiler
- A quality-of-life improvement was added toVTuneServer. If you useProfilerVTuneCLI to run data collection using a scheduler in an HPC cluster and put the results into a mounted shared location, you can now pointProfilerVTuneServer to an arbitrarily structured folder in this shared location.ProfilerVTunenow discovers all results in a directory and allows you to seamlessly navigate your arbitrary folder structure and open any result.Profiler
- HPC Analysis
- Application Performance Snapshot includesMaxandBoundBandwidth metrics to better estimate the efficiency of the DRAM, MCDRAM, Intel Persistent Memory and Intel® Omni-Path usage
- Cloud and Containerization
- This release extends container profiling capabilities to display the container name instead of its ID for ease of identification.
- You can profile applications running in Amazon Web Services* (AWS) EC2 Instances based on Intel microarchitecture code name Cascade Lake X.
- Connection Types
- New TCP/IP Communication AgentUse the TCP/IP communication agent as a connection type to profile embedded systems running real-time operating systems. You can profile the kernel of an arbitrary real-time operating system and the applications running on it. This requires the development of a custom agent (Analysis Communication Agent). A reference solution based on Linux OS is available through the Analysis Communication Agent GitHub* repository. Detailed information on developing an agent for a specific real-time operating system is available in the ACA documentation.
- Remote Linux (SSH) Connection TypeThe Remote Linux (SSH) connection type has been improved to make automated target package deployment more transparent. NowVTunechecks for the presence of the target package on the remote system and offers to deploy the package automatically with a single click of a button if the package is not found.Profiler
- Quality and Usability
- Symbol resolution for effective source-level analysis enabled for crossgen (Ahead-of-JIT compilation) functions on Linux* systems
- InteractiveHelp Touravailable from the Welcome page and guiding you through the product interface using a sample project
- The third-party components updated to the most recent versions to include functional and security changes. You are recommended to update your product to the latest version.
- Profiling Support for OpenSHMEM ApplicationsUse the Fabric Profiler feature in VTune Profiler to identify detailed characteristics of the runtime behavior for an OpenSHMEM application.
- Profiling Support for Applications Annotated with ITT API
- Profiling Remote Amazon Web Services* Instances
- There exists support for remote profiling of applications running in Amazon Web Services* (AWS) EC2 instances.
- Support for DPC++ Applications
- Demangling of Lambda FunctionsThis release implements the demangling of DPC++ lambda function names, which are used as DPC++ kernel names.
- Analysis Configuration:
- Wrapper Script Option for Quick Profiling Environment SetupUse the Wrapper script to run a custom set of commands to prepare the profiling environment before you start analysis in the environment. For example, you can create a script with a custom set of commands that sets environment variables. Include the custom set in theWHATpane when you configure the analysis. The commands get executed on the target system before the analysis begins. You can also provide the wrapper script through the command-line interface by using the--wrapper-script-pathoption.
- PDF version of User GuideTheIntel® VTune™User Guide is available in PDF format as well as HTML. If you are viewing this content online, clickProfilerDownload as PDFat the top of this page to use the PDF version.
- Support for Data Parallel C++ (DPC++) code profiling added across CPUs and multiple accelerator architectures, including GPUs and FPGAs
- GPU Offload and GPU Compute/Media Hotspots types extended to support profiling DPC++ code and OpenMP* code offloaded to the GPU
- GPU Time and Utilization metrics added to Application Performance Snapshot to help you triage your performance issues and identify whether your code is CPU or GPU bound