Profiling Hardware Without Intel Sampling Drivers
This collection of recipes helps you set up driverless Linux* Perf*-based performance profiling with Intel® VTune™ Profiler, understand benefits and workarounds for possible limitations.
Intel processors provide performance monitoring unit (PMU) events that can be used to analyze how effectively your code utilizes hardware resources.
VTune
can collect and analyze PMU events for microarchitecture analysis types, such as HPC Performance Characterization, Memory Access, and Microarchitecture Exploration. If required, you can also configure the Hotspots and Threading analysis types to use the PMU event-based sampling instead of the default user-mode sampling; for example, if your analysis requires a smaller sampling interval.
Profiler
For PMU event-based analysis, the
VTune
uses
Intel sampling drivers that require administrative privileges to install them on a target system. If you do not have administrative privileges or your environment does not allow third-party drivers to be injected to the systems, the
Profiler
VTune
cannot access PMU events via Intel sampling drivers and determine hardware performance bottlenecks. For such cases, the
Profiler
VTune
has adopted hardware performance monitoring capabilities through a built-in Linux Perf performance monitoring system.
Profiler
VTune
enables the hardware event-based sampling analysis in the Perf driverless mode if:
Profiler
- Intel sampling drivers cannot be installed (for example, if installed without root privileges).
- Collection with stacks is selected with a non-zero stack size and the requirements for driverless collection are satisfied.
- The option to use driverless collection is enabled and the requirements for driverless collection are satisfied.
The latest versions of the
VTune
have extended the Linux Perf (driverless) support providing the profiling functionality, a collection overhead, and trace size competitive with the Intel sampling driver-based solution. However,
Profiler
VTune
capabilities in the driverless mode depend on your Linux OS configuration and might have some limitations described in some recipes below.
Profiler
- To enable the Perf driverless collection to match all the hardware profiling functionality provided with Intel drivers, you will need administrative privileges to configure system options as described below.
- To check which collector type - Perf or Intel sampling driver (SEP) - was used for you analysis, see theCollection and Platform Infosection of theSummarywindow.
- INGREDIENTS:Intel VTune Profiler (or its previous version - Intel VTune Amplifier 2019) can use the driverless mode if the following requirements are satisfied:
- Core and uncore events. All hardware event-based collections inVTuneuse core PMU events. Some of them such as Memory Access and IO analysis types require access to uncore events that enable collecting metrics like DRAM bandwidth, QPI/UPI bandwidth, PCI bandwidth, and others.Profiler
- Perf for Linux kernel 2.6.32 and higher.PMU events are exposed by Linux kernel through/sys/bus/event_source/devices/cpuand/sys/bus/event_source/devices/uncore_*directories. Empty directory content may indicate that the system configuration does not support PMU event collection. In this case, either update the OS or install the Intel sampling driver.
- /proc/sys/kernel/perf_event_paranoidvalue is equal to or less than 1.
- RECIPES FOR LIMITATIONS:
You can run the
vtune-self-checker.sh
script provided with the
VTune
to validate product capabilities on your analysis system. The script runs a representative set of analysis types on a stable benchmark and informs you on limitations the
Profiler
VTune
encountered on the system. The recommendations in this diagnostics may help you properly configure your system for the driverless Perf collection or offer to install the Intel sampling driver if the system configuration cannot help. To run the script, enter:
Profiler
<vtune_install_dir>/bin64/vtune-self-checker.sh
Enable System Wide or User Process Profiling
Analysis types
: all.
Concepts
:
System-wide analysis
collects performance information about all processes running on the system, including system services and so on.
Driverless mode limitations
: Additional configuration is required to enable system-wide or user process profiling.
To enable system-wide analysis in the driverless mode:
- Configure aVTuneproject and from theProfilerWHATpane select either theProfile Systemtarget or theLaunch Applicationtarget with theAnalyze system-wideoption enabled.
- Check the/proc/sys/kernel/perf_event_paranoidfile value with the following command:cat /proc/sys/kernel/perf_event_paranoidIf the value is less than 1, theVTunecan proceed with the system-wide collection.Profiler
- If theperf_event_paranoidvalue is equal to 1 (which limits the collection to user processes only) or more than 1 (which prevents theVTunefrom using the Perf driverless mode), set theProfilerperf_event_paranoidvalue to 0 for the system-wide collection:echo 0 > /proc/sys/kernel/perf_event_paranoid
In some environments,
perf_event_paranoid
is regulated by the security policy. For more information about Linux Perf security requirements, see
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html.
Intel sampling driver limitations:
by default, the Intel sampling driver supports system-wide collection. But if it is built and loaded with the
--per-user
option, the collection will be limited to user processes only.
Enable Core and Uncore Event Collection
Analysis types
: Memory Access, HPC Performance Characterization, and other analysis types based on uncore events
Core events can be collected both system wide and per user processes. To collect uncore events in the Perf driverless mode, enable system-wide analysis.
Driverless mode limitations:
- Memory Access analysis requires access to uncore events and will not run without ability to collect them. Other analysis types, like HPC Performance Characterization, will run but miss metrics based on uncore events such as DRAM Bandwidth, OPA Interconnect Bandwidth, and Packet Rate.
- Uncore collection in the driverless mode is not supported on Intel Atom® processors.
To collect uncore events in the driverless mode:
Set the
perf_event_paranoid
value to 0 to enable system-wide performance monitoring, which is a prerequisite for the uncore event collection.
Intel sampling driver limitations
: none.
Enable Multi-Process Profiling
Analysis types
: all
By default, the Linux kernel limits the size of the memory available for capturing performance data by 518Kb. To know the current value, enter:
cat /proc/sys/kernel/perf_event_mlock_kb
Driverless mode limitations
: For some parallel applications (for example, MPI applications with multiple ranks per node) in the user process collection mode on multi-core systems (>64 logical cores), the limit of 518Kb tends to be reached and the data collection will not be available.
To enable multi-process profiling on a multi-core system:
Set the
perf_event_paranoid
value to 0 to enable system-wide performance monitoring.
Intel sampling driver limitations:
none. Any number of processes with default settings can be profiled.
Profile a Large Number of PMU Events on Multi-Core Systems
Analysis types
: Microarchitecture Exploration
Driverless mode limitations
: Linux Perf allocates file descriptors for every configured PMU event on each CPU. So, on a multi-core system with a long events list used by such analyses as Microarchitecture Exploration, this limit is easily reachable and can prevent the collection in the driverless mode.
To support profiling a large number of PMU events in the driverless mode:
- Check the limit of opened files:ulimit -n
- If required, increase the limit in the/etc/security/limits.conffile. To do this, you must have administrator privilege. Increase the limit by adding or changing these lines (particular numbers are chosen as examples):* soft nofile 65535 * hard nofile 65535
- If you increased the limit in step 2, log out of the shell or close it and reopen a secure shell connection. Log back in.With administrator privilege, you can set the limit for a specific user. The change should be visible when the user logs in again.
For more information on using the
limits.conf
file, see
http://man7.org/linux/man-pages/man5/limits.conf.5.html.
Intel sampling driver limitations
: None.
Enable Stack Sampling
Analysis types
: Hotspots (Hardware Event-Based Sampling mode), Threading (Hardware Event-Based Sampling and Stack Stitching mode), HPC Performance Characterization (Collect stacks
option enabled), GPU Compute/Media Hotspots (Collect stacks
option enabled).
Driverless mode limitations
:
- Default 1024 byte stack size may not be enough for a full stack unwinding if a function intensively allocates data on the stack. This may lead to[Skipped stack frame(s)]displayed in the collected data.
- Linux kernel versions older than 3.7 support only frame-pointer (FP) based stack unwinding. This means that theVTunecan provide no stacks for binaries built without frame-pointer (Profiler-fomit-frame-pointercompiler option), as well as no Glibc stacks since Glibc is built without frame-pointers.
To avoid issues with stack unwinding in the driverless mode:
Increase the stack size. For example:
vtune -collect hotspots -knob sampling-mode=hw -knob enable-stack-collection=true -knob stack-size=2048 <application>
Otherwise, switch to the Intel sampling driver setting the
Stack size
option to 0 (unlimited value).
Since stack sampling collection with the Intel sampling driver depends on the kernel implementation, it usually requires an update for a new kernel version, which may bring additional product maintenance cost. To reduce the cost, the
VTune
( and its VTune Amplifier 2019 Update 4 and higher) started using the driverless mode for all analysis types with stacks collection enabled, even when the Intel sampling driver is loaded. If you need to switch to the Intel sampling driver for stack sampling collection, create a custom analysis type and disable the
Profiler
Enable driverless collection
option, or use the corresponding command line configuration:
vtune -collect-with runsa -knob enable-driverless-collection=false -knob event-config=<event-list> <application>
Intel sampling driver limitations:
No limitation for the stack unwinding since the Intel sampling driver uses a different algorithm of call stack collection. The driver may require an update if your kernel version is newer than the latest kernel version supported by
VTune
..
Profiler
Collect Context Switches
Analysis types
: Threading
Concepts
:
Context switch collection
helps expose metrics based on thread Inactive Wait time resulted from either synchronization or thread preemption.
Driverless mode limitations
: Linux Perf collects context switches from kernel version 4.3 and higher. Identification of the context switch reason (synchronization or preemption) is available from kernel version 4.17. For older kernel versions, the
VTune
switches the collection to the Intel sampling driver if it is available on the system.
Profiler
Intel sampling driver limitations:
none.
Resolve Symbols for Kernel Functions
Analysis types
: all
Driverless mode limitations
: Additional manual configuration of the
kptr_restrict
file is required.
To associate performance data with kernel function names:
Set the
kptr_restrict
configuration file value to 0 as a system administrator:
echo 0 > /proc/sys/kernel/kptr_restrict
Setting the value to 1 limits the file name resolution to user-level modules.
Intel sampling driver limitations:
none. The Intel driver resolves kernel symbols if the
/boot/System.map-<
file is accessible for reading or
kernel_version
>/proc/sys/kernel/kptr_restrict
is set to 0.
Avoid Resource Contention with the NMI Watchdog
Analysis types
: all
Driverless mode limitations
: NMI watchdog (a hard lockup detector) utilizes one CPU performance counter register that becomes unavailable for Linux Perf. This can increase the number of multiplexing groups and, as a result, impact the accuracy of statistical sampling data.
To improve the accuracy of analysis runs with long events lists in the driverless mode:
Disable the NMI watchdog, using administrative privileges:
echo 0 > /proc/sys/kernel/nmi_watchdog
When the driverless Perf collection is complete, you can re-enable the NMI watchdog (using the administrative privileges):
echo 1 > /proc/sys/kernel/nmi_watchdog
Intel sampling driver limitations:
none. Intel driver automatically stops the NMI watchdog for the collection time to avoid this problem with data accuracy.
Reduce Collection Overhead
Analysis types
: all
Driverless mode limitations
: Linux Perf collection may incur an overhead on CPU intensive applications, since it fully loads all CPUs.
To reduce the collection overhead in the driverless mode:
- To reduce the trace size for stack sampling collections, theVTuneuses a Linux Perf trace compression, which may introduce an additional overhead. To avoid this, disable the trace compression with theProfiler-run-pass-thruoption:vtune -collect hotspots -knob sampling-mode=hw -knob enable-stack-collection=true -run-pass-thru=--perf-compression=0 <application>This can reduce collector overhead in rare cases, but the trace size increases dramatically.
- In some real-time and telecom applications, the default per-CPU trace collection mode can cause collection overhead. To overcome this, disable the per-CPU trace collection mode with the-run-pass-thruoption:vtune -collect hotspots -knob sampling-mode=hw -knob enable-stack-collection=true -run-pass-thru=--perf-threads=none <application>
- Set the limit of CPU time consumption by Linux Perf collector. For example, for a 10% limit, use the following command (with administrative privileges):cat 10 > /proc/sys/kernel/perf_cpu_time_max_percentThis can drop the sampling frequency and statistical accuracy to reach the limit.
Intel sampling driver limitations
: none.
Enable Using Driverless Mode When Required
VTune
uses the Intel sampling driver if it is loaded in all cases except for the stack sampling collection. To make the
Profiler
VTune
use the driverless Perf mode for sampling without stacks, create a custom analysis type and select the
Profiler
Enable driverless collection
option in the GUI, or set the command line
knob
value to
enable-driverless-collection=true
as follows:
vtune -collect-with runsa -knob enable-driverless-collection=true -knob event-config=<event-list> <application>
The option is available starting with the VTune Amplifier 2019 Update 4.
Enable Profiling Capabilities for the Group
Driverless mode limitations
: Setting the
perf_event_paranoid
option to a lower value could be inappropriate as this option applies to all users. Instead, you could set Linux capabilities to a specific user group and binary.
To enable capabilities for perf tool:
The required capabilities cannot be assigned for a file system mounted with the
nosuid
option, or if the file system does not support extended file attributes.
- For Linux kernel versions older than 5.8, useCAP_SYS_ADMIN. To set up this configuration, run thevtune-set-perf-caps.shscript with this parameter:vtune-set-perf-caps.sh -v cap_sys_admin
- For Linux kernel versions 5.8 and newer, useCAP_PERFMON. To set up this configuration, run thevtune-set-perf-caps.shscript with this parameter:vtune-set-perf-caps.sh -v cap_perfmonAlternatively, you can set up this configuration manually:
- Create avtunegroup for privilegedamplxe-perfusers.
- Assign thevtunegroup to the Perf tool executable.
- Restrict access to the executable to only those users who are in thevtunegroup.# cp amplxe-perf amplxe-perf-priv # groupadd vtune # chgrp vtune amplxe-perf-priv # chmod o-rwx amplxe-perf-priv
- Assign the required capabilities to the Perf tool executable.
If the installed libcap does not support# setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" amplxe-perf-priv # getcap amplxe-perf-priv amplxe-perf-priv = cap_sys_ptrace,cap_syslog,cap_perfmon+epcap_perfmon, use38instead:# setcap "38,cap_sys_ptrace,cap_syslog=ep" amplxe-perf-priv # getcap amplxe-perf-priv amplxe-perf-priv = cap_sys_ptrace,cap_syslog,38+ep
For more information on Linux capabilities, see
https://man7.org/linux/man-pages/man7/capabilities.7.html.
For more information on
perf_event
access control, see
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html