knob
knob
Set
configuration options for the specified analysis type or collector type.
GUI Equivalent
Configure
Analysis
window >
HOW
pane
Syntax
-knob | -k
<knob-name>=<knob-value>
Arguments
- knob-name
- An analysis type or collector type may have one or more configuration options (knobs) that provide additional instructions for performing the specified type of analysis. To use a knob, you must specify the knob name and knob value.Multipleknoboptions are allowed and can be followed by additional action-options, as well as global-options, if needed.
- knob-value
- There are values available for each knob. In most cases this is a Boolean value, so for Boolean knobs, specifyto enable the knob.<knob-name>=true
Knob behavior may vary depending on the analysis type
or collector type.
< knob-name > | Description
|
---|---|
accurate-cpu-time-detection=true | false
(Windows only)
Default:
true | Collect more accurate CPU time data. This option requires
additional disk space and post-processing time. Administrator privileges are
required.
Supported analysis:
runss |
analyze-loops=true |
false Default:
false | Extend loop analysis to collect advanced loops
information such as instruction set usage and display analysis results by loops
and functions.
Supported analysis:
runss ,
runsa |
analyze-mem-objects=true | false Default:
false | Enable the instrumentation of memory
allocation/de-allocation and map hardware events to memory objects. This option
is supported for Linux targets only running on the Intel microarchitecture code
name Sandy Bridge (or later).
Supported analysis:
memory-access |
analyze-openmp=true | false Default:
true for the HPC Performance Characterization
analysis;
false for other analysis types.
| Instrument the OpenMP* runtimes in your application to group
performance data by regions/work-sharing constructs and detect inefficiencies
such as imbalance, lock contention, or overhead on performing scheduling,
reduction, and atomic operations. Using this option may cause higher overhead
and increase the result size.
Supported analysis:
hotspots ,
threading ,
hpc-performance ,
memory-access ,
uarch-exploration ,
runsa |
analyze-persistent-memory=true |
false Default:
false | Collect performance information for Intel® Optane™ Persistent
Memory modules.
Supported analysis:
platform-profiler |
analyze-power-usage=true | false Default:
false | Collect information about energy consumed by CPU, DRAM, and
discrete GPU.
Supported analysis:
gpu-hotspots ,gpu-offload |
analyze-throttling-reasons=true |
false Default:
false | Collect information about
factors
that cause the CPU to throttle.
Supported analysis:
system-overview |
atrace-config=< event >Available events are
gfx, input, view, webview, wm, am, audio, video, camera,
hal, res, dalvik .
| Collect Android framework events from
Systrace*.
Supported analysis:
runsa |
characterization-mode=overview | global-local-accesses |
compute-extended | full-compute | instruction-count Default:
overview | Monitor the Render and GPGPU engine usage (Intel Graphics only),
identify which parts of the engine are loaded, and correlate GPU and CPU data.
The Characterization mode uses platform-specific presets of the
GPU metrics. All presets, except for the
instruction-count , collect data about execution
units (EUs) activity: EU Array Active, EU Array Stalled, EU Array Idle,
Computing Threads Started, and Core Frequency; and each one introduces
additional metrics:
Supported analysis:
gpu-hotspots ,
graphics-rendering ,
runsa |
chipset-event-config="
,..."
event1 ,event2 | Specify a comma-separated list of Android chipset events (up to 5
events) to monitor with the hardware event-based sampling collector.
Supported analysis:
runsa |
source-analysis=bb-latency | mem-latency Default value:
bb-latency | Collect data on performance-critical basic
blocks and issues caused by memory accesses in the GPU kernels. Choose one of
the following modes:
Supported analysis:
gpu-hotspots |
collect-bad-speculation=true |
false Default value:
true | Collect the minimum set of data required to compute top-level
metrics and all Bad Speculation sub-metrics.
Supported analysis:
uarch-exploration ,
runsa |
collect-core-bound=true |
false Default:
false | Collect the minimum set of data required to compute top-level
metrics and all Core Bound sub-metrics.
Supported analysis:
uarch-exploration ,
runsa |
collect-frontend-bound=true |
false Default value:
true | Collect the minimum set of data required to compute top-level
metrics and all Front-End Bound sub-metrics.
Supported analysis:
uarch-exploration ,
runsa |
collect-cpu-gpu-bandwidth=true |
false Default: false
| Collect DRAM bandwidth data for all hosts. Additionally, collect
PCIe bandwidth for supported server hosts (Intel® micro-architectures code
named Ice Lake and Sapphire Rapids). To view collected data in GUI, enable the
Analyze CPU host-GPU
bandwidth option.
Supported analysis: gpu-offload |
collect-cpu-gpu-pci-bandwidth=true | false Default: false
| Collect PCIe bandwidth for supported server hosts (Intel®
micro-architectures code named Ice Lake and Sapphire Rapids). This knob is
available for custom analyses only. To view collected data in GUI, enable the
Analyze CPU host-GPU
bandwidth option.
Supported analysis: runsa |
collect-io-waits=true | false Default:
false | Analyze the percentage of time each thread and
CPU spends in I/O wait state.
Supported analysis:
runsa |
collect-memory-bandwidth=true |
false Default: depends on analysis type
| Collect data to identify where your application is generating
significant bandwidth to DRAM. To view collected data in GUI, enable the
Analyze memory
bandwidth option.
Supported analysis:
performance-snapshot, uarch-exploration ,
hpc-performance ,
gpu-hotspots ,runsa |
collect-memory-bound=true |
false Default value:
true | Collect the minimum set of data required to compute top-level
metrics and all Memory Bound sub-metrics.
Supported analysis:
uarch-exploration ,
hpc-performance |
collect-programming-api=true | false Default for
gpu-hotspots :
true , for
runss :
false .
| Analyze execution of SYCL apps, OpenCL™ kernels
and Intel® Media SDK programs on Intel HD Graphics and Intel® Iris® Graphics.
This option may affect the performance of your application on the CPU side.
Supported analysis:
gpu-hotspots ,
gpu-offload ,
runsa |
collect-retiring=true |
false Default value:
true | Collect the minimum set of data required to compute top-level
metrics and all Retiring sub-metrics.
Supported analysis:
uarch-exploration ,
runsa |
collecting-mode=hw-tracing |
hw-tracing Default value:
hw-sampling | Specify the system-wide collection mode to
either explore CPU, GPU, and I/O resources utilization with the default
event-based sampling mode, or enable the low-overhead hardware tracing and
identify a root cause of latency issues.
Supported analysis:
system-overview ,
runsa |
computing-task-of-interest= computing_task_name [#start_idx#step#stop_idx ] | Specify a comma-separated list of GPU computing task names and
invocations.
computing_task_name start_idx stop_idx Supported analysis:
gpu-hotspots ,
runsa |
counting-mode=true |
false Default:
false | Choose between collecting detailed context data for each PMU event
(such as code or hardware context) or the counts of events. Counting mode
introduces less overhead but gives less information.
Supported analysis:
runsa |
cpu-samples-mode=off |
stack | nostack Default:
false | Enable to periodically sample the application. Samples can be
collected with or without stacks.
Supported analysis:
runss |
dpdk=true | false Default:
false | Profile DPDK IO API.
Supported analysis:
io |
dram-bandwidth-limits=true | false Default:
true for the HPC Performance Characterization and
Microarchitecture Exploration analysis with
collect-memory-bandwidth knob enabled;
true for the Memory Access and Microarchitecture
Exploration analysis.
| Evaluate maximum achievable local DRAM bandwidth before the
collection starts. This data is used to scale bandwidth metrics on the timeline
and calculate thresholds.
Supported analysis:
performance-snapshot, memory-access ,
uarch- exploration ,
hpc-performance ,
runsa |
enable-characterization-insights=true | false | Get additional performance insights such as the efficiency of
hardware usage, and learn next steps.
Supported analysis:
gpu-offload |
enable-context-switches=true |
false Default:
false | Analyze detailed scheduling layout for all
threads
in your application , explore time spent on a context switch
and identify the nature of context switches for a thread (preemption or
synchronization).
Supported analysis:
runsa |
enable-driverless-collection=true | false Default:
false | Enable
driverless
Linux Perf collection when possible.
Supported analysis:
runsa |
enable-gpu-usage=true |
false Default:
false | Analyze frame rate and usage of Intel HD Graphics and Intel® Iris®
Graphics engines and identify whether your application is GPU or CPU bound.
Supported analysis:
runss ,
runsa |
enable-interrupt-collection=true | false Default:
false | Collect interrupt events that alter a normal
execution flow of a program. Such events can be generated by hardware devices
or by CPUs. Use this data to identify slow interrupts that affect your code
performance.
Supported analysis:
system-overview .
|
enable-parallel-fs-collection=true | false Default:
false | Analyze Lustre* file system performance
statistics, including Bandwidth, Package Rate, Average Packet Size, and others.
Supported analysis:
runsa |
enable-stack-collection=true |
false Default:
false | Supported analysis:
hotspots ,
hpc-performance ,
gpu-offload ,
runsa |
enable-system-cswitch=true | false Default:
false | Analyze detailed scheduling layout for all threads
on the system and identify the nature of context switches
for a thread (preemption or synchronization).
Supported analysis:
runsa |
enable-thread-affinity=true | false Default:
false | Analyze thread pinning to sockets, physical cores, and logical
cores. Identify incorrect affinity that utilizes logical cores instead of
physical cores and contributes to poor physical CPU utilization.
Affinity information is collected at the end of the thread
lifetime, so the resulting data may not show the whole issue for dynamic
affinity that is changed during the thread lifetime.
|
enable-user-sync=true | false Default:
false | Collect synchronization data via the
User-Defined
Synchronization API.
Supported analysis:
threading ,
runss |
enable-user-tasks=true |
false Default:
false | Analyze tasks, events and counters specified in your application
via the
Task
API. This option causes higher overhead and increases result size.
Supported analysis:
hotspots ,
threading ,
uarch-exploration ,
runss ,
runsa |
event-config=<event_name1>,<event_name2>,... | Configure PMU events to collect with the hardware event-based
sampling collector. Multiple events can be specified as a comma-separated list
(no spaces).
To display a list of events available on the target PMU,
enter:
vtune <target> The command returns names and short
descriptions of available events. For more information on the events, use
Intel Processor Events Reference.
Supported analysis:
runsa |
event-mode=all | user |
os Default:
all | Limit event-based sampling collection to OS or
USER mode.
Supported analysis:
hotspots ,
runsa |
ftrace-config=< event_name >Available events are
freq, idle, sched, disk, filesystem, irq, kvm, workq,
softirq, sync .
Default for Linux targets:
sched,freq,idle,workq,irq,softirq Default for Android targets:
sched,freq,idle,workq,filesystem,
irq,softirq,sync,disk | Collect Linux Ftrace* framework events.
Supported analysis:
runsa ,
runss |
gpu-sampling-interval=< between 0.1 and 1000ms
number >
Default: 1.
| Specify an interval between GPU samples (in milliseconds).
Supported analysis:
gpu-hotspots ,
graphics-rendering ,
runss ,
runsa |
io-mode=off | stack |
nostack Default:
off | Enable to identify where threads are waiting or compute thread
concurrency. The collector instruments APIs, which causes higher overhead and
increases result size.
Supported analysis:
runss ,
runsa |
ipt-regions-to-load=<
between 10 and 5000
number >Default:
1000 | Specify the maximum number (10-5000) of code
regions to load for detailed analysis.
Supported analysis:
anomaly-detection |
kernel-stack=true |
false Default:
true | Profile system disk IO API.
Supported analysis:
io |
max-region-duration=<
between 0.001 and 1000 ms
number >Default:
100 | Specify the maximum duration (0.001-1000ms) of
analysis per code region.
Supported analysis:
anomaly-detection |
mem-object-size-min-thres=< number >Default: 1024 bytes
| Specify a minimal size of memory allocations
to analyze. This option helps reduce runtime overhead of the instrumentation.
This option is supported for Linux targets only
running on the Intel microarchitecture code name Sandy Bridge (or later).
Supported analysis:
memory-access |
mrte-type=java,dotnet |
java,dotnet,python | python
Default:
java,dotnet | Specify a type of managed runtime to analyze.
Available values: combined .NET* and Java* analysis, combined Java, .NET and
Python* analysis, and Python only.
Supported analysis:
runss ,
runsa |
no-altstack=true |
false Default:
false | Disable using alternative stacks for signal
handlers. Consider this option for profiling standard Python 3 code on Linux.
Supported analysis:
runss |
pmu-collection-mode=detailed |
summary Default:
detailed | Choose the
detailed sampling-based collection mode to view
data breakdown per function and other hotspots. Use the
summary counting-based mode for an overview of the
whole profiling run. This mode has a lower collection overhead and fast
post-processing time.
Supported analysis:
uarch-exploration |
profiling-mode= characterization
(default),
code-level-analysis | Select a profiling mode to either characterize GPU performance
issues based on GPU hardware metric presets or enable a source analysis to
identify basic blocks latency due to algorithm inefficiencies, or memory
latency due to memory access issues.
Supported analysis:
gpu-hotspots ,
runsa |
sampling-interval=< number >For user-mode sampling and tracing types: a number (in
milliseconds) between 1 and 1000. Default: 10
For hardware event-based sampling types: a number (in
milliseconds) between 0.01 and 1000. Default: 1.
| Specify a
sampling
interval (in milliseconds) between CPU samples.
Supported analysis:
hotspots ,runss ,
threading ,
,runsa ,
system-overview ,
memory-access ,
hpc-performance |
sampling-mode=sw | hw Default:
sw | Specify a profiling mode.
Use
sw to identify CPU hotspots and explore a call
flow of your program. This mode does not require sampling drivers to be
installed but incurs more collection overhead.
Use
hw to identify application hotspots based on such
basic hardware events as Clockticks and Instructions Retired. This is a
low-overhead collection mode but it requires the sampling driver to be
installed on your system.
Supported analysis:
hotspots, threading |
signals-mode=off | objects
| stack | nostack Default:
off | Enable to view synchronization transitions in the timeline and
signalling call stacks for associated waits. The collector instruments
signalling APIs, which causes higher overhead and increases result size.
Supported analysis:
runss |
spdk=true | false Default:
false | Profile SPDK IO API.
Supported analysis:
io |
stack-size=< number >A number between 0 and 2147483647. Default is 0
(unlimited stack size).
| Reduce the collection overhead and limit the
stack size (in bytes) processed by the
VTune
.
Profiler Supported analysis:
runsa |
stack-stitching=true |
false Default:
true | For
Intel® oneAPI Threading Building Blocks (oneTBB
)-based applications, restructure the call flow to attach
stacks to a point introducing a parallel workload.
Supported analysis:
runss |
stack-type=software | lbr Default:
software | Choose between software stack and hardware
LBR-based stack types. Software stacks have no depth limitations and provide
more data while hardware stacks introduce less overhead. Typically, software
stack type is recommended unless the collection overhead becomes significant.
Note that hardware LBR stack type may not be available on all platforms.
Supported analysis:
runsa |
stackwalk-mode=online | offline Default:
offline | Choose between online (during collection) and offline (after
collection) modes to analyze stacks. Offline mode reduces analysis overhead and
is typically recommended.
Supported analysis:
runss |
target-gpu=
< domain:bus:device.function >Default: The newest GPU architecture that VTune
Profiler can detect
| Select a target GPU for profiling when you have multiple GPUs
connected to your system. If unset, VTune Profiler selects the newest GPU
architecture it can detect.
Example:
target-gpu=0:0:2.0 Supported analysis:
gpu-offload ,
gpu-hotspots |
uncore-sampling-interval=< number >For hardware event-based sampling types: a number (in
milliseconds) between 1 and 1000. Default: 10.
| Specify an interval (in milliseconds) between uncore event
samples.
Supported analysis:
runsa |
waits-mode=off | stack |
nostack Default:
off | Enable to identify where threads are waiting or compute thread
concurrency. The collector instruments APIs, which causes higher overhead and
increases result size.
Supported analysis:
runss |
Actions Modified
Description
Use the
knob
action-option to configure knob settings for a
collect
(predefined analysis types) or
collect-with
(custom analysis types) action where the analysis
type supports one or more knobs. Each analysis type or collector type supports
a specific set of knobs, and each knob requires a value. In most cases the knob
value is Boolean, so you would use
True
to enable the knob.
To see all knobs available for a predefined analysis
type:
vtune
analysis_type
>To see knobs for a custom analysis type:
vtune
analysis_type
>Example
This example returns a list of knobs for the Threading
analysis type:
vtune -help collect threading
This example runs a custom event-based sampling data
collection on an Android system enabling collection of Android framework and
chipset events.
vtune -collect-with runss -target-system=android -knob sampling-interval=2 -knob cpu-samples-mode=stack -knob ftrace-config=gfx,dalvik -knob chipset-event-config="GMCH_PARTIAL_WR_DRAM.ANY,GMCH_CORE_CLKS" --target-process com.intel.tbb.example.tachyon
This example configures and runs a custom event-based sampling data
collection with the stack size limited to 8192 bytes:
vtune -collect-with runsa -knob enable-stack-collection=true -knob stack-size=8192 -knob enable-call-counts=true -knob event-config=CPU_CLK_UNHALTED.REF_TSC:sa=1800000,CPU_CLK_UNHALTED