Hardware Event-based Sampling Collection with Stacks
- The profiler gains control whenever a thread gets scheduled on and then off a processor (that is, at thread quantum borders). That enables the profiler to take exact measurements of any hardware performance events or timestamps, as well as collect a call stack to the point where the thread gets activated and inactivated.
- The profiler determines a reason for thread inactivation: it can either be an explicit request for synchronization , or a so-called thread quantum expiration, when the operating system scheduler preempts the current thread to run another, higher-priority one instead.
- The time during which a thread remains inactive is also measured directly and differentiated based on the thread inactivation reason: inactivity caused by a request for synchronization is called Wait time, while inactivity caused by preemption is called Inactive time.
- call stack information
- branching information (if configured so)
- processor timestamps
Configure Stack Collection
- Click theConfigure Analysisbutton on the VTune Profiler toolbar.TheConfigure Analysiswindow opens.
- Specify your analysis system in theWHEREpane and your analysis target in theWHATpane.
- In theHOWpane, choose the required event-based sampling analysis type. Typically, you are recommended to start with the Hotspots analysis in thehardware event-based samplingmode.
- Configure collection options, if required. For call stack analysis, consider enabling theCollect stacksoption.
- Click theStartbutton at the bottom to run the selected analysis type.VTune Profiler collects hardware event-based sampling data along with the information on execution paths. You may see the collected results in theHardware Eventsviewpoint providing performance, parallelism and power consumption data on detected call paths.
- The event-based stack sampling data collection cannot be configured for the entire system. You have to specify an application to launch or attach to.
- By default, on Linux* the VTune Profiler uses the driverless Perf*-based mode for hardware event-based collection with stacks. To use the driver-based mode, set theStack sizeoption to 0 (unlimited).
- Call stack analysis adds an overhead to your data collection. To minimize the overhead incurred with the stack size, use theStack sizeoption in the custom hardware event-based sampling configuration or-stack-sizeknob from CLI to limit the size of a raw stack. By default, on Linux a stack size of 1024 bytes is collected. On Windows, by default, a full size stack is collected (zero size value). If you disable this option, the overhead will be also reduced but no stack data will be collected.
- For analyses using the Perf*-based driverless collection, the types of context switches (preemption or synchronization) may not be identified on kernels older than 4.17 and the following metrics may not be available: Wait time, Wait Rate, Inactive Time, Preemption and Synchronization Context Switch Count.
- The speed at which the data is generated (proportional to the sampling frequency and the intensity of thread synchronization/contention) may become greater than the speed at which the data is being saved to a trace file, so the profiler will try to adapt the incoming data rate to the outgoing data rate by not letting threads of a program being profiled be scheduled for execution. This will cause paused regions to appear on the timeline, even if no pause was explicitly requested. In ultimate cases, when this procedure fails to limit the incoming data rate, the profiler will begin losing sample records, but will still keep the counts of hardware events. If such a situation occurs, the hardware event counts of lost sample records will be attributed to a special node:[Events Lost on Trace Overflow].