Memory Access Analysis for Cache Misses and High Bandwidth Issues
How It Works
- LoadsandStoresmetrics that show the total number of loads and stores
- LLC Miss Countmetric that shows the total number of last-level cache misses
- Local DRAM Access Countmetric that shows the total number of LLC misses serviced by the local memory
- Remote DRAM Access Countmetric that shows the number of accesses to the remote socket memory
- Remote Cache Access Countmetric that shows the number of accesses to the remote socket cache
- Memory Boundmetric that shows a fraction of cycles spent waiting due to demand load or store instructions
- L1 Boundmetric that shows how often the machine was stalled without missing the L1 data cache
- L2 Boundmetric that shows how often the machine was stalled on L2 cache
- L3 Boundmetric that shows how often the CPU was stalled on L3 cache, or contended with a sibling core
- L3 Latencymetric that shows a fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)
- NUMA: % of Remote Accessesmetric shows percentage of memory requests to remote DRAM. The lower its value is, the better.
- DRAM Boundmetric that shows how often the CPU was stalled on the main memory (DRAM). This metric enables you to identifyDRAM Bandwidth Bound,UPI Utilization Boundissues, as well asMemory Latencyissues with the following metrics:
- Remote / Local DRAM Ratiometric that is defined by the ratio of remote DRAM loads to local DRAM loads
- Local DRAMmetric that shows how often the CPU was stalled on loads from the local memory
- Remote DRAMmetric that shows how often the CPU was stalled on loads from the remote memory
- Remote Cachemetric that shows how often the CPU was stalled on loads from the remote cache in other sockets
- Average Latencymetric that shows an average load latency in cycles
- The list of metrics may vary depending on your microarchitecture.
- The UPI Utilization metric replaced QPI Utilization starting with systems based on Intel microarchitecture code name Skylake.
Configure and Run Analysis
- Click the (standalone GUI)/ (Visual Studio IDE)Configure Analysisbutton on theIntel® VTune™toolbar.ProfilerTheConfigure Analysiswindow opens.
- FromHOWpane, click the Browse button and selectMemory Access.
- Configure the following options:CPU sampling interval, msfieldSpecify an interval (in milliseconds) between CPU samples.Possible values -0.01-1000.The default value is1 ms.Analyze dynamic memory objectscheck box (Linux only)Enable the instrumentation of dynamic memory allocation/de-allocation and map hardware events to such memory objects. This option may cause additional runtime overhead due to the instrumentation of all system memory allocation/de-allocation API.The option is disabled by default.Minimal dynamic memory object size to track, in bytesspin box (Linux only)Specify a minimal size of dynamic memory allocations to analyze. This option helps reduce runtime overhead of the instrumentation.The default value is1024.Evaluate max DRAM bandwidthcheck boxEvaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds.The option is enabled by default.Analyze OpenMP regionscheck boxInstrument and analyze OpenMP regions to detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction and atomic operations.The option is disabled by default.DetailsbuttonExpand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify or enable additional settings for the analysis, you need to create a custom configuration by copying an existing predefined configuration.VTunecreates an editable copy of this analysis type configuration.Profiler
- Memory objects analysis can be configured for Linux* targets only and only for processors based on Intel microarchitecture code name Sandy Bridge or later.
- Summarywindow displays statistics on the overall application execution, including the application-level bandwidth utilization histogram.
- Bottom-upwindow displays performance data per metric for each hotspot object. If you enable theAnalyze memory objectsoption for data collection, theBottom-upwindow also displays memory allocation call stacks in the grid and Call Stack pane. Use theMemory Objectgrouping level, preceded with theFunctionlevel, to view memory objects as the source location of an allocation call.
- Platformwindow provides details on tasks specified in your code with the Task API, Ftrace*/Systrace* event tasks, OpenCL™ API tasks, and so on. If corresponding platform metrics are collected, the Platform window displays over-time data as GPU usage on a software queue, CPU time usage, OpenCL™ kernels data, and GPU performance per the Overview group of GPU hardware metrics, Memory Bandwidth, and CPU Frequency.
- 2nd Generation Intel® Core™ processors
- Intel® Xeon® processor families, or later
- 3rd Generation Intel Atom® processor family, or later
- standard system memory allocation API:mmap,malloc/free,calloc, and others
- memkind- https://github.com/memkind/memkind
- jemalloc- https://github.com/memkind/jemalloc
- pmdk- https://github.com/pmem/pmdk