Tutorial

  • 2021.2
  • 10/26/2021
  • Public

Find Hotspots (L3 Cache)

The following steps show how to find hotspots at the L3 cache level.
As an example, the tutorial uses the sample application,
tcc_cache_allocation_sample
. Although this sample is already tuned with the cache allocation library, it can simulate an untuned application when configured to allocate the buffer in DRAM.
First, you will run the sample using DRAM and observe the number of cache misses. Then you will run the sample again, using a buffer in L3 cache. You can expect to see fewer cache misses.
The following example is for illustration purposes only; your results may vary.

Run the Sample Using DRAM (Baseline)

  1. Make sure that you can ssh from your host system to the target system. Run
    ifconfig
    on the target to get the IP address.
  2. On the target system, open a terminal window and run the Linux* tool stress-ng as a noisy neighbor:
    taskset -c 3 stress-ng -C 10 --cache-level 3
  3. In the WHERE section, specify the target system as follows:
    1. Click the browse
      Browse
      button.
    2. Select
      Remote Linux (SSH)
      .
    3. For
      SSH destination
      , specify the address of the target system
      root@<IP address>
      OR
      root@<hostname>
      .
    4. Click the deploy
      Deploy
      button if required.
  4. In the WHAT section, specify the following information to run the sample:
    1. For
      Application
      , type
      tcc_cache_allocation_sample
      .
    2. For
      Application parameters
      , type
      --latency 300 --sleep 100000000
      .
  5. When the HOW section is visible, configure the analysis as follows:
    1. Click the browse
      Browse
      button.
    2. Under
      Microarchitecture
      , select
      Memory Access
      .
    3. Click the copy
      Copy
      button to customize the analysis.
    4. Optional:
      Select
      Collect stacks
      :
    5. Under
      Events configured for CPU
      , select
      MEM_LOAD_RETIRED.L3_MISS
      for 11th Gen Intel® Core™ processors or
      LONGEST_LAT_CACHE.MISS
      for Intel Atom® x6000E Series processors. Set
      Sample After
      to
      2000
      and deselect other (the performance of the real-time system may be affected by interrupts caused by extra VTune™ counters).
    6. Optional:
      Select
      Analyze loops
      to collect advanced information such as instruction set usage, and display analysis results by loops and functions.
    7. Optional:
      Scroll down and select
      Analyze memory objects
      .
  6. Click the start
    Start
    button to run the analysis.
  7. Go to the
    Event Count
    tab.
  8. Maximize the screen if it is smaller than the full size of your monitor.
  9. Select
    Grouping
    by
    Task Type/Function/Call Stack
  10. At the top of the screen, find the
    MEM_LOAD_RETIRED.L3_MISS
    column. Click the column to sort the rows by number of cache misses. In this example, the function
    pointer_chase_read_workload_internal
    is at the top of the list with 42,000 misses, which means the function is the hotspot for this type of event, and the buffer is a candidate for the cache allocation library.
  11. Now follow the instructions below to run the sample using the cache allocation library to allocate a buffer in L3 cache. Compare the results.

Run the Sample Using L3 Cache

  1. At the top left of the screen, click the configure
    Configure Analysis
    button.
  2. In the WHAT section, change the
    Application parameters
    to
    --latency 110 --sleep 100000000
    . This command allocates the buffer in L3 cache.
  3. Click the start
    Start
    button to run the analysis.
  4. After the analysis is complete, go to the
    Event Count
    tab.
  5. By using an L3 cache buffer, the number of cache misses for function
    pointer_chase_read_workload_internal
    is lower or not in the list, as in the screenshot below, because there were no misses for this function.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.