Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Before Using the Cache Allocation Library

System Configuration for Using Cache Allocation Library

The cache allocation library depends on proper configuration of multiple system components, which are described in detail throughout the documentation. You need to consider the following configuration options:
  • Size and location in cache of software SRAM buffers
  • Size of the buffer allocated to each application
  • Affinity mask of your real-time application
First, you need to ensure that your application has access to a sufficient amount of software SRAM to allocate the requested buffers. As the workflow of determining optimal buffer allocations in your application may require several iterations (Using Cache Allocation Library in Real-Time Application), you may also require several iterations of system configuration. You can start with the initial configuration applied as part of the Get Started Guide, or use the cache configurator to create a new configuration based on your estimate of required software SRAM size and latency. Existing software SRAM regions can be checked using the cache configurator.
After reserving a sufficient amount of software SRAM, you need to ensure your real-time application has access to a portion of the reserved software SRAM via the
.tcc.config
configuration file. This file determines the size of software SRAM provided to each application that uses the cache allocation library.
The final parameter affecting memory allocation is the process affinity mask. Software SRAM allocation requires a match between the process affinity and the reserved buffer affinity. Depending on the cache topology of the system, different cores may have access to different regions of software SRAM. You can set an affinity mask by using default Linux* functions or by using the
cpuid
parameter in the
tcc_cache_init()
function.
The affinity mask value is checked every time the memory allocation functions are called (
tcc_malloc*
or
tcc_calloc*
) and is not checked during further use. Therefore, changing the affinity mask after memory allocation can break the software SRAM performance, if the new affinity mask is not aligned with the affinity for the software SRAM buffer.

Using Cache Allocation Library in Real-Time Application

To use the library in your real-time application, you will need to know the latency and size of the data set that your workload processes, as well as the hotspots in your workload’s code that are most latency sensitive.
To help you acquire this information and prepare your system, see the following workflow:
Identify Hotspots
In this context, hotspots are memory objects (such as arrays) in your real-time workload that have a high number of Level 2 (L2) or Level 3 (L3) cache misses.
Since cache is a limited and precious resource, it is important to choose carefully which memory objects to address with the cache allocation library, so as not to overuse the cache. Hotspots are the prime candidates.
The first step is to use analysis tools, such as VTune™ Profiler, to find the hotspots in your workload.
Determining Cache Allocation Library Inputs
The cache allocation library allocates buffers based on two parameters:
  • The size of the buffer
  • The maximum acceptable access latency
The size of the buffer, specified in bytes, is the standard input parameter to a malloc call.
The maximum acceptable access latency, specified in nanoseconds, is the longest amount of time that can be tolerated for accessing a single element in the buffer. This latency value will be unique to every buffer and depends on:
  • The timing requirements for executing the function that contains the buffer
  • The desired amount of time during execution of the function that can be spent on accesses to the buffer
  • The way in which the buffer is accessed, including its memory access pattern (MAP) and arithmetic intensity.
You are not expected to know these characteristics for each buffer. While it is possible to perform such an analysis, it can be a complex and time-consuming process. To help you get started quickly, you can use predetermined latency values as a starting point.
Select one of the following options to determine the latency value:
  • Option #1
    (recommended option for those who are new to the cache allocation library)
    Use the provided latency tables to select a latency value to pass to the library.
    11th Gen Intel® Core™ processors latency table:
    Latency to give to API
    30 ns
    88 ns
    300 ns (equivalent to regular malloc)
    Intel Atom® x6000E Series processors latency table:
    Latency to give to API
    62 ns
    145 ns
    250 ns (equivalent to regular malloc)
    Lower latency values come at a higher “cost” in terms of resources needed to satisfy the buffer. The “cost” can come from reserving space in caches closer to the CPU, or by system level tunings that may improve latency for one core but at the cost of latency and bandwidth for another. Intel recommends starting with the least costly latency value (numerically highest) and seeing if performance needs are met. Experimenting with different latency values and measuring the worst-case execution time (WCET) of the workload is a good approach to determining a latency value that works for your workload.
  • Option #2
    (Advanced option)
    Profile the workload to understand workload timing requirements, buffer access time, and buffer access characteristics.
    Intel does not provide tools to assist in determining the memory access pattern or arithmetic intensity of a function operating on a buffer, and this exercise is left to you as a developer working with the cache allocation library. It is a manual process that can be done via inspection of the object code.
    The following example outlines a prescriptive process for calculating the latency value on an overly-simplified function.
    Example: How to manually calculate the latency value to provide to the cache allocation library:
    1. Determine that there is a timing problem.
      If the workload in question can already meet the timing requirement, there is no need to use the cache allocation library. Alternatively, if the workload already meets timing requirements but there is value in completing the workload in a shorter duration to free up CPU cycles for other value-added tasks, using the cache allocation library is an option.
      Measure timing violations by executing the workload multiple times, while the system is heavily loaded. See if any instances took longer than expected to complete.
    2. Identify and state the timing problem. For example:
      Workload A should complete in 125 microseconds. When executing Workload A 1 billion times, it is observed that in some instances it took longer than 125 microseconds for Workload A to complete.
      1. Workload A completes in 120 microseconds on average.
      2. Workload A sees execution jitter with some instances taking 130 microseconds (tail outliers).
      3. The tail outliers of 130 microseconds need to be pulled in to 120 microseconds (for a net delta of 10 microseconds), leaving a 5-microsecond safety margin.
    3. Instrument the workload to identify hotspots. For example:
      Workload A is instrumented to determine functions that are experiencing a high number of cache misses. Assume
      foo()
      has a portion of code that when broken down results in the following memory accesses:
      Iterations = 100 For i = 0 to Iterations A = data_array[i] /* Work done on A */ End For
      While this is an oversimplification, it represents a buffer,
      data_array[]
      , where between each access there is some amount of compute performed.
      Since there were 100 iterations of the loop resulting in 100 accesses to the buffer
      data_array[]
      , and the profiling tools indicate that there are a high number of cache misses being observed when accessing
      data_array[]
      , it becomes a candidate for optimization with the cache allocation library.
      Using Instrumentation and Tracing Technology API (ITT API) (
      _itt_task_begin
      and
      _itt_task_end
      ) to instrument the block of code that resulted in the 100 accesses to
      data_array[]
      , the following is determined:
      1. On average, it takes 8 microseconds to access all 100 elements.
        8 microseconds / 100 accesses = 80 nanoseconds per element accessed
      2. Occasionally, it takes 12 microseconds to access all 100 elements.
      When the buffer accesses complete in 8 microseconds, Workload A closes timing in the desired 120 microseconds (with a 5-microsecond safety margin).
    4. Implement the cache allocation library:
      If accesses to
      data_array[]
      were to complete in 80 nanoseconds, this would improve the ability for Workload A to close the timing. Function
      foo()
      is then updated to replace legacy calls to
      malloc()
      with cache allocation library, specifying 80 nanoseconds in the latency field.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.