Using Cache Allocation Library in Real-Time Application
To use the library in your real-time application, you will need to know the latency and size of the data set that your workload processes, as well as the hotspots in your workload’s code that are most latency sensitive.
To help you acquire this information and prepare your system, see the following workflow:
In this context, hotspots are memory objects (such as arrays) in your real-time workload that have a high number of Level 2 (L2) or Level 3 (L3) cache misses.
Since cache is a limited and precious resource, it is important to choose carefully which memory objects to address with the cache allocation library, so as not to overuse the cache. Hotspots are the prime candidates.
The first step is to use analysis tools, such as VTune™ Profiler, to find the hotspots in your workload.
Determining Cache Allocation Library Inputs
The cache allocation library allocates buffers based on two parameters:
The maximum acceptable access latency
The size of the buffer, specified in bytes, is the standard input parameter to a malloc call.
The maximum acceptable access latency, specified in nanoseconds, is the longest amount of time that can be tolerated for accessing a single element in the buffer. This latency value will be unique to every buffer and depends on:
The timing requirements for executing the function that contains the buffer
The desired amount of time during execution of the function that can be spent on accesses to the buffer
The way in which the buffer is accessed, including its memory access pattern (MAP) and arithmetic intensity.
You are not expected to know these characteristics for each buffer. While it is possible to perform such an analysis, it can be a complex and time-consuming process. To help you get started quickly, you can use predetermined latency values as a starting point.
Select one of the following options to determine the latency value:
(recommended option for those who are new to the cache allocation library)
Use the provided latency tables to select a latency value to pass to the library.
11th Generation Intel® Core™ processors and Intel® Xeon® W-11000E Series processors latency table:
300 ns (equivalent to regular malloc)
Intel Atom® x6000E Series processors latency table:
250 ns (equivalent to regular malloc)
Lower latency values come at a higher “cost” in terms of resources needed to satisfy the buffer. The “cost” can come from reserving space in caches closer to the CPU. Intel recommends starting with the least costly latency value (numerically highest) and seeing if performance needs are met. Experimenting with different latency values and measuring the worst-case execution time (WCET) of the workload is a good approach to determining a latency value that works for your workload.
Profile the workload to understand workload timing requirements, buffer access time, and buffer access characteristics.
Intel does not provide tools to assist in determining the memory access pattern or arithmetic intensity of a function operating on a buffer, and this exercise is left to you as a developer working with the cache allocation library. It is a manual process that can be done via inspection of the object code.
The following example outlines a prescriptive process for calculating the latency value on an overly-simplified function.
Example: How to manually calculate the latency value to provide to the cache allocation library:
Determine that there is a timing problem.
If the workload in question can already meet the timing requirement, there is no need to use the cache allocation library. Alternatively, if the workload already meets timing requirements but there is value in completing the workload in a shorter duration to free up CPU cycles for other value-added tasks, using the cache allocation library is an option.
Measure timing violations by executing the workload multiple times, while the system is heavily loaded. See if any instances took longer than expected to complete.
Identify and state the timing problem. For example:
Workload A should complete in 125 microseconds. When executing Workload A 1 billion times, it is observed that in some instances it took longer than 125 microseconds for Workload A to complete.
Workload A completes in 120 microseconds on average.
Workload A sees execution jitter with some instances taking 130 microseconds (tail outliers).
The tail outliers of 130 microseconds need to be pulled in to 120 microseconds (for a net delta of 10 microseconds), leaving a 5-microsecond safety margin.
Instrument the workload to identify hotspots. For example:
Workload A is instrumented to determine functions that are experiencing a high number of cache misses. Assume
has a portion of code that when broken down results in the following memory accesses:
Iterations = 100
For i = 0 to Iterations
A = data_array[i]
/* Work done on A */
While this is an oversimplification, it represents a buffer,
, where between each access there is some amount of compute performed.
Since there were 100 iterations of the loop resulting in 100 accesses to the buffer
, and the profiling tools indicate that there are a high number of cache misses being observed when accessing
, it becomes a candidate for optimization with the cache allocation library.
On average, it takes 8 microseconds to access all 100 elements.
8 microseconds / 100 accesses = 80 nanoseconds per element accessed
Occasionally, it takes 12 microseconds to access all 100 elements.
When the buffer accesses complete in 8 microseconds, Workload A closes timing in the desired 120 microseconds (with a 5-microsecond safety margin).
Implement the cache allocation library:
If accesses to
were to complete in 80 nanoseconds, this would improve the ability for Workload A to close the timing. Function
is then updated to replace legacy calls to
with cache allocation library, specifying 80 nanoseconds in the latency field.