Developer Guide

  • 2021.1
  • 11/03/2021
  • Public
Contents

Cache Configurator

The cache configurator (
tcc_cache_configurator
) is a command-line tool that enables you to discover and manage cache memory resources.
The tool is intended for system integrators or administrators who have been given a requirement to either:
  • Provide low-latency buffer access to real-time applications running on the system
  • Provide mechanisms to improve the worst-case execution time (WCET)
  • Minimize the impact the GPU has on real-time applications running on the CPU cores
  • Partition the shared cache resources among various components using the cache (such as CPU, GPU, or I/O), referred to in this guide as
    caching agents
    .
The tool simplifies techniques that address these requirements, namely software SRAM buffer management and cache partitioning. By using the tool’s interface, you can accomplish these complex tasks without the need to directly configure the low-level details of the cache architecture. You can:
  • Select from a variety of preset cache partitioning schemes. The presets provide varying levels of cache isolation and software SRAM to cover the most typical scenarios and it is highly probable that available presets will be suitable for your use case.
  • Create a custom partitioning scheme. If you need a custom or more flexible setup, the tool offers an interactive interface to guide you through the process of adding or deleting software SRAM, as well as dividing the remaining cache among caching agents.

What Is a Cache Partitioning Scheme?

When a caching agent makes a request to allocate a new cache line into the cache, a victim cache line must be identified, evicted, and the data written back to memory prior to depositing a new cache line. If an application incurs too many cache misses as a result of the activity from other caching agents, then the application will see reduced performance. This is known as the
noisy neighbor effect
.
Creating partitions in the cache to isolate certain agents from others can help to minimize the noisy neighbor effect. By default, most caching agents are configured to use the entire cache, effectively sharing the cache amongst all caching agents without any partitions. This yields maximum peak performance for all of the caching agents. In many real-time designs, the GPU is considered a noisy neighbor and full GPU performance is often not required. Changing how much cache the GPU can use will minimize the GPU as a noisy neighbor.
A
cache partitioning scheme
controls which caching agents can allocate into the cache and more specifically where they can allocate into the cache.

Developer Workflow

If you have completed the steps in the Get Started Guide, you applied Preset 4 to your system, which has enough software SRAM to run the cache allocation sample and start your exploration of the cache allocation library.
Before trying different presets, adding software SRAM, or customizing a cache partitioning scheme, Intel recommends the following process:
  1. First, determine how much of the cache should be reserved for software SRAM regions. Once cache space is reserved for software SRAM, it is no longer available to the rest of the system and is only accessible via the Cache Allocation Library.
  2. Determine how to partition the remaining cache between CPU cores, GPU, and I/O. Considerations:
    • Sharing cache between multiple caching agents (CPU cores, GPU, and I/O) generally leads to increased jitter under loaded conditions.
    • Isolating cores, GPU, and I/O will improve the noisy neighbor effect.
    • If App1 and App2 are affinitized to Core 1 and Core 2, respectively, consider using Classes of Service to differentiate the cache space available to each core. Intel supports multiple Classes of Service which enable Core 1 to have a potentially separate, non-overlapping cache region compared with Core 2. If App1 is a real-time application, having dedicated cache space may be desireable to minimize the impact App2 has on App1’s performance.
    • Starting in 11th Gen Intel® Core™ processors on Intel® Core™ based products, real-time I/O traffic (designated via Traffic Class 1) can allocate directly into the L3 cache. If the I/O traffic is time sensitive, it will be faster for the CPU to access the data if it resides in the cache (versus DRAM). Consider allocating a small portion of the cache for I/O traffic.
    • If the integrated GPU is going to be used, consider minimizing the portion of the L3 available to the GPU. By default, Intel enables maximum GPU performance by providing access to the entire L3. For real-time designs, maximum GPU performance is often not needed and a smaller portion of the L3 cache can be used. Careful selection of the cache available to the GPU, ensuring no overlap with regions dedicated to real-time applications, will improve the noisy neighbor effect of the GPU.
It is expected that the tool is used during the development phase to achieve an optimal cache partitioning scheme as determined by the system integrator, with feedback from application developers. If cache partitioning requirements change after a system has been deployed to production, it is possible to specify a new cache partitioning scheme, including software SRAM regions, simply by rerunning the tool on the target system. Due to the nature of software SRAM and cache partitioning requirements being communicated through firmware, if a system in production implements security measures that lock the BIOS region, additional steps may be required before the updated configuration can be applied and are not within the scope of the tool.

Cache Configurator and Cache Allocation Library

Intel does not attempt to place any limits on how companies choose to use software SRAM regions once they are created. Application developers can use the Cache Allocation Library to programmatically place user-space data into a software SRAM region. To accomplish this, application developers specify:
  • Size of the buffer required
  • Worst-case access latency for a single element in the buffer
In order for the cache allocation library to use cache, there must be an existing software SRAM region created that can satisfy the requirements of the buffer request. System integrators and administrators need to determine the location and size of software SRAM regions required depending on where the applications making use of the cache allocation library are intended to run, and how much memory they may need. This can be a balancing act and may require multiple iterations between application developers and system integrators / administrators.

Dependencies

The tool is dependent on the following underlying software components. They are available as part of the Intel® best known configuration or as otherwise noted.
  • Real-time configuration driver at the OS level.
  • Real-time configuration manager (RTCM) with the cache reservation library (CRL), or a hypervisor that supports CRL.
  • Real-time configuration data (RTCD) and real-time configuration table (RTCT) at the BIOS level.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.