Improve Performance of Latency-Sensitive Applications with Intel® TCC Tools
Intel® processors are employed across a variety of industries and use cases: from gaming to machine vision to process automation, autonomous mobile robots, and patient monitoring in healthcare settings. Some of these use cases rely on real-time systems with extremely strict time requirements where the hardware and software systems must respond to events within predictable and specific time constraints.
To help meet those real-time computing requirements, Intel introduced real-time hardware processor features and a system software stack optimized for real-time applications. Part of the enabling technology is Intel® TCC Tools, a new set of features that reduces jitter and improves performance for latency-sensitive applications. These features help maximize efficiency by aggregating time-critical and non-time-constrained applications onto a single board.
Intel® TCC Tools includes:
- Cache allocation
- Data streams optimizer
- Measurement library
- Real-time readiness checker
- Time-aware general-purpose input/output (I/O) sample application
- Ethernet timestamps sample application
- Real-time communication demo
[The] strength is that the [cache allocation] democratizes the [real-time tuning] for [software] engineers that don’t have a good background in [hardware] architecture of the cache and memory management and system interactions—otherwise they wouldn’t be able to touch it. —Intel® TCC Cache Library user
Intel® TCC Tools Cache Allocation Feature: Optimize Performance of Real-Time Applications
The Intel® TCC Tools cache allocation capability helps reduce hotspots (latency-sensitive areas in application code) in real-time applications that have a high number of cache misses.
Why Use Cache Allocation?
Cache misses—the unsuccessful attempt to read or write data from the cache—negatively affect the latency of real-time applications by causing the processor to fetch data from other memory locations. High latency in real-time applications can have dire consequences in certain industries. The fields of robotics, utilities, and healthcare all have use cases with higher requirements for synchronization, time lines, and worst-case execution time guarantee.
The Intel® TCC Tools cache allocation feature helps bound the time needed to access data from a memory buffer based on your specified latency requirements. Cache allocation allows you to reduce cache misses by allocating buffers that are less likely to be evicted from the processor cache.
To create these low-latency buffers, the system uses the software SRAM process. Software SRAM employs hardware capabilities to allocate a portion of the physical address space into the cache. These physical addresses are less likely to be evicted by other processes.
The cache allocation feature includes two additional tools:
- Cache configurator: A command-line tool that displays a visual representation of the cache on your system. It allocates cache for use by the cache allocation library and other compute and I/O resources.
- Cache allocation library: A set of C APIs that allocate low-latency buffers.
When to Use Cache Allocation with Real-Time Applications
Understanding when to use an Intel® TCC Tools feature such as cache allocation depends on your specific use case. Examine an example workflow and read through the steps involved in the graphic below. This workflow uses multiple Intel TCC features: the measurement library, the data streams optimizer, and cache allocation in step 6a.
After steps 1 through 5 are completed, cache allocation is used to further reduce the latency.
Before using the cache allocation capability, you need to know:
- The size of the dataset that your application processes
- The maximum acceptable latency for access
- Any known hotspots in your application’s code
Step 1: Set up your target system with the board support package (BSP), which provides a real-time kernel and optimized drivers. Run your real-time application along with other applications, per your expected use case, under worst-case conditions. Check whether deadlines and system requirements are met. If not, move to the next step.
Step 2: Enable Intel® TCC mode in the firmware. Diagnose whether you have additional real-time needs and where your performance bottlenecks are. Then proceed with Intel® TCC Tools, which provides advanced-level tuning by using features in the processor and BIOS.
Step 3: Install Intel® TCC Tools. Use the real-time readiness checker to verify the configuration.
Step 4: Run your real-time application along with other applications again to recheck deadlines.
Step 5: Instrument your code with measurement library APIs. Use VTune™ Profiler or other profiling tools to find hotspots and bottlenecks.
Step 6a: If you find that data access latency exceeds requirements, use the cache configurator to create software SRAM buffers. Add cache allocation library APIs in your real-time application to use the software SRAM buffers to improve data access timings.
Step 6b: If you find that data transfer latency exceeds requirements, use the data streams optimizer. The data streams optimizer can also balance real-time performance with system power consumption or computational resources available for other tasks.
What is the Cache Configurator?
The cache configurator is a command-line tool that allows you to manage—or configure—cache resources. This tool is especially helpful when you or any administrator needs to:
- Provide low-latency buffer access to real-time applications running on the system
- Provide mechanisms to improve the worst-case execution time
- Minimize the impact the GPU has on real-time applications running on the CPU cores
- Partition the shared cache resources among the various components using the cache such as CPU, GPU, or I/O
The cache configurator features an interface that allows you to complete complex tasks, such as SRAM buffer management and cache partitioning, without needing to directly configure the low-level details of the cache architecture—ultimately saving you time and resources. Using the interface also allows you to choose a preset cache partitioning scheme or create a custom partitioning scheme.
The following images show the flow of using cache configurator within a system, starting with selecting an option from the preset list and confirming the selection.
Once the selection is confirmed, the tool shows a summary, including existing buffers and requested buffers. It may also indicate that it cannot configure the buffers exactly as requested and shows the allocation that it can complete.
If the edited configuration is acceptable and confirmed, the system will implement the configuration and reboot.
After the system reboots, the tool verifies that the configuration was applied successfully.
What is a Cache Partitioning Scheme?
A cache partitioning scheme controls which compute and I/O resources (i.e., caching agents) can allocate into the cache and where they have access. When a caching agent requests to allocate a new cache line into the cache, a victim cache line is identified and evicted and the data is written back to memory before depositing a new cache line. If an application incurs too many cache misses as a result of other activity that uses the shared caches, then the application sees reduced performance. This is known as the noisy neighbor effect.
You can minimize the noisy neighbor effect by creating partitions in the cache to isolate noisy cache agents. Intel provides you with several preset cache partitioning schemes that offer different levels of cache isolation and software SRAM to address most use cases. The presets will also partition the cache to:
- Establish isolated cache regions dedicated to real-time applications, also known as workloads
- Restrict the GPU from accessing the entire L3 cache and avoid overlap with cache partitions dedicated to real-time workloads
- Dedicate a small portion of the cache for low-latency I/O operations
- Configure L2 and L3 software SRAM buffers
See all the presets Intel offers for selected real-time hardware processors that are optimized for real-time applications.
What Is the Cache Allocation Library?
The cache allocation library is a set of C language APIs that help reduce memory access latency by allocating buffers from software SRAM buffers. SRAM buffers are better protected in cache and less likely to be evicted by other applications.
What Are the Benefits of Using the Cache Allocation Library?
The cache allocation library is particularly useful because it:
- Allows applications to move between Intel® platforms without code refactoring
- Provides the ability to reduce the worst-case execution time (WCET) of a function
- Offers low buffer access latencies, which can improve overall workload performance as measured by WCET
If you have a well-optimized operating environment, the cache allocation library can provide:
- Minimal performance improvement if the workload has a linear memory access pattern, with a significant amount of compute instructions executed between buffer accesses.
- More significant performance improvement if the workload has a random memory access pattern, with minimal amounts of compute instructions being executed between buffer accesses.
How to Use the Cache Allocation Library?
Before you begin using the cache allocation library APIs to allocate buffer in your code, there are two important steps to take:
- Read "Before You Begin Using the Cache Allocation Library" to learn to configure the software SRAM size and affinity mask value for your system, identify your workload hotspots, and determine cachel allocation library inputs.
- Use the default preset configuration of software SRAM buffers from the Get Started Process during Intel® TCC Tools install or further configure buffers using the cache configurator.
Once you’ve completed steps one and two, you can begin using the cache allocation library. To allocate a buffer, follow the six steps documented in the Development Guide.
See Intel® TCC Tools in Action
Watch this video to see a demonstration of Intel® TCC Tools on:
- Cache allocation optimization workflow
- Command line tools for cache configurator
- Examples in measuring cache latencies with an internal and external noisy neighbor.
Run Cache Allocation Samples
Follow the Cache Allocation Sample in the developer guide to try out the cache allocation feature. There are two samples provided in the guide to demonstrate the benefits of cache allocation library:
- The internal noisy neighbor sample demonstrates the effect of cache line eviction within the same processor core and how cache allocation library can help to minimize the latency.
- The external noisy neighbor sample demonstrates a more common use case where two processor cores are competing for cache resources on a shared cache. With the cache allocation library, you can lock the critical data and avoid the eviction of the cache lines by processor core that is running less critical workload.