Developer Guide

  • 2021.1
  • 11/03/2021
  • Public

Why Use the Data Streams Optimizer

Intel engineers continuously test real-time workloads in the lab environment and experiment with tuning knobs that most affect real-time performance. Some of these knobs are provided to customers in the Intel® 64 and IA-32 Architectures Software Developer’s Manual or platform-specific documents such as the External Design Specification (EDS) or BIOS Writer’s Guide. Other knobs are not published for multiple reasons including, but not limited to, supportability and portability. Customers use the published tuning knobs to adjust the performance of the platform to their requirements and, when insufficient, may participate in back-and-forth engagement with Intel for help with further tuning the platform based on Intel internal expertise. This process can be complex and time-consuming.
The data streams optimizer simplifies the tuning process. To accomplish this, it auto-generates custom tuning configurations that can meet workload-specific real-time performance requirements without overprovisioning the best-effort (non-real-time) capabilities and power management of the system. The automatic tuning process adjusts visible and hidden tuning knobs in the form of a series of register writes.
This fine-tuning is achieved by implementing a three-level platform tuning strategy that systematically reduces worst-case execution time (WCET) using an iterative process of elimination (commonly referred to as “knocking down the long pole in the tent”). Among these three levels, the tool eliminates the highest source of jitter, then validates whether those optimizations were sufficient to meet workload requirements, and repeats the process, until either success or failure (where failure suggests that the hard limits of the silicon have been exceeded).
The tool’s tuning strategy entails that known interference vectors have been identified and can be eliminated (or at least mitigated) through platform optimizations. When these optimizations are ordered and weighted by estimated jitter reduction, a pattern emerges, showing three levels of tuning stratification (from highest to lowest estimated jitter reduction): platform power management, Intel® Time Coordinated Computing (Intel® TCC) features, and fabric tuning.

Platform Power Management Tuning

The data streams optimizer addresses the conflict between power and performance. High throughput and low latency performance require running the CPU constantly at maximum frequency. Power management features reduce energy consumption by either placing the CPU into a low-power state or reducing the operating frequency.
For real-time applications that require consistent performance, power management features can negatively affect consistency by sporadically increasing latency when parts of the CPU either exit low-power states or lock phase-locked loops (PLLs) to increase frequency. However, for real-time use cases where low-power operation is also important, disabling all power management is counter-productive. The right balance of power management versus performance consistency is necessary to meet both of these goals.

Intel® TCC Feature Tuning

In general, Intel® TCC features are SoC-level optimizations that entail major design impact across multiple subsystems on the SoC. Intel® TCC features often aim to improve specific workloads or data flows (for example, PCIe-from-memory reads and CPU-to-memory writes), but have wide-spread negative side-effects on best-effort performance.
This behavior is situational. High-impact, narrow-scope improvements with wide-scope side-effects make these Intel® TCC features impractical to deploy in out-of-the-box, non-real-time applications, but targeted Intel® TCC feature tuning can significantly improve performance for real-time applications.

Fabric Tuning

Real-time performance is bounded by the worst-case execution or transaction transmission latency. One major factor that contributes to worst-case performance is contention for shared hardware resources such as in the processor cores, data buses, memory, and processor fabric. Real-time data streams may be forced to wait while resources are used by best-effort data streams. Arbitration is the mechanism that manages the utilization of shared resources between the various requesters.
Some of this arbitration occurs between SoC subsystems (such as arbitrating between the CPU cores and the uncore), but the majority of arbitration occurs at the microarchitecture level, between small-scale subcomponents. In extremely precise real-time control applications, where a couple microseconds or less of jitter may cause a deadline violation, fine-tuned control of system arbitration may be required.
Tuning the platform for real-time performance can have direct or indirect impact on other important system characteristics such as power, thermal, and the ability for the system to enter low-power states. When a tuning configuration is selected, be sure to perform a full system analysis to determine the impact of the configuration, if any, on other performance metrics beyond real-time performance.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at