Why Use the Data Streams Optimizer
Intel engineers continuously test real-time workloads in the lab
environment and experiment with tuning knobs that most affect real-time
performance. Some of these knobs are provided to customers in the Intel®
64 and IA-32 Architectures Software Developer’s Manual or
platform-specific documents such as the External Design Specification
(EDS) or BIOS Writer’s Guide. Other knobs are not published for multiple
reasons including, but not limited to, supportability and portability.
Customers use the published tuning knobs to adjust the performance of
the platform to their requirements and may
participate in back-and-forth engagement with Intel for help with
further tuning the platform based on Intel expertise. This
process can be complex and time-consuming.
The data streams optimizer simplifies the tuning process. To accomplish this, it automates the tuning process and applies tuning configurations that adjust visible and hidden tuning knobs in the form of a series of register writes.
Finding a Balance Between Real-Time Performance and Power Management
The data streams optimizer meets workload-specific real-time performance requirements without overprovisioning the best-effort (non-real-time) capabilities and power management of the system. This fine-tuning is achieved by implementing a three-level platform tuning strategy that systematically reduces worst-case execution time (WCET) using an iterative process of elimination (commonly referred to as “knocking down the long pole in the tent”). Among these three levels, the tool eliminates the highest source of jitter, then validates whether those optimizations were sufficient to meet workload requirements, and repeats the process, until either success or failure (where failure suggests that the hard limits of the processor have been exceeded).
The tool’s tuning strategy entails that known interference
vectors have been identified and can be eliminated or
mitigated through platform optimizations. When these optimizations are
ordered and weighted by estimated jitter reduction, a pattern emerges,
showing three levels of tuning stratification (from highest to lowest
estimated jitter reduction): power management, Intel® Time
Coordinated Computing (Intel® TCC) features, and fabric tuning.
Power Management Tuning
The data streams optimizer addresses the conflict between power and
performance. High throughput and low latency performance require running
the CPU constantly at maximum frequency. Power management features
reduce energy consumption by either placing the CPU into a low-power
state or reducing the operating frequency.
For real-time applications that require consistent performance, power
management features can negatively affect consistency by sporadically
increasing latency when parts of the CPU either exit low-power states or
lock phase-locked loops (PLLs) to increase frequency. However, for
real-time use cases where low-power operation is also important,
disabling all power management is counter-productive. The right balance
of power management versus performance consistency is necessary to meet
both of these goals.
See the following diagram for a visual description of the balance between power and performance.

Intel® TCC Feature Tuning
In general, Intel® TCC features are processor-level optimizations that entail
major design impact across multiple subsystems on the processor. Intel® TCC
features often aim to improve specific workloads or data flows (for
example, PCIe-from-memory reads and CPU-to-memory writes), but have
wide-spread negative side-effects on best-effort performance.
This behavior is situational. High-impact, narrow-scope improvements
with wide-scope side-effects make these Intel® TCC features impractical
to deploy in out-of-the-box, non-real-time applications, but targeted
Intel® TCC feature tuning can significantly improve performance for
real-time applications.
Fabric Tuning
Real-time performance is bounded by the worst-case execution or
transaction transmission latency. One major factor that contributes to
worst-case performance is contention for shared hardware resources such
as in the processor cores, data buses, memory, and processor fabric.
Real-time data streams may be forced to wait while resources are used by
best-effort data streams. Arbitration is the mechanism that manages the
utilization of shared resources between the various requesters.
Some of this arbitration occurs between processor subsystems (such as
arbitrating between the CPU cores and the uncore), but the majority of
arbitration occurs at the microarchitecture level, between small-scale
subcomponents. In extremely precise real-time control applications,
where a couple microseconds or less of jitter may cause a deadline
violation, fine-tuned control of system arbitration may be required.
Tuning the platform for real-time performance can impact other subsystems, such as power, thermal, and the ability for the system to enter low-power states. Perform a full system analysis to determine the impact of the configuration on other performance metrics.