|
Recently, different architectures use a split last-level cache in order to achieve a fast time-to-market of a dual-core system.
Clear downsides of this solution are as follows:
- Cache coherent-related events that need to be served over the FSB, such as RFO or invalidation signals, greatly
impact performance and power.
- An ST application cannot take full advantage of the entire cache.
The hard partitioned cache may
have one significant benefit over the unified cache; that is, it may prevent one application from significantly reducing
the amount of cache memory available to an application running on the other core. Thus, in this section we compare two
systems: one uses a split L2 cache and the other uses a unified model. In order to make the comparison fair, we present
speedup numbers and not absolute numbers.
A sample physics engine game is created (using Microsoft DirectX*) to perform this study. The application is MT using data
domain decomposition. The threads are synchronized before rendering the updates on the screen. Since the dependency among
the threads is very minimal, we expected to achieve ~2.0x performance improvement with the MT version as compared to the
ST version.
The split L2 cache indicated approximately a 1.68x performance improvement due to MT. Running the same application on the Intel®
Core™ Duo processor-based system demonstrates ~1.90x scaling as per our expectations.
The root cause of the difference in the scaling is due to the shared L2 cache on the Intel Core Duo system. The sample
application under study is designed in a way that both threads work on data from a shared data structure. Hence, on the
system with the split L2 cache, to get access to the data modified by one processor, the second processor needs to go to
main memory, which results in many L2 cache misses. Since the Intel Core Duo system has a unified L2 cache, a penalty of
cache miss and access to the main memory is avoided, as the data modified by one core can be made available to the other
core immediately.
|