|
Like any parallel system, the performance and the power of the Intel® Core™ Duo processor may be sensitive to the memory access
patterns. In this section we review three optimizations that are very important for getting the best out of the system.
Efficient Use of the Shared L2 Cache
Sharing data between two threads on the Intel Core Duo processor is fastest when done through the L2 cache. This section
examines several scenarios for sharing.
One scenario is when one thread brings the data from memory, and the other thread later uses this data directly from the L2
cache. If the single-threaded workload needs to bring the same data several times from memory but the multi-threaded
version is carefully designed to use the same data by the two threads simultaneously, the MT version gains performance by
bringing the data less times from the memory to the cache hierarchy. Such a design can help applications with a larger than
L2 cache data set and even achieve higher than 2x performance improvement.
Another scenario is when one thread generates the data and the other thread consumes it. A couple of variations of this
scenario are possible and are further explained in the "Producer Consumer Models," Section 5.3 of the Intel®
Core™ Duo Processor Optimization Guide[4]. Briefly, they are the "Delay" approach and "Symmetric"
approach. Below is an example of the expected speedup when the producer-consumer model is run on an Intel Core Duo
processor vs. a Dual Core Intel® Xeon® processor vs. an Intel® Pentium® 4 processor with Hyper-Threading Technology¹
(W = Write, R = Read, xxK = buffer size).
Not only do these data show the benefit of avoiding the bus/memory latency, they also demonstrate how varying multi-processor
implementations behave in both code affinity (functional) decomposition and data affinity (data) decomposition threading
models. If the produced/consumed data set size is bigger than the L1 data cache size, yet smaller than the L2 cache size,
data decomposition and functional decomposition yield similar performance (assuming the functional decomposition
implementation is well balanced), and the best performance that can be achieved for data sharing.

Figure 9: Code vs. data affinity performance on various processors
click image for larger view
False Sharing Can Reduce Performance
False sharing happens when two or more threads access different address ranges on the same cache line simultaneously. This
causes the cache line to be in the first level cache of the two cores.
False sharing causes a severe performance penalty if one or more of the threads writes to the shared cache line. This causes
invalidation of the cache line at the first-level cache of the other core. As a result, the next time that the other core
accesses the cache line in question it will have to transfer it from the core that wrote it earlier through the bus,
thereby incurring a major latency penalty.
Below is an example of code that has false sharing when executed by several threads simultaneously.
int counter[THREAD_NUM];
int inc_counter ()
{
counter[my_tid]++;
return counter[my_tid];
}
Table 1 lists the penalties that an application can suffer if it uses false sharing intensively on an Intel Core Duo system. In
order to avoid such an unnecessary overhead, the programmer needs to avoid false sharing, and in particular, needs to make
sure it does not occur unintentionally in the following cases:
- Global data variables and static data variables that are placed in the same cache line but are written by different threads.
- Objects allocated dynamically by different threads can accidentally share cache lines.
Table 1: False sharing penalties
| Case |
Data location |
Latency (cycles/nsec) |
| L1 to L1 Cache |
L1 Cache |
14 core cycles + 5.5 bus cycles |
| Through L2 Cache |
L2 Cache |
14 core cycles |
| Through Memory |
Main memory |
14 core cycles + 5.5 bus cycles + ~40-80 nsec depending on FSB and DDR freq. |
Optimize Bus Access Between the Cores to Maximize the Bus Bandwidth
Be careful when parallelizing code sections that use data sets exceeding the second-level cache and/or bus bandwidth. If only
one of the threads is using the second-level cache and/or bus, then it is expected to get the maximum possible speedup when
the other thread running on the other core does not interrupt its progress. However, if the two threads use the second-level
cache there may be performance degradation if one of the following conditions is true:
- Their combined data set is greater than the second-level cache size.
- Their combined bus usage is greater than bus capacity.
- They both have extensive access to the same set in the second-level cache, and at least one of the threads writes
to this cache line.
To avoid these, we recommend that you investigate parallelism schemes in which only
one of the threads accesses the second-level cache at a time, or that the level of using the second-level cache and the
bus does not exceed their limits. This concept is explained further in Section 5.3.5 of the Intel® Core™ Duo
Processor Optimization Guide.
|