Technology and Research
Intel® Technology Journal Home
Volume 10, Issue 02
Intel® Centrino® Duo Processor Technology
Table of Contents
Technical Reviewers
About This Journal
Intel Published Articles
Read Past Journals
Subscribe
E-Mail this Journal to a Collegue
Home  ›  Technology and Research  ›  Intel Technology Journal  ›  Intel® Centrino® Duo Mobile Technology
Main Visual Description
Intel Technology Journal - Featuring Intel's Recent Research and Development
Intel® Centrino® Duo Mobile Technology
Volume 10    Issue 02    Published May 15, 2006
ISSN 1535-864X    DOI: 10.1535/itj.1002.02

  Section 7 of 10  
CMP Implementation in Systems Based on the Intel® Core™ Duo Processor
OPTIMIZATION OPPORTUNITIES FOR INTEL® CORE™ DUO PROCESSOR

Like any parallel system, the performance and the power of the Intel® Core™ Duo processor may be sensitive to the memory access patterns. In this section we review three optimizations that are very important for getting the best out of the system.

Efficient Use of the Shared L2 Cache

Sharing data between two threads on the Intel Core Duo processor is fastest when done through the L2 cache. This section examines several scenarios for sharing.

One scenario is when one thread brings the data from memory, and the other thread later uses this data directly from the L2 cache. If the single-threaded workload needs to bring the same data several times from memory but the multi-threaded version is carefully designed to use the same data by the two threads simultaneously, the MT version gains performance by bringing the data less times from the memory to the cache hierarchy. Such a design can help applications with a larger than L2 cache data set and even achieve higher than 2x performance improvement.

Another scenario is when one thread generates the data and the other thread consumes it. A couple of variations of this scenario are possible and are further explained in the "Producer Consumer Models," Section 5.3 of the Intel® Core™ Duo Processor Optimization Guide[4]. Briefly, they are the "Delay" approach and "Symmetric" approach. Below is an example of the expected speedup when the producer-consumer model is run on an Intel Core Duo processor vs. a Dual Core Intel® Xeon® processor vs. an Intel® Pentium® 4 processor with Hyper-Threading Technology¹ (W = Write, R = Read, xxK = buffer size).

Not only do these data show the benefit of avoiding the bus/memory latency, they also demonstrate how varying multi-processor implementations behave in both code affinity (functional) decomposition and data affinity (data) decomposition threading models. If the produced/consumed data set size is bigger than the L1 data cache size, yet smaller than the L2 cache size, data decomposition and functional decomposition yield similar performance (assuming the functional decomposition implementation is well balanced), and the best performance that can be achieved for data sharing.



Figure 9: Code vs. data affinity performance on various processors
click image for larger view
 

False Sharing Can Reduce Performance

False sharing happens when two or more threads access different address ranges on the same cache line simultaneously. This causes the cache line to be in the first level cache of the two cores.

False sharing causes a severe performance penalty if one or more of the threads writes to the shared cache line. This causes invalidation of the cache line at the first-level cache of the other core. As a result, the next time that the other core accesses the cache line in question it will have to transfer it from the core that wrote it earlier through the bus, thereby incurring a major latency penalty.

Below is an example of code that has false sharing when executed by several threads simultaneously.

int counter[THREAD_NUM];
int inc_counter ()
{
counter[my_tid]++;
return counter[my_tid];
}

Table 1 lists the penalties that an application can suffer if it uses false sharing intensively on an Intel Core Duo system. In order to avoid such an unnecessary overhead, the programmer needs to avoid false sharing, and in particular, needs to make sure it does not occur unintentionally in the following cases:

  • Global data variables and static data variables that are placed in the same cache line but are written by different threads.
  • Objects allocated dynamically by different threads can accidentally share cache lines.

Table 1: False sharing penalties

Case Data location Latency (cycles/nsec)
L1 to L1 Cache L1 Cache 14 core cycles + 5.5 bus cycles
Through L2 Cache L2 Cache 14 core cycles
Through Memory Main memory 14 core cycles + 5.5 bus cycles + ~40-80 nsec depending on FSB and DDR freq.

Optimize Bus Access Between the Cores to Maximize the Bus Bandwidth

Be careful when parallelizing code sections that use data sets exceeding the second-level cache and/or bus bandwidth. If only one of the threads is using the second-level cache and/or bus, then it is expected to get the maximum possible speedup when the other thread running on the other core does not interrupt its progress. However, if the two threads use the second-level cache there may be performance degradation if one of the following conditions is true:

  • Their combined data set is greater than the second-level cache size.
  • Their combined bus usage is greater than bus capacity.
  • They both have extensive access to the same set in the second-level cache, and at least one of the threads writes to this cache line.

To avoid these, we recommend that you investigate parallelism schemes in which only one of the threads accesses the second-level cache at a time, or that the level of using the second-level cache and the bus does not exceed their limits. This concept is explained further in Section 5.3.5 of the Intel® Core™ Duo Processor Optimization Guide.

  • ¹ Hyper-Threading Technology requires a computer system with an Intel Pentium® 4 processor supporting HT Technology and a HT Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. See http://www.intel.com/products/ht/hyperthreading_more.htm for additional information.

 


  Section 7 of 10  

In This Article
Abstract
Introduction
CMP Implementation and Design Considerations
The Protocol
Performance Measurements
Comparing Split Cache with Shared Cache
Optimization Opportunities For Intel® Core™ Duo Processor
Conclusion and Remarks
References
Authors' Biographies
Download a PDF of this article.   
Email This Page
Back to Top