|
The Intel® Core™ Duo processor is a new member of the Pentium® M processor family. Before discussing how CMP is implemented, let
us describe the implementation of current processors in the Pentium M family.
Background The Structure of the Pentium® M Processor
All the Intel® processors in the mobility family that preceded the Intel Core Duo processor were uni-processor, and therefore
efficiently support only Single Threaded (ST) applications and had the same basic structure as presented in Figure 2.

Figure 2: Structure of the memory cluster in the Intel Pentium M processor
click image for larger view
Here, all the accesses to the L2 cache, as well as the accesses to the main memory and IO space, were under the supervision of
a single control unit, shown in Figure 2 as Memory/L2 access control units (also called super-queue). Using this structure,
cacheable requests from the core first looked for the data in the L2 cache and only if not found there (L2 miss), were they
forwarded to the main memory via the front side bus (FSB). Uncacheable accesses could be directly sent to the main memory.
The Memory/L2 access control unit also served as a central point for maintaining coherency within the core and with the
external world. Pentium M processors support the MESI [3] coherence protocol that marks each cache line as Modifid,
Exclusive, Shared, or Invalid.
In a nutshell, the MESI protocol attaches for each cache line a state that can be M-modified, E-exclusive, S-shared, or I-invalid.
A line that is fetched, receives E, or S state depending on whether it exists in other processors in the system. A
cache line gets the M state when a processor writes to it; if the line is not in E or M-state prior to writing it, the
cache sends a Read-For-Ownership (RFO) request that ensures that the line exists in the L1 cache and is in the I state in
all other processors on the bus (if any).
The Memory/L2 access control unit manipulates the coherency of each level of the caches independently. It contains a snoop
control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal
buffers) in parallel. It also handles RFO requests and ensures the operation continues only after it guarantees that no
other version on the cache line exists in any other cache in the system.
The CMP Implementation
At the early stages of the project, we considered three alternatives for CMP implementation, as illustrated with two structural
alternatives in Figure 3. The first option (Figure 3a) was to put two single-core Pentium M processors, side by side, split
the L2 cache among them, and communicate between the cores via the FSB or another fast interconnect.
Both other options called for a shared L2 (Figure 3b) with a different implementation of the coherence protocol; one option
called for the same basic MESI table as in a single core but "adjusting it" to the new structures, while the
second option called for a simple version of a directory-based protocol to improve the performance of the proposed
structure.
The "simple" shared L2 implementation called for us to take advantage of the fact that the latency of the access to
the L2 cache is significantly longer than the L1 access latency. This difference in latency enables us to check/update the
status of the cache line in first level caches in parallel with L2 access. Therefore, this option increases the active
power consumption (with respect to a single core) for snoop activities, but keeps the static power (leakage) the same as
the single core, since no additional tables are used.

Figure 3: Implementation alternatives
click image for larger view
The directory-based solution calls for extending the MESI information, as part of the L2 structure, and keeping information
regarding the ownership on L2 cache lines. Here we assume that snoops are sent to the other core by the L2 controller, and
only when needed. Thus, when a core accesses the line in the L2 cache, the cache controller knows if the line is shared
with the other cache, and based on this information the cache control unit can optimize the number of snoops sent to the
other L1. This technique reduces the active power due to reduced snoop activity, but increases the design complexity and
the static power due to larger tag arrays.
Using the three criteria described in the introduction, we analyzed the performance and power and firstly eliminated the first
option (3a). The reason for this was that it would reduce the performance of ST applications, since it provides only half
of the cache size for each core. We also observed that the use of a split L2 cache could cause performance degradation when
running multi-threaded (MT) applications with shared data, preventing effective data sharing between the threads, and
requiring long latencies when moving data from one core to another. On top of that, it may reduce the performance of MT and
parallel application processing since it could not dynamically partition the L2 cache
Deciding between the two implementations of the shared L2 cache was a tough task. The performance of the two options was very
close and so we had to make our decision based on power efficiency. We decided to implement the simple solution and not the
directory-based architecture due to its complexity. The directory-based solution was found to be less favorable since
battery life mainly depends on static power consumption and less on dynamic power.
The general structure of the Intel Core Duo CMP implementation is given in Figure 3b. Comparing it with Figure 2 shows few
structural changes: (1) The core and first-level caches structure is duplicated; (2) the traditional memory and L2 control
unit (super-queue) is partitioned into two logical units: the L2 controller that handles all the requests to the L2 from
the core and from the external bus (snoop requests) and a bus control unit that handles all the data and IO requests to and
from the external bus; (3) in order to balance the requests to the L2 and memory, we added a new logical unit (represented
by the hexagon) that aims to guarantee the fairness between the requests coming from different cores; and (4) we extended
the prefetching unit to handle separately hardware prefetching by each core.
The new structure of the shared area allows us to enhance the performance while reducing power consumption. The new partitioned
structure of the super-queue allows us to implement new power and performance optimizations, since the L2-control unit was
designed to be relatively small, simple, and fast in order to reduce the latency to the L2 cache without increasing the
power consumption. The Bus Control Unit was designed to be larger and more complicated, but since it was found to be more
relaxed in timing, we could design it to have less leakage and even reduce its active power.
The power and performance results were measured on Intel Core Duo silicon and justified the CMP architecture we choose. We
discuss this later in the paper.
|