|
Power and thermal management of the Intel® Core™ Duo processor raise a new challenge to the ACPI-based
architecture since, from a software point of view, there is no difference between running under a CMP-based system and
running under dual-core architecture, but from a hardware point of view, even though most of the logic is duplicated,
major parts of the logic such as the power supply and the L2 cache are still shared. For example, in a Dual-Processor
system, when the OS decides to reduce the frequency of a single core, the other core can still run at full speed. In the
Intel Core Duo system, however, lowering the frequency to one core slows down the other core as well.
C-state architecture
Since the OS views the Intel® Core™ Duo processor as two independent execution units, and the platform views the
whole processor as a single entity for all power-management related activities (C2 state and beyond), we chose to
separate the core C- state control mechanism from that of the full CPU and platform C-state control.
This was achieved by making the power and thermal control unit part of the core logic and not part of the chipset as
before. Migration of the power and thermal management flow into the processor allows us to use a hardware coordination
mechanism in which each core can request any C-state it wishes, thus allowing for individual core savings to be
maximized. The CPU C-state is determined and entered based on the lowest common denominator of both cores’ requests,
portraying a single CPU entity to the chipset power management hardware and flows. Thus, software can manage each core
independently, while the actual power management adheres to the platform and CPU shared resource restrictions.
As can be seen from Figure 2, the Intel Core Duo processor is partitioned into three domains. The cores, their
respective Level-1 caches, and the local thermal management logic each operates as a power management domain. The
shared resources–including the Level-2 cache, bus interface, and interrupt controllers (APICs)–are yet another power
management domain. All domains share a single power plane and a single-core PLL, thus operating at the same frequency
and voltage levels. However, each of the domains has an independent clock distribution (spine). The core spines can be
gated independently, allowing the most basic per core power savings (C1-Halt state). The shared-resource spine is
gated only when both cores are idle and no shared operations (bus transactions, cache accesses) are taking place. If
needed, the shared-resource clock can be kept active even when both cores’ clocks are halted, serving L2 snoops and
APIC message analysis.
The coordination mechanism serves as a transparent layer between the individually controlled cores and the shared
resource on die and on the platform. It determines the required CPU C-state, based on both cores’ individual requests,
controls the state of the shared resources, such as the shared clock distribution, and also implements the C-state
entry protocol with the chipset, emulating a legacy single-CPU mobile processor. When detecting that both core states
are deeper than C1, the coordination mechanism issues the proper indication to the chipset, triggering the platform
C2-4 entry sequence.
When the platform detects a break event (such as an interrupt request), it negates the proper sideband signals (such as
the STPCLK, SLP, DPSLP and DPRSTP). The coordination logic then analyzes the platform signaling, and initiates the proper
C-state exit sequences for the shared resources, and if needed, the cores.
Thus, the required goal of coordinating across the two cores is achieved in an efficient and transparent manner.
Software operates as if it is managing two independent cores, while the platform and the shared resources are controlled as
one by the coordination logic, reflecting a single platform Level C-state in a backwards-compatible manner.
As part of the per core power savings, the independent cores’ L1 cache is also flushed during core C3 and C4 states.
Due to the ratio between L1 and L2 cache size, it is assumed that nearly all of the L1 is included in the L2. Therefore,
the flushing should not incur a high overhead of a write-back into the system memory, and it will not incur a high
warm-up penalty when restarting execution afterwards, since the data already reside nearby in the L2 cache. By
flushing the caches, the cores can be kept asleep even when the L2 cache is accessed heavily by the other core or by a
system device, thus improving power savings even further.
Dynamic cache sizing and Deep C4 state
Now that the processor has been able to enter C4, we face the challenge of lowering the C4-state leakage power
even further. Since leakage power is directly proportional to the operating voltage, the most efficient means to save
leakage static power would be to lower the C4 operating point. Unfortunately, lowering the voltage impacts data retention,
and the first cells to be affected are the small transistor data arrays such as in the L2 cache. Therefore, the first step
in achieving a lower voltage idle state is to implement a mechanism that can dynamically shut down the L2 cache in
preparation for the Deep C4 state.
Cache sizing
When defining the L2 cache dynamic sizing algorithm, the following considerations need to be addressed:
- L2 is a large array; therefore flushing it will incur some power and potentially C-state latencies, especially
if done all at once or too frequently.
- Many applications will suffer performance degradation if running for a long period of time with little or no L2
cache. However, it has been proven that the short interrupt handling tasks, occurring at periodic timer ticks, do not have
much use for the L2 cache and are not visibly impacted by running with the cache closed.
To accommodate the above restrictions, the dynamic sizing mechanism is implemented as an adaptive algorithm, with
various built-in filtering properties and heuristics.
At the algorithms’ base is an assumption that during long periods of very low utilization and idle residency (mapping
to the C4 state), shutting down the cache will not result in a perceivable performance impact. This condition is detected
by a state machine, described in Figure 3. In order to start shrinking the cache, the Finite State Machine (FSM) checks
that the CPU frequency, controlled by the OS, is below a programmable threshold. It also checks that the CPU on the whole
has not stayed too long in C0–which may indicate a streaming task being executed (e.g., DVD playback). This is done by a
pre-programmed count-down timer, reloaded once the whole CPU enters C2-4 (package) states. Finally, the FSM
checks the second core’s idle state, requiring it to be in C4 as well, before allowing a shrink operation to begin.

Figure 3: Shrink/Expand heuristics
click image for larger view
The FSM freezes the shrink operation if a pending interrupt is detected, or if either core is in the active C0 state.
Cache expansion is requested once the activity indicators show that performance may be required. This is inferred in
one of three ways: either the frequency is being increased over the programmable threshold, the CPU is staying in C0 for a
period exceeding the pre-programmed timer, or one of the cores is entering a non C4 idle state (this is the OS’s way
of signaling that the core was not idle enough).
The actual cache flushing flow is performed by the microcode of the last core entering C4-state. In order to
minimize the power impact and to filter too short C4 periods, the microcode flushes only part of the L2 (1/8 through 1/2 of
the total cache size) during each consecutive C4 entry. The cache is flushed in chunks of lines (between 4 to 256 lines)
checking for interrupts in between. Once a whole way is flushed, it is power-gated with sleep transistors, further
reducing its leakage.
Microcode typically automatically expands the cache to a minimum of two ways upon every DeepC4 exit. Once the CPU
enters C4 again, microcode will shrink the cache back down to 0. At the initial expansion it is assumed that the CPU has
just exited from DeepC4. As such the L2 valid array may not be valid. Therefore, as part of the DeepC4 exit flow, the
microcode also clears all of the L2 valid bits, ensuring the cache is indeed perceived as empty by the snoop logic. The
same initialization flow can be applied also to other sensitive arrays should testing detect them as unstable.
DeepC4 (DC4) entry
After the L2 cache has been shrunk to 0 and the CPU enters C4, the CPU voltage may be further reduced. Moreover, since
no data are cached, the data cache does not need to be awakened for snoops. This feature is performed by the chipset,
during the DC4 state. Once this is detected, the chipset starts diverting snoopable traffic directly to memory. During the
DC4 state, the chipset scans the incoming traffic for interrupts and APIC messages, and once they are detected, queues them
separately, while initiating a break sequence for the processor. Once the processor is fully awake, the interrupts are
delivered to the processor and the memory traffic is diverted back to the CPU for snooping.
Handling P-states in a CMP environment
ACPI P-state (Performance) control algorithm’s goal is to optimize the runtime power consumption without
significantly impacting performance. The algorithm dynamically adjusts the processor frequency such that it is just high
enough to service the SW execution load. Operating point selection is done by the OS power management algorithms (OSPM)
based on the CPU load observed over a window of time. Once the target point is set, the CPU is expected to modify its
operating voltage and frequency to match the OSPM's request.
Figure 4 shows one example of the relationship between different working points (P-state points) in the Intel Core
Duo processor and their relative power consumption. As can be seen, the benefit of going to a lower working point can be
divided into an "exponential" part and a "linear" part. The exponential part represents a range of operating points where
both frequency and voltage can be scaled to meet the new working point, while the linear part represents a range of
operating points where the system has already reached the minimum voltage allowed and so only the frequency can be scaled.

Figure 4: P-state CPU frequencies
click image for larger view
When a multi-processor system was used, each core could be controlled independently by the OS and the individual
processor state was set accordingly. The Intel Core Duo CMP implementation presents new challenges since some parts of the
system, such as the PLLs and power planes, are shared and so both cores and the shared area need to operate at the same
effective frequency/voltage point (i.e., the same P-state).
Initial SW versions that ran on the hyper-threading Intel NetBurst® microarchitecture-based CPUs utilized
System Management Mode (SMM) code to coordinate the target P-state between the threads. This solution was targeted to
optimize performance, which suited desktops and servers at which most hyper-threading CPUs were targeted. However, it
incurred inherent overhead for every P-state request made by the OSPM, resulting in an average power impact
prohibitive to the mobile focused Intel Core Duo architecture.
Consequently, a HW coordination approach was adopted for the P-state architecture as well. As with the C-state
mechanism, each core's OS Power Management component can request a P-state separately via the standard
IA32_PERF_CONTROL MSR. The HW coordination logic in the shared region tracks these P-state requests from both cores
and determines the required CPU level target operating point based on the current execution state of the CPU:
- No thermal control state: In this case, the coordination logic will prefer performance over power, so as not to
starve the SW threads. As a result, the higher operating point will be selected for the total CPU operating point, causing
the whole CPU to execute the known advanced Intel SpeedStep® technology-based architecture transitions between
operating points.
- Thermal_Controlled_State: In this case, the HW has detected an overheating condition and is trying to drive the
operating point down for cooling. Here, the coordination HW will select the lower of the two cores’ requests, allowing a
quicker cooldown than if the maximum of both cores was used. Initial Intel Core Duo CPUs will use the same operation point
for both cores; however, future implementations may choose to select a different operation point per core, thus taking
advantage of this capability.
Due to the hardware coordination nature of the P-state architecture, OSPM may have a hard time tracking the exact
frequency each core has run at, since the actual runtime frequency may be higher than the OSPM requested, and may vary
without the core's OSPM knowledge. This may cause some confusion to the OSPM when trying to use the core's activity factor
(non C0 state %) and the expected frequency to determine the actual code load. Intel Core Duo technology augments the
previous frequency visibility mechanism by supplying two 64-bit counters, providing the OSPM information on the actual
execution frequency of each core. ACNT (ActualCount) counts executed clocks (not Idle) at the currently coordinated CPU
frequency. MCNT (MaxCount) counts the maximum number of non-idle clocks that the core could have run at, during the
same period of time, had it run at the nominal frequency defined for the part. By dividing the ACNT by MCNT, the OSPM can
determine the actual frequency each core has run at over a window of time, assisting the OSMP to assess the correct
execution load on that core and then in turn determine the next operating point.
Thermal control in a CMP environment
The Intel Core Duo system was designed with power efficiency in mind and is aimed at different form factors, some of
which are power and thermally constrained. Thermal management is a fundamental capability of all mobile platforms. Managing
platform thermals enables us to achieve the maximum CPU and platform performance within thermal constraints. Thermal
management features also improve ergonomics with cooler systems and lower fan acoustic noise.
In order to better control the thermal conditions of the system, Intel Core Duo technology presents two new concepts:
the use of digital sensors for high accuracy die temperature measurements and dual-core multiple-level thermal
control.
The general structure of the digital thermometer is described in Figure 5. It supports the legacy Intel®
Centrino® mobile technology thermal sensor, PROCHOT and THERMTRIP, and the adding of a temperature reading capability
to the fixed threshold sensor. A control logic built around the thermal sensor performs a periodic scan and generates an
output that represents the current temperature reading. The reading is loaded into an MSR, accessible to software. In order
to improve temperature reading, multiple sense points monitor different hot spots on the die and report the maximum
temperature of the die. An independent temperature reading from each core is available, with optional reporting of the
maximum temperature of the entire die, e.g., the highest temperature of both cores.

Figure 5: Digital thermometer block diagram
click image for larger view
Interrupt generation capability is provided in addition to the temperature reading. Two programmable thresholds are
loaded by S/W and a thermal event is generated upon threshold crossing. This thermal event generates an interrupt to a
single core or to both cores simultaneously through the APIC settings.
The digital thermometer is intended to be used as an input to a software-based thermal control such as the ACPI.
An interrupt threshold is defined to indicate the upper and lower temperature thresholds. An example of digital thermometer
usage is illustrated in Figure 6.

Figure 6: Digital thermometer and ACPI
click image for larger view
In the above example, the die temperature is at 60°C and the thresholds are set to 50°C and 65°C. If the temperature
rises above 65°C, a low-to-high interrupt is generated. The control software identifies the new temperature and
initiates action, such as activating fans or initiating some passive cooling policy. The activation thresholds and policies
are defined using BIOS and ACPI. New thresholds are loaded around the new temperature to further track new changes.
In previous Intel® Pentium® M processor-based systems, a single analog thermal diode was used to measure die
temperature. A thermal diode cannot be located at the hottest spot of the die and therefore some offset was applied to keep
the CPU within specifications. For these systems it was sufficient, since the die had a single hot spot. In the Intel Core
Duo system, there is more than a single hot spot that moves as a function of the combined workload of both cores. Figure 7
shows the differences between the usage of the traditional analog sensor and the new digital sensors.

Figure 7: Analog vs. digital sensors in Intel Core Duo systems
click image for larger view
As we can see, the use of multiple sensing points provides high accuracy and close proximity to the hot spot at any
time. An analog thermal diode is still available on the Intel Core Duo processor. An example of the difference between the
diode and the DTS is presented in Figure 8. The information presented in this graph was taken from simulations and not from
a real system, but was correletaed with data from real systems and found to be close enough.

Figure 8: Diode to DTS temperature difference
click image for larger view
It can be seen that significant temperature gradients exist on the die. The use of a digital thermometer provides
improved temperature readings, enables higher CPU performance within thermal limitations, and improves reliability.
The Intel Core Duo system also implements hardware-based thermal control. Hardware-based thermal management
is intended to handle abnormal thermal conditions and to protect the die from transient effects. Hardware-based
thermal control ensures that the CPU will always operate within specified conditions. This improves reliability and allows
higher performance with tighter control parameters.
Legacy thermal control features implement two externally visible signals:
- THERMTRIP: a fixed temperature sensor to detect catastrophic thermal conditions and to shut down the system if
thermal runaway occurs.
- PROCHOT: a fixed temperature threshold that provides the DVS with a self-control mechanism that drops
frequency and voltage to a new working point (a more detailed description of this mechanism can be found in [1]).
Intel Core Duo technology implements a new multiple-core thermal control algorithm. Both cores synchronize action
requests and activity. A programmable policy can select DVS on both cores and linear control on each core either
independently or as a locked operation. PROCHOT was also made available as an input, to allow thermal protection of
platform components such as the voltage regulator (VR).
A finer grain overheat detection is performed by monitoring the thermal activity. If an extreme thermal condition
occurs (fan malfunction, operation within a bag, etc.) the processor self thermal management may not be sufficient and the
temperature may not drop below the threshold point. The internal control algorithm tracks the control behavior and further
reduces power as needed. Continuous extreme conditions will eventually initiate an "Out Of Spec" thermal interrupt and
register the warning in an internal status bit. The "Out Of Spec warning" signal is intended to initiate a managed system
shutdown before a THERMTRIP shutdown occurs.
Platform power optimization
Intel Core Duo technology implemented power and thermal management features as described above to control the CPU power
and thermals. Other components on the platform also contribute to the total platform power and work closely with the CPU.
The CPU VR losses can get as high as 5W while running high workloads. A 3-phase VR was built to deliver high current
operating at low efficiency while working at low utilization. Turning off phases or switching to asynchronous mode can save
a significant amount of power. Efficiency, as a function of the number of phases, is described in Figure 9.

Figure 9: Efficiency as a function of number of phases
click image for larger view
It can be noted that improved power efficiency mode cannot handle high currents and there is a need for feedback on the
current requirements. The Intel Core Duo processor keeps track of the power consumption and gives feedback to the VR using
the PSI-2 and VID signals, as described in Figure 10.

Figure 10: PSI-2 overview
click image for larger view
The CPU tracks the activity requirements based on P- and C-states. When the OS initiates a request to go to a
higher P-state, there is sufficient time to signal the VR to change to high current mode before the actual power
consumption increases.
Another optimization at low workloads is done on the load line. The voltage delivery has serial resistance, and a
voltage drop between the VR and the CPU is a function of the current consumption. At lower power consumption, the voltage
drop is lower and the CPU voltage increases, driving higher power. The same mechanism on the CPU detects low power
consumption and drives lower VID code to the VR. Voltage is increased ahead of time, before actual power consumption
increases.
Communicating the CPU requirements to the rest of the platform enables additional power savings on the platform,
increasing the battery life.
|