|
In this section, we provide measurements that were taken from an Intel® Core™ Duo processor-based system in order to
examine the efficiency of its power and performance. We mainly focus on the relation between the use of MT applications and
the utilization of the Intel Core Duo system.
Distribution of P-states
The OS along with the hardware optimizes the platform power by decreasing the processor speed when there is no demand
and increasing the frequency as demand increases. Figure 11 shows a typical snapshot of the system power consumption during
some period of time. As can be seen, the OS sets both processors to a single power state (P-state), but the total
power consumption depends on the number of tasks run in parallel (if one task is run, we assume it is run in an ST
environment) and the computational time of these tasks. We can clearly observe the phases of execution: the length of each
phase reflects the total time it takes while the area below the curve reflects the total energy for this phase. We use this
energy calculation methodology in the following sections to compare ST and MT applications.

Figure 11: Energy calculation methodology
click image for larger view
Power consumption of different power states
In order to evaluate the benefit of using MT applications, we measure the power consumption of the system while in
different working points when running an ST application vs. an MT application. Figure 12 shows the experimental data with
both ST and MT code under different P states on an Intel Core Duo system. As can be seen, the power consumed by an MT code
is less than twice the power consumed by an ST code in corresponding P states. This implies that the applications that are
MT and demonstrate excellent scaling will not only reduce the total execution time, but also show greater savings in power
consumed.

Figure 12: P/C-state average power
click image for larger view
Impact of clock rate on power consumption
The OS uses a periodic clock interrupt to keep track of time, trigger timer objects, and schedule application threads.
While the default interrupt rate is set by the OS, applications can increase the interrupt rate to any desired frequency
(as low as 1ms).
While some multimedia applications with high-definition content may require a high interrupt rate for timely
scheduling of timer- based threads, some with enough processing headroom may not require it. Thus, in both cases,
increasing the clock interrupt rate may have an important impact on the overall power (and therefore on the battery life).
Figure 13 shows the processor power impact due to a high interrupt rate on an Intel Core Duo system. Average power
increases from 0.5 W to 1.05 W when the interrupt frequency is increased to 1ms.

Figure 13: High interrupt rate impact
click image for larger view
As can be seen, the "baseline" of the 1ms curve is higher than the normal interrupt rate. The reason for that is that
every interrupt causes the CPU to go back from C3 to C0 (active state) and when the CPU utilization is high enough, it will
also execute the clock service routine in high P-state.
Balance and imbalance threading environment
This section includes a comparative study of the total energy consumed on the Intel Core Duo platform (CPU plus
platform) for a given task with both MT and ST methodologies. We shall also analyze applications featuring various
threading scenarios, such as "balanced threading" vs. "imbalanced threading."
Applications with a balanced threading model
Real-world applications under study here are CPU bound workloads that were run in both ST and MT mode. The
applications are CPU bound, but in ST mode, only one core is fully utilized, and in MT mode both the cores are fully
utilized and consume equal processing resources. Adaptive mode here refers to the power-saving scheme where the OS
optimizes overall power consumption by dynamically changing CPU frequency on demand using Intel SpeedStep® technology (the
GV3 technology).
The graphs (Figure 14 and 15) below indicate CPU and platform energy for adaptive mode. For each application run (ST
and MT), power data gathering is normalized to the longest run-time. For example, a cryptography workload runs for ~50
seconds in ST mode and ~25 seconds in MT modes. The power data is measured for 50 seconds in both ST and MT mode. The
intent here is to identify if total energy can be saved by finishing the task faster (as in the case of MT) and entering
deep sleep states after task completion.

Figure 14: Balanced threading–CPU power
click image for larger view
As indicated in Figure 14, the energy for a given task is reduced by completing the task sooner with efficient MT and
letting the processor enter idle/sleep state. The MT applications above show linear performance scaling with the number of
cores and demonstrate a ~9%-27% savings in CPU power.
The CPU power is a major component of the overall power, and the graph in Figure 15 indicates about an 8%-17%
savings in the platform power due to MT.

Figure 15: Balanced threading–platform power
click image for larger view
Applications with an imbalanced threading model
In this section, we examine power/performance implications on an application with imbalanced threading where the load
on the threads is asymmetrical. The analysis here demonstrates how the imbalance in the threads may cause certain OSs to
incorrectly choose a sub-optimal P-state.
We used an in-house gaming prototype primarily composed of a physics computation (collision detection and
resolution of graphics objects) and a rendering engine for the analysis. A balanced threading model was achieved by
dividing the graphical objects into two cores with each thread taking care of collision detection and resolution of the
objects it owned. For the imbalanced case, one thread was given the task of performing collision detection and resolution
for the colliding objects while the other thread was given the task of calculating the updated positions. The imbalance was
due to the fact that the first thread was more CPU intensive than the other.
The MaxPerf mode refers to the power scheme where the processor is always running at the highest clock speed. The
following shows the total energy under different modes with a normalization similar to the one used before. The first data
set in Figure 16 represents energy data with a MaxPerf power scheme. The second data set indicates energy consumption with
an Adaptive scheme. We focus on these two data sets for now.

Figure 16: Imbalanced threading–platform power
click image for larger view

Figure 17: Imbalanced threading–performance data
click image for larger view
As indicated in Figure 17 (circled data), the Imbalanced-MT implementation demonstrates 2x performance degradation
when running in the Adaptive power scheme as compared to MaxPerf.
Since we use a normalization technique for energy measurements, the platform energy consumption increased for the
Imbalanced-MT because the run time is now doubled with the Adaptive-Default scheme (indicated with the circle in
Figure 17).
What caused a 2x performance degradation with the Adaptive power scheme?
By looking at the application profile while running the Imbalanced-MT version it is observed that since the first
thread is doing more of the work than the second thread, the first thread keeps migrating between the cores, making the
effective CPU utilization on the cores ~50%. This is a natural artifact of the manner in which the Windows* scheduler
works. However, on a system running in "Adaptive" (portable/laptop) power mode, this thread migration causes the Windows
kernel power manager to incorrectly calculate the optimal target performance state for the processor, as the individual
cores may appear less busy than the whole package. Due to this, the Windows OS tends to reduce processor frequency in order
to save power in Adaptive mode and hence performance may be degraded in Adaptive mode. This in turn causes increased power
consumption.
To address this issue of incorrectly calculating the optimal frequency, Microsoft provided a hotfix (KB896256) to
change the kernel power manager to track CPU utilization across the entire package, rather than the individual cores, and
hence calculate the optimum frequency for applications.
The third data set in Figure 16 (Adaptive–with GV3 Fix) indicates data with the kernel hotfix. In this case,
Imbalanced-MT implementation in Adaptive scheme shows power/performance data similar to that of MaxPerf mode. With
this fix, processors run at optimum frequency produce expected results.
This study shows that the imbalanced threading model/under-utilized CPU may cause degradation in performance
causing increased power consumption. Hence it is recommended to use a balanced threading model while running MT
applications or utilize the appropriate operating system fixes.
Platform power savings measurements
The following measurement shows the benefit of platform power savings while running low workloads. The workloads tested
in this experiment are mobile mark and DVD playback. The power losses of the VR and power delivery to the CPU are measured
with the PSI enabled and disabled. The results are described in Figure 18.

Figure 18: Power losses on VR with and without PSI-2
click image for larger view
The above figure shows the overall power losses broken into its components. It can be seen in the above results that
activating the PSI-2 reduces the VR losses by 14% while playing DVD and by 21% in Mobile Mark workload.
|