|
This section describes the different measurements we did on an Intel® Core™ Duo processor-based system. We start with basic
measurements and then discuss the impact of programming models and optimizations on the overall power and performance of
the system.
Basic Measurements
Two of the basic requirements we had from the system were (1) to keep, or improve the performance of ST applications, using the
same frequency and L2 cache sizes, and (2) to take full advantage of parallel execution, when parallelism is available.
Figure 4 compares the performance of Pentium® M processor and Intel Core Duo processors, using the same platform, running at the
same frequency, and executing all the programs out of the SpecINT benchmark suite. As can be seen, on average, the
performance of the Intel Core Duo and the Pentium M are the same.
Figure 5 compares the execution of all the programs out of the SpecFP performance benchmark. Here, some of the programs show a
significant improvement over the Pentium M execution on the same platform. The main reason is the use of SSE3 new
instructions by the compiler and a few other performance improvements described in [2].

Figure 4: Single-threaded performanceSpecINT
Core Duo vs. Pentium M (same cache, same platform)
click image for larger view

Figure 5: Single-threaded performanceSpecFP
Core Duo vs. Pentium M (same cache, same platform)
click image for larger view
After achieving the first goal of keeping the performance of an ST application at the same level (or better) as when run on a
Pentium M processor, Figure 6 shows the speedup numbers that various MT applications can achieve.

Figure 6: MT speedups
click image for larger view
The speedup numbers presented here range from 1.2 to 2, which is the theoretical maximum that can be achieved by two cores. A
closer look at the applications that reveal relatively low scalability shows that the main reason for that is lack of
parallelism within the application. A few applications, such as SpecFP rate, suffer from high utilization of the bus. In
these cases doing the same experiment but with a faster bus yields a better scalability.
Threading Models
When multi-threading an application, the choice of a threading model plays a key role in achieving maximum performance scaling.
In this section, we discuss the effect of "data domain decomposition" and "functional domain
decomposition" on the performance of an application.
Data domain decomposition usually results in a balanced threading model and is likely to produce a better scalable threading
behavior when running the application on platforms with a higher number of processors. Functional domain decomposition is
susceptible to imbalanced threads due to thread specific performance characteristics, and hence load-balancing issues need
to be considered. A functional domain decomposed model is also likely to limit the scalability by any number of processors.
One very important consideration with imbalanced threading behavior in applications is the operating system (OS) scheduling
of threads on a CMP system (we illustrate this with an example in the sections below).
Applications With Balanced Threading Models
Applications studied here are CPU-intensive, consuming 95-100% of the CPU with the threads performing equal work and consuming
equal processing resources. Here, we discuss the performance of these applications when they run in ST and MT modes. The
performance data are measured in seconds.
The graph in Figure 7 indicates performance data for running ST and MT versions of the applications. Cryptography and Video
Encoding applications have two MT implementations, and hence, results are indicated as MT1 and MT2. MT1 is implemented
using a data domain decomposition methodology, and MT2 is implemented using functional domain decomposition.

Figure 7: Balanced threading performance
click image for larger view
As indicated in Figure 7, MT applications clearly demonstrate significant performance improvement over ST applications. Some of
the applications have two different multi-threaded implementations. For example, MT-1, MT-2 versions of the Cryptography
workload demonstrate a 2x performance improvement as compared to the ST version.
Applications with Imbalanced Threading Model
In this section, we examine the performance implications on an application with imbalanced threading models. For this study, a
sample game physics engine was created (using Microsoft DirectX*). The sample application has two parts: 1) Physics
Computation (collision detection and resolution for graphics objects), 2) Rendering (updated positions are drawn onto
screen). The application was deliberately designed such that balanced and imbalanced threading could be studied for a CMP
processor:
- Balanced: For this implementation, graphical objects (and background imagery) were divided into two parts and each thread
took care of the collision detection and resolution of its own set of objects.
- Imbalanced: In this implementation, one thread was tasked with performing collision detection and resolution for the
colliding objects while the other thread calculated the updated positions. The result was the desired goal of the first
thread being more CPU intensive than the second thread.
With the two implementations, performance data in different power schemes, MaxPerf and Adaptive, are as shown below. Adaptive
mode here refers to the power-saving scheme where the OS optimizes overall power consumption, by dynamically changing CPU
frequency on demand, using Intel SpeedStep® technology (the GV3 technology). The MaxPerf mode refers to the power
scheme where the processor is always running at the highest clock speed.
Let us discuss the first two data sets in Figure 8 for now.

Figure 8: Imbalanced threading performance
click image for larger view
The Imbalanced MT (Imbalanced-MT) implementation demonstrates a 2x performance degradation (0.6 scaling) when running in the
Adaptive power scheme as compared to MaxPerf (indicated with the circle in Figure 8). In the Imbalanced-MT case, since one
of the threads is doing a large amount of the work as compared to the other thread, the thread performing more work keeps
migrating between the cores, making effective CPU utilization on the cores at ~50%. On systems running in
"Adaptive" (portable/laptop) power mode, this thread migration causes the Windows* kernel power manager to
incorrectly calculate the optimal target performance state for the processor. This reduces the operating frequency of both
cores even when one of the cores is fully utilized in Adaptive mode and hence causes degradation in performance for the
Imbalanced-MT case. Note that this issue may occur while running single- threaded workload as well. To address this issue,
Microsoft provided a hot-fix (KB896256) to change the kernel power manager to track CPU utilization across the entire
package, rather than the individual cores and hence calculate the optimum frequency for applications.
The third set in Figure 8 indicates data with the kernel hot-fix. In this case, the Imbalanced-MT implementation in Adaptive
mode shows expected performance scaling as of MaxPerf mode. With this fix, both cores run at optimum frequency, not causing
any degradation in Adaptive (PL) mode.
|