- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
High-Performance Physical Simulations on Next-Generation Architecture with Many Cores
PARALLEL SCALABILITY RESULTS
Figure 10 shows the parallel scalability for our applications for up to 64 cores. Since no large-scale CMP is available for us to experiment with, we use cycle-accurate simulation to measure performance and characterize the parallelized workloads. Details of our simulator can be found in [5]. We assume a very high main memory bandwidth so that we do not artificially limit scalability. The speedups are obtained against the one thread version of the parallelized code. On 64 cores, we achieve 30x to 56x speedup for production physics and 36x to 61x speedup for game physics.

Figure 9: Speedup of FSM relative to FMM
click image for larger view

Figure 10: Parallel scalability of production physics and game physics
click image for larger view
Next, we discuss important issues regarding scalability. Amdahl's law determines the theoretical maximum scalability. Load balancing and synchronization overheads impact how close we can be to the theoretical limit.
Serial Sections: Amdahl's law dictates that the parallel scalability is limited by the size of the serial sections. In most of the production and game physics modules, the serial section accounts for much less than 1% of execution time for one core. As a result, it does not significantly impact the parallel scalability in our study of up to 64 cores. However, as the number of cores increases, more aggressive parallelization will be needed to keep serial code from limiting parallel scalability.
Load Imbalance: The load imbalance is a function of the variability of task size as well as the number of tasks. The lower the variability, the fewer tasks are needed to obtain good load balance. Unfortunately, some modules exhibit high variability, which requires many tasks for good load balance, resulting in high parallelization overhead. Therefore, we should make a tradeoff between good load balance and low parallelization overhead. Under certain instances (e.g., Figure 7), we are forced to minimize the number of tasks to keep the amount of replication and redundant computation at an acceptable level. However, this comes at a cost of significant load imbalance that may limit parallel scaling.
Task Queuing: We implement a task queue to distribute parallel tasks across the cores. For some modules, the task queue overhead becomes a bottleneck for the scalability. In our implementation of task queues, all tasks are enqueued before we enter the parallel section. Therefore, if the number of tasks is large and/or the parallel section is small, the enqueue overhead becomes significant. Note that an alternative implementation of task queues might solve the problem, one of which is discussed in [8].
Locking: Grabbing and releasing locks incurs synchronization overhead. However, we observe that the locking overhead does not increase with the number of threads. Since there is little contention on the locks, locking does not significantly limit scalability. Nevertheless, accessing an uncontended lock still incurs parallelization overhead as it is extra work that is not required in a serial code.
In addition to the reasons listed above, parallel scaling is also affected by the memory behavior, which is covered in detail in the next section.
