- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures
INTRODUCTION
Multi-core Architectures (MCAs) provide applications with an opportunity to achieve much higher performance than uniprocessor systems. Furthermore, the number of cores on MCAs is likely to continue growing, increasing the performance potential of MCAs. However, realizing this performance potential in an application requires the application to expose a significant amount of thread-level parallelism.
A common approach to exploiting thread-level parallelism is to decompose each parallel section into a set of tasks. At runtime, an underlying library or run-time environment distributes (schedules) these tasks to the software threads [2, 3, 4]. To achieve maximum performance, especially in systems with many cores, it is desirable to create many more tasks than cores and to dynamically schedule the tasks. This allows for much better load balancing across the cores.
We examine a set of benchmarks from an important emerging application domain: Recognition, Mining, and Synthesis (RMS) [1]. Many RMS applications have very high compute demands and can therefore benefit from a large amount of acceleration. Further, they often have abundant thread-level parallelism. Thus, they are excellent targets for running on MCAs with many cores.
For previously studied applications and architectures, the overhead of software dynamic task schedulers is small compared to the size of the tasks, and therefore, enables sufficient scalability. However, we find that a significant number of RMS applications are dominated by parallel sections with small tasks. These tasks can complete execution in as few as 50 processor clock cycles. For these, the overhead of software dynamic task scheduling is large enough to limit parallel speedups.
We therefore propose a hardware technique to accelerate dynamic task scheduling on scalable MCAs. It consists of two components: (1) a set of hardware queues that cache tasks and implement task scheduling policies, and (2) per-core task prefetchers that hide the latency of accessing these hardware queues. This hardware is relatively simple, scalable, and delivers performance close to optimal.
We compare our hardware proposal to highly tuned software task schedulers, and also to an idealized hardware implementation of a dynamic task scheduler (i.e., operations are instantaneous). On a set of RMS benchmarks with small tasks, it provides large performance benefits over the software schedulers and gives performance very similar to the idealized implementation.

Figure 1: Impact of multiprogramming
click image for larger view
Our contributions are as follows:
- We make the case for efficient support for fine-grained parallelism on MCAs. Parallel tasks can be as fine as 50 processor clock cycles.
- We propose a hardware scheme that provides architectural support for fine-grained parallelism. Our proposed solution has low hardware complexity and is fairly insensitive to access latency to the hardware queues.
- We demonstrate that the proposed architectural support has significant performance benefits. First, it delivers much better performance than optimized software implementations: 88% and 98% faster on average for 64 cores on a set of loop-parallel and task-parallel RMS benchmarks, respectively. In addition, it delivers performance close to (about 3% on average) an idealized hardware implementation of a dynamic task scheduler (i.e., operations are instantaneous).
