- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures
A CASE FOR FINE-GRAINED PARALLELISM
Previous work on dynamic load balancing targeted coarse-grained parallelism, i.e., parallel sections with either large tasks, a large number of tasks, or both. The target was primarily scientific applications for which this assumption is valid. For these applications, an optimized software implementation delivers good load balancing with an acceptable performance overhead.
The widespread trend towards an increasing number of cores becoming available on mainstream computersboth at homes and at server farmsmotivates efficient support for fine-grained parallelism. Parallel applications for the mainstream are fundamentally different from parallel scientific applications that run on supercomputers and clusters in a number of aspects. We discuss these differences in detail in this section.
Architecture
Reduced communication overhead: MCAs dramatically reduce communication latency and increase bandwidth between cores. This allows parallelization of modules that could not previously be profitably parallelized.
Usage scenarios: These architectures are designed to be used with virtualization technologies as well as multiprogramming. In both these instances, the number of cores assigned to an application can change during the course of its execution. Maximizing the available parallelism under these conditions requires exploiting fine-grained parallelism.
Consider the example shown in Figure 1 that illustrates this using an 8-core MCA. It presents two scenarios where the parallel section is broken down into 8 and 32 equal-sized tasks (represented by green boxes). In a parallel section, if a core finishes its tasks before all other cores have finished their tasks, it has to wait. This results in wasted compute resources (shown in red). In each of the two scenarios, it shows the performance when varying number of cores are assigned to this parallel section. In both scenarios, with 4 and 8 cores, all the assigned cores are fully utilized. However, when 6 cores are assigned to the application, the first scenario wastes significant compute resources. In fact, it achieves the same speedup as when it was assigned 4 cores. In the second scenario, there are many fewer wasted compute resources because the parallel section was broken into finer-grained tasks.
This problem worsens when the number of cores increases. Figure 2 shows the maximum potential speedup on a 64-core MCA for a varying number of tasks. The ideal situation would be if the graph was linear, implying that each additional core would deliver additional performance. When only 64 tasks are used, the application would see no performance improvement even when the number of cores assigned to an application was increased from 32 to 63. To approach the ideal situation, one needs a much larger number of tasks (say 1024).
Performance portability across platforms: Parallel scientific computing applications are often optimized for a specific supercomputer to achieve the best possible performance. However, for mainstream parallel programs, it is much more important for the application to get good performance on a variety of platforms and configurations. This has a number of implications that require exposing parallelism at a finer granularity.

Figure 2: Theoretical scalability
click image for larger view
First, the number of cores varies from platform to platform. For reasons similar to that for virtualization/multiprogramming, finer-granularity tasks are necessary.
Second, MCAs are likely to be asymmetric for a number of reasons including heterogeneous cores, Hyper-Threaded (HT) cores, Non-Uniform Cache Architecture (NUCA), and Non-Uniform Memory Architecture (NUMA). This means that the different threads on the core might progress at different rates. For instance, two threads sharing a core run at a different rate than two threads running on two different cores.
Figure 3 illustrates the impact of asymmetry with an example. Consider an application that breaks its parallel section into tasks that represent equal amounts of work (shown in green). However, asymmetry in architecture results in each task taking a different amount of time to complete. The result is wasted compute cycles (shown in red). This example shows that to ensure good performance in the presence of hardware asymmetry, it is best to expose parallelism at a fine grain.
Workloads
To understand emerging applications for multi-core architectures, we have parallelized and analyzed emerging applications (referred to as RMS [1]) from a wide range of areas including physical simulation for computer games as well as for movies, raytracing, computer vision, financial analytics, and image processing. These applications exhibit diverse characteristics. On the one hand, a number of modules in these applications have coarse-grained parallelism and are insensitive to a task queuing overhead. On the other hand, a significant number of modules have to be parallelized at a fine granularity to achieve reasonable performance scaling.
Recall that Amdahl's law dictates that the parallel scaling of an application is bounded by the serial portion. For instance, if 99% of an application is parallelized, the remaining 1% that is executed serially will limit the maximum scaling to around 39X on 64 threads.
This means that even small modules need to be parallelized to ensure good overall application scaling.

Figure 3: Impact of asymmetry in architecture
click image for larger view
Ease of Programming
The use of modularity will continue to be very important for mainstream applications for several reasons. Modularity is essential to developing and maintaining complex software. In addition, applications are increasingly composed of software components from multiple vendors. These include middleware as well as libraries optimized for specific platforms.
Modular programs require writers of individual modules to make decisions about how best to parallelize that module. Consider a simple example where an application is composed of two modules: the main program and an optimized math library. Suppose that parallelizing either the library or the main program is sufficient to exploit all the parallel computing resources on the machine. However, modularity dictates that one module does not make assumptions about another module. This requires that for the best performance on a variety of platforms, both modules be parallelized in cases where the other module is not parallelized. The net result will be a finer granularity of parallelism in the application.
