- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures
ARCHITECTURAL SUPPORT FOR FINE-GRAINED PARALLELISM
Software implementations of task queues incur an overhead (e.g., for enqueues and dequeues). This overhead grows with an increasing number of threads due to increased contention on shared data structures in the software implementation. Thus, if the tasks are small, the overhead can be a significant fraction of application execution time. This limits how fine-grained the tasks can be made and still achieve performance benefits with a large number of cores.
Therefore, we investigate adding hardware for MCAs that accelerates task queues. This hardware operates under the covers (i.e., is not visible to application writers) to accelerate the task queue operations that are key to high performance on many-core architectures. In particular, it provides very fast access to the storage for tasks. This includes performing fast task scheduling (i.e., determining which task a core should execute next). Its task scheduling is based on work stealinga well-known scheduling algorithm.
Using the Proposed Hardware
Applications interface to the proposed hardware via a software library. This allows programmers to use the same intuitive API that they use for software implementations of task queues. Since the software library hides the proposed hardware from applications, only the library needs to directly interact with this hardware. Besides initialization and termination, the only operations that the library needs to perform are task enqueues and dequeues.
In current software task queue implementations, each task is represented as a tuple, a set of associated items, as shown in Figure 5. Typically, the tuple entries will be function pointers, jump labels, pointers to shared data, pointers to task-specific data, and iteration bounds, but they could be anything. An enqueue places a tuple into a software data structure for storage, and a dequeue retrieves a tuple from the data structure.
Our proposed hardware is primarily intended to accelerate enqueue and dequeue operations. Thus, the hardware stores tuples on enqueue operations and delivers tuples on dequeue operations. It does not interpret the contents of the tuples. This provides flexibility to the software library in that the library determines the meaning of each entry in a tuple. This flexibility allows the library writer to optimize for applications with different needs.
Our Proposed Hardware
We consider an MCA chip where the cores are connected to a cache hierarchy by an on-die network. The proposed hardware consists of two separate hardware components: a Local Task Unit (LTU) per core, and a single Global Task Unit (GTU). This is illustrated in Figure 4.

Figure 4: Our proposed hardware in a MCA chip with cores (Ci) and parts of the last-level shared cache ($i)
click image for larger view
Global Task Unit (GTU)
The GTU holds enqueued tasks in a set of hardware queues. There is one hardware queue per logical core in the chip. This allows the use of the distributed task scheduling algorithm. The GTU also includes logic for implementing this algorithm. Since the hardware queues are physically close to each other, the proposed scheme can quickly determine which queues are empty and which are not. This makes stealing tasks much faster than for software implementations of distributed task scheduling. It also allows the hardware to quickly detect when all tasks are complete. This is important so that the main thread can start executing the serial code following the parallel section as quickly as possible.

Figure 5: An example task tuple format
click image for larger view
The GTU is physically centralized on the chip. Communication between the GTU and the cores is via the same on-die interconnect as the cache subsystem. The downside of a physically centralized GTU is that as the number of cores on a chip increases, the average communication latency between a core and the GTU also increases. This latency, if not hidden, could impact performance. Therefore, we address this with task prefetchers at each core, as described below.
The size of the queues in the GTU is bounded. When the queues are full, the hardware generates an exception. The exception handler can move some of the tasks from the hardware queues into memory creating room for future task enqueues. An underflow mechanism is used to move the overflown tasks back into hardware queues at a later point [2].
Multiprogramming is also supported by the hardware by using the same overflow and underflow mechanism to move tasks from hardware into memory, and vice versa, on context switches [2].
Local Task Unit (LTU)
Each core has a small piece of hardware to interface with the GTU, called the LTU. In addition to hardware for interfacing with the GTU, the LTU also contains a task prefetcher and small buffer to hide the latency of accessing the GTU. Hiding this latency can significantly improve performance. While a typical enqueue operation from a thread can be almost entirely overlapped with useful work, a dequeue operation is on a thread's critical path. If a logical core were to wait to contact the GTU until the thread running on it finished its current task, the thread would have to stall for the entire GTU access latency. If the latency is significant compared to the size of a task, this can take up a significant fraction of an application's execution time. Therefore, the LTU tries to hide the dequeue latency. It does this by trying to keep at least one task in the LTU's buffer at all times. A dequeue is able to grab such a task very quickly. The LTU operates as follows.
- On a dequeue, if there is a task in the LTU's buffer, that task is returned to the thread and a prefetch for the next available task is sent to the GTU. When the GTU receives a prefetch request, it treats it as a regular dequeue, and will steal a task from another logical core's queue if necessary.
- On an enqueue, the task is placed in the LTU's buffer. Since the proposed hardware uses a LIFO ordering of tasks for a given thread, if the buffer is already full, the oldest task in the buffer is sent to the GTU.
In our experience, the LTU's buffer only needs to hold a single task to hide the GTU access latency. If this latency grows in the future, the buffer could be made larger. However, there is a cost: tasks in an LTU's buffer cannot be stolen since they are not visible to the GTU. This could hurt performance if there are only a few tasks available at a time.
Table 1: Loop-level benchmarks and their inputs
| Benchmark | Data set |
|---|---|
| Gauss-Seidel | 128x128, 256x256, 512x512 |
| Dense Matrix-Matrix Multiply (MMM) | 64x64, 128x128, 256x256 |
| Dense Matrix-Vector Multiply (MVM) | 64x64, 128x128, 256x256, 512x512 |
| Sparse Matrix-Vector Multiply (MVM) | 4 data sets |
| Scaled Vector Add | 512, 1024, 4096, 16384 elements |
Table 2: Task-level benchmarks and their inputs
| Benchmark | Data set |
|---|---|
| Game Physics Constraint Solver | 4 Models |
| Binomial Tree | 512, 1024, 2048, 4096 |
| Canny Edge Detection | cars, costumes, camera2, camera4 |
| Cholesky Factorization | 4 data sets |
| Forward Solve | 4 data sets |
| Backward Solve | 4 data sets |
