- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
The Foundations for Scalable Multi-Core Software in Intel® Threading Building Blocks
THE TBB TASK SCHEDULER
The Intel® TBB task scheduler is a work-stealing scheduler. The design of the TBB scheduler is inspired by the early Cilk scheduler, which Blumofe and Leiserson [2, 3] proved has optimal space, time, and communication bounds for well-structured ("fully strict") programs.
In a system that uses work-stealing, each thread maintains a local pool of tasks that are ready to run. Using local pools avoids the contention that may arise with the use of a global task queue. When executed, a task performs work and also may create additional tasks that are placed in the local pool. If a thread's pool becomes empty, it attempts to steal a task from another random thread's pool. This approach is in contrast to static scheduling methods where threads are assigned work up-front and from other dynamic scheduling methods where a central pool of tasks (or iterations) is maintained.
Blumofe and Leiserson [2, 3] showed that the expected parallel runtime of applications scheduled by the Cilk scheduler is E[Tp = O(T1/P+T∞), where T1 is the "work" or sequential time of the application, and T∞ is the critical path length. This optimal bound shows that as P -> ∞, the expected time is only limited by the critical path length (the sequential part) of the application.
To achieve these same optimal bounds, the TBB task scheduler also uses a randomized work-stealing algorithm. An overview of its implementation is provided in the following section.
An Overview of the Task Scheduler Design
The TBB task scheduler evaluates task graphs. A task graph is a directed graph where nodes are tasks, and each node points to its parent, which is another task that is waiting on it to complete, or NULL. Each task has a refcount that counts the number of tasks that have it as their parent. Each task also has a depth, which is usually one more than the depth of its parent. The work of the task is performed by a user-defined function execute that is encapsulated within the task object.
To assist in providing an overview of the Intel TBB task scheduler, we use calculation of the nth Fibonacci number as a running example. A serial implementation of our Fibonacci example is shown below:
The function ParallelFib, shown below, uses the TBB task API to construct the root node of a task graph, an object of type FibTask. When this task's function execute is called, it will create two child tasks, also of type FibTask. Child a will calculate fibonacci(n-1) and child b will calculate fibonacci(n-2). When each of these tasks is executed, they will in turn recursively spawn child tasks as follows:
For performance reasons, TBB requires users to set task refcounts explicitly with the set_ref_count call, instead of atomically incrementing it in allocate_child. The refcount should be set for a task before spawning any of its children.
Each task that spawns children waits at the spawn_and_wait_for_all call until all of its children complete. An additional guard reference is required for this, as shown in the above example by using the refcount of 3, while there are only 2 child tasks. A thread that enters a spawn_and_wait_for_all is free to execute other ready tasks while it waits.
In ParallelFib, after completing the wait call, the results of the child tasks are summed and returned. When n<CutOff, no additional child tasks are created and instead the leaf task will directly call SerialFib.
Figure 1 shows a snapshot of a task graph that might be created by an execution of ParallelFib. Tasks with non-zero reference counts (A, B, and C) must wait for their child tasks to complete before proceeding. The leaf tasks are ready to run.
As mentioned previously, the TBB library maintains a pool of threads, each of which has its own pool of ready tasks. Each per-thread task pool is implemented as an array of lists of tasks. A task goes into a pool only when it is deemed ready to run, i.e., it has been spawned and has a refcount of 0. Figure 2 shows a snapshot of a pool that corresponds to the task graph in Figure 1. Tasks A, B, and C do not appear in the pool because they have non-zero refcounts and therefore are not ready to run.

Figure 1: Intermediate task graph for the Fibonacci example
click image for larger view

Figure 2: A pool of ready tasks that corresponds to the graph in Figure 1
click image for larger view
Breadth-First Theft and Depth-First Work
The TBB task scheduler's fundamental strategy is "breadth-first theft and depth-first work." The breadth-first theft rule raises parallelism sufficiently to keep threads busy. The depth-first work rule keeps each thread operating efficiently once it has sufficient work to do.
A depth-first execution of a graph is the most efficient when performing a sequential execution because it provides better temporal locality and limits the space required for storing tasks. The deepest tasks are the most recently created tasks, and therefore are hottest in cache. When they complete, their parents can then execute, and although the parents are not hot in the cache, they're warmer than the tasks above them. A depth-first execution also limits the space required for storing tasks. When executing a node, only the nodes that lie along the path from the root to that node need to exist in memory.
Depth-first execution of a graph, however, limits parallelism. In contrast, always executing the shallowest tasks first leads to a breadth-first unfolding of the tree. This creates an exponential number of nodes that coexist simultaneously, providing ample tasks to steal but also excessively consuming memory.
To balance efficient execution and parallelism, the TBB scheduler therefore uses the "breadth-first theft and depth-first work" rule.
Each thread in the TBB thread pool executes a worker routine that actively looks for ready tasks to execute. A thread will first take the task at the front of the deepest list of its own pool1. If there are no ready tasks in its own pool and there is at least one non-empty task pool, it will then steal from the front of the shallowest list of another randomly chosen pool. If the chosen pool is empty, the thread tries to steal from another randomly selected thread until it succeeds.
Scheduling Trade-offs and Optimizations
The Intel TBB task scheduler was inspired by the Cilk scheduler. Cilk is a parallel extension of the C programming language that defines additional keywords and constructs. Since Cilk requires a modified C compiler, it can rely on the compiler to perform Cilk-specific transformations and optimizations.
TBB on the other hand is a C++ template library and can be compiled using any standard-compliant C++ compiler. While this makes TBB more portable, it also means that correctness and performance cannot depend on any TBB-specific compiler passes. The TBB task API has therefore been designed to allow users to perform certain scheduling optimizations "manually" to achieve increased performance when necessary. The most important of these optimization opportunities are discussed below and their impact is evaluated in the Experimental Results section.
Minimizing Stack Use with Continuation Tasks
As mentioned before, TBB uses a "breadth-first theft and depth-first work" approach. However, this approach can sometimes cause the processor stack to overflow.
For example, consider the case when a task enters a spawn_and_wait_for_all. The task cannot continue until all of its children complete. On entering the wait, the calling thread is released to execute or steal other tasks. If it steals the shallowest task from another thread, it then begins a depth-first execution of this stolen tree.
However, the initial task that entered the spawn_and_wait_for_all is kept on the processor stack to maintain its local storage and instruction pointer. The newly stolen tree then begins to unfold on top of the waiting subtree on the processor stack. This situation could occur repeatedly, causing the stack to overflow.
To avoid this situation, the TBB task scheduler forces a thread to only steal tasks that are deeper than any waiting task. While this limits stack growth, it also limits the choice of tasks to steal and therefore might limit parallelism.
To avoid restricting the choice of tasks to steal while at the same time limiting stack space growth, the TBB task interface allows developers to specify continuation tasks. A task can replace itself in the graph with a continuation task and then return, freeing up its stack space. When the children complete, the continuation task is spawned to finish the work delegated to it by the parent.
To use a continuation task c, the children of a task are allocated as children of c and not the task itself. Like other tasks, c becomes ready to run when its children complete and will only then be spawned. The code for spawning children using "continuation-passing" for our Fibonacci example is shown below:
The implementation of class FibContinuation (not shown) inherits from class task, and sums c.x and c.y into sum in its execute function. The benefit of this approach is that after spawning children tasks in FibTask, the execute function returns, removing itself from the stack. Only the tasks that are actively executing are on the processor stack.
While there are benefits to the use of continuation tasks, there are also downsides. When using continuation tasks, all live state passed from parent to child cannot reside in the parent or its execute stack frame, since the parent may be destroyed before the child completes. Therefore additional care may be needed to properly encapsulate the live state of the computation. Also continuation passing requires the creation of an additional task object. In fine-grain tasks, this additional runtime overhead of task creation might be noticeable.
Reducing Overheads: Scheduler Bypass and Task Recycling
Luckily once an algorithm is using continuation tasks, it can also make use of two other overhead reducing techniques: scheduler bypass and task recycling.
With scheduler bypass, a task's execute function explicitly returns the next task to execute. Since the next task is known, the more complex logic to select a task is avoided in the scheduler's code. To use scheduler bypass in our Fibonacci example, the child task a is not spawned but is instead returned as shown below:
c.spawn( a );
return &a;
Once continuation passing and scheduler bypass are in use, it also becomes possible to recycle task objects. Normally when a task returns from its execute function, the task object is automatically deallocated. However, a user can choose to recycle a task object, making it live beyond the return and avoiding the repeated allocation and deallocation of task objects. Recycling a task as one of its own children is shown below for Fibonacci:
As shown in the Experimental Results section, scheduler bypass and task recycling often more than make up for the extra overhead added from the allocation of a continuation task.
[1] Optimizations will be discussed later that allow tasks to directly return a next task to execute, bypassing the task scheduler.
