1.2. Pipelining Framework for High Throughput Design
A given algorithm is partitioned into tasks that process data completely before the next task starts. Each task is modeled as a producer and/or a consumer that consumes data produced by the previous task and produces data for the next task. Each task waits for data from the previous task, in a loop on a different thread that runs on separate cores. With continuous streaming of data, the framework forms a pipelined architecture. After the pipeline is filled-up at all stages, output data is generated at the speed of the slowest task in the pipeline. In general, if there are n tasks in the algorithm or pipeline and all tasks are taking about ts time to produce their output, the process time for each input reduces to ts instead of n x ts. This improvement is presented in Performance of the Original Algorithm for Multiple Input Data and Performance of the Pipelined Algorithm for Multiple Input Data.
Queuing is the best way to transfer work data between these threads or tasks (steps of the pipeline). Main thread might enqueue the incoming inputs or job-items in the input queue for the next task or thread. This thread dequeues and consumes job-items from the front of the input queue as soon as it has the ability to process a new job. Then, it generates a new job-item and pushes it into the input queue for the next task or thread, as illustrated in the following figure.
- Synchronizing process to access the same queue by a producer and a consumer.
- Designing threads properly to avoid starving or overflowing queues and to maintain efficient performance.
Since each queue is accessed by both producer and consumer tasks, you must protect the queue as a critical area. Mutex lock (pthread_mutex_t) is used to control access to the queue (both for enqueuing and dequeuing data from the queue). The enqueue or push function must ensure that the queue is not full and obtain the lock before writing into it. The dequeue or pop function must also lock the queue before reading the front element from the queue, in case the queue is not empty. To efficiently check the status of the queue, a conditional (pthread_condition_t) variable is used. If a queue is empty, instead of busy looping, wait for the pthread_condition_t variable to complete. The variable releases the lock on the queue allowing another process (producer) to access the queue and push into it. Eventually, the producer sends signal to the consumer after pushing the new item and the consumer thread (waiting on pthread_condition_t variable) return to lock the queue, pop an item from it, release the lock, and continue its process.
This implementation minimizes the synchronization process because all the required synchronization is implemented as part of the push and pop functions to or from the queue.
Avoiding Overflow and Starvation
It is important to ensure that the throughput of all pipeline steps (all threads) are in the same range, and the producing and consuming rates are similar. This avoids overflow and starvation of the queues. Hence, the throughput of all pipeline steps must be calculated, and each thread or task is optimized in case its throughput is not in the expected range.
It is possible to have more than one consumer thread in case the process of that stage is too time consuming, as illustrated in Producer-Consumer Pipelining Framework.