Partitioning the Work
- Coarse-grain partitioning of the work between CPU and GPU devices:
- Use the inter-frame load-balancing with the naturally independent data pieces like video frames or multiple image files to distribute them between different devices for processing. This approach minimizes scheduling overheads. However it requires a sufficiently large number of frames. It also might increase a burden to the shared resources, such as shared last-level cache and memory bandwidth.
- Use the intra-frame load-balancing to split between the devices the data that is currently being processed. For example, if it is an input image, the CPU processes its first half, and the GPU processes the rest. The actual splitting ratio should be adjusted dynamically, based on how fast the devices complete the tasks. One specific approach is to keep some sort of performance history for the previous frames. Refer to the dedicated “HDR Tone Mapping for Post Processing using OpenCL - Multi-Device Version” SDK sample for an example.
- Fine-grain partitioning - partitioning into smaller parts that are requested by devices from the pool of remaining work. This partitioning method simulates a “shared queue”. Faster devices request new input faster, resulting in automatic load balancing. The grain size must be large enough to amortize associated overheads from additional scheduling and kernel submission.