Work-Group Level Parallelism
Since work-groups are independent, they can execute concurrently on different hardware threads. So the number of work-groups should be not less than the number of logical cores. A larger number of work-groups results in more flexibility in scheduling, at the cost of task-switching overhead.
Notice that multiple cores of a CPU as well as multiple CPUs (in a multi-socket machine) constitute a single OpenCL device. Separate cores are compute units. The Device Fission extension enables you to control compute unit utilization within a compute device. You can find more information on the Device Fission in the Intel® Code Builder for OpenCL™ API - User Manual.
For the best performance and parallelism between work-groups, ensure that execution of a work-group takes at least 100,000 clocks. A smaller value increases the proportion of switching overhead compared to actual work.
Did you find the information on this page useful?