When you tune your programs for execution on the Intel® Graphics device to improve performance, be aware of the way your kernels are executed on the hardware:
Optimize the number of work-groups
Optimize the work-group size
Use barriers in kernels wisely
Optimize thread utilization
The primary goal of every throughput computing machine is to keep a sufficient number of work-groups active, so that if one is stalled, another can run on its hardware resource.
The primary things to consider:
Launch enough work items to keep EU threads busy, keep in mind that compiler may pack up to 32 work items per thread (with SIMD-32).
In short/lightweight kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost.