Using Specialization in Branching
You can improve the performance of both CPU and Intel® Graphics devices by converting the uniform conditions that are equal across all work-items into compile time branches, a techniques known as specialization.
The approach, which is sometimes referred as Uber-Shader in the pixel shader context, is to have a single kernel that implements all needed behaviors, and to let the host logic disable the paths that are not currently required. However, setting constants to branch on calculations wastes the device facilities, as the data is still being calculated before it is thrown away. Consider a preprocess approach instead, using #ifndef blocks.
Original kernel that uses constants to branch:
__kernel void foo(__constant int* src,                                          
                 __global int* dst,                                                                       unsigned char bFullFrame, unsigned char bAlpha)
{
        …
        if(bFullFrame)//uniform condition (equal for all work-items
        {                       
        …
                if(bAlpha) //uniform condition
                {       
                …               
                }
                else
                {
                …
                }
        else
        {
        …
        }
} 
  The same kernel with compile time branches:
__kernel void foo(__constant int* src,                                          
                 __global int* dst)
{
        …
        #ifdef bFullFrame
        {                       
        …
                #ifdef bAlpha
                {       
                …               
                }
                #else
                {
                …
                }
                #endif
        #else
        {
        …
        }
        #endif
} 
  Also consider similar optimization for other constants.
Finally, avoid or minimize use of branching in short computations with using min, max, clamp or select built-ins instead of “if and else”.
Also, optimizing specifically for the OpenCL™ Intel Graphics device, ensure all conditionals are evaluated outside of code branches (for the CPU device it does not make any difference).
For example, the following code demonstrates conditional evaluation in the conditional blocks:
if(x && y || (z && functionCall(x, y, z))
 {
    // do something
 }
 else
 {
    // do something else
 } 
  The following code demonstrates the conditional evaluation moved outside of the conditional blocks:
//improves compilation time for Intel® Graphics device
bool comparison = x && y || (z && functionCall(x, y, z));
 if(comparison)
 {
    // do something
 }
 else
 {
    // do something else
 }