Loading and Storing Data in Greatest Chunks
“Saturating” the available memory bandwidth is very important. Bytes data types actually load integer data types (DWORDS), but also trigger instructions to pack and unpack data. Using (u)int4 or float4 for buffers saves a lot of compute, even if you unpack data manually afterward. In other words, you should avoid using uchar4 or char4. See the example below:
__kernel void amp (__constant uchar4* src, __global uchar4* dst)
        …
        uint4 tempSrc = convert_uint4(src[offset]);//Load one RGBA8 pixel
        …
        //some processing
        …
        dst[offset] = convert_uchar4(tempDst);
} 
  Consider data accesses by using int4 data type:
__kernel void amp (__constant uint4* src, __global uint4* dst)
        …
        uint4 tempSrc = src[offset]; // Load 4 RGBA8 pixels 
        …
        //some processing in uint4
        uint r0 = (tempSrc.x & 0xff);//Red component of 1st pixel
        uint r1 = (tempSrc.y & 0xff);//Red component of 2nd pixel
        …
        tempSrc.x >>= 8;
        tempSrc.y >>= 8;
        …
        tempSrc.x >>= 8;
        tempSrc.y >>= 8;
        …
        uint a0 = (tempSrc.x & 0xff);// Alpha component of 1st pixel
        uint a1 = (tempSrc.y & 0xff);// Alpha component of 2nd pixel
        //any calculations on the individual components
        …
        uint4 final = 0; // repack them:
        final.x = (r0) | ((g0) << 8) | ((b0) << 16) | ((a0) << 16);//first pixel
        final.y = (r1) | ((g1) << 8) | ((b1) << 16) | ((a1) << 16);//second pixel
        …
        dst[offset] = final; 
} 
  If your kernel operates on floating-point data, consider using float4 data type, which gets four times as much data in one load. It also helps to ensure that the kernel has enough work to do, amortizing the work-item scheduling overheads.
For the CPU device this optimization is equivalent to explicit (manual) vectorization, see the “Using Vector Data Types” section for more information.
Accessing data in greater chunks can improve the Intel® Graphics device data throughput, but it might slightly reduce the CPU device performance as also explained in the “Using Vector Data Types” section.