Loading and Storing Data in Greatest Chunks

OpenCL™ Developer Guide for Intel® Processor Graphics

Download PDF

ID 773088

Date 3/20/2019

Version 2019.4

Public

Visible to Intel only — GUID: GUID-50C00560-3371-43F9-B428-9F4EAC8819F8

View Details

Loading and Storing Data in Greatest Chunks

“Saturating” the available memory bandwidth is very important. Bytes data types actually load integer data types (DWORDS), but also trigger instructions to pack and unpack data. Using (u)int4 or float4 for buffers saves a lot of compute, even if you unpack data manually afterward. In other words, you should avoid using uchar4 or char4. See the example below:

__kernel void amp (__constant uchar4* src, __global uchar4* dst)
        …
        uint4 tempSrc = convert_uint4(src[offset]);//Load one RGBA8 pixel
        …
        //some processing
        …
        dst[offset] = convert_uchar4(tempDst);
}

Consider data accesses by using int4 data type:

__kernel void amp (__constant uint4* src, __global uint4* dst)
        …
        uint4 tempSrc = src[offset]; // Load 4 RGBA8 pixels 
        …
        //some processing in uint4
        uint r0 = (tempSrc.x & 0xff);//Red component of 1st pixel
        uint r1 = (tempSrc.y & 0xff);//Red component of 2nd pixel
        …
        tempSrc.x >>= 8;
        tempSrc.y >>= 8;
        …
        tempSrc.x >>= 8;
        tempSrc.y >>= 8;
        …
        uint a0 = (tempSrc.x & 0xff);// Alpha component of 1st pixel
        uint a1 = (tempSrc.y & 0xff);// Alpha component of 2nd pixel
        //any calculations on the individual components
        …
        uint4 final = 0; // repack them:
        final.x = (r0) | ((g0) << 8) | ((b0) << 16) | ((a0) << 16);//first pixel
        final.y = (r1) | ((g1) << 8) | ((b1) << 16) | ((a1) << 16);//second pixel
        …
        dst[offset] = final; 
}

NOTE:

The global size is 1/4th of the original size in the second example above.

If your kernel operates on floating-point data, consider using float4 data type, which gets four times as much data in one load. It also helps to ensure that the kernel has enough work to do, amortizing the work-item scheduling overheads.

For the CPU device this optimization is equivalent to explicit (manual) vectorization, see the “Using Vector Data Types” section for more information.

Accessing data in greater chunks can improve the Intel® Graphics device data throughput, but it might slightly reduce the CPU device performance as also explained in the “Using Vector Data Types” section.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Processor Graphics

Loading and Storing Data in Greatest Chunks

See Also