Get Started with Intel® SDK for OpenCL™ Applications 2020 on Linux* OS with Training Sample

Published: 03/03/2020  

Last Updated: 03/02/2020

By Michael R Carroll


This article demonstrates getting started with OpenCL™ Tools developer components in Intel® System Studio 2020 initial release and Update 1 release on Linux* OS. This article also applies to the Intel® SDK for OpenCL Applications 2020 as a standalone. The walkthrough assumes developer targets of both Intel® CPU and Intel® Graphics Technology. Training sample source code is attached for download via upper right orange button and at the article footer.

The article walks through:

  • Part 1:
    • Prerequisites and Installation
    • Check the Deployment
    • Build
    • Program execution
  • Part 2:
    • Debugger
    • Disk Provisions
    • Developer Tips
  • Part 3:
    • Explanation of training sample applications
    • Offline Compilation
    • More resources

The specific platform used in this example is Ubuntu* OS 18.04.3 on an Intel® Core™ i7-6770HQ processor (system model NUC6i7KYK). The Intel® Core™ i7-6770HQ processor has Intel® Iris™ Pro Graphics 580. More information on this platform and other platforms can be viewed at The article walkthrough is *not* exclusive to this specific hardware model.

New developers are encouraged to develop on a system with all intended OpenCL™ implementations and hardware available. However, in the general case, target hardware is not absolutely required for building OpenCL™ applications. Intel® CPU and Intel® Graphics Technology hardware are required to use all the developer tool features. Their implementations are required to support compiling device kernels for each target respectively.

For Intel® oneAPI DPC++/SYCL developers, if you wish to develop SYCL and OpenCL interoperable programs review the article on interoperability.

Part 1


Prerequisites & Installation

  • Download either the standalone or bundled OpenCL™ Tools for Linux* OS:
    • In the bundle see the OpenCL™ Tools for Intel® System Studio Eclipse IDE add-in.
    • This tutorial uses the bundle in
  • Review the Intel® System Studio 2020 release notes.
  • Review the release notes for the OpenCL™ developer tools. Observe supported platforms.
  • Download and install Intel® Graphics Runtime for OpenCL™ Applications if applicable. The CPU Runtime (version 18.1) is already included in the SDK/tools installer:
    • Review the Graphics runtime README/release notes as needed. Observe supported platforms.
    • Review CPU Runtime release notes as needed. Observe supported platforms.
    • This CPU runtime does not require any Intel® Graphics Technology hardware. It serves as a production OpenCL™ implementation useful for:
      • Backing production applications using OpenCL™.
      • Reference development for targeting other types of devices, such as Intel® Graphics Technology or Intel® FPGA
      • Alternative: The Experimental CPU Runtime for OpenCL Applications with SYCL support is maintained and updated to back Intel® oneAPI DPC++/SYCL platforms. Latest features, maintenance, and updates are going to this implementation.
        • For simplicity, this tutorial will use the included CPU Runtime version 18.1. 
  • Example Prerequisites Setup:
    • Set the proxy as needed. Install runtime prerequisites per the release notes.
      #edit apt properties to configure proxy as needed
      # Acquire::http::Proxy “”;
      # Acquire::https::Proxy “”;
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo gedit /etc/apt/apt.conf.d/proxy.conf
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo -E apt update
      #prerequisites noted in CPU runtime release notes, SDK release notes, and Graphics Runtime release notes.
      #consider sudo -E apt-get --simulate install <packages> to observe system changes first!
      tutorial@2020OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ sudo -E apt-get install libnuma1 zlib1g libxml2 lsb-core
    • On capable systems, install the Intel® Graphics Runtime for OpenCL™ Driver "NEO". Various package management repository instructions are maintained on the NEO portal on Github*.
      • This runtime enables OpenCL™ kernels to target Intel® Iris™ Pro Graphics 580 on the Intel® Core™ i7-6770HQ processor in this example, or supported Intel® Graphics Technology available on other Intel® platforms.
      • Review the release notes on the GitHub* portal. The README, FAQ, LIMITATIONS, and DISTRIBUTIONS documents are particularly useful. Observe supported platforms. 
      • If necessary see the OpenCL™ runtimes overview article
      • Install the appropriate runtime package(s) and ensure the user is added to video group.
        tutorial@2020OCL-Dev-Host$ sudo -E add-apt-repository ppa:intel-opencl/intel-opencl
        tutorial@2020OCL-Dev-Host$ sudo -E apt-get update
        tutorial@2020OCL-Dev-Host$ sudo -E apt-get install intel-opencl-icd
      • This graphics runtime is also available as a prebuilt install package(s) for the distribution used in this example from the GitHub* portal releases page.
      • Optional installation for cl_intel_media opencl extensions:
        tutorial@2020OCL-Dev-Host:~/Downloads$ sudo apt-get install libva1 libva-drm1 libdrm2
        Reading package lists... Done
        Building dependency tree       
        Reading state information... Done
        The following additional packages will be installed:
          i965-va-driver libdrm-amdgpu1 libdrm-common libllvm6.0 mesa-va-drivers
  • Install OpenCL™ Tools prerequisites:
    • Make sure to see the install notes section of the release notes for the OpenCL™ tools. That section is the official source for validated package prerequisites.
    • libicu55 prerequisite install example: 
      #Install the character internationalization library mentioned in the release notes
      #set the proxy for wget as necessary
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo gedit /etc/wgetrc
      tutorial@2020OCL-Dev-Host:~/Downloads$ wget
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo apt-get install ./libicu55_55.1-7ubuntu0.4_amd64.deb
    • For mono use the latest guidance from Here is an example used on the test system:
      sudo -E apt-key adv --keyserver-options http-proxy=http://<proxy-server>:<proxy-port> --keyserver hkp:// --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
      sudo apt install apt-transport-https
      echo "deb stable-xenial main" | sudo tee /etc/apt/sources.list.d/mono-official-stable.list
      sudo -E apt update
      sudo -E apt-get install mono-devel ca-certificates-mono
    • dkms dependency example install:
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo -E apt-get install dkms
    • libwebkitgtk IDE rendering dependency:
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo -E apt-get install libwebkitgtk-3.0-0
    • cpio installation for the installer
      sudo -E apt update
      sudo -E apt install cpio
    • Add the user to the video group so user programs are privileged for the Intel® Graphics Runtime for OpenCL™ Driver
      tutorial@2020OCL-Dev-Host:~/Downloads$ sudo usermod -a -G video tutorial

Install Intel® System Studio 2020 (or Intel® SDK for OpenCL™ Applications 2020 standalone)

  • Click the Configure installation from the Intel® System Studio 2020 portal downloads page.
  • Click the Add button for the OpenCL™ Tools for Intel® System Studio Eclipse IDE.
  • Select 'Linux* host Linux* target’ Download.
    • Note any extraction instructions. Configured downloads may come with a .json file to be referenced by the installer.
  • Extract the installation package. Execute the installer with sudo ./ as sudoer user: 
    tutorial@2020OCL-Dev-Host:/Downloads# unzip
    extracting: intel-sw-tools-installer.tar.gz
    extracting: intel-sw-tools-license.lic
    extracting: intel-sw-tools-config-custom.json
    tutorial@2020OCL-Dev-Host:/Downloads# ls
    intel-sw-tools-config-custom.json  intel-sw-tools-installer.tar.gz  intel-sw-tools-license.lic  intel-sw-tools-installer.tar.gz
    tutorial@2020OCL-Dev-Host:/Downloads# tar -xvf intel-sw-tools-installer.tar.gz
    tutorial@2020OCL-Dev-Host:/Downloads# cd intel-sw-tools-installer
    tutorial@2020OCL-Dev-Host:/Downloads/intel-sw-tools-installer# ls
    README.txt  package_id.txt  pset  silent.cfg
    tutorial@2020OCL-Dev-Host:/Downloads/intel-sw-tools-installer# sudo -E ./
  • Set proxy as necessary if presented with the proxy dialog.
  • Install to 'this computer'… optionally set the deployment folder.
  • Review and accept license terms
  • Choose your preferred software improvement collection option. Consent greatly helps Intel® improve this product.
  • View the summary installation page:

  • Allow the installer to continue.
  • After the components are installed, the 'Launch Intel(R) System Studio checkbox' can be deselected. The Build section of this walkthrough demonstrates building from the command line.

Check the Deployment

#Check installation by inspecting icd file residency. Some of these references may be symbolic links.
tutorial@2020OCL-Dev-Host$ ls -l /etc/OpenCL/vendors/
tutorial@2020OCL-Dev-Host$ cat /etc/OpenCL/vendors/intel64.icd
tutorial@2020OCL-Dev-Host$ cat /etc/OpenCL/vendors/intel.icd
  • Confirm that the installation setup ICE loader reference for OpenCL™ implementations:
  • The CPU Runtime sets up symbolic links to the ICD loader library Advanced users: if the development system prefers a different, you may wish to ensure the alternate is first in the search order for your dynamic linker. Symbolic link setup is limited within Docker container environments.
    #Check system references... ldconfig -v and ls -l can show references are symlinked and which files will be seen by the dynamic linker.
    tutorial@2020OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ sudo updatedb
    tutorial@2020OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ locate **
    tutorial@2020OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ ls -l /usr/lib/x86_64-linux-gnu/
    lrwxrwxrwx 1 root root 37 Dec 17 17:27 /usr/lib/x86_64-linux-gnu/ -> /etc/alternatives/
    tutorial@2020OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ ls -l /etc/alternatives/
    lrwxrwxrwx 1 root root 97 Dec 17 17:27 /etc/alternatives/ -> /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/
  • Quickly confirming visible OpenCL™ implementations can ensure the system is setup properly. Examples of typical utilities used to examine the system are clinfo (Linux* OS/Windows* OS), gpu_caps_viewer (Windows* OS), and GPU-Z (Windows* OS).
  • clinfo is a typical program used in the OpenCL™ developer ecosystem to check system capabilities. On Ubuntu* OS, it’s typically available via: sudo -E apt install clinfo
  • Such tools may report the detected OpenCL™ implementations, as well as device parameters associated with those implementations.
  • Use these parameters to better:
    • Dynamically allocate OpenCL™ memory objects 
    • Schedule OpenCL™ command queue items
    • Size global NDRanges and local workgroup sizes
    • Use OpenCL™ extensions available for the device, referencing the extension specifications on the Khronos OpenCL™ registry.
  • Assuming the SDK, CPU implementation, and Intel® Graphics Technology implementation have been installed... here’s what is displayed from a platform and device perspective:
    • The Intel(R) CPU Runtime for OpenCL™ Applications and Intel(R) Gen9 HD Graphics platforms are exposed:
      • The former has a CPU OpenCL™ target device implementation suitable for production usages.
      • The later has a Intel® Graphics Technology OpenCL™ target device implementation suitable for production usages.
    • Users who have only Intel® CPU Runtime for OpenCL Applications installed will only see the CPU OpenCL™ target device.
  • This configuration could change (i.e. different platforms for different devices) given future updates or different runtime configurations. Please ensure your application is sufficiently flexible and tested on supported hardware.
    Device Version reporting a NEO implementation like ‘OpenCL 2.1 NEO’ maps to the validation platform for Intel® SDK for OpenCL™ Applications 2020. Implementations without the NEO branch designation may not function with all Intel® SDK for OpenCL™ Applications 2020 features. These properties match the various property bitflags discernable with clGetDeviceInfo(...) OpenCL™ API call. This field shows the output of the OpenCL™ API call clGetDeviceInfo(…) looking for CL_DEVICE_VERSION information.
  • Production OpenCL™ programs will similarly interrogate available OpenCL™ implementations.



Two example source sets are in the .tar.gz archive attached to this article. They are entitled GPUOpenCLProjectForLinux and CPUOpenCLProjectForLinux. These sources match the two implementations installed earlier.

These example build commands demonstrate the two main build requirements: inclusion of OpenCL™ headers, and linking the (ICD loader library) runtime. The first program executes it's kernel on Intel® Graphics Technology. The second second program executes it's kernel on Intel® CPU and not for Intel® Graphics Technology. Here are command line examples to build the host side applications with Intel® System Studio 2020 redistributed headers and libraries:

g++ -L/opt/intel/sw_dev_tools/opencl/SDK/lib64 -I/opt/intel/sw_dev_tools/opencl/SDK/include GPUOpenCLProjectforLinux.cpp  utils.cpp -Wno-deprecated-declarations -lOpenCL -o GPUOpenCLProjectforLinux


g++ -L/opt/intel/sw_dev_tools/opencl/SDK/lib64 -I/opt/intel/sw_dev_tools/opencl/SDK/include CPUOpenCLProjectforLinux.cpp  utils.cpp -Wno-deprecated-declarations -lOpenCL -o CPUOpenCLProjectforLinux

The filesystem paths used in this example build reflect default Intel® System Studio 2020 initial release install locations. These may vary for the standalone SDK or custom tool installations.

Notice -Wno-deprecated-declarations is used in the build. This allows this build to proceed using deprecated function calls that can allow for OpenCL™ 1.2 API implementations in addition to the version 2.1 API implementations from Intel®. Developers may wish to maintain portability in a different manner than this example, however the example source regions relevant for this toggle are discussed in the code walkthrough section of this article.

Compiling the device side OpenCL-C program offline is a helpful practice, although not the only option. In this article we use offline compilation only for developer feedback:

/opt/intel/sw_dev_tools/opencl/bin/ioc64 –device=gpu – –output=TemplateGFX.log

The ioc64 (or ioc32) offline compiler picks an OpenCL™ implementation with which to compile with the -device=<device> toggle. The ‘gpu’ device switch sends the kernel source through Intel® Graphics Compute Runtime for OpenCL™ Driver. The kernel is compiled for Intel® Graphics Technology.

/opt/intel/sw_dev_tools/opencl/bin/ioc64 –device=cpu – –output=TemplateCPU.log


The ‘cpu’ switch sends the kernel through Intel® CPU Runtime for OpenCL™ Applications compiling the kernel for Intel® x86_64. The output log file shows build feedback. Note that the contents of cpu/ and gfx/ kernel sources are not the same. 

In this article, both CPU and Graphics kernels will also ultimately compile and execute at runtime.

See /opt/intel/sw_dev_tools/opencl/bin/ioc64 -help for more information about offline compiler capabilities. They are also pasted in the source walkthrough section of the article for convenience.


Execute the test application on Intel® Graphics Technology application and check the output:



tutorial@2020OCL-Dev-Host:~/Downloads/ocl-tools-walkthrough-20200305/gfx$ ./GPUOpenCLProjectforLinux
Selected device type is CL_DEVICE_TYPE_GPU
Number of available platforms: 2
Selected platform: Intel(R) OpenCL HD Graphics
NDRange performance counter time 1.744000 ms.


Execute the test application on Intel® CPU and check the output:



tutorial@2020OCL-Dev-Host:~/Downloads/ocl-tools-walkthrough-20200305/cpu$ ./CPUOpenCLProjectforLinux
Selected device type is CL_DEVICE_TYPE_CPU
Number of available platforms: 2
Selected platform: Intel(R) OpenCL HD Graphics
clGetDeviceIDs() returned CL_DEVICE_NOT_FOUND for platform Intel(R) OpenCL HD Graphics.
Selected platform: Intel(R) CPU Runtime for OpenCL(TM) Applications
NDRange performance counter time 0.904000 ms.

Device selection error feedback is presented in the output here to show no CPU device is available underneath the 'Intel(R) OpenCL HD Graphics' platform as expected.

Developer tips and a sample explanation are later in the walkthrough. Newer developers in particular may gain insight into how OpenCL™ applications function.


Part 2:



See the Developer Guide for debugger usage instructions. Note that debugger availability depends on specific SDK tools in use. See the Intel® oneAPI initiative for upcoming heterogeneous debugger capabilities.


Disk Provisions

Intel® System Studio 2020: OpenCL™ Tools come with standard headers and libraries from Khronos*. Key examples are the cl.h header and the OpenCL™ ICD interrogator loader library, This header and library are needed to build and execute OpenCL™ programs. The library interrogates and routes OpenCL™ API function calls to a chosen OpenCL™ implementation. In our example programs, the routines create contexts and execute work on OpenCL™ devices. The main getting started guide diagrams the modular relationship of OpenCL™ implementations and the OpenCL™ ICD interrogator library.

Alternate versions of the OpenCL™ ICD interrogator library are available from third parties. One such example is the ocl-icd package available from various system package managers. The key to using any ICD library effectively is to ensure it’s OpenCL™ capability can support desired OpenCL™ implementation capabilities. For example:

  • Intel® Graphics Compute Runtime for OpenCL™ Driver offers an OpenCL™ 2.1 implementation for Intel® Iris™ Pro Graphics 580 on the Intel® Core™ i7-6770HQ Processor. The OpenCL™ ICD interrogator library included with Intel® System Studio 2020: OpenCL™ Tools is 2.1 capable.
  • Current Intel® Atom™ Processors’ Graphics hardware may support only up to OpenCL™ version 1.2, but the OpenCL™ 2.1 ICD interrogator library should still be able to resolve the legacy specification’s features.

The OpenCL™ implementation installers put their ICD loader library references in /etc/OpenCL/vendors. At runtime the ICD loader library uses these ICD files to route OpenCL™ API calls through the intended OpenCL™ implementation. The contents of the /opt/OpenCL/vendors folder are useful for understanding the deployments available on a system. Developers may observe ICD reference text files from multiple OpenCL™ device vendors in this folder. Example:

tutorial@2020OCL-Dev-Host:~$ ls -l /etc/OpenCL/vendors/
lrwxrwxrwx 1 root root 42 Nov 30 12:28 intel64.icd -> /etc/alternatives/opencl-intel-runtime-icd
-rw-r--r-- 1 1020 1024 28 Oct 12 06:17 intel.icd
tutorial@2020OCL-Dev-Host:~$ ls -l /etc/alternatives/opencl-intel-runtime-icd
lrwxrwxrwx 1 root root 75 Nov 30 12:28 /etc/alternatives/opencl-intel-runtime-icd -> /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/etc/intel64.icd
tutorial@2020OCL-Dev-Host:~$ cat /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/etc/intel64.icd
tutorial@2020OCL-Dev-Host:~$ cat /etc/OpenCL/vendors/intel.icd

In the above filesystem:

  • intel64.icd maps to Intel® CPU Runtime for OpenCL™ Applications.
  • intel.icd maps to Intel® Graphics Compute Runtime for OpenCL™ Driver. Your deployment directory may vary.
  • For Intel® oneAPI: DPC++/SYCL support the Experimental Intel® CPU Runtime for OpenCL™ Applications with SYCL support may be referenced here.

Note that the ICD Loader library may have environment variable controls to affect implementation discovery. Refer to the release notes for the ICD Loader library implementation: see the official example.

Developer Tips

Intel® recommends keeping the two main bottlenecks in mind for heterogeneous development:

  1. Minimizing offload transfer.
    • OpenCL™ devices may operate within a different memory domains. Minimizing transfers in these cases is crucial for performance.
    • OpenCL™ devices may share a memory domain with the host processor. The API typically exposes ways to avoid copying data all together in these cases. Such a shared configuration is typical of current Intel® Graphics Technology capable processors.
  2. Target device topology. Devices differ in number of compute units and other device characteristics. So, how to best schedule work to devices may change depending on device.

For information on the using the Intel® System Studio 2020: OpenCL™ tools or Intel SDK for OpenCL™ Applications 2020 standalone… see the Developer Guide.

For more on getting the most out of OpenCL™ programming for the CPU only implementation see the OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon processors.

For a a video instructional guide on general programming considerations for Intel® Graphics Technology check the video, "What Intel® Processor Graphics GEN9 Unlocks in OpenCL*", located at Searching for 'opencl' yields other related content.

For a comprehensive look at Intel® Graphics Technology hardware, see this compute architecture overview document for Gen 9.

Note that a manual attempt to walk through how OpenCL™ kernels and generated assembly is abstracted by the scheduling employed by the OpenCL™ runtime implementation.

Intel® FPGA products guidance can be accessed through Download Center for FPGAs.


Part 3


Explanation of sample source

The test application adds two input values together in an OpenCL™ kernel for each work-item, the result is available on the host after the kernel execution.

The Kernel - Intel® Graphics Technology

The texture sampler facility is used to read two texture sampler elements. The elements are added and the result is written to the output image. The kernel uses the read_imageui(…) and write_imageui(...) function calls available from the OpenCL-C language to use the device’s texture sampler facility. In the Intel® Graphics Technology implementation case there are multiple hardware texture samplers per device. This source file is stored on disk as for consumption at run time by the OpenCL™ host program.


__kernel void Add(read_only image2d_t imageA, read_only image2d_t imageB, write_only image2d_t imageC)
    const int x = get_global_id(0);
    const int y = get_global_id(1);

    uint A = read_imageui(imageA, sampler, (int2)(x, y)).x;
    uint B = read_imageui(imageB, sampler, (int2)(x, y)).x;

    write_imageui(imageC, (int2)(x, y), A + B);

Note: There are more compute elements than texture samplers on current Intel® Graphics Technology hardware. Texture samplers are preferable for interpolation tasks. For simple reads or writes, operating on straight forward OpenCL™ buffers may be preferable. Memory staged for image2d_t objects may be reorganized by the OpenCL™ implementation for interpolation prior to kernel launch. The reorganization cost may be amortized by hardware interpolation operations.

The Host Side Application - Intel® Graphics Technology

The host side application has all the OpenCL™ host API calls to set up a context, pick a target device, compile and execute the device kernel, and stage kernel input and output memory.

This example is not the only way to create an OpenCL™ host program, as production applications may wish to use different interrogation procedures, OpenCL™ API event facilities, or even different command queue methodology.  C++ developers may wish to consider the OpenCL™ C++ Wrapper API for use in C++ programs. This example is written in C++ and uses the OpenCL™ API and some C99 library routines to focus the walkthrough on base API usage.

Starting in main() function, notice the application picks a 1024 x 1024 array size for the image data. The sample doesn't read in and operate on a real image. Random data will be generated to fill the image data. However, the sizing for the input is critical later on to control the number of OpenCL™ work-items that are launched for the kernel.

cl_uint arrayWidth  = 1024;
cl_uint arrayHeight = 1024;

The FindOpenCLPlatform(...) routine finds an Intel® platform that has a device type that we specify, in this case we're looking for a platform with a GPU device. The program uses the clGetPlatformIDs(...) API call to get the number of OpenCL™ platforms and an array of OpenCL™ platform id's.

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS != err)
        LogError("Error: clGetplatform_ids() to get num platforms returned %s.\n", TranslateOpenCLError(err));
        return NULL;
    LogInfo("Number of available platforms: %u\n", numPlatforms);

    if (0 == numPlatforms)
        LogError("Error: No platforms found!\n");
        return NULL;

    std::vector<cl_platform_id> platforms(numPlatforms);

    // Now, obtains a list of numPlatforms OpenCL platforms available
    // The list of platforms available will be returned in platforms
    err = clGetPlatformIDs(numPlatforms, &platforms[0], NULL);
    if (CL_SUCCESS != err)
        LogError("Error: clGetplatform_ids() to get platforms returned %s.\n", TranslateOpenCLError(err));
        return NULL;

Each platform's devices are checked to see if the platform contains our preferred device type using clGetDeviceIDs(...) API call: 

    // Check if one of the available platform matches the preferred requirements
    for (cl_uint i = 0; i < numPlatforms; i++)
        bool match = true;
        cl_uint numDevices = 0;

        // If the preferredPlatform is not NULL then check if platforms[i] is the required one
        // Otherwise, continue the check with platforms[i]
        if ((NULL != preferredPlatform) && (strlen(preferredPlatform) > 0))
            // In case we're looking for a specific platform
            match = CheckPreferredPlatformMatch(platforms[i], preferredPlatform, platformStr);

        // match is true if the platform's name is the required one or don't care (NULL)
        if (match)
            // Obtains the number of deviceType devices available on platform
            // When the function failed we expect numDevices to be zero.
            // We ignore the function return value since a non-zero error code
            // could happen if this platform doesn't support the specified device type.
            err = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &numDevices);
            if (CL_SUCCESS != err)
                LogError("clGetDeviceIDs() returned %s for platform %s.\n", TranslateOpenCLError(err), platformStr.c_str());

            if (0 != numDevices)
                // There is at list one device that answer the requirements
                return platforms[i];

A new OpenCL™ context is created on the matching platform. The clCreateContextFromType(...) OpenCL™ API call creates the context associated with a device of our selected type. The clGetContextInfo(...) OpenCL™ API call allows the program to recover the selected device's id:

    // Create context with device of specified type.
    // Required device type is passed as function argument deviceType.
    // So you may use this function to create context for any CPU or GPU OpenCL device.
    // The creation is synchronized (pfn_notify is NULL) and NULL user_data
    cl_context_properties contextProperties[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)platformId, 0};
    ocl->context = clCreateContextFromType(contextProperties, deviceType, NULL, NULL, &err);
    if ((CL_SUCCESS != err) || (NULL == ocl->context))
        LogError("Couldn't create a context, clCreateContextFromType() returned '%s'.\n", TranslateOpenCLError(err));
        return err;

    // Query for OpenCL device which was used for context creation
    err = clGetContextInfo(ocl->context, CL_CONTEXT_DEVICES, sizeof(cl_device_id), &ocl->device, NULL);
    if (CL_SUCCESS != err)
        LogError("Error: clGetContextInfo() to get list of devices returned %s.\n", TranslateOpenCLError(err));
        return err;

The GetPlatformAndDeviceVersion(...) function interrogates the given platform and device to understand OpenCL™ feature capability. The function allows for some flexibility in using OpenCL™ 1.2, OpenCL™ 2.0, or OpenCL™ 2.1 API calls as appropriate for target devices. The function also informs the program if OpenCL-C kernel language 2.0 features are supported.

    // Read the platform's version string
    // The read value returned in platformVersion
    err = clGetPlatformInfo(platformId, CL_PLATFORM_VERSION, stringLength, &platformVersion[0], NULL);
    if (CL_SUCCESS != err)
        LogError("Error: clGetplatform_ids() to get CL_PLATFORM_VERSION returned %s.\n", TranslateOpenCLError(err));
        return err;

    if (strstr(&platformVersion[0], "2.0") != NULL)
        ocl->platformVersion = OPENCL_VERSION_2_0;

    if (strstr(&platformVersion[0], "2.1") != NULL)
        ocl->platformVersion = OPENCL_VERSION_2_1;


    // Read the device's version string
    // The read value returned in deviceVersion
    err = clGetDeviceInfo(ocl->device, CL_DEVICE_VERSION, stringLength, &deviceVersion[0], NULL);
    if (CL_SUCCESS != err)
        LogError("Error: clGetDeviceInfo() to get CL_DEVICE_VERSION returned %s.\n", TranslateOpenCLError(err));
        return err;

    if (strstr(&deviceVersion[0], "2.0") != NULL)
        ocl->deviceVersion = OPENCL_VERSION_2_0;

    if (strstr(&deviceVersion[0], "2.1") != NULL)
        ocl->deviceVersion = OPENCL_VERSION_2_1;


    // Read the device's OpenCL C version string
    // The read value returned in compilerVersion
    err = clGetDeviceInfo(ocl->device, CL_DEVICE_OPENCL_C_VERSION, stringLength, &compilerVersion[0], NULL);
    if (CL_SUCCESS != err)
        LogError("Error: clGetDeviceInfo() to get CL_DEVICE_OPENCL_C_VERSION returned %s.\n", TranslateOpenCLError(err));
        return err;


Back in the SetupOpenCL(...) funtion, the program creates a command queue with the device context. Command queue creation changed between OpenCL™ 1.2 API and OpenCL™ 2.0 API, so there are helper macros to pick the correct API calls. A mismatch in command queue creation is a frequent build breaker for new developers. OpenCL™ API code developers review in the wild may not be explicitly labeled for it's OpenCL™ revision. The build example in the article uses -Wno-deprecated-declarations to easily leverage the older style command queue OpenCL™ API call:

    // Create command queue.
    // OpenCL kernels are enqueued for execution to a particular device through special objects called command queues.
    // Command queue guarantees some ordering between calls and other OpenCL commands.
    // Here you create a simple in-order OpenCL command queue that doesn't allow execution of two kernels in parallel on a target device.
#if defined(CL_VERSION_2_0) || defined(CL_VERSION_2_1)
    if (OPENCL_VERSION_2_0 == ocl->deviceVersion || OPENCL_VERSION_2_1 == ocl->deviceVersion )
        const cl_command_queue_properties properties[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
        ocl->commandQueue = clCreateCommandQueueWithProperties(ocl->context, ocl->device, properties, &err);
    else {
        // default behavior: OpenCL 1.2
        cl_command_queue_properties properties = CL_QUEUE_PROFILING_ENABLE;
        ocl->commandQueue = clCreateCommandQueue(ocl->context, ocl->device, properties, &err);
    // default behavior: OpenCL 1.2
    cl_command_queue_properties properties = CL_QUEUE_PROFILING_ENABLE;
    ocl->commandQueue = clCreateCommandQueue(ocl->context, ocl->device, properties, &err);

Note: const cl_command_queue_properties properties[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0}; The CL_QUEUE_PROFILING_ENABLE flag is useful for debugging. It can help provide useful information in concert with OpenCL™ events. Refer to the Khronos* documentation on using this feature for more information: clCreateCommandQueueWithProperties(...) clGetEventProfilingInfo(...). In production circumstances, enabling command queue profiling is often undesirable due to queue serialization.

Back in our main(...) function our input and output buffers are created. The aligned allocations assume 4K page size. This enables us to get the zero copy behavior to eliminate unnecessary copying. We take advantage of the unified host and device memory domain. See the Intel® article on zero copy for OpenCL for more information:

    // allocate working buffers. 
    // the buffer should be aligned with 4K page and size should fit 64-byte cached line
    cl_uint optimizedSize = ((sizeof(cl_int) * arrayWidth * arrayHeight - 1)/64 + 1) * 64;
    cl_int* inputA;
    cl_int* inputB;
    cl_int* outputC;
    if(posix_memalign((void**)&inputA, 4096, optimizedSize) != 0)
        LogError("Error: posix_memalign failed to allocate buffer.\n");
        return -1;
    if(posix_memalign((void**)&inputB, 4096, optimizedSize) != 0)
        LogError("Error: posix_memalign failed to allocate buffer.\n");
        return -1;
    if(posix_memalign((void**)&outputC, 4096, optimizedSize) != 0)
        LogError("Error: posix_memalign failed to allocate buffer.\n");
        return -1;


In int CreateBufferArguments(ocl_args_d_t *ocl, cl_int* inputA, cl_int* inputB, cl_int* outputC, cl_uint arrayWidth, cl_uint arrayHeight) image objects are created. A description data structure is configured to denote the image object parameters. An image object is created for the 2 inputs and 1 output. The results are 1-channel images with bitdepth of 32-bits. Data is stored with unsigned integer values. CL_MEM_USE_HOST_PTR allows buffers to exhibit zero copy behavior:

    // Define the image data-type and order -
    // one channel (R) with unit values
    format.image_channel_data_type = CL_UNSIGNED_INT32;
    format.image_channel_order     = CL_R;

    // Define the image properties (descriptor)
    desc.image_type        = CL_MEM_OBJECT_IMAGE2D;
    desc.image_width       = arrayWidth;
    desc.image_height      = arrayHeight;
    desc.image_depth       = 0;
    desc.image_array_size  = 1;
    desc.image_row_pitch   = 0;
    desc.image_slice_pitch = 0;
    desc.num_mip_levels    = 0;
    desc.num_samples       = 0;
#if defined(CL_VERSION_2_0) || defined(CL_VERSION_2_1)
    desc.mem_object        = NULL;
    desc.buffer            = NULL;

    // Create first image based on host memory inputA
    ocl->srcA = clCreateImage(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, &format, &desc, inputA, &err);
    if (CL_SUCCESS != err)
        LogError("Error: clCreateImage for srcA returned %s\n", TranslateOpenCLError(err));
        return err;

    // Create second image based on host memory inputB
    ocl->srcB = clCreateImage(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, &format, &desc, inputB, &err);
    if (CL_SUCCESS != err)


Next, the program creates the kernel program object. The kernel program "" is read in from source on disk. The source is then associated with the program object and context.

    // Upload the OpenCL C source code from the input file to source
    // The size of the C program is returned in sourceSize
    char* source = NULL;
    size_t src_size = 0;
    err = ReadSourceFromFile("", &source, &src_size);
    if (CL_SUCCESS != err)
        LogError("Error: ReadSourceFromFile returned %s.\n", TranslateOpenCLError(err));
        goto Finish;

    // And now after you obtained a regular C string call clCreateProgramWithSource to create OpenCL program object.
    ocl->program = clCreateProgramWithSource(ocl->context, 1, (const char**)&source, &src_size, &err);
    if (CL_SUCCESS != err)
        LogError("Error: clCreateProgramWithSource returned %s.\n", TranslateOpenCLError(err));
        goto Finish;

The program is built for the target context and device with clBuildProgram(...). Build feedback is recorded and is often very useful for scheduling and sizing kernel execution. Build feedback is also useful for debugging. The empty string is the argument where the OpenCL-C revision for the kernel program source or other preprocessor variables would be specified. Example "-cl-std=CL2.0". The OpenCL-C revision may or may not be the same as the OpenCL™ API revision. See the Khronos* official reference for more information:

    // Build the program
    // During creation a program is not built. You need to explicitly call build function.
    // Here you just use create-build sequence,
    // but there are also other possibilities when program consist of several parts,
    // some of which are libraries, and you may want to consider using clCompileProgram and clLinkProgram as
    // alternatives.
    err = clBuildProgram(ocl->program, 1, &ocl->device, "", NULL, NULL);
    if (CL_SUCCESS != err)
        LogError("Error: clBuildProgram() for source program returned %s.\n", TranslateOpenCLError(err));

        // In case of error print the build log to the standard output
        // First check the size of the log
        // Then allocate the memory and obtain the log from the program
        if (err == CL_BUILD_PROGRAM_FAILURE)
            size_t log_size = 0;
            clGetProgramBuildInfo(ocl->program, ocl->device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);

            std::vector<char> build_log(log_size);
            clGetProgramBuildInfo(ocl->program, ocl->device, CL_PROGRAM_BUILD_LOG, log_size, &build_log[0], NULL);

            LogError("Error happened during the build of OpenCL program.\nBuild log:%s", &build_log[0]);

A kernel object is created. This kernel object is associated with the desired function, Add(...), in the kernel source file via character string "Add" in the second argument:

    // Program consists of kernels.
    // Each kernel can be called (enqueued) from the host part of OpenCL application.
    // To call the kernel, you need to create it from existing program.
    ocl.kernel = clCreateKernel(ocl.program, "Add", &err);
    if (CL_SUCCESS != err)
        LogError("Error: clCreateKernel returned %s\n", TranslateOpenCLError(err));
        return -1;

Kernel arguments are bound. This binding maps image objects to parameters for the Add(...) kernel function. Now, the program can feed the proper host side data to the intended kernel parameters:

cl_uint SetKernelArguments(ocl_args_d_t *ocl)
    cl_int err = CL_SUCCESS;

    err  =  clSetKernelArg(ocl->kernel, 0, sizeof(cl_mem), (void *)&ocl->srcA);
    if (CL_SUCCESS != err)
        LogError("error: Failed to set argument srcA, returned %s\n", TranslateOpenCLError(err));
        return err;

    err  = clSetKernelArg(ocl->kernel, 1, sizeof(cl_mem), (void *)&ocl->srcB);
    if (CL_SUCCESS != err)
        LogError("Error: Failed to set argument srcB, returned %s\n", TranslateOpenCLError(err));
        return err;

    err  = clSetKernelArg(ocl->kernel, 2, sizeof(cl_mem), (void *)&ocl->dstMem);
    if (CL_SUCCESS != err)
        LogError("Error: Failed to set argument dstMem, returned %s\n", TranslateOpenCLError(err));
        return err;

    return err;

The built kernel program is now associated with the context for our Intel® Graphics Technology device and memory arguments are bound. The kernel program is enqueued on the command queue for execution. The enqueue specifies the 2 dimensional hard coded size assigned at the beginning of the example. clFinish(...) OpenCL™ API call is used to block host program execution until the enqueued kernel has finished executing. The main performance timer measuring kernel performance is set around these operations:

    // Define global iteration space for clEnqueueNDRangeKernel.
    size_t globalWorkSize[2] = {width, height};

    // execute kernel
    err = clEnqueueNDRangeKernel(ocl->commandQueue, ocl->kernel, 2, NULL, globalWorkSize, NULL, 0, NULL, NULL);
    if (CL_SUCCESS != err)
        LogError("Error: Failed to run kernel, return %s\n", TranslateOpenCLError(err));
        return err;

    // Wait until the queued kernel is completed by the device
    err = clFinish(ocl->commandQueue);
    if (CL_SUCCESS != err)
        LogError("Error: clFinish return %s\n", TranslateOpenCLError(err));
        return err;

Next, the output is mapped back to a host pointer for result verification. This mapping operation avoids a copy and exploits the shared memory domain between the host and Intel® Graphics Technology hardware on this platform.

    cl_int *resultPtr = (cl_int *)clEnqueueMapImage(ocl->commandQueue, ocl->dstMem, true, CL_MAP_READ, origin, region, &image_row_pitch, &image_slice_pitch, 0, NULL, NULL, &err);

    if (CL_SUCCESS != err)
        LogError("Error: clEnqueueMapBuffer returned %s\n", TranslateOpenCLError(err));
        return false;

    // Call clFinish to guarantee that output region is updated
    err = clFinish(ocl->commandQueue);
    if (CL_SUCCESS != err)
        LogError("Error: clFinish returned %s\n", TranslateOpenCLError(err));

The result is also computed on the host and compared to the result from the OpenCL™ device for verification. The source code that follows in GPUProjectforLinux.cpp is for tear down and clean-up purposes.

Intel® CPU

The Intel® CPU version of the source is mostly similar. Below we look at the key differences.

The Kernel - Intel® CPU

The CPU version does not show texture sampler usage in the sample. Hardware texture sampler access is only offered through the Intel® Graphics Technology implementation. The CPU runtime implements texture sampling functionality in software. The CPU the kernel in this example for the CPU target accesses kernel data through raw pointers. The data index desired for this work-item is calculated by a basic 2-dimensional to 1-dimensional transformation.

__kernel void Add(__global int* pA, __global int* pB, __global int* pC)
    const int x     = get_global_id(0);
    const int y     = get_global_id(1);
    const int width = get_global_size(0);

    const int id = y * width + x;

    pC[id] = pA[id] + pB[id];

The Host Side Application - Intel® CPU

The sample uses OpenCL™ basic buffer objects and not OpenCL™ images for preparing kernel data:

    // Create new OpenCL buffer objects
    // As these buffer are used only for read by the kernel, you are recommended to create it with flag CL_MEM_READ_ONLY.
    // Always set minimal read/write flags for buffers, it may lead to better performance because it allows runtime
    // to better organize data copying.
    // You use CL_MEM_COPY_HOST_PTR here, because the buffers should be populated with bytes at inputA and inputB.

    ocl->srcA = clCreateBuffer(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, inputA, &err);
    if (CL_SUCCESS != err)
        LogError("Error: clCreateBuffer for srcA returned %s\n", TranslateOpenCLError(err));
        return err;

    ocl->srcB = clCreateBuffer(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, inputB, &err);
    if (CL_SUCCESS != err)
        LogError("Error: clCreateBuffer for srcB returned %s\n", TranslateOpenCLError(err));
        return err;

    // If the output buffer is created directly on top of output buffer using CL_MEM_USE_HOST_PTR,
    // then, depending on the OpenCL runtime implementation and hardware capabilities, 
    // it may save you not necessary data copying.
    // As it is known that output buffer will be write only, you explicitly declare it using CL_MEM_WRITE_ONLY.
    ocl->dstMem = clCreateBuffer(ocl->context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, outputC, &err);
    if (CL_SUCCESS != err)
        LogError("Error: clCreateBuffer for dstMem returned %s\n", TranslateOpenCLError(err));
        return err;

The validation step maps the OpenCL™ result buffer to a host program pointer at the end of the program. clEnqueueMapBuffer(...) OpenCL™ API call is used:

    // Enqueue a command to map the buffer object (ocl->dstMem) into the host address space and returns a pointer to it
    // The map operation is blocking
    cl_int *resultPtr = (cl_int *)clEnqueueMapBuffer(ocl->commandQueue, ocl->dstMem, true, CL_MAP_READ, 0, sizeof(cl_uint) * width * height, 0, NULL, NULL, &err);

    if (CL_SUCCESS != err)
        LogError("Error: clEnqueueMapBuffer returned %s\n", TranslateOpenCLError(err));
        return false;

    // Call clFinish to guarantee that output region is updated
    err = clFinish(ocl->commandQueue);
    if (CL_SUCCESS != err)
        LogError("Error: clFinish returned %s\n", TranslateOpenCLError(err));


On error handing

Error handling, even with sandbox OpenCL™ development, triages many issues before they start. See the source examples for the TranslateOpenCLError(...) function definition to observe the various detailed error codes. These error codes are set by the OpenCL™ standard and are defined in the Khronos* standard headers. Always handle OpenCL™ API return values. Even for sandbox programs basic error handling is advised.


Offline compilation

The ioc64 offline compiler front end has a few different output format options for compiled kernel intermediates. These can be stored on disk and consumed by the OpenCL™ API later. In some cases linked to other compiled kernel objects for future builds. SPIR and SPIR-V provide useful targeting for OpenCL™ as well as non OpenCL™ kernels. SPIR provides a necessary level of kernel obfuscation for developers who wish not to distribute kernels in source text format. This avoids distributions as a text file (.cl) or constant string within the host binary. For developer reference, the -help usage output of ioc64 from 2020 initial release is provided here:

tutorial@2020OCL-Dev-Host:~$ ioc64 -help
Intel(R) SDK for OpenCL(TM) - Offline Compiler, version
Copyright (C) 2019 Intel Corporation.  All rights reserved.

Usage: ioc64 <ARGUMENTS>

  -cmd=<command>               Command to be performed:
                               'build' - create executable ELF
                                  intermediate representation (IR) from source
                                  code (default, if none specified)
                               'compile' - create compiled object ELF IR from
                                  source code
                               'link' - create executable ELF IR or library
                                  from object ELF IR and libraries
                               NOTE: use -ir option to save the binary file to
                                  the system
  -input=<input_file_path>     Build the OpenCL(TM) code given in
                               <input_file_path> (use with 'build' or 'compile' 
  -binary=<binary_files_paths> Build/link binary SPIR-V or ELF IR files, comma
                               separated if more than one (use with 'build' or
                               'link' commands)
                               NOTE: use -ir option to save the binary file to
                                  the system
  -version                     Show compiler version information
  -help                        Show available commands
  -device=<device_type>        Set target device type:
                               'cpu' for Intel(R) CPU device
                               'gpu' for Intel(R) Graphics device (default)
                               'fpga_fast_emu' for Intel(R) FPGA Emulation
                                  Platform device
  -simd=<instruction_set_arch> Set target instruction set architecture (use
                               with 'cpu' or 'fpga_fast_emu' devices only):
                               'sse42' for Intel(R) Streaming SIMD
                                  Extensions 4.2
                               'avx' for Intel(R) Advanced Vector Extensions
                               'avx2' for Intel(R) Advanced Vector Extensions 2
                               'skx' for Intel(R) Advanced Vector Extensions 512
  -output[=<output_file_path>] Write the build log to <output_file_path>
  -asm[=<file_path>]           Generate assembly code (use with 'gpu' device
  -ir[=<file_path>]            Generate ELF IR binary file
  -llvm[=<file_path>]          Generate LLVM IR binary file (use with 'gpu' device only)
  -spirv32[=<file_path>]       Generate SPIR-V (32-bit) binary file
  -spirv64[=<file_path>]       Generate SPIR-V (64-bit) binary file
  -txt-spirv32[=<file_path>]   Generate SPIR-V (32-bit) code file
  -txt-spirv64[=<file_path>]   Generate SPIR-V (64-bit) code file
  -bo="<build_options>"        Add OpenCL(TM) build options

A build log may report the x86_64 instruction set architecture extensions used for the kernel. As of December 2018 and Intel® CPU Runtime for OpenCL™ Applications 18.1, avx512 generated kernels can be generated and executed on capable Intel® processors. See the -simd=<instruction_set_arch> toggle for more information.


For discussion on OpenCL™ developer components in Intel® System Studio 2020 and the Intel® SDK for OpenCL™ Applications 2020, join the community at the Intel® forums.

For Intel® oneAPI DPC++/SYCL developers, if you wish to develop SYCL and OpenCL interoperable programs review the article on interoperability.


For more on CPU and Intel® Graphics Technology performance analysis for heterogeneous OpenCL™ programs, see Intel® VTune™ Profiler, available as a standalone or as an Intel® System Studio 2020 component, and a component within the Intel® oneAPI toolkits.

The OpenCL™ intercept layer instrumentation tool is available via Github* repository. It can track API calls and provide performance metrics.

*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Attachment Size
ocl-tools-walkthrough-20200305.tar.gz 33.2 KB

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at