Intel® CPU Runtime for OpenCL™ Applications Guide

ID 859509
Updated 7/2/2025
Version Latest
Public

author-image

By

Intel® CPU Runtime for OpenCL™ Applications is an OpenCL™ runtime and compiler that runs on Intel® Core™ processor family or Intel®  Xeon® processor family. Intel® CPU Runtime for OpenCL™ Applications is compatible with the OpenCL 3.0 standard specification and most extensions.

Hardware Requirements

The Intel® CPU Runtime for OpenCL™ Applications provides CPU device support on the following processors:

  • Intel® Core™ processor family with Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2) support or higher
  • Intel® Xeon® E3, E5, and E7 processor family with Intel® SSE4.2 support or higher
  • Intel® Xeon® Scalable processor family with Intel® Xeon® Bronze, Silver, Gold, or Platinum with Intel® SSE4.2 support or higher

The Intel® CPU Runtime for OpenCL™ Applications provides optimizations for processors that support the following instruction sets:

  • Intel® Advanced Vector Extensions (Intel® AVX)
  • Intel® Advanced Vector Extensions 2 (Intel® AVX2)
  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • Intel® Advanced Vector Extensions 512 Byte and Word instructions (AVX-512BW)
  • Intel® Advanced Vector Extensions 512 Conflict Detection instructions (AVX-512CD)
  • Intel® Advanced Vector Extensions 512 Doubleword and Quadword instructions (AVX-512DQ)
  • Intel® Advanced Vector Extensions 512 Foundation instructions (Intel® AVX-512F)
  • Intel® Advanced Vector Extensions 512 Vector Length Extensions (AVX-512VL)
  • Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX-512 VNNI)
  • Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2)

Operating System Requirements

  • Fedora* Linux* 39 or Fedora Linux 40
  • Microsoft Windows* 10 (IA-32 or Intel® 64)
  • Microsoft Windows 11 (IA-32 or Intel® 64)
  • Microsoft Windows Server 2022 (IA-32 or Intel® 64)
  • Microsoft Windows Server 2019 (IA-32 or Intel® 64)
  • Red Hat* Enterprise Linux 8.x, or 9.x
  • SUSE Linux Enterprise Server* 15 SP4, SP5, or SP6
  • Ubuntu*  22.04 LTS, 24.04

Configure Intel® CPU Runtime for OpenCL™ Applications

Linux

On Linux, the libintelocl.so shared library, under the installation folder, is the device driver. It can be configured to use the OCL_ICD_FILENAMES environment variable or the .icd file so that the installable client driver (ICD) loader can load it.

This configuration is finished by the installation package.

Windows

On Windows, the intelocl64.dll shared library, under the installation folder, is the device driver. It can be configured to use the OCL_ICD_FILENAMES environment variable or Windows registry key location “Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors” so that the ICD loader can load it.

NOTE: According to the Khronos OpenCL ICD Loader* implementation, the OCL_ICD_FILENAMES environment variable does not work if you have a higher status. For example, an administrator user.

Supported OpenCL™ Extensions

The following extensions are supported by Intel® CPU Runtime for OpenCL™ Applications:

  • cl_ext_float_atomics
  • cl_intel_concurrent_dispatch
  • cl_intel_device_attribute_query
  • cl_intel_device_partition_by_names
  • cl_intel_devicelib_assert
  • cl_intel_exec_by_local_thread
  • cl_intel_required_subgroup_size
  • cl_intel_spirv_subgroups
  • cl_intel_subgroup_local_block_io
  • cl_intel_subgroups
  • cl_intel_subgroups_char
  • cl_intel_subgroups_long
  • cl_intel_subgroups_short
  • cl_intel_unified_shared_memory
  • cl_intel_vec_len_hint
  • cl_khr_3d_image_writes
  • cl_khr_byte_addressable_store
  • cl_khr_command_buffer
  • cl_khr_depth_images
  • cl_khr_device_uuid
  • cl_khr_extended_bit_ops
  • cl_khr_fp16
  • cl_khr_fp64
  • cl_khr_global_int32_base_atomics
  • cl_khr_global_int32_extended_atomics
  • cl_khr_icd
  • cl_khr_il_program
  • cl_khr_image2d_from_buffer
  • cl_khr_int64_base_atomics
  • cl_khr_int64_extended_atomics
  • cl_khr_integer_dot_product
  • cl_khr_local_int32_base_atomics
  • cl_khr_local_int32_extended_atomics
  • cl_khr_spir
  • cl_khr_spirv_linkonce_odr
  • cl_khr_subgroup_ballot
  • cl_khr_subgroup_clustered_reduce
  • cl_khr_subgroup_extended_types
  • cl_khr_subgroup_non_uniform_arithmetic
  • cl_khr_subgroup_non_uniform_vote
  • cl_khr_subgroup_shuffle
  • cl_khr_subgroup_shuffle_relative
  • cl_khr_suggested_local_work_size

Configurable Runtime Environment Variables

Intel® CPU Runtime for OpenCL™ Applications provides a set of configurable environment variables that you can use to configure your device compiler and runtime options. The following lists are grouped into three categories:

  • Compiler dump and debug options
  • Compiler optimization options
  • Device property options

Compiler Dump and Debug Options

  • CL_CONFIG_DUMP_BIN
    • Description: Controls if the object binary of an OpenCL™ program is dumped. The destination is the current directory.
    • Value Type: Bool
    • Optional Values: 0 (default) or 1
  • CL_CONFIG_DUMP_DISASSEMBLY
    • Description: Controls if the disassembled assembly of the OpenCL™ program is dumped. The destination is the current directory.
    • Value Type: Bool
    • Optional Values: 0 (default) or 1
  • CL_CONFIG_DUMP_ASM
    • Description: Controls if there is a dump of ASM code. The destination is the current directory.
    • Value Type: Bool
    • Optional Values: 0 (default) or 1
  • CL_CONFIG_DUMP_FILE_NAME_PREFIX
    • Description: Resets the file name of the dump file if CL_CONFIG_DUMP_BIN or CL_CONFIG_DUMP_DISASSEMBLY is enabled.
    • Value Type: String
    • Optional Values: String. The default is empty.
  • CL_CONFIG_USER_LOGGER
    • Description: Dumps all detailed information for each OpenCL™ API execution, including thread_id, start timeslot, duration, parameter information, return value, etc. The pattern is CL_CONFIG_USER_LOGGER=type,destination.
      • Available types:
        • E: Error information
        • I: API execution information
      • Available destinations:
        •  stderr
        •  filename   
    • Value Type: n/a
    • Optional Values: n/a
  • CL_CONFIG_LLVM_OPTIONS
    • Description: A space-separated list of LLVM options that are passed to the compiler backend. These options must be defined with llvm::cl::opt in the LLVM component libraries. Example: export CL_CONFIG_LLVM_OPTIONS="-verify-each-pass"
    • Value Type: n/a
    • Optional Values: n/a

Compiler Optimization Options

  • CL_CONFIG_USE_VECTORIZER
    • Description: Determines if the auto-vectorization module is active or not.
    • Value Type: Bool
    • Optional Values: 1 (default) or 0
  • CL_CONFIG_CPU_VECTORIZER_MODE
    • Description: Controls the vectorization mode of the Intel® CPU Runtime for OpenCL™ Applications compiler. This only applies when CL_CONFIG_USE_VECTORIZER = True. Some kernels do not support vectorization from a functional aspect. Such kernels cannot be vectorized in any mode. The possible values are:
      • 0: (default) Autonomous vectorization. The compiler makes heuristic decisions if it should vectorize each kernel, and if so, to what vector length.
      • 1: No compiler vectorization. Explicit kernel vector data types are left intact. It is the same as if CL_CONFIG_USE_VECTORIZER = False.
      •  [4, 8, 16, 32, 64]: Disable heuristic and force vectorization to the specified length.
    • Value Type: Integer
    • Optional Values: 0 (default), 1, 2, 4, 8, 32, 64
  • CL_CONFIG_USE_FAST_RELAXED_MATH
    • Description: Determines if kernels are built with the -cl-fast-relaxed-math option.
    • Value Type: Bool
    • Optional Values: 1 (default) or 0
  • CL_CONFIG_CPU_RT_LOOP_UNROLL_FACTOR
    • Description: Controls loop unrolling of loops with a non-constant trip count. Out-of-bounds values are clamped to [1, 16]. Examples: 1, 2, 3, 16.
      • 1: No runtime unrolling (default)
      • [2, 16]: Unrolling factor
    • Value Type: Integer
    • Optional Values: From 1 (default) to 16
  • CL_CONFIG_CPU_TBB_NUM_WORKERS
    • Description: Controls the number of Intel® oneAPI Threading Building Blocks (oneTBB) workers/threads. Work-groups can be executed in parallel by CL_CONFIG_CPU_TBB_NUM_WORKERS threads. Out-of-bounds values are clamped to [1, MAX_NUM]. MAX_NUM is the number of logical CPU cores. The default value is MAX_NUM. If CL_CONFIG_CPU_TBB_NUM_WORKERS is 1, workgroups are executed sequentially by a single thread.
    • Value Type: Integer
    • Optional Values: From 1 to MAX_NUM (default)
  • CL_CONFIG_ENABLE_PARALLEL_COPY
    • Description: Controls parallel memory copy/fill enabling in clEnqueueReadBuffer, clEnqueueWriteBuffer, clEnqueueCopyBuffer, clEnqueueSVMMemcpy, clEnqueueMemcpyINTEL, clEnqueueSVMMemFillclEnqueueMemsetINTEL, and clEnqueueMemFillINTEL operations.
    • Value Type: Bool
    • Optional Values: 1 (default) or 0
  • CL_CONFIG_CPU_EXPENSIVE_MEM_OPT
    • Description: A bitmap indicating enabled expensive memory optimizations. These optimizations may lead to more just-in-time (JIT) compilation time, but give some performance benefit. Examples: 0 (default), 0x0001. Available bits: 1: OpenCL™ address space alias analysis.
    • Value Type: Bitmap
    • Optional Values: 0
  • CL_CONFIG_CPU_STREAMING_ALWAYS
    • Description: Determines if non-temporal instructions are used or not.
    • Value Type: Bool
    • Optional Values: 0

Device Property Options

  • CL_CONFIG_CPU_SUB_GROUP_CONSTRUCTION
    • Description: Controls how sub-groups are constructed in the NDRange space. Available values:
      • 0 (default): Sub-groups are constructed in the 0th dimension.
        • Example: If local_sizes = (8, 4, 2) and sub_group_size = 16, you get (local_sizes[1] * local_sizes[2]) = 8 sub-groups and the size of each sub-group is 16 with only 8 items active.
      • 1: Sub-groups are constructed in the 1st dimension.
        • Example: If local_sizes = (8, 4, 2) and sub_group_size = 16, you get (local_sizes[0] * local_sizes[2]) = 16 sub-groups and the size of each sub-group is 16 with only 4 items active.
      • 2: Sub-groups are constructed in the 2nd dimension.
        • Example: If local_sizes = (8, 4, 2) and sub_group_size = 16, you get (local_sizes[0] * local_sizes[1]) = 32 sub-groups and the size of each sub-group is 16 with only 2 items active.
      • -1: Sub-groups are constructed in the linearized space.
        • Example: If local_sizes = (8, 4, 2) and sub_group_size = 16, there are 64 items in the linearized space, so you get (local_sizes[0] * local_sizes[1] * local_sizes[2] / sub_group_size) = 4 sub-groups and the size of each sub-group is 16 with all items active.
    • Value Type: Integer
    • Optional Values: 0 (default), 1, 2, -1
  • CL_CONFIG_CPU_FORCE_MAX_WORK_GROUP_SIZE
    • Description: Forces the CPU to work with the specified maximum work-group size. Out-of-bounds values are clamped to [8192, 67108864].
    • Value Type: Integer
    • Optional Values: [8192 (default), 67108864]
  • CL_CONFIG_CPU_FORCE_WORK_GROUP_SIZE
    • Description: Forces the CPU to work with the specified workgroup size. For example: export CL_CONFIG_CPU_FORCE_WORK_GROUP_SIZE=128 or export CL_CONFIG_CPU_FORCE_WORK_GROUP_SIZE=128,1,1. If this environment is set, the local_work_size of clEnqueueNDRangeKernel is ignored. If the array size is larger than the work_dim of clEnqueueNDRangeKernel, only the first work_dim values of the array are used. If the array size is smaller than work_dim, the workgroup size of higher dim is set to 1. clEnqueueNDRangeKernel returns CL_INVALID_WORK_GROUP_SIZE if:
      • Any of the first work_dim values is negative.
      • Any of the first work_dim values is larger than global_work_size.
      • The maximum of the first work_dim values is larger than the query result of  CL_DEVICE_MAX_WORK_GROUP_SIZE.
      Value Type: Integer or array of an integer
    • Optional Values: n/a
  • CL_CONFIG_CPU_FORCE_GLOBAL_MEM_SIZE
    • Description: Forces the CPU to work with the specified global memory size; this affects the max memory allocation size if CL_CONFIG_CPU_FORCE_MAX_MEM_ALLOC_SIZE is configured. Examples: 2GB, 512MB, 65536KB, 16777216B.
    • Value Type: Size with units
    • Optional Values: From 1 to total physical size (default)
  • CL_CONFIG_CPU_FORCE_MAX_MEM_ALLOC_SIZE
    • Description: Forces the CPU to work with the specified max memory allocation size. Examples: 2GB, 512MB, 65536KB, 16777216B.   
    • Value Type: Size with units
    • Optional Values: Default is the max of 128MB (half of the CL_CONFIG_CPU_FORCE_GLOBAL_MEM_SIZE value)
  • CL_CONFIG_CPU_FORCE_LOCAL_MEM_SIZE
    • Description: Forces CL_DEVICE_LOCAL_MEM_SIZE for the CPU device to be the given value. Examples: 8MB, 256KB (default), 8388608B.  
    • Value Type: Size with units
    • Optional Values: A reasonable value. The default is 256KB.
  • CL_CONFIG_CPU_TARGET_ARCH
    • Description: Generates code exclusively for a given target CPU architecture. Examples: graniterapids, sapphirerapids, icelake-client, icelake-server, cascadelake, skx, core-avx2, corei7-avx, corei7.
    • Value Type: A CPU architecture name
    • Optional Values: Auto detect
  • CL_CONFIG_STACK_DEFAULT_SIZE
    • Description: Sets the value as the default stack size for kernel execution. If the actual stack size is greater than this value, stack reallocation occurs. Examples: 4MB, 8MB (default), 16MB.
    • Value Type: Size with units
    • Optional Values: A reasonable value. The default is 8MB.
  • CL_CONFIG_STACK_EXTRA_SIZE
    • Description: Sets a value as the extra stack size for the execution of built-in and third-party functions. Examples: 512KB, 1MB (default), 2MB.
    • Value Type: Size with units
    • Optional Values: A reasonable value. Default 1MB.

Note: The accepted units for size configuration are measured by GB, G, MB, M, KB, K, B, or their lowercase equivalents.

Except for these configurable environment variables for OpenCL™,  we support the following configurable environment variables, which were originally implemented for SYCL. They are for CPU affinity settings and work for OpenCL too.

Additional Configurable Environment Variables

The following configurable environment variables were originally implemented for SYCL*. They are available for CPU affinity settings and also work for OpenCL™.

  • SYCL_CPU_CU_AFFINITY
    • Sets the thread affinity to the CPU. The optional value is {close | spread | master}. By default, the SYCL_CPU_CU_AFFINITY variable is not set. Possible values:
      • close: Threads are pinned to CPU cores successively through available spaces.
      • spread: Threads are spread to available places.
      • master: Threads are put in the same places as master.
  • SYCL_CPU_SCHEDULE
    • Specifies the algorithm for scheduling work groups. The value is {dynamic | affinity | static}. The default value is dynamic. Possible values:
      • dynamic: A TBB auto_partitioner that performs sufficient splitting to balance the load.
      • affinity: A TBB affinity_partitioner that improves auto_partitioner's cache affinity by its choice of mapping subranges to worker threads compared to auto_partitioner.
      • static: A TBB static_partitioner that distributes range iterations among worker threads as uniformly as possible.
  • SYCL_CPU_NUM_CUS
    • See CL_CONFIG_CPU_TBB_NUM_WORKERS.       
  • SYCL_CPU_PLACES
    • Specifies the places where affinities are set. The optional value is {sockets | numa_domains | cores | threads}. By default, the SYCL_CPU_PLACES variable is not set.
      • sockets: Each place is a single socket that consists of one or multiple cores.
      • numa_domains: Each place is a single NUMA node.
      • cores: Each place is a single CPU core that has one or multiple hardware threads.
      • threads: Each place is a single hardware thread.

The Intel® CPU Runtime for OpenCL™ Applications installation includes the latest OpenCL™ ICD Loader from Khronos. You can use all the debug environment variables for the OpenCL™ ICD Loader defined in the Khronos Table of Debug Environment Variables. One useful environment variable is OCL_ICD_ENABLE_TRACE. You can enable this environment variable to trace ICD loader execution and analyze ICD loader-related issues.

  • OCL_ICD_ENABLE_TRACE
    • Description: Traces the ICD loader execution information.
    • Value Type: Bool
    • Optional Values: True, true 1, False, false, 0

Support and Resources

 

 

1