OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon® Processors
ID
773005
Date
10/30/2018
Public
Legal Information
Getting Help and Support
Introduction
Check-list for OpenCL™ Optimizations
Tips and Tricks for Kernel Development
Application-Level Optimizations
Debugging OpenCL™ Kernels on Linux* OS
Performance Debugging with Intel® SDK for OpenCL™ Applications
Coding for the Intel® Architecture Processors
Why Optimizing Kernels Is Important?
Avoid Spurious Operations in Kernels
Avoid Handling Edge Conditions in Kernels
Use the Preprocessor for Constants
Prefer (32-bit) Signed Integer Data Types
Prefer Row-Wise Data Accesses
Use Built-In Functions
See Also
Avoid Extracting Vector Components
Task-Parallel Programming Model Hints
Common Mistakes in OpenCL™ Applications
Introduction for OpenCL™ Coding on Intel® Architecture Processors
Vectorization Basics for Intel® Architecture Processors
Vectorization: SIMD Processing Within a Work Group
Benefitting from Implicit Vectorization
Vectorizer Knobs
Targeting a Different CPU Architecture
Using Vector Data Types
Writing Kernels to Directly Target the Intel® Architecture Processors
Work-Group Size Considerations
Threading: Achieving Work-Group Level Parallelism
Efficient Data Layout
Using the Blocking Technique
Intel® Turbo Boost Technology Support
Global Memory Size
Use Built-In Functions
OpenCL™ offers a library of built-in functions, including vector variants. For details, see the OpenCL specification.
Using built-in functions is typically more efficient than using their manual implementation in OpenCL code. Consider the following code example:
__kernel void Foo(const __global float* a, const __global float* b, __global float* c) { int tid = get_global_id(0); c[tid] = 1/sqrt(a[tid] + b[tid]); }
The following code uses the rsqrt built-in to implement the same example:
__kernel void Foo(const __global float* a, const __global float* b, __global float* c) { int tid = get_global_id(0); c[tid] = rsqrt(a[tid] + b[tid]); }
Consider simple expressions and built-ins based equivalents below:
dx * fCos + dy * fSin == dot( (float2)(dx, dy),(float2)(fCos, fSin)) x * a - b == mad(x, a, -b) sqrt(dot(x, y)) == distance(x,y)
Use specialized built-in versions like math, integer, and geometric built-ins, where possible, as the specialized built-ins work faster than their manually-computed counterparts. For example, when the x value for xy is ≥0, use powr instead of pow.
See Also
The OpenCL™ 1.2 Specification at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf