OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon® Processors
ID
773005
Date
10/30/2018
Public
Legal Information
Getting Help and Support
Introduction
Check-list for OpenCL™ Optimizations
Tips and Tricks for Kernel Development
Application-Level Optimizations
Debugging OpenCL™ Kernels on Linux* OS
Performance Debugging with Intel® SDK for OpenCL™ Applications
Coding for the Intel® Architecture Processors
Why Optimizing Kernels Is Important?
Avoid Spurious Operations in Kernels
Avoid Handling Edge Conditions in Kernels
Use the Preprocessor for Constants
Prefer (32-bit) Signed Integer Data Types
Prefer Row-Wise Data Accesses
Use Built-In Functions
Avoid Extracting Vector Components
Task-Parallel Programming Model Hints
Common Mistakes in OpenCL™ Applications
Introduction for OpenCL™ Coding on Intel® Architecture Processors
Vectorization Basics for Intel® Architecture Processors
Vectorization: SIMD Processing Within a Work Group
Benefitting from Implicit Vectorization
Vectorizer Knobs
Targeting a Different CPU Architecture
Using Vector Data Types
Writing Kernels to Directly Target the Intel® Architecture Processors
Work-Group Size Considerations
Threading: Achieving Work-Group Level Parallelism
Efficient Data Layout
Using the Blocking Technique
Intel® Turbo Boost Technology Support
Global Memory Size
Avoid Needless Synchronization
For better results, avoid explicit command synchronization primitives, such as clEnqueueMarker and Barrier. Explicit synchronization commands and event tracking result in cross-module round trips, which decrease performance. The less you use explicit synchronization commands, the better the performance is.
Use the following techniques to reduce the explicit synchronization:
- Merge kernels whenever possible. It also improves data locality.
- If you need to wait for a kernel to complete execution before reading the resulting buffer, continue execution until you need the first buffer with results.
- If an in-order queue expresses the dependency chain correctly, use it to define a string of dependent kernels. In the in-order execution model, the commands in a command queue are executed in the order of submission, with each command running to completion before the next one begins. This is a typical case for a straightforward processing pipeline. Consider the following:
- Using the blocking OpenCL™ API is more effective than explicit synchronization schemes based on OS synchronization primitives.
- If you are optimizing the kernel pipeline, first measure kernels separately to find the most time-consuming one. Avoid calling clFinish or clWaitForEvents in the final pipeline version frequently after, for example, each kernel invocation. Prefer submitting the whole sequence (to the in-order queue) and issue clFinish once or wait on the OpenCL event object, which reduces host-device round trips.