Kernel Memory Access Optimization Summary

OpenCL™ Developer Guide for Intel® Processor Graphics

Download PDF

ID 773088

Date 3/20/2019

Version 2019.4

Public

Document Table of Contents

Document Table of Contents x

OpenCL™ Optimization Guide for Visual Computing Systems

OpenCL™ Optimization Guide for Visual Computing Systems x

Legal Information Getting Help and Support Introduction Coding for the Intel® Processor Graphics Platform-Level Considerations Application-Level Optimizations Optimizing OpenCL™ Usage with Intel® Processor Graphics Check-list for OpenCL™ Optimizations Performance Debugging Using Multiple OpenCL™ Devices Coding for the Intel® CPU OpenCL™ Device OpenCL™ Kernel Development for Intel® CPU OpenCL™ device

Introduction x

About this Document Basic Concepts Using Data Parallelism Related Products

Coding for the Intel® Processor Graphics x

Execution of OpenCL™ Work-Items: the SIMD Machine Memory Hierarchy

Platform-Level Considerations x

Intel® Turbo Boost Technology Support Global Memory Size

Application-Level Optimizations x

Minimizing Data Copying Avoiding Needless Synchronization Reusing Compilation Results with clCreateProgramWithBinary Interoperability with Other APIs

Interoperability with Other APIs x

Interoperability between OpenCL and OpenGL Using Microsoft DirectX* Resources Aligning Pointers to Microsoft DirectX* Buffers Upon Mapping Note on Working with other APIs Note on Intel® Quick Sync Video

Optimizing OpenCL™ Usage with Intel® Processor Graphics x

Optimizing Utilization of Execution Units Work-Group Size Recommendations Summary Memory Access Considerations Using Loops

Memory Access Considerations x

Memory Access Overview Recommendations Kernel Memory Access Optimization Summary

Recommendations x

Granularity __global Memory and __constant Memory __private Memory __local Memory

Check-list for OpenCL™ Optimizations x

Mapping Memory Objects Using Buffers and Images Appropriately Using Floating Point for Calculations Using Compiler Options for Optimizations Using Built-In Functions Loading and Storing Data in Greatest Chunks Applying Shared Local Memory Using Specialization in Branching Considering native_ and half_ Versions of Math Built-Ins Using the Restrict Qualifier for Kernel Arguments Avoiding Handling Edge Conditions in Kernels

Performance Debugging x

Host-Side Timing Profiling Operations Using OpenCL™ Profiling Events Comparing OpenCL™ Kernel Performance with Performance of Native Code Getting Credible Performance Numbers Using Tools

Using Multiple OpenCL™ Devices x

Using Shared Context for Multiple OpenCL™ Devices Sharing Resources Efficiently Synchronization Caveats Writing to a Shared Resource Partitioning the Work Keeping Kernel Sources the Same Basic Frequency Considerations Eliminating Device Starvation Limitations of Shared Context with Respect to Extensions

Coding for the Intel® CPU OpenCL™ Device x

Vectorization Basics for Intel® Architecture Processors Benefitting From Implicit Vectorization Vectorizer Knobs Using Vector Data Types Writing Kernels to Directly Target the Intel® Architecture Processors Work-Group Size Considerations Work-Group Level Parallelism

OpenCL™ Kernel Development for Intel® CPU OpenCL™ device x

Why Optimizing Kernel Code Is Important? Avoid Spurious Operations in Kernel Code Perform Initialization in a Separate Task Use Preprocessor for Constants Use Signed Integer Data Types Use Row-Wise Data Accesses Tips for Auto-Vectorization Local Memory Usage Avoid Extracting Vector Components Task-Parallel Programming Model Hints

OpenCL™ Optimization Guide for Visual Computing Systems

Legal Information

Getting Help and Support

Introduction

About this Document

Basic Concepts

Using Data Parallelism

Related Products

Coding for the Intel® Processor Graphics

Execution of OpenCL™ Work-Items: the SIMD Machine

Memory Hierarchy

Platform-Level Considerations

Intel® Turbo Boost Technology Support

Global Memory Size

Application-Level Optimizations

Minimizing Data Copying

Avoiding Needless Synchronization

Reusing Compilation Results with clCreateProgramWithBinary

Interoperability with Other APIs

Interoperability between OpenCL and OpenGL

Using Microsoft DirectX* Resources

Aligning Pointers to Microsoft DirectX* Buffers Upon Mapping

Note on Working with other APIs

Note on Intel® Quick Sync Video

Optimizing OpenCL™ Usage with Intel® Processor Graphics

Optimizing Utilization of Execution Units

Work-Group Size Recommendations Summary

Memory Access Considerations

Memory Access Overview

Recommendations

Granularity

__global Memory and __constant Memory

__private Memory

__local Memory

Kernel Memory Access Optimization Summary

Using Loops

Check-list for OpenCL™ Optimizations

Mapping Memory Objects

Using Buffers and Images Appropriately

Using Floating Point for Calculations

Using Compiler Options for Optimizations

Using Built-In Functions

Loading and Storing Data in Greatest Chunks

Applying Shared Local Memory

Using Specialization in Branching

Considering native_ and half_ Versions of Math Built-Ins

Using the Restrict Qualifier for Kernel Arguments

Avoiding Handling Edge Conditions in Kernels

Performance Debugging

Host-Side Timing

Profiling Operations Using OpenCL™ Profiling Events

Comparing OpenCL™ Kernel Performance with Performance of Native Code

Getting Credible Performance Numbers

Using Tools

Using Multiple OpenCL™ Devices

Using Shared Context for Multiple OpenCL™ Devices

Sharing Resources Efficiently

Synchronization Caveats

Writing to a Shared Resource

Partitioning the Work

Keeping Kernel Sources the Same

Basic Frequency Considerations

Eliminating Device Starvation

Limitations of Shared Context with Respect to Extensions

Coding for the Intel® CPU OpenCL™ Device

Vectorization Basics for Intel® Architecture Processors

Benefitting From Implicit Vectorization

Vectorizer Knobs

Using Vector Data Types

Writing Kernels to Directly Target the Intel® Architecture Processors

Work-Group Size Considerations

Work-Group Level Parallelism

OpenCL™ Kernel Development for Intel® CPU OpenCL™ device

Why Optimizing Kernel Code Is Important?

Avoid Spurious Operations in Kernel Code

Perform Initialization in a Separate Task

Use Preprocessor for Constants

Use Signed Integer Data Types

Use Row-Wise Data Accesses

Tips for Auto-Vectorization

Local Memory Usage

Avoid Extracting Vector Components

Task-Parallel Programming Model Hints

Kernel Memory Access Optimization Summary

A kernel should access at least 32-bits of data at a time, from addresses that are aligned to 32-bit boundaries. A char4, short2, int, or float counts as 32-bits of data. If you can, load two, three, or four 32-bit quantities at a time, which may improve performance. Loading more than four 32-bit quantities at a time may reduce performance.

Optimize __global memory and __constant memory accesses to minimize the number of cache lines read from the L3 cache. This typically involves carefully choosing your work-group dimensions, and how your array indices are computed from the work-item local or global id.

If you cannot access __global memory or __constant memory in an optimal manner, consider moving part of your data to __local memory, where more access patterns can execute with full performance.

Local memory is most beneficial when the access pattern favors the banked nature of the SLM hardware.

Optimize __local memory accesses to minimize the number of bank conflicts. Reading the same address from the same bank is OK, but reading different addresses from the same bank results in a bank conflict. Writes to the same bank always result in a bank conflict, even if the writes are going to the same address. Consider adding a column to two-dimensional local memory arrays if it avoids bank conflicts when accessing columns of data.

Avoid dynamically-indexed __private arrays if possible.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Processor Graphics

Kernel Memory Access Optimization Summary