Sharing Resources Efficiently

OpenCL™ Developer Guide for Intel® Processor Graphics

Download PDF

ID 773088

Date 3/20/2019

Version 2019.4

Public

Document Table of Contents

Document Table of Contents x

OpenCL™ Optimization Guide for Visual Computing Systems

OpenCL™ Optimization Guide for Visual Computing Systems x

Legal Information Getting Help and Support Introduction Coding for the Intel® Processor Graphics Platform-Level Considerations Application-Level Optimizations Optimizing OpenCL™ Usage with Intel® Processor Graphics Check-list for OpenCL™ Optimizations Performance Debugging Using Multiple OpenCL™ Devices Coding for the Intel® CPU OpenCL™ Device OpenCL™ Kernel Development for Intel® CPU OpenCL™ device

Introduction x

About this Document Basic Concepts Using Data Parallelism Related Products

Coding for the Intel® Processor Graphics x

Execution of OpenCL™ Work-Items: the SIMD Machine Memory Hierarchy

Platform-Level Considerations x

Intel® Turbo Boost Technology Support Global Memory Size

Application-Level Optimizations x

Minimizing Data Copying Avoiding Needless Synchronization Reusing Compilation Results with clCreateProgramWithBinary Interoperability with Other APIs

Interoperability with Other APIs x

Interoperability between OpenCL and OpenGL Using Microsoft DirectX* Resources Aligning Pointers to Microsoft DirectX* Buffers Upon Mapping Note on Working with other APIs Note on Intel® Quick Sync Video

Optimizing OpenCL™ Usage with Intel® Processor Graphics x

Optimizing Utilization of Execution Units Work-Group Size Recommendations Summary Memory Access Considerations Using Loops

Memory Access Considerations x

Memory Access Overview Recommendations Kernel Memory Access Optimization Summary

Recommendations x

Granularity __global Memory and __constant Memory __private Memory __local Memory

Check-list for OpenCL™ Optimizations x

Mapping Memory Objects Using Buffers and Images Appropriately Using Floating Point for Calculations Using Compiler Options for Optimizations Using Built-In Functions Loading and Storing Data in Greatest Chunks Applying Shared Local Memory Using Specialization in Branching Considering native_ and half_ Versions of Math Built-Ins Using the Restrict Qualifier for Kernel Arguments Avoiding Handling Edge Conditions in Kernels

Performance Debugging x

Host-Side Timing Profiling Operations Using OpenCL™ Profiling Events Comparing OpenCL™ Kernel Performance with Performance of Native Code Getting Credible Performance Numbers Using Tools

Using Multiple OpenCL™ Devices x

Using Shared Context for Multiple OpenCL™ Devices Sharing Resources Efficiently See Also Synchronization Caveats Writing to a Shared Resource Partitioning the Work Keeping Kernel Sources the Same Basic Frequency Considerations Eliminating Device Starvation Limitations of Shared Context with Respect to Extensions

Coding for the Intel® CPU OpenCL™ Device x

Vectorization Basics for Intel® Architecture Processors Benefitting From Implicit Vectorization Vectorizer Knobs Using Vector Data Types Writing Kernels to Directly Target the Intel® Architecture Processors Work-Group Size Considerations Work-Group Level Parallelism

OpenCL™ Kernel Development for Intel® CPU OpenCL™ device x

Why Optimizing Kernel Code Is Important? Avoid Spurious Operations in Kernel Code Perform Initialization in a Separate Task Use Preprocessor for Constants Use Signed Integer Data Types Use Row-Wise Data Accesses Tips for Auto-Vectorization Local Memory Usage Avoid Extracting Vector Components Task-Parallel Programming Model Hints

OpenCL™ Optimization Guide for Visual Computing Systems

Legal Information

Getting Help and Support

Introduction

About this Document

Basic Concepts

Using Data Parallelism

Related Products

Coding for the Intel® Processor Graphics

Execution of OpenCL™ Work-Items: the SIMD Machine

Memory Hierarchy

Platform-Level Considerations

Intel® Turbo Boost Technology Support

Global Memory Size

Application-Level Optimizations

Minimizing Data Copying

Avoiding Needless Synchronization

Reusing Compilation Results with clCreateProgramWithBinary

Interoperability with Other APIs

Interoperability between OpenCL and OpenGL

Using Microsoft DirectX* Resources

Aligning Pointers to Microsoft DirectX* Buffers Upon Mapping

Note on Working with other APIs

Note on Intel® Quick Sync Video

Optimizing OpenCL™ Usage with Intel® Processor Graphics

Optimizing Utilization of Execution Units

Work-Group Size Recommendations Summary

Memory Access Considerations

Memory Access Overview

Recommendations

Granularity

__global Memory and __constant Memory

__private Memory

__local Memory

Kernel Memory Access Optimization Summary

Using Loops

Check-list for OpenCL™ Optimizations

Mapping Memory Objects

Using Buffers and Images Appropriately

Using Floating Point for Calculations

Using Compiler Options for Optimizations

Using Built-In Functions

Loading and Storing Data in Greatest Chunks

Applying Shared Local Memory

Using Specialization in Branching

Considering native_ and half_ Versions of Math Built-Ins

Using the Restrict Qualifier for Kernel Arguments

Avoiding Handling Edge Conditions in Kernels

Performance Debugging

Host-Side Timing

Profiling Operations Using OpenCL™ Profiling Events

Comparing OpenCL™ Kernel Performance with Performance of Native Code

Getting Credible Performance Numbers

Using Tools

Using Multiple OpenCL™ Devices

Using Shared Context for Multiple OpenCL™ Devices

Sharing Resources Efficiently

See Also

Synchronization Caveats

Writing to a Shared Resource

Partitioning the Work

Keeping Kernel Sources the Same

Basic Frequency Considerations

Eliminating Device Starvation

Limitations of Shared Context with Respect to Extensions

Coding for the Intel® CPU OpenCL™ Device

Vectorization Basics for Intel® Architecture Processors

Benefitting From Implicit Vectorization

Vectorizer Knobs

Using Vector Data Types

Writing Kernels to Directly Target the Intel® Architecture Processors

Work-Group Size Considerations

Work-Group Level Parallelism

OpenCL™ Kernel Development for Intel® CPU OpenCL™ device

Why Optimizing Kernel Code Is Important?

Avoid Spurious Operations in Kernel Code

Perform Initialization in a Separate Task

Use Preprocessor for Constants

Use Signed Integer Data Types

Use Row-Wise Data Accesses

Tips for Auto-Vectorization

Local Memory Usage

Avoid Extracting Vector Components

Task-Parallel Programming Model Hints

Sharing Resources Efficiently

Objects, allocated at the context level, are shared between devices in the context. For example, buffers and images are effectively shared by default. Other resources that are shared automatically across all devices, include program and kernel objects.

NOTE:

Shared memory objects cannot be written concurrently by different command queues. Use explicit synchronization of the write access with OpenCL™ synchronization objects, such as events. Consider using sub-buffers, which enables you to simultaneously write to the non-overlapping regions.

You can also avoid implicit copying when you share data with the host, as explained in the “Mapping Memory Objects” section.

NOTE:

To avoid potential inefficiencies, especially associated with improper alignment, use 4k alignment for the host pointers in scenarios when the Intel® Graphics device is involved. Also align the allocation sizes to the cache line boundaries (64 bytes). Refer to the “Mapping Memory Objects” section for more details.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Processor Graphics

Sharing Resources Efficiently