Considering native_ and half_ Versions of Math Built-Ins

OpenCL™ Developer Guide for Intel® Processor Graphics

Download PDF

ID 773088

Date 3/20/2019

Version 2019.4

Public

Visible to Intel only — GUID: GUID-85DCDACE-5DEA-4F1D-A0AB-0E8EC6377BE1

View Details

Document Table of Contents

Document Table of Contents x

OpenCL™ Optimization Guide for Visual Computing Systems

OpenCL™ Optimization Guide for Visual Computing Systems x

Legal Information Getting Help and Support Introduction Coding for the Intel® Processor Graphics Platform-Level Considerations Application-Level Optimizations Optimizing OpenCL™ Usage with Intel® Processor Graphics Check-list for OpenCL™ Optimizations Performance Debugging Using Multiple OpenCL™ Devices Coding for the Intel® CPU OpenCL™ Device OpenCL™ Kernel Development for Intel® CPU OpenCL™ device

Introduction x

About this Document Basic Concepts Using Data Parallelism Related Products

Coding for the Intel® Processor Graphics x

Execution of OpenCL™ Work-Items: the SIMD Machine Memory Hierarchy

Platform-Level Considerations x

Intel® Turbo Boost Technology Support Global Memory Size

Application-Level Optimizations x

Minimizing Data Copying Avoiding Needless Synchronization Reusing Compilation Results with clCreateProgramWithBinary Interoperability with Other APIs

Interoperability with Other APIs x

Interoperability between OpenCL and OpenGL Using Microsoft DirectX* Resources Aligning Pointers to Microsoft DirectX* Buffers Upon Mapping Note on Working with other APIs Note on Intel® Quick Sync Video

Optimizing OpenCL™ Usage with Intel® Processor Graphics x

Optimizing Utilization of Execution Units Work-Group Size Recommendations Summary Memory Access Considerations Using Loops

Memory Access Considerations x

Memory Access Overview Recommendations Kernel Memory Access Optimization Summary

Recommendations x

Granularity __global Memory and __constant Memory __private Memory __local Memory

Check-list for OpenCL™ Optimizations x

Mapping Memory Objects Using Buffers and Images Appropriately Using Floating Point for Calculations Using Compiler Options for Optimizations Using Built-In Functions Loading and Storing Data in Greatest Chunks Applying Shared Local Memory Using Specialization in Branching Considering native_ and half_ Versions of Math Built-Ins See Also Using the Restrict Qualifier for Kernel Arguments Avoiding Handling Edge Conditions in Kernels

Performance Debugging x

Host-Side Timing Profiling Operations Using OpenCL™ Profiling Events Comparing OpenCL™ Kernel Performance with Performance of Native Code Getting Credible Performance Numbers Using Tools

Using Multiple OpenCL™ Devices x

Using Shared Context for Multiple OpenCL™ Devices Sharing Resources Efficiently Synchronization Caveats Writing to a Shared Resource Partitioning the Work Keeping Kernel Sources the Same Basic Frequency Considerations Eliminating Device Starvation Limitations of Shared Context with Respect to Extensions

Coding for the Intel® CPU OpenCL™ Device x

Vectorization Basics for Intel® Architecture Processors Benefitting From Implicit Vectorization Vectorizer Knobs Using Vector Data Types Writing Kernels to Directly Target the Intel® Architecture Processors Work-Group Size Considerations Work-Group Level Parallelism

OpenCL™ Kernel Development for Intel® CPU OpenCL™ device x

Why Optimizing Kernel Code Is Important? Avoid Spurious Operations in Kernel Code Perform Initialization in a Separate Task Use Preprocessor for Constants Use Signed Integer Data Types Use Row-Wise Data Accesses Tips for Auto-Vectorization Local Memory Usage Avoid Extracting Vector Components Task-Parallel Programming Model Hints

OpenCL™ Optimization Guide for Visual Computing Systems

Legal Information

Getting Help and Support

Introduction

About this Document

Basic Concepts

Using Data Parallelism

Related Products

Coding for the Intel® Processor Graphics

Execution of OpenCL™ Work-Items: the SIMD Machine

Memory Hierarchy

Platform-Level Considerations

Intel® Turbo Boost Technology Support

Global Memory Size

Application-Level Optimizations

Minimizing Data Copying

Avoiding Needless Synchronization

Reusing Compilation Results with clCreateProgramWithBinary

Interoperability with Other APIs

Interoperability between OpenCL and OpenGL

Using Microsoft DirectX* Resources

Aligning Pointers to Microsoft DirectX* Buffers Upon Mapping

Note on Working with other APIs

Note on Intel® Quick Sync Video

Optimizing OpenCL™ Usage with Intel® Processor Graphics

Optimizing Utilization of Execution Units

Work-Group Size Recommendations Summary

Memory Access Considerations

Memory Access Overview

Recommendations

Granularity

__global Memory and __constant Memory

__private Memory

__local Memory

Kernel Memory Access Optimization Summary

Using Loops

Check-list for OpenCL™ Optimizations

Mapping Memory Objects

Using Buffers and Images Appropriately

Using Floating Point for Calculations

Using Compiler Options for Optimizations

Using Built-In Functions

Loading and Storing Data in Greatest Chunks

Applying Shared Local Memory

Using Specialization in Branching

Considering native_ and half_ Versions of Math Built-Ins

See Also

Using the Restrict Qualifier for Kernel Arguments

Avoiding Handling Edge Conditions in Kernels

Performance Debugging

Host-Side Timing

Profiling Operations Using OpenCL™ Profiling Events

Comparing OpenCL™ Kernel Performance with Performance of Native Code

Getting Credible Performance Numbers

Using Tools

Using Multiple OpenCL™ Devices

Using Shared Context for Multiple OpenCL™ Devices

Sharing Resources Efficiently

Synchronization Caveats

Writing to a Shared Resource

Partitioning the Work

Keeping Kernel Sources the Same

Basic Frequency Considerations

Eliminating Device Starvation

Limitations of Shared Context with Respect to Extensions

Coding for the Intel® CPU OpenCL™ Device

Vectorization Basics for Intel® Architecture Processors

Benefitting From Implicit Vectorization

Vectorizer Knobs

Using Vector Data Types

Writing Kernels to Directly Target the Intel® Architecture Processors

Work-Group Size Considerations

Work-Group Level Parallelism

OpenCL™ Kernel Development for Intel® CPU OpenCL™ device

Why Optimizing Kernel Code Is Important?

Avoid Spurious Operations in Kernel Code

Perform Initialization in a Separate Task

Use Preprocessor for Constants

Use Signed Integer Data Types

Use Row-Wise Data Accesses

Tips for Auto-Vectorization

Local Memory Usage

Avoid Extracting Vector Components

Task-Parallel Programming Model Hints

Visible to Intel only — GUID: GUID-85DCDACE-5DEA-4F1D-A0AB-0E8EC6377BE1

View Details

Considering native_ and half_ Versions of Math Built-Ins

OpenCL™ API offers two basic ways to trade precision for speed:

native_* and half_* math built-ins, which have lower precision, but are faster than their un-prefixed variants
Compiler optimization options that enable optimizations for floating-point arithmetic for the whole OpenCL program (for example, the -cl-fast-relaxed-math flag).

For the list of other compiler options and their description please refer to the Intel® Code Builder for OpenCL™ API - User Manual. In general, while the -cl-fast-relaxed-math flag is a quick way to get potentially large performance gains for kernels with many math operations, it does not permit fine control of numeric accuracy. Consider experimenting with native_* equivalents separately for each specific case, keeping track of the resulting accuracy.

The native_ versions of math built-ins are generally supported in hardware and run substantially faster, while offering lower accuracy. Use native trigonometry and transcendental functions, such as sin, cos, exp or log, when performance is more important than precision.

The list of functions that have optimized versions support is provided in "Working with cl-fast-relaxed-math Flag" section of the OpenCL Code Builder - User’s Guide.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Processor Graphics

Considering native_ and half_ Versions of Math Built-Ins

See Also