Identify Kernels to Offload

Intel® oneAPI Programming Guide

Download PDF

ID 771723

Date 12/17/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

Intel® oneAPI Programming Guide

Intel® oneAPI Programming Guide x

Introduction to oneAPI Programming oneAPI Programming Model oneAPI Development Environment Setup Compile and Run oneAPI Programs API-based Programming Software Development Process Glossary Notices and Disclaimers

Introduction to oneAPI Programming x

Intel oneAPI Programming Overview oneAPI Toolkit Distribution Related Documentation

oneAPI Programming Model x

Data Parallelism in C++ using SYCL* C/C++ or Fortran with OpenMP* Offload Programming Model Device Selection

oneAPI Development Environment Setup x

Use the setvars Script with Windows* Use the setvars Script with Linux* or MacOS* Use Modulefiles with Linux* Use CMake with oneAPI Applications

Use the setvars Script with Windows* x

Use a Config file for setvars.bat on Windows Automate the setvars.bat Script with Microsoft Visual Studio*

Use the setvars Script with Linux* or MacOS* x

Use a Config file for setvars.sh on Linux or macOS Automate the setvars.sh Script with Eclipse*

Compile and Run oneAPI Programs x

Single Source Compilation Invoke the Compiler Standard Intel oneAPI DPC++/C++ Compiler Options Example Compilation Compilation Flow Overview CPU Flow GPU Flow FPGA Flow

CPU Flow x

Traditional CPU Flow CPU Offload Flow

CPU Offload Flow x

Example CPU Commands Ahead-of-Time Compilation for CPU Architectures Control Binary Execution on Multiple CPU Cores

GPU Flow x

GPU Offload Flow Example GPU Commands Ahead-of-Time Compilation for GPU

FPGA Flow x

Why is FPGA Compilation Different? Types of SYCL* FPGA Compilation FPGA Compilation Flags Emulate and Debug Your Design Evaluate Your Kernel Through Simulation Device Selectors for FPGA FPGA IP Authoring Flow Fast Recompile for FPGA Generate Multiple FPGA Images (Linux only) FPGA BSPs and Boards Targeting Multiple Homogeneous FPGA Devices Targeting Multiple Platforms FPGA-CPU Interaction FPGA Performance Optimization Use of RTL Libraries for FPGA Use SYCL Shared Library With Third-Party Applications FPGA Workflows in IDEs

Emulate and Debug Your Design x

Emulator Environment Variables Emulate Pipe Depth Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe Compile and Emulate Your Design Limitations of the Emulator Discrepancies in Hardware and Emulator Results Emulator Known Issues

Evaluate Your Kernel Through Simulation x

Simulation Prerequisites Installing the Questa*-Intel FPGA Edition Software Set Up the Simulation Environment Compile a Kernel for Simulation Simulate Your Kernel Viewing Simulation Waveforms Troubleshoot Simulator Issues

FPGA IP Authoring Flow x

Code IP Components in SYCL* Emulate and Debug Your IP Component Evaluate Your IP Component Through Simulation FPGA IP Component Performance Optimization Synthesizing Your Component IP with Intel® Quartus® Prime Software Integrating Your IP Into a System Encrypt IP Components for Distribution

Code IP Components in SYCL* x

Customize RTL Interfaces Suggested Coding Styles Memory-Mapped interfaces Host Pipes Agent IP Component Kernels Streaming IP Component Kernels Streaming Arguments Pipelined Kernels Stable Arguments The printf Command

Memory-Mapped interfaces x

Memory-Mapped Interface Using Accessors Memory-Mapped Interface Using Unified Shared Memory

Host Pipes x

Declare a Host Pipe Host Pipe API Host Pipes IP Authoring Flow Host Pipes RTL Interfaces

Evaluate Your IP Component Through Simulation x

Debug During Verification

Integrating Your IP Into a System x

Adding IP into an Intel® Quartus® Prime Project Adding IP into a Platform Designer System

FPGA BSPs and Boards x

FPGA Board Initialization Obtain FPGA Hardware Image Information

Use of RTL Libraries for FPGA x

Restrictions and Limitations in RTL Support

API-based Programming x

Intel oneAPI DPC++ Library (oneDPL) Intel oneAPI Math Kernel Library (oneMKL) Intel oneAPI Threading Building Blocks (oneTBB) Intel oneAPI Data Analytics Library (oneDAL) Intel oneAPI Collective Communications Library (oneCCL) Intel oneAPI Deep Neural Network Library (oneDNN) Intel oneAPI Video Processing Library (oneVPL) Other Libraries

Intel oneAPI DPC++ Library (oneDPL) x

oneDPL Library Usage oneDPL Code Sample

Intel oneAPI Math Kernel Library (oneMKL) x

oneMKL Usage oneMKL Code Sample

Intel oneAPI Threading Building Blocks (oneTBB) x

oneTBB Usage oneTBB Code Sample

Intel oneAPI Data Analytics Library (oneDAL) x

oneDAL Usage oneDAL Code Sample

Intel oneAPI Collective Communications Library (oneCCL) x

oneCCL Usage oneCCL Code Sample

Intel oneAPI Deep Neural Network Library (oneDNN) x

oneDNN Usage oneDNN Code Sample

Intel oneAPI Video Processing Library (oneVPL) x

oneVPL Usage oneVPL Code Sample

Software Development Process x

Migrating Code to SYCL* and DPC++ Composability Debugging the DPC++ and OpenMP* Offload Process Performance Tuning Cycle oneAPI Library Compatibility

Migrating Code to SYCL* and DPC++ x

Migrating from C++ to SYCL* Migrating from CUDA* to SYCL* for the DPC++ Compiler Migrating from OpenCL Code to SYCL* Migrating Between CPU, GPU, and FPGA

Composability x

C/C++ OpenMP* and SYCL* Composability OpenCL™ Code Interoperability

Debugging the DPC++ and OpenMP* Offload Process x

oneAPI Debug Tools Trace the Offload Process Debug the Offload Process Optimize Offload Performance

Performance Tuning Cycle x

Establish Baseline Identify Kernels to Offload Offload Kernels Optimize Recompile, Run, Profile, and Repeat

Intel® oneAPI Programming Guide

Introduction to oneAPI Programming

Intel oneAPI Programming Overview

oneAPI Toolkit Distribution

Related Documentation

oneAPI Programming Model

Data Parallelism in C++ using SYCL*

C/C++ or Fortran with OpenMP* Offload Programming Model

Device Selection

oneAPI Development Environment Setup

Use the setvars Script with Windows*

Use a Config file for setvars.bat on Windows

Automate the setvars.bat Script with Microsoft Visual Studio*

Use the setvars Script with Linux* or MacOS*

Use a Config file for setvars.sh on Linux or macOS

Automate the setvars.sh Script with Eclipse*

Use Modulefiles with Linux*

Use CMake with oneAPI Applications

Compile and Run oneAPI Programs

Single Source Compilation

Invoke the Compiler

Standard Intel oneAPI DPC++/C++ Compiler Options

Example Compilation

Compilation Flow Overview

CPU Flow

Traditional CPU Flow

CPU Offload Flow

Example CPU Commands

Ahead-of-Time Compilation for CPU Architectures

Control Binary Execution on Multiple CPU Cores

GPU Flow

GPU Offload Flow

Example GPU Commands

Ahead-of-Time Compilation for GPU

FPGA Flow

Why is FPGA Compilation Different?

Types of SYCL* FPGA Compilation

FPGA Compilation Flags

Emulate and Debug Your Design

Emulator Environment Variables

Emulate Pipe Depth

Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe

Compile and Emulate Your Design

Limitations of the Emulator

Discrepancies in Hardware and Emulator Results

Emulator Known Issues

Evaluate Your Kernel Through Simulation

Simulation Prerequisites

Installing the Questa*-Intel FPGA Edition Software

Set Up the Simulation Environment

Compile a Kernel for Simulation

Simulate Your Kernel

Viewing Simulation Waveforms

Troubleshoot Simulator Issues

Device Selectors for FPGA

FPGA IP Authoring Flow

Code IP Components in SYCL*

Customize RTL Interfaces

Suggested Coding Styles

Memory-Mapped interfaces

Memory-Mapped Interface Using Accessors

Memory-Mapped Interface Using Unified Shared Memory

Host Pipes

Declare a Host Pipe

Host Pipe API

Host Pipes IP Authoring Flow

Host Pipes RTL Interfaces

Agent IP Component Kernels

Streaming IP Component Kernels

Streaming Arguments

Pipelined Kernels

Stable Arguments

The printf Command

Emulate and Debug Your IP Component

Evaluate Your IP Component Through Simulation

Debug During Verification

FPGA IP Component Performance Optimization

Synthesizing Your Component IP with Intel® Quartus® Prime Software

Integrating Your IP Into a System

Adding IP into an Intel® Quartus® Prime Project

Adding IP into a Platform Designer System

Encrypt IP Components for Distribution

Fast Recompile for FPGA

Generate Multiple FPGA Images (Linux only)

FPGA BSPs and Boards

FPGA Board Initialization

Obtain FPGA Hardware Image Information

Targeting Multiple Homogeneous FPGA Devices

Targeting Multiple Platforms

FPGA-CPU Interaction

FPGA Performance Optimization

Use of RTL Libraries for FPGA

Restrictions and Limitations in RTL Support

Use SYCL Shared Library With Third-Party Applications

FPGA Workflows in IDEs

API-based Programming

Intel oneAPI DPC++ Library (oneDPL)

oneDPL Library Usage

oneDPL Code Sample

Intel oneAPI Math Kernel Library (oneMKL)

oneMKL Usage

oneMKL Code Sample

Intel oneAPI Threading Building Blocks (oneTBB)

oneTBB Usage

oneTBB Code Sample

Intel oneAPI Data Analytics Library (oneDAL)

oneDAL Usage

oneDAL Code Sample

Intel oneAPI Collective Communications Library (oneCCL)

oneCCL Usage

oneCCL Code Sample

Intel oneAPI Deep Neural Network Library (oneDNN)

oneDNN Usage

oneDNN Code Sample

Intel oneAPI Video Processing Library (oneVPL)

oneVPL Usage

oneVPL Code Sample

Other Libraries

Software Development Process

Migrating Code to SYCL* and DPC++

Migrating from C++ to SYCL*

Migrating from CUDA* to SYCL* for the DPC++ Compiler

Migrating from OpenCL Code to SYCL*

Migrating Between CPU, GPU, and FPGA

Composability

C/C++ OpenMP* and SYCL* Composability

OpenCL™ Code Interoperability

Debugging the DPC++ and OpenMP* Offload Process

oneAPI Debug Tools

Trace the Offload Process

Debug the Offload Process

Optimize Offload Performance

Performance Tuning Cycle

Establish Baseline

Identify Kernels to Offload

Offload Kernels

Optimize

Recompile, Run, Profile, and Repeat

oneAPI Library Compatibility

Glossary

Notices and Disclaimers

Identify Kernels to Offload

To best utilize the compute cycles available on the devices of a heterogeneous platform, it is important to identify the tasks that are compute intensive and that can benefit from parallel execution. Consider an application that executes solely on a CPU, but there may be some tasks suitable to execute on a GPU. This can be determined using the Offload Modeling perspective of the Intel® Advisor.

Intel Advisor estimates performance characterizations of the workload as it may execute on an accelerator. It consumes the information from profiling the workload and provides performance estimates, speedup, bottleneck characterization, and offload data transfer estimates and recommendations.

Typically, kernels with high compute, a large dataset, and limited memory transfers are best suited for offload to a device.

See Get Started: Identify High-impact Opportunities to Offload to GPU for quick steps to ramp up with the Offload Modeling perspective. For more resources about modeling performance of your application on GPU platforms, see Offload Modeling Resources for Intel® Advisor Users.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Programming Guide

Identify Kernels to Offload