Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Public
Document Table of Contents

11.2. Utilizing Hardware Kernel Invocation Queue

OpenCL kernels are built with invocation queue to enable immediate launch of next invocation.
Tip: If you are looking for Intel® oneAPI DPC++/C++ Compiler-specific details, refer to Utilizing Hardware Kernel Invocation Queue topic in the FPGA Optimization Guide for Intel® oneAPI Toolkits.

As illustrated in the following figure, when the invocation queue is used, system and OpenCL runtime environment overheads (from responding to the finish and sending in the next set of invocation arguments) are overlapped with the kernel executions. This allows kernels to execute continuously, maximizing the system level throughput.

Figure 92. Kernel Execution With and Without Invocation Queue

Kernel invocations are queued in hardware when another enqueued kernel with the same kernel function name and program is already running on the device, and the following are true:

  • OpenCL kernel is not compiled with hardware kernel invocation buffer disabled (-no-hardware-kernel-invocation-buffer).
  • OpenCL kernel is not compiled with performance counters (-profile)
  • Enqueued OpenCL kernel does not have printf.
  • All event objects queued earlier in the command queue have execution status equal to CL_COMPLETE.

    If the status is CL_SUBMITTED or CL_RUNNING, then that status is for the enqueue kernel with the same kernel function name in the same program.

  • All event objects in the event wait list have execution status equal to CL_COMPLETE.

    If the status is CL_SUBMITTED or CL_RUNNING, then that status is for the enqueue kernel on the same device with the same kernel function name in the same program.

  • If the OpenCL kernel uses heterogeneous memory, kernel currently running on the device and the one getting enqueued did not set the same memory object on different memory types.

Consider the following two example host code snippets where the compiler can queue kernel invocation on hardware kernel invocation queue:

Example 1

int main()
{	…
  clEnqueueNDRangeKernel(queue, kernel, …, NULL);
  clEnqueueNDRangeKernel(queue, kernel, …, NULL);
  …
}

As soon as the first enqueue kernel is running, the compiler can queue the second enqueue kernel on hardware.

Example 2

int main()
{	…
  clEnqueueNDRangeKernel(queue0, kernel0, …, NULL);
  clEnqueueNDRangeKernel(queue1, kernel1, …, NULL);
  clEnqueueNDRangeKernel(queue0, kernel0, …, NULL);
  clEnqueueNDRangeKernel(queue1, kernel1, …, NULL);
  …
}

As soon as the first enqueue of kernel0 is running, the compiler can queue the second enqueue of kernel0 on the hardware irrespective of the kernel1 status. Similarly, as soon as the first enqueue of kernel1 is running, the compiler can queue the second enqueue of kernel1 on the hardware irrespective of the kernel0 status.

Now, consider the following two examples where the compiler cannot queue kernel invocation on hardware:

Example 1

int main()
{	…
  clEnqueueNDRangeKernel(queue, kernel0, …, NULL);
  clEnqueueNDRangeKernel(queue, kernel1, …, NULL);
  clEnqueueNDRangeKernel(queue, kernel0, …, NULL);
  …
}

Since the queue is in-order, enqueue kernel1 prevents the second enqueue of kernel0 from being queued on the hardware invocation queue.

Example 2

int main()
{	…
  clEnqueueNDRangeKernel(queue0, kernel0, …, NULL);
  clEnqueueNDRangeKernel(queue1, kernel1, …, &event);
  clFlush(queue1);
  clEnqueueNDRangeKernel(queue0, kernel0, …, 1, &event, NULL);
  …
}

Since the second enqueue of kernel0 is waiting on enqueue of kernel1 to complete, it only gets queued on the hardware kernel invocation queue if kernel1 finishes execution before first enqueue of kernel0 finishes.

Attention: If the difference in clGetEventProfilingInfo() time between CL_PROFILING_COMMAND_END and CL_PROFILING_COMMAND_START flags is used to calculate execution time of enqueue kernel commands, it is possible that the execution time is zero if it is queued on the invocation queue. Use the following formula to calculate average execution time of the kernel across multiple enqueues instead: