Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 10/04/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

7.1. Specifying a Maximum Work-group Size or a Required Work-Group Size

Specify the max_work_group_size or reqd_work_group_size attribute for your kernels whenever possible. These attributes allow the Intel® FPGA SDK for OpenCL™ Offline Compiler to perform aggressive optimizations to match the kernel to hardware resources without any excess logic.
Tip: For oneAPI DPC++-specific details, refer to Specify a Work-group Size topic in the Intel® oneAPI DPC++ FPGA Optimization Guide.

The offline compiler assumes a default work-group size for your kernel depending on certain constraints imposed during compilation time and runtime .

The offline compiler imposes the following constraints at compilation time:

  • If you specify a value for the reqd_work_group_size attribute, the work-group size must match this value.
  • If you specify a value for the max_work_group_size attribute, the work-group size must not exceed this value.
  • If you do not specify values for reqd_work_group_size and max_work_group_size, and the kernel contains a barrier, the offline compiler defaults to a maximum work-group size of 256 work-items.
  • If you do not specify values for both attributes and the kernel does not contain any barrier, the offline compiler does not impose any constraint on the work-group size at compilation time.
Tip: Use the CL_KERNEL_WORK_GROUP_SIZE and CL_KERNEL_COMPILE_WORK_GROUP_SIZE queries to the clGetKernelWorkGroupInfo API call to determine the work-group size constraints that the offline compiler imposes on a particular kernel at compilation time.

The OpenCL™ standard imposes the following constraints at runtime:

  • The work-group size in each dimension must divide evenly into the requested NDRange size in each dimension.
  • The work-group size must not exceed the device constraints specified by the CL_DEVICE_MAX_WORK_GROUP_SIZE and CL_DEVICE_MAX_WORK_ITEM_SIZES queries to the clGetDeviceInfo API call.
CAUTION:
If the work-group size you specify for a requested NDRange kernel execution does not satisfy all of the constraints listed above, the clEnqueueNDRangeKernel API call fails with the error CL_INVALID_WORK_GROUP_SIZE.

If you do not specify values for both the reqd_work_group_size and max_work_group_size attributes, the runtime determines a default work-group size as follows:

  • If the kernel contains a barrier or refers to the local work-item ID, or if you use the clGetKernelWorkGroupInfo and clGetDeviceInfo API calls in your host code to query the work-group size, the runtime defaults the work-group size to one work-item.
  • If the kernel does not contain a barrier or refer to the local work-item ID, or if your host code does not query the work-group size, the default work-group size is the global NDRange size.

When queuing an NDRange kernel (that is, not a single work-item kernel), specify an explicit work-group size under the following conditions:

  • If your kernel uses memory barriers, local memory, or local work-item IDs.
  • If your host program queries the work-group size.

If your kernel uses memory barriers, perform one of the following tasks to minimize hardware resources:

  • Specify a value for the reqd_work_group_size attribute.
  • Assign to the max_work_group_size attribute the smallest work-group size that accommodates all your runtime work-group size requests.
CAUTION:
Including a memory barrier at the end of your NDRange kernel causes compilation to fail.

Specifying a smaller work-group size than the default at runtime might lead to excessive hardware consumption. Therefore, if you require a work-group size other than the default, specify the max_work_group_size attribute to set a maximum work-group size. If the work-group size remains constant through all kernel invocations, specify a required work-group size by including the reqd_work_group_size attribute. The reqd_work_group_size attribute instructs the offline compiler to allocate exactly the correct amount of hardware to manage the number of work-items per work-group you specify. This allocation results in hardware resource savings and improved efficiency in the implementation of kernel compute units. By specifying the reqd_work_group_size attribute, you also prevent the offline compiler from implementing additional hardware to support work-groups of unknown sizes.

For example, the code fragment below assigns a fixed work-group size of 64 work-items to a kernel:

__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
                   __global const float * restrict b,
                   __global float * restrict answer)
{
  size_t gid = get_global_id(0);

  answer[gid] = a[gid] + b[gid];
}