To get best performance from using the vectorization module (see the "Benefitting from Implicit Vectorization" section), the work-group size must be larger or a multiple of 8.
To reduce the overhead of maintaining a workgroup, you should create work-groups that are as large as possible, which means 64 and more work-items. One upper bound is the size of the accessed data set as it is better not to exceed the size of the L1 cache in a single work group. Also there should be sufficient number of work-groups, see the "Work-Group Level Parallelism" section for more information.
To accommodate multiple architectures, query the device for the
parameter by calling to
, and set the work-group size accordingly.
If your kernel code contains the barrier instruction, the issue of work-group size becomes a tradeoff. The more local and private memory each work-item in the work-group requires, the smaller the optimal work-group size is. The reason is that a barrier also issues copy instructions for the total amount of private and local memory used by all work-items in the work-group in the work-group since the state of each work-item that arrived at the barrier is saved before proceeding with another work-item.