Making Better Use of OpenMP Constructs

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 7/10/2025

Version

Public

Making Better Use of OpenMP Constructs

Reduce Synchronizations Using `nowait`

A nowait clause on a target construct helps eliminate unnecessary synchronizations between host and device code. By default, a target region includes an implicit barrier at the end, meaning that the host thread waits for the device region to complete before proceeding. When nowait is added to a target directive, this barrier is removed. The host thread is free to continue execution immediately, while the offloaded computation proceeds asynchronously on the device.

Example

To illustrate the benefit of using nowait on a target region, consider a scenario with two computational kernels:

Kernel A is a GPU-friendly computation. It has two nested loops where the inner loop performs 20 iterations of transcendental operations like sinf, expf, and fused arithmetic. These operations are compute-bound, high-throughput, and highly vectorizable, making them ideal for GPU execution.

#pragma omp parallel for
    for (int i = 0; i < N_A; ++i) {
      float acc = a[i];
      for (int k = 0; k < 20; ++k)
        acc = sinf(acc) * expf(acc) + acc * 1.01f;
      res[i] = acc;
    }

Kernel B in contrast, performs lightweight filtering and accumulation over a smaller array. It involves a conditional and a simple reduction. This computation is memory-bound, branchy, and low in arithmetic intensity, making it better suited to execute on the CPU execution, which handles branching and irregular control flow more efficiently.

    for (int i = 0; i < N_B; ++i) {
      float val = b[i];
      if (val > 0.5f)
        count++;
      sum += val * 0.1f;
    }
  }

Given that the two kernels operate on disjoint data structures, they can be executed independently and in parallel. This motivates a hybrid implementation where Kernel A is offloaded to the GPU and Kernel B is executed on the CPU. Leveraging the nowait clause in this context enables both kernels to run concurrently, rather than sequentially, improving resource utilization.

To demonstrate the impact of nowait, a comparison of the following four versions of the code were made.

CPU-only: Both Kernel A and Kernel B run sequentially on the host.
GPU-only: Both Kernel A and Kernel B are offloaded and executed sequentially on the device.
Hybrid Blocking: Kernel A runs on the GPU and is followed by Kernel B on the CPU. The CPU waits for the GPU to finish before continuing to execute Kernel B.
Hybrid Non-blocking: Kernel A runs asynchronously on the GPU using nowait, while Kernel B runs simultaneously on the CPU.

Experiments were conducted with two versions of Kernel B, one using #pragma omp parallel for reduction(+ : sum, count) and one without it. When executing on the CPU, the parallel version was observed to be slightly slower. This is likely due to the loop’s lightweight workload, which does not offer enough computational intensity to offset the overhead of spawning threads and performing reduction. As a result, the serial version was used when running Kernel B on the CPU. Conversely, when executing on the GPU — as in gpu_only.cpp — using the parallel version led to improved performance. This suggests that the GPU’s parallel execution model benefits from explicit parallelization even for kernels which may not see the benefit on host. Therefore, all four configurations use the version of Kernel B that yields the best performance for that platform, making this a fair and performance-optimized comparison.

Compilation command:

icpx -fiopenmp -fopenmp-targets=spir64 cpu_only.cpp -o cpu.out
icpx -fiopenmp -fopenmp-targets=spir64 gpu_only.cpp -o gpu.out
icpx -fiopenmp -fopenmp-targets=spir64 hybrid_blocking.cpp -o hybrid_blocking.out
icpx -fiopenmp -fopenmp-targets=spir64 hybrid_non_blocking.cpp -o hybrid_non_blocking.out

Example run command:

./cpu.out
./gpu.out
./hybrid_blocking.out
./hybrid_non_blocking.out

Results

The performance of the four versions running on the same system are as follows:

Version	Time (seconds)
CPU-only	36.861047
GPU-only	9.073009
Hybrid-blocking	8.683631
Hybrid-non-blocking	6.531662

As expected, by using nowait, the Hybrid-non-blocking version outperforms all three remaining versions.

It is:

5.64× faster than the CPU-only version
1.39× faster than the GPU-only version
1.32× faster than the Hybrid-blocking version

To ensure the analysis is accurate, the performance of each kernel is measured separately on both the GPU and the CPU. The results are as follows:

Kernel	CPU Time (sec)	GPU Time (sec)
Kernel A	36.658759	5.098709
Kernel B	0.342394	3.674300

Based on the per-kernel, per-device performance, it is evident that offloading Kernel A is a wise choice, while running Kernel B on the CPU is better for overall performance. By using nowait, you are able to exploit task-level parallelism and achieve performance gains by overlapping independent computations across heterogeneous resources.

Nowait with Optional Argument

OpenMP 6.0 added support for an optional Boolean argument on the nowait clause. This feature provides flexibility to choose dynamically at runtime based on some condition to execute the kernel on the device synchronously or asynchronously

If the argument to nowait is omitted, the behavior defaults to asynchronous execution (equivalent to the classic nowait semantics).
If the argument is present, the region executes asynchronously if the expression evaluates to .true., and synchronously if it evaluates to .false., enabling runtime control over synchronization.

Fortran

The same nowait example shown above may be written in Fortran as follows.

program hybrid_non_blocking
  use omp_lib
  implicit none

  integer, parameter :: N_A = 1000000, N_B = 500000, NUM_ITERATIONS = 10000
  real(4), allocatable :: a(:), b(:), res(:)
  real(4) :: sum
  integer :: count, i, j
  real(8) :: start, mid, end

  allocate(a(N_A), b(N_B), res(N_A))
  sum = 0.0
  count = 0

  do i = 1, N_A
    a(i) = i * 0.0001
  end do

  do i = 1, N_B
    b(i) = mod(i, 1000) * 0.001
  end do

  ! Dummy target region to warm up GPU
  !$omp target
  !$omp end target

  call omp_set_default_device(0)
  call cpu_time(start)

  do j = 1, NUM_ITERATIONS
    !$omp target teams distribute parallel do nowait map(to: a(1:N_A)) map(from: res(1:N_A))
    do i = 1, N_A
      res(i) = a(i)
      res(i) = res(i) + sin(res(i)) * exp(res(i))
      res(i) = res(i) * 1.01
    end do

    do i = 1, N_B
      if (b(i) > 0.5) then
        count = count + 1
      end if
      sum = sum + b(i) * 0.1
    end do

    !$omp taskwait
  end do

  call cpu_time(end)

  print *, "Hybrid Non-blocking: ", end - start, " seconds"
  deallocate(a, b, res)

end program hybrid_non_blocking

Compilation command:

ifx -fiopenmp -fopenmp-targets=spir64 hybrid_non_blocking_f.f90 -o hybrid_non_blocking_f.out

Example run command:

./hybrid_non_blocking_f.out

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Making Better Use of OpenMP Constructs

Reduce Synchronizations Using `nowait`

Example

Results

Nowait with Optional Argument

Fortran

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Making Better Use of OpenMP Constructs

Reduce Synchronizations Using nowait

Example

Results

Nowait with Optional Argument

Fortran

Reduce Synchronizations Using `nowait`