Visible to Intel only — GUID: GUID-1BC800E5-1248-4D9C-BAD3-CF5FD2227BBE
Visible to Intel only — GUID: GUID-1BC800E5-1248-4D9C-BAD3-CF5FD2227BBE
Making Better Use of OpenMP Constructs
Reduce Synchronizations Using nowait
A nowait clause on a target construct helps eliminate unnecessary synchronizations between host and device code. By default, a target region includes an implicit barrier at the end, meaning that the host thread waits for the device region to complete before proceeding. When nowait is added to a target directive, this barrier is removed. The host thread is free to continue execution immediately, while the offloaded computation proceeds asynchronously on the device.
Example
To illustrate the benefit of using nowait on a target region, consider a scenario with two computational kernels:
Kernel A is a GPU-friendly computation. It has two nested loops where the inner loop performs 20 iterations of transcendental operations like sinf, expf, and fused arithmetic. These operations are compute-bound, high-throughput, and highly vectorizable, making them ideal for GPU execution.
#pragma omp parallel for for (int i = 0; i < N_A; ++i) { float acc = a[i]; for (int k = 0; k < 20; ++k) acc = sinf(acc) * expf(acc) + acc * 1.01f; res[i] = acc; }
Kernel B in contrast, performs lightweight filtering and accumulation over a smaller array. It involves a conditional and a simple reduction. This computation is memory-bound, branchy, and low in arithmetic intensity, making it better suited to execute on the CPU execution, which handles branching and irregular control flow more efficiently.
for (int i = 0; i < N_B; ++i) { float val = b[i]; if (val > 0.5f) count++; sum += val * 0.1f; } }
Given that the two kernels operate on disjoint data structures, they can be executed independently and in parallel. This motivates a hybrid implementation where Kernel A is offloaded to the GPU and Kernel B is executed on the CPU. Leveraging the nowait clause in this context enables both kernels to run concurrently, rather than sequentially, improving resource utilization.
To demonstrate the impact of nowait, a comparison of the following four versions of the code were made.
CPU-only: Both Kernel A and Kernel B run sequentially on the host.
GPU-only: Both Kernel A and Kernel B are offloaded and executed sequentially on the device.
Hybrid Blocking: Kernel A runs on the GPU and is followed by Kernel B on the CPU. The CPU waits for the GPU to finish before continuing to execute Kernel B.
Hybrid Non-blocking: Kernel A runs asynchronously on the GPU using nowait, while Kernel B runs simultaneously on the CPU.
Experiments were conducted with two versions of Kernel B, one using #pragma omp parallel for reduction(+ : sum, count) and one without it. When executing on the CPU, the parallel version was observed to be slightly slower. This is likely due to the loop’s lightweight workload, which does not offer enough computational intensity to offset the overhead of spawning threads and performing reduction. As a result, the serial version was used when running Kernel B on the CPU. Conversely, when executing on the GPU — as in gpu_only.cpp — using the parallel version led to improved performance. This suggests that the GPU’s parallel execution model benefits from explicit parallelization even for kernels which may not see the benefit on host. Therefore, all four configurations use the version of Kernel B that yields the best performance for that platform, making this a fair and performance-optimized comparison.
Compilation command:
icpx -fiopenmp -fopenmp-targets=spir64 cpu_only.cpp -o cpu.out icpx -fiopenmp -fopenmp-targets=spir64 gpu_only.cpp -o gpu.out icpx -fiopenmp -fopenmp-targets=spir64 hybrid_blocking.cpp -o hybrid_blocking.out icpx -fiopenmp -fopenmp-targets=spir64 hybrid_non_blocking.cpp -o hybrid_non_blocking.out
Example run command:
./cpu.out ./gpu.out ./hybrid_blocking.out ./hybrid_non_blocking.out
Results
The performance of the four versions running on the same system are as follows:
Version |
Time (seconds) |
---|---|
CPU-only |
36.861047 |
GPU-only |
9.073009 |
Hybrid-blocking |
8.683631 |
Hybrid-non-blocking |
6.531662 |
As expected, by using nowait, the Hybrid-non-blocking version outperforms all three remaining versions.
It is:
5.64× faster than the CPU-only version
1.39× faster than the GPU-only version
1.32× faster than the Hybrid-blocking version
To ensure the analysis is accurate, the performance of each kernel is measured separately on both the GPU and the CPU. The results are as follows:
Kernel |
CPU Time (sec) |
GPU Time (sec) |
---|---|---|
Kernel A |
36.658759 |
5.098709 |
Kernel B |
0.342394 |
3.674300 |
Based on the per-kernel, per-device performance, it is evident that offloading Kernel A is a wise choice, while running Kernel B on the CPU is better for overall performance. By using nowait, you are able to exploit task-level parallelism and achieve performance gains by overlapping independent computations across heterogeneous resources.
Nowait with Optional Argument
OpenMP 6.0 added support for an optional Boolean argument on the nowait clause. This feature provides flexibility to choose dynamically at runtime based on some condition to execute the kernel on the device synchronously or asynchronously
If the argument to nowait is omitted, the behavior defaults to asynchronous execution (equivalent to the classic nowait semantics).
If the argument is present, the region executes asynchronously if the expression evaluates to .true., and synchronously if it evaluates to .false., enabling runtime control over synchronization.
Fortran
The same nowait example shown above may be written in Fortran as follows.
program hybrid_non_blocking use omp_lib implicit none integer, parameter :: N_A = 1000000, N_B = 500000, NUM_ITERATIONS = 10000 real(4), allocatable :: a(:), b(:), res(:) real(4) :: sum integer :: count, i, j real(8) :: start, mid, end allocate(a(N_A), b(N_B), res(N_A)) sum = 0.0 count = 0 do i = 1, N_A a(i) = i * 0.0001 end do do i = 1, N_B b(i) = mod(i, 1000) * 0.001 end do ! Dummy target region to warm up GPU !$omp target !$omp end target call omp_set_default_device(0) call cpu_time(start) do j = 1, NUM_ITERATIONS !$omp target teams distribute parallel do nowait map(to: a(1:N_A)) map(from: res(1:N_A)) do i = 1, N_A res(i) = a(i) res(i) = res(i) + sin(res(i)) * exp(res(i)) res(i) = res(i) * 1.01 end do do i = 1, N_B if (b(i) > 0.5) then count = count + 1 end if sum = sum + b(i) * 0.1 end do !$omp taskwait end do call cpu_time(end) print *, "Hybrid Non-blocking: ", end - start, " seconds" deallocate(a, b, res) end program hybrid_non_blocking
Compilation command:
ifx -fiopenmp -fopenmp-targets=spir64 hybrid_non_blocking_f.f90 -o hybrid_non_blocking_f.out
Example run command:
./hybrid_non_blocking_f.out