5.8.4. No Stalls, Low Occupancy Percentage, and Low Bandwidth
In this example, dst is executed once every 20 iterations of the FACTOR2 loop and once every four iterations of the FACTOR1 loop. Therefore, FACTOR2 loop is the source of the bottleneck.
Solutions for resolving loop bottlenecks:
- Unroll the FACTOR1 and FACTOR2 loops evenly. Simply unrolling FACTOR2 loop further does not resolve the bottleneck.
- Vectorize your kernel to allow multiple work-items to execute during each loop iteration.
Did you find the information on this page useful?