5.8.5. No Stalls, High Occupancy Percentage, and Low Bandwidth
In this example, the accelerator board can provide a bandwidth of 25600 megabytes per second (MB/s). However, the vector_add kernel is requesting (2 reads + 1 write) x 4 bytes x 294 MHz = 12 bytes/cycle x 294 MHz = 3528 GB/s, which is 14% of the available bandwidth. To increase the bandwidth, increase the number of tasks performed in each clock cycle.
Solutions for low bandwidth:
- Automatically or manually vectorize the kernel to make wider requests
- Unroll the innermost loop to make more requests per clock cycle
- Delegate some of the tasks to another kernel
Did you find the information on this page useful?