If your OpenCL™ kernel fails to generate hardware even though the estimated resources are low, the failure may be due to excessive unrolling of loops that access global memory.
Loops that access global memory should not be unrolled beyond where a read or write to global memory is wider than the memory interface in the BSP. This will cause contention, routing congestion and may result in compilation failure.
The width of the external memory interfaces can be found in the board_spec.xml file in the OpenCL™ BSP. Here is an example from the board_spec.xml of the Arria 10 GX development kit BSP. (a10_ref)
<!-- DDR4-2400 -->
<global_mem name="DDR" max_bandwidth="19200" interleaved_bytes="1024" config_addr="0x018">
<interface name="board" port="kernel_mem0" type="slave" width="512" maxburst="16" address="0x00000000" size="0x80000000" latency="240"/>
</global_mem>
As you can see, the external memory interface width on this BSP is 512 bits. (width="512") Therefore, if a loop accesses global 32-bit integers, the loop should not be unrolled more than 16. (512 / 32 = 16)
If the original loop count is not a multiple of 16:
1. Round up the new loop count to a multiple of 16.
2. Make any on-chip memories in the loop large enough to accommodate the new loop count
3. Use conditionals to prevent reads or writes when the new loop count exceeds the original loop count