In 16.1, this code behaved as expected where the outer loop was serialized due to dependencies and the inner loop dependencies were removed by the #pragma ivdep.
// This loop gets serialized due to true dependencies with inner loop
for (unsigned char x = 0; x < 4; x ) {
// Inner loop does not have inter-iteration dependencies, but depends on outer loop
#pragma ivdep
for (unsigned char y = 0; y<64; y ) {
In 17.0, the #pragma ivdep is now applied to both the inner and outer loop, so the dependencies in the outer loop are not accounted for by the compiler. As a result, similar code may not work correctly in hardware despite working in emulation.
Workaround:
1. Add an extra argument "dummy" to the kernel. On the host side, always pass 1 for this dummy argument.
BEFORE
__kernel void my_kernel(
__global cpx_t* restrict input,
__global cpx_t* restrict result)
AFTER
__kernel void my_kernel(
__global cpx_t* restrict input,
__global cpx_t* restrict result,
int dummy)
2. In the loop nest, wrap the inner loop in "if (dummy)":
// This loop gets serialized due to true dependencies
for (unsigned char x = 0; x < 4; x ) {
if (dummy) {
// No dependencies within each set of 64 iterations
#pragma ivdep
for (unsigned char y = 0; y<64; y ) {
This issue is scheduled to be fixed in a future version of the Intel© OpenCL™ for FPGA SDK.