ID 683846
Date 12/19/2022
Public

## 8.7. Discrepancies in Hardware and Emulator Results

When you emulate a kernel, your OpenCL system might produce results different from that of the kernel compiled for hardware. You can further debug your kernel before you compile for hardware by running your kernel through simulation.

Warning: These discrepancies usually occur when the Intel FPGA SDK for OpenCL Emulator is unable to model some aspects of the hardware computation accurately, or when your program relies on an undefined behavior.

The most common reasons for differences in emulator and hardware results are as follows:

• Your OpenCL kernel code is using the #pragma ivdep directive. The Emulator does not model your OpenCL system when a true dependence is broken by a pragma ivdep directive. During a full hardware compilation, you observe this as an incorrect result.
• Your OpenCL kernel code is relying on uninitialized data. Examples of uninitialized data include uninitialized variables and uninitialized or partially initialized global buffers, local arrays, and private arrays.
• Your OpenCL kernel code behavior depends on the precise results of floating point operations. The Emulator uses floating point computation hardware of the CPU whereas the hardware run uses floating point cores implemented as FPGA cores. The use of -ffp-reassociate aoc option in your OpenCL kernel code might change the order of operations leading to further divergence in the floating point results.
Note: The OpenCL standard allows one or more least significant bits of floating point computations to differ between platforms, while still being considered correct on both such platforms.
• Your OpenCL kernel code behavior depends on the order of channel accesses in different kernels. The emulation of channel behavior has limitations, especially for conditional channel operations where the kernel does not call the channel operation in every loop iteration. In such cases, the Emulator might execute channel operations in an order different from that on the hardware.
• Your OpenCL kernel or host code is accessing global memory buffers out-of-bounds.
Attention:
• Uninitialized memory read and write behaviors are platform-dependent. Verify sizes of your global memory buffers when using all addresses within kernels, allocating clCreateBuffer function call, and transferring clEnqueueReadBuffer and clEnqueueWriteBuffer function calls.
• You can use software memory leak detection tools, such as Valgrind, on the emulated version of your OpenCL system to analyze memory related problems. Absence of warnings from such tools does not mean the absence of problems. It only means that the tool could not detect any problem. In such a scenario, Intel recommends manual verification of your OpenCL kernel or host code.
• Your OpenCL kernel code is accessing local or private variables out-of-bounds. For example, accessing a local or private array out-of-bounds or accessing a private variable after it has gone out of scope.
Attention: In software terms, these issues are referred to as stack corruption issues because accessing variables out-of-bounds usually affects unrelated variables located close to the variable being accessed on a software stack. Emulated OpenCL kernels are implemented as regular CPU functions, and have an actual stack that can be corrupted. When targeting hardware, no stack exists and hence, the stack corruption issues are guaranteed to manifest differently. You may use memory leak analyzer tools, such as Valgrind, when a stack corruption is suspected. However, stack related issues are usually difficult to identify. Intel recommends manual verification of your OpenCL kernel code to debug a stack related issue.
• Your OpenCL kernel code is using shifts that are larger than the type being shifted. For example, shifting a 64-bit integer by 65 bits. According to the OpenCL specification version 1.0, the behavior of such shifts is undefined.
• When you compile your OpenCL kernel for emulation, the default channel depth is different from the default channel depth generated when your kernel is compiled for hardware. This difference in channel depths might lead to scenarios where execution on the hardware hangs while kernel emulation works without any issue. Refer to Emulating Channel Depth for information on how to fix the channel depth difference.
• In terms of ordering the printed lines, the output of the printf function might be ordered differently on the Emulator and hardware. This is because, in the hardware, printf data is stored in a global memory buffer and flushed from the buffer only when the kernel execution is complete, or when the buffer is full. In the Emulator, the printf function uses the x86 stdout.
• If you perform an unaligned load/store through upcasting of types, the FPGA and emulator might produce different results. A load/store of this type is undefined in the C99 specification.
For example, the following operation might produce unexpected results:
int tmp = *((int *) (my_ptr + 5));