Comparing OpenCL™ Kernel Performance with Performance of Native Code
- Wrap exactly the same set of operations.
- Do not include program build time in the kernel execution time. You can amortize this step by program precompilation (refer toclCreateProgramFromBinary).
- Track data transfers costs separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a native code (by pointers). Refer to the “Mapping Memory Objects” section for more information.
- Ensure the working set is identical for native and OpenCL code. Similarly, for correct performance comparison, access patterns should be the same (for example, rows compared to columns).
- Demand the same accuracy. For example,rsqrt(x)is inherently of higher accuracy than the__mm_rsqrt_psSSE intrinsic. To use the same accuracy in native code and OpenCL code, do one of the following:
- Equip__mm_rsqrt_psin your native code with a couple of additional Newton-Raphson iterations to match the precision of OpenCLrsqrt.
- Usenative_rsqrtin your OpenCL kernel, which maps to thersqrtpsinstruction in the final assembly code.
- Use the relaxed-math compilation flag to enable similar accuracy for the whole program. Similarly torsqrt, there are relaxed versions forrcp,sqrt, etc. Refer to theUser Manual - OpenCL™ Code Builderfor the full list