Avoiding Needless Synchronization
For best results, try to avoid explicit command synchronization primitives (such as
clEnqueueMarker
or Barrier),
also explicit synchronization commands and event tracking result in cross-module round trips, which decrease performance. The less you use explicit synchronization commands, the better the performance.Use the following techniques to reduce explicit synchronization:
- Continue executing kernels until you really need to read the results; this idiom best expressed with in-order queue and blocking call toclEnqueueMapXXXorclEnqueueReadXXX.
- If an in-order queue expresses the dependency chain correctly, exploit the in-order queue rather than defining an event-driven string of dependent kernels. In the in-order execution model, the commands in a queue are automatically executed back-to-back, in the order of submission. This suits very well a typical case of a processing pipeline. Consider the following recommendations:
- Avoid any host intervention to the in-order queue (like blocking calls) and additional synchronization costs.
- When you have to use the blocking API, use OpenCL™ API, which is more effective than explicit synchronization schemes, based on OS synchronization primitives.
- If you are optimizing the kernel pipeline, first measure kernels separately to find the most time-consuming one. Avoid callingclFinishorclWaitForEventsfrequently (for example, after each kernel invocation) in the final pipeline version. Submit the whole sequence (to the in-order queue) and issueclFinish(or wait on the event) once. This reduces host-device round trips.
- Consider OpenCL 2.0 “enqueue_kernel” feature that allows a kernel to independently enqueue to the same device, without host interaction. Notice that this approach is useful not just for recursive kernels, but also for regular non-recursive chains of the lightweight kernels. Refer to theSee Alsosection below.
See Also