It is important to schedule command-queue for each device asynchronously. Host-queue multiple kernels first, then flush the queues so kernels begin executing on the devices, and finally wait for results. Refer to the Section "Synchronization Caveats" for more information.
Another approach is having a separate thread for GPU command-queue. Specifically, you can dedicate a physical CPU core for scheduling GPU tasks. To reserve a core, you can use the device fission extension, using which can prevent GPU starvation in some cases. Refer to the User Manual - OpenCL™ Code Builder for more information on the device fission extension.
Consider experimenting, as various trade-offs are possible.