As discussed in the Striving for Performance
section, deducing performance conclusions from the execution time of individual kernels might be misleading. In most cases, the larger subgraph you charge an accelerator with, the better the communication costs are amortized.
Generally, GPU performance is better on large images. So if the amount of work is too small (<1ms of execution time) - run the graph on CPU device instead or fuse kernels.
Notice that using the GPU target introduces one-time overhead (order of few seconds) of compiling the OpenCL™ kernels. The compilation happens upon OpenVX* context creation and does not affect the execution.
A typical strategy to start with is to test the CPU-only and GPU-only scenarios first (section 9.2). Beware of the situations when some nodes are not supported by the particular target (refer to the Kernel Extensions document for the kernels support matrix). In this case, the only way is to schedule nodes individually and search for optimal split by scheduling subgraphs (or independent branches in the graph) to different targets.
For scenarios where CPU and GPU targets are mixed in the graph, it is recommended to try the option of enabling the GPU tiling (which is set to
by default). That might unleash the additional (data-) parallelism between two devices:
$ export VX_CL_TILED_MODE=1
For the GPU-only scenarios, the option should be definitely reverted back to OFF.
It is advised to do performance analysis (next chapter) to determine “hotspot” nodes which should be first candidates for offloading to the additional targets. At the same time it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other run-time overheads.
Notice that GPU can be busy with other tasks (like rendering), similarly the CPU can be in charge for the general OS routines.
Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel® Turbo Boost Technology. This might result in overall performance decrease even in compare to single-device scenario.
Similarly, even in the GPU-only scenario, a high interrupt rate and frequent synchronization with the host can raise the frequency of the CPU and drag the frequency of GPU down.