Intel OpenVX* Performance Promise
An OpenVX* approach to the performance extends the conventional one-off function acceleration with the notion of graphs. Graphs expose optimization possibilities that might not be available or obvious with traditional approaches. For example, with the Intel OpenVX implementation, different kernels that share data are not forced to use global (slow) memory. Rather, automatic tiling fits data to cache. Similarly, while parallelism itself is not directly expressed in the graph, the independent data flows are extracted from its structure.
OpenVX also creates a logical model where IP blocks of the Intel SoC fully share system resources such as memory; the code can be scheduled seamlessly on the block that is best able to execute it.
In addition to the global graph-level optimizations, performance of the OpenVX vision functions is also resolved via use of optimized implementation with a strong focus on a particular platform. For the CPU, this is leveraged through Intel® Integrated Performance Primitives (Intel® IPP), which has code branches for different architectures. For the GPU, the matured stack of OpenCL* Just-In-Time compilation to the particular architecture is used.
To achieve good performance the trade-offs implicit to the OpenVX model of computation must be well understood. This sections describes general considerations for OpenVX with respect to performance.
Use OpenVX* Graph Mode to Estimate Performance
OpenVX supports a single-function execution model called immediate mode.
NOTE: Notice that use of immediate mode flavors of the vision functions (prefixed with
, for example,
) still implies using graphs (each comprising just a single function) behind the scene. Thus, graph verification and other costs like memory allocation will be included in the timing, and not amortized over multiple nodes/iterations.
Still the immediate mode can be useful as an intermediate step, for example, when porting an application from OpenCV to OpenVX (see the Example Interoperability with OpenCV
Beware of Graph Verification Overheads
The graph verification step is a heavy-weight operation and should be avoided during “tight-loop” execution time. Notice that changing the meta-data (for example, size or type) of the graph inputs might invalidate the graph. Refer to the Map/Unmap for OpenVX* Images
section for some tips on updating the data.
Comparing OpenVX Performance to Native Code
When comparing OpenVX performance with native code, for example, in C/C++ or OpenCL, make sure that both versions are as similar as possible:
Wrap exactly the same set of operations.
Do not include graph verification when estimating the execution time. Graph verification is intended to be amortized over multiple iterations of graph execution.
Track data transfer costs (reading/writing images, arrays, and so on.) separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a regular code (by pointers).
Demand the same accuracy.
Enabling Performance Profiling per Node
So far, we discussed overall performance of the graph. In order to get the
performance data, OpenVX* 1.1 spec explicitly mandates enabling of the performance profiling in the application. There is a dedicated directive for that:
vx_status res = vxDirective(context, VX_DIRECTIVE_ENABLE_PERFORMANCE);
NOTE: Per-node performance profiling is enabled on the per-context basis. As it might introduce certain overheads, disable it in the production code and/or when measuring overall graph performance.
When the profiling is enabled, you can get performance information for a node:
vxQueryNode(node, VX_NODE_ATTRIBUTE_PERFORMANCE, &perf, sizeof(perf));
printf(“Average exec time for the %h node: %0.3lfms\n", node, (vx_float32)perf.avg/1000000.0f);
NOTE:Notice that to get the performance data for nodes running on the GPU, you need to set the following environment variable:
Do not deduce final performance conclusions from individual kernels. Graphs allow the runtime and pipeline manager to do certain system-level optimizations that are simply not possible under a single-function paradigm.