Performance measurements are done on a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, the minimum value for the execution time is usually used for final projections. Projections could also be made using other measures such as average or geometric mean of execution time.
An alternative to calling the kernel many times is to use a single “warm-up” run.
The warm-up run might be helpful for small or "lightweight" kernels, for example, the kernels with execution time less than 10 milliseconds. Specifically, it helps to amortize the following potential (one-time) costs:
Bringing data to the cache
“Lazy” object creation
Other costs incurred by the OpenCL™ runtime.
You need to build your performance conclusions on reproducible data. If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average the results. For time values that range too much use
For bandwidth-limited kernels, which operate on the data that does not fit in the last-level cache, the ”warm-up” run does not have as much impact on the measurement.
For a kernel with a small number of instructions executed over a small data set, make sure there is a sufficient number of iterations, so the kernel runs for at least 20 milliseconds.
Kernels that are very lightweight do not give reliable data, so making them artificially heavier could give you important insights into the hotspots. For example, you can add loop in the kernel, or replicate its heavy pieces.
Refer to the “OpenCL Optimizations Tutorial” SDK sample for code examples of performing the warm-up activities before starting performance measurement. You can download the sample from the Intel® SDK for OpenCL Applications website at intel.com/software/opencl/.