Loading an input model's Intermediate Representation (IR) to GPU takes longer than loading the same model to a CPU.
Manually create cl_cache directory in the working directory of your application.
The driver will use this directory to store the binary representations of the compiled kernels. This will work on all supported OSes.
Alternatively, set the environment variable:
export INTEL_OPENCL_CACHE=1
Refer to this article for Model Caching Overview to optimize for latency.
Loading your input model in Intermediate Representation (IR) format to GPU takes longer than loading the same model to a CPU because the GPU stack is based on OpenCL*. The load time depends on the compilation time of OpenCL* kernels.
When you enable the cl_cache, the first time you load the model it will still take a long time because the OpenCL* kernel will compile. However, each subsequent load of the same model will be much faster.
For programmatic cache configuration in OpenVINO™ 2026.0:
core.set_property("GPU", {"CACHE_DIR": "./cl_cache"})