Support Knowledge Base

Why Is the Model Load Time to GPU Longer Than to CPU?

Content Type: Maintenance & Performance | Article ID: 000057525 | Last Reviewed: 03/05/2026

Description Resolution Additional information

Environment

OpenVINO™ toolkit GPU plugin CPU plugin

Description

Loading an input model's Intermediate Representation (IR) to GPU takes longer than loading the same model to a CPU.

Resolution

Manually create cl_cache directory in the working directory of your application.

The driver will use this directory to store the binary representations of the compiled kernels. This will work on all supported OSes.

Alternatively, set the environment variable:

export INTEL_OPENCL_CACHE=1

Additional information

Refer to this article for Model Caching Overview to optimize for latency.

Loading your input model in Intermediate Representation (IR) format to GPU takes longer than loading the same model to a CPU because the GPU stack is based on OpenCL*. The load time depends on the compilation time of OpenCL* kernels.

When you enable the cl_cache, the first time you load the model it will still take a long time because the OpenCL* kernel will compile. However, each subsequent load of the same model will be much faster.

For programmatic cache configuration in OpenVINO™ 2026.0:
core.set_property("GPU", {"CACHE_DIR": "./cl_cache"})

Related Information

OpenVINO™ 2026.0 GPU Plugin Documentation

Related Products

This article applies to 1 products.

OpenVINO™ toolkit

Need more help?

Contact support