What’s new
-
More Gen AI coverage and frameworks integrations to minimize code changes
-
New models supported: Phi-4-mini-reasoning, AFM-4.5B, Gemma-3-1B-it, Gemma-3-4B-it, and Gemma-3-12B.
-
NPU support added for: Qwen3-1.7B, Qwen3-4B, and Qwen3-8B.
-
LLMs optimized for NPU now available on OpenVINO Hugging Face collection.
-
Preview: Intel® Core™ Ultra Processor and Windows-based AI PCs can now leverage the OpenVINO™ Execution Provider for Windows* ML for high-performance, off-the-shelf starting experience on Windows*.
-
-
Broader LLM model support and more model compression optimization technique
-
The NPU plug-in adds support for longer contexts of up to 8K tokens, dynamic prompts, and dynamic LoRA for improved LLM performance.
-
The NPU plug-in now supports dynamic batch sizes by reshaping the model to a batch size of 1 and concurrently managing multiple inference requests, enhancing performance and optimizing memory utilization.
-
Accuracy improvements for GenAI models on both built-in and discrete graphics achieved through the implementation of the key cache compression per channel technique, in addition to the existing KV cache per-token compression method.
-
OpenVINO™ GenAI introduces TextRerankPipeline for improved retrieval relevance and RAG pipeline accuracy, plus Structured Output for enhanced response reliability and function calling while ensuring adherence to predefined formats.
-
-
More portability and performance to run AI at the edge, in the cloud or locally
-
Announcing support for Intel® Arc™ Pro B-Series (B50 and B60).
-
Preview: Hugging Face models that are GGUF-enabled for OpenVINO GenAI are now supported by the OpenVINO™ Model Server for popular LLM model architectures such as DeepSeek Distill, Qwen2, Qwen2.5, and Llama 3. This functionality reduces memory footprint and simplifies integration for GenAI workloads.
-
With improved reliability and tool call accuracy, the OpenVINO™ Model Server boosts support for agentic AI use cases on AI PCs, while enhancing performance on Intel CPUs, built-in GPUs, and NPUs.
-
int4 data-aware weights compression, now supported in the Neural Network Compression Framework (NNCF) for ONNX models, reduces memory footprint while maintaining accuracy and enables efficient deployment in resource-constrained environments.
-
OpenVINO™ Runtime
Common
-
Public API has been added to set and reset the log message handling callback. It allows injecting an external log handler to read OpenVINO messages in the user’s infrastructure, rather than from log files left by OpenVINO.
-
Build-time optimizations have been introduced to improve developer experience in project compilation.
-
Ability to import a precompiled model from an ov::Tensor has been added. Using ov::Tensor, which also supports memory-mapped files, to store precompiled models benefits both the OpenVINO caching mechanism and applications using core.import_model().
-
Several fixes for conversion between different precisions, such as u2 and f32->f4e2,1, have been implemented to improve compatibility with quantized models.
-
Support for negative indices in GatherElements and GatherND operators has been added to ensure compliance with ONNX standards.
-
vLLM-OpenVINO integration now supports vLLM API v1.
CPU Device Plugin
-
Sage Attention is now supported. This feature is turned on with the
ENABLE_SAGE_ATTN
property, providing a performance boost for 1st token generation in LLMs with long prompts, while maintaining accuracy. -
FP16 model performance on 6th generation Intel® Xeon® processors has been enhanced by improving utilization of the underlying AMX FP16 capabilities and graph-level optimizations.
GPU Device Plugin
-
LLM accuracy has been improved with by-channel key cache compression. Default KV-cache compression has also been switched from by-token to by-channel compression.
-
Gemma3-4b and Qwen-VL VLM performance has been improved on XMX-supporting platforms.
-
Basic functionalities for dynamic shape custom operations in GPU extension have been enabled.
-
LoRA performance has been improved for systolic platforms.
NPU Device Plugin
-
Models compressed as NF4-FP16 are now enabled on NPU. This is the recommended precision for the following models: deepseek-r1-distill-qwen-7b, deepseek-r1-distill-qwen-14b, and qwen2-7b-instruct, providing a reasonable balance between accuracy and performance. This quantization is not supported on Intel® Core™ Ultra Series 1, where only symmetrically quantized channel-wise or group-wise INT4-FP16 models are supported.
-
Peak memory consumption of LLMs on NPU has been significantly reduced when using ahead-of-time compilation.
-
Optimizations for LLM vocabularies (LM Heads) compressed in INT8 asymmetric have been introduced, available with NPU driver 32.0.100.4181 or later.
-
Accuracy of LLMs with RoPE on longer contexts has been improved.
-
The NPU plug-in now supports dynamic batch sizes by reshaping the model to a batch size of 1 and concurrently managing multiple inference requests, enhancing performance and optimizing memory utilization. This requires driver 32.0.202.298 or later.
-
The remote tensor interface has been extended to support tensor creation from files; recent NPU drivers now support memory-mapped inputs/outputs.
OpenVINO Python API
-
TensorVector binding has been enabled to avoid extra copies and speed up PostponedConstant usage.
-
Support for building experimental free-threaded 3.13t Python API has been added; prebuilt wheels are not distributed yet.
-
Free-threaded Python performance has been improved.
-
set_rt_info() method has been added to Node, Output, and Input to align with Model.set_rt_info().
OpenVINO Node.js API
-
AsyncInferQueue class has been added to support easier implementation of asynchronous inference. The change comes with a benchmark tool to evaluate performance.
-
Model.reshape method has been exposed, including type conversion ability and type validation helpers, useful for reshaping LLMs.
-
Support for ov-node types in TypeScript part of bindings has been extended, enabling direct integration with the JavaScript API.
-
Wrapping of compileModel() method has been fixed to allow checking type of returned objects.
-
The version of LLMPipeline.generate() that returns strings is now deprecated. Starting with 2026.0.0 LLMPipeline.generate() will return DecodedResults by default. To use the new behavior with current release, set ["return_decoded_results": true] in GenerationConfig.
PyTorch Framework Support
-
Tensor concatenation inside loops is now supported, enabling the Paraformer model family.
OpenVINO™ Model Server
-
Major new features:
-
Tool-guided generation has been implemented with the enable_tool_guided_generation parameter and –tool_parser option to enable model-specific XGrammar configuration for following expected response syntax. It uses dynamic rules based on the generated sequence, increasing model accuracy and minimizing invalid response formats for tools.
-
Tool parser has been added for Mistral-7B-Instruct-v0.3, extending the list of supported models with tool handling.
-
Stream response has been implemented for Qwen3, Hermes3 and Llama3 models, enabling more interactive use with tools.
-
BREAKING CHANGE: Separation of tool parser and reasoning parser has been implemented. Instead of the response_parser parameter, use separate parameters: tool_parser and reasoning_parser, allowing more flexible implementation and configuration on the server. Parsers can now be shared independently between models. Currently, Qwen3 is the only reasoning parser implemented.
-
Reading of the chat template has been changed from template.jinja to chat_template.jinja if the chat template is not included in tokenizer_config.json.
-
Structured output is now supported with the addition of JSON schema-guided generation using the OpenAI response_format field. This parameter enables generation of JSON responses for automation purposes and improvements in response accuracy. See Structured response in LLM models article for more details. A script testing the accuracy gain is also included.
-
Enforcement of tool call generation has been implemented using the tool_call=required in chat/completions field. This feature forces the model to generate at least one tool response, increasing response reliability while not guaranteeing response validity.
-
MCP server demo has been updated to include available features.
-
-
New models and use cases supported:
-
Qwen3-embedding and cross-encoders embedding models,
-
Qwen3-reranker,
-
Gemma3 VLM models.
-
-
Deployment improvements:
-
Progress bar display has been implemented for model downloads from Hugging Face. For models from the OpenVINO organization, download status is now shown in the logs.
-
Documentation on how to build a docker image with optimum-cli is now available, enabling the image to pull any model from Hugging Face and convert it to IR online in one step.
-
Models endpoint for OpenAI has been implemented, returning a list of available models in the expected OpenAI JSON schema for easier integration with existing applications.
-
The package size has been reduced by removing git and gitlfs dependencies, reducing the image by ~15MB. Model files are now pulled from Hugging Face using libgit2 and curl libraries.
-
UTF-8 chat template is now supported out of the box, no additional installation steps on Windows required.
-
Preview functionality for GGUF models has been added for LLM architectures including Qwen2, Qwen2.5, and Llama3. Models can now be deployed directly from HuggingFace Hub by passing model_id and file name. Note that accuracy and performance may be lower than with IR format models.
-
-
Bug fixes:
-
Truncation of prompts exceeding model length in embeddings has been implemented.
-
Neural Network Compression Framework
-
4-bit data-aware Scale Estimation and AWQ compression methods have been introduced for ONNX models, providing more accurate compression results.
-
NF4 data type is now supported as an FP8 look-up table for faster inference.
-
New parameter has been added to support a fallback group size in 4-bit weight compression methods. This helps when the specified group size can not be applied, for example, in models with an unusual number of channels in matrix multiplication (matmuls). When enabled with nncf.AdvancedCompressionParameters(group_size_fallback_mode=ADJUST), NNCF automatically adjusts the group size. By default, nncf.AdvancedCompressionParameters(group_size_fallback_mode=IGNORE) is used, meaning that NNCF will not compress nodes when the specified group size can not be applied.
-
Initialization for 4-bit QAT with absorbable LoRA has been enhanced using advanced compression methods (AWQ + Scale Estimation). This replaces the previous basic data-free compression approach, enabling QAT to start with a more accurate model baseline and achieve better final accuracy.
-
External quantizers in the quantize_pt2e API have been enabled, including XNNPACKQuantizer and CoreMLQuantizer.
-
PyTorch 2.8 is now supported.
OpenVINO Tokenizers
-
OpenVINO GenAI integration:
-
Padding side can now be set dynamically during runtime.
-
Tokenizer loading now supports a second input for relevant GenAI pipelines, for example TextRerankPipeline.
-
-
Two inputs are now supported to accommodate a wider range of tokenizer types.
OpenVINO GenAI
-
New OpenVINO GenAI docs homepage: https://openvinotoolkit.github.io/openvino.genai/
-
Transitioned from Jinja2Cpp to Minja, improving chat_template coverage support.
-
Cache eviction algorithms added:
-
KVCrush algorithm
-
Sparse attention prefill
-
-
Support for Structured Output for flexible and efficient structured generation with XGrammar:
-
C++ and Python samples
-
Constraint sampling with Regex, JSONSchema, EBNF Grammar and Structural tags
-
Compound grammar to combine multiple grammar types (Regex, JSONSchema, EBNF) using Union (|) and Concat (+) operations for more flexible and complex output constraints.
-
-
GGUF:
-
Qwen3 architecture is now supported
-
enable_save_ov_model property to serialize generated ov::Model as IR for faster LLMPipeline construction next time
-
-
LoRA:
-
Dynamic LoRA for NPU has been enabled.
-
Model weights can now be overridden from .safetensors
-
-
Tokenizer:
-
padding_side property has been added to specify padding direction (left or right)
-
add_second_input property to transform Tokenizer from one input to two inputs, used for TextRerankPipeline
-
-
JavaScript bindings:
-
New pipeline: TextEmbeddingPipeline
-
PerfMetrics for the LLMPipeline
-
Implemented getTokenizer into LLMPipeline
-
-
Other changes:
-
C API for WhisperPipeline has been added
-
gemma3-4b-it model is now supported in VLM Pipeline
-
Performance metrics for speculative decoding have been extended
-
Qwen2-VL and Qwen2.5-VL have been optimized for GPU
-
Exporting stateful Whisper models is now supported on NPU out of the box, using –disable-stateful is no longer required.
-
-
Dynamic prompts are now enabled by default on NPU:
-
Longer contexts are available as preview feature on 32GB Intel® Core™ Ultra Series 2 (with prompt size up to 8..12K tokens).
-
The default chunk size is 1024 and can be controlled via property NPUW_LLM_PREFILL_CHUNK_SIZE. For example, set it to 256 to see the effect on shorter prompts.
-
PREFILL_HINT can be set to STATIC to bring back the old behavior.
-
Other Changes and Known Issues
Jupyter Notebooks
Known Issues
Component: NPU compiler
ID: 169077
Description:
miniCPM3-4B model is inaccurate with OV 2025.3 and NPU driver 32.0.100.4239 (latest one available as of OV 2025.3 release). The accuracy will be improved with the next driver release.
Component: NPU plugin
ID: 171934
Description:
Whisper model is not functional with transformer v.4.53. Recommended workaround is to use
transformers=4.52 and optimum-intel=1.25.2 for Whisper model conversion.
Component: NPU plugin
ID: 169074
Description:
phi-4-multimodal-instruct is not functional on NPU. Planned to be fixed in future releases.
Component: NPU plugin
ID: 173053
Description:
Transformers v4.53 introduce a performance regression for prompts smaller than 1K tokens. For optimal performance, it is recommended to use v4.51.
Component: CPU, GPU plugins
ID: 171208
Description:
ChatGLM-3-6B is inaccurate on CPU and GPU.
Component: GPU plugin
ID: 172726
Description:
Flux.1-schnell or Flux.1-dev model can functionally fail on Intel® Core™ Ultra Series 1. As a workaround, the model can be converted using OpenVINO 2025.2.
Component: GPU plugin
ID: 171017
Description:
If OV_GPU_DYNAMIC_QUANTIZATION_THRESHOLD config is explicitly set to less than 64 on XMX-supporting platforms, functional failure can be observed with several GenAI models. 64 is the default value, and setting it to less than 64 is not normally recommended due to performance degradation.
Component: CPU plugin
ID: 172548
Description:
Performance regression has been observed on Atom x7835RE with Ubuntu 22.04 OS. This is planned to be fixed in the next release.
Component: CPU plugin
ID: 172518
Description:
Qwen2-VL-7b-instruct is inaccurate on with 4th and 6th Gen Intel® Xeon® Scalable processors. As a workaround, the model can be converted using OpenVINO 2025.2.
Deprecation and Support
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. For more details, refer to: OpenVINO Legacy Features and Components.
Discontinued in 2025
-
Runtime components:
-
The OpenVINO property of Affinity API is no longer available. It has been replaced with CPU binding configurations (
ov::hint::enable_cpu_pinning
). -
The openvino-nightly PyPI module has been discontinued. End-users should proceed with the Simple PyPI nightly repo instead. More information in Release Policy.
-
Binary operations Node API has been removed from Python API after previous deprecation.
-
-
Tools:
-
The OpenVINO™ Development Tools package (pip install openvino-dev) is no longer available for OpenVINO releases in 2025.
-
Model Optimizer is no longer available. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.
-
Intel® Streaming SIMD Extensions (Intel® SSE) are currently not enabled in the binary package by default. They are still supported in the source code form.
-
Legacy prefixes:
l_
,w_
, andm_
have been removed from OpenVINO archive names.
-
-
OpenVINO GenAI:
-
StreamerBase::put(int64_t token)
-
The
Bool
value for Callback streamer is no longer accepted. It must now return one of three values of StreamingStatus enum. -
ChunkStreamerBase is deprecated. Use StreamerBase instead.
-
-
NNCF
create_compressed_model()
method is now deprecated.nncf.quantize()
method is recommended for Quantization-Aware Training of PyTorch and TensorFlow models. -
Deprecated OpenVINO Model Server (OVMS) benchmark client in C++ using TensorFlow Serving API.
Deprecated and to be removed in the future
-
Python 3.9 is now deprecated and will be unavailable after OpenVINO version 2025.4.
-
openvino.Type.undefined
is now deprecated and will be removed with version 2026.0.openvino.Type.dynamic
should be used instead. -
APT & YUM Repositories Restructure: Starting with release 2025.1, users can switch to the new repository structure for APT and YUM, which no longer uses year-based subdirectories (like “2025”). The old (legacy) structure will still be available until 2026, when the change will be finalized. Detailed instructions are available on the relevant documentation pages:
-
OpenCV binaries will be removed from Docker images in 2026.
-
The openvino namespace of the OpenVINO Python API has been redesigned, removing the nested openvino.runtime module. The old namespace is now considered deprecated and will be discontinued in 2026.0. A new namespace structure is available for immediate migration. Details will be provided through warnings and documentation.
-
Starting with the next release, manylinux2014 will be upgraded to manylinux_2_28. This aligns with modern toolchain requirements but also means that CentOS 7 will no longer be supported due to glibc incompatibility.
-
With the release of Node.js v22, updated Node.js bindings are now available and compatible with the latest LTS version. These bindings do not support CentOS 7, as they rely on newer system libraries unavailable on legacy systems.
Legal Information
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
Copyright © 2025, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.