What’s new
-
More Gen AI coverage and frameworks integrations to minimize code changes
-
New models supported:
-
On CPUs & GPUs: Qwen3-Embedding-0.6B, Qwen3-Reranker-0.6B, Mistral-Small-24B-Instruct-2501.
-
On NPUs: Gemma-3-4b-it and Qwen2.5-VL-3B-Instruct.
-
-
Preview: Mixture of Experts (MoE) models optimized for CPUs and GPUs, validated for Qwen3-30B-A3B.
-
GenAI pipeline integrations: Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B for enhanced retrieval/ranking, and Qwen2.5VL-7B for video pipeline.
-
-
Broader LLM model support and more model compression techniques
-
Gold support for Windows ML* enables developers to deploy AI models and applications effortlessly across CPUs, GPUs, and NPUs on Intel® Core™ Ultra processor-powered AI PCs.
-
The Neural Network Compression Framework (NNCF) ONNX backend now supports INT8 static post-training quantization (PTQ) and INT8/INT4 weight-only compression to ensure accuracy parity with OpenVINO IR format models. SmoothQuant algorithm support added for INT8 quantization.
-
Accelerated multi-token generation for GenAI, leveraging optimized GPU kernels to deliver faster inference, smarter KV-cache reuse, and scalable LLM performance.
-
GPU plugin updates include improved performance with prefix caching for chat history scenarios and enhanced LLM accuracy with dynamic quantization support for INT8.
-
-
More portability and performance to run AI at the edge, in the cloud or locally
-
Announcing support for Intel® Core™ Ultra Processor Series 3.
-
Encrypted blob format support added for secure model deployment with OpenVINO™ GenAI. Model weights and artifacts are stored and transmitted in an encrypted format, reducing risks of IP theft during deployment. Developers can deploy with minimal code changes using OpenVINO GenAI pipelines.
-
OpenVINO™ Model Server and OpenVINO™ GenAI now extend support for Agentic AI scenarios with new features such as output parsing and improved chat templates for reliable multi-turn interactions, and preview functionality for the Qwen3-30B-A3B model. OVMS also introduces a preview for audio endpoints.
-
NPU deployment is simplified with batch support, enabling seamless model execution across Intel® Core™ Ultra processors while eliminating driver dependencies. Models are reshaped to batch_size=1 before compilation.
-
The improved NVIDIA Triton Server* integration with OpenVINO backend now enables developers to utilize Intel GPUs or NPUs for deployment.
-
OpenVINO™ Runtime
CPU Device Plugin
-
Qwen3-MoE is now supported, with improved performance for Mixture-of-Experts subgraphs.
-
Model inference on Intel® Core™ Ultra Series 3 processors has been optimized with AI workload scheduling among P-cores, E-cores and LP E-cores.
-
BitNet model is supported and optimized on both Intel® Xeon® processors and Client processors for 2-bit weight compression support.
-
Qwen2/2.5-VL performance and memory footprint has been optimized with 3D Rotary Position Embedding fusion support.
-
FP16 model performance on Intel® Xeon® 6 processors with P-cores has been enhanced by improving utilization of the underlying AMX FP16 capabilities and graph-level optimizations.
-
Inference support for AI workloads is now available on Intel® Xeon® 6 processors with E-cores for Windows 11 and Windows Server.
GPU Device Plugin
-
Intel® Core™ Ultra Series 3 is fully supported with optimized performance.
-
Initial optimization for MoE (Mixture of Experts) has been introduced on Intel® XMX based platforms. Qwen3-30B-A3B model has been enabled.
-
Prefix caching performance has been significantly improved on Intel® XMX based platforms.
-
Per-group dynamic quantization is now supported and configurable, providing an alternative when accuracy is insufficient with the default per-token dynamic quantization.
-
Performance of Qwen3-Embedding and Qwen3-Reranker models has been optimized on Intel® XMX based platforms.
-
Multiple primitives have been optimized for non-Gen AI models, improving performance of vision embedding models, RNN-based models, and models in the GeekBench AI benchmark tool.
-
Runtime memory footprint has been reduced for dynamic shape models.
-
4.2 GB memory allocation limit has been removed, with large allocations now allowed using the
GPU_ENABLE_LARGE_ALLOCATIONSproperty. -
Querying of discrete and integrated GPUs upon plugin creation has been optimized, extending battery life and reducing power consumption.
-
Qwen2/2.5-VL performance and memory footprint has been optimized by accelerating image processing on GPU and supporting 3D Rotary Position Embedding fusion.
-
XAttention (Block Sparse Attention with Antidiagonal Scoring) is now initially supported on Xe2 architecture to improve time-to-first-token.
NPU Device Plugin
-
Gemma-3-4B-it and Qwen2.5-VL-3B-Instruct models are now enabled.
-
Sliding window mask support for Phi-3 on NPU has been fixed.
-
Asynchronous weight processing has been introduced to provide a slight speed-up in importing pre-compiled LLMs.
-
LLM prefix caching is introduced to reduce TTFT in long chat scenarios, enabled via property
NPUW_LLM_ENABLE_PREFIX_CACHING:YES. -
OpenVINO™ cached models are now memory-mapped and imported within the current Level Zero context, reducing peak memory consumption during imports by eliminating an additional in-memory copy of the compiled model.
-
Implicit I/O tensor import is now supported. A shadow-copy tensor is created only when the user-provided tensor address or size is not aligned to the 4K page size.
-
NPU deployment is simplified through batch support, which automatically reshapes models to batch size = 1 for compatibility with older driver versions. This enables seamless model execution across all Intel® Core™ Ultra processors regardless of driver version.
-
I/O layout information is now preserved after an export/import cycle. Information is stored inside the blob metadata.
OpenVINO Python API
-
Python* 3.14 is now supported, including experimental free-threaded mode (3.14t) on Linux and macOS for improved parallel processing performance.
-
The issue with the
PERFORMANCE_HINTproperty not being properly applied in benchmark_app when using custom configurations was fixed. Benchmark results now accurately reflect intended performance settings. -
Precompiled model import from tensor: precompiled models can now be imported directly from
ov.Tensorobjects in Python, matching C++ API capabilities and enabling more flexible model deployment workflows. -
The issue that prevented
import_model()from working with large models on Windows was fixed. Models of any size can now be imported on all supported platforms.
OpenVINO C API
-
Log callback setup has been added to C API:
ov_util_set_log_callbackandov_util_reset_log_callback.
OpenVINO Node.js API
-
Tensor.setShapehas been added to Node.js API, enabling in-place tensor shape updates from JavaScript/TypeScript without recreating the tensor object. This is useful for dynamic input sizes and batched inference workflows. -
Full 64‑bit integer (BigInt64/Uint64) support is added for tensors and inference I/O, enabling models and pipelines that require 64‑bit index/ID types or high‑range counters.
PyTorch Framework Support
-
Support for the padding operations family has been improved by adding new operations (
aten::reflection_padndandaten::replication_padnd) and resolving issues in existing implementations. -
Complex data type support has been added for
aten::unsqueezeandaten::catoperations. -
An issue in
aten::indexoperation with applying boolean masks on specified axes has been resolved.
ONNX Framework Support
-
Initial support for sequence data types has been added, beginning with the implementation of
SequenceAtandSplitToSequenceoperations. -
The incorrect output shape calculation in the
ConvTransposeoperation when using automatic padding (auto_pad) has been fixed. -
The
LayerNormalizationoperation has been corrected to properly handle scale and bias inputs with flattened shapes by automatically reshaping them to match the input tensor dimensions. -
A regression causing FP16 model conversion failures due to node not found errors has been fixed.
OpenVINO™ Model Server
-
Agentic use case improvements:
-
Tool parsers for the new models Qwen3-Coder-30B and Qwen3-30B-A3B-Instruct have been enabled. These models are supported in OpenVINO Runtime as a preview feature and can be evaluated with “tool calling” capabilities.
-
Streaming with “tool calling” for phi-4-mini-instruct and mistral-7B-v0.4 models is now supported.
-
Tool parsers for mistral and hermes have been improved, resolving multiple issues related to complex generated JSON objects and increasing overall response reliability.
-
Guided generation now supports all rules from XGrammar integration. The response_format parameter can now accept XGrammar structural tags format (not part of the OpenAI API). Example: { “type”: “const_string”, “value”: “Hello World!” }.
-
-
New capabilities and demos:
-
Integration with OpenWebU
-
Integration with Visual Studio Code using the Continue extension
-
Agentic client demo
-
Windows service usage
-
GGUF model pulling
-
-
Deployment improvements:
-
GGUF model format can now be deployed directly from Hugging Face Hub for several LLM architectures. Architectures such as Qwen2, Qwen2.5, Qwen3 and Llama3 can be deployed with a single command. See Loading GGUF models in OVMS demo for details.
-
OpenVINO Model Server can be deployed as a service in the Windows operating system. It can be managed by service configuration management, shared by all running applications, and controlled using a simplified CLI to pull, configure, enable, and disable models. Check the OVMS documentation for more details.
-
Pulling the model in IR format has been extended beyond the OpenVINO™ organization in HuggingFace* Hub. While OpenVINO org models are validated by Intel, a rapidly growing ecosystem of IR-format models from other publishers can now also be pulled and deployed via the OVMS CLI. Note: The repository needs to be populated by
optimum-cli export openvinocommand and must include tokenizer model in IR format to be successfully loaded by OpenVINO Model Server.
-
-
CLI simplifications for easier deployment:
-
--plugin_configparameter can now be applied not only to classic models but also to generative pipelines. -
cache_dirnow enables compilation caching for both classic models and generative pipelines. -
enable_prefix_cachingcan be used the same way for all target devices.
-
-
--add_to_configand--remove_from_config, like--list_models, are now OVMS CLI directives and no longer expect a value. Configuration values should be passed through the following parameters:--config_path,--model_repository_path,--model_name, or--model_path. -
When a service is deployed, the CLI can be simplified by setting the environment variable
OVMS_MODEL_REPOSITORY_PATHto point to the models folder. This automatically applies the default parameters for model management, ensuring thatconfig_pathandmodel_repository_pathare set correctly.ovms -pull -task text_generation OpenVINO/Qwen3-8B-int4 ovms -list_models ovms -add_to_models -model_name OpenVINO/Qwen3-8B-int4 ovms -remove_from_models -model_name OpenVINO/Qwen3-8B-int4
-
The
--api_keyparameter is now available, enabling client authorization using an API key. -
Binding parameters are added for both IPv6 and IPv4 addresses for gRPC and REST interfaces.
-
The metrics endpoint is now compatible with Prometheus v3. The output header type has been updated from JSON to plain text.
Performance improvements:
-
First-token generation performance has been significantly improved for LLM models with GPU acceleration and prefix caching. This is particularly beneficial for agentic use cases, where repeated chat history creates very long contexts that can now be processed much faster. Prefix caching can be enabled with the OVMS CLI parameter
–enable_prefix_caching true. -
A new parameter is introduced to increase the allowed prompt length for LLM and VLLM models deployed on NPU. The context length can now be extended by adding the CLI parameter
--max_prompt_length. The default is 1024 tokens and can be increased up to 10k tokens. Set it to the required value to avoid unnecessary memory usage.For VLM models running on both NPU and CPU, use a device-specific configuration to apply the setting only to the NPU device
--plugin_config '{device:NPU,{...}}'. -
Model loading time has been reduced through compilation cache, with significant improvements on GPU and NPU devices. Enable caching using the
--cache_dirparameter. -
Improved guided generation performance, including support for tool-call guiding.
Audio endpoints added:
-
text to speech endpoint compatible with the OpenAI API - /audio/speech
-
speech to text endpoints compatible with the OpenAI API - /audio/speech_to_text
-
/audio/translation - converts provided audio content to English text
-
/audio/transcription - converts provided audio content to text in the original language.
Embeddings endpoints improvements:
-
A tokenize endpoint has been added to get tokens before sending the input text to embeddings calculation. This helps assess input length to avoid exceeding the model context.
-
Embeddings Model now supports three pooling options:
CLS,LAST, andMEAN, improving model compatibility. See Text Embeddings Models list for details.
Breaking changes:
-
The old
embeddingsandrerankingcalculators were removed and replaced byembeddings_ovandreranking_ov. These new calculators follow the optimum-cli / Hugging Face model structure and support more features. If you use the old calculators, re-export your models and pull the updated versions from Hugging Face.
Bug fixes:
-
Fixed the model phi-4-mini-instruct generating incorrect responses when context exceeded 4k tokens.
Neural Network Compression Framework
-
SmoothQuant algorithm support has been added to the int8 post-training quantization method,
nncf.quantize()for the ONNX backend in NNCF, improving the accuracy of transformer-based int8 ONNX models. -
Saving ONNX models with int8 after int8 post-training quantization is now enabled, significantly reducing the model size.
-
Histogram Observer support has been added to the int8 post-training quantization method
nncf.quantize()for more accurate quantization results. -
MXFP8 precision support has been included in the weight-only compression method
nncf.compress_weights().
OpenVINO Tokenizers
-
BPEandSplitSpecialTokensoperations have been optimized, resulting in faster processing of large input strings. -
Metaspaceoperation has been improved to support the LLaVA-NeXT-Video model. -
Pair-input support has been added for the Qwen3 tokenizer.
OpenVINO GenAI
-
Tokenizers:
-
JsonContainerhas been added to represent arbitrary string containers. -
The apply_chat_template() function has been extended with tools and arbitrary values wrapped with
JsonContainerto be used by the chat template -
get_original_chat_template()has been added. -
TextStreamerconstructor is extended withdetokenization_paramsto pass to detokenizer.
-
-
LLM Pipeline enhancements:
-
Preview: Mixture of Experts (MoE) models optimized for CPUs and GPUs, validated for Qwen3-30B-A3B.
-
Parsers have been added for C++, Python, and JavaScript. They structure the generated output splitting into arbitrary sections. For example, thinking and tool calling.
-
Structured Output grammar compilation time has been improved, and reworked structural tags, providing new grammar building blocks for imposing complex output constraints.
-
ChatHistory (C++, Python, and JavaScript) API is added which stores conversation messages and optional metadata for chat templates. This is a recommended way to manage history instead of
start/finish_chat()for LLMs. See updated C++ and Python chat_sample -
Automatic memory allocation for ContinuousBatching has been improved: it now allocates a fixed number of extra tokens instead of exponential growth, aligning with the GPU plugin.
-
SDPA-based Speculative Decoding has been implemented (used for NPU).
-
GGUF Q4_1 gibberish has been fixed.
-
-
VLM Pipeline enhancements:
-
LLaVA-NeXT-Video, Qwen2-VL, and Qwen2.5-VL now support video input alongside images (samples TBA).
-
nanoLLaVA, MiniCPM-o-2.6 are now supported, and optimizations for Qwen-VL have been added.
-
On NPU, absent images and start/finish_chat() are now supported.
-
C API covering VLMPipeline class is added.
-
-
Image generation improvements:
-
RAG:
-
pad_to_max_lengthandbatch_sizeconfig fields have been added, along with theLAST_TOKENPoolingType for EmbeddingPipeline. -
Qwen3 support is added for EmbeddingPipeline and TextRerankPipeline.
-
-
-DENABLE_GIL_PYTHON_API=OFFcmake option is added to build GenAI with free threaded Python -
Node.js bindings:
-
Issues with launching on NodeJS version 22 and later on Linux have been resolved.
-
StructuredOutputConfigclass enhanced to support structured output generation, including concatenation and tagging features. -
ChatHistory is now implemented to manage conversational context, improving tools usage and providing extra context for more accurate prompts.
-
SchedulerConfig entity is introduced, enabling advanced scheduling configurations for pipelines.
-
Getters for PerfMetrics grammar have been added, allowing developers to retrieve detailed performance data for analysis.
-
Configuration parameter issues in the TextEmbeddingPipeline have been resolved, expanding the list of supported parameters.
-
Sample list for text generation using JavaScript has been expanded, including new examples for structured output and benchmarking.
-
Other Changes and Known Issues
Jupyter Notebooks
New models and use cases:
Deleted notebooks (still abailable in 2025.3 branch):
-
One Step Sketch to Image translation with pix2pix-turbo and OpenVINO
-
Visual-language assistant with LLaVA and Optimum Intel OpenVINO integration
-
Knowledge graphs model optimization using the Intel OpenVINO toolkit
-
Text-to-Image Generation with LCM LoRA and ControlNet Conditioning
-
Image Generation with Stable Diffusion using OpenVINO TorchDynamo backend
-
Generate creative QR codes with ControlNet QR Code Monster and OpenVINO™
Known Issues
Component: OpenVINO Tokenizers
ID: 174531
Description: Accuracy regression of Mistral-7b-instruct-v0.2 and Mistral-7b-instruct-v0.3 on all devices when executed with OpenVINO GenAI. As a workaround, use the IR converted with OpenVINO 2025.3. The accuracy will be improved with the next release.
Component: OpenVINO GenAI
ID: 176777
Description: Using the ``callback`` parameter with the Python API call generate() in Text2ImagePipeline, Image2ImagePipeline, InpaintingPipeline may cause the process to hang. As a workaround, do not use the ``callback`` parameter. The issue will be resolved in the next release. C++ implementations are not affected.
Deprecation And Support
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. For more details, refer to: OpenVINO Legacy Features and Components.
Discontinued in 2025
-
Runtime components:
-
The OpenVINO property of Affinity API is no longer available. It has been replaced with CPU binding configurations (
ov::hint::enable_cpu_pinning). -
The runtime namespace for Python API has been marked as deprecated and designated to be removed for 2026.0. The new namespace structure has been delivered, and migration is possible immediately. Details will be communicated through warnings and via documentation.
-
Binary operations Node API has been removed from Python API after previous deprecation.
-
PostponedConstant Python API Update: The PostponedConstant constructor signature is changing for better usability. Update maker from Callable[[Tensor], None] to Callable[[], Tensor]. The old signature will be removed in version 2026.0.
-
-
Tools:
-
The OpenVINO™ Development Tools package (pip install openvino-dev) is no longer available for OpenVINO releases in 2025.
-
Model Optimizer is no longer available. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.
-
Intel® Streaming SIMD Extensions (Intel® SSE) are currently not enabled in the binary package by default. They are still supported in the source code form.
-
Legacy prefixes:
l_,w_, andm_have been removed from OpenVINO archive names.
-
-
OpenVINO GenAI:
-
StreamerBase::put(int64_t token)
-
The
Boolvalue for Callback streamer is no longer accepted. It must now return one of three values of StreamingStatus enum. -
ChunkStreamerBase is deprecated. Use StreamerBase instead.
-
-
Deprecated OpenVINO Model Server (OVMS) benchmark client in C++ using TensorFlow Serving API.
-
NPU Device Plugin:
-
Removed logic to detect and handle Intel® Core™ Ultra Processors (Series 1) drivers older than v1688. Since v1688 is the earliest officially supported driver, older versions (e.g., v1477) are no longer recommended or supported.
-
-
Python 3.9 support will be discontinued starting with the OpenVINO 2025.4 and Neural Network Compression Framework (NNCF) 2.19.0.
Deprecated and to be removed in the future
-
openvino.Type.undefinedis now deprecated and will be removed with version 2026.0.openvino.Type.dynamicshould be used instead. -
Ubuntu 20.04 support has been deprecated due to the end of standard support.
-
The openvino-nightly PyPI module will soon be discontinued. End-users should proceed with the Simple PyPI nightly repo instead. Find more information in the Release policy.
-
auto shapeandauto batch size(reshaping a model in runtime) will be removed in the future. OpenVINO’s dynamic shape models are recommended instead. -
MacOS x86 is no longer recommended for use due to the discontinuation of validation. Full support will be removed later in 2025.
-
The
openvinonamespace of the OpenVINO Python API has been redesigned, removing the nestedopenvino.runtimemodule. The old namespace is now considered deprecated and will be discontinued in 2026.0. -
Starting with OpenVINO release 2026.0, the CPU plugin will require support for the AVX2 instruction set as a minimum system requirement. The SSE instruction set will no longer be supported.
-
APT & YUM Repositories Restructure: Starting with release 2025.1, users can switch to the new repository structure for APT and YUM, which no longer uses year-based subdirectories (like “2025”). The old (legacy) structure will still be available until 2026, when the change will be finalized. Detailed instructions are available on the relevant documentation pages:
-
OpenCV binaries will be removed from Docker images in 2026.
-
Starting with the 2026.0 release, OpenVINO will migrate builds based on RHEL 8 to RHEL 9.
-
NNCF
create_compressed_model()method is now deprecated and will be removed in 2026.nncf.quantize()method is recommended for Quantization-Aware Training of PyTorch models. -
NNCF optimization methods for TensorFlow models and TensorFlow backend in NNCF are deprecated and will be removed in 2026. It is recommended to use PyTorch analogous models for training-aware optimization methods and OpenVINO IR, PyTorch, and ONNX models for post-training optimization methods from NNCF.
-
The following experimental NNCF methods are deprecated and will be removed in 2026: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, Movement Sparsity.
-
Starting with the 2026.0 release, manylinux2014 will be upgraded to manylinux_2_28. This aligns with modern toolchain requirements but also means that CentOS 7 will no longer be supported due to glibc incompatibility.
-
With the release of Node.js v22, updated Node.js bindings are now available and compatible with the latest LTS version. These bindings do not support CentOS 7, as they rely on newer system libraries unavailable on legacy systems.
-
OpenVINO Model Server:
-
The dedicated OpenVINO operator for Kubernetes and OpenShift is now deprecated in favor of the recommended KServe operator. The OpenVINO operator will remain functional in upcoming OpenVINO Model Server releases but will no longer be actively developed. Since KServe provides broader capabilities, no loss of functionality is expected. On the contrary, more functionalities will be accessible and migration between other serving solutions and OpenVINO Model Server will be much easier.
-
TensorFlow Serving (TFS) API support is planned for deprecation. With increasing adoption of the KServe API for classic models and the OpenAI API for generative workloads, usage of the TFS API has significantly declined. Dropping date is to be determined based on the feedback, with a tentative target of mid-2026.
-
Support for Stateful models will be deprecated. These capabilities were originally introduced for Kaldi audio models which is no longer relevant. Current audio models support relies on the OpenAI API, and pipelines implemented via OpenVINO GenAI library.
-
Directed Acyclic Graph Scheduler will be deprecated in favor of pipelines managed by MediaPipe scheduler and will be removed in 2026.3. That approach gives more flexibility, includes wider range of calculators and has support for using processing accelerators.
-
Legal Information
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
Copyright © 2025, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.