Skip To Main Content
Support Knowledge Base

Why Does Applying Different Weights to a Model Affect the Inference Performance?

Content Type: Troubleshooting   |   Article ID: 000088030   |   Last Reviewed: 03/09/2026

Description

Observed different inference throughput when running the same model architecture with different weight files. Although the model structure is identical, inference performance varies significantly depending on the weight precision and representation used.

Resolution

Model weights and precision (FP32, FP16, INT8) affect the inference performance.

Using FP32 format would result in the full distribution of weight and is known as a Single Precision Floating Point.

Meanwhile, FP16 and INT8 formats are both compressed weight formats where they are squeezed to be smaller in size. The trade‑off for these compressions is the accuracy of the model, also known as quantization error.
The more bits allocated to represent data, the wider range they could represent and potentially, the better accuracy of the model. However, bigger data requires larger memory space for its storage, higher memory bandwidth needed to transfer it around, and more compute resources and time being used up.

The Intel® Distribution of OpenVINO™ toolkit Benchmark Results depicts obvious differences in terms of performance between different weight formats or precision.

Related Products

This article applies to 1 products.