Expectation is FP16 format to perform faster inference when compared to same model in FP32 format. Using the benchmark_app to run inference with the application's default settings for both formats, but there is no performance improvement (higher FPS) when comparing FP16 format model against FP32 format model.
To execute the FP32 model as F32 format while using the benchmark_app, add -infer_precision f32 for the chosen device.
For example:
$ benchmark_app -m intel/bert-large-uncased-whole-word-masking-squad-0001/FP32/bert-large-uncased-whole-word-masking-squad-0001.xml -d GPU -t 5 -api async -hint throughput -infer_precision f32
For GPU plugin, floating-point precision of a GPU primitive is selected based on operation precision in the OpenVINO IR, except for the <compressed f16 OpenVINO IR form, which is executed in the f16
precision.
For CPU plugin, the default floating-point precision of a CPU primitive is f32. To support the f16 OpenVINO™ IR the plugin internally converts all the f16 values to f32 and all the calculations are performed using the native precision of f32. On platforms that natively support bfloat16 calculations (have the AVX512_BF16 or AMX extension), the bf16 type is automatically used instead of f32 to achieve better performance (see the Execution Mode Hint).
For additional information on Data Types for CPU/GPU Plugins refer to: