Increase PyTorch Inference Throughput by 4x

Get the Latest on All Things CODE



In the video below, Intel Cloud Software Engineer Ben Olson demonstrates how to accelerate PyTorch-based inferencing by applying optimizations from the Intel® Extension for PyTorch* and quantizing to INT8.

To accelerate the compute-intensive tasks of deep-learning training and inferencing using PyTorch, Intel engineers have been contributing optimizations for Intel® hardware to the PyTorch open-source community. While these optimizations eventually become included in stock PyTorch, the Intel Extension for PyTorch supports the newest optimizations and features such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with Vector Neural Network Instructions (Intel® AVX-512 VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX).

This extension automatically optimizes the PyTorch operators, graph, and runtime based on the hardware it detects. This means you can take advantage of this library with minimal changes to your code.

In this demo video, which shows an inferencing use case with a ResNet-50 model and a synthetic dataset, the key changes are:

  • Import the library:
    import intel_extension_for_pytorch as ipex
  • Apply the optimizations to the model for its datatype:
    fp32_model = ipex.optimize(fp32_model, dtype=torch.float32, inplace=True)

The demo ran on an AWS m6i.32xlarge instance with an Intel® Xeon® Platinum 8375C CPU @ 2.90GHz and 512GiB of memory. The baseline inference throughput with stock PyTorch was 382.8 images per second, and the simple code change to apply these optimizations resulted in an inference throughput of 467.7 images per second with a batch size of 116.

The model in this run used 32-bit single-precision floating-point (FP32) data types. You can further accelerate deep learning inference by quantizing to lower-precision data types, such as 8-bit integer (INT8). Reducing the word length of the weights and mathematical operations enables more parallelism. For instance, the AVX512 instruction set is single-instruction multiple-data (SIMD) with 512-bit registers. This means quantization can increase throughput of these registers from 16 FP32 values at a time to 64 INT8 values. Intel® AVX-512 VNNI further accelerates inference by taking advantage of INT8 quantization to combine three instructions into one.

Typically, you would assess the accuracy of your model with reduced precision and only apply it where the effects are within tolerance, and this can be automated using a tool such as Intel® Neural Compressor. This demo focuses only on performance.  In this case the model was already calibrated, and the results were saved to a configuration file.  Then the demo uses this configuration to convert to INT8:

calibration_config = “resnet50_configure_sym.json”
conf = ipex.quantization.QuantConf(calibration_config)
int8_model = ipex.quantization.convert(int8_model, conf, x)

Now the model produces an inference throughput of over 1544 images per second, which is a total improvement of about 4x over the baseline run with stock PyTorch.

While this demo just highlights acceleration of deep learning inference with PyTorch, you can speed up your entire end-to-end AI workflow running on Intel ® hardware with IntelAI and machine learning development tools.

Note: Starting with Intel® Extension for PyTorch* v1.12, the API for INT8 conversion has been simplified and the calibrate command has been deprecated. Please see INT8 Quantization section of the documentation for usage details in v1.12 and later.

See Related Content >