Optimize PyTorch Inference Performance on GPUs

PyTorch* is a deep learning framework that is frequently used in various applications of computer vision and natural language processing using CPUs and GPUs. This framework provides a high-level interface for designing and building deep neural networks. PyTorch defines computational graphs dynamically, which gives more flexibility to build complex deep learning models.

Intel® Extension for PyTorch* extends PyTorch with up-to-date features and optimizations for an extra performance boost on Intel hardware. This extension gives users the ability to perform PyTorch model training and inference on discrete Intel GPUs and supports auto-mixed precision for improved performance.

This article demonstrates a code sample on how to perform PyTorch for ResNet*-50 model training and inference using the CIFAR-10 dataset on a discrete Intel GPU with Intel Extension for PyTorch.

How to Optimize Performance for PyTorch Models

Intel Extension for PyTorch enables the users to apply the newest performance optimizations that are not yet in PyTorch with minimal code changes. Learn how to install it as a stand-alone product or get it as a part of the AI Tools. The extension can be loaded as a Python* module or linked as a C++ library. Python users can enable it dynamically by importing intel_extension_for_pytorch.

The CPU tutorial provides detailed information on Intel Extension for PyTorch for Intel CPUs. The source code is available at the main branch.
The GPU tutorial provides detailed information on Intel Extension for PyTorch for Intel GPUs. The source code is available at the xpu-main branch.

The extension supports lower-precision data formats and specialized computer instructions. It also allows auto-mixed precision training and inference with float32 (FP32) and bfloat16 (bf16).

What Is Auto-Mixed Precision?

Auto-mixed precision enables low-precision data types such as bf16 and float16 to accelerate the training and inference workloads by improving memory and computation efficiency. Bf16 is a floating-point format that occupies 16 bits of computer memory but represents the approximate dynamic range of 32-bit floating-point numbers. Auto-mixed precision automates the tuning of data type conversions over all operators and buffers to a lower-precision data type.

Support for the auto-mixed precision feature is enabled in Intel Extension for PyTorch on Intel CPUs and GPUs.

For GPUs, torch.xpu.amp provides convenience for auto data type conversion at runtime.
Training workloads using torch.xpu.amp support torch.bfloat16.

Inference workloads using torch.xpu.amp support torch. bfloat16 and torch.float16. When torch.xpu.amp is enabled, bfloat16 is the default lower-precision floating-point data type.

Code Implementation

The code sample shows how to train a ResNet-50 model with a CIFAR-10 dataset using Intel Extension for PyTorch. The model is trained using FP32 by default but can also be trained with auto-mixed precision bf16 precision by passing the bf16 parameter in the train function. Then, the same trained model is taken and inference is performed with FP32 and auto-mixed precision bf16. With the use of Intel® X^e Matrix Extensions (Intel® XMX) for bf16, there will be good performance improvement. Intel XMX is supported on bf16 and int8 data types on discrete Intel GPUs.

Import the required packages and define the hyperparameters and dataset location:

import os
from time import time
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import intel_extension_for_pytorch as ipex
from tqdm import tqdm

# Hyperparameters and constants
LR = 0.01
MOMENTUM = 0.9
DATA = 'datasets/cifar10/'
epochs=1
batch_size=128

Load the CIFAR-10 dataset by downloading it from built-in datasets available in the torchvision.datasets module and placing it in a proper location:

transform = torchvision.transforms.Compose([
torchvision.transforms.Resize((224, 224)),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = torchvision.datasets.CIFAR10(
  root=DATA,
  train = True,
  transform=transform,
  download=True,
)
train_loader = torch.utils.data.DataLoader(
  dataset=train_dataset,
  batch_size=batch_size
)
test_dataset = torchvision.datasets.CIFAR10(
root=DATA, train = False,
   download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)

Define the trainModel function:

def trainModel(train_loader, modelName="myModel", device="cpu", dataType="fp32"):

After defining the previous function, perform the following steps:

a. Initialize the model, and then add a fully connected layer for fine-tuning the model with the chosen dataset:

model = torchvision.models.resnet50(pretrained=True)
  model.fc = torch.nn.Linear(2048,10)
  lin_layer = model.fc
  new_layer = torch.nn.Sequential(
   lin_layer,
   torch.nn.Softmax(dim=1)
)
model.fc = new_layer

b. Define the loss function and optimization methodology:

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=MOMENTUM)
model.train()

c. Export the model and criterion to the XPU:

if device == "GPU":
  model = model.to("xpu:0") 
  criterion = criterion.to("xpu:0")

d. Optimize the model according to the specified precision. In the function, we are creating support for FP32 and bf16:

if "bf16" == dataType:
    model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
else:
    model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)

e. Train the model for the defined number of epochs and apply auto-mixed precision:

num_batches = len(train_loader) * epochs
for i in range(epochs):
    running_loss = 0.0

for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    # export data to XPU device. GPU specific code
    if device == "GPU":
       data = data.to("xpu:0")
       target = target.to("xpu:0")

    # Apply Auto-mixed precision(BF16) 
    if "bf16" == dataType:
        with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):

       output = model(data)
       loss = criterion(output, target)
       loss.backward()
       optimizer.step()
       running_loss += loss.item()
    else:
      output = model(data)
      loss = criterion(output, target)
      loss.backward()
      optimizer.step()
      running_loss += loss.item()


    # Showing Average loss after 50 batches
    if 0 == (batch_idx+1) % 50:
      print("Batch %d/%d complete" %(batch_idx+1, num_batches))
      print(f' average loss: {running_loss / 50:.3f}')
      running_loss = 0.0

f. Save a checkpoint of the trained model:

torch.save({
   'model_state_dict': model.state_dict(),
   'optimizer_state_dict': optimizer.state_dict(),
   }, 'checkpoint_%s.pth' %modelName)
print(f'\n Training finished and model is saved as checkpoint_{modelName}.pth')
return None

Train the model using a prepared function (FP32 and bf16) for inference comparison. Set the device to “gpu” and specify the data type. To train using a CPU, change the device parameter to “cpu”.

trainModel(train_loader, modelName="gpu_rn50", device="gpu", dataType="fp32")
trainModel(train_loader, modelName="gpu_rn50", device="gpu", dataType="bf16")
trainModel(train_loader, modelName="cpu_rn50", device="cpu", dataType="fp32")

To load a trained model from a file, prepare a custom loading function to specify a trained model location and create a model based on the saved checkpoint.

def load_model(cp_file = 'checkpoint_rn50.pth'):
    model = torchvision.models.resnet50()
    model.fc = torch.nn.Linear(2048,10)
    lin_layer = model.fc
    new_layer = torch.nn.Sequential(
        lin_layer,
        torch.nn.Softmax(dim=1)
)
model.fc = new_layer

checkpoint = torch.load(cp_file)
model.load_state_dict(checkpoint['model_state_dict'])
return model

Apply Intel Extension for PyTorch optimizations:

def ipex_jit_optimize(model, dataType = "fp32" , device="CPU"):
    model.eval()
    if device=="GPU":
       model = model.to("xpu:0")
    if dataType=="bf16":
       model = ipex.optimize(model, dtype=torch.bfloat16)
    else:
       model = ipex.optimize(model, dtype = torch.float32)

    with torch.no_grad():
       d = torch.rand(1, 3, 224, 224)
       if device=="GPU":
          d = d.to("xpu:0")

if dataType=="bf16":
   with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
     jit_model = torch.jit.trace(model, d) # JIT trace the optimized model
     jit_model = torch.jit.freeze(jit_model) # JIT freeze the traced model
   else:
     jit_model = torch.jit.trace(model, d) # JIT trace the optimized model
     jit_model = torch.jit.freeze(jit_model) # JIT freeze the traced model
return jit_model

Measure the model's time for inference. To calculate model accuracy and its inference latency, create the function inferModel.

def inferModel(model, test_loader, device="cpu" , dataType='fp32'):
    correct = 0
    total = 0
    if device == "GPU":
       model = model.to("xpu:0")
    infer_time = 0

    with torch.no_grad():
       num_batches = len(test_loader)
       batches=0

       for i, data in tqdm(enumerate(test_loader)):

           torch.xpu.synchronize()
           start_time = time()
           images, labels = data
           if device =="GPU":
              images = images.to("xpu:0")

           outputs = model(images)
           outputs = outputs.to("cpu")
           _, predicted = torch.max(outputs.data, 1)

           total += labels.size(0)
           correct += (predicted == labels).sum().item()

           torch.xpu.synchronize()
           end_time = time()

           if i>=3 and i<=num_batches-3:
               infer_time += (end_time-start_time)
               batches += 1
           if i == num_batches - 3:
               break

accuracy = 100 * correct / total
return accuracy, infer_time*1000/(batches*batch_size)

Define the evaluation function using the custom loading function (as explained previously) and ipex_jit_optimize and inferModel functions.

def Eval_model(cp_file='checkpoint_model.pth', dataType="fp32" , device="GPU" ):
    model = load_model(cp_file)
    model = ipex_jit_optimize(model, dataType , device)
    accuracy, latency = inferModel(model, test_loader, device, dataType )
    print(f' Model accuracy:{accuracy} and Average Inference latency:{latency}')
    return accuracy, latency

Check accuracy and inference latency using the prepared Eval_model function. Specify different device and data type parameters to check performance of all trained models. The final step is to summarize the inference results.

Eval_model(cp_file = 'checkpoint_gpu_rn50.pth', dataType = "fp32", device="GPU")
Eval_model(cp_file = 'checkpoint_gpu_rn50.pth', dataType = "bf16", device="GPU")
Eval_model(cp_file = 'checkpoint_cpu_rn50.pth', dataType = "fp32", device="CPU")

Try out the code sample on Linux* and Jupyter* Notebook. The code sample compares the performance of models trained using FP32 and bf16 on CPU and GPU. The code output illustrates that the performance speed increases using auto-mixed precision bf16 with auto-mixed precision during training.

What’s Next?

Use the auto-mixed precision feature in Intel Extension for PyTorch to improve performance for training and inference workloads. To learn more, see Introducing Intel Extension for PyTorch for GPUs.

We encourage you to also check out and incorporate Intel’s other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow. Learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel® AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Framework Optimizations

AI Development Software

Intel oneAPI

AI Development Resources and Tools