author-image

By

Overview

This tutorial shows you how to run inference on Intel® Gaudi® and Intel Gaudi 2 AI accelerators. These are simple, fully runnable examples that show how to run inference using the MNIST dataset, a simple checkpoint, and linear model. For more details, refer to the Inference User Guide. There are three examples on GitHub*:

  • Example 1 is a simple inference example showing how to run using the model.eval() path, which is the most direct path to running inference.
  • Example 2 adds the use of HPU Graph with the Graph and Stream APIs
  • Example 3 uses HPU Graph with wrap_in_hpu_graph API, a simpler version of the Graph and Stream APIs

The HPU Graph API provides a performance optimization technique to reduce PyTorch* host overhead. This is done by capturing the PyTorch run on a stream for the first iteration and replaying that in subsequent ones. The replay avoids the PyTorch overhead of accumulating the operations (ops) in the model and makes the execution device-bound.

For further details on Stream APIs and HPU Graph APIs, refer to the reference documentation.

The HPU Graph API can be used for performance gains and should be applied to real-world models where the application is latency sensitive or the host time ends up greater than the device time due to a low batch size. The HPU Graphs feature can help minimize this host time.

The three examples are provided in Jupyter* Notebooks. To run these examples, start with one of the following two options:

Amazon EC2* DL1 Instances (based on first-generation Intel Gaudi software):

  • An Amazon Web Services (AWS)* account is required. For instructions on starting a DL1 instance, see the quick start guide.

Intel® Developer Cloud Using Intel Gaudi 2 Software

  • A user account is required.

For more information, see the installation guide to pull and run a Docker* image with PyTorch, and then install the JupyterLab library. The following example is from Example 3 using inference mode with HPU Graph.

Inference Mode

This tutorial shows how to use inference mode with HPU Graph, along with the built-in wrapper wrap_in_hpu_graph, by using a simple model and the MNIST dataset.

Download pretrained model checkpoints from vault.

!wget https://vault.habana.ai/artifactory/misc/inference/mnist/mnist-epoch_20.pth

Import all necessary dependencies.

import os
import sys
import torch
import time
import habana_frameworks.torch as ht
import habana_frameworks.torch.core as htcore
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F

Define a simple .NET model for MNIST.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1   = nn.Linear(784, 256)
        self.fc2   = nn.Linear(256, 64)
        self.fc3   = nn.Linear(64, 10)
    def forward(self, x):
        out = x.view(-1,28*28)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        out = F.log_softmax(out, dim=1)
        return out

Create the model, and load the pretrained checkpoint. Optimize the model for evaluation, and then move the model to the Intel Gaudi accelerator (“hpu”)

model = Net()
checkpoint = torch.load('mnist-epoch_20.pth')
model.load_state_dict(checkpoint)
model = model.eval()

Wrap the model with HPU graph, and move it to the Intel Gaudi accelerator (hpu). Use wrap_in_hpu_graph to wrap the module forward function with HPU Graphs. This wrapper captures, caches, and replays the graph.

model = ht.hpu.wrap_in_hpu_graph(model)
model = model.to("hpu")
============SYSTEM CONFIGURATION===============
Num CPU Cores = 96
CPU RAM = 784300908 KB 
================================================

Create an MNIST dataset for evaluation. This is pulled from the Torchvision library.

transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])

data_path = './data'
test_kwargs = {'batch_size': 32}
dataset1 = datasets.MNIST(data_path, train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(dataset1,**test_kwargs)

Do a warm-up run. The HPU graph is captured and cached.

warmup_input = torch.randn(32, 1, 28, 28, device='hpu')
warmup_output = model(warmup_input)

Run inference.

The model was already wrapped with wrap_in_hpu_graph as shown earlier, so there is no need to copy and replay the stream. It is done in the background. Use asynchronous copies (copy with non_blocking=True followed by mark_step) to further optimize the inference. Adding mark_step after model() is not required with HPU Graphs as it is handled implicitly.

For more information, refer to the inference optimization guidelines.

correct = 0 
for batch_idx, (data, label) in enumerate(test_loader):  
    data = data.to("hpu", non_blocking=True)
    htcore.mark_step()
    output = model(data)
    correct += output.max(1)[1].eq(label).sum()

print('Accuracy: {:.2f}%'.format(100. * correct / (len(test_loader) * 32)))

Accuracy: 94.36%

Summary

Running inference on Intel Gaudi accelerators is easier to do, and adding HPU Graph can improve performance in your model.

Licensed under the Apache License, Version 2.0 (the “License”)

You may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.