Accelerate TensorFlow Inference with Intel® Neural Compressor

Overview

As a data scientist or an AI developer, one of the common tasks is to optimize the deep learning models for inference. Intel® Neural Compressor is a tool that helps you easily perform model compression to reduce the model size and increase the speed of deep learning inference for deployment on Intel hardware.

This article describes a code sample on how to accelerate inference for a TensorFlow* model without sacrificing accuracy using Intel Neural Compressor.

Optimizing TensorFlow Model Inference

TensorFlow is one of the most popular deep learning frameworks and improving the inference performance of your TensorFlow model is an important part of optimizing your AI workflow. Intel Neural Compressor is an open source library that automates model compression technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks. This Python* library can quantize activations and weights to int8, bfloat16, or a mixture of FP32, bfloat16, and int8 to reduce model size and accelerate inference while minimizing precision loss. Intel Neural Compressor requires four elements to run model quantization and tuning:

Calibration dataloader – a class that allows to load dataset like images and corresponding labels
Model – an FP32 model to be quantized
Configuration file – a YAML file that specifies all necessary parameters
Evaluation function – a function that returns the accuracy achieved by the model on a given dataset

Code Sample

This code sample shows the process of building a convolutional neural network (CNN) model to recognize handwritten numbers and demonstrates how to increase the inference performance by using Intel Neural Compressor. Intel Neural Compressor simplifies the process of converting the FP32 model to int8 or bfloat16 (BF16) and can achieve higher inference performance. In addition, Intel Neural Compressor tunes the quantization method to reduce the accuracy loss.

Get the Code Sample

The following steps are implemented in the code sample:

Setup
Model training
Quantization of the model using Intel Neural Compressor
Performance comparison between models

Setup

Import Python packages and verify that the correct versions are installed. The required packages are:

  •  TensorFlow 2.2 and later
  •  Intel Neural Compressor 1.2.1 and later
  •  Matplotlib


import tensorflow as tf 
print("Tensorflow version {}".format(tf.__version__)) 
tf.compat.v1.enable_eager_execution()

try:  import neural_compressor as inc  
print("neural_compressor version {}".format(inc.__version__))  
except:  
try:  
import lpot as inc  
print("LPOT version {}".format(inc.__version__))   
except:  
import ilit as inc  
print("iLiT version {}".format(inc.__version__))   

import matplotlib.pyplot as plt 
import numpy as np  

from IPython import display

Enable the optimizations (for TensorFlow powered by optimizations from Intel) by setting the TF_ENABLE_MKL_NATIVE_FORMAT=0 environment variable for TensorFlow 2.5 and later. It needs to be set up before running Intel Neural Compressor to quantize FP32 model or deploying the quantized model.
```
%env TF_ENABLE_MKL_NATIVE_FORMAT=0
```

Train a CNN Model Based on Keras

In the code sample, there is a Python script prepared to run all the training.

Load the dataset. This sample uses an MNIST dataset of handwritten digits.


import alexnet

data = alexnet.read_data()
x_train, y_train, label_train, x_test, y_test, label_test = data
print('train', x_train.shape, y_train.shape, label_train.shape)
print('test', x_test.shape, y_test.shape, label_test.shape)

Train the model with the dataset. The number of training epochs is 3.
```
epochs = 3
alexnet.train_mod(model, data, epochs)
```

Freeze and save the model to single profobuf (*. pb file). Set the input node name to x.


from tensorflow.python.framework.convert_to_constants 
import convert_variables_to_constants_v2

def save_frezon_pb(model, mod_path):
# Convert Keras model to ConcreteFunction
full_model = tf.function(lambda x: model(x))
concrete_function = full_model.get_concrete_function(
x=tf.TensorSpec(model.inputs[0].shape, 
model.inputs[0].dtype))

# Get frozen ConcreteFunction
frozen_model = 
convert_variables_to_constants_v2(concrete_function)

# Generate frozen pb
tf.io.write_graph(graph_or_graph_def=frozen_model.graph,
logdir=".",
name=mod_path,
as_text=False)
fp32_frezon_pb_file = "fp32_frezon.pb"
save_frezon_pb(model, fp32_frezon_pb_file)

Model Quantization using Intel Neural Compressor

Similar to the training process, a Python script is prepared for quantization. It contains all the steps needed to quantize and tune the model, as explained in a previous section.

Define the Dataloader – a prepared class provides an iter function to return and image and label it as batch size. This sample uses the validation data of an MNIST dataset.


import mnist_dataset
import math


class Dataloader(object):
def __init__(self, batch_size):
self.batch_size = batch_size


def __iter__(self):
x_train, y_train, label_train, x_test, y_test,label_test = 
mnist_dataset.read_data()
batch_nums = math.ceil(len(x_test)/self.batch_size)

for i in range(batch_nums-1):
begin = i*self.batch_size
end = (i+1)*self.batch_size
yield x_test[begin: end], label_test[begin: end]

begin = (batch_nums-1)*self.batch_size
yield x_test[begin:], label_test[begin:]

Load the FP32 model that was saved previously.


fp32_graph = alexnet.load_pb(input_graph_path)

Define the configuration file. This sample uses the YAML file from oneAPI code samples on GitHub*. This file saves all necessary parameters for Intel Neural Compressor to perform quantization and tuning. For more information about the YAML configuration file, see the template from Intel Neural Compressor repository.

Define the tuning function. Intel Neural Compressor can quantize the model with a validation dataset for tuning. As a result, it returns a frozen quantized int8 model. The defined function will do it based on the given configuration and FP32 model path.


def auto_tune(input_graph_path, yaml_config, batch_size):
fp32_graph = alexnet.load_pb(input_graph_path)
quan = inc.Quantization(yaml_config)
dataloader = Dataloader(batch_size)
assert(dataloader)
q_model = quan(
fp32_graph,
q_dataloader=dataloader,
eval_func=None,
eval_dataloader=dataloader)
return q_model

Define how the model will be written to the file. For this purpose, a function save_int8_frezon_pb has been created.


def save_int8_frezon_pb(q_model, path):
from tensorflow.python.platform import gfile
f = gfile.GFile(path, 'wb')
f.write(q_model.as_graph_def().SerializeToString())
print("Save to {}".format(path))

Call function auto_tune to quantize the model. Remember to save the created model.


yaml_file = "alexnet.yaml"
batch_size = 200

fp32_frezon_pb_file = "fp32_frezon.pb"
int8_pb_file = "alexnet_int8_model.pb"



q_model = auto_tune(fp32_frezon_pb_file, yaml_file, batch_size)

save_int8_frezon_pb(q_model, int8_pb_file)

Compare Models

Python script profiling_inc.py is created to compare the performance of the FP32 and int8 models. The performance results are saved to the files specific to the used model. Additionally, Intel® Deep Learning Boost helps to speed up the int8 inference.

Run the profiling_inc.py script with the original, FP32 model. The results are saved in 32.json file.


python profiling_inc.py --input-graph=./fp32_frezon.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=32

Do the same with the int8 model. The results are saved in 8.json file.


python profiling_inc.py --input-graph=./alexnet_int8_model.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=8

A summary of results for both the models is shown using the draw_bar function.


res_32 = load_res('32.json')
res_8 = load_res('8.json')

accuracys = [res_32['accuracy'], res_8['accuracy']]
throughputs = [res_32['throughput'], res_8['throughput']]
latencys = [res_32['latency'], res_8['latency']]

throughputs_times = [1, throughputs[1]/throughputs[0]]
latencys_times = [1, latencys[1]/latencys[0]]
accuracys_times = [0, accuracys_perc[1] - accuracys_perc[0]]