Accelerate TensorFlow* Inference with Intel® Neural Compressor

Authors

author-image

By

Overview

As a data scientist or an AI developer, one of the common tasks is to optimize the deep learning models for inference. Intel® Neural Compressor is a tool that helps you easily perform model compression to reduce the model size and increase the speed of deep learning inference for deployment on Intel hardware.

This article describes a code sample on how to accelerate inference for a TensorFlow* model without sacrificing accuracy using Intel Neural Compressor.

Optimizing TensorFlow Model Inference

TensorFlow is one of the most popular deep learning frameworks and improving the inference performance of your TensorFlow model is an important part of optimizing your AI workflow. Intel Neural Compressor is an open source library that automates model compression technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks. This Python* library can quantize activations and weights to int8, bfloat16, or a mixture of FP32, bfloat16, and int8 to reduce model size and accelerate inference while minimizing precision loss. Intel Neural Compressor requires four elements to run model quantization and tuning:
 

  1. Calibration dataloader – a class that allows to load dataset like images and corresponding labels
  2. Model – an FP32 model to be quantized
  3. Configuration file – a YAML file that specifies all necessary parameters
  4. Evaluation function – a function that returns the accuracy achieved by the model on a given dataset

Code Sample

This code sample shows the process of building a convolutional neural network (CNN) model to recognize handwritten numbers and demonstrates how to increase the inference performance by using Intel Neural Compressor. Intel Neural Compressor simplifies the process of converting the FP32 model to int8 or bfloat16 (BF16) and can achieve higher inference performance. In addition, Intel Neural Compressor tunes the quantization method to reduce the accuracy loss.

Get the Code Sample

The following steps are implemented in the code sample:
 

  1. Setup
  2. Model training
  3. Quantization of the model using Intel Neural Compressor
  4. Performance comparison between models

Setup

  1. Import Python packages and verify that the correct versions are installed. The required packages are:

          •  TensorFlow 2.2 and later
          •  Intel Neural Compressor 1.2.1 and later
          •  Matplotlib
    import tensorflow as tf 
    print("Tensorflow version {}".format(tf.__version__)) 
    tf.compat.v1.enable_eager_execution()
    
    try:  import neural_compressor as inc  
    print("neural_compressor version {}".format(inc.__version__))  
    except:  
    try:  
    import lpot as inc  
    print("LPOT version {}".format(inc.__version__))   
    except:  
    import ilit as inc  
    print("iLiT version {}".format(inc.__version__))   
    
    import matplotlib.pyplot as plt 
    import numpy as np  
    
    from IPython import display
  2. Enable the optimizations (for TensorFlow powered by optimizations from Intel) by setting the TF_ENABLE_MKL_NATIVE_FORMAT=0 environment variable for TensorFlow 2.5 and later. It needs to be set up before running Intel Neural Compressor to quantize FP32 model or deploying the quantized model.
    %env TF_ENABLE_MKL_NATIVE_FORMAT=0

Train a CNN Model Based on Keras

In the code sample, there is a Python script prepared to run all the training.
 

  1. Load the dataset. This sample uses an MNIST dataset of handwritten digits.
    import alexnet
    
    data = alexnet.read_data()
    x_train, y_train, label_train, x_test, y_test, label_test = data
    print('train', x_train.shape, y_train.shape, label_train.shape)
    print('test', x_test.shape, y_test.shape, label_test.shape)
  2. Train the model with the dataset. The number of training epochs is 3.
    epochs = 3
    alexnet.train_mod(model, data, epochs)
    
  3. Freeze and save the model to single profobuf (*. pb file). Set the input node name to x.
    from tensorflow.python.framework.convert_to_constants 
    import convert_variables_to_constants_v2
    
    def save_frezon_pb(model, mod_path):
    # Convert Keras model to ConcreteFunction
    full_model = tf.function(lambda x: model(x))
    concrete_function = full_model.get_concrete_function(
    x=tf.TensorSpec(model.inputs[0].shape, 
    model.inputs[0].dtype))
    
    # Get frozen ConcreteFunction
    frozen_model = 
    convert_variables_to_constants_v2(concrete_function)
    
    # Generate frozen pb
    tf.io.write_graph(graph_or_graph_def=frozen_model.graph,
    logdir=".",
    name=mod_path,
    as_text=False)
    fp32_frezon_pb_file = "fp32_frezon.pb"
    save_frezon_pb(model, fp32_frezon_pb_file)
    

Model Quantization using Intel Neural Compressor

Similar to the training process, a Python script is prepared for quantization. It contains all the steps needed to quantize and tune the model, as explained in a previous section.
 

  1. Define the Dataloader – a prepared class provides an iter function to return and image and label it as batch size. This sample uses the validation data of an MNIST dataset.
    import mnist_dataset
    import math
    
    
    class Dataloader(object):
    def __init__(self, batch_size):
    self.batch_size = batch_size
    
    
    def __iter__(self):
    x_train, y_train, label_train, x_test, y_test,label_test = 
    mnist_dataset.read_data()
    batch_nums = math.ceil(len(x_test)/self.batch_size)
    
    for i in range(batch_nums-1):
    begin = i*self.batch_size
    end = (i+1)*self.batch_size
    yield x_test[begin: end], label_test[begin: end]
    
    begin = (batch_nums-1)*self.batch_size
    yield x_test[begin:], label_test[begin:]
  2. Load the FP32 model that was saved previously.
    fp32_graph = alexnet.load_pb(input_graph_path)
    
  3. Define the configuration file. This sample uses the YAML file from oneAPI code samples on GitHub*. This file saves all necessary parameters for Intel Neural Compressor to perform quantization and tuning. For more information about the YAML configuration file, see the template from Intel Neural Compressor repository.
  4. Define the tuning function. Intel Neural Compressor can quantize the model with a validation dataset for tuning. As a result, it returns a frozen quantized int8 model. The defined function will do it based on the given configuration and FP32 model path.
    def auto_tune(input_graph_path, yaml_config, batch_size):
    fp32_graph = alexnet.load_pb(input_graph_path)
    quan = inc.Quantization(yaml_config)
    dataloader = Dataloader(batch_size)
    assert(dataloader)
    q_model = quan(
    fp32_graph,
    q_dataloader=dataloader,
    eval_func=None,
    eval_dataloader=dataloader)
    return q_model
  5. Define how the model will be written to the file. For this purpose, a function save_int8_frezon_pb has been created.
    def save_int8_frezon_pb(q_model, path):
    from tensorflow.python.platform import gfile
    f = gfile.GFile(path, 'wb')
    f.write(q_model.as_graph_def().SerializeToString())
    print("Save to {}".format(path))
  6. Call function auto_tune to quantize the model. Remember to save the created model.
    yaml_file = "alexnet.yaml"
    batch_size = 200
    
    fp32_frezon_pb_file = "fp32_frezon.pb"
    int8_pb_file = "alexnet_int8_model.pb"
    
    
    
    q_model = auto_tune(fp32_frezon_pb_file, yaml_file, batch_size)
    
    save_int8_frezon_pb(q_model, int8_pb_file)

Compare Models

Python script profiling_inc.py is created to compare the performance of the FP32 and int8 models. The performance results are saved to the files specific to the used model. Additionally, Intel® Deep Learning Boost helps to speed up the int8 inference.
 

  1. Run the profiling_inc.py script with the original, FP32 model. The results are saved in 32.json file.
    python profiling_inc.py --input-graph=./fp32_frezon.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=32
  2. Do the same with the int8 model. The results are saved in 8.json file.
    python profiling_inc.py --input-graph=./alexnet_int8_model.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=8
  3. A summary of results for both the models is shown using the draw_bar function.
    res_32 = load_res('32.json')
    res_8 = load_res('8.json')
    
    accuracys = [res_32['accuracy'], res_8['accuracy']]
    throughputs = [res_32['throughput'], res_8['throughput']]
    latencys = [res_32['latency'], res_8['latency']]
    
    throughputs_times = [1, throughputs[1]/throughputs[0]]
    latencys_times = [1, latencys[1]/latencys[0]]
    accuracys_times = [0, accuracys_perc[1] - accuracys_perc[0]]
    

Get the Software

AI Tools

Accelerate data science and AI pipelines—from preprocessing through machine learning—and provide interoperability for efficient model development.

Get It Now

Intel® Neural Compressor

Speed up AI inference without sacrificing accuracy with this open source Python library that automates popular model compression technologies.

Get It Now

See All Tools