Intel® Neural Compressor (INC) is an open-source Python library running on Intel CPUs and GPUs, that can be used to optimize original neural network model using popular network compression technologies, such as quantization, pruning, knowledge distillation. More information can be found at https://github.com/intel/neural-compressor website.
In this article, we will show how to get started with using INC and its quantization feature on AWS EC2 instance. We obtained one C5 12xlarge instance with Ubuntu 20.04 OS that has Intel® Xeon® Platinum 8275CL CPU, 48 virtual cores and 96 GB of memory. We deliberately used a larger instance in order to see the benefit from multiple cores on the system when comparing the throughput of quantized model relative to the fp32 model.
Here are the steps we took to quantize an fp32 model and compare performance results between fp32 and int8:
- Download and install Intel® AI Analytics toolkit. INC is downloaded as part of the toolkit.
- Simple Google search for “Intel® AI analytics toolkit” will point you to the link below https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html
- Choose the most appropriate download method for you.
- We used online installer method and executed below commands to download and install the package on the EC2 instance.
- wwget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/18235/l_AIKit_p_2021.4.0.1460.sh
- sudo bash l_AIKit_p_2021.4.0.1460.sh
- Install INC.
The default location where the AI Analytics toolkit package gets installed is under /opt/intel/oneapi. Go to LPOT directory and run LPOT installation script. LPOT stands for Low Precision Optimization Tool, a former name for INC.
cd /opt/intel/oneapi/LPOT/latest
sudo ./install_LPOT.sh - Now that the toolkit and INC are installed, run a script that sets environment variables and loads conda environments:
- source /opt/intel/oneapi/setvars.sh
- conda env list
- conda activate tensorflow
We will be working in Tensorflow environment since our sample is based on Tensorflow framework.
- We will download Intel® oneAPI samples and INC sample that is located under AI-and-Analytics/Getting-Started-Samples folder.
- The INC sample shows how to train a CNN model based on Keras, then how to quantize Keras model using INC, and lastly compares quantized int8 model performance against fp32 model.
There is a Jupyter notebook, inside the sample folder, that contains step by step instructions and explanations how and why to run each step. To run Jupyter notebook, we need to install it and run the commands below to launch:- pip install notebook
- export PATH=/home/ubuntu/.local/bin:$PATH
- jupyter notebook --no-browser --port 888
When connecting to EC2 instance using Putty for example, we need to create an SSH tunnel by adding source port (8888) and destination (localhost:8888). This way we will be able to open the Jupyter notebook window locally from our browser (not EC2 instance) using the link that is generated.
- If you are unable to open the Jupyter notebook, then it is also possible to run the sample using below steps on EC2 console directly:
- Modify alexnet.py file to include the 3 lines below:
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
//end of main function
fp32_frezon_pb_file = "fp32_frezon.pb"
save_frezon_pb(model, fp32_frezon_pb_file) - Alexnet.py file builds CNN model, trains it, saves the trained model both as a saved model and frozen model. Run alexnet.py file using Python* command:
- python alexnet.py
- python alexnet.py
- Next, we quantize the fp32 model by running commands below and generate alexnet_int8_model.pb file:
export TF_ENABLE_MKL_NATIVE_FORMAT=0
python lpot_quantize_model.py
Lpot_quantize_model.py shows how to use INC API to do the quantization. - Next, we will use a utility code (profiling_lpot.py) that will run inference using fp32 and int8 models, and output latency and throughput performance results as follows:
- python profiling_lpot.py --input-graph=./fp32_frezon.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=32
Tensorflow version 2.5.0
Loading data ...
Done
2021-11-16 23:55:04.081027: I tensorflow/core/platform/cpu_feature_guard.cc:142]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-16 23:55:04.219435: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-16 23:55:04.238076: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2999995000 Hz
accuracy: 0.9893
max throughput(fps): 1133.1466433490598
latency(ms): 1.4293733908205617
Save result to 32.json
python profiling_lpot.py --input-graph=./alexnet_int8_model.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=8
Tensorflow version 2.5.0
Loading data ...
Done
2021-11-17 00:07:36.600982: I tensorflow/core/platform/cpu_feature_guard.cc:142]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:
AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-17 00:07:36.669854: I
tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-17 00:07:36.690049: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114]
CPU Frequency: 2999995000 Hz
accuracy: 0.9893
max throughput(fps): 3843.266917624619
latency(ms): 0.7340088182566117
Save result to 8.json
- python profiling_lpot.py --input-graph=./fp32_frezon.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=32
- Modify alexnet.py file to include the 3 lines below:
During our test run of the INC sample on C5 12xlarge instance, we observe that int8 model incurs almost 2X less latency and achieves almost 3.4X higher throughput compared to fp32 version. However, the latency and throughput performance numbers are not guaranteed, and may vary based on software versions being used and hardware platform where the sample is run.
In this article, we showed how to install INC tool on AWS instance, run a sample code that converts fp32 model into int8 using INC, run inference using both versions of the model with the same run settings and compare latency/throughput performance obtained from inference test.