Intel® Low Precision Optimization Tool

Key Takeaways

  • Learn how Intel technology helps to boost up the low precision inference of deep learning workload with 2nd and 3rd Gen Intel Xeon Scalable Processors.

  • Step-by-step tutorial on how to use Intel Low Precision Optimization Tool to develop the low-precision inference solution quickly on Intel platforms.

BUILT IN - ARTICLE INTRO SECOND COMPONENT

Authors

Feng Tian, Haihao Shen, Jiong Gong, Huma Abidi

Deep neural networks (DNNs) show state-of-the-art (SOTA) accuracy in a wide range of computation tasks. However, they still face challenges during industrial deployment due to its high computational complexity of inference. Low precision is one of the key techniques being actively studied recently to conquer the problem. With hardware acceleration support, low precision inference can compute more operations per second, reduce the memory access pressure and better utilize the cache, and deliver higher throughput and lower latency.

In this document, we would like to introduce Intel® Low Precision Optimization Tool. This tool aims to help Intel customers deploy low-precision inference solution easily and rapidly on multiple deep learning frameworks (TensorFlow, PyTorch, and MXNet)

Introduction

Intel® Low Precision Optimization Tool is an open-source python library to help users to fast deploy low-precision inference solution on popular deep learning frameworks including TensorFlow, PyTorch, MXNet etc. Intel® Low Precision Optimization Tool v1.0 alpha is released recently, featuring:

·       Built-in tuning strategies, including Basic, Bayesian, and MSE

·       Built-in evaluation metrics, including TopK (image classification), F1 (NLP), and CocoMAP (object detection)

·       Built-in tuning objectives, including Performance, Model Size, and Footprint

·       Extensible API design to add new strategy, framework backend, metric, and objective

·       KL-divergence calibration for TensorFlow and MXNet

·       Tuning process resuming from certain checkpoint

Figure 1: Intel® Low Precision Optimization Tool architecture. For more details, please refer to the project on Github.

Intel® DL Boost

Intel® DL Boost is built into the second-generation Intel® Xeon® scalable processor. Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel DL Boost Vector Neural Network Instructions (VNNI) delivers a significant performance improvement by combining three instructions into one—thereby maximizing the use of compute resources, utilizing the cache better, and avoiding potential bandwidth bottlenecks.

Since second-generation Intel® Xeon® Scalable processors, Intel® DL Boost provides a theoretical peak speedup of 4x for INT8 inference in comparison to FP32 inference. Developers can use this tool to convert a FP32 trained model to an INT8 quantized model. This new INT8 model will benefit from Intel® DL Boost acceleration when used for inference in place of the earlier FP32 model and run on second-generation Intel® Xeon® Scalable processors.

Easy quantization

Intel® Low Precision Optimization Tool provides an easy way to enable quantization from the scratch. Assuming there is a FP32 model for deployment, user can produce the quantized model within two steps:

·       Config yaml file. It is to define tuning config and model-specific information. Here is a sample yaml for TensorFlow MobileNetv1.0:

framework:
  - name: tensorflow
    inputs: input                              # tensorflow only
    outputs: MobilenetV1/Predictions/Reshape_1 # tensorflow only

tuning:
    metric:
      - topk: 1
    accuracy_criterion:
      - relative: 0.01
    timeout: 3600                              # tuning time(seconds)
    random_seed: 9527

The above setting means to tune out the best quantized model which has best inference performance and meets relative 1% accuracy loss from FP32 accuracy during the time range of 3600 seconds.

·       Use Tuner.tune(). API is the main entry of automatic tuning and the definition is as following:

class Tuner(object):
    def tune(self, model, q_dataloader, q_func=None, eval_dataloader= 
    None, eval_func=None, resume_file=None):

The Intel® Low Precision Optimization Tool v1.0a release supports two usages:

a)      User specifies fp32 "model", calibration dataset "q_dataloader", evaluation dataset "eval_dataloader" and accuracy metrics in tuning.metric field of the yaml config file.

This is designed for seamless enablement of DL model tuning with the tool, leveraging the pre-defined accuracy metrics supported by the tool. We expect this is the most common usage of the tool. Now it works well for most image classification models, and we are improving the tool to cover more workload categories.

b)      User specifies fp32 "model", calibration dataset "q_dataloader" and a custom "eval_func" which encapsulates the evaluation dataset and accuracy metrics by itself.

This is designed for ease of tuning enablement for models with custom metric evaluation or metrics not supported by the tool yet. Currently this usage model works for object detection and NLP networks.

Example 1

Below is step-by-step of how to enable easy quantization for TensorFlow ResNet50 V1.5 using the first usage.

Prepare Yaml File

Copy examples/template.yaml to work directory and keep mandatory items correspondingly. Here is the yaml file of MXNet MobileNetv1.0:

framework:
  - name: tensorflow
    inputs: input_tensor
    outputs: softmax_tensor

tuning:
    metric:
      - topk: 1
    accuracy_criterion:
      - relative: 0.01
    timeout: 0
    random_seed: 9527

Here we choose topk built-in metric and set accuracy target as tolerating relative 1% accuracy loss of baseline. The default tuning strategy is basic. The timeout 0 means early stop if a tuning config meet accuracy target.

Code Changes:

1.      Import ilit python package

2.      Create tuner objective using yaml file

3.      Invoke ilit.tune() interface with calibration dataloader

import ilit

tuner = ilit.Tuner(self.args.config)
dataloader = Dataloader(self.args.data_location, 'validation',
              RESNET_IMAGE_SIZE, RESNET_IMAGE_SIZE, self.args.batch_size,
              num_cores=self.args.num_cores, resize_method='crop')
q_model = tuner.tune(self.args.input_graph, q_dataloader=dataloader,
                     eval_func=None, eval_dataloader=dataloader)

Example 2

Below is step-by-step of how to enable easy quantization for SSD-ResNet50v1.0 using the second usage. This usage will use eval_func() user provided to do evaluation.

Prepare Yaml File

Here is the yaml file of MXNet SSD-ResNet50v1.0:

framework:
  - name: mxnet

tuning:
    accuracy_criterion:
      - relative: 0.01
    timeout: 0                                  # 0 means early stop
    random_seed: 9527

Here we set accuracy target as tolerating relative 1% accuracy loss of baseline. The default tuning strategy is basic. The timeout 0 means early stop if a tuning config meet accuracy target.

Code Changes:

1.      Import ilit python package

2.      Create tuner objective using yaml file

3.      Implement eval_func() like below

Invoke ilit.tune() interface with calibration dataloader, here we reuse existing validation dataloader as calibration dataloader

import ilit

def eval_func(graph):
    val_dataset, val_metric = get_dataset(args.dataset, args.data_shape)
    val_data = get_dataloader(
    val_dataset, args.data_shape, args.batch_size, args.num_workers)
    classes = val_dataset.classes  # class names
    size = len(val_dataset)
    ctx = [mx.cpu()]
    results = validate(graph, val_data, ctx, classes, size, val_metric)
    mAP = float(results[-1][-1])
    return mAP

tuner = ilit.Tuner("./ssd.yaml")
quantized_model = tuner.tune(net, q_dataloader=val_data,
val_dataloader= val_dataset, eval_func=eval_func)

Tuning Results

Intel® Low Precision Optimization Tool v1.0 alpha release already supported 30 deep learning workloads, covering all popular use cases including image classification, object detection, NLP, and recommendation systems. Below table shows the results on three Intel optimized frameworks on CLX8280 with TSX disabled. For detail reproduce steps, please refer to this link.

Future Works

We plan to add more sophisticated tuning strategies and metrics to facilitate accuracy-driven tuning more effectively. We also explore the quantitation support for more backends.

Please use this tool if you want to deploy a low precision solution quickly. You are also very welcome to submit a feature request or an issue via ilit.maintainers@intel.com during your usage. 

MXNet

V1.6.x

Model

Tuning Strategy

INT8 Tuning Accuracy

FP32 Accuracy Baseline

Relative Accuracy Drop[(INT8-FP32)/FP32]

INT8/FP32 Speedup

ResNet50 V1

mse

76.40%

76.80%

-0.52%

3.73x

MobileNet V1

mse

71.60%

72.10%

-0.69%

3.02x

MobileNet V2

mse

71.00%

71.10%

-0.14%

3.88x

SSD-ResNet50

basic

29.50%

29.70%

-0.67%

1.86x

SqueezeNet V1

mse

57.30%

57.20%

0.18%

2.88x

ResNet18

mse

70.50%

70.40%

0.14%

2.98x

Inception V3

mse

78.20%

78.00%

0.26%

3.35x

TensorFlow v1.15.2

Model

Tuning Strategy

INT8 Tuning Accuracy

FP32 Accuracy Baseline

Relative Accuracy Drop[(INT8-FP32)/FP32]

INT8/FP32 Speedup

ResNet50 V1

mse

73.28%

73.54%

-0.35%

2.99x

ResNet50 V1.5

bayesian

75.70%

76.26%

-0.73%

1.95x

ResNet101

basic

76.68%

75.58%

1.46%

3.03x

Inception V1

basic

69.54%

69.48%

0.09%

2.18x

Inception V2

basic

74.32%

74.38%

-0.08%

1.69x

Inception V3

basic

76.54%

76.90%

-0.47%

2.02x

Inception V4

basic

79.74%

80.12%

-0.47%

3.40x

ssd_resnet50_v1

basic

37.80%

38.01%

-0.55%

1.82x

PyTorch v1.5.0

Model

Tuning Strategy

INT8 Tuning Accuracy

FP32 Accuracy Baseline

Relative Accuracy Drop[(INT8-FP32)/FP32]

INT8/FP32 Speedup

DLRM

basic

80.21%

80.27%

-0.08%

1.87x

BERT-Large MRPC

basic

87.90%

88.30%

-0.45%

2.38x

BERT-Large SQUAD

basic

92.15%

93.05%

-0.96%

1.42x

BERT-Large CoLA

basic

62.10%

62.60%

-0.80%

1.76x

BERT-Base STS-B

basic

88.50%

89.30%

-0.90%

3.05x

BERT-Base CoLA

basic

58.30%

58.80%

-0.85%

3.01x

BERT-Base MRPC

basic

88.30%

88.70%

-0.45%

2.34x

BERT-Base SST-2

basic

90.90%

91.90%

-1.09%

1.64x

BERT-Base RTE

basic

69.30%

69.70%

-0.57%

2.95x

BERT-Large RTE

basic

72.90%

72.60%

0.41%

2.38x

BERT-Large QNLI

basic

91.00%

91.80%

-0.87%

2.25x

ResNet50 V1.5

bayesian

75.60%

76.10%

-0.66%

2.76x

ResNet18

bayesian

69.50%

69.80%

-0.43%

2.61x

ResNet101

bayesian

77.00%

77.40%

-0.52%

2.64x

 

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit:  http://www.intel.com/performance.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.