Getting Started with Mixed Precision Support in MKL

Overview

Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory.

There are two options to enable BF16 mixed precision in Intel Optimized TensorFlow.

  1. AutoMixedPriecisionMkl grapper pass through low level session configuration
  2. Keras mixed preicision API

This guide describes how to enable BF16 mixed precision via those two options by using Intel Optimized TensorFlow.

AutoMixedPrecisionMkl

Introduction

AutoMixedPrecisionMkl is a grappler pass that automatically converts a model written in FP32 data type to operate in BFloat16 data type. To be precise, it scans the data-flow graph corresponding to the model, and looks for nodes in the graph (also called as operators) that can operate in BFloat16 type, and inserts FP32ToBFloat16 and vice-versa Cast nodes in the graph appropriately. This feature will be supported starting TensorFlow* 2.3. We will demonstrate this with examples illustrating:

  • How to convert the graph to BFloat16 on-the-fly as you train the model

  • How to convert a pre-trained fp32 model to BFloat16

Let’s consider a simple neural network consisting of a typical pattern of Conv2D with addition of bias and output of Conv2D clipped using ReLU. Inputs x and w of Conv2D and input b of bias_add are TensorFlow* Variables. Notice that the default type of Variable in TensorFlow* is FP32, so this neural network model operates completely in FP32 data type. TensorFlow* data-flow graph corresponding to this model looks like below.

Notice that the default type of Variable in TensorFlow* is FP32, so this neural network model operates completely in FP32 data type. TensorFlow* data-flow graph corresponding to this model looks like this

Steps to set up the runtime environment

Before running the examples build or install the latest TensorFlow* with BFloat16 support

Follow the below instructions to simply create an environment and install Tensorflow*

conda create -n tf_mkl_2 python=3.7 -c intel 
source activate tf_mkl_2 
(tf_mkl_2) pip install intel-tensorflow

 

Steps to prepare a FP32 Model

Let's create a sample workload conv2D_fp32.py with a Conv2d and Relu layers.

import tensorflow as tf 
from tensorflow.core.protobuf import rewriter_config_pb2  

# Disable Eager execution mode 
tf.compat.v1.disable_eager_execution()  

def conv2d(x, w, b, strides=1): 
    # Conv2D wrapper, with bias and relu activation 
    x = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME') 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x)  

X = tf.Variable(tf.compat.v1.random_normal([784])) 
W = tf.Variable(tf.compat.v1.random_normal([5, 5, 1, 32])) 
B = tf.Variable(tf.compat.v1.random_normal([32])) 
x = tf.reshape(X, shape=[-1, 28, 28, 1])  

with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto()) as sess: 
    sess.run(tf.compat.v1.global_variables_initializer()) 
    sess.run([conv2d(x, W, B)]) 
python conv2D_fp32.py 

This will train a regular fp32 conv2d graph.

Steps to Train a BFloat16 Model

Porting the above workload to BFloat16 data type using AutoMixedPrecisionMkl requires adding following lines of Python* code to this model:

graph_options=tf.compat.v1.GraphOptions( 
        rewrite_options=rewriter_config_pb2.RewriterConfig( 
            auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON)) 

In a nutshell, AutoMixedPrecisionMkl grappler pass can be controlled using RewriterConfig proto of GraphOptions proto. Possible values to initialize auto_mixed_precision_mkl are rewriter_config_pb2.RewriterConfig.ON and rewriter_config_pb2.RewriterConfig.OFF.

The default value is rewriter_config_pb2.RewriterConfig.OFF.

Add the code snippet given above just before intializing a TensorFlow* session with the graph_options to include the grappler pass.

Below is the complete code for the neural network model that is ported to BFloat16 type using AutoMixedPrecisionMkl is below.

import tensorflow as tf 
from tensorflow.core.protobuf import rewriter_config_pb2 

tf.compat.v1.disable_eager_execution()

def conv2d(x, w, b, strides=1): 
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME') 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x)  

X = tf.Variable(tf.compat.v1.random_normal([784])) 
W = tf.Variable(tf.compat.v1.random_normal([5, 5, 1, 32])) 
B = tf.Variable(tf.compat.v1.random_normal([32])) 
x = tf.reshape(X, shape=[-1, 28, 28, 1])  

graph_options=tf.compat.v1.GraphOptions( 
        rewrite_options=rewriter_config_pb2.RewriterConfig( 
            auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON))  

with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto( 
        graph_options=graph_options)) as sess: 
    sess.run(tf.compat.v1.global_variables_initializer()) 
    sess.run([conv2d(x, W, B)]) 

Notice that graph_options variable created by turning AutoMixedPrecisionMkl ON is passed to ConfigProto that is eventually passed to tf.Session API.

Create and run the Python* file with the above code conv2D_bf16.py

python conv2D_bf16.py 

Console output

2020-06-12 08:44:07.306404: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.306987: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No allowlist ops found, nothing to do 
2020-06-12 08:44:07.309752: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.309866: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No allowist ops found, nothing to do 
2020-06-12 08:44:07.326082: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.326278: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 2/11 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 
2020-06-12 08:44:07.327683: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.327814: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/14 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 

The data-flow graph after porting the model to BFloat16 type looks like below.

he data-flow graph after porting the model to BFloat16 type looks like this

Notice that 2 operators, Conv2D+BiasAdd and ReLU in the graph are automatically converted to operate in BFloat16 type. Also note that appropriate Cast nodes are inserted in the graph to convert TensorFlow* tensors from FP32 type to BFloat16 type and vice-versa.

 

Steps to Convert a Pretrained fp32 Model to BFloat16

In the previous section we saw how the AutoBFloat16Convertor can automatically convert certain nodes to BFloat16 while training a sample model. This section will cover how to convert a pre-trained fp32 model to BFloat16.

1. Modify the conv2D_fp32.py to save the trained model

import tensorflow as tf 
import tensorflow.python.saved_model 
from tensorflow.python.saved_model import tag_constants 
from tensorflow.python.saved_model.signature_def_utils_impl import predict_signature_def 

# Disable Eager execution mode 
tf.compat.v1.disable_eager_execution() 

def conv2d(x, w, b, strides=1): 
    # Conv2D wrapper, with bias and relu activation 
    x = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME',name="myInput") 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x, name="myOutput")  

X = tf.Variable(tf.compat.v1.random_normal([784], name="myInput")) 
W = tf.Variable(tf.compat.v1.random_normal([5, 5, 1, 32])) 
B = tf.Variable(tf.compat.v1.random_normal([32])) 
x = tf.reshape(X, shape=[-1, 28, 28, 1]) 

export_dir='./model'  

with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto()) as sess: 
    sess.run(tf.compat.v1.global_variables_initializer()) 
    y=conv2d(x,W,B) 
    sess.run([y]) 
    builder = tf.compat.v1.saved_model.builder.SavedModelBuilder(export_dir) 
    signature = predict_signature_def(inputs={'myInput': X}, 
                                  outputs={'myOutput': y}) 
    builder.add_meta_graph_and_variables(sess=sess, 
                                     tags=["myTag"], 
                                     signature_def_map={'predict': signature}) 
    builder.save() 

2 Run the Python* script and check for the saved_model.pb stored at ./model location

python conv2D_fp32.py

3 Run the AutoMixedPrecisionMkl on Saved Model

3.1 Create gen_bf16_pb.py Python* Script

Initialize the graph_options with AutoMixedPrecisionMkl, load and convert the saved fp32 model.

from argparse import ArgumentParser 
from tensorflow.core.protobuf import config_pb2 
from tensorflow.core.protobuf import rewriter_config_pb2 
from tensorflow.python.grappler import tf_optimizer 
from tensorflow.python.tools import saved_model_utils 
import tensorflow as tf 
import time   

parser = ArgumentParser() 
parser.add_argument("input_dir", help="Input directory containing saved_model.pb.", type=str) 
parser.add_argument("as_text", help="Output graph in text protobuf format." 
                    "If False, would dump in binary format", type=bool) 
parser.add_argument("output_dir", help="Directory to store output graph.", type=str) 
parser.add_argument("output_graph", help="Output graph name. (e.g., foo.pb," 
                    "foo.pbtxt, etc)", type=str) 
args = parser.parse_args()   

graph_options = tf.compat.v1.GraphOptions(rewrite_options=rewriter_config_pb2.RewriterConfig(
                               auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON)) 
optimizer_config = tf.compat.v1.ConfigProto(graph_options=graph_options) 
metagraph_def = saved_model_utils.get_meta_graph_def(args.input_dir, "myTag") 
output_graph = tf_optimizer.OptimizeGraph(optimizer_config, metagraph_def) 
tf.io.write_graph(output_graph, args.output_dir, args.output_graph, 
                  as_text=args.as_text) 
3.2 Run the Conversion Script
python gen_bf16_pb.py ./model True ./model saved_model_bf16.pbtxt

Console output:

2020-06-15 11:45:19.612781: I tensorflow/core/grappler/devices.cc:78] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support) 
2020-06-15 11:45:19.612978: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session 
2020-06-15 11:45:19.663030: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3300000000 Hz 
2020-06-15 11:45:19.675988: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5592f0fe00f0 initialized for platform Host (this does not guarantee that XLA will be used). devices: 
2020-06-15 11:45:19.676016: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-15 11:45:19.696569: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 2/59 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 
2020-06-15 11:45:19.700334: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/49 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 

Again, notice 2 operators are converted into BFloat16 type. The script also supports dumping the model into protobuf binary file (.pb) or text file (.pbtxt).

Controlling the Operators that will be Ported to BFloat16 type Automatically

We provide some more details about AutoMixedPrecisionMkl for interested readers. An important point to note is that not all of TensorFlow’s* operators for CPU backend support BFloat16 type - this could be because either the support is missing (and is a WIP) or that the BFloat16 version of an operator may not offer much performance improvement over the FP32 version.

Furthermore, BFloat16 type for certain operators could lead to numerical instability of a neural network model. So we categorize TensorFlow* operators that are supported by MKL backend in BFloat16 type into 1) if they are always numerically stable, and 2) if they are always numerically unstable, and 3) if their stability could depend on the context. AutoMixedPrecisionMkl pass uses a specific Allow, Deny and Infer list of operators respectively to capture these operators. The exact lists could be found in auto_mixed_precision_lists.h file in TensorFlow* github repository.

We would like to mention that the default values of these lists already capture the most common BFloat16 usage models and to also ensure numerical stability of the model. There is, however, a way to add or remove operators from any of these lists as by setting environment variables that control these lists. For instance, executing

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_DENYLIST_ADD=Conv2D 

before running the model would add Conv2D operator to Deny list. While, executing

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_ALLOWLIST_REMOVE=Conv2D 

before running the model would remove Conv2D from Allow list. And executing both of these commands before running a model would move Conv2D from Allow list to Deny list.

In general, the template corresponding to the names of the environment variables controlling these lists is:

TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_${LIST}_${OP}=operator

where ${LIST} would be any of {ALLOW, DENY, INFER}, and ${OP} would be any of {ADD, REMOVE}.

To test this feature of adding an op into denylist and removing from the allowlist, run the code sample conv2D_bf16.py by enabling the environment variables with conv2d ops

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_DENYLIST_ADD=Conv2D 
export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_ALLOWLIST_REMOVE=Conv2D  
python conv2D_bf16.py 

As you can see from the console output and comparing with the Step 1, conv2D ops have been removed from the allowlist and added to the denylist, thus controlling the node from converting into BFloat16 equivalent.

2020-06-16 17:13:57.655035: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3300000000 Hz 
2020-06-16 17:13:57.668690: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f592d69360 initialized for platform Host (this does not guarantee that XLA will be used). devices: 
2020-06-16 17:13:57.668723: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version 
2020-06-16 17:13:57.668818: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 
2020-06-16 17:13:57.674947: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-16 17:13:57.675318: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No allowlist ops found, nothing to do 
2020-06-16 17:13:57.676833: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-16 17:13:57.676945: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No allowlist ops found, nothing to do 
2020-06-16 17:13:57.688769: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-16 17:13:57.688879: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No allowlist ops found, nothing to do 
2020-06-16 17:13:57.689922: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer
2020-06-16 17:13:57.690025: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No allowlist ops found, nothing to do

 

 

Keras mixed preicsion API

Introduction

This session describes how to use the experimental Keras mixed precision API to speed up your models by using a Simple MNIST convnet instead.

Staring from TensorFlow 2.4, Intel Optimized TensorFlow supports the experimental Keras mixed precision API.

Steps to prepare a FP32 Model

By using an existing sample, users can easily understand how to change their own Keras* model.

First, get the sample by using the command below:

wget https://raw.githubusercontent.com/keras-team/keras-io/master/examples/vision/mnist_convnet.py

We don't need to train model with 15 epochs, so change epoch value from 15 to 2 in line#67 of mnist_convert.py:

epoch = 2

Then, use the command below to train the convnet model.

python mnist_convnet.py

Steps to Train a BFloat16 Model

To port high-level Keras* API convnet sample, users only need to add the below three lines around line#16 of mnist_convert.py.

The model will use bfloat16 as its compute data type.

from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('mixed_bfloat16')
mixed_precision.set_global_policy(policy)

Users could also use oneDNN verbose log to verify if this modified mnist_convnet indeed uses bfloat16 data type for computation.

By following below command, users will get a oneDNN verbose log file "dnnl_log.csv".

DNNL_VERBOSE=1 python mnist_convnet.py > dnnl_log.csv

  By following Analyze Verbose Logs section in this Tutorial, users can parse this dnnl_log.csv and get a oneDNN JIT kernel breakdown.

Here is the result from modified mnist_convnet.py.  From below result, avx512_core_bf16 jit kernel indeed was used, and it took around 82% of total elapsed time.

jit kernel                   elapsed time
simple:any                      232.060092
jit:avx512_common              1763.848252
gemm:jit                       9761.591859
jit:uni                       32606.045378
jit:avx512_core_bf16          41308.567345
jit_bf16:avx512_core_bf16    394847.022056

 

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.