Guide to TensorFlow* Runtime Optimizations for CPU

ID 764550
Updated 11/29/2022
Version Latest
Public

author-image

By

Overview

Runtime settings can greatly affect the performance of TensorFlow* workloads running on CPUs, particularly regarding threading, data layout.  

OpenMP* and TensorFlow both have settings that should be considered for their effect on performance. The Intel® oneAPI Deep Neural Network Library (oneDNN) within the Intel® Optimization for TensorFlow* uses OpenMP settings as environment variables to affect performance on Intel CPUs. TensorFlow has a class (ConfigProto or config depending on the version) with settings that affect performance.

Most of the recommendations work on both official x86-64 TensorFlow and  Intel® Optimization for TensorFlow. Some recommendations such as OpenMP tuning only apply to Intel® Optimization for TensorFlow

This guide will describe how to set the running variables to optimize Tensorflow* for CPU. 

OpenMP* settings descriptions

  • OMP_NUM_THREADS
    • Maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application.
    • Recommend: start with the number of physical cores/sockets on the test system, and try increasing and decreasing.
  • KMP_BLOCKTIME
    • Time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
    • Recommend: start with 1 and try increasing.
  • KMP_AFFINITY
    • Restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Only valid if Hyperthreading is enabled.
    • Recommend: granularity=fine,verbose,compact,1,0
  • KMP_SETTINGS
    • Enables (TRUE) or disables (FALSE) printing of OpenMP run-time library environment variables during execution.
    • Recommend: Start with TRUE to ensure settings are being utilized, then use as needed.

How to apply OpenMP settings

These settings are applied as environment variables by two methods: 

  • Shell
    • Example:
export OMP_NUM_THREADS=16
  • Python code
    • Example:
import os
os.environ["OMP_NUM_THREADS"] = “16”

TensorFlow* settings

  • intra_op_parallelism_threads
    • Number of threads used within an individual op for parallelism.
    • Recommend: start with the number of cores/sockets on the test system, and try increasing and decreasing.
  • inter_op_parallelism_threads
    • Number of threads used for parallelism between independent operations.
    • Recommend: start with the number of physical cores on the test system, and try increasing and decreasing.
  • device_count
    • Maximum number of devices (CPUs in this case) to use.
    • Recommend: start with the number of cores/sockets on the test system, and try increasing and decreasing.
  • allow_soft_placement
    • Set to True/enabled to facilitate operations to be placed on CPU instead of GPU.

How to apply TensorFlow settings

These settings are applied in Python* code using Config Proto or config

  • Example in TensorFlow version 1.X:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=16, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 16})
session = tf.Session(config=config)
  • Example in TensorFlow 2.X:
import tensorflow as tf
tf.config.threading.set_inter_op_parallelism_threads() 
tf.config.threading.set_intra_op_parallelism_threads()
tf.config.set_soft_device_placement(enabled)

Intel® oneDNN enabling settings

TensorFlow* is highly optimized with Intel® oneAPI Deep Neural Network Library (oneDNN) on CPU. The oneDNN optimizations are now available both in the official x86-64 TensorFlow binary and Intel® Optimization for TensorFlow* since v2.5.

  • Users can enable oneDNN optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1 for the official x86-64 TensorFlow for v2.5-v2.8. Since v2.9,  no environment setting is needed as oneDNN is default DNN library in official x86-64 TensorFlow.
export TF_ENABLE_ONEDNN_OPTS=1

 

  • Users could enable/disable usage of oneDNN blocked data format in Tensorflow by TF_ENABLE_MKL_NATIVE_FORMAT environment variable. By exporting TF_ENABLE_MKL_NATIVE_FORMAT=0, TensorFlow will use oneDNN blocked data format instead. Please check oneDNN memory format for more information about oneDNN blocked data format.

We recommend users to enable NATIVE_FORMAT by below command to achieve good out-of-box performance.
export TF_ENABLE_MKL_NATIVE_FORMAT=1 (or 0)

export TF_ENABLE_MKL_NATIVE_FORMAT=1
oneDNN Related Environment Variables used within TensorFlow 2.9
Environment Variables Default Purpose
TF_ENABLE_ONEDNN_OPTS True Stock Tensorflow: enable/disable oneDNN optimization
TF_ONEDNN_ASSUME_FROZEN_WEIGHTS False

Are WeightsFrozen(): tell Tensorflow if weights are frozen or not. Better performance is achieved with frozen graphs. Set for inference onely. 

Related ops: fwd conv, fused matmul 

TF_ONEDNN_USE_SYSTEM_ALLOCATOR False UseSystemAlloc(). tell oneDNN user system allocator or not. Usage:MklCPUAllocator, Set it to true for better performance in case of small allocation 
TF_MKL_ALLOC_MAX_BYTES 64 MklCPUAllocator: user can use it to set upper bound on memory allocation. Unit:GB
TF_MKL_OPTIMIZE_PEIMITIVE_MEMUSE False Enable/disable primitive caching. Disabling primitive caching will reduce memory usage but impacts performance. Set false to enable primitive caching

 

 

 

 

 

 

 

 

 

 

 

 

 

References

Notices and Disclaimers

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.