Overview
Runtime settings can greatly affect the performance of TensorFlow* workloads running on CPUs, particularly regarding threading, data layout.
OpenMP* and TensorFlow both have settings that should be considered for their effect on performance. The Intel® oneAPI Deep Neural Network Library (oneDNN) within the Intel® Optimization for TensorFlow* uses OpenMP settings as environment variables to affect performance on Intel CPUs. TensorFlow has a class (ConfigProto or config depending on the version) with settings that affect performance.
Most of the recommendations work on both official x86-64 TensorFlow and Intel® Optimization for TensorFlow. Some recommendations such as OpenMP tuning only apply to Intel® Optimization for TensorFlow
This guide will describe how to set the running variables to optimize Tensorflow* for CPU.
OpenMP* settings descriptions
- OMP_NUM_THREADS
- Maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application.
- Recommend: start with the number of physical cores/sockets on the test system, and try increasing and decreasing.
- KMP_BLOCKTIME
- Time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
- Recommend: start with 1 and try increasing.
- KMP_AFFINITY
- Restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Only valid if Hyperthreading is enabled.
- Recommend: granularity=fine,verbose,compact,1,0
- KMP_SETTINGS
- Enables (TRUE) or disables (FALSE) printing of OpenMP run-time library environment variables during execution.
- Recommend: Start with TRUE to ensure settings are being utilized, then use as needed.
How to apply OpenMP settings
These settings are applied as environment variables by two methods:
- Shell
- Example:
export OMP_NUM_THREADS=16
- Python code
- Example:
import os
os.environ["OMP_NUM_THREADS"] = “16”
TensorFlow* settings
- intra_op_parallelism_threads
- Number of threads used within an individual op for parallelism.
- Recommend: start with the number of cores/sockets on the test system, and try increasing and decreasing.
- inter_op_parallelism_threads
- Number of threads used for parallelism between independent operations.
- Recommend: start with the number of physical cores on the test system, and try increasing and decreasing.
- device_count
- Maximum number of devices (CPUs in this case) to use.
- Recommend: start with the number of cores/sockets on the test system, and try increasing and decreasing.
- allow_soft_placement
- Set to True/enabled to facilitate operations to be placed on CPU instead of GPU.
How to apply TensorFlow settings
These settings are applied in Python* code using Config Proto or config
- Example in TensorFlow version 1.X:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=16, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 16})
session = tf.Session(config=config)
- Example in TensorFlow 2.X:
import tensorflow as tf
tf.config.threading.set_inter_op_parallelism_threads()
tf.config.threading.set_intra_op_parallelism_threads()
tf.config.set_soft_device_placement(enabled)
Intel® oneDNN enabling settings
TensorFlow* is highly optimized with Intel® oneAPI Deep Neural Network Library (oneDNN) on CPU. The oneDNN optimizations are now available both in the official x86-64 TensorFlow binary and Intel® Optimization for TensorFlow* since v2.5.
- Users can enable oneDNN optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1 for the official x86-64 TensorFlow for v2.5-v2.8. Since v2.9, no environment setting is needed as oneDNN is default DNN library in official x86-64 TensorFlow.
export TF_ENABLE_ONEDNN_OPTS=1
- Users could enable/disable usage of oneDNN blocked data format in Tensorflow by TF_ENABLE_MKL_NATIVE_FORMAT environment variable. By exporting TF_ENABLE_MKL_NATIVE_FORMAT=0, TensorFlow will use oneDNN blocked data format instead. Please check oneDNN memory format for more information about oneDNN blocked data format.
We recommend users to enable NATIVE_FORMAT by below command to achieve good out-of-box performance.
export TF_ENABLE_MKL_NATIVE_FORMAT=1 (or 0)
export TF_ENABLE_MKL_NATIVE_FORMAT=1
Environment Variables | Default | Purpose |
---|---|---|
TF_ENABLE_ONEDNN_OPTS | True | Stock Tensorflow: enable/disable oneDNN optimization |
TF_ONEDNN_ASSUME_FROZEN_WEIGHTS | False |
Are WeightsFrozen(): tell Tensorflow if weights are frozen or not. Better performance is achieved with frozen graphs. Set for inference onely. Related ops: fwd conv, fused matmul |
TF_ONEDNN_USE_SYSTEM_ALLOCATOR | False | UseSystemAlloc(). tell oneDNN user system allocator or not. Usage:MklCPUAllocator, Set it to true for better performance in case of small allocation |
TF_MKL_ALLOC_MAX_BYTES | 64 | MklCPUAllocator: user can use it to set upper bound on memory allocation. Unit:GB |
TF_MKL_OPTIMIZE_PEIMITIVE_MEMUSE | False | Enable/disable primitive caching. Disabling primitive caching will reduce memory usage but impacts performance. Set false to enable primitive caching |
References
- Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads
- Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs
- TensorFlow Guide: Optimizing for CPU
- TensorFlow ConfigProto for TensorFlow 1.x
- TensorFlow config for TensorFlow 2.x
Notices and Disclaimers
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.