Guide to TensorFlow* Runtime optimizations for CPU

Published: 07/30/2020  

Last Updated: 07/30/2020


Runtime settings can greatly affect the performance of TensorFlow* workloads running on CPUs, particularly regarding threading.

OpenMP* and TensorFlow both have settings that should be considered for their effect on performance. The Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) within the Intel® Optimization for TensorFlow* uses OpenMP settings as environment variables to affect performance on Intel CPUs. TensorFlow has a class (ConfigProto or config depeding on version) with settings that affect performance.

This guide will describe the settings, usage and how to apply the them.

OpenMP* settings descriptions

    • Maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application.
    • Recommend: start with the number of physical cores/socket on the test system, and try increasing and decreasing
    • Time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
    • Recommend: start with 1 and try increasing
    • Restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Only valid if Hyperthreading is enabled.
    • Recommend: granularity=fine,verbose,compact,1,0
    • Enables (TRUE) or disables (FALSE) printing of OpenMP run-time library environment variables during execution
    • Recommend: Start with TRUE to ensure settings are being utilized, then use as needed

How to apply OpenMP settings

These settings are applied as environment variables

  • Can be set in shell
    • Example:
  • Can be set in Python code
    • Example:
import os
os.environ["OMP_NUM_THREADS"] = “16”

TensorFlow* settings

  • intra_op_parallelism_threads
    • Number of threads used within an individual op for parallelism
    • Recommend: start with the number of cores/socket on the test system, and try increasing and decreasing
  • inter_op_parallelism_threads
    • Number of threads used for parallelism between independent operations.
    • Recommend: start with the number of physical cores on the test system, and try increasing and decreasing
  • device_count
    • Maximum number of devices (CPUs in this case) to use
    • Recommend: start with the number of cores/socket on the test system, and try increasing and decreasing
  • allow_soft_placement
    • Set to True/enabled to facilitate operations to be placed on CPU instead of GPU

How to apply TensorFlow settings

These settings are applied in Python* code using Config Proto or config

  • Example in TensorFlow version 1.X:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=16, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 16})
session = tf.Session(config=config)
  • Example in TensorFlow 2.X:
import tensorflow as tf


Notices and Disclaimers

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at