Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs

Published: 06/01/2018  

Last Updated: 06/01/2018

Introduction

The purpose of this document is to help developers speed up the execution of the programs that use popular deep learning frameworks in the background. There are situations where we have observed that the deep learning code, with default settings, does not take advantage of the full compute capability of the underlying machine on which it runs. This is often the case, especially when the code runs on Intel® Xeon® processors.

Optimization

The primary goal of the performance optimization tips given in this section is to make use of all the cores available in the machine. Intel® DevCloud consists of Intel® Xeon® Gold 6128 processors.

Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS. On the Intel DevCloud, assign NUM_PARALLEL_EXEC_UNITS to 6.

TensorFlow*

To get the best performance from a machine, change the parallelism threads and OpenMP* settings as below:

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})

session = tf.Session(config=config)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Keras with TensorFlow Backend

To get the best performance from a machine, change the parallelism threads and OpenMP settings as below:

from keras import backend as K

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS })

session = tf.Session(config=config)

K.set_session(session)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Caffe*

To get the best performance from the underlying machine, change the OpenMP settings as below:

export OMP_NUM_THREADS= NUM_PARALLEL_EXEC_UNITS

export KMP_AFFINITY= granularity=fine,verbose,compact,1,0

In general:

export OMP_NUM_THREADS= <number of threads to use>

export KMP_AFFINITY= <your affinity settings of choice>

For example:

KMP_AFFINITY=granularity=fine,balanced

KMP_AFFINITY=granularity=fine,compact

Conclusion

Even though we have observed a speed up in most cases, please note that the performance is largely code-dependent and there can be multiple other reasons that affect the code performance. A good code profiling tool like the Intel® VTune™ Amplifier can help you dig deeper and analyze performance problems.

Author

Anju Paul is a Technical Solutions Engineer working on behalf of the Intel® AI Academy.

 

 

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.