Using Intel® Xeon® processors for Multi-node Scaling of TensorFlow* with Horovod*

Published: 06/27/2018  

Last Updated: 06/27/2018

By Mohammad Ashraf Bhuiyan, Wei Wang, and Mahmoud Abuzaina

TensorFlow* is one of the leading Deep Learning (DL) and machine learning frameworks today. In 2017, Intel worked with Google* to incorporate optimizations for Intel® Xeon® processor-based platforms using Intel® Math Kernel Library (Intel® MKL)4. Optimizations such as these with multiple popular frameworks have led to orders of magnitude improvement in performance—up to 127 times 2 higher performance for training and up to 198 times1 higher performance for inference. For TensorFlow, Intel updated the optimizations and performance results for a number of DL models running on the Intel® Xeon® Scalable processor2  3.

Intel has mainly been reporting out Intel® Optimization for TensorFlow performance improvements on single nodes2  3. However, some complex DL models train more efficiently using multi-node training configurations. They either don't fit in one machine, or their time-to-train can be significantly reduced if they are trained on a cluster of machines. Therefore, Intel has also performed scaling studies on multi-node clusters of Intel Xeon Scalable processors. This article describes distributed training performance on a cluster of Intel® Xeon® platforms using a Horovod*-based configuration option for the TensorFlow framework.

Horovod, which was developed by Uber*, uses the Message Passing Interface (MPI) as the main mechanism of communication. It uses MPI concepts such as allgather and allreduce to handle the cross-replicas communication and weight updates. OpenMPI* can be used with Horovod to support these concepts. Horovod is installed as a separate Python* package. By calling Horovod's API from the Deep Learning Neural Network's model script, a regular build of TensorFlow can be used to run distributed training. With Horovod, there is no additional source code change required in TensorFlow to support distributed training with MPI.

Scaling Results Using Uber Horovod* with TensorFlow* 1.7

In this section, we show the performance numbers of Intel Xeon proceesor optimizations for TensorFlow 1.7 + ResNet-50* and Inception- v3* training, running on up to 64 nodes containing Intel® Xeon® Gold processors. A real training dataset was used to perform these runs. As shown in the charts below, by running one MPI process per node, ResNet-50 was able to maintain at least 89.1 percent scalability for up to 64 nodes, while Inception-v3 could achieve at least 89.4 percent5. So, with the higher throughput for ResNet-50 and Inception-v3, time to train is reduced significantly. Although this study shows the scaling for up to 64 nodes, it is expected that the same scalability rate would carry over to 128 nodes.

Performance Scaling for ResNet 50 and InceptionV3
Figure 1. Up to 89 percent (ResNet-50* and Inception-v3*) of scaling efficiency for TensorFlow* 1.7 can be achieved for 64 nodes of Intel® Xeon® Gold processors using one MPI process/node.

The user can also run the same models by having two MPI processes running on each node. As shown in the charts below, we can get up to 17 percent and 24 percent performance improvements for ResNet-50 and Inception-v3, respectively5, with no extra hardware cost. Please note that the batch size per node remains the same as what we used for running one MPI process per node.

Model Batch Size per Node Gain of TensorFlow* with Horovod* versus without Horovod on two Sockets
ResNet-50* 128 17%
Inception-v3* 128 24%

Thus, by running two MPI processes per node as shown in the two graphs below, ResNet-50 was able to maintain at least 94.1 percent scalability for up to 64 nodes, while Inception-v3 could achieve at least 87.4 percent5. So, with higher throughput for ResNet-50 and Inception-v3, time to train is reduced significantly, even faster than using one MPI process per node.

Performance Scaling for ResNet 50 and InceptionV3
Figure 2. Up to 94 percent of scaling (parallel efficiency) can be achieved for TensorFlow* 1.7 for 64 Intel® Xeon® Gold processors, using two MPI processes/node.

Gathering and Installing Relevant Software Tools

1. OpenMPI can be installed via Yellowdog Updater, Modified* (YUM) software on recent versions of CentOS*. Some existing clusters already have available OpenMPI. In this article, we will use OpenMPI 3.0.0. OpenMPI can be installed from the instructions at Open MPI: Version 3.0.

2. The latest GNU Compiler Collection* (GCC) version is needed; at least, GCC version 6.2 or newer is recommended. See GCC, the GNU Compiler Collection for the latest installation.

3. Python versions 2.7 and 3.6 have both been tested.

4. Uber Horovod supports running TensorFlow in distributed fashion. Horovod can be installed as a standalone Python package as follows:

pip install --no-cache-dir horovod (for example, horovod-0.11.3)

Install Horovod from the source.

5. The current TensorFlow benchmarks are recently modified to use Horovod. You can obtain the benchmark code from GitHub*, and run the tf_cnn_benchmarks.py as explained below.

$ git clone https://github.com/tensorflow/benchmarks
$ cd tensorflow/benchmarks/scripts
$ python tf_cnn_benchmarks.py

Running TensorFlow Benchmark Using Horovod with TensorFlow

Here, we discuss commands needed to run distributed TensorFlow using the Horovod framework. For the hardware platform, we use a dual-socket Intel® Xeon® Gold 6148 processor-based cluster system. For networking, a 10 GB ethernet is used. Mellanox InfiniBand* or Cornelis Networks also can be used for networking the cluster.

Running two MPI processes on a single node:

export LD_LIBRARY_PATH=<path to OpenMP lib>:$LD_LIBRARY_PATH
export PATH=<path to OpenMPI bin>:$PATH
export inter_op=2
export intra_op=18 {# cores per socket}
export batch_size=64 
export MODEL=resnet50 {or inception3}
export python_script= {path for tf_cnn_benchmark.py script}

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by socket  --overscribe --report-bindings  -n 2 python  $python_script      --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

For one MPI process per node, the configuration is as follows. The other environment variables will be the same:

export intra_op=38
export batch_size=128 

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS --bind-to none --report-bindings  -n 1 python  $python_script --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

Please note that if you want to train models to achieve good accuracy please use the configuration flag --distortions=True. Other hyper-parameters may also need adjusted.

For running a model on a multi-node cluster, a similar script as above. For example, to run on a 64-node cluster (two MPIs per node), where each node is an Intel Xeon Gold 6148 processor, the distributed training can be launched as shown below. All the export lists will be the same as above:

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by node  --report-bindings -hostfile host_names  -n 128 python  $python_script --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

Here, the host_names file is the list of hosts that you want to run the workload on.

What Distributed TensorFlow Means for Deep Learning Training on Intel® Xeon® Processors

Various efforts were taken to implement distributed TensorFlow on a CPU and graphics processing unit; for example, Remote Procedure Call (gRPC), Remote Direct Memory Access (RDMA), and TensorFlow built in MPI—all of these technologies are incorporated within the TensorFlow codebase. Uber Horovod is one distributed TensorFlow technology that was able to harness the power of Intel Xeon processors. It uses MPI underneath and it uses ring-based reduction and gather for DL parameters. As shown above, Horovod on Intel Xeon processors demonstrates great scaling for existing DL benchmark models, such as ResNet- 50 (up to 94 percent) and Inception-v3 (up to 89 percent) for 64 nodes5. In other words, time to train a DL network can be accelerated by as much as 57 times (ResNet-50) and 58 times (Inception-v3) using 64 Intel Xeon processor nodes, compared to a single Intel Xeon processor node. Thus, Intel recommends TensorFlow users use Intel® Optimization for TensorFlow and Horovod MPI for multi-node training on Intel Xeon Scalable processors.

Acknowledgements

The authors (Mahmoud Abuzaina, Ashraf Bhuiyan, Wei Wang) would like to thank Vikram Saletore, Mikhail Smorkalov, and Srinivas Sridharan for their collaboration with us on using Horovod with TensorFlow.

References

1. Performance is reported at Amazing Inference Performance with Intel® Xeon® Scalable Processors

2. The results are reported at TensorFlow* Optimizations on Modern Intel® Architecture

3. The updated results is in TensorFlow* Optimizations for the Intel® Xeon® Scalable Processor

4. Refer to GitHub for more details on Intel® MKL-DNN optimized primitives

5. System configuration

TensorFlow* Source Code https://github.com/tensorflow/tensorflow
TensorFlow Commit ID 024aecf414941e11eb643e29ceed3e1c47a115ad.
CPU  
   Thread(s) per core 2
   Core(s) per socket 20
   Socket(s) 2
   NUMA node(s) 2
   CPU family 6
   Model 85
   Model name Intel® Xeon® Gold 6148 Processor @ 2.40GHz
 
   Stepping 4
   Hyper Threading ON
   Turbo ON
Memory 192GB (12 x 16GB) 2666MT/s
Disks Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB)
BIOS SE5C620.86B.00.01.0013.030920180427
OS Red Hat* Enterprise Linux* Server release 7.4 (Maipo)
Kernel* 3.10.0-693.21.1.0.1.el7.knl1.x86_64

 

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.