Deploy Distributed TensorFlow* Using Horovod* and Kubernetes* on Intel® Xeon® Platforms
Published: 11/06/2018
Last Updated: 11/06/2018
By Asmita Nijsure, Clayne B Robison, and Cong Xu
Intel has been working closely with Google in order to add optimizations to TensorFlow* for Intel® Xeon® platforms. We have seen a magnitude of performance improvement due to these optimizations and recently published an article on how to scale training of deep learning models on Intel Xeon platforms to multiple nodes using TensorFlow and Horovod*, a distributed training framework for TensorFlow. The following article discusses deploying distributed TensorFlow using Horovod on Intel Xeon platforms on a Kubernetes* cluster.
Container systems such as Docker* have made the deployment of TensorFlow easy and convenient. Users do not need to worry about missing TensorFlow dependencies, package versions, etc. during subsequent deployments. Docker containers share the machine’s OS system kernel and provide a thin layer of abstraction, thus the performance of workloads inside containers is comparable to that on the host machine. To launch containers in a cluster, users can take advantage of Kubernetes, an open source container-orchestration platform. With Kubernetes, users can deploy and operate TensorFlow on multiple Intel Xeon processor nodes without having to set each of them up individually.
Begin with a performance comparison between Kubernetes and bare metal when running deep learning training workloads. Then follow the steps to build a container as you learn to deploy multi-node deep learning workloads using Kubernetes.
Performance Results with Multi-node Containers
We trained two commonly-used models -- Resnet-50 and Inception-v3 -- on 16 nodes containing Intel Xeon Platinum processors in both bare metal environment and using containers deployed on a Kubernetes cluster. For both models, we compared the throughput and the training loss trends in the two environments.
Training throughput and container overhead
As shown in table 1, on a single node, containerized TensorFlow is able to achieve 99.77% and 99.43% of throughput on bare metal for Resnet-50 and Inception-v3 topologies respectively. As the number of nodes increases, the same trend is seen - the overhead introduced by containers and Kubernetes is negligible.
Table 1. Kubernetes cluster as a percentage of that on bare metal
Topology | 1 Node | 2 Nodes | 4 Nodes | 8 Nodes | 16 Nodes |
---|---|---|---|---|---|
Resnet-50 | 99.77% | 99.99% | 99.95% | 99.38% | 99.18% |
Inception-v3 | 99.43% | 99.41% | 98.60% | 99.44% | 99.46% |
Docker containers and Kubernetes are lightweight and impose little overhead on computation capacity and network bandwidth of the cluster. As a consequence, containerized TensorFlow on Kubernetes is capable of delivering almost the same scaling efficiency as TensorFlow on bare metal. As shown in figure 1, TensorFlow in both bare metal and Kubernetes environments achieves 92% scaling efficiency with Resnet-50 model and 95% scaling efficiency with Inception-v3 model on 16 Intel Xeon nodes. In our test, we launched two MPI processes per node on each of the 16 nodes and used a batch size of 64 images per process.
Model training loss comparison
As presented in figure 2, we see that for both of the topologies, the trends for training loss match very closely in bare metal and Kubernetes environments. For either topology, the models trained in the two environments produced comparable accuracy (<0.3% difference) on the ILSVRC2012 validation set. Thus, when running distributed deep learning training with this container, users will see no loss in trained model accuracy as compared to bare metal.
Building Your Own Container Image
Build the container image for Intel® Optimization for TensorFlow* with Horovod using Dockerfile*:
- Ensure that you have Docker installed and running
- Check out TensorFlow
git clone https://github.com/TensorFlow/TensorFlow.git
- Set following environment variables:
export TF_DOCKER_BUILD_IS_DEVEL=yes export TF_DOCKER_BUILD_TYPE=mkl-horovod
Choose Python* 2 or Python 3 for following:
export TF_DOCKER_BUILD_PYTHON_VERSION=python2
Choose the TensorFlow version you wish to use:
export TF_DOCKER_BUILD_DEVEL_BRANCH=<TensorFlow branch/tag. Ex. r1.10>
OPTIONAL: follow the installation guide in the "Build TensorFlow from Source with Intel® Math Kernel Library (Intel® MKL)" paragraph to provide compiler switches for the TensorFlow build.
export TF_BAZEL_BUILD_OPTIONS=""
- Build the container image:
cd TensorFlow/TensorFlow/tools/docker/ ./parameterized_docker_build.sh
Multinode Training Example Using Kubernetes
The following provides a way to launch containers on a Kubernetes cluster and try out a training benchmark using TensorFlow. Use this as for guidance when deploying your own containers.
- Ensure you have kubectl installed
- In the TensorFlow git repo cloned previously, use the scripts in following area:
cd TensorFlow/TensorFlow/tools/dist_test
- Launch containers with the script. Provide the name of the image you created previously:
scripts_allreduce/k8s_deploy_TensorFlow.sh \ --num_containers <num_of_containers> \ --image <docker_image> \ --deployment <deployment_name> \ --configs_map <configs_map>
- Wait for a while, then check pods and deployment
kubectl get pods kubectl get deployment
- Log into one of the pods. Choose a pod name from the list of pods shown after using "kubectl get pods" command:
kubectl exec -ti <name_of_pod> -- /bin/bash
- The current TensorFlow benchmarks have been modified to use Horovod. You can obtain the benchmark code from GitHub*:
git clone https://github.com/TensorFlow/benchmarks cd benchmarks/scripts/tf_cnn_benchmarks
- Run the benchmark using dummy data with two message passing interface (MPI) processes per node. The following settings show example values to be used with Intel Xeon processors:
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH export PATH=/usr/local/bin:$PATH export num_core_per_socket=$(lscpu | grep "Core(s) per socket:" | awk '{print $4}') export OMP_NUM_THREADS=$num_core_per_socket let intra_op=$num_core_per_socket-2 export inter_op=2 export pyscript=<path for tf_cnn_benchmark.py script> export hostfile=<path to hostfile> export numproc=2*<num_of_containers> mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS \ --cpus-per-proc $num_core_per_socket --map-by node \ --hostfile $hostfile --report-bindings -n $numproc \ python $pyscript --forward_only=False \ --num_batches=200 --num_warmup_batches=50 --model=resnet50 \ --data_format=NCHW --batch_size=64 --optimizer=sgd \ --distortions=False --kmp_blocktime=0 --mkl=True \ --num_inter_threads=$inter_op --num_intra_threads=$intra_op \ --variable_update horovod --horovod_device cpu
Intel Optimization for TensorFlow for Intel Xeon platforms have led to magnified performance improvement. Use this guide to launch containers on a Kubernetes cluster and try out a training benchmark using TensorFlow. For additional information, refer to the article on how to scale training of deep learning models on Intel Xeon platforms to multiple nodes using TensorFlow and Horovod*, a distributed training framework for TensorFlow.
System Configuration
CPU | Intel® Xeon® Platinum 8180 CPU @ 2.50GHz |
OS | Ubuntu* 16.04, Kernel 4.15.0-29-generic |
TensorFlow* Source Code | https://github.com/tensorflow/tensorflow |
TensorFlow Commit ID | f2e8ef305e90151dfd3092a77880c9d046878ef8 (v1.10.0-rc0) |
Detailed configuration is as follows:
CPU | |
Thread(s) per core | 2 |
Core(s) per socket | 28 |
Socket(s) | 2 |
NUMA node(s) | 2 |
CPU family | 6 |
Model | 85 |
Model name | Intel® Xeon® Platinum 8180 @ 2.50GHz |
Stepping | 4 |
HyperThreading | ON |
Turbo | ON |
Frequency Governor Policy | powersave |
Memory | 384GB (12 x 32GB), 2666MT/s |
Disks | Intel RS3DC080 x 3 (800GB, 1.6TB, 6TB) |
BIOS | SE5C620.86B.00.01.0013.030920180427 |
Network Fabric | 10Gbit/s Ethernet |
References
- Refer to Intel® MKL-DNN optimized primitives for more details.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.