Deploy Distributed TensorFlow* Using Horovod* and Kubernetes* on Intel® Xeon® Platforms

Published: 11/06/2018  

Last Updated: 11/06/2018

By Asmita Nijsure, Clayne B Robison, and Cong Xu

Intel has been working closely with Google in order to add optimizations to TensorFlow* for Intel® Xeon® platforms. We have seen a magnitude of performance improvement due to these optimizations and recently published an article on how to scale training of deep learning models on Intel Xeon platforms to multiple nodes using TensorFlow and Horovod*, a distributed training framework for TensorFlow. The following article discusses deploying distributed TensorFlow using Horovod on Intel Xeon platforms on a Kubernetes* cluster.

Container systems such as Docker* have made the deployment of TensorFlow easy and convenient. Users do not need to worry about missing TensorFlow dependencies, package versions, etc. during subsequent deployments. Docker containers share the machine’s OS system kernel and provide a thin layer of abstraction, thus the performance of workloads inside containers is comparable to that on the host machine. To launch containers in a cluster, users can take advantage of Kubernetes, an open source container-orchestration platform. With Kubernetes, users can deploy and operate TensorFlow on multiple Intel Xeon processor nodes without having to set each of them up individually.

Begin with a performance comparison between Kubernetes and bare metal when running deep learning training workloads. Then follow the steps to build a container as you learn to deploy multi-node deep learning workloads using Kubernetes.

Performance Results with Multi-node Containers

We trained two commonly-used models -- Resnet-50 and Inception-v3 -- on 16 nodes containing Intel Xeon Platinum processors in both bare metal environment and using containers deployed on a Kubernetes cluster. For both models, we compared the throughput and the training loss trends in the two environments.

Training throughput and container overhead

As shown in table 1, on a single node, containerized TensorFlow is able to achieve 99.77% and 99.43% of throughput on bare metal for Resnet-50 and Inception-v3 topologies respectively. As the number of nodes increases, the same trend is seen - the overhead introduced by containers and Kubernetes is negligible.

Table 1. Kubernetes cluster as a percentage of that on bare metal


1 Node

2 Nodes

4 Nodes

8 Nodes

16 Nodes













Docker containers and Kubernetes are lightweight and impose little overhead on computation capacity and network bandwidth of the cluster. As a consequence, containerized TensorFlow on Kubernetes is capable of delivering almost the same scaling efficiency as TensorFlow on bare metal. As shown in figure 1, TensorFlow in both bare metal and Kubernetes environments achieves 92% scaling efficiency with Resnet-50 model and 95% scaling efficiency with Inception-v3 model on 16 Intel Xeon nodes. In our test, we launched two MPI processes per node on each of the 16 nodes and used a batch size of 64 images per process.

scaling efficiency bar charts

Figure 1. Scaling efficiency

Model training loss comparison

As presented in figure 2, we see that for both of the topologies, the trends for training loss match very closely in bare metal and Kubernetes environments. For either topology, the models trained in the two environments produced comparable accuracy (<0.3% difference) on the ILSVRC2012 validation set. Thus, when running distributed deep learning training with this container, users will see no loss in trained model accuracy as compared to bare metal.

training loss charts

Figure 2. Training loss

Building Your Own Container Image

Build the container image for Intel® Optimization for TensorFlow* with Horovod using Dockerfile*:

  1. Ensure that you have Docker installed and running
  2. Check out TensorFlow
    git clone
  3. Set following environment variables:
    export TF_DOCKER_BUILD_TYPE=mkl-horovod

    Choose Python* 2 or Python 3 for following:


    Choose the TensorFlow version you wish to use:

    export TF_DOCKER_BUILD_DEVEL_BRANCH=<TensorFlow branch/tag. Ex. r1.10>

    OPTIONAL: follow the installation guide in the "Build TensorFlow from Source with Intel® Math Kernel Library (Intel® MKL)" paragraph to provide compiler switches for the TensorFlow build.

  4. Build the container image:
    cd TensorFlow/TensorFlow/tools/docker/

Multinode Training Example Using Kubernetes

The following provides a way to launch containers on a Kubernetes cluster and try out a training benchmark using TensorFlow. Use this as for guidance when deploying your own containers.

  1. Ensure you have kubectl installed
  2. In the TensorFlow git repo cloned previously, use the scripts in following area:
    cd TensorFlow/TensorFlow/tools/dist_test
  3. Launch containers with the script. Provide the name of the image you created previously:
    scripts_allreduce/ \
          --num_containers <num_of_containers> \
      	  --image <docker_image> \
          --deployment <deployment_name> \
      	  --configs_map <configs_map>
  4. Wait for a while, then check pods and deployment
    kubectl get pods
    kubectl get deployment
  5. Log into one of the pods. Choose a pod name from the list of pods shown after using "kubectl get pods" command:
    kubectl exec -ti <name_of_pod>   -- /bin/bash
  6. The current TensorFlow benchmarks have been modified to use Horovod. You can obtain the benchmark code from GitHub*:
    git clone
    cd benchmarks/scripts/tf_cnn_benchmarks
  7. Run the benchmark using dummy data with two message passing interface (MPI) processes per node. The following settings show example values to be used with Intel Xeon processors:
    export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
    export PATH=/usr/local/bin:$PATH
    export num_core_per_socket=$(lscpu | grep "Core(s) per socket:" | awk '{print $4}')
    export OMP_NUM_THREADS=$num_core_per_socket
    let intra_op=$num_core_per_socket-2
    export inter_op=2
    export pyscript=<path for script>
    export hostfile=<path to hostfile>
    export numproc=2*<num_of_containers>
    --cpus-per-proc $num_core_per_socket --map-by node \
    --hostfile $hostfile --report-bindings -n $numproc \
    python $pyscript --forward_only=False \
    --num_batches=200 --num_warmup_batches=50 --model=resnet50 \
    --data_format=NCHW --batch_size=64 --optimizer=sgd \
    --distortions=False --kmp_blocktime=0 --mkl=True \
    --num_inter_threads=$inter_op --num_intra_threads=$intra_op \
    --variable_update horovod --horovod_device cpu

Intel Optimization for TensorFlow for Intel Xeon platforms have led to magnified performance improvement. Use this guide to launch containers on a Kubernetes cluster and try out a training benchmark using TensorFlow. For additional information, refer to the article on how to scale training of deep learning models on Intel Xeon platforms to multiple nodes using TensorFlow and Horovod*, a distributed training framework for TensorFlow.

System Configuration 

CPUIntel® Xeon® Platinum 8180 CPU @ 2.50GHz
OSUbuntu* 16.04, Kernel 4.15.0-29-generic
TensorFlow* Source Code
TensorFlow Commit IDf2e8ef305e90151dfd3092a77880c9d046878ef8 (v1.10.0-rc0)

Detailed configuration is as follows:

Thread(s) per core2
Core(s) per socket28
NUMA node(s)2
CPU family6
Model nameIntel® Xeon® Platinum 8180 @ 2.50GHz
Frequency Governor Policypowersave
Memory384GB (12 x 32GB), 2666MT/s
DisksIntel RS3DC080 x 3 (800GB, 1.6TB, 6TB)
Network Fabric10Gbit/s Ethernet


  1. Refer to Intel® MKL-DNN optimized primitives for more details.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at