When I joined the Intel® Student Developer Program in late 2017 I was pretty excited to try the Intel® Xeon® scalable processor that is a part of the Intel® AI DevCloud they launched roughly at the same time. Just to get everyone on same page I would like to begin with what Intel Xeon Scalable processors (CPU) are and how they affect computation. Later I will discuss what I learned on optimizing Deep Learning TensorFlow*  codes to juice out the last drop of this beast.
The Intel Xeon Scalable processor family on the "Purley" platform is a new microarchitecture with many additional features compared to the previous-generation Intel® Xeon® processor E5-2600 v4 product family (formerly Broadwell microarchitecture). The core reason I believe Intel Xeon Scalable processors are so cool at handling artificial intelligence (AI) computations are due to Intel® Advanced Vector Extension 512 (Intel® AVX-512)  instruction set. It provides ultra-wide 512 bit vector operation capabilities which allows it to handle most of the high performance computing required for TensorFlow*. As the most basic units of computation in TensorFlow involve flow of tensors through operations which are paralleled through the vector processing units. These are called Single Instruction Multiple Data (SIMD)  operations. I will give you an example, say we add two vectors in a natural way, we will go about looping over the dimension and adding the corresponding units. While a vector supported central processing unit (CPU) will add those two with single add operation thereby reducing latency to factor of dimension of the vector, but of course in our case it can do a maximum of up to 512 bits at time. So we get a performance boost of up to 3 to 4x over normal CPUs.
Now I will switch to talk about my experiments which I conducted and which proves my point. When I joined the program I already had 1500 lines of codebase of a Neural Image Captioning System to tryout which I had previously ran only on the Google* Cloud Platform . A Neural Captioning System is one which generates captions for images through an encoder-decoder neural network system. In my case I followed the work of Vinyals et al  with slight modifications. My encoder system for images is a VGG16 model . It’s a convolution neural network that was presented in ILSVRC  for the object recognition competition. It turns out that it can be used as a good feature extractor, so I removed everything after its 7th fully connected layer and used the final 4096 length vector. This approach is popularly coined as Transferred Learning in the deep learning community. I pre-extracted the features of all the images of Microsoft COCO  dataset then performed Principle Component Analysis (PCA) over the data to reduce its dimension to 512. I conducted experiments with both the datasets (PCA and non-PCA). The work is still in progress and the codebase is in my GitHub repository if you want to take a look.
Now during my runs when I first tried to run out of the box on the Intel Xeon processors I got absolutely zero performance increase, in fact it was bit slower. So I’ve spent the last month hoping I could figure it out before Santa knocked on my door. I would like to share with you some steps that I found need to be done before you see improved performance off this complicated yet powerful processor.
- As we batch through multiple data over the epoch, avoid any kind of ‘Disk Read Write’ as far as possible. On Intel AI DevCloud our home folder, Network File System (NFS), is shared between the compute node and login node. The read writes takes a long time on the cluster as it resides further away compared to home PCs. So how do we go about it? Well, TensorFlow provides an elegant queue based operation through its Dataset API . The API enables us to build complex input pipelines using reusable pieces of operation. It wraps your data with a pre-processing operation of your choice and batches them together. This will drastically reduce your latency as your entire dataset shall be cached with all limitations and requirements in place and embed all the operations in a computation graph.
- There is very nice paper by Colfax research  from which I would like to write a few tips which directly affected my performance. It deals with optimization of an object detection network based on YOLO  and recommends tuning of certain critical variables that are quite important from a performance point of view.
- KMP_BLOCKTIME : This is an ‘of many’ variable that controls the behavior of the OpenMP* API which is a parallel programming interface and is primarily responsible for multi-threading operation inside the Tensorflow API. It is the variable that controls the wait time, in milliseconds, that an OpenMP thread waits before going to sleep. A large value and you keep your data hot, but at the same time you can easily starve other threads of resources, so this variable needs to be tuned to best suite our interest. In my case I kept it to 30.
os.environ["KMP_BLOCKTIME"] = “30”
- OMP_NUM_THREADS: This refers to number of parallel threads that a TensorFlow operation can use. The recommended setting for TensorFlow is to keep this to the number of physical cores. I tried 136 and it worked for my case.
os.environ["OMP_NUM_THREADS"] = “136”
- KMP_AFFINITY: This provides abstract control over placement of OpenMP threads to physical cores. The recommended setting for TensorFlow is ‘fine, compact, 1, 0’. ‘Fine’ prevents thread migration thereby reducing cache misses. ‘Compact’ places neighboring threads close together. ‘1’ provides priority placement of threads on different free physical cores rather than on the same core which has a situation of hyper threading. This behavior is similar to how electronic orbitals are filled in atoms. ‘0’ refers to indexing core mapping.
os.environ["KMP_AFFINITY"] = "granularity=fine,compact,1,0"
- Inter and Intra Operation Parallel Threads: These are the variables provided by TensorFlow to control how many simultaneous operations can be run and how many parallel threads each operation can run. In my case I kept the former at two and latter to be equal to OMP_NUM_THREADS (as recommended)
tf.app.flags.DEFINE_integer(‘inter_op’,2,”””Inter op Parallelism Threads”””)
tf.app.flags.DEFINE_integer(‘inter_op’,136,”””Intra op Parallelism Threads”””)
After tuning all the said variables I got a performance increase of up to 4x reducing my per epoch time from 2.5 hours to 30 mins thereby greatly reducing latency. As I said, Intel Xeon Scalable processors are pretty powerful and what we get in the Intel AI DevCloud is a theoretical promised performance of 260 TFlops of performance, but this can’t be expected out of box unless certain cards fall in right place.
- Intel Xeon Scalable Platform
- arXiv:1603.04467: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
- Advanced Vector Extension
- Single Instruction Multiple Data (SIMD)
- Google cloud Platform
- arXiv:1411.4555: Show and Tell: A Neural Image Caption Generator
- arXiv:1409.1556 Very Deep Convolutional Networks for Large-Scale Image Recognition
- ImageNet Large Scale Visual Recognition Competition (ILSVRC)
- arXiv:1405.0321v3 Microsoft COCO: Common Objects in Context
- TensorFlow’s Dataset API
- Optimization of Real time object detection on Intel Xeon Scalable Processor, Colfax
- arXiv:1506.02640: You Only Look Once: Unified, Real-Time Object Detection
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.