Intel® CPU Outperforms NVIDIA* GPU on ResNet-50 Deep Learning Inference

Published: 05/13/2019  

Last Updated: 05/13/2019

By Haihao Shen, Andres Felipe Rodriguez Perez, Wei Li, Cong Xu, Xu Deng, feng tian, and indu kalyanaraman

Intel has been advancing both hardware and software rapidly in the recent years to accelerate deep learning workloads. Today, we have achieved leadership performance of 7878 images per second on ResNet-50 with our latest generation of Intel® Xeon® Scalable processors, outperforming 7844 images per second on NVIDIA Tesla V100*, the best GPU performance as published by NVIDIA on its website including T4.

This is a significant milestone for customers who have Intel Xeon Scalable processors widely available in their clouds and data centers. CPU is general purpose, designed for a broad set of applications. Customers can run any workload important to their business at a given time; it is adaptable to dynamic compute demands. Accelerators are appropriate for certain user scenarios, where dedicated hardware is economically justified. Intel is also developing deep learning accelerators for both inference and training. However, having the CPU with high deep learning capabilities gives AI customers the flexibility to manage their compute infrastructure uniformly and cost effectively.

Deep learning is used in image/video processing, natural language processing, personalized recommender systems, and reinforcement learning. The types of workloads and algorithms are rapidly expanding. A general purpose CPU is very adaptable to this dynamically changing environment.

We measured the throughput of ResNet-50 on a 2nd gen Intel Xeon Scalable processor (formerly codenamed Cascade Lake), more specifically Intel® Xeon® Platinum 9282 processor, a high core-count multi-chip packaged server multiprocessor, using Intel® Optimization for Caffe*. We achieved 7878 images per second by simultaneously running 28 software instances each one across four cores with batch size 11. The performance on NVIDIA Tesla V100 is 7844 images per second and NVIDIA Tesla T4 is 4944 images per second per NVIDIA's published numbers as of the date of this publication (May 13, 2019).

In Apr 2019, Intel announced the 2nd gen Intel® Xeon® Scalable processors with Intel® Deep Learning Boost (Intel® DL Boost) technology. This technology include integer vector neural network instructions (VNNI), providing the high throughput for 8-bit inference with a theoretical peak compute gain of 4x INT8 OPS over FP32 OPS.

Intel Optimized Caffe is an open-source deep learning framework maintained by Intel for the broad deep learning community. We have recently added four general optimizations for INT8 inference: 1) activation memory optimization, 2) weight sharing, 3) convolution algorithm tuning, and 4) first convolution transformation.


We demonstrated the effectiveness of Intel Xeon processors with optimized deep learning software, and achieved the throughput of ResNet-50 7878 images per second on Intel Xeon Platinum 9282 processors, outperforming NVIDIA’s best GPUs.

Appendix: Reproducible Instructions

Step 1: Install ICC compiler following

Step 2: Get Caffe source code

git clone caffe

cd caffe

Step 3: Build with ICC

sh scripts/

source /opt/intel/compilers_and_libraries/linux/bin/ intel64

unset CPATH

CC=icc CXX=icpc cmake ../ -DCMAKE_BUILD_TYPE=Release -DBLAS=mkl -DCPU_ONLY=1 -DBOOST_ROOT=<caffe>/boost_1_64_0/install

CC=icc CXX=icpc make all -j262

Step 4: Small update at the header of for CLX-AP







Step 5: Run test


Configuration Details

Intel Xeon Platinum 9282 Processor: Tested by Intel as of 5/10/2019. DL Inference: Platform: Intel S2900WK 2S Intel Xeon Platinum 9282 (56 cores per socket), HT ON, turbo ON, Total Memory 768 GB (24 slots/ 32 GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0492.042220190420, Centos 7.6 3.10.0-957.10.1.el7.x86_64, Deep Learning Framework: Intel Optimized Caffe version: Commit id: 1141d7f, ICC 2019.2.187 for build, MKL DNN version: v0.19 (commit hash: 9a865c2b935e611f4ea451a26bebe45ec5ef4160), model: resnet50_int8_perf_clx_winograd.prototxt, BS=11, synthetic Data:3x230x230, 28 software instance/2 socket, Datatype: INT8; throughput: 7878 images/s

Performance results are based on testing as of dates shown in configuration and may not reflect all publicly available security updates. No product can be absolutely secure. See configuration disclosure for details.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product user and reference guides for more information regarding the specific instruction sets covered by this notice.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit:

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at