Artificial Intelligence with New Intel® Xeon® Scalable Processors: Most Agile AI Platform

The New Intel® Xeon® processor Scalable family provides scalable performance for widest variety of AI & other datacenter workloads – including deep learning. The New Intel® Xeon® Scalable processor platform offers built-in Return on Investment (ROI), potent performance and production ready support for AI deployments.

In our smart and connected world, machines are increasingly learning to sense, reason, act, and adapt in the real world. Artificial Intelligence (AI) is the next big wave of computing, and Intel uniquely has the experience to fuel the AI computing era. AI will let us accelerate solutions to large-scale problems that would otherwise take months, years, or decades to resolve.

AI will unleash new scientific discoveries, automate undesirable tasks and extend our human senses and capabilities. Today, machine learning (ML) and deep learning (DL) are two underlying approaches to AI, as are reasoning-based systems.

Deep learning is the most rapidly emerging branch of machine learning, in many cases supplanting classic ML, relying on massive labeled data sets to iteratively “train” many-layered neural networks inspired by the human brain. Trained neural networks are used to “infer” the meaning of new data, with increased speed and accuracy for processes like image search, speech recognition, natural language processing, and other complex tasks.

Architectural improvements together with enhanced software optimizations demonstrates potent performance on the New Intel® Xeon® server-class platforms upto 138X improvement in inference and 113X improvement in training compared to older systems running unoptimized software. (Refer to config details below).

Here we present some AI workloads showing advanced inference and training performance with the New Intel® Xeon® Scalable processors compared to the previous generation Intel® Xeon® processors. All performance measurements are accurate as of July 11th 2017.

Inference is performed at the data center or at the edge. Key metrics for inference performance is throughput and Total Cost of Ownership (TCO). Inference is unsupervised and the output classification can be fed into a number of different usages including – a dashboard for visualization or a decision tree for automatic decision making.

Here we show inference throughput for image recognition workloads using multiple frameworks such as TensorFlow, Caffe, Neon and MxNet and multiple topologies such as AlexNet, GoogleNet V1, ResNet 50, VGG-19.

View larger image

Learn more

Today training runs in the data center. The key metrics are time to train and Total Cost of Ownership (TCO). Training typically requires a large labeled dataset. A portion of this dataset is used to iteratively adjust the connection weights in the neural network. Another portion of this dataset is used to test error rate of the trained neural network, in order to ensure that the data wasn’t overfitted.

Here we show training throughput for image recognition workloads using multiple frameworks such as TensorFlow, Caffe, Neon and MxNet and multiple topologies such as AlexNet, GoogleNet V1, ResNet 50, VGG-19.

View larger image

Configuration Details for Inference Throughput and Training Throughput

 

 

 

 

Platform

 

2S Intel® Xeon® Platinum 8180 processor CPU @ 2.50GHz (28 cores)

2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22 cores)

Hyper Threading

 

HT disabled

HT enabled

Turbo

 

Turbo disabled

Turbo disabled

Driver

 

Scaling governor set to “performance” via intel_pstate driver

Scaling governor set to “performance” via acpi-cpufreq driver

Memory

 

384GB DDR4-2666 ECC RAM

256GB DDR4-2133 ECC RAM

OS

 

CentOS* Linux release 7.3.1611 (Core)

CentOS* Linux release 7.3.1611 (Core)

Kernel

 

Linux kernel 3.10.0-514.10.2.el7.x86_64

Linux kernel 3.10.0-514.10.2.el7.x86_64

SSD

 

SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC)

SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm, MLC)

Performance Measurement Command Variables

 

Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance

Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set -d 2.2G -u 2.2G -g performance

Caffe

Revision

Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. 

Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. 

Other Arguments

Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command.  Caffe run with “numactl -l“.

Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command.

Dataset

For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training.

For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training.

Topologies

Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent).

Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent).

Compiler

Intel C++ compiler ver. 17.0.2 20170213

GCC 4.8.5

Library

Intel® MKL small libraries version 2018.0.20170425

Intel® MKL small libraries version 2017.0.2.20170110

TensorFlow

Revision

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. 

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e.  

Other Arguments

interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 56, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with --forward_backward_only option.

interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 44, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with --forward_backward_only option.

Dataset

Dummy data was used.

Dummy data was used.

Topologies

Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data.

Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow)

Compiler

GCC 4.8.5

GCC 4.8.5

Library

Intel® MKL small libraries version 2018.0.20170425

Intel® MKL small libraries version 2018.0.20170425,

MXNet

Revision

MxNet: (https://github.com/dmlc/mxnet/), revision 5efd91a71f36fea483e882b0358c8d46b5a7aa20.  

MxNet: (https://github.com/dmlc/mxnet/), revision e9f281a27584cdb78db8ce6b66e648b3dbc10d37.

Other Arguments

Inference was measured with “benchmark_score.py”, training was measured with a modified version of benchmark_score.py which also runs backward propagation.

Inference was measured with “benchmark_score.py”, training was measured with a modified version of benchmark_score.py which also runs backward propagation.

Dataset

Dummy data was used.

Dummy data was used.

Topologies

Topology specs from https://github.com/dmlc/mxnet/tree/master/example/image-classification/symbols.

Topology specs from https://github.com/dmlc/mxnet/tree/master/example/image-classification/symbols.

Compiler

GCC 4.8.5

GCC 4.8.5

Library

Intel® MKL small libraries version 2018.0.20170425.

Intel® MKL small libraries version 2017.0.2.20170110.

Neon

Revision

Neon: ZP/MKL_CHWN branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d

Neon: ZP/MKL_CHWN branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d. 

Other Arguments

The main.py script was used for benchmarking, in mkl mode.

The main.py script was used for benchmarking, in mkl mode.

Dataset

Dummy data was used. 

Dummy data was used. 

Topologies

   

Compiler

ICC version used : 17.0.3 20170404

ICC version used : 17.0.3 20170404

Library

Intel® MKL small libraries version 2018.0.20170425.

Intel® MKL small libraries version 2018.0.20170425.

Inference Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

Caffe

AlexNet
BS = 1

235

152

AlexNet
BS = 1024

2656

1146

GoogLeNet v1
BS = 1

117

103

GoogLeNet v1
BS = 1024

814

405

ResNet-50
BS = 1

69

45

ResNet-50
BS = 1024

226

118

VGG-19
BS = 1

73

37

VGG-19
BS = 256

136

62

AlexNet ConvNet
BS = 1

582

282

Inference Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

TensorFlow

AlexNet ConvNet
BS = 1

144

126

AlexNet ConvNet
BS = 1024

3382

2135

GoogLeNet ConvNet
BS = 256

533

411

GoogLeNet ConvNet
BS = 1024

658

427

VGG ConvNet
BS = 32

236

129

VGG ConvNet
BS = 256

248

140

Inference Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

MXNet

AlexNet
BS = 1

428

251

AlexNet
BS = 1024

2439

1093

VGG-19
BS = 1

121

71

VGG-19
BS = 256

333

155

Inception V3
BS = 16

170

121

Inception V3
BS = 1024

250

164

ResNet-50
BS = 1

47

41

ResNet-50
BS = 256

115

79

Inference Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

Neon

AlexNet ConvNet
BS = 1

138

86

AlexNet ConvNet
BS = 1024

2889

1305

GoogLeNet v1 ConvNet
BS = 4

153

80

GoogLeNet v1 ConvNet
BS = 1024

1036

445

ResNet 18
BS = 4

224

133

ResNet 18
BS = 1024

672

286

Training Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

Caffe

Caffe
AlexNet
BS=256

947

453.9007092

Caffe
GoogleNet v1
BS=96

268

145.2344932

Caffe
ResNet-50
BS=50

85

45.41326067

Caffe
VGG-19
BS=64

40

18.93491124

Caffe
AlexNet Convnet
BS=256

1089

495.1644101

Caffe
GoogleNet-v1 Convnet  BS=96

288

146.3414634

Caffe
VGG ConvNet 
BS=64

89

44.41360167

Training Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

TensorFlow

TensorFlow
AlexNet Convnet
BS=256

969.69

387.737

Training Throughput Performance Measured in Images/Second; BS refers to Batch Size    

 

 

MXNet

  2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz 2S Intel® Xeon® processor E5-26994v4, 22C, 2.2GHz
MXNet
Alexnet
BS=256
672.420169 335.351575

MXNet
VGG
BS=256

94.388501

51.650352

MXNet
Inception-bn
BS=256

134.497456

86.149216

MXNet
Inception-v3
BS=256

61.802955

41.057106

MXNet
ResNet-50
BS=256

44.340751

30.331338

Training Throughput Performance Measured in Images/Second; BS refers to Batch Size      
 

 

2S Intel® Xeon® Platinum 8180 processor, 28C, 2.5GHz

2S Intel® Xeon® processor  E5-2699v4, 22C, 2.2GHz

Neon

Neon
GoogleNet v1-Convnet
BS=128

220.62

129

Neon
Resnet-18
BS=128

196.967

90.2427

Begin your AI journey today using existing, familiar Intel® Xeon® infrastructure with server-class reliability and increase datacenter utilization without additional, unique investments. Here we show time to train performance (lower is better) for image recognition workload using Caffe framework and GoogleNet V1 topology.

View larger image

Configuration Details for Time to Train

       
Platform   2S Intel® Xeon® Platinum 8180 processor CPU @ 2.50GHz (28 cores) 2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22 cores)
Hyper Threading   HT disabled HT enabled
Turbo   Turbo disabled Turbo disabled
Driver   Scaling governor set to “performance” via intel_pstate driver Scaling governor set to “performance” via acpi-cpufreq driver
Memory   384GB DDR4-2666 ECC RAM 256GB DDR4-2133 ECC RAM
OS   CentOS* Linux release 7.3.1611 (Core) CentOS* Linux release 7.3.1611 (Core)
Kernel   Linux kernel 3.10.0-514.10.2.el7.x86_64 Linux kernel 3.10.0-514.10.2.el7.x86_64
SSD   SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC) SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm, MLC)
Performance Measurement Command Variables   Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set -d 2.2G -u 2.2G -g performance
Caffe Revision Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. 
  Other Arguments Training measured with “caffe time” command.  Caffe run with “numactl -l“. Training measured with “caffe time” command.
  Dataset For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training.
  Topologies Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet v1), Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet v1),
  Compiler Intel C++ compiler ver. 17.0.2 20170213 GCC 4.8.5
  Library Intel® MKL small libraries version 2018.0.20170425 Intel® MKL small libraries version 2017.0.2.20170110