Last month, AWS announced three new C5 instances (C5.12xlarge, C5.24xlarge and C5.metal), all featuring custom 2nd Generation Intel® Xeon® Scalable processors (code-named Cascade Lake) with a sustained all-core turbo frequency of 3.6GHz and maximum single core turbo frequency of 3.9GHz, as well as Intel® Deep Learning (Intel® DL) Boost technology enabled. Intel DL Boost technology refers to a group of acceleration features, including new Vector Neural Network Instructions (VNNI), that help speed up deep learning operations like Convolution and GEMM by using the INT8 instead of FP32 FMA computation at the quadrupled throughput improving inference performance over a wide range of deep learning workloads without requiring re-training. Table 1 provides more details on C5 instance types, with the three new instances highlighted. As the price per vCPU remains unchanged, these new C5 instances are a compelling choice for deep learning inference applications.
|Model||vCPU||Memory (GiB)||Instance Storage (GiB)||Network Bandwidth (Gbps)||EBS Bandwidth (Mbps)|
|c5.large||2||4||EBS-Only||Up to 10||Up to 3,500|
|c5.xlarge||4||8||EBS-Only||Up to 10||Up to 3,500|
|c5.2xlarge||8||16||EBS-Only||Up to 10||Up to 3,500|
|c5.4xlarge||16||32||EBS-Only||Up to 10||3,500|
You can start using these new instances today in the following regions: US East (N. Virginia), US West (Oregon), Europe (Frankfurt, Ireland, London, Paris, Stockholm), Asia Pacific (Sydney), and AWS GovCloud (US).
Using MXNet as an example of a typical deep learning framework, VNNI on the new c5.24xlarge instance can boost performance of popular image classification/object detection models by 2.6x~3.8x (figure 1) with minimum or no accuracy loss (figure 2), thanks to the help of Intel DL boost with VNNI. For more details on how to speed up inference workloads using VNNI, please see our recent work on Model Quantization for Production-Level Neural Network Inference.
An Introduction to Deep Learning Inference
In deep learning, inference is used to deploy a pretrained neural network model to perform a wide variety of tasks, including speech detection, image classification, object detection, and other prediction tasks. For enterprises, inference is especially important because it is the stage of the analytics pipeline where their production-level data is used to produce valuable insights. The huge number of inference requests from end users are constantly being routed to cloud servers all over the world. Recent studies show that major data centers currently rely heavily on CPUs for inference, and rapid growth in machine learning across existing and new services in data centers and cloud services is predicted. A majority of data centers run on CPUs today, so it’s critical to ensure inference workloads can perform efficiently on them.
What is VNNI — And How Does it Work?
Various researchers have demonstrated that both deep learning training and inference can be performed with lower numerical precision, using 16-bit multipliers for training and 8-bit multipliers for inference, with minimal to no loss in accuracy. Using these lower numerical precision (training with 16-bit multipliers accumulated to 32-bits, and inference with 8-bit multipliers accumulated to 32-bits) will likely become the standard over the next year. VNNI is an ISA embodiment of the aforementioned method, and it extends Intel® AVX-512 instructions to support vectored INT8 FMA at the quadrupled throughput versus FP32 FMA.
Intel DL Boost in Action
Intel DL Boost technology has been integrated into various popular deep learning frameworks like MXNet*, TensorFlow*, Pytorch*, PaddlePaddle*, and Caffe*. The Apache MXNet community has delivered quantization approaches to enable INT8 inference and use of VNNI. iFLYTEK, which is leveraging 2nd Gen Intel Xeon Scalable processors and Intel® Optane™ SSDs for its AI applications, has reported that Intel DL Boost has resulted in similar or better performance in comparison to inference using alternative architectures. For more information on framework support, please refer to our recent blog post, Increasing AI Performance and Efficiency with Intel® DL Boost.
Get Started with Intel DL Boost
Follow the steps in our blog Model Quantization for Production-Level Neural Network Inference to experience the performance improvement with Intel DL Boost in those new EC2 C5 instances, or directly run the shell scripts with the below order:
ec2_benchmark_base.shto get the data of FP32 w/o op Fusion (Baseline).
ec2_benchmark_int8.shto get the data of FP32 w/ op Fusion (Better) and INT8 w/ op Fusion (Best).
Check out the following blog posts for more details on Intel DL Boost features, configurations, benchmarks and framework integrations:
- Vector Neural Network Instructions Enable Int8 AI Inference on Intel Architecture
- Increasing AI Performance and Efficiency with Intel DL Boost
- Lower Numerical Precision Deep Learning Inference and Training
- Model Quantization for Production-Level Neural Network Inference
- Accelerating TensorFlow Inference with Intel Deep Learning Boost on 2nd Gen Intel Xeon Scalable Processors
- Intel and Facebook Collaborate to Boost PyTorch CPU Performance
- INT8 Inference Support in PaddlePaddle on 2nd Gen Intel Xeon Scalable Processors
- BigDL Model Inference with Intel DL Boost
- Intel® CPU Outperforms NVIDIA* GPU on ResNet-50 Deep Learning Inference
- Artificial Intelligence with 2nd Gen Intel® Xeon® Scalable Processor
Many thanks to my colleagues Yixin Bao, Ciyong Chen, Xinyu Chen, Ying Guo, Zhiyuan Huang, Tao Lv, Eric Lin, Wei Li, Zhennan Qin, Shufan Wu, Zixuan Wei, Pengxin Yuan, Lujia Yin, Patric Zhao, Rong Zhang and many others in Intel for their great work on optimizing deep learning frameworks with the state-of-the-art accelerating technology on Intel processors. Also thanks to Emily Hutson for providing valuable feedback.
Appendix: Notices and Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Performance results are based on testing as of 1st July 2019 by AWS and may not reflect all publicly available security updates. No product or component can be absolutely secure.
Test Configuration: Reproduce Script: https://github.com/intel/optimized-models/tree/v1.0.6/mxnet/blog/medium_vnni Software: Apache MXNet 1.5.0b20190623 and benchmark script commit id f44f6cfbe752fd8b8036307cecf6a30a30ad8557 Hardware: AWS EC2 c5.24xlarge Custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation