What a difference a few years make! We have seen tremendous shifts in the field of deep learning as it moves from model training to inference deployment within main lines of business across industries. Customers are asking us different questions now: how do I scale in the real world, quickly and cost-effectively, to stay competitive? How do I run AI applications in products with strict latency and lightning-fast requirements for inference results? The answer has three key pieces: get more from the architecture you already know, accelerate with purpose, and use software to simplify the environment. Before we look at these key elements of real-world deployments, let’s examine how we got here.
3 Years, 3 Major Changes
As deep learning (DL) has matured, CPU solutions turned a page by achieving a many-fold boost to deliver optimized AI. Today, even older CPUs can deliver performance many times better than once thought possible, let alone new generations of enhanced CPUs.
- Entirely new software: Libraries that didn’t exist just a couple years ago now allow you to access broadly deployed CPU hardware in AI-specific ways, preserving existing software environments and the hardware you use to run other enterprise applications.
- New hardware features: For the past few years we’ve worked to enhance our x86-based architecture with new AI hardware features, and earlier this month, we announced the release of 2nd generation Intel® Xeon® Scalable Processors with Intel® Deep Learning Boost technology to accelerate inference.
- AI cycles shift to inference in lines of business: We’re on the cusp of a major shift to inference deployments at scale that meet a critical blend of performance, cost, and energy efficiency needs. Currently, we estimate the training to inference workload ratio to be around 1:5. CPUs are well-architected for the high-throughput, low-latency compute that the shift to inference demands.
Deploying AI in the Real World
Now that we’ve examined these shifts, let’s revisit the question: How do I scale performant AI applications efficiently, while on a budget? The answer is clear and manageable:
- Get more performance from the Xeon foundation you know. Thanks to the new software and hardware enhancements discussed above, Intel Xeon Scalable processors have never been more performant for AI applications. Using the latest software libraries and compilers will ensure optimal performance on existing hardware, and upgrading to 2nd Gen Intel Xeon Scalable processors provides a significant hardware-based performance improvement, thanks in part to Intel DL Boost’s new Vector Neural Network Instructions (VNNI).
- Accelerate with purpose, for continuous and intensive tensor compute. The most demanding part of deep learning compute is the arithmetic done on large multi-dimensional arrays called Tensors. When an application needs continuous, intensive tensor arithmetic, using a purpose-built specialized deep learning accelerator is the right solution. These ASICs, designed to do this specific task really well, work in tandem with the main host CPU to offload the intensive deep learning parts of the application.
- Keep the software environment updated and simple. Software is key! Using the latest software versions, libraries and optimizations with deep learning frameworks (like TensorFlow, MXNet, Pytorch, and PaddlePaddle) will “unlock” the CPU hardware, including features of newer Intel Xeon Scalable processor generations. We’re also very focused on delivering a streamlined environment which connects popular deep learning frameworks like TensorFlow* to various HW platforms like CPUs, accelerators, and FPGAs.
Common Ground Across Industries
Many companies, representing a diverse range of applications, markets, data, and audiences, are using this three-part approach to deploy real-world AI today. Some are long-time users of Intel Xeon processors. Others are taking advantage of the new 2nd Gen Intel Xeon Scalable processors. With their hardware and software optimizations targeting AI workloads, these CPUs deliver up to 14X inference throughput over the previous generation.
Here are a few customers seeing great success deploying AI on Intel:
- Philips - Fortune 500 company that cost-effectively deployed fast deep-learning inference on tens of thousands of servers and scanning machines already in the field.
- Taboola - The world’s largest content recommendation engine sped inference by 257%, while reducing planned infrastructure spend by scaling with CPUs instead of GPUs.
- iFlyTek - This voice recognition leader in China phased out GPUs in favor of CPUs to process six billion transactions daily.
- TACC (Texas Advanced Computing Center) - Their new Frontera system based entirely around 2nd Gen Intel Xeon Scalable processors with Intel® Optane™ DC persistent memory yields 40 petaflops of peak performance to enable groundbreaking discoveries using massively-parallel AI inference on HPC systems.
The reason these collaborations and many others were successful is because the companies and academic institutes were able to meet performance demands and extend their existing solutions with AI capabilities, while minimizing the cost of change. Another recurring theme was the flexibility to quickly adapt to new usages and opportunities.
Facebook: A Case Study in Acceleration
Intel® Xeon® Scalable processors are relied upon for so many other enterprise workloads, leveraging them for AI comes at minimal extra cost. Yet as AI matures, the path to the future calls for decisions about when further acceleration is needed for intensive, continuous, high-volume tensor compute. Custom ASICS work hand-in-hand with Xeon-based infrastructure to offload and accelerate the intensive deep learning tensor-based parts of the application leaving the rest to benefit from the host CPU.
Customers like Facebook, whose deep learning demands grow more intensive and sustained, are looking to augment their current CPU-based inference with this new class of accelerators that offer very high concurrency of large numbers of compute elements (spatial architectures), fast data access, high-speed memory close to the compute, high-speed interconnect, and multi-node scaled solutions.
For this reason, Facebook has been a close collaborator with us on the Intel® Nervana™ Neural Network Processor-i 1000 (codenamed Spring Hill) in production later this year. As a leading community platform that unites nearly half the world, Facebook relies on driving and helping build substantial advancements in AI, including this new generation of power-optimized, highly-tuned AI inference chips that we expect to be a leap forward in inference application acceleration, delivering industry leading performance per watt on real production workloads. The Intel Nervana NNP-I 1000 will be fully integrated with Facebook’s Glow compiler to help keep their software environment simple and highly optimized.
The Scale Challenge: A Solution Strategy
Many companies have just started or are about to start their deep learning inference deployment at scale. They are in a variety of industries but share a similar overarching multi-faceted question — how to add high-impact DL functionality to line of business applications, at required performance, with managed cost and change, while maintaining flexibility for future needs? Here, I offer a basic strategy for deployment in the data center and cloud:
- Optimize and extend CPU platforms (e.g., Intel Xeon Scalable Processors) for a great balance of performance, TCO, investment preservation, and versatility.
- For usages with high-intensity, continuous tensor operations, add purpose-built DL acceleration ASICs (e.g., Intel's upcoming Neural Network Processors) that have tight SW integration with the host CPUs for most effective offloading.
The AI landscape is shifting – constantly and quickly. What you couldn’t do three years ago, you can do now. It’s an exciting time to witness the impact of enterprise-scale inference deployments, and advancements in both hardware and software from devices to data centers. I can’t wait to see what the next three years brings!
Notices and Disclaimers
 Up to 14X AI Performance Improvement with Intel® Deep Learning Boost (Intel DL Boost) compared to Intel® Xeon® Platinum 8180 processor (July 2017). Tested by Intel as of 2/20/2019. 2 socket Intel® Xeon® Platinum 8280 processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode: 0x200004d), Ubuntu 18.04.1 LTS, kernel 4.15.0-45-generic, SSD 1x sda INTEL SSDSC2BA80 SSD 745.2GB, nvme1n1 INTEL SSDPE2KX040T7 SSD 3.7TB, Deep Learning Framework: Intel® Optimization for Caffe* version: 1.1.3 (commit hash: 7010334f159da247db3fe3a9d96a3116ca06b09a), ICC version 18.0.1, MKL DNN version: v0.17 (commit hash: 830a10059a018cd2634d94195140cf2d8790a75a, model: https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/resnet50_int8_full_conv.prototxt, BS=64, DummyData, 4 instance/2 socket, Datatype: INT8 vs Tested by Intel as of July 11th 2017: 2S Intel® Xeon® Platinum 8180 cpu @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS* Linux release 7.3.1611 (Core), Linux kernel* 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY=’granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: (https://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (ResNet-50), Intel® C++ Compiler ver. 17.0.2 20170213, Intel® Math Kernel Library (Intel® MKL) small libraries version 2018.0.20170425. Caffe run with “numactl -l“.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation