All around the world, customers like Novartis, Warner Bros., GE Healthcare, and Ziva Dynamics are achieving excellent real-world AI results on Intel® architecture. However, AI hardware is nothing without software. The complex set of machine learning, deep learning, and advanced analytics workloads that comprise modern AI applications requires versatile, performant software optimized to make the best use of that hardware’s features.
My team and I deliver software optimizations for deep learning on current-gen and future-gen Intel® Xeon® Scalable processors. I’m excited to share our progress this week at O’Reilly AI San Francisco.
Making an Impact in AI
In 2017 alone, Intel produced more than $1 billion in AI-driven Intel Xeon processor revenue. “One billion” is a big number, but it still doesn’t fully capture the effect that Intel Xeon Scalable processors are having in AI. Much of AI today occurs on Intel Xeon processor-based servers that organizations already use for tasks that keep critical infrastructure up and running, perform advanced analytics, or enable high-performance computing. With this in mind, we enhanced the Intel Xeon Scalable platform specifically to run high-performance AI workloads alongside the other cloud and data center workloads they already run. This gives you the best of both worlds. At Intel’s 2018 Data-Centric Innovation Summit, we showcased new features coming in future generations of the Intel Xeon Scalable platform, called Intel® Deep Learning Boost (Intel® DL Boost), that will further accelerate deep learning inferencing on Intel architecture.
The first of these technologies, the Vector Neural Network Instruction set (VNNI), will be included in the next generation of the Intel Xeon Scalable platform and will accomplish in a single instruction what formerly required three. With VNNI, we’ve projected an up to 11X performance increase in low-precision inferencing for this next generation platform, compared to the performance of the Intel Xeon Scalable platform at its launch in July 2017. The microarchitecture to follow will add support for bfloat16, a new numeric format quickly being adopted by the AI practitioners for highly accurate algorithmic performance and increased parallelism at a fraction of the power.
Accelerating the Most Popular Deep Learning Frameworks
Many recent results point to the efficacy of Intel Xeon Scalable processors for deep learning applications across enterprises and in the cloud.
- Stanford DAWNBench - In April 2018, Intel® Optimized Caffe* running in Amazon EC2 [c5.18xlarge] demonstrated the ability to classify one ImageNet image using a model with a top-5 validation accuracy of up to 93% or greater in just a few milliseconds. As of September 2018, Intel has posted the three fastest completion times for this particular inferencing task.
- Novartis – Pharmaceutical leader Novartis accelerated the time to train a multiscale convolutional neural network (M-CNN) for 10K high-content cellular microscopic images from hours to minutes—with more than 99 percent accuracy--using multi-node Intel Xeon Scalable processor-based servers, Intel® Omni-Path Architecture (Intel® OPA), and multi-node TensorFlow*. This amounts to an improvement of greater than 6x.
- Apache MXNet* - As of its v1.2.0 release, MXNet integrates Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to speed the execution of deep neural network operations including Convolution, Deconvolution, FullyConnected, Pooling, Batch Normalization, Activation, LRN, Softmax, as well as common operators such as sum and concat. In early testing by an Intel AI team, these optimizations have been shown to decrease latency for single-picture inference by up to 43x and increase throughput by up to 56.9x with a batch size of 32 images.
Facilitating Deep Learning Application Development and Deployment
Our work prioritizes the out of box experience for data scientists and developers using TensorFlow through Optimized Wheels and the Anaconda* Python* distribution. Our goal is to improve access to the latest performance improvements for Intel processors in TensorFlow. These performance improvements are largely due to the integration of and improvements to Intel MKL-DNN.
Gaining the benefit of Intel MKL-DNN in TensorFlow formerly required building TensorFlow with the MKL tag, which could be a tedious, time-consuming process. We’re now easing this process through the release of Intel-optimized Wheels (or pre-built binaries) and containers for TensorFlow. Customers can now simply use ‘pip’ to install these existing libraries instead of building a new optimized TensorFlow instance.
We’re additionally excited to showcase that the latest Intel optimizations (using Intel MKL-DNN libraries) can install easily and quickly using “conda install” in a conda environment on Linux* OS. Anaconda is a Python distribution that includes many of the most popular packages for data science, analytics, machine learning, and deep learning. Anaconda users can now easily install TensorFlow optimized with Intel MKL-DNN from Anaconda.org into their virtual environments. These performance-optimized wheels and streamlined TensorFlow installations through Anaconda represent great improvements in terms of ease of use.
Accelerating Real AI on Intel® Architecture
Software is key to moving AI forward. Intel – and my team – will continue to deliver the performance and simplicity needed to shorten the distance between idea and production AI solution. For more on software optimizations and tools for AI on Intel architecture, please look for us at O’Reilly AI San Francisco this week, follow @intelAI on Twitter, and stay tuned to ai.intel.com.
Notices and Disclaimers
The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation
 § Configuration: CPU: Intel Xeon 6148 processor @ 2.4GHz, Hyper-threading: Enabled. NIC: Intel® Omni-Path Host Fabric Interface, TensorFlow: v1.7.0, Horovod: 0.12.1, OpenMPI: 3.0.0. OS: CentOS 7.3, OpenMPU 23.0.0, Python 2.7.5. Time to Train to converge to 99% accuracy in model. Performance results are based on testing as May 25th 2018 and may not reflect all publicly available security update. See configuration disclosure for details. No product can be absolutely secure