Inside Intel: The Race for Faster Machine Learning

Teams across Intel spent the last 18 months refining hardware, software and algorithms as part of a broad initiative to improve machine learning on our architecture. Here’s an inside look at that effort.

The goal is not just the fastest but the most productive machine-learning platform for researchers.

In 2012, researchers at the University of Toronto designed an image recognition system substantially more accurate than the previous state of the art. Dubbed AlexNet, the system relied on a deep neural network that performed multiple layers of calculations before producing an answer.

Data scientists widely viewed AlexNet as a big step forward for the young field of machine learning. Researchers at Intel, meanwhile, noticed that AlexNet used graphical processing units, not central processing units. Developers of subsequent neural networks also used GPUs, usually citing a speed advantage over CPUs.

Machine learning, which involves programs that get more accurate with experience, is fundamentally different from any kind of computing that’s come before.

“There’s always been a simple division of labor: machines do number crunching, and humans make decisions,” says Pradeep Dubey, an Intel Fellow at the company’s Intel Labs division.

Machine-learning programs—and in particular the high-profile deep-learning subset that can teach themselves—are different. These programs have the potential to discover new drug compounds or identify consumer trends without human intervention.

For Dubey and others at Intel, it was clear that they needed to find a way to make machine-learning programs work well on Intel’s architecture. That meant improving the speed, accuracy, and efficiency of the hardware, software, frameworks, and algorithms on which they rely.

“The goal is not just the fastest but the most productive machine-learning platform for researchers,” Dubey says.

Now, two-and-a-half years after the company first prioritized machine learning, Dubey believes that Intel has pulled it off.

A dual-socket Intel® Xeon® processor-based system running newly optimized AlexNet code, for example, can classify images at a rate 10 times faster than it did with code not optimized for CPUs. And Intel just released two software libraries containing CPU-optimized algorithms that are, on average, five times faster than previous versions.1

The new Intel® Xeon Phi™ chip adds hardware improvement as well. An Intel Xeon Phi processor-based system can train, or learn an AlexNet image classification system, up to 2.3 times faster than a similarly configured system using Nvidia* GPUs.2

Intel’s internal research also finds the Intel Xeon Phi Processor delivers up to nine times more performance per dollar versus a hosted GPU solution, and up to eight times more performance per watt.3

These enhancements make machine learning practical on today’s most widely deployed and scalable CPU-based data center architecture. They make machine learning broadly accessible for the first time.

 

The Need for Speed

The notion that computer systems could learn on their own has been around for decades. But getting a computer to recognize, say, pictures of cats, requires first exposing it to millions of images so it can begin to differentiate between animals.

When Dubey began looking into machine learning in earnest, businesses were beginning to compile data sets large enough to teach computers. Speed is critical when processing so much data.

Since it introduced its first 64-bit processor in 2003, Intel has focused on developing CPUs that can perform many highly accurate double-precision floating-point operations per second, or FLOPS. Because GPUs perform single-precision operations, GPUs achieve high FLOPS scores on benchmarking tests.

But as they looked into performance on machine learning tasks, it became clear to Dubey and others at Intel that the difference in speed between CPUs and GPUs wasn’t inherent to the different hardware platforms.

“The performance gap [had] nothing to do with a FLOPS gap,” Dubey says. “It was coming from missing integration, missing libraries.”

Intel began “working on all levels of the stack at the same time,” he says. It launched parallel efforts to optimize its hardware and software, as well as open-source algorithms and frameworks, for machine learning.

“More optimized middleware in the middle, more power at the bottom, more ways for academics and researchers to easily interact with us,” Dubey says.

 

Studying the Math

To Craig Garland, a software development manager at Intel, getting machine-learning programs to run better on Intel’s architecture meant optimizing algorithms.

“Deep neural-network algorithms boil down to mathematics,” Garland says. He and a team of about 15 developers at Intel’s site in Russia, researched how leading academics were structuring programs focused on image and speech recognition. Then “we studied the math.”

With a goal of increasing speed at least fivefold, the developers focused on optimizing about two dozen machine-learning algorithms for Intel® Architecture.

Among them: the Apriori algorithm, which finds hidden correlations like the fact that people who buy milk are also likely to buy bread; and the K-means algorithm, which finds abnormalities like a potentially fraudulent transaction in a large data set.

Existing open-source versions of these algorithms ran slowly on Intel architecture. Benchmarking and adapting each algorithm is “laborious work” familiar to data scientists, Garland says, that incorporates a repetitive cycle of adjusting, running tests, then looking for additional ways to further improve performance.

The developers tested the results, sometimes using data shared by customers seeking improvement for a particular use case. Intel’s engineers also created their own custom data set-generating tools for testing purposes. The results of this work were made available to software developers in the Intel® Math Kernel Library (Intel® MKL) and Intel® Data Analytics Acceleration Library (Intel® DAAL) 2017 releases.

 

Across the Stack

Meanwhile, an Intel team in Oregon was studying the impact of machine learning on the company’s enterprise customers.

“We look at where demand for different workloads is increasing and create a roadmap for where our products should be going,” says Ananth Sankaranarayanan, director of engineering. Analytics use is increasing across virtually all industries, as connected devices, known as the Internet of Things, create significant volumes of data. In the next few years, he says, “we expect 50 billion devices to come online and connect via gateways to the data center.”

Based on its own research—as well as the development efforts by Intel’s teams of data scientists and developers—Sankaranarayanan’s team came up with a plan to adapt processor design to boost performance.

These efforts, taken together, are improving performance for machine learning at every layer of Intel’s technology stack, from the computing silicon up through networking, software primitives, middleware, and applications.

“What’s super exciting about this is the degree of performance improvement that’s possible,” Dubey says. Initial research showed that small adjustments centered on machine learning could significantly increase speed. In some tests, he says, the “improvement in performance was dramatic.”

Big speed jumps like those Intel is seeing in performance on specific algorithms are rarely possible in the traditional high-performance computing world, where problems are well-defined and optimization work has already been happening for many years. Machine learning algorithms still have room for improvement.

Andrey Nikolaev, a software architect on the team in Russia, says that at given times, it appears that he and his colleagues have optimized an algorithm to the maximum extent possible.

“Then tomorrow we will understand—or someone will come to us and show us—how we can make it faster,” he says. “Optimization is something you can do forever.”

Product and Performance Information

1

Configuration information - Hardware: Intel® Xeon® Processor E5-2699 v3, 2 eighteen-core CPUs (45MB LLC, 2.3GHz), Intel® TurboBoost Technology off, Intel® Hyperthreading technology off, 64GB of RAM; Operating System: RHEL 6.5 GA x86_64; testing source, internal Intel measurements.

2

Up to 2.3x faster training per system claim based on AlexNet* topology workload (batch size = 256) using a large image database running 4-nodes Intel® Xeon Phi™ processor 7250 (16 GB, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux* 6.7 (Santiago), 1.0 TB SATA drive WD1003FZEX-00MK2A0 System Disk, running Intel® Optimized DNN Framework, Intel® Optimized Caffe (source: https://github.com/intelcaffe) training 1.33 billion images/day in 10.5 hours compared to 1-node host with four NVIDIA “Maxwell” GPUs training 1.33 billion images/day in 25 hours (source: http://www.slideshare.net/NVIDIA/gtc-2016-opening-keynote slide 32).

3

http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-competitive-performance.html
Configuration information: One 2-socket Intel® Xeon® Processor E5-2697 v4 (45M cache, 2.3GHz, 18 cores), memory 128GB vs one NVIDIA* Tesla K80 GPUs, NVIDIA CUDA* 7.5.17 (Driver 352.39), ECC enabled, persistence mode enabled.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.