Accelerating Compute-Intensive Applications at Google

Key Takeaways

  • Google Cloud has already shown the potential of infrastructure for compute-intensive applications. Using C2 instances based on 2nd Generation Intel® Xeon® Scalable processors, Climacell achieved 40% better price/performance than N1 instances.1

  • The Genomics team at Google Brain cut the runtime for the most computationally intensive stage of its genome-sequencing application using Intel® Advanced Vector Extensions 512 (Intel® AVX-512). It fell from over 14 hours to 3 hours 25 minutes.2

  • Intel® Deep Learning Boost (Intel® DL Boost) can help to speed up applications including image classification, speech recognition, language translation, and object detection.

author-image

By

Deploying compute-intensive applications in the cloud or consuming them as online services is more affordable and faster for customers than using their own hardware. Google has experience in providing infrastructure for compute-intensive applications to its cloud customers. Similar infrastructure can be used for other compute-intensive applications within Google.

The C2 instance, Google Cloud’s first compute-optimized instance, offers 40 percent better compute performance3 and 90 percent faster CPU frequency4, compared to the previous N1 instance. It’s based on 2nd Generation Intel® Xeon® Scalable processors, and Google Cloud Platform (GCP) customers are already enjoying the breakthrough performance C2 offers:

  • WP Engine powers some of its WordPress Digital Experience Platform services with 2nd Generation Intel® Xeon® Scalable-based “C2” (compute optimized) instances on GCP. When combined with other software optimizations, WP Engine achieved platform performance that was 60% faster than before5.
  • Climacell uses C2 for its micro-forecasting tools for weather prediction. Climacell performed internal benchmarking of the proof-of-concept solution comparing a Google Cloud C2 instance with previous-generation N1 clusters6. With C2 instances, Climacell achieved 40% better price/performance than N1 instances.

Accelerating Compute-Intensive and Artificial Intelligence (AI) Applications

Compute-intensive and artificial intelligence (AI) applications can often benefit from being optimized for Single Instruction Multiple Data (SIMD) instructions. These instructions enable a single processor instruction to process multiple data items at the same time. Intel® Advanced Vector Extensions 512 (Intel® AVX-512) was introduced with the Intel® Xeon® Scalable processor. The size of the register was doubled to 512 bits compared to previous-generation Intel® Xeon® processors. The bigger register size can help to dramatically increase the throughput for applications that can be optimized for instruction-level parallelization.

Here are some examples:

Using the 512-bit vectors, applications can pack 32 double precision and 64 single precision floating point operations per clock cycle, with up to two 512-bit fused-multiply add (FMA) units. The 512-bit vectors double the number of operations per clock cycle compared to Intel® Advanced Vector Extensions 2 (Intel® AVX2)7.

Not all applications benefit from Intel AVX-512, but for those applications that do, the performance gains can be significant.

Deep learning applications are among those applications that can benefit from using Intel AVX-512 to process more data at the same time, with a single instruction.

The Genomics team at Google Brain has been using Intel AVX-512 to improve the performance of its genome-sequencing application. DeepVariant, an open-source tool, is built on top of TensorFlow.

The most computationally intensive stage in the process is known as call_variants. It compares an individual’s genome with a reference genome, for medical diagnosis, treatment, or drug research. Using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and Intel AVX-512, the runtime for call_variants was cut from over 14 hours to 3 hours 25 minutes. The Google Brain team notes that there has been a three-fold reduction in cost too8.

Intel MKL-DNN is an open-source library for enhancing the performance of deep learning frameworks on Intel® architecture. It provides building blocks to take advantage of Intel AVX-512 and multithreading to accelerate convolutional neural networks (CNNs).

Accelerating deep learning for genome analysis using Intel® MKL-DNN.

Accelerating Deep Learning Inference

The 2nd Generation Intel Xeon Scalable processor introduced Intel® Deep Learning Boost (Intel® DL Boost), a new Vector Neural Network Instruction that accelerates deep learning inference. It’s a new Intel AVX-512 instruction for fused multiply-add operations, which are often used in matrix manipulations as part of deep learning inference. The instruction combines three instructions into a single instruction, saving clock cycles on the processor. The new instruction can help to speed up applications including image classification, speech recognition, language translation, and object detection.

Previously, three instructions were required to carry out the fused multiply-add operation.

Now, the new instruction in the 2nd Generation Intel® Xeon® Scalable processor combines these instructions into a single instruction.

Intel DL Boost enables up to 30X improvement in deep learning throughput9, and is available for Google Cloud customers to use in their own applications now.

Performance can be further enhanced by using lower precision data, based on 8-bit integers (INT8) instead of 32-bit floating-point (FP32) numbers. Research by Intel found that using INT8 and Intel DL Boost technology together on the Wide & Deep Recommender System improved performance by 200 percent, with a minimal loss of accuracy (less than 0.5 percent), compared to FP32 precision10.

Google Compute Engine will be enabling N2 instance customers to automatically upgrade to the 3rd Generation Intel® Xeon® Scalable processor, with previews coming later this year. 3rd Gen Intel Xeon Scalable processors deliver 1.5 times more performance than other CPUs across 20 popular machine and deep learning workloads11For more information, see the 3rd Generation Intel Xeon Scalable fact sheet.

Introducing Intel® Advanced Matrix Extensions (Intel® AMX)

The next-generation Intel Xeon Scalable processors, codenamed Sapphire Rapids, will continue Intel’s strategy of providing built-in AI acceleration on Intel Xeon processors with a new accelerator called Intel® Advanced Matrix Extensions (Intel® AMX).

Intel AMX introduces a new programming paradigm, based on two-dimensional registers called tiles. An accelerator, called TMUL (short for tile matrix multiply unit), carries out operations on the tiles. TMUL is a grid of fused multiply-add units that can read and write tiles. The matrix multiplications in the TMUL instruction set compute:

C[M][N] += A[M][K] * B[K][N]

Each tile has a maximum size of 16 rows of 64 bytes (a total of 1 KB). Programmers can configure a smaller size for each tile if it better fits their algorithm. Data is loaded into tiles from memory using the traditional Intel architecture register set as pointers.

The new instructions include:

  • TDPBF16PS, which performs a set of SIMD dot-products of two Bfloat16 (BF16) elements.
  • TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD, which are used to multiply signed and unsigned byte elements from two different tiles in different combinations (signed * signed, signed * unsigned, unsigned * signed, unsigned * unsigned).

To find out more about the upcoming instructions, see Chapter 3 of the Intel® Architecture Instruction Set Extensions Programming Reference.

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Product and Performance Information

1“The new Compute-Optimized VMs offer a greater than 40% performance improvement compared to current GCP VMs.” Bart Sano, VP of Platforms, Google Cloud. Learn more here. (source: https://cloud.google.com/intel).
2https://google.github.io/deepvariant/posts/2019-04-30-the-power-of-building-on-an-accelerating-platform-how-deepVariant-uses-intels-avx-512-optimizations/
3“The new Compute-Optimized VMs offer a greater than 40% performance improvement compared to current GCP VMs.” Bart Sano, VP of Platforms, Google Cloud. Learn more here. (source: https://cloud.google.com/intel).
4From Google Cloud’s CPU platforms page: Learn more here. C2 instance all-core turbo CPU frequency is 3.8 GHz, with the lowest first gen N1 machine type base frequency being 2.0 GHz. Note, all-core turbo not available for first gen N1 instances. Improvement from 2.0GHz to 3.8GHz = 90%. (source: https://cloud.google.com/intel).
6“The new Compute-Optimized VMs offer a greater than 40% perfor­mance improvement compared to current GCP VMs.” Bart Sano, VP of Platforms, Google Cloud. Learn more here. From Google Cloud’s CPU platforms page: Learn more here. C2 instance all-core turbo CPU frequency is 3.8 GHz, with the lowest first-generation N1 machine type base frequency being 2.0 GHz. Note, all-core turbo not available for first gen N1 instances. Improve­ment from 2.0GHz to 3.8GHz=40%. Also see https://cloud.google.com/blog/products/compute/expanding-virtual-machine-types-to-drive-performance-and-efficiency.
7Intel® AVX 2.0 delivers 16 double precision and 32 single precision floating point operations per second per clock cycle within the 256-bit vectors, with up to two 256-bit fused-multiply add (FMA) units. https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html.
9Up to 30X AI performance with Intel® DL Boost compared to Intel® Xeon® Platinum 8180 processor (July 2017). Tested by Intel as of 2/26/2019. Platform: Dragon rock two-socket Intel® Xeon® Platinum 9282(56 cores per socket), HT ON, turbo ON, Total Memory 768 GB (24 slots/ 32 GB/ 2933 MHz), BIOS:SE5C620.86B.0D.01.0241.112020180249, CentOS 7 Kernel 3.10.0-957.5.1.el7. x86_64, Deep Learning Framework: Intel® Optimization for Caffe version: https://github.com/intel/caffe d554cbf1, ICC 2019.2.187, MKL DNN version: v0.17 (commit hash: 830a10059a018cd2634d94195140cf2d8790a75a), model: https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/resnet50_int8_full_conv.prototxt, BS=64, No data layer DummyData:3x224x224, 56 instance/2 socket, Datatype: INT8 vs Tested by Intel as of July 11th 2017: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 error-correcting code (ECC) RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. Solid-state drive: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, multi-level cell). Performance measured with: Environment variables: KMP_AFFINITY=’granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (ResNet-50). Intel C++ Compiler version 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.
10See: https://software.intel.com/content/www/us/en/develop/articles/accelerate-int8-inference-performance-for-recommender-systems-with-intel-deep-learning.html. Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks. Testing by Intel as of March 1, 2019. Intel® Xeon® Platinum 8280L Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0348.011820191451 (ucode:0x5000017), CentOS 7.6, Kernel 4.19.5-1.el7.elrepo.x86_64, solid-state drive 1x INTEL SSDSC2KG96 960GB, Compiler gcc 6.3.1; Deep Learning Framework: TensorFlow on GitHub at tensorflow/tensorflow applying Pull Request PR26169, Pull Request PR26261 and Pull Request PR26271, MKL-DNN version: v0.18, Wide & Deep on GitHub at IntelAI/models, Models: FP32 pretrained model and INT8 pretrained model; MXNet on GitHub at apache/incubator-mxnet applying patch; MKL-DNN on GitHub and intel/mkl-dnn; Wide & Deep on GitHub at intel/optimized-models; Dataset: Criteo Display Advertisement Challenge, Batch Size=512 (with 28 concurrent batches in TensorFlow).
11See (43) at www.intel.com/3gen-xeon-config for details. Results may vary. Source: https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/Ice-Lake-Launch-Product-Fact-Sheet-460649.pdf.