Executive Summary
3rd Gen Intel® Xeon® Scalable Processors are based on Intel’s 10nm+ process technology. They offer more CPU cores, higher memory capacity, and frequency than previous generation. Technologists from Alibaba Group and Intel worked together to explore what these capabilities mean for AI applications, particularly when used with Intel® Deep Learning Boost (Intel® DL Boost). We also explored the Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), which helps customers rapidly develop and deploy their AI INT8 models on Intel Xeon Scalable processor-based platforms. We optimized the Alibaba Transformer model on 3rd Gen Intel Xeon Scalable Processors and demonstrated 1.36x and 1.42x1 performance improvement in FP32 and INT8 inference over Intel’s previous generation processors.
Transformer is a key model used in the Alibaba Machine Learning Platform for AI (PAI). It is widely used in real-world natural language processing (NLP) tasks, serving millions of users through Alibaba’s online service. Low latency and high throughput are keys to
Transformer’s success, and 8-bit low precision is a promising technique to meet such requirements.
Intel DL Boost offers powerful capabilities for 8-bit low precision inference on AI workloads. With the support of Intel Neural Compressor, we can optimize 8-bit inference performance while significantly reducing accuracy loss. These capabilities demonstrate leadership in AI inference and show the power of Intel DL Boost and 3rd Gen Intel Xeon Scalable Processors.