Authors: Lisha Guo, Junwei Fu, Ningxin Hu, Mingming xu, Intel
In recent years, the deep learning has been getting increasingly important and widely applied in many applications. Hardware vendors, including Intel, are actively optimizing the performance of DL workloads, not only by extending the capabilities of CPU and GPU, but also by introducing new specialized accelerators. Edge devices have limited memory, computing resources and power, so these dedicated accelerators not only help optimize the performance but also help reduce the power consumption. Vector Neural Network Instructions (VNNI, Intel® Advanced Vector Extensions 512 [Intel® AVX-512])1 is an x86 extension instruction set and is a part of the Intel AVX-512 ISA which is designed to accelerate convolutional neural network for INT8 inference. INT8 inference offer improved performance over higher precision types since it allows packing more data into a single instruction, at the cost of reduced but acceptable accuracy. For power saving, Intel® Gaussian & Neural Accelerator (Intel® GNA)2 is designed for offloading continues inference workloads such as noise suppression and speech recognition to free CPU resources available in ICL+ platform. However, applications running on the Web Platform are still disconnected from these hardware improvements. Considered that, we are proposing a new domain specific Web Neural Network (WebNN) API to access those hardware accelerations for machine learning. By taking advantage of these new hardware features, WebNN can help access a purpose-built machine learning hardware and close the gap between the web and native.
WebNN: Efficient Access to DL Capabilities from the Web Platform
Architecture of WebNN
The primitives of WebNN can be mapped to the native machine learning API available on different operating systems, such as Android Neural Networks API*, DirectML on Windows*, Metal Performance Shader on macOS*, iOS, and OpenVINO, Intel® oneAPI Deep Neural Network Library (oneDNN). Eventually, these native APIs will talk with compilers and drivers to run these primitives on various machine learning hardware. The user can select the most proper underlying native ML backend for accelerations before compiling the constructed graph. For the deep learning framework or library which are not embedded in OS, such as OpenVINO and Intel® oneAPI Deep Neural Network Library (oneDNN), we need to compile and install them for usage in advance. Otherwise, the WebNN will throw error if the user wants to use the corresponding framework to acceleration.
- The Android* Neural Networks API (NNAPI)8 is an Android C API designed for running computationally intensive operations for machine learning on Android devices. NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks, such as TensorFlow Lite and Caffe2, that build and train neural networks.
- DirectML9 is a high-performance, hardware-accelerated DirectX* 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers.
- oneAPI Deep Neural Network Library (oneDNN) is intended for deep learning applications and framework developers interested in improving application performance on Intel® CPUs (Central Processing Units) and GPUs (Graphics Processing Units) and it’s an important part of Intel oneAPI. Although Intel® oneAPI Deep Neural Network Library (oneDNN) only support a small fraction of operations defined for neural networks, they cover most computation-heavily operations which are critical for inference time optimization. Besides, Intel® oneAPI Deep Neural Network Library (oneDNN) implements matrix multiplication such as operations with u8 and s8 operands on the VNNI by using a sequence of VPMADDUBSW, VPMADDWD, VPADDD instructions. The figure below illustrates how to implement int8 convolution using VNNI.
- OpenVINO toolkit for Linux* and Windows that is a comprehensive toolkit allowing to develop and deploy deep learning based solutions on Intel® platforms for native applications, the Inference Engine10 in Intel Distribution of OpenVINO toolkit is a set of C++ libraries providing a common API to access hardware including CPU, GPU, VPU and GNA, so the Inference Engine need to be implemented as a backend in Chromium using nGraph API to create Neural Network, set input and output formats, and execute the model on GNA, We wrapped the flow of building network with C API that is similar with Android NN API for C++ compatibility and generated a new binary to useofficial Intel Distribution of OpenVINO toolkit libraries.
We adopted a use-case-driven methodology to ensure results that were meaningful and relevant to the Web Community. In this way we first collected DL use cases on the browser. These turned out to heavily favor two areas: computer vision such as image classification and text/natural language processing such as speech recognition.
For VNNI case, we select MobileNet* V1 as the test model and complete the quantization process using PyTorch*. There is a 2.56x speedup compared with fp32 model on CPU with WebNN. What’s more, for INT8 WebNN inference time, there is a 20X speedup compared with WebAssembly -optimized with SIMD(fp32) and Multi-Thread on CPU and a 16X speedup compared with WebGL(fp32). Compare with the native Intel Distribution of OpenVINO toolkit, the performance gap between native and web is much smaller. For the machine with a VNNI accelerator from Intel, the web applications can achieve much more efficient AI computation by accessing to Intel® hardware accelerators.
GNA Power and Performance
For speech recognition, we complete acoustic model inference based on Kaldi* neural networks and speech feature vectors.
Summary and Standardization Work
Based on our solid results, we have launched the “Machine Learning for the Web” Community Group (WebML CG)7 in the World Wide Web Consortium (W3C) and got strong industry support including major browser vendors. The W3C* Web Machine Learning Working Group (WG) will be launched in first half of 2021 as well, formalizing hardware-accelerated Web Neural Network API optimized for Intel AI use cases per Intel®’s XPU vision.
1. VNNI for Intel AVX-512
3. Intel Distribution of OpenVINO toolkit
7. W3C Machine Learning for the Web Community Group
8. The Android Neural Networks API (NNAPI)
10. Inference Engine