Artificial intelligence is already solving problems in all aspects of our lives, from animation filmmaking and tackling space exploration , to fast food recommendation systems that improve ordering efficiency. These real-world AI systems examples are just the beginning of what is possible in an AI Everywhere future and they are already testing the limits of compute power. Tomorrow’s AI system solutions will require optimization up and down the stack from hardware to software, including in the tools and frameworks used to implement end-to-end AI and data science pipelines.
AI math operations need powerful hardware
A simple example can help illustrate the root of the challenge. In Architecture All Access: Artificial Intelligence Part 1 – Fundamentals, Andres Rodriguez, Intel Fellow and AI Architect, shows how a simple deep neural network (DNN) to identify digits from handwritten numbers requires over 100,000 weight parameters for just the first layer of multiplications. This is for a simple DNN that processes 28×28 black-and-white images.
Today’s AI solutions process image frames of 1280×720 and higher, with separate channels for red, green and blue. And the neural networks are more complex, as they must identify and track multiple objects from frame-to-frame, or extract meaning from different arrangements of words that may or may not affect that meaning. We are already seeing models surpassing trillions of parameters that require multiple weeks to train. Tomorrow’s AI solutions will be even more complex, combining multiple types of models and data.
AI application development is an iterative process, so speeding up compute-intensive tasks can increase a developer’s ability to explore more options or just get their job done more quickly. As Andres explains in the video above, matrix multiplications are often the bulk of the compute load during the training process.
In Architecture All Access: Artificial Intelligence Part 2 – Hardware, Andres compares the different capabilities of CPUs, GPUs and various specialized architectures. The AI-specific devices, and many new GPUs, have systolic arrays of multiply-accumulates (MACs) that can parallelize the matrix multiplications inherent to the training process.
Size and complexity of AI systems require hardware heterogeneity
As neural networks become more complex and dynamic — for instance those with a directed acyclic graph (DAG) structure — they limit the ability to parallelize these computations. And their irregular memory access patterns require low-latency memory access. CPUs can be a good fit for these requirements due to their flexibility and higher operating frequencies.
With increasing network size, even larger amounts of data need to be moved between compute and memory. Given the growth of MACs available in hardware devices, memory bandwidth and the bandwidth between nodes within a server and across servers are becoming the limiting factors for performance.
Once a network is trained and ready for deployment as part of an application, a new set of hardware requirements typically arises. AI inference often requires low latency, or producing an answer as quickly as possible, whether that means keeping up with city traffic in real-time, inspecting parts in a production line, or providing timely fast-food recommendations to reduce wait times. Additional requirements such as cost, form factor and power profile, tend to be more application-specific. CPUs are often used for deployment because one core can be dedicated to AI inference, leaving the other cores available for the application and other tasks.
The trend is toward more heterogeneous computing, combining general-purpose CPU-like compute with dedicated AI-specific resources. The degree toward which these devices are specialized for training versus inference may differ, but they share common features that improve AI processing. The Bfloat16 (BF16) data type provides floating-point dynamic range with shorter word lengths, reducing the size of data and enabling more parallelism. 8-bit integer (INT8) data types enable further optimization but limit the dynamic range and thus require the AI developer to make some implementation tradeoffs. Dedicated MAC-based systolic arrays parallelize the heavy computation loads associated with training and inference. And high-bandwidth memory provides wide highways to speed data between compute and memory. But these hardware features require software that can take advantage of them.
Unifying and optimizing the software stack
A key focus of this trend towards heterogenous computing is software. oneAPI is a cross-industry, open, standards-based unified programming model that delivers performance across multiple architectures. The oneAPI initiative encourages community and industry collaboration on the open oneAPI specification and compatible oneAPI implementations across the ecosystem.
We have already outlined how a given AI system may require different types of hardware between training and inference. Even before training, while preparing the dataset and exploring network options, a data scientist will be more productive with faster response times with data extraction, transformation and loading (ETL) tasks. Different systems will also have different hardware requirements depending on what type of AI is being developed.
AI is typically part of a specific application, for instance the filmmaking, space exploration or fast-food recommendation examples listed earlier. This top layer of the stack is developed using middleware and AI frameworks.
There is no shortage of AI tools and frameworks available for creating, training and deploying AI models. Developers choose these based on their task — for instance PyTorch*, TensorFlow* or others for deep learning, and XGBoost, scikit-learn* or others for machine learning — based on their experience, preferences or code reuse. This layer of the stack also includes the libraries used during application and AI development, such as NumPy, SciPy or pandas. These frameworks and libraries are the engines that automate the data science and AI tasks.
But with all the innovation in hardware to accelerate AI, how can a framework or library know how to take advantage of whatever hardware resources it’s running on? This bottom layer of the software stack is what enables all the software from the upper layers to interact with the specific hardware it’s running on, without having to write hardware-specific code. As Huma Abidi, Senior Director of Artificial Intelligence and Deep Learning at Intel, explains in Architecture All Access: Artificial Intelligence Part 3 – Software, this layer is created by developers that understand the capabilities and instruction sets available on a given device.
As models become larger and larger, a single device no longer has enough compute or memory resources to efficiently perform all the necessary calculations. Innovations in hardware architectures enable clusters of compute power with high-bandwidth memory that can distribute tasks across the cluster. Software that splits and distributes large workloads is a very active area of innovation for tomorrow’s needs.
Co-optimization is essential
Co-optimization of hardware and software in the Intel oneAPI Deep Neural Network Library (oneDNN) has delivered significant performance gains in deep learning applications based on TensorFlow*, PyTorch*, and Apache MXNet*. Similarly, machine learning applications using XGBoost and scikit-learn have benefited from Intel oneAPI Data Analytics Library (oneDAL).
This same type of cross-domain collaboration between people — specifically data scientists, software engineers and architects — will be what truly unlocks the potential of AI development. A good example is model compression, which is a technique to reduce the size of a deep learning network for deployment. One model compression technique, quantization to INT8 data type, was mentioned earlier.
A software engineer may appreciate that moving to 8-bit word lengths and operations will reduce the size of the weights and the size of the operations, which allows for more parallel operations if supported by the instruction set. But this loss in dynamic range may require a data scientist to transform data, for instance by normalizing it to a set range of values. Since the precision of the model will decrease, the architect usually trades off how much precision loss is tolerable for the inference speedup, or whether a mixed-precision model is possible. This and other powerful model compression techniques can all be managed with software like Intel® Neural Compressor, but they require collaboration between experts from different disciplines.
The future requires flexibility and system performance
Tomorrow’s AI systems will be even more application-specific, which means each project will have different needs in terms of the type of data, the type of AI algorithms, how the AI is integrated into the application, and on what type of hardware it will be deployed. Additionally, models will be more complex, for instance combining models that predict the health of utility assets with models of security threats with active visual identification of potential issues.
The common requirement through all of this is high-performance hardware-plus-software systems that can adapt to a variety of project needs.
Learn More: Intel AI, Software AI Accelerators, Data Science Workstations for the Real World