Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2...

This article was originally written by Régis Pierrard, Hugging Face* machine learning engineer, and published on Hugging Face.

In this article, learn how to deploy multibillion-parameter language models on Intel® Gaudi® 2 accelerators and get a view into the Hugging Face performance evaluation of Intel Gaudi 2 accelerators and NVIDIA* A100 Tensor Core GPUs 80GB on BLOOMZ.

As demonstrated in the benchmark presented in this post, this will enable you to run inference faster than with any GPU currently available on the market. As models get bigger, deploying them into production to run inference has become increasingly challenging. Both hardware and software have seen many innovations to address these challenges, so let's dive in to see how to overcome them efficiently.

BLOOMZ

BLOOM is a 176-billion-parameter autoregressive model that was trained to complete sequences of text. It can handle 46 languages and 13 programming languages. Designed and trained as part of the BigScience initiative, BLOOM is an open science project that involves a large number of researchers and engineers from all over the world. More recently, another model with the exact same architecture was released: BLOOMZ, which is a fine-tuned version of BLOOM on several tasks leading to better generalization and zero-shot¹ capabilities.

Such large models raise new challenges in terms of memory and speed for training and inference. Even in 16-bit precision, one instance requires 352 GB to fit. State-of-the-art hardware, like the Intel Gaudi 2 accelerator, makes it possible to perform inference on BLOOM and BLOOMZ models with low latencies.

Intel Gaudi 2 Accelerators

The Intel Gaudi 2 accelerator is a second-generation AI hardware accelerator designed by Intel. A single server contains 8 accelerator devices with 96 GB of memory each, which provides room for very large models. However, hosting the model is not very interesting if the computation is slow. Fortunately, the Intel Gaudi 2 accelerator differs from GPUs in that its architecture enables the accelerator to perform General Matrix Multiplication (GeMM) and other operations in parallel, which speeds up deep learning workflows. These features make Intel Gaudi 2 accelerators great candidates for large language model (LLM) training and inference.

The SynapseAI* SDK supports PyTorch* and DeepSpeed* for accelerating LLM training and inference. The SynapseAI graph compiler optimizes the running of the operations accumulated in the graph (for example, operator fusion, data layout management, parallelization, pipelining and memory management, and graph-level optimizations).

Moreover, support for HPU Graphs and DeepSpeed inference was recently introduced in SynapseAI, and these are well-suited for latency-sensitive applications, as shown in the following benchmark.

All these features are integrated into the Optimum for Intel Gaudi library, which simplifies deploying your model on Intel Gaudi accelerators. For more information, see Quickstart.

For access to Intel Gaudi 2 accelerators, visit Intel® Developer Cloud.

Benchmarks

In this section, we provide an early benchmark of BLOOMZ on an Intel Gaudi 2 accelerator, a first-generation Intel Gaudi accelerator, and an NVIDIA A100 80GB. Although these devices have quite a lot of memory, the model is so large that a single device is not enough to contain a single instance of BLOOMZ. To solve this issue, we use DeepSpeed, which is a deep learning optimization library that enables many memory and speed improvements to accelerate the model and make it fit the device. In particular, we rely here on DeepSpeed inference: It introduces several features, such as model (or pipeline) parallelism, to make the most of the available devices. For Intel Gaudi 2 accelerators, we use the DeepSpeed fork, which adds support for Intel Gaudi accelerators.

Latency

We measured latencies (a batch of one sample) for two different sizes of BLOOMZ, both with multibillion parameters:

176 billion parameters (BLOOMZ model)
7 billion parameters (BLOOMZ-7b1 model)

Runs were performed with DeepSpeed inference in 16-bit precision with 8 devices and using a key-value cache. Note that while CUDA* graphs are not currently compatible with model parallelism in DeepSpeed (DeepSpeed v0.8.2), HPU Graphs are supported in DeepSpeed fork. All benchmarks are doing greedy generation of 100 token outputs. The input prompt is:

"DeepSpeed is a machine learning framework"

which consists of 7 tokens with BLOOM's tokenizer.

The results for inference latency are displayed in the following table.

Model	Number of Devices	Intel Gaudi 2 Accelerator Latency (Seconds)	NVIDIA A100 80GB Latency (Seconds)	First-Gen Intel Gaudi Accelerator Latency (Seconds)
BLOOMZ	8	3.717	4.402	/
BLOOMZ-7b1	8	0.737	2.417	3.029
BLOOMZ-7b1	1	1.066	2.119	2.865

Support was recently introduced for DeepSpeed inference in SynapseAI 1.8, thereby quickly enabling inference for models with more than 100 billion parameters. For the 176-billion-parameter checkpoint, the Intel Gaudi 2 accelerator is 1.2x faster than the NVIDIA A100 80GB. Smaller checkpoints present interesting results too. An Intel Gaudi 2 accelerator is 3x faster than an NVIDIA A100 80GB for BLOOMZ-7b1. It is also interesting to note that it manages to benefit from model parallelism, whereas an NVIDIA A100 80GB is faster on a single device.

We also ran these models on a first-gen Intel Gaudi accelerator. While it is slower than an Intel Gaudi 2 accelerator, it is interesting from a price perspective, as a DL1 instance on Amazon Web Services (AWS)* costs approximately $13 per hour. Latency for BLOOMZ-7b1 on a first-gen Intel Gaudi accelerator is 2.865 seconds. Thus, a first-gen Intel Gaudi accelerator offers for the 7-billion-checkpoint a better price-performance ratio than NVIDIA A100 80GB, which costs more than $30 per hour.

We expect to optimize the performance of these models in the upcoming SynapseAI releases. For example, in our last benchmark, we saw that an Intel Gaudi 2 accelerator performs Stable Diffusion* inference 2.2x faster than an NVIDIA A100 80GB. This has since been improved further to 2.37x with the latest optimizations. With the prerelease version of SynapseAI 1.9, we saw the inference latency for the BLOOMZ model was improved to 3.5 seconds. We will update these numbers as new versions of SynapseAI are released and integrated within Optimum for Intel Gaudi.

Running Inference on a Complete Dataset

The script we wrote (see How to Reproduce These Results) enables you to use your model to complete sentences across a whole dataset. This is useful for trying BLOOMZ inference on Intel Gaudi 2 accelerators on your own data.

Following is an example with the tldr_news dataset. It contains both the headline and content of several articles (you can visualize it on the Hugging Face Hub).

Input: 'Businesses Will Not Be Able to Hide': Spy Satellites May Give Edge From Above ;
Output: 'Businesses Will Not Be Able to Hide': Spy Satellites to Track Illegal Logging in Indonesia
The Indonesian government has announced that it will use spy satellites to track illegal logging in the country. Input: Alphabet and SoftBank’s solar-powered drone provides first LTE connection ;
Output: Alphabet and SoftBank’s solar-powered drone provides first LTE connection ⯑⯑⯑
⯑⯑⯑The drone, which is about the size of a small car, is equipped with a solar panel and can fly for up Input: SQLFlow (GitHub Repo);
Output: SQLFlow (GitHub Repo):
https://github.com/davidfowl/SQLFlow
The code is written in Java and is available on GitHub. It is a simple, lightweight
Input: Secret to keeping ice cream creamy (not crunchy);
Output: Secret to keeping ice cream creamy (not crunchy) is to freeze it in a container that is airtight. Input: Tesla's giant battery saved $40 million during its first year, report says ;
Output: Tesla's giant battery saved $40 million during its first year, report says
Tesla's giant battery saved $40 million during its first year, report says
Tesla's giant battery saved $40 million during its first year,
Input: Python 3.9: Cool New Features for You to Try (28 minute read);
Output: Python 3.9: Cool New Features for You to Try (28 minute read): This is a great article for those Input: A company aims to power the world for millions of years by digging the deepest holes ever ;
Output: A company aims to power the world for millions of years by digging the deepest hole ever made in Input: In Nevada desert, a technology firm aims to be a government ;
Output: In Nevada desert, a technology firm aims to be a government Introduction
The use of the Internet has become a common practice in the daily life of people. The Internet has become

In the next section, we explain how to use the script we wrote to perform this benchmark or to apply it to any dataset you like from the Hugging Face Hub.

How to Reproduce These Results

For the script used for benchmarking BLOOMZ on Intel Gaudi 2 accelerators and first-gen Intel Gaudi accelerators, see the Optimum for Intel Gaudi repository on GitHub*.
Make sure that the latest versions of SynapseAI and the Intel Gaudi accelerator drivers are installed by following the Intel Gaudi Accelerator Installation Guide.

Run the following:

git clone https://github.com/huggingface/optimum-habana.git

cd optimum-habana && pip install . && cd examples/text-generation

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.8.0

Launch the script as follows:

git clone https://github.com/huggingface/optimum-habana.git

cd optimum-habana && pip install . && cd examples/text-generation

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.8.0

For multinode inference, follow the Multinode Training Usage Guide.

You can also load any dataset from the Hugging Face Hub to get prompts that will be used for generation using the argument --dataset_name my_dataset_name.

This benchmark was performed with Transformers* v4.27.1, SynapseAI v1.8.0, and an Optimum for Intel Gaudi accelerator library installation.

For the inference script used in the examples shown in Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate, refer to the transformers-bloom-inference GitHub repository.

To use CUDA graphs, you need to use static shapes, which are not supported in Transformers. To enable static shapes, see the Model-References GitHub repository.

Conclusion

This article shows that an Intel Gaudi 2 accelerator performs BLOOMZ inference faster than an NVIDIA A100 80GB. The Optimum for Intel Gaudi library provides easy-to-use tools that eliminate the need to write a complicated script to run inference on multibillion-parameter models on Intel Gaudi accelerators. Future releases of the SynapseAI SDK are expected to speed up performance, so we will update this benchmark regularly as LLM inference optimizations on SynapseAI continue to advance. We are also looking forward to the performance benefits that will come with FP8 inference on Intel Gaudi 2 accelerators.

We also presented the results achieved with a first-generation Intel Gaudi accelerator. For smaller models, it can perform equal to or even better than an NVIDIA A100 80GB for almost a third of its price. It is a good alternative option to using GPUs for running inference on large models such as BLOOMZ.

Additional Resources

To accelerate your machine learning training and inference workflows using the latest AI hardware accelerators and software libraries, check out the Hugging Face Expert Acceleration Program.
To learn more about Intel Gaudi accelerator solutions, read about the Hugging Face and Intel partnership.
To learn more about Hugging Face efforts to make AI hardware accelerators easy to use, check out the Hardware Partner Program.
Contact the Intel Gaudi accelerator team.

Phillip Howard and Anahita Bhiwandiwalla, research scientists with the Intel Labs Cognitive AI team, put Intel Gaudi 2 accelerators and BLOOMZ to the test in this brief video. Watch to see how you can easily put LLMs like BLOOMZ to work on Intel Gaudi 2 accelerators for your organization.

¹ "Zero-shot" refers to the ability of a model to complete a task on new or unseen input data (that is, without having been provided any training examples of this kind of data). We provide the model with a prompt and a sequence of text that describes what we want our model to do, in natural language. Zero-shot classification excludes any examples of the desired task being completed. This differs from single or few-shot classification, as these tasks include a single or a few examples of the selected task.

You Might Also Like

Memory-Efficient Training on Intel Gaudi Accelerators with DeepSpeed

Fine-Tuning GPT2* with Hugging Face and Intel Gaudi Accelerators

Intel Gaudi Software Version 1.7.0

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Fast Inference on LLMs: BLOOMZ on Intel® Gaudi® 2 Accelerators

Optimize with Intel® Gaudi® AI Accelerators