BLOOM 176B Inference on Intel® Gaudi® 2 Processors

Optimize with Intel® Gaudi® AI Accelerators

  • Create new deep learning models or migrate existing code in minutes.

  • Deliver generative AI performance with simplified development and increased productivity.

author-image

By

With the Intel® Gaudi® software 1.8.0 release with support of DeepSpeed* inference, you can run inference on large language models (LLM), including BLOOM 176B.

LLMs have become popular in the field of AI, especially since the introduction of GPT-3* by OpenAI* in 2020. LLMs can complete several natural language processing tasks by successfully generating text that is indistinguishable from human-generated text. People have been using ChatGPT*, the publicly available GPT API from OpenAI, to create AI-generated email, stories, recipes, and even film scripts. The caveat is that the source code and the trained checkpoints are not publicly available, with users only able to interface with the model through the API.

BLOOM (big science language open-science open-access multilingual) is an open source initiative to bring an LLM of similar scale to GPT-3 to the public. It has a similar transformer architecture as GPT-3 and was trained on 384 graphics cards on the Jean Zay supercomputer provided by the French government.

Intel has enabled DeepSpeed inference capabilities to run the 176B parameter BLOOM model on eight Intel® Gaudi® 2 accelerators. You’ll need the fork of DeepSpeed to run this model. Additionally, we use the HPU Graph instructions to reduce time on the host.

To get started, you can clone the Model-References repository.

git clone https://github.com/HabanaAI/Model-References 
cd Model-References/PyTorch/nlp/bloom

In this folder you can find modeling_bloom.py, which was adapted from the Hugging Face* version found in Transformers. Additionally, you can find graph_utils.py, which uses the HPU Graph API to optimize the inference graph for HPU. Check the documentation for information about using HPU graphs and the API’s current limitations.
Next, install the requirements with pip and download the checkpoints of the models.

$PYTHON -m pip install -r requirements.txt
mkdir checkpoints; 
$PYTHON utils/fetch_weights.py --weights ./checkpoints --model bigscience/bloom

Note that the checkpoints for this model are about 330 GB, so ensure that the instance has enough storage to run. Alternatively, you can run the model using a smaller parameter set (bloom-3b or bloom-7b1), but expect to see slightly less comprehensible output in your sentence completion queries than the full 176B parameter model.

Once checkpoints are loaded, we can run some sentence completion tasks on different-sized versions of the model.

The size of BLOOM 176B requires the use of the DeepSpeed library to ensure that the model will fit and run on a minimum of eight Intel Gaudi 2 accelerators. Smaller versions of the BLOOM model, including BLOOM 7B1, can run on a single first-gen Intel Gaudi or Intel Gaudi 2 accelerator. Inference using DeepSpeed allows for model parallelism of large transformer models. In order to initialize the model for inference using DeepSpeed in the code, first wrap the model definition in OnDevice() call, which allows us to declare the datatype and the use of meta tensors.

with deepspeed.OnDevice(dtype=dtype, device='meta'):
   model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)

We then call deepspeed.init_inference(). The injection policy argument is needed for the all-reduce to accumulate results across multiple Intel Gaudi 2 accelerators. For more information about injection policy and DeepSpeed inference, consult the official DeepSpeed documentation.

model = deepspeed.init_inference(model, mp_size=args.world_size, dtype=dtype,
                                 injection_policy={code.BloomBlock: 
                                 ('mlp.dense_4h_to_h', 'self_attention.dense')},
                                 args=args, enable_cuda_graph=args.use_graphs,
                                 checkpoint=f.name)

Install the DeepSpeed library and then use the command below to call the model with a prompt for the text entry. By default, the inference script applies a greedy search on HPU and ignores end-of-sentence tokens while searching for the next most probable token at each step until it reaches the user-specified –max_length. We have found ignoring end-of-sentence tokens to be more performant as it allows the device to run continuously without needing to synchronize with the CPU after each token. Due to the size of the model, it takes a few minutes to load the weights into memory.

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.8.0
deepspeed --num_gpus 8 ./bloom.py --weights ./checkpoints --model bloom --max_length 128 --dtype bf16 "Does he know about phone hacking"

At the end of your console output, the following appears:

Init took 141.525s
Starting inference...
------------------------------------------------------
step:0 time:8.433s tokens:122 tps:14.466 hpu_graphs:13
------------------------------------------------------
Q0.0: Does he know about phone hacking
A0.0: Does he know about phone hacking?
- No.
- Good.
- What about the rest of the team?
- No.
Good.
This is going to be a closed shop.
I want you to keep it that way.
Understood?
Tony, I think we should meet.
Yeah, of course.
- Tomorrow?
- Yeah, good.
The House of Commons is expected to vote on the new anti-terrorism bill today.
The legislation has been controversial and some have argued that it infringes on civil liberties.
The government claims the bill is necessary to keep the country safe.
The vote is expected to be close.
Back to

Next Steps

To run the full BLOOM 176B model and other LLMs using DeepSpeed, you can access Intel® Gaudi® 2 software on the Intel® Tiber™ Developer Cloud. To run the smaller BLOOM 7B1 or 3B model, you can access Intel Gaudi 2 software on the Amazon* EC2 DL1 instance.

More Resources