Access several optimized LLM inference reference models (including GPT-2*, GPTNeoX, GPT-J, OPT, and LLaMA) from Hugging Face* and Habana Labs for Intel® Gaudi® AI accelerators and Intel Gaudi 2 AI accelerators with the Optimum-Habana library.
Developers can now run these models as well as other LLMs using a repository that Hugging Face hosts. To access Habana hardware, you can either:
- Go to the Intel® Developer Cloud, and use the Gaudi2 instance
- Go to the Amazon EC2* DL-1, and use the first-gen Gaudi instance
This guide shows:
- How GPTNeoX was enabled to run on Intel Gaudi AI accelerators
- How to run text generation with the GPTNeoX model on a single Intel Gaudi Ai accelerator, as well as with eight devices using DeepSpeed lightweight framework.
To run any of these models, enter the model's name in your Python* command:
–model_name_or_path
For example, to use GPT-J, enter the following:
–model_name_or_path EleutherAI/gpt-gpt-j-20b
To improve the model performance for these new models, let's implement the two optimizations. These existing models already have the following modifications and are ready to use as part of the Optimum-Habana library.
Note For other LLMs, to get better inference performance, we recommend that you make the needed changes to apply these optimizations.
- Fix the input shapes to be static during new token generation, which prevents unnecessary graph recompilations:
- Pad the vectors of all input and generated tokens of the self-attention mask to the max token length during each call to the generate function, as shown in the following code block.
To keep track of the current token being generated in the output sequence, introduce a new variable: token_idx. During padding, this variable is used to track the latest token in the padded group of content.
For more information, see the list of current models optimized with static shapes.
if generation_config.static_shapes:
#token_idx is the current index in the generation process, it is incremented each time a new token is generated
model_kwargs["token_idx"] = torch.tensor(inputs_tensor.shape[-1], device=inputs_tensor.device)
#Pad inputs to have static shapes during generation, this gives better performance than dynamic shapes on HPUs
inputs_tensor = torch.nn.functional.pad(
inputs_tensor, (0, generation_config.max_new_tokens), value=generation_config.pad_token_id
)
if model_kwargs["attention_mask"] is not None:
model_kwargs["attention_mask"] = torch.nn.functional.pad(
model_kwargs["attention_mask"], (0, generation_config.max_new_tokens), value=0
)
The second optimization uses a static key-value cache to eliminate the need for recompilations of self-attention forward passes needed to generate new tokens. The implementation is as follows:
if layer_past is not None:
past_key, past_value = layer_past
if token_idx is not None:
past_key.index_copy_(2, token_idx - 1, key)
past_value.index_copy_(2, token_idx - 1, value)
key = past_key
value = past_value
else:
key = torch.cat((past_key, key), dim=-2)
value = torch.cat((past_value, value), dim=-2)
The model examples with these optimizations include a dedicated Gaudi subclass that inherits from the original upstream model code. We went through the same process, creating causal language modeling subclasses for each of the classes. The only differences are the two changes related to enabling static shapes and key-value cache optimizations.
We enabled support for hpu_graphs and DeepSpeed inference using Optimum-Habana, which are additional methods of optimizing model performance. Enable the support through command-line options, which are shown in the following sections.
Notes
- Ensure that you set up hpu_graphs for Training or Inference.
- To enable inference with better performance on other LLMs on Intel® Gaudi® platforms can implement the following two additional optimization techniques.
Set Up and Run Inference on One Intel Gaudi AI Accelerator
1. To clone Optimum-Habana and install the necessary dependencies, run the following commands:
Note This example uses Intel® Gaudi® software v1.10.0 with Hugging Face* v1.6.1.
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana && pip install . && cd examples/text-generation
pip install –r requirements
txtpip install git+https://github.com/HabanaAI/DeepSpeed.git@1.10.0
2. To get a text generation output using the 20-billion parameter version of GPTNeoX, run the following command.
Note Feel free to modify the prompt. You must include –use_kv_cache argument, which implements the optimization discussed earlier.
python run_generation.py \
–model_name_or_path EleutherAI/gpt-neox-20b \
–batch_size 1 \
–max_new_tokens 100 \
–use_kv_cache \
–use_hpu_graphs \
–bf16 \
–prompt ‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’
The prompt returns the following output using an Intel Gaudi 2 AI accelerator:
Stats:
———————————————————————-
Throughput (including tokenization) = 42.177969335170665 tokens/secondMemory allocated = 39.27 GBMax memory allocated = 39.68 GBTotal memory available = 94.65 GBGraph compilation duration = 15.923236011061817 seconds
———————————————————————-
Input/outputs:
———————————————————————-
input 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’,)
output 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the storage industry.\n\nThe company, called Storj, is a peer-to-peer cloud storage service. It is a peer-to-peer network that allows users to store their data on other users’ computers.\n\nThe company is based in San Francisco, and was founded by Shawn Wilkinson, a former Google employee.\n\nThe company is currently in the process of raising a $30 million round of funding.\n\nThe company is currently in the process of raising’,)
Run Inference on multiple Gaudi Devices Using DeepSpeed
To run eight Intel Gaudi 2 AI accelerators with DeepSpeed enabled:
To launch the multicard run, use the same arguments in the previous section with the gaudi_spawn.py script, which invokes mpirun:
Here is the output:
python ../gaudi_spawn.py –use_deepspeed –world_size 8 run_generation.py \
–model_name_or_path EleutherAI/gpt-neox-20b
–batch_size 1 \
–max_new_tokens 100 \
–use_kv_cache \
–use_hpu_graphs \
–bf16 \
–prompt ‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’
Here is the output:
Stats:
——————————————————————-
Throughput (including tokenization) = 85.745799548246 tokens/second
Memory allocated = 6.5 GB
Max memory allocated = 6.54 GB
Total memory available = 94.65 GB
Graph compilation duration = 5.841280916007236 seconds
——————————————————————-
Input/outputs:
——————————————————————-
input 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’,)
output 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the cloud storage market.\n\nThe company, called Storj, is a peer-to-peer cloud storage service that is currently in beta. The company is currently in the process of raising a $1.2 million seed round.\n\nThe company is led by John Quinn, a former executive at Dropbox, and Shawn Wilkinson, a former executive at Box.\n\nThe company is currently in the process of raising a $1.2 million seed round.\n\nThe’,)
Next Steps
Hugging Face and Habana Labs continue to enable reference models and publish them in optimum-habana and Model-References where anyone can freely access them. For helpful articles and forum posts, see the developer site.