Develop Solutions on Intel® Gaudi® AI Accelerators

Run an Inference Use Case on the Intel® Gaudi® 2 AI Accelerator

Learn how to select a model, set up the environment, run the workload, and then see a price-performance comparison. The accelerator supports PyTorch* as the main framework for inference.

The following code guides you in how to:

Get access to a node for Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
Ensure that all the software is installed and configured properly by running the PyTorch* version of the Docker* image for the accelerator
Select the model to run by loading the desired model repository and appropriate libraries for model acceleration.

Run the model and extract the details for evaluation.

There are four methods for running inference on models:

Using Hugging Face* models with the Optimum for Intel Gaudi AI accelerators library.
Using the Intel Gaudi AI accelerators model references repository to use built-in PyTorch models.
Using the GPU Migration toolkit to automatically convert GPU-based models to be compatible with Intel Gaudi AI accelerators.
Manual migration from PyTorch models in the public domain.

The Optimum for Intel Gaudi AI accelerators library and the model-reference repository contain fully optimized and fully documented model examples. Use them as a starting point for running a model.

This example shows model inference with Hugging Face by running the Meta* Llama-2-70b model using the Optimum for Intel Gaudi AI accelerators library. Since Hugging Face models are used with an associated task, run inference with the text-generation task.

Performance Evaluation

Before running the model, let's look at the performance measurements and a price-performance comparison to an equivalent H100 inference. In this case, the Llama-2-70b parameter model was selected using FP8, with 128 input tokens, 2048 output tokens, and four Intel Gaudi AI accelerators.

The tokens per dollar or inference runs per dollar are significantly higher than the NVIDIA solution.

View the Intel Gaudi benchmarks and performance data.
The model is compared to the same model configuration using the H100 GPU using NVIDIA* published inference benchmarks from June 25, 2024.

Performance cost differences

Setup Instructions
Run and Fine-Tune

Runtime Instructions

The following are the run instructions needed to set up the node, the model infrastructure and the full runtimes for the model.

Accessing the Intel® Gaudi® Node

To access an Intel® Gaudi® node in the Intel® Tiber™ AI cloud, go to Intel® Tiber™ AI Cloud Console, access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command: ssh -L 8888:localhost:8888 .. to be able to access the Notebook.

Details about setting up Jupyter Notebooks on an Intel® Gaudi® Platform are available here.

Docker Setup

With access to the node, use the latest Docker image by first calling the docker run command which will automatically download and run the docker:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Start the docker and enter the docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Model Setup

In the running Docker environment, install the remaining libraries and model repositories:

Start in the root directory and install the DeepSpeed* Library; DeepSpeed is used to improve memory consumption on Intel® Gaudi® while running large language models.

cd ~
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Now install the Hugging Face Optimum Intel® Gaudi® library and GitHub examples. Selecting the latest validated release of optimum-habana:

pip install optimum-habana==1.16.0
git clone -b v1.16.0 https://github.com/huggingface/optimum-habana

Then, transition to the text-generation example and install the final set of requirements to run the model:

cd ~/optimum-habana/examples/text-generation
pip install -r requirements.txt
pip install -r requirements_lm_eval.txt

How to Access and Use the Llama 2 Model

Use of the pre-trained model is subject to compliance with third-party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions. Users bear sole liability and responsibility to follow and comply with any third-party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third-party licenses.

To be able to run gated models like this Llama-2-70b-hf, you need the following:

Have a Hugging Face account and agree to the terms of use of the model in its model card on the Hugging Face hub
Create a read token and request access to the Llama 2 model from meta-llama
Login to your account using the Hugging Face CLI:

huggingface-cli login --token <your_hugging_face_token_here>

If you want to run with the associated Jupyter Notebook for inference, please see the running and fine-tuning addendum section for setup of the Jupyter Notebook and you can run these steps directly in the Jupyter interface.

Intel® Tiber® AI Cloud

Text-generation example on GitHub

Pytorch Inference Jupyter Notebook

Running the Llama 2 70B Model Using the FP8 Datatype

Note To learn more about Intel® Gaudi® FP8 quantization, see the user guide.

Run a quantization measurement. This is provided by running the local quantization tool using the maxabs_measure.json file that is already loaded on the Hugging Face library on GitHub.

PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
-o acc_70b_bs1_measure4.txt \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--bf16 \
--batch_size 1 \
--use_flash_attention \
--flash_attention_recompute

Note The model will ask permission to run custom code associated with dataset loading. If this is acceptable, answer yes and execution will proceed.

The code generates a set of measurement values in an hqt_output folder that shows what operations were converted to the FP8 datatype.

-rw-r--r--  1 root root 297695 Dec 22 02:31 measure_hooks_maxabs_0_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_0_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_0_8_mod_list.json
-rw-r--r--  1 root root 297684 Dec 22 02:31 measure_hooks_maxabs_1_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_1_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_1_8_mod_list.json
-rw-r--r--  1 root root 297751 Dec 22 02:32 measure_hooks_maxabs_2_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:32 measure_hooks_maxabs_2_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:32 measure_hooks_maxabs_2_8_mod_list.json
-rw-r--r--  1 root root 297751 Dec 22 02:31 measure_hooks_maxabs_3_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_3_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_3_8_mod_list.json
-rw-r--r--  1 root root 297710 Dec 22 02:32 measure_hooks_maxabs_4_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:32 measure_hooks_maxabs_4_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:32 measure_hooks_maxabs_4_8_mod_list.json
-rw-r--r--  1 root root 297835 Dec 22 02:32 measure_hooks_maxabs_5_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:32 measure_hooks_maxabs_5_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:32 measure_hooks_maxabs_5_8_mod_list.json
-rw-r--r--  1 root root 297764 Dec 22 02:31 measure_hooks_maxabs_6_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_6_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_6_8_mod_list.json
-rw-r--r--  1 root root 297655 Dec 22 02:31 measure_hooks_maxabs_7_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_7_8.npz

You can use these measurements to run the throughput running of the model. In this case, a standard input prompt is used:

PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--max_new_tokens 2048 \
--max_input_tokens 128 \
--bf16 \
--batch_size 210 \
--use_flash_attention \
--flash_attention_recompute

Notice that the quantization.json config file is used instead of the measurement file and additional input and output parameters are added. In this case, --max_new_tokens 2048 appears, which determines the size of the output text generated and --max_input_tokens 128 appears, which defines the size of the number of input tokens.

You can now see the final values that align with the published numbers.

Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 9095.210106548433 tokens/second
Memory allocated                    = 93.69 GB
Max memory allocated                = 94.57 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 164.68858193303458 seconds
----------------------------------------------------------------------------------

Note These performance numbers were generated using Gaudi 2 devices (HL-225). Better Throughput using runs with greater values for --batch_size can be achieved on Gaudi 3 devices.

Next Steps

Now that you have run a full inference case, you can go back to the Hugging Face Optimum Intel® Gaudi® validated models to see more options for running inference.

Intel® Tiber® AI Cloud

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in