Models

Model Performance Data for Intel® Gaudi® 3 AI Accelerators

These performance numbers are measured using the latest SynapseAI* software release version 1.19.0, unless otherwise noted.

Note All models for both training and inference are using the PyTorch* 2.5.1 framework. Other applicable frameworks used for training or inference are noted for each model.

Explore Intel Gaudi 2 Accelerator Performance Data

INFERENCE

Large Language Models (LLM) for Throughput with Intel Gaudi 3 Accelerator

The measurements in the following table are based on the Optimum-Habana for Intel Gaudi 3 accelerators with Hugging Face* model. To reproduce the results, follow these steps:

Setup Instructions
Tensor Quantization Statistics Measurement
Quantize and Run the FP8 Model

Model #HPU Precision Input Length Output Length Batch Size Throughput

Model	# HPU	Precision	Input Length	Output Length	Batch Size	Throughput (tokens/sec)
LLaMA 2 7b	1	fp8	128	128	1536	20381
LLaMA 2 7b	1	fp8	128	2048	217	7476
LLaMA 2 7b	1	fp8	2048	128	153	2137
LLaMA 2 7b	1	fp8	2048	2048	117	2977
LLaMA 2 70b	2	fp8	128	128	1750	4562
LLaMA 2 70b	2	fp8	128	2048	512	6590
LLaMA 2 70b	2	fp8	2048	128	242	486
LLaMA 2 70b	2	fp8	2048	2048	241	2736
LLaMA 3.1 8B	1	fp8	128	128	1536	24364
LLaMA 3.1 8B	1	fp8	128	2048	768	18063
LLaMA 3.1 8B	1	fp8	2048	128	256	2590
LLaMA 3.1 8B	1	fp8	2048	2048	371	8335
LLaMA 3.1 70B	2	fp8	128	128	2048	4562
LLaMA 3.1 70B	2	fp8	128	2048	450	6278
LLaMA 3.1 70B	2	fp8	2048	128	223	499
LLaMA 3.1 70B	2	fp8	2048	2048	175	2796
LLaMA 3.1 70B	8	fp8	128	128	4000	15377
LLaMA 3.1 70B	8	fp8	128	2048	600	16891
LLaMA 3.1 70B	8	fp8	2048	128	512	1594
LLaMA 3.1 70B	8	fp8	2048	2048	600	9467
LLaMA 3.1 405B	8	fp8	128	128	2996	3306
LLaMA 3.1 405B	8	fp8	128	2048	460	4793
LLaMA 3.1 405B	8	fp8	2048	128	195	371
LLaMA 3.1 405B	8	fp8	2048	2048	180	2143
Mixtral 8x7B	2	bf16	128	128	13722	15904
Mixtral 8x7B	2	bf16	128	2048	8496	7660
Mixtral 8x7B	2	bf16	2048	128	1641	1896
Mixtral 8x7B	2	bf16	2048	2048	4014	3643

Setup Instructions

Please make sure to follow Driver Installation to install the Gaudi driver on the system.
It is recommended to use the PyTorch Docker image to run the examples below.

To use the provided Dockerfile for the sample, follow the Docker Installation guide to setup the Habana runtime for Docker images.
The Docker image assists in setting up the PyTorch software and packages to run the samples. However, installing additional required packages like DeepSpeed is still necessary to run the samples.

Get examples from optimum-habana github repository

To benchmark Llama2 and Llama3 models, obtain optimum-habana from the GitHub repository using the following command.

git clone -b v1.15.0 https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/text-generation

Docker Run

After building the Docker image, run the following command to start a Docker instance, which will open in the text-generation folder inside the docker instance.

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none   --cap-add=ALL --privileged=true  --net=host --ipc=host  -v "$PWD/../../":/workspace --workdir  /workspace/examples/text-generation  vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

NOTE: The Huggingface model file size might be large, so it is recommended to use an external disk as the Huggingface hub folder. \ Export the HFHOME environment variable to the external disk and then export the mount point into the Docker instance. \ ex: "-e HFHOME=/mnt/huggingface -v /mnt:/mnt"

Install required packages inside docker

First, install the optimum-habana:

pip install --upgrade-strategy eager optimum[habana]

Second, install the requirements:

pip install -r requirements.txt

For run_lm_eval.py:

pip install -r requirements_lm_eval.txt

Then, to use DeepSpeed-inference, install DeepSpeed as follows:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0

Tensor quantization statisics measurement

This step needs to be completed only once for each model with the corresponding world size values.
The hqtoutput generated after this step will be used for the FP8 run. If changing models for the FP8 run, repeat this step to obtain the relevant hqtoutput.

Llama2

Here is an example to measure the tensor quantization statistics on LLama2:

Export different values to the following environment variables to change parameters for tensor quantization statistics:

Environment Variable	Values
model_name	meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-7b-hf
world_size	1, 2, 8

export model_name=meta-llama/Llama-2-70b-hf
export world_size=2

QUANT_CONFIG=./quantization_config/maxabs_measure.json python3 ../gaudi_spawn.py \
--use_deepspeed --world_size ${world_size} run_lm_eval.py \
-o acc_llama2_bs1_measure.txt \
--model_name_or_path ${model_name} \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--use_flash_attention \
--flash_attention_recompute \
--bf16 \
--batch_size 1

Llama3

Here is an example to measure the tensor quantization statistics on Llama3 with 8 cards:

Please note that Llama3-405B requires a minimum of 8 Gaudi3 cards.

Export different values to the following environment variables to change parameters for tensor quantization statistics:

Environment Variable	Values
model_name	meta-llama/Llama-3.1-405B-Instruct, meta-llama/Llama-3.1-70B-Instruct, and meta-llama/Llama-3.1-8B-Instruct
world_size	8

export model_name=meta-llama/Llama-3.1-405B-Instruct
export world_size=8

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python3 ../gaudi_spawn.py \
--use_deepspeed --world_size ${world_size} run_lm_eval.py \
-o acc_llama3_bs1_quant.txt \
--model_name_or_path ${model_name} \
--use_hpu_graphs \
--use_kv_cache \
--trim_logits \
--batch_size 1 \
--bf16 \
--reuse_cache \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask

Quantize and run the fp8 model

Here is an example to quantize the model based on previous measurements for LLama2 or 3 models:

Export different values to the following environment variables to change parameters for tensor quantization statistics:

Environment Variable	Values
model_name	meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-7b-hf, meta-llama/Llama-3.1-405B-Instruct, meta-llama/Llama-3.1-70B-Instruct, and meta-llama/Llama-3.1-8B-Instruct
input_len	128, 2048, and etc
output_len	128, 2048, and etc
batch_size	350, 1512, 1750, and etc
world_size	1, 2, 8

Please note that Llama3-405B requires a minimum of 8 Gaudi3 cards.

Here is an example to run llama2-70b with input tokens length=128, output tokens length=128 and batch size = 1750

export model_name=meta-llama/Llama-2-70b-hf
export input_len=128
export output_len=128
export batch_size=1750
export world_size=2

After setting the environment variables, run the FP8 model using the following command:

QUANT_CONFIG=./quantization_config/maxabs_quant.json python3 ../gaudi_spawn.py \
--use_deepspeed --world_size ${world_size} run_generation.py \
--model_name_or_path ${model_name} \
--attn_softmax_bf16 \
--use_hpu_graphs \
--limit_hpu_graphs \
--trim_logits \
--use_kv_cache \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask  \
--bucket_size=128 \
--bucket_internal \
--bf16 \
--batch_size ${batch_size} \
--max_new_tokens ${output_len} \
--max_input_tokens ${input_len} \
--book_source \
--warmup 2

System Configuration

Intel Gaudi 3 Platform

System: HLS-Gaudi3 with eight Intel Gaudi 3 platform HL-325L mezzanine cards, two Intel® Xeon® Platinum 8480+ CPUs at 2.0 GHz, and 1 TB of system memory.

Common Software

Ubuntu* v22.04
Intel Gaudi software v1.19.0 (full software support details)
PyTorch: Models run with PyTorch v2.5.1

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Models

Model Performance Data for Intel® Gaudi® 3 AI Accelerators

Large Language Models (LLM) for Throughput with Intel Gaudi 3 Accelerator

Setup Instructions

Get examples from optimum-habana github repository

Docker Run

Install required packages inside docker

Tensor quantization statisics measurement

Llama2

Llama3

Quantize and run the fp8 model

System Configuration