Models

Model Performance Data for Intel® Gaudi® 2 AI Accelerators

These performance numbers are measured using the latest Intel® Gaudi® software release version 1.22, unless otherwise noted.

Explore Intel® Gaudi® 3 Accelerator Performance Data

INFERENCE

Large Language Models (LLM) for Throughput with Intel® Gaudi® 2 Accelerators

Model HPU Precision Input Output Batch Size Throughput

Max Throughput [tokens/sec - higher is better]
Model	#HPU	Precision	Input Length	Output Length	Batch Size	Throughput
LLaMA 3.1 8B	1	fp8	128	128	2816	20475
LLaMA 3.1 8B	1	fp8	128	2048	512	15093
LLaMA 3.1 8B	1	fp8	2048	128	179	2150
LLaMA 3.1 8B	1	fp8	2048	2048	256	6090
LLaMA 3.1 70B	2	fp8	128	128	1792	3404
LLaMA 3.1 70B	2	fp8	128	2048	256	3844
LLaMA 3.1 70B	2	fp8	2048	128	142	456
LLaMA 3.1 70B	2	fp8	2048	2048	139	1661
LLaMA 3.1 70B	8	fp8	128	128	4000	11607
LLaMA 3.1 70B	8	fp8	128	2048	768	13754
LLaMA 3.1 70B	8	fp8	2048	128	383	1567
LLaMA 3.1 70B	8	fp8	2048	2048	476	6772
LLaMA 3.3 70B	8	fp8	128	128	4000	11623
LLaMA 3.3 70B	8	fp8	128	2048	768	13765
LLaMA 3.3 70B	8	fp8	2048	128	383	1572
LLaMA 3.3 70B	8	fp8	2048	2048	476	6769

The measurements in the table above are based on the Optimum-Habana for Intel Gaudi 2 accelerators with Hugging Face* model using the latest Intel Gaudi software release version. To reproduce the results, follow these steps:

Setup Instructions
Build and Run the Benchmark Docker instance
System Configuration

Setup Instructions

Please make sure to follow Driver Installation to install the Gaudi driver on the system.
It is recommended to use the Optimum-Habana fp8 Benchmark Dockerfile to run the examples below.

To use the provided Dockerfile for the sample, follow the Docker Installation guide to setup the Habana runtime for Docker images.

Build and Run the Benchmark Docker instance

Optimum-Habana model agnostic Container (Single Node)

This folder contains scripts and configuration files that can be used to build an Optimum-Habana container with support for the following models:

Environment Variable	Values
model_name	meta-llama/Llama-3.1-8B-Instruct meta-llama/Llama-3.1-70B-Instruct meta-llama/Llama-3.3-70B-Instruct
world_size	1, 2, 8

Quick Start

To run these models on your Gaudi machine:

First, obtain the Dockerfile and benchmark scripts from the Optimum-Habana repository using the command below

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana/examples/text-generation/docker

IMPORTANT: All build and run steps listed in this document need to be executed on Gaudi Hardware

To build the oh-1.22.0-gaudi image from the Dockerfile, use the command below.

## Set the next line if you are using a HTTP proxy on your build machine
BUILD_ARGS="--build-arg http_proxy --build-arg https_proxy --build-arg no_proxy"
docker build -f Dockerfile $BUILD_ARGS -t oh-1.22.0-gaudi .

Single Card Models

1. Run the container and load a shell into it. For HABANA_VISIBLE_DEVICES, choose a single, available card from 0 through 7

HFCACHE="/mnt/hf_cache"
MOUNT="/mnt/data/"
VOLUME_OPTS="-v ${HFCACHE}:/hf_cache -v ${MOUNT}:/data"
DOCKER_OPTS="-e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -d --runtime=habana --restart always"
DOCKER_OPTS="${DOCKER_OPTS} -e HF_TOKEN=$hf_token -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy"
docker run --entrypoint /bin/bash $DOCKER_OPTS -e HABANA_VISIBLE_DEVICES=0 ${VOLUME_OPTS} --name oh-1.22.0 oh-1.22.0-gaudi -c "sleep infinity"
docker exec -it oh-1.22.0 bash

2. Build Measurement files for Single Card models - this needs to be run once per model.

Change the value for model_name to the one you need to use
Set world_size to the number of HPUs recommended as per the table above

export model_name=meta-llama/Llama-3.1-8B-Instruct
export world_size=1
export PT_HPU_LAZY_MODE=1
export HF_TOKEN=<YOUR_TOKEN_HERE>
export HF_DATASETS_TRUST_REMOTE_CODE=true
export TQDM_DISABLE=1
export QUANT_CONFIG=/root/optimum-habana/examples/text-generation/quantization_config/maxabs_measure.json

cd /root/optimum-habana/examples/text-generation/
python3 run_lm_eval.py \
-o acc_llama_quant.json \
--model_name_or_path ${model_name} \
--warmup 0 \
--flash_attention_causal_mask \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bf16 \
--batch_size 1 \
--bucket_size=128 \
--bucket_internal \
--trust_remote_code \
--tasks hellaswag lambada_openai piqa winogrande \
--use_flash_attention \
--flash_attention_recompute

3. Run benchmark for Single card models (world_size=1)

export QUANT_CONFIG=/root/optimum-habana/examples/text-generation/quantization_config/maxabs_quant.json
export input_len=128
export output_len=128
export batch_size=2816
export world_size=1
cd /root/optimum-habana/examples/text-generation/
python3 run_generation.py \
--model_name_or_path ${model_name} \
--attn_softmax_bf16 \
--trim_logits \
--warmup 2 \
--use_kv_cache \
--use_hpu_graphs \
--limit_hpu_graphs \
--bucket_size=128 \
--bucket_internal \
--attn_batch_split 2 \
--bf16 \
--flash_attention_causal_mask \
--use_flash_attention \
--flash_attention_recompute \
--batch_size ${batch_size} \
--max_new_tokens ${output_len} \
--max_input_tokens ${input_len}

Multi-card Models

1. Run the container and load a shell into it. For HABANA_VISIBLE_DEVICES, chose multiple available cards based on the recommended number of HPUs as per the table above

HFCACHE="/mnt/hf_cache"
MOUNT="/mnt/data/"
VOLUME_OPTS="-v ${HFCACHE}:/hf_cache -v ${MOUNT}:/data"
DOCKER_OPTS="-e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -d --runtime=habana --restart always"
DOCKER_OPTS="${DOCKER_OPTS} -e HF_TOKEN=$hf_token -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy"
docker run --entrypoint /bin/bash $DOCKER_OPTS -e HABANA_VISIBLE_DEVICES=0,1 ${VOLUME_OPTS} --name oh-1.22.0_multicard oh-1.22.0-gaudi -c "sleep infinity"
docker exec -itoh-1.22.0_multicard  bash

2. Build Measurement files - this needs to be run once per model.

Change the value for model_name to the one you need to use
Set world_size to the number of HPUs recommended as per the table above

export model_name=meta-llama/Llama-3.1-70B-Instruct
export world_size=2
export PT_HPU_LAZY_MODE=1
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export HF_TOKEN=<YOUR_TOKEN_HERE>
export HF_DATASETS_TRUST_REMOTE_CODE=true
export TQDM_DISABLE=1
export QUANT_CONFIG=./quantization_config/maxabs_measure.json

cd /root/optimum-habana/examples/text-generation/
python3 ../gaudi_spawn.py \
--use_deepspeed --world_size ${world_size}  run_lm_eval.py \
-o acc_llama_quant.json \
--model_name_or_path ${model_name} \
--warmup 0 \
--flash_attention_causal_mask \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bf16 \
--batch_size 1 \
--bucket_size=128 \
--bucket_internal \
--trust_remote_code \
--tasks hellaswag lambada_openai piqa winogrande \
--use_flash_attention \
--flash_attention_recompute

3. Run benchmark for Multi-card models (world_size>1)

export QUANT_CONFIG=./quantization_config/maxabs_quant.json
export input_len=128
export output_len=128
export batch_size=1792
cd /root/optimum-habana/examples/text-generation/
python3 ../gaudi_spawn.py \
--use_deepspeed \
--world_size ${world_size} \
run_generation.py \
--model_name_or_path ${model_name} \
--attn_softmax_bf16 \
--trim_logits \
--warmup 2 \
--use_kv_cache \
--use_hpu_graphs \
--limit_hpu_graphs \
--bucket_size=128 \
--bucket_internal \
--attn_batch_split 2 \
--bf16 \
--flash_attention_causal_mask \
--use_flash_attention \
--flash_attention_recompute \
--batch_size ${batch_size} \
--max_new_tokens ${output_len} \
--max_input_tokens ${input_len}

System Configuration

Intel® Gaudi® 2 Platform

System: HLS-Gaudi2 with eight Intel® Gaudi® 2 platform HL-225H mezzanine cards, two Intel® Xeon® Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory

Common Software

Ubuntu* 24.04.3 LTS
Intel Gaudi software 1.22.0 (full software support details)
PyTorch* 2.7.1

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in