Model Performance Data for Intel® Gaudi® AI Accelerators
These performance numbers are measured using the latest SynapseAI* software release version 1.17.0-495, unless otherwise noted.
Note All models for both training and inference are using the PyTorch* 2.3.1 framework. Other applicable frameworks used for training or inference are noted for each model.
Intel® Gaudi® 2 Accelerator with MLPerf* v4.0
Model | #HPU | Precision | Performance | Framework Version |
---|---|---|---|---|
MLPerf4.0 LLama 2 70B Server | 8 | fp8 | 6252 token/sec | PyTorch 2.3.1 |
MLPerf4.0 Llama 2 70B Offline | 8 | fp8 | 7581 token/sec | PyTorch 2.3.1 |
MLPerf4.0 Stable Diffusion XL Server | 8 | fp8 | 6.25 token/sec | PyTorch 2.3.1 |
MLPerf4.0 Stable Diffusion XL Offline | 8 | fp8 | 6.48 token/sec | PyTorch 2.3.1 |
LLMs for Throughput with Intel Gaudi 2 Accelerator
Model | #HPU | Precision | Input Length | Output Length | Throughput | Batch | Framework Version |
---|---|---|---|---|---|---|---|
LLaMA 2 7B | 1 | fp8 | 128 | 128 | 12772 tokens/sec | 1230 | Optimum Habana 1.12.1 |
LLaMA 2 7B | 1 | fp8 | 128 | 2048 | 4787 tokens/sec | 163 | Optimum Habana 1.12.1 |
LLaMA 2 7B | 1 | fp8 | 2048 | 128 | 1318 tokens/sec | 94 | Optimum Habana 1.12.1 |
LLaMA 2 7B | 1 | fp8 | 2048 | 2048 | 1967 tokens/sec | 81 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 128 | 128 | 17331 tokens/sec | 2429 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 128 | 2048 | 11106 tokens/sec | 289 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 2048 | 128 | 1762 tokens/sec | 179 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 2048 | 2048 | 5379 tokens/sec | 155 | Optimum Habana 1.12.1 |
LLaMA 2 70B | 2 | fp8 | 128 | 128 | 2784 tokens/sec | 1750 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
LLaMA 2 70B | 2 | fp8 | 128 | 2048 | 3186 tokens/sec | 750 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
LLaMA 2 70B | 2 | fp8 | 2048 | 128 | 292 tokens/sec | 95 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
LLaMA 2 70B | 2 | fp8 | 2048 | 2048 | 1392 tokens/sec | 78 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
Mistral 7B | 1 | fp8 | 128 | 128 | 13112 tokens/sec | 896 | Optimum Habana 1.12.1 |
Mistral 7B | 1 | fp8 | 128 | 2048 | 7947 tokens/sec | 120 | Optimum Habana 1.12.1 |
Mistral 7B | 1 | fp8 | 2048 | 128 | 1360 tokens/sec | 120 | Optimum Habana 1.12.1 |
Mistral 7B | 1 | fp8 | 2048 | 2048 | 3143 tokens/sec | 44 | Optimum Habana 1.12.1 |
LLM for Low Latency with Intel Gaudi 2 Accelerator
Model | #HPU | Precision | Input Length | Latency | Batch | Framework Version |
---|---|---|---|---|---|---|
LLaMA 2 7B | 1 | fp8 | 128 | 7.62 ms | 1 | Optimum Habana 1.12.1 |
LLaMA 2 7B | 1 | fp8 | 2048 | 56.31 ms | 1 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 128 | 8.17 ms | 1 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 2048 | 60.52 ms | 1 | Optimum Habana 1.12.1 |
LLaMA 2 70B | 8 | fp8 | 128 | 26.93 ms | 1 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
LLaMA 2 70B | 8 | fp8 | 2048 | 116 ms | 1 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
Mistral 7B | 1 | fp8 | 128 | 10.8 ms | 1 | Optimum Habana 1.12.1 |
Mistral 7B | 1 | fp8 | 2048 | 92 ms | 1 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 128 | 8.17 ms | 1 | Optimum Habana 1.12.1 |
LLaMA 3 8B | 1 | fp8 | 2048 | 60.52 ms | 1 | Optimum Habana 1.12.1 |
Reference Models for Intel Gaudi 2 Accelerator
Model | #HPU | Precision | Throughput | Latency‡ | Batch | Framework Version |
---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512)** | 1 | bf16 | 1.23 img/sec | 813 ms | 1 | Lightning 2.2.0 |
Stable Diffusion v2.1 (768X768)** | 1 | bf16 | 0.4 img/sec | 2500 ms | 1 | Lightning 2.2.0 |
Bert FT (torch.compile) | 1 | bf16 | 814 token/sec | 29.48 ms | 24 | |
Resnet50 (torch.compile) | 1 | bf16 | 17018 img/sec | 15.04 ms | 256 | |
Unet2D | 1 | bf16 | 7525 img/sec | 8.5 ms | 64 | Lightning 2.3.3 |
Unet3D | 1 | bf16 | 114.74 img/sec | 17.43 ms | 2 | Lightning 2.3.3 |
Hugging Face* Optimum with Intel Gaudi 2 Accelerator
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub*).
Model | #HPU | Precision | Throughput | Latency | Batch | Task | Framework Version |
---|---|---|---|---|---|---|---|
Bert (Language Modeling) | 1 | bf16 | 89 token/sec | 44.94 ms | 4 | text-generation | Optimum Habana 1.12.1 |
Bert (Question Answering) | 1 | bf16 | 662 token/sec | 12.08 ms | 8 | question-answering | Optimum Habana 1.12.1 |
Bert (Text Classification) | 1 | bf16 | 1992 token/sec | 4.01 ms | 8 | language-modeling | Optimum Habana 1.12.1 |
Bloomz | 8 | bf16 | 37 token/sec | 27.02 ms | 1 | text-generation | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
BridgeTower | 1 | bf16 | 3224 token/sec | 4.96 ms | 16 | constrastive-image-text | Optimum Habana 1.12.1 |
ESMFold | 1 | bf16 | 2.91 token/sec | 343.64 ms | 1 | protein-folding | Optimum Habana 1.12.1 |
GPT-J | 8 | bf16 | 588 token/sec | 6.8 ms | 4 | text-generation | Optimum Habana 1.12.1 |
MPT-7B 1932 Tokens | 1 | bf16 | 121 token/sec | 8.26 ms | 1 | text-generation | Optimum Habana 1.12.1 |
OPT | 1 | bf16 | 1013 token/sec | 0.98 ms | 1 | text-generation | Optimum Habana 1.12.1 |
StableDiffusion v2.1 (512x512) | 1 | bf16 | 1.35 images/sec | 2962.96 ms | 4 | stable-diffusion | Optimum Habana 1.12.1 |
StableLM-7B 2048 Tokens | 1 | bf16 | 128 token/sec | 7.81 ms | 1 | text-generation | Optimum Habana 1.12.1 |
StarCoder | 1 | bf16 | 65 token/sec | 15.38 ms | 1 | text-generation | Optimum Habana 1.12.1 |
T5-3B Summarization Greedy | 1 | bf16 | 12.38 token/sec | 5331.17 ms | 1 | summarization | Optimum Habana 1.12.1 |
Wav2vec(Audio Classification) | 1 | bf16 | 1817 token/sec | 2.2 ms | 4 | audio-classification | Optimum Habana 1.12.1 |
Wav2vec(Speech Recoginition) | 1 | bf16 | 19.48 token/sec | 205.33 ms | 4 | speech-recoginition | Optimum Habana 1.12.1 |
Reference Models for Intel Gaudi Accelerator
Model | #HPU | Precision | Throughput | Latency | Batch Size | Framework Version |
---|---|---|---|---|---|---|
Bert | 1 | bf16 | 147.6 token/sec | 162.6 ms | 24 | |
Unet2D | 1 | bf16 | 2359 img/sec | 27.13 ms | 64 | Lightning 2.3.3 |
Unet3D | 1 | bf16 | 29.6 img/sec | 67.56 ms | 2 | Lightning 2.3.3 |
Hugging Face Optimum for Intel Gaudi Accelerator
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub).
Model | #HPU | Precision | Throughput | Latency | Batch | Task | Framework Version |
---|---|---|---|---|---|---|---|
HF Bert (Language Modeling) | 1 | bf16 | 38.7 token/sec | 103.35 ms | 4 | language-modeling | Optimum Habana 1.12.1 |
HF Bert (Question Answering) | 1 | bf16 | 128.7 token/sec | 62.16 ms | 8 | question-answering | Optimum Habana 1.12.1 |
HF Bert (Text Classification) | 1 | bf16 | 434.2 token/sec | 18.42 ms | 8 | text-classification | Optimum Habana 1.12.1 |
Bart-Greedy | 1 | bf16 | 3.1 token/sec | 645.16 ms | 2 | summarization | Optimum Habana 1.12.1 |
ESMFold | 1 | bf16 | 13.9 token/sec | 71.94 ms | 1 | protein-folding | Optimum Habana 1.12.1 |
StableDiffusion V2-1 (512x512) | 1 | bf16 | 0.4 token/sec | 10000 ms | 4 | text to image generation | Optimum Habana 1.12.1 |
Wav2vec(Audio Classification) | 1 | bf16 | 1287 token/sec | 3.1 ms | 4 | speech-recognition | Optimum Habana 1.12.1 |
** These models used the previous 1.15.0 software release.
† For large language inference models, this is the average next-token latency.
System Configuration
Intel® Gaudi® Platform
System: HLS-1 with eight Intel® Gaudi® platform HL-205 mezzanine cards, two Intel® Xeon® Platinum 8280 CPUs at 2.70 GHz, and 756 GB of system memory
Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04,
- Intel Gaudi software v1.17.0-495
- PyTorch: Models run with PyTorch v2.2.2 use this Docker* image.
- Environment: These workloads run using the Docker images running directly on the host operating system.
For each model's support and validation coverage, see Model-References on GitHub. All information provided there is subject to change without notice. Your costs and results may vary.