Model Performance Data for Intel® Gaudi® AI Accelerators
These performance numbers are measured using the latest SynapseAI* software release version 1.17.0-495, unless otherwise noted.
Note All models for both training and inference are using the PyTorch* 2.3.1 framework. Other applicable frameworks used for training or inference are noted for each model.
Training Performance Highlights
Microsoft DeepSpeed* for Megatron 0.12.4 | Llama-2-70B, 1024 BS=4096 | Llama-2-70B, 512 BS=2048 | Llama-2-70B, 256 BS=1024
Intel® Gaudi® 2 Accelerator with MLPerf* v3.1
These performance numbers were generated with previous versions of Intel Gaudi software. The numbers will be updated with the upcoming release of new MLPerf* training, which will be part of the next Intel Gaudi software release.
Model | #HPU | Precision | Time To Train | Frameworks Version |
---|---|---|---|---|
MLPerf 3.1 - GPT3 | 384 | fp8 | 153.58 min** | |
MLPerf 3.1 - GPT3 | 256 | fp8 | 223.75 min† | |
MLPerf 3.1 - Stable Diffusion v2 | 64 | bf16 | 19.4 min† | Lightning 2.1.2 |
MLPerf 3.1 - ResNet | 8 | bf16 | 16.4 min‡ | |
MLPerf 3.1 - BERT | 8 | bf16 | 15.01 min‡ |
** The GPT-3* measurement with 384 cards was taken using a prelaunch version of the Intel Gaudi software v1.13.0.
† The GPT-3 measurement with 256 cards and Stable Diffusion* measurement were taken using the Intel Gaudi software v1.13.0.
‡ The ResNet* and BERT measurements were taken using the Intel Gaudi software v1.15.0.
Large Language Models (LLM) for Intel Gaudi 2 Accelerator
Model | #HPU | Precision | Throughput | Sequence Length | TP,PP,DP | Batch Size | Framework Version |
---|---|---|---|---|---|---|---|
LLaMA 2 7B | 8 | FP8 | 70688 tokens/sec | 4,096 | 1,1,8 | 1,024 | Megatron DeepSpeed PR #374 |
LLaMA 2 13B | 16 | FP8 | 55296 tokens/sec | 4,096 | 2,2,4 | 256 | Megatron DeepSpeed PR #374 |
LLaMA 2 70B | 64 | FP8 | 54067 tokens/sec | 4,096 | 8,2,4 | 1,024 | Megatron DeepSpeed PR #374 |
LLaMA 2 70B** | 256 | bf16 | 137625 tokens/sec | 4,096 | 8,8,4 | 1,024 | Megatron DeepSpeed PR #307 |
LLaMA 2 70B** | 512 | bf16 | 226918 tokens/sec | 4,096 | 8,8,8 | 2048 | Megatron DeepSpeed PR #307 |
LLaMA 2 70B** | 1024 | bf16 | 427622 tokens/sec | 4,096 | 8,8,16 | 4096 | Megatron DeepSpeed PR #307 |
TP, PP, DP: These are the tensor parallel, pipeline parallel, and data parallel parameters for the DeepSpeed for Megatron training.
Reference Models for Intel Gaudi 2 Accelerator
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch | Framework Version |
---|---|---|---|---|---|---|---|
Llama 2 13B | 16 | bf16 | 10.16 samples/sec | 256 | DeepSpeed 0.14.0 | ||
Llama 2 70B | 64 | bf16 | 9.13 samples/sec | 1024 | DeepSpeed 0.14.0 | ||
Llama 2 70B | 64 | FP8 | 13.17 samples/sec | 1024 | DeepSpeed 0.14.0 | ||
MIXTRAL-8x7B-32K | 32 | bf16 | 0.7 samples/sec | 88.46 | 345 min | 128 | DeepSpeed 0.14.0 |
Stable Diffusion | 64 | bf16 | 11122 img/sec | 32 | Lightning 2.3.3 | ||
Stable Diffusion Fine Tuning** | 1 | bf16 | 73 img/sec | 7 | Lightning 2.3.3 | ||
Stable Diffusion Fine Tuning Textual Inversion** | 1 | bf16 | 19.7 img/sec | 7 | Lightning 2.3.3 | ||
ResNet50 LARS | 32 | bf16 | 18399 img/sec | 76.38 | 7.26 min | 256 | |
ResNet50 LARS | 8 | bf16 | 48166.02 img/sec | 76.04 | 17.81 min | 256 | |
ResNet50 LARS | 1 | bf16 | 6201.14 img/sec | 256 | |||
BERT Pre Training Phase 1 (torch.compile) | 32 | bf16 | 33179.52 sent/sec | 238 min | 64 | ||
BERT Pre Training Phase 1 (torch.compile) | 8 | bf16 | 8593.03 sent/sec | 0 | 64 | ||
BERT Pre Training Phase 1 (torch.compile) | 1 | bf16 | 1074.45 sent/sec | 64 | |||
BERT Pre Training Phase 2 (torch.compile) | 32 | bf16 | 9861.81 sent/sec | 0 | 87 min | 16 | |
BERT Pre Training Phase 2 (torch.compile) | 8 | bf16 | 2568.65 sent/sec | 0 | 16 | ||
BERT Pre Training Phase 2 (torch.compile) | 1 | bf16 | 320.41 sent/sec | 16 | |||
BERT SQUAD Fine Tuning | 8 | bf16 | 2013 sent/sec | 90.52 | 4.68 min | 24 | |
ResNext101 | 8 | bf16 | 21851 img/sec | 77.81 | 102 min | 256 | |
Transformer | 8 | bf16 | 1121879 token/sec | 27.9 | 236 min | 8,192 | |
Unet2D (torch.compile) | 8 | bf16 | 19888 img/sec | 72.5 | 10.21 min | 64 | Lightning 2.3.3 |
Unet3D PTL | 8 | bf16 | 252 img/sec | 74.17 | 17.96 min | 2 | Lightning 2.3.3 |
†† For the inference models, this is the average next-token latency.
Hugging Face* Optimum with Intel Gaudi 2 Accelerator
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub*).
Model | #HPU | Precision | Throughput | Acc | TTT | Batch | Task | Framework Version |
---|---|---|---|---|---|---|---|---|
Llama2-70B Fine Tuning FSDP (LoRA with torch.compile) | 8 | bf16 | 1.81 sentences/sec | 2.13 | 60 min | 10 | language-modeling | Optimum Habana 1.12.1 |
Llama2-70B Fine Tuning (LoRA) | 8 | bf16 | 2.66 sentences/sec | 2.13 | 38.86 min | 10 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
Falcon-180B Fine Tuning (LoRA) | 8 | bf16 | 2.47 sentences/sec | 3.74 | 162.13 min | 1 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
GPTJ-CLM | 8 | bf16 | 22.17 sentences/sec | 0.53 | 21.56 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
GPTNEOX-20B-CLM | 16 | bf16 | 257 sentences/sec | 0.53 | 41 min | 2 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
BridgeTower | 8 | bf16 | 1031 sentences/sec | 7.28 min | 40 | contrastive-image-text | Optimum Habana 1.12.1 | |
GPT2-XL | 8 | bf16 | 95.69 sentences/sec | 0.47 | 8.81 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
ALBERT-XXL | 8 | bf16 | 422 sentences/sec | 94.8 | 7.4 min | 16 | question-answering | Optimum Habana 1.12.1 |
BERT Base (torch.compile) | 8 | bf16 | 4513 sentences/sec | 85.29 | 0.93 min | 24 | question-answering | Optimum Habana 1.12.1 |
BERT-Large Fine Tuning (torch.compile) | 8 | bf16 | 2099 sentences/sec | 93.18 | 1.93 min | 32 | question-answering | Optimum Habana 1.12.1 |
ClipRoBERTa (torch.compile) | 8 | bf16 | 6420 images/sec | 8.95 min | 64 | contrastive-image-text | Optimum Habana 1.12.1 | |
DistilBERT (torch.compile) | 8 | bf16 | 12192 sentences/sec | 82.02 | 0.56 min | 64 | question-answering | Optimum Habana 1.12.1 |
Flan-T5 XXL | 8 | bf16 | 27.11 sentences/sec | 37.06 | 356 min | 22 | question-answering | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
RoBERTa Large (torch.compile) | 8 | bf16 | 2084 sentences/sec | 94.84 | 1.95 min | 32 | question-answering | Optimum Habana 1.12.1 |
Swin Transformer | 8 | bf16 | 5830 images/sec | 99.09 | 1.8 min | 160 | question-answering | Optimum Habana 1.12.1 |
T5-LARGE | 8 | bf16 | 86 sentences/sec | 44.34 | 226 min | 4 | image-classification | DeepSpeed 0.14.0 Optimum Habana 1.12.1 |
Vision Transformer | 8 | bf16 | 6273 images/sec | 98.85 | 0.91 min | 128 | image-classification | Optimum Habana 1.12.1 |
Wav2Vec2.0 AC | 8 | bf16 | 1933 sentences/sec | 81.47 | 2.46 min | 16 | speech-recognition | Optimum Habana 1.12.1 |
Wav2Vec2.0 ASR | 8 | bf16 | 88 sentences/sec | 3.96 | 17.5 min | 4 | speech-recognition | Optimum Habana 1.12.1 |
MosaicML for Intel Gaudi 2 Accelerator
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size | Framework Version |
---|---|---|---|---|---|---|---|
MosaicML MPT-1B | 8 | bf16 | 23542 samples/sec | 6.95 | 13.83 min | 512 | DeepSpeed 0.14.0 |
MosaicML MPT-70B | 32 | bf16 | 17955 samples/sec | 7.47 | 106 min | 512 | DeepSpeed 0.14.0 |
Reference Models for Intel Gaudi Accelerator
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size | Framework Version |
---|---|---|---|---|---|---|---|
ResNet50 LARS (torch.compile) | 32 | bf16 | 46508 img/sec | 76.39 | 23.4 min | 256 | |
ResNet50 LARS (torch.compile) | 8 | bf16 | 11959 img/sec | 76.39 | 2.7 min | 256 | |
BERT Pre Training combine | 32 | bf16 | 4851 sent/sec | 1735 min | 64 | ||
BERT Pre Training combine | 8 | bf16 | 1240 sent/sec | 64 | |||
BERT Pre Training Phase 1 | 32 | bf16 | 5810 sent/sec | Loss: | 1302 min | 64 | |
BERT Pre Training Phase 1 | 8 | bf16 | 1489 sent/sec | 64 | |||
BERT Pre Training Phase 2 | 32 | bf16 | 1932 sent/sec | Loss: | 433 min | 16 | |
BERT Pre Training Phase 2 | 8 | bf16 | 490 sent/sec | 16 | |||
BERT SQUAD Fine Tuning | 8 | bf16 | 406 sent/sec | 90.68 | 12.96 min | 24 | |
BART Fine Tuning | 8 | bf16 | 1782 sent/sec | 32 | |||
Transformer | 8 | bf16 | 186020 tokens/sec | 27.8 | 1034 min | 4096 | |
Unet2D (torch.compile) | 8 | bf16 | 4776 img/sec | 72.88 | 67.4 min | 64 | Lightning 2.3.3 |
Unet3D PTL | 8 | bf16 | 60.77 img/sec | 74.28 | 59.4 min | 2 | Lightning 2.3.3 |
YOLOX | 8 | bf16 | 312.37 img/sec | 39.93 | 2331.2 min | 16 |
Hugging Face Optimum with Intel Gaudi Accelerator
For information on running each task, including model naming and hyperparameter use, see Validated Models (GitHub).
Model | #HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size | Task | Framework Version |
---|---|---|---|---|---|---|---|---|
GPT2-XL | 8 | bf16 | 19.49 sentences/sec | 0.47 | 101 min | 4 | language-modeling | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
T5-LARGE | 8 | bf16 | 49.67 sentences/sec | 44.34 | 368 min | 4 | summarization | DeepSpeed 0.14.0, Optimum Habana 1.12.1 |
ALBERT-XXL | 8 | bf16 | 71.2 sentences/sec | 94.88 | 43.6 min | 12 | question-answering | Optimum Habana 1.12.1 |
BERT-BASE FT (torch.compile) | 8 | bf16 | 1187 sentences/sec | 85.53 | 3.1 min | 24 | question-answering | Optimum Habana 1.12.1 |
BERT-Large FT (torch.compile) | 8 | bf16 | 415 sentences/sec | 93.36 | 8.6 min | 24 | question-answering | Optimum Habana 1.12.1 |
Clip-RoBERTa | 8 | bf16 | 630 images/sec | 46 min | 64 | contrastive-image-text | Optimum Habana 1.12.1 | |
RoBERTa Base (torch.compile) | 8 | bf16 | 1141 sentences/sec | 91.77 | 3.2 min | 12 | question-answering | Optimum Habana 1.12.1 |
RoBERTa Large (torch.compile) | 8 | bf16 | 412 sentences/sec | 94.58 | 8.6 min | 12 | question-answering | Optimum Habana 1.12.1 |
Swin Transformer | 8 | bf16 | 1592 images/sec | 98.68 | 4.6 min | 64 | question-answering | Optimum Habana 1.12.1 |
Vision Transformer | 8 | bf16 | 2461 images/sec | 97.19 | 2.81 min | 64 | question-answering | Optimum Habana 1.12.1 |
** These models used the previous 1.15.0 software release.
System Configuration
Intel® Gaudi® Platform
System: HLS-1 with eight Intel® Gaudi® platform HL-205 mezzanine cards, two Intel® Xeon® Platinum 8280 CPUs at 2.70 GHz, and 756 GB of system memory
Intel Gaudi 2 Platform
System: HLS-Gaudi2 with eight Intel Gaudi 2 platform HL-225H mezzanine cards, two Intel Xeon Platinum 8380 CPUs at 2.30 GHz, and 1 TB of system memory
Common Software
- Ubuntu* v22.04,
- Intel Gaudi software v1.17.0-495
- PyTorch: Models run with PyTorch v2.2.2 use this Docker* image.
- Environment: These workloads run using the Docker images running directly on the host operating system.
For each model's support and validation coverage, see Model-References on GitHub. All information provided there is subject to change without notice. Your costs and results may vary.