Model	#HPU	Precision	Throughput	Acc	TTT	Batch	Task	Framework Version
Llama2-70B Fine Tuning FSDP (LoRA with torch.compile)	8	bf16	1.81 sentences/sec	2.13	60 min	10	language-modeling	Optimum Habana 1.12.1
Llama2-70B Fine Tuning (LoRA)	8	bf16	2.66 sentences/sec	2.13	38.86 min	10	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.12.1
Falcon-180B Fine Tuning (LoRA)	8	bf16	2.47 sentences/sec	3.74	162.13 min	1	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.12.1
GPTJ-CLM	8	bf16	22.17 sentences/sec	0.53	21.56 min	4	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.12.1
GPTNEOX-20B-CLM	16	bf16	257 sentences/sec	0.53	41 min	2	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.12.1
BridgeTower	8	bf16	1031 sentences/sec		7.28 min	40	contrastive-image-text	Optimum Habana 1.12.1
GPT2-XL	8	bf16	95.69 sentences/sec	0.47	8.81 min	4	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.12.1
ALBERT-XXL	8	bf16	422 sentences/sec	94.8	7.4 min	16	question-answering	Optimum Habana 1.12.1
BERT Base (torch.compile)	8	bf16	4513 sentences/sec	85.29	0.93 min	24	question-answering	Optimum Habana 1.12.1
BERT-Large Fine Tuning (torch.compile)	8	bf16	2099 sentences/sec	93.18	1.93 min	32	question-answering	Optimum Habana 1.12.1
ClipRoBERTa (torch.compile)	8	bf16	6420 images/sec		8.95 min	64	contrastive-image-text	Optimum Habana 1.12.1
DistilBERT (torch.compile)	8	bf16	12192 sentences/sec	82.02	0.56 min	64	question-answering	Optimum Habana 1.12.1
Flan-T5 XXL	8	bf16	27.11 sentences/sec	37.06	356 min	22	summarization	DeepSpeed 0.14.0 Optimum Habana 1.12.1
RoBERTa Large (torch.compile)	8	bf16	2084 sentences/sec	94.84	1.95 min	32	question-answering	Optimum Habana 1.12.1
Swin Transformer	8	bf16	5830 images/sec	99.09	1.8 min	160	question-answering	Optimum Habana 1.12.1
T5-LARGE	8	bf16	86 sentences/sec	44.34	226 min	4	image-classification	DeepSpeed 0.14.0 Optimum Habana 1.12.1
Vision Transformer	8	bf16	6273 images/sec	98.85	0.91 min	128	image-classification	Optimum Habana 1.12.1
Wav2Vec2.0 AC	8	bf16	1933 sentences/sec	81.47	2.46 min	16	speech-recognition	Optimum Habana 1.12.1
Wav2Vec2.0 ASR	8	bf16	88 sentences/sec	3.96	17.5 min	4	speech-recognition	Optimum Habana 1.12.1

Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.

MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 LOWER_LIST=ops_bf16.txt PT_HPU_LAZY_MODE=0   python3 ../gaudi_spawn.py   --world_size 8 --use_mpi run_lora_clm.py --model_name_or_path meta-llama/Llama-2-70b-hf --dataset_name tatsu-lab/alpaca --bf16 True --output_dir /tmp/lora_fsdp_out --max_seq_len 2048 --gradient_checkpointing --per_device_train_batch_size 5 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --use_habana --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --attn_softmax_bf16 True --validation_split_percentage 4 --flash_attention_causal_mask True --use_lazy_mode False --fsdp_config fsdp_config.json --fsdp auto_wrap --num_train_epochs 1 --evaluation_strategy no --pipelining_fwd_bwd False --use_fused_rope False --gradient_accumulation_steps 2 --torch_compile_backend hpu_backend --torch_compile --use_flash_attention True

Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.

MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_MAX_COMPOUND_OP_SIZE=10   python3 ../gaudi_spawn.py   --world_size 8 --use_deepspeed run_lora_clm.py --model_name_or_path meta-llama/Llama-2-70b-hf --dataset_name tatsu-lab/alpaca --bf16 True --output_dir /tmp/lora_out --max_seq_len 2048 --gradient_checkpointing --per_device_train_batch_size 10 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --use_habana --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --attn_softmax_bf16 True --validation_split_percentage 4 --flash_attention_causal_mask True --pipelining_fwd_bwd --use_lazy_mode --use_flash_attention True --deepspeed llama2_ds_zero3_config.json --num_train_epochs 0.5 --evaluation_strategy no

Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.

MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 LOWER_LIST=ops_bf16.txt   python3 ../gaudi_spawn.py   --world_size 8 --use_deepspeed run_lora_clm.py --model_name_or_path tiiuae/falcon-180B --dataset_name timdettmers/openassistant-guanaco --cache_dir /software/data/pytorch/falcon/models--tiiuae--falcon-180B --bf16 True --output_dir /tmp/model_lora_falcon_ddp --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "no" --save_steps 200 --save_total_limit 1 --learning_rate 0.0004 --max_grad_norm 0.3 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --do_train --use_habana --use_lazy_mode --pipelining_fwd_bwd --throughput_warmup_steps 3 --lora_rank 64 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "query_key_value" "dense" "dense_h_to_4h" "dense_4h_to_h" --dataset_concatenation --max_seq_length 256 --adam_epsilon 1e-08 --validation_split_percentage 5 --deepspeed ds_falcon_180b_z3.json --token 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 --max_steps 25

Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.

MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0   python3 ../gaudi_spawn.py   --world_size 8 --use_deepspeed run_clm.py --deepspeed /root/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path EleutherAI/gpt-j-6b --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --num_train_epochs 1 --do_train --output_dir ~/gptj --gaudi_config_name Habana/gpt2 --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --gradient_checkpointing --use_hpu_graphs_for_inference

Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.

PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0  python3 ../gaudi_spawn.py   --hostfile ./hostsfile --world_size 8 --use_deepspeed run_clm.py --deepspeed deepspeed_zero_2.json --model_name_or_path EleutherAI/gpt-neox-20b --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --num_train_epochs 1 --do_train --output_dir ~/gpt-neox-20b --gaudi_config_name Habana/gpt2 --gradient_checkpointing --learning_rate 5e-05 --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --use_hpu_graphs_for_inference

Go to the GitHub page for the Optimum for Intel Gaudi library for contrastive image text. Follow the setup instructions, and then run the model directly from the contrastive-image-text folder.

MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 HF_DATASETS_TRUST_REMOTE_CODE=True PT_HPU_MAX_COMPOUND_OP_SIZE=512   python3 ../gaudi_spawn.py   --world_size 8 --use_mpi run_bridgetower.py --model_name_or_path 'BridgeTower/bridgetower-large-itm-mlm-itc' --do_train --dataset_name 'jmhessel/newyorker_caption_contest' --dataset_config matching --dataset_revision '3c6c4f6c0ff7e902833d3afa5f8f3875c2b036e6' --image_column image --caption_column image_description --remove_unused_columns False --mediapipe_dataloader --output_dir /tmp/bridgetower-test --per_device_train_batch_size 48 --per_device_eval_batch_size 16 --learning_rate 1e-05 --overwrite_output_dir --use_habana --use_lazy_mode --use_hpu_graphs_for_inference --gaudi_config_name Habana/clip --save_strategy epoch --throughput_warmup_steps 3 --num_train_epochs 1 --logging_steps 10 --dataloader_num_workers 2 --adjust_throughput True

Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.

MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0   python3 ../gaudi_spawn.py   --world_size 8 --use_deepspeed run_clm.py --deepspeed /root/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path gpt2-xl --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --num_train_epochs 1 --do_train --output_dir ~/gpt2-xl --gaudi_config_name Habana/gpt2 --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --gradient_checkpointing --learning_rate 0.0004 --use_hpu_graphs_for_inference