| Model | #HPU | Precision | Throughput | Acc | TTT | Batch | Task | Framework Version | Run Instructions |
|---|---|---|---|---|---|---|---|---|---|
| Llama2-70B Fine Tuning FSDP (LoRA with torch.compile) | 8 | bf16 | 1.81 sentences/sec | 2.13 | 60 min | 10 | language-modeling | Optimum Habana 1.12.1 | |
| Llama2-70B Fine Tuning (LoRA) | 8 | bf16 | 2.66 sentences/sec | 2.13 | 38.86 min | 10 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| Falcon-180B Fine Tuning (LoRA) | 8 | bf16 | 2.47 sentences/sec | 3.74 | 162.13 min | 1 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| GPTJ-CLM | 8 | bf16 | 22.17 sentences/sec | 0.53 | 21.56 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| GPTNEOX-20B-CLM | 16 | bf16 | 257 sentences/sec | 0.53 | 41 min | 2 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| BridgeTower | 8 | bf16 | 1031 sentences/sec | 7.28 min | 40 | contrastive-image-text | Optimum Habana 1.12.1 | ||
| GPT2-XL | 8 | bf16 | 95.69 sentences/sec | 0.47 | 8.81 min | 4 | language-modeling | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| ALBERT-XXL | 8 | bf16 | 422 sentences/sec | 94.8 | 7.4 min | 16 | question-answering | Optimum Habana 1.12.1 | |
| BERT Base (torch.compile) | 8 | bf16 | 4513 sentences/sec | 85.29 | 0.93 min | 24 | question-answering | Optimum Habana 1.12.1 | |
| BERT-Large Fine Tuning (torch.compile) | 8 | bf16 | 2099 sentences/sec | 93.18 | 1.93 min | 32 | question-answering | Optimum Habana 1.12.1 | |
| ClipRoBERTa (torch.compile) | 8 | bf16 | 6420 images/sec | 8.95 min | 64 | contrastive-image-text | Optimum Habana 1.12.1 | ||
| DistilBERT (torch.compile) | 8 | bf16 | 12192 sentences/sec | 82.02 | 0.56 min | 64 | question-answering | Optimum Habana 1.12.1 | |
| Flan-T5 XXL | 8 | bf16 | 27.11 sentences/sec | 37.06 | 356 min | 22 | summarization | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| RoBERTa Large (torch.compile) | 8 | bf16 | 2084 sentences/sec | 94.84 | 1.95 min | 32 | question-answering | Optimum Habana 1.12.1 | |
| Swin Transformer | 8 | bf16 | 5830 images/sec | 99.09 | 1.8 min | 160 | question-answering | Optimum Habana 1.12.1 | |
| T5-LARGE | 8 | bf16 | 86 sentences/sec | 44.34 | 226 min | 4 | image-classification | DeepSpeed 0.14.0 Optimum Habana 1.12.1 | |
| Vision Transformer | 8 | bf16 | 6273 images/sec | 98.85 | 0.91 min | 128 | image-classification | Optimum Habana 1.12.1 | |
| Wav2Vec2.0 AC | 8 | bf16 | 1933 sentences/sec | 81.47 | 2.46 min | 16 | speech-recognition | Optimum Habana 1.12.1 | |
| Wav2Vec2.0 ASR | 8 | bf16 | 88 sentences/sec | 3.96 | 17.5 min | 4 | speech-recognition | Optimum Habana 1.12.1 |
Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 LOWER_LIST=ops_bf16.txt PT_HPU_LAZY_MODE=0 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_lora_clm.py --model_name_or_path meta-llama/Llama-2-70b-hf --dataset_name tatsu-lab/alpaca --bf16 True --output_dir /tmp/lora_fsdp_out --max_seq_len 2048 --gradient_checkpointing --per_device_train_batch_size 5 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --use_habana --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --attn_softmax_bf16 True --validation_split_percentage 4 --flash_attention_causal_mask True --use_lazy_mode False --fsdp_config fsdp_config.json --fsdp auto_wrap --num_train_epochs 1 --evaluation_strategy no --pipelining_fwd_bwd False --use_fused_rope False --gradient_accumulation_steps 2 --torch_compile_backend hpu_backend --torch_compile --use_flash_attention True
Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_MAX_COMPOUND_OP_SIZE=10 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_lora_clm.py --model_name_or_path meta-llama/Llama-2-70b-hf --dataset_name tatsu-lab/alpaca --bf16 True --output_dir /tmp/lora_out --max_seq_len 2048 --gradient_checkpointing --per_device_train_batch_size 10 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --use_habana --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --attn_softmax_bf16 True --validation_split_percentage 4 --flash_attention_causal_mask True --pipelining_fwd_bwd --use_lazy_mode --use_flash_attention True --deepspeed llama2_ds_zero3_config.json --num_train_epochs 0.5 --evaluation_strategy no
Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 LOWER_LIST=ops_bf16.txt python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_lora_clm.py --model_name_or_path tiiuae/falcon-180B --dataset_name timdettmers/openassistant-guanaco --cache_dir /software/data/pytorch/falcon/models--tiiuae--falcon-180B --bf16 True --output_dir /tmp/model_lora_falcon_ddp --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "no" --save_steps 200 --save_total_limit 1 --learning_rate 0.0004 --max_grad_norm 0.3 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --do_train --use_habana --use_lazy_mode --pipelining_fwd_bwd --throughput_warmup_steps 3 --lora_rank 64 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "query_key_value" "dense" "dense_h_to_4h" "dense_4h_to_h" --dataset_concatenation --max_seq_length 256 --adam_epsilon 1e-08 --validation_split_percentage 5 --deepspeed ds_falcon_180b_z3.json --token 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 --max_steps 25
Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py --deepspeed /root/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path EleutherAI/gpt-j-6b --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --num_train_epochs 1 --do_train --output_dir ~/gptj --gaudi_config_name Habana/gpt2 --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --gradient_checkpointing --use_hpu_graphs_for_inference
Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.
PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 python3 ../gaudi_spawn.py --hostfile ./hostsfile --world_size 8 --use_deepspeed run_clm.py --deepspeed deepspeed_zero_2.json --model_name_or_path EleutherAI/gpt-neox-20b --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --num_train_epochs 1 --do_train --output_dir ~/gpt-neox-20b --gaudi_config_name Habana/gpt2 --gradient_checkpointing --learning_rate 5e-05 --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --use_hpu_graphs_for_inference
Go to the GitHub page for the Optimum for Intel Gaudi library for contrastive image text. Follow the setup instructions, and then run the model directly from the contrastive-image-text folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 HF_DATASETS_TRUST_REMOTE_CODE=True PT_HPU_MAX_COMPOUND_OP_SIZE=512 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_bridgetower.py --model_name_or_path 'BridgeTower/bridgetower-large-itm-mlm-itc' --do_train --dataset_name 'jmhessel/newyorker_caption_contest' --dataset_config matching --dataset_revision '3c6c4f6c0ff7e902833d3afa5f8f3875c2b036e6' --image_column image --caption_column image_description --remove_unused_columns False --mediapipe_dataloader --output_dir /tmp/bridgetower-test --per_device_train_batch_size 48 --per_device_eval_batch_size 16 --learning_rate 1e-05 --overwrite_output_dir --use_habana --use_lazy_mode --use_hpu_graphs_for_inference --gaudi_config_name Habana/clip --save_strategy epoch --throughput_warmup_steps 3 --num_train_epochs 1 --logging_steps 10 --dataloader_num_workers 2 --adjust_throughput True
Go to the GitHub page for the Optimum for Intel Gaudi library for language modeling. Follow the setup instructions, and then run the model directly from the language-modeling folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py --deepspeed /root/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path gpt2-xl --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --num_train_epochs 1 --do_train --output_dir ~/gpt2-xl --gaudi_config_name Habana/gpt2 --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --gradient_checkpointing --learning_rate 0.0004 --use_hpu_graphs_for_inference
Go to the GitHub page for the Optimum for Intel Gaudi library for question answering. Follow the setup instructions, and then run the model directly from the question-answering folder.
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path albert-xxlarge-v1 --do_train --dataset_name squad --do_eval --cache_dir /cache/SquadAL --max_seq_length 384 --torch_compile_backend hpu_backend --torch_compile --use_lazy_mode false --bf16 --per_device_eval_batch_size 2 --per_device_train_batch_size 16 --learning_rate 7e-05 --num_train_epochs 2.0 --logging_steps 20 --save_steps 5000 --output_dir /tmp/SQUAD --seed 42 --doc_stride 128 --use_habana --overwrite_output_dir --gaudi_config_name Habana/albert-xxlarge-v1
Go to the GitHub page for the Optimum for Intel Gaudi library for question answering. Follow the setup instructions, and then run the model directly from the question-answering folder.
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path Habana/bert-base-uncased --do_train --dataset_name squad --cache_dir ./cache/transformers --max_seq_length 384 --torch_compile_backend hpu_backend --torch_compile --use_lazy_mode false --bf16 --per_device_eval_batch_size 8 --per_device_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 1.0 --logging_steps 20 --save_steps 5000 --max_steps 500 --output_dir /tmp/SQUAD --seed 42 --doc_stride 128 --use_habana --overwrite_output_dir --gaudi_config_name Habana/bert-base-uncased --throughput_warmup_steps 5
Go to the GitHub page for the Optimum for Intel Gaudi library for question answering. Follow the setup instructions, and then run the model directly from the question-answering folder.
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path Habana/bert-large-uncased-whole-word-masking --do_train --dataset_name squad --cache_dir /software/data/pytorch/cache/transformers --max_seq_length 384 --torch_compile_backend hpu_backend --torch_compile --use_lazy_mode false --bf16 --per_device_eval_batch_size 8 --per_device_train_batch_size 32 --learning_rate 8e-05 --num_train_epochs 1.0 --logging_steps 20 --save_steps 5000 --max_steps 500 --output_dir /tmp/SQUAD --seed 42 --doc_stride 128 --use_habana --overwrite_output_dir --gaudi_config_name Habana/bert-large-uncased-whole-word-masking --throughput_warmup_steps 5
Go to the GitHub page for the Optimum for Intel Gaudi library for contrastive image text. Follow the setup instructions, and then run the model directly from the contrastive-image-text folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 HF_DATASETS_TRUST_REMOTE_CODE=True PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py --output_dir=/tmp/clip_roberta --model_name_or_path clip-roberta --data_dir=/data --dataset_name="ydshieh/coco_dataset_script" --dataset_config_name="2017" --image_column="image_path" --caption_column="caption" --remove_unused_columns="False" --do_train --do_eval --mediapipe_dataloader --per_device_train_batch_size="64" --per_device_eval_batch_size="64" --learning_rate="5e-5" --warmup_steps="0" --weight_decay="0.1" --overwrite_output_dir --save_strategy="epoch" --use_habana --use_lazy_mode=False --gaudi_config_name="Habana/clip" --throughput_warmup_steps=20 --use_hpu_graphs --max_steps=100 --torch_compile_backend=hpu_backend --torch_compile --logging_nan_inf_filter
Go to the GitHub page for the Optimum for Intel Gaudi library for question answering. Follow the setup instructions, and then run the model directly from the question-answering folder.
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path Habana/distilbert-base-uncased --do_train --dataset_name squad --cache_dir /cache/transformers --max_seq_length 384 --torch_compile_backend hpu_backend --torch_compile --use_lazy_mode false --bf16 --per_device_eval_batch_size 8 --per_device_train_batch_size 64 --learning_rate 0.0005 --num_train_epochs 1.0 --logging_steps 20 --save_steps 5000 --max_steps 500 --output_dir /tmp/SQUAD --seed 42 --doc_stride 128 --use_habana --overwrite_output_dir --gaudi_config_name Habana/distilbert-base-uncased --throughput_warmup_steps 5
Go to the GitHub page for the Optimum for Intel Gaudi library for summarization. Follow the setup instructions, and then run the model directly from the question-answering folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 PT_HPU_MAX_COMPOUND_OP_SIZE=512 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_summarization.py --deepspeed ds_flan_t5_z3_config_bf16.json --model_name_or_path google/flan-t5-xxl --do_train --source_prefix '"summarize:"' --dataset_name cnn_dailymail --dataset_config '"3.0.0"' --output_dir /tmp/tst-summarization --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --learning_rate 0.0001 --overwrite_output_dir --predict_with_generate --use_habana --use_lazy_mode --gaudi_config_name Habana/t5 --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --save_strategy epoch --throughput_warmup_steps 5 --gradient_checkpointing --adam_epsilon 1e-08 --max_eval_samples 880 --num_train_epochs 1 --max_steps 400
Go to the GitHub page for the Optimum for Intel Gaudi library for question answering. Follow the setup instructions, and then run the model directly from the question-answering folder.
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path Habana/roberta-large --do_train --dataset_name squad --cache_dir /software/data/pytorch/cache/transformers --max_seq_length 384 --torch_compile_backend hpu_backend --torch_compile --use_lazy_mode false --bf16 --per_device_eval_batch_size 8 --per_device_train_batch_size 32 --learning_rate 8e-05 --num_train_epochs 1.0 --logging_steps 20 --save_steps 5000 --max_steps 500 --output_dir /tmp/SQUAD --seed 42 --doc_stride 128 --use_habana --overwrite_output_dir --gaudi_config_name Habana/bert-large-uncased-whole-word-masking --throughput_warmup_steps 5
Go to the GitHub page for the Optimum for Intel Gaudi library for question answering. Follow the setup instructions, and then run the model directly from the question-answering folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 numactl --cpunodebind=1 --membind=1 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_image_classification.py --model_name_or_path microsoft/swin-base-patch4-window7-224 --dataset_name cifar10 --output_dir /tmp/swint_hf/results/ --remove_unused_columns False --do_train --learning_rate 0.0002 --per_device_train_batch_size 160 --evaluation_strategy no --save_strategy no --load_best_model_at_end True --save_total_limit 3 --seed 1337 --use_habana --use_lazy_mode --gaudi_config_name Habana/swin --throughput_warmup_steps 20 --ignore_mismatched_sizes --bf16 --image_column_name img --num_train_epochs 5 --logging_steps 20 --dataloader_num_workers 2 --use_hpu_graphs_for_inference --pipelining_fwd_bwd True --non_blocking_data_copy True
Go to the GitHub page for the Optimum for Intel Gaudi library for image classification. Follow the setup instructions, and then run the model directly from the image-classification folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_summarization.py --deepspeed /root/optimum-habana/tests/configs/deepspeed_zero_2.json --do_train --overwrite_output_dir --predict_with_generate --use_habana --use_lazy_mode --gaudi_config_name Habana/t5 --ignore_pad_token_for_loss False --pad_to_max_length --save_strategy no --throughput_warmup_steps 15 --model_name_or_path t5-large --source_prefix '"summarize:"' --dataset_name cnn_dailymail --dataset_config '"3.0.0"' --output_dir /tmp/tst-summarization --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_train_samples 2000
Go to the GitHub page for the Optimum for Intel Gaudi library for image classification. Follow the setup instructions, and then run the model directly from the image-classification folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_image_classification.py --model_name_or_path google/vit-base-patch16-224-in21k --dataset_name cifar10 --output_dir /tmp/vit_hf/results/ --remove_unused_columns False --do_train --learning_rate 0.0005 --per_device_train_batch_size 128 --evaluation_strategy no --save_strategy no --load_best_model_at_end True --save_total_limit 3 --seed 1337 --use_habana --use_lazy_mode --gaudi_config_name Habana/vit --throughput_warmup_steps 6 --image_column_name img --num_train_epochs 5 --logging_steps 20 --bf16 --dataloader_num_workers 1 --non_blocking_data_copy True --pipelining_fwd_bwd True --use_hpu_graphs_for_inference
Go to the GitHub page for the Optimum for Intel Gaudi library for speech recognition. Follow the setup instructions, and then run the model directly from the speech-recognition folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 HF_DATASETS_TRUST_REMOTE_CODE=True python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_audio_classification.py --model_name_or_path facebook/wav2vec2-base --dataset_name common_language --overwrite_output_dir --remove_unused_columns False --do_train --do_eval --learning_rate 0.0003 --warmup_ratio 0.1 --use_habana --use_lazy_mode --attention_mask False --gaudi_config_name Habana/wav2vec2 --throughput_warmup_steps 3 --use_hpu_graphs_for_training --use_hpu_graphs_for_inference --max_length_seconds 8 --num_train_epochs 5 --per_device_train_batch_size 16 --per_device_eval_batch_size 32 --seed 0 --audio_column_name audio --label_column_name language --output_dir /tmp/wav2vec2-base-lang-id
Go to the GitHub page for the Optimum for Intel Gaudi library for speech recognition. Follow the setup instructions, and then run the model directly from the speech-recognition folder.
MASTER_ADDR=127.0.0.1 MASTER_PORT=12345 PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_speech_recognition_ctc.py --model_name_or_path facebook/wav2vec2-large-lv60 --dataset_name librispeech_asr --output_dir /tmp/wav2vec2-librispeech-clean-100h-demo-dist --dataset_config_name clean --train_split_name train.100 --eval_split_name validation --per_device_train_batch_size 4 --learning_rate 0.0003 --overwrite_output_dir --text_column_name text --warmup_steps 500 --layerdrop 0.0 --freeze_feature_encoder --use_habana --use_lazy_mode --gaudi_config_name Habana/wav2vec2 --throughput_warmup_steps 20 --do_train --do_eval --adjust_throughput --chars_to_ignore '",?.!-;:\"“%‘â€"' --use_hpu_graphs_for_training --use_hpu_graphs_for_inference --num_train_epochs 2 --preprocessing_num_workers 64 --per_device_eval_batch_size 8