| Model | #HPU | Precision | Input Length | Output Length | Throughput | Batch | Framework Version | Measurement Instructions | Run Instructions |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA 2 7B | 1 | fp8 | 128 | 128 | 12772 tokens/sec | 1230 | Optimum Habana 1.12.1 | ||
| LLaMA 2 7B | 1 | fp8 | 128 | 2048 | 4787 tokens/sec | 163 | Optimum Habana 1.12.1 | ||
| LLaMA 2 7B | 1 | fp8 | 2048 | 128 | 1318 tokens/sec | 94 | Optimum Habana 1.12.1 | ||
| LLaMA 2 7B | 1 | fp8 | 2048 | 2048 | 1967 tokens/sec | 81 | Optimum Habana 1.12.1 | ||
| LLaMA 3 8B | 1 | fp8 | 128 | 128 | 17331 tokens/sec | 2429 | Optimum Habana 1.12.1 | ||
| LLaMA 3 8B | 1 | fp8 | 128 | 2048 | 11106 tokens/sec | 289 | Optimum Habana 1.12.1 | ||
| LLaMA 3 8B | 1 | fp8 | 2048 | 128 | 1762 tokens/sec | 179 | Optimum Habana 1.12.1 | ||
| LLaMA 3 8B | 1 | fp8 | 2048 | 2048 | 5379 tokens/sec | 155 | Optimum Habana 1.12.1 | ||
| LLaMA 2 70B | 2 | fp8 | 128 | 128 | 2784 tokens/sec | 1750 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 | ||
| LLaMA 2 70B | 2 | fp8 | 128 | 2048 | 3186 tokens/sec | 750 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 | ||
| LLaMA 2 70B | 2 | fp8 | 2048 | 128 | 292 tokens/sec | 95 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 | ||
| LLaMA 2 70B | 2 | fp8 | 2048 | 2048 | 1392 tokens/sec | 78 | DeepSpeed 0.14.0, Optimum Habana 1.12.1 | ||
| Mistral 7B | 1 | fp8 | 128 | 128 | 13112 tokens/sec | 896 | Optimum Habana 1.12.1 | ||
| Mistral 7B | 1 | fp8 | 128 | 2048 | 7947 tokens/sec | 120 | Optimum Habana 1.12.1 | ||
| Mistral 7B | 1 | fp8 | 2048 | 128 | 1360 tokens/sec | 120 | Optimum Habana 1.12.1 | ||
| Mistral 7B | 1 | fp8 | 2048 | 2048 | 3143 tokens/sec | 44 | Optimum Habana 1.12.1 |
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 128 --bf16 --batch_size 1230 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 128 --bf16 --batch_size 163 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 2048 --bf16 --batch_size 94 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 81 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 128 --bf16 --batch_size 2429 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 128 --bf16 --batch_size 289 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 2048 --bf16 --batch_size 179 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 155 --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 128 --bf16 --batch_size 1750 --disk_offload --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 128 --bf16 --batch_size 750 --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 2048 --bf16 --batch_size 95 --disk_offload --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --bf16 --batch_size 1 --bucket_size=128 --bucket_internal --use_flash_attention --flash_attention_recompute --warmup 0
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 78 --disk_offload --use_flash_attention --flash_attention_recompute
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --n_iterations 1 --use_kv_cache --reuse_cache --bf16 --batch_size 1
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --limit_hpu_graphs --bucket_internal --bucket_size 128
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --n_iterations 1 --use_kv_cache --reuse_cache --bf16 --batch_size 1
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --n_iterations 1 --use_kv_cache --reuse_cache --bf16 --batch_size 1
Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.
* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.
QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --limit_hpu_graphs --bucket_internal --bucket_size 128