Model	#HPU	Precision	Input Length	Output Length	Throughput	Batch	Framework Version
LLaMA 2 7B	1	fp8	128	128	12772 tokens/sec	1230	Optimum Habana 1.12.1
LLaMA 2 7B	1	fp8	128	2048	4787 tokens/sec	163	Optimum Habana 1.12.1
LLaMA 2 7B	1	fp8	2048	128	1318 tokens/sec	94	Optimum Habana 1.12.1
LLaMA 2 7B	1	fp8	2048	2048	1967 tokens/sec	81	Optimum Habana 1.12.1
LLaMA 3 8B	1	fp8	128	128	17331 tokens/sec	2429	Optimum Habana 1.12.1
LLaMA 3 8B	1	fp8	128	2048	11106 tokens/sec	289	Optimum Habana 1.12.1
LLaMA 3 8B	1	fp8	2048	128	1762 tokens/sec	179	Optimum Habana 1.12.1
LLaMA 3 8B	1	fp8	2048	2048	5379 tokens/sec	155	Optimum Habana 1.12.1
LLaMA 2 70B	2	fp8	128	128	2784 tokens/sec	1750	DeepSpeed 0.14.0, Optimum Habana 1.12.1
LLaMA 2 70B	2	fp8	128	2048	3186 tokens/sec	750	DeepSpeed 0.14.0, Optimum Habana 1.12.1
LLaMA 2 70B	2	fp8	2048	128	292 tokens/sec	95	DeepSpeed 0.14.0, Optimum Habana 1.12.1
LLaMA 2 70B	2	fp8	2048	2048	1392 tokens/sec	78	DeepSpeed 0.14.0, Optimum Habana 1.12.1
Mistral 7B	1	fp8	128	128	13112 tokens/sec	896	Optimum Habana 1.12.1
Mistral 7B	1	fp8	128	2048	7947 tokens/sec	120	Optimum Habana 1.12.1
Mistral 7B	1	fp8	2048	128	1360 tokens/sec	120	Optimum Habana 1.12.1
Mistral 7B	1	fp8	2048	2048	3143 tokens/sec	44	Optimum Habana 1.12.1

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 128  --bf16 --batch_size 1230  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 128  --bf16 --batch_size 163  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 2048  --bf16 --batch_size 94  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf/ --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048  --bf16 --batch_size 81  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 128  --bf16 --batch_size 2429  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 128  --bf16 --batch_size 289  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 2048  --bf16 --batch_size 179  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3  run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3  run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --attn_softmax_bf16 --trim_logits --warmup 2 --use_kv_cache --use_hpu_graphs --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048  --bf16 --batch_size 155  --flash_attention_causal_mask --book_source --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 128 --bf16 --batch_size 1750 --disk_offload --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 128 --bf16 --batch_size 750  --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 128 --max_input_tokens 2048 --bf16 --batch_size 95 --disk_offload --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure.json  TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_lm_eval.py -o ./test_results_measure.json --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache  --bf16 --batch_size 1 --bucket_size=128 --bucket_internal   --use_flash_attention --flash_attention_recompute --warmup 0

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_quant.json   TQDM_DISABLE=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf/ --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 78 --disk_offload --use_flash_attention --flash_attention_recompute

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits  --use_kv_cache --reuse_cache --bf16 --batch_size 1

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits  --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --n_iterations 1 --use_kv_cache --reuse_cache --bf16 --batch_size 1

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits  --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --limit_hpu_graphs  --bucket_internal --bucket_size 128

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --n_iterations 1 --use_kv_cache --reuse_cache --bf16 --batch_size 1

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits  --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --n_iterations 1 --use_kv_cache --reuse_cache --bf16 --batch_size 1

X

Go to the Hugging Face GitHub repo for text generation and follow the setup instructions.

* pip install -r requirements.txt
* pip install -r requirements_lm_eval.txt
* pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.0
To run the model directly from the text-generation folder, you must run the Measurement command and then the Run command.

QUANT_CONFIG= ./quantization_config/maxabs_quant.json python run_generation.py   --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits  --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --limit_hpu_graphs  --bucket_internal --bucket_size 128