Learn How to Reduce Model Latency When Deploying Meta* Llama 3 on CPUs
The much-anticipated release of the third-generation batch of Meta* Llama is here, and this tutorial shows you how to deploy this state-of-the-art large language model (LLM) optimally. We focus on performing weight-only-quantization (WOQ) to compress the 8B parameter model and improve inference latency.
Llama 3
To date, the Llama 3 family includes models ranging from 8B to 70B parameters, with more versions coming in the future. The models come with a permissive Meta Llama 3 license. Review and accept the terms required to use them. This marks an exciting chapter for the Llama model family and open-source AI.
Architecture
Llama 3 is an auto-regressive LLM based on a decoder-only transformer. Compared to Llama 2, the Meta team has made the following notable improvements:
- Adoption of grouped query attention (GQA), which improves inference efficiency
- Optimized tokenizer with a vocabulary of 128K tokens designed to encode language more efficiently
- Trained on a 15 trillion token dataset, which is 7x larger than Llama 2’s training dataset and includes 4x as much code
Figure 1 is the result of print(model) where model is meta-llama/Meta-Llama-3–8B-Instruct. The model comprises 32 LlamaDecoderLayers composed of Llama Attention self-attention components. Additionally, it has LlamaMLP, LlamaRMSNorm, and a linear head.
Figure 1. Output of print(model) showcasing the distribution of layers across Llama-3–8B-instruct’s architecture
Language Modeling Performance
The model was evaluated on various industry-standard language modeling benchmarks, such as Massive Multitask Language Understanding (MMLU), Google*-Proof Q&A (GPQA), HumanEval, Grade School Math 8K (GSM8K), MATH, and more. Figure 2 reviews the performance of instruction-tuned models. The most remarkable aspect of these figures is that the Llama 3 8B parameter model outperforms Llama 2 70B by 62% to 143% across the reported benchmarks while being an 88% smaller model.
Figure 2. Summary of Llama 3 instruction model performance metrics across the MMLU, GPQA, HumanEval, GSM8K, and MATH LLM benchmarks
The increased language modeling performance, permissive licensing, and architectural efficiencies included with this latest Llama generation mark the beginning of an exciting chapter in the generative AI (GenAI) space. Explore how we can optimize inference on CPUs for scalable, low-latency deployments of Llama 3.
Optimize Llama 3 Inference with PyTorch*
A previous article covers the importance of model compression and overall inference optimization in developing LLM-based applications. This tutorial focuses on applying WOQ to meta-llama/Meta-Llama-3–8B-Instruct. WOQ offers a balance between performance, latency, and accuracy, with options to quantize to int4 or int8. A key component of WOQ is the dequantization step, which converts int4 or in8 weights back to bfloat16 before computation.
Figure 3. Simple illustration of WOQ with prequantized weights in orange and quantized weights in green. Note that this depicts the initial quantization to int4 or int8 and dequantization toFP16 or bfloat16 for the computation step.
Environment Setup
You need approximately 60 GB of RAM to perform WOQ on Llama-3-8B-Instruct. This includes about 30 GB to load the full model and approximately 30 GB for peak memory during quantization. The WOQ Llama 3 only consumes about 10 GB of RAM, which means that we can free approximately 50 GB of RAM by releasing the full model from memory.
You can run this tutorial in the free JupyterLab environment on Intel® Tiber™ Developer Cloud. This environment offers an Intel® Xeon® 4 CPU with 224 threads and 504 GB of memory, more than enough to run this code.
If running this in your own integrated development environment, you may need to address additional dependencies like installing Jupyter* or configuring a conda* or Python* environment. Before getting started, ensure that you have the following dependencies installed.
intel-extension-for-pytorch==2.2
transformers==4.35.2
torch==2.2.0
huggingface_hub
Access and Configure Llama 3
You need a Hugging Face* account to access Llama 3's model and tokenizer.
Do the following:
- From the Settings menu, select Access Tokens (see figure 4) and then create a token.
Figure 4. Snapshot of the Hugging Face token configuration console
-
Run the following code:
from huggingface_hub import notebook_login, Repository # Login to Hugging Face notebook_login()
- Copy your access token, and and then paste it into the Token field generated inside your Jupyter cell.
-
Go to meta-llama/Meta-Llama-3–8B-Instruct and evaluate the terms and license before providing your information and submitting the Llama 3 access request.
Quantize Llama-3–8B-Instruct with WOQ
Use the Intel® Extension for PyTorch* to apply WOQ to Llama 3. This extension contains the latest PyTorch optimizations for Intel® hardware. Follow these steps to quantize and perform inference with an optimized Llama 3 model:
- Llama 3 model and tokenizer: Import the required packages and use the AutoModelForCausalLM.from_pretrained() and AutoTokenizer.from_pretrained() methods to load the Llama-3–8B-Instruct specific weights and tokenizer.
import torch import intel_extension_for_pytorch as ipex from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer Model = 'meta-llama/Meta-Llama-3-8B-Instruct' model = AutoModelForCausalLM.from_pretrained(Model) tokenizer = AutoTokenizer.from_pretrained(Model)
- Quantization recipe configuration: Configure the WOQ quantization recipe. Set the weight_dtype variable to the desired in-memory datatypes, choosing from torch.quint4x2 or torch.qint8 for int4 and in8, respectively. Additionally, use lowp_model to define the dequantization precision. For now, keep this as ipex.quantization.WoqLowpMode.None to keep the default BF16 computation precision.
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping( weight_dtype=torch.quint4x2, # or torch.qint8 lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8 ) checkpoint = None # optionally load int4 or int8 checkpoint # PART 3: Model optimization and quantization model_ipex = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint) del model
- Use ipex.llm.optimize() to apply WOQ, and then del model to delete the full model from memory and free about 30 GB of RAM.
-
Prompt Llama 3: Llama 3, like Llama 2, has a predefined prompting template for its instruction-tuned models. Using this template, developers can define specific model behavior instructions and provide user prompts and conversation history.
system= """\n\n You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. If you don't know the answer to a question, please don't share false information.""" user= "\n\n You are an expert in astronomy. Can you tell me 5 fun facts about the universe?" model_answer_1 = 'None' llama_prompt_tempate = f""" <|begin_of_text|>\n<|start_header_id|>system<|end_header_id|>{system} <|eot_id|>\n<|start_header_id|>user<|end_header_id|>{user} <|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>{model_answer_1}<|eot_id|> """ inputs = tokenizer(llama_prompt_tempate, return_tensors="pt").input_ids
Provide the required fields and then use the tokenizer to convert the entire template into tokens for the model.
-
Llama 3 inference: For text generation, use TextStreamer to generate a real-time inference stream instead of printing the entire output at once. This results in a more natural text generation experience for readers. Provide the configured streamer to model_ipex.generate() and other text-generation parameters.
with torch.inference_mode(): tokens = model_ipex.generate( inputs, streamer=streamer, pad_token_id=128001, eos_token_id=128001, max_new_tokens=300, repetition_penalty=1.5, )
When the code runs, the model starts generating outputs. Keep in mind that these are unfiltered and nonguarded outputs. For real-world use cases, you need to make additional postprocessing considerations.
Figure 5. Streamed inference of Llama-3–8B-Instruct with WOQ mode compression at int4 running in the JupyterLab environment on Intel Tiber Developer Cloud
That’s it. With less than 20 lines of code, you now have a low-latency, CPU-optimized version of the latest state-of-the-art LLM in the ecosystem.
Considerations for Deployment
Depending on your inference service deployment strategy, there are a few things to consider:
- If deploying instances of Llama 3 in containers, WOQ offers a smaller memory footprint and lets you serve multiple inference services of the model on a single hardware node.
- When deploying multiple inference services, optimize the threads and memory reserved for each service instance. Leave enough additional memory (about 4 GB) and threads (around 4 threads) to handle background processes.
- Consider saving the WOQ version of the model and storing it in a model registry to eliminate the need to requantize the model for each instance deployment.
Conclusion
The Meta Llama 3 LLM family delivers remarkable improvements over previous generations with a diverse range of configurations. This tutorial explored enhancing CPU inference with weight-only quantization (WOQ), a technique that reduces latency with minimal impacts to accuracy.
By integrating the new generation of performance-oriented Llama 3 LLMs with optimization techniques like WOQ, developers can unlock new possibilities for GenAI applications. This combination simplifies the hardware requirements to achieve high-fidelity, low-latency results from LLMs integrated into new and existing systems.
Next Steps
A few exciting things to try next would be:
- Experiment with quantization levels: Test int4 and int8 quantization to identify the best compromise between performance and accuracy for your specific applications.
- Performance monitoring: It is crucial to continuously assess the performance and accuracy of the Llama 3 model across different real-world scenarios to ensure that quantization maintains the desired effectiveness.
- Test more Llamas: Explore the entire Llama 3 family and evaluate the impact of WOQ and other PyTorch quantization recipes.
Check out and incorporate other AI and machine learning framework optimizations and end-to-end portfolios of tools from Intel into your AI workflow. Learn about the unified, open, standards-based oneAPI programming model that forms the foundation of AI development from Intel to help you prepare, build, deploy, and scale your AI solutions.
Additional Resources
- AI Developer Tools and Resources from Intel
- oneAPI Unified Programming Model
- Documentation for PyTorch* Optimizations from Intel
- Documentation for Intel® Extension for PyTorch*
- AI Concepts: Machine Learning
- AI Concepts: Inference
- AI Concepts: Computer Vision
Articles
- How to Build an Interactive Chat-Generation Model Using DialoGPT and PyTorch*
- Language Identification: Building an End-to-End AI Solution Using PyTorch*
- PyTorch* Quantization Using Intel® Neural Compressor