As the rapid development of generative AI unfolds, the need for running large language models (LLM) on client hardware (such as laptops, desktops, or workstations) is becoming increasingly significant. Deploying LLMs directly on client hardware can offer more personalized experiences, ensure data privacy, and foster technological innovation. A growing number of technology companies and developers are dedicated to effectively deploying and running LLMs on client hardware. Intel® Core™ Ultra processors and Intel® Arc™ A-series graphics represent ideal platforms for LLM inference.
The IPEX-LLM library (previously known as BigDL-LLM) is a PyTorch* library for running LLMs on Intel CPUs and GPUs with low latency. The library contains state-of-art optimizations for LLM inference and fine-tuning, low-bit (int4, FP4, int8, and FP8) LLM accelerations, and seamless integration of the community libraries such as Hugging Face*, LangChain*, LlamaIndex, and vLLM. This article introduces the performance and the instructions for running LLM inference on Intel Core Ultra processors and Intel Arc A-series graphics using the IPEX-LLM library.
LLM Inferencing on Intel® Core™ Ultra Processors
These processors are designed and optimized for high-performance slimline laptops and are suitable for local deployment of generative AI workloads such as LLM model inference.
The following chart shows the token latency for LLM inference ranging from 6 billion to 13 billion parameters while running on an Intel Core Ultra processor. The tests were conducted on Windows* 11, with 1,024 input tokens and batch size 1 using the IPEX-LLM library.
Figure 1. Next token latency on Intel® Core™ Ultra. For details, see Configurations and Disclaimers.
LLM Inference on Intel® Arc™ A-series Graphics
Intel Arc A-series graphics is optimized for a premium gaming experience, and also to provide essential acceleration for advanced content creation such as LLM inference.
The following chart shows the token latency for LLM inference ranging from 6 billion to 13 billion parameters, running on Intel Arc A-series A770 graphics (with an Intel® Core™ i7-12700 processor as the host platform) using the IPEX-LLM library. The tests were conducted on Ubuntu* 22.04, with 1,024 input tokens and a batch size of 1.
Figure 2. Next token latency on Intel Arc A-series graphics. For detailed information, see Configurations and Disclaimers.
Running LLMs on Your Local PC
It just takes a few steps to set up and run LLM inference using the IPEX-LLM library on your local PC (such as a laptop equipped with an Intel Core Ultra processor or a desktop equipped with Intel Arc A-series graphics).
Installation
Go to the instructions for installing the IPEX-LLM library in a Windows* environment for Intel Core Ultra processors or a Linux* environment for Intel Arc A-series graphics. More details are available in the Installation Guide.
Inferencing LLMs with Low-Bit Optimizations
To run your favorite LLM models on Intel Core Ultra processors and Intel Arc A-series graphics, see the GPU examples.
Use the Hugging Face Transformer API for LLM inference using the IPEX-LLM library. You need to use the proper import statement, and the setting load_in_4bit=True in the from_pretrained parameter to enable the IPEX-LLM library low-bit optimizations. The following is an example of the code change:
from ipex_llm.transformers import AutoModelForCausalL
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True).to("xpu")
The model is converted to low bit and loaded to the running hardware (the XPU), and afterward, various hardware and software optimizations are applied to accelerate the LLM inferencing.
More examples are available, for example FP8, int8, and FP4 inference, saving and loading the IPEX-LLM library low-bit models, or directly loading of quantized models that include GGUF, Activation Aware Quantization (AWQ), GPU Post-Training Quantization (PTQ), and you can build IPEX-LLM library applications with community libraries such as LangChain and LlamaIndex.
Benchmarking
You can run LLM inference benchmarking on an Intel Core Ultra processor or Intel Arc A-series graphics. To prepare the hardware and software environment, to prepare and update the test scripts according to your own test configurations, and to interpret the test results in the output file, see the instructions.
Before conducting the benchmarking activity, it's recommended to review the IPEX-LLM environment check utility scripts to verify the IPEX-LLM installation and runtime environment.
Summary
This article introduced how to run the state-of-the-art LLM on an Intel Core Ultra processor and Intel Arc A-series graphics, and shows the performance data. More information is available for Intel Core Ultra processor and Intel Arc A-series graphics. You can also visit the IPEX-LLM library on GitHub* for the latest updates and LLM examples.
Related Software
PyTorch* Optimizations from Intel
Intel is one of the largest contributors to PyTorch*, providing regular upstream optimizations to the PyTorch deep learning framework that provides superior performance on Intel® architectures. The AI Tools includes the latest binary version of PyTorch tested to work with the rest of the kit, along with Intel® Extension for PyTorch*, which adds the newest Intel optimizations and usability features.
Acknowledgements
We would like to thank Yuwen Hu, Yishuo Wang, and Jason Dai for the IPEX-LLM library optimizations on Intel Core Ultra processors and Intel Arc A-series graphics. Special thanks to Padma Apparao and Juan Ouyang for their great support.
Configurations and Disclaimers
The benchmark uses next token latency to measure the inference performance. Batch size 1, greedy search, input tokens: 1,024, output tokens: 128, data type: int4. The measurements used BigDL-LLM 2.5.0b20240303 for the int4 benchmark, PyTorch 2.1.0a0+cxx11.abi, Intel® Extension for PyTorch* 2.1.10+xpu, transformers: 4.31.0 for Llama 2 and 4.36.0 for Mistral*, and Intel® oneAPI Base Toolkit 2024.0. Windows 11 22H2 LTSC. Tests performed by Intel in March 2024.
Intel Arc A-series A770 graphics results were measured on a system with an Intel Core i7-12700 processor and 32 GB DDR4-3200, Ubuntu 22.04 with kernel version 5.19.0-50. Tests performed by Intel in March 2024.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software, or service activation.