In our previous blog, Intel announced functional support for Llama 4 models across Intel® Gaudi® 3 AI accelerator and Intel® Xeon® processors. In this blog, we are going a step further, sharing our performance benchmarks for the Llama 4 herd of models, Scout and Maverick.
As a reminder, Llama 4 Scout is a general-purpose multimodal model with 17 billion active parameters, 16 experts, and 109 billion total parameters that delivers state-of-the-art performance for its class. Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens. Llama 4 Maverick contains 17 billion active parameters, 128 experts, and 400 billion total parameters, offering high quality at a lower price compared to Llama 3.3 70B.
Intel® Gaudi® AI Accelerators
Intel Gaudi 3 AI accelerators are designed from the ground up for AI workloads and benefit from vector processors, and eight large Matrix Multiplication Engines compared to many small matrix multiplication units that a GPU has. This leads to reduced data transfers and greater energy efficiency. Gaudi 3 is equipped with 128GB HBM2e and the new Llama 4 Maverick model can be run on a single Gaudi 3 node with 8 accelerators.
vLLM as a fast and easy-to-use library for LLM inference and serving, has added the support of Llama 4 herd of models. Intel Gaudi 3 is one of the officially supported hardware backends of vLLM, and we followed the same configurations as in the vLLM Llama4 blog to conduct performance studies on Intel Gaudi 3.
Table 1. Llama 4 inference on 8 Intel® Gaudi® 3 Accelerators
Intel® Xeon® Processors
Intel Xeon processors are an excellent choice for running Mixture of Experts (MoE) models in deployments with small to medium concurrency. Xeon’s substantial memory capacity can effectively handle the large weights associated with MoE models. Given that only a subset of experts is activated for each input, Intel Xeon 6 processors are particularly well-suited for efficiently supporting MoE model inference with Intel® Advanced Matrix Extensions (Intel® AMX) instructions and MRDIMMs.
SGLang is recognized as one of the most popular LLM serving frameworks within the open-source community. Integration of Intel Xeon support into this framework creates a compelling solution that offers impressive performance, reliability, and cost-effectiveness. This combination is especially advantageous for MoE-based large LLM models, ensuring robust and efficient operations.
This solution delivers promising results as shown in table 2, and has been verified to be working well for multimodal scenarios.
Table 2. Llama 4 inference on Intel® Xeon® 6 Processor
Intel is actively working with the SGLang community to upstream the optimizations for Xeon.
Summary
In conclusion, deploying Llama 4 herd of models with vLLM on Intel Gaudi 3 accelerators offers competitive throughput. Xeon processors are well suited for MoE models and offer an option for broader deployments across standard servers. While more performance enhancements are on the way, developers can start deploying Llama 4 herd of models today on Intel platforms following the steps below:
Product and Performance Information
Intel Gaudi 3 Configurations: Measured with 1 Intel Gaudi server with 8 Gaudi 3 AI Accelerators. 2 socket Intel® Xeon® Platinum 8480+ CPU @ 2.00GHz. 1TB System Memory. OS: Ubuntu 22.04. Intel Gaudi software suite, version 1.20.0-543. vLLM from https://github.com/HabanaAI/vllm-fork/tree/llama4. Tested on 16th April 2025.
Intel Xeon Configurations: Measurement on Intel Xeon 6 Processor (formerly code-named: Granite Rapids) using: 2x Intel® Xeon® 6 6980P with P-cores, HT On, Turbo On, NUMA 6, Integrated Accelerators Available [used]: DLB [8], DSA [8], IAA[8], QAT[on CPU, 8], Total Memory 1536GB (24x64GB MRDIMM 8800 MT/s [8800 MT/s]), BIOS BHSDCRB1.IPC.3544.D02.2410010029, microcode 0x11000314, 1x Ethernet Controller I210 Gigabit Network Connection 1x Micron_7450_MTFDKBG960TFR 894.3G, CentOS Stream 9, 6.6.0-gnr.bkc.6.6.16.8.23.x86_64. Tested by Intel on 17th Aprill 2025.
AI Disclaimer
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC. Results may vary.