Meta* recently released the first models of the Llama 4 herd, which enables the creation of more personalized multimodal experiences. Llama 4 models use MoE (Mixture of Experts) architecture and are designed with native multimodality by incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. The Llama 4 herd includes Llama 4 Scout and Llama 4 Maverick models.
Llama 4 Scout is a general-purpose multimodal model with 17 billion active parameters, 16 experts, and 109 billion total parameters that delivers state-of-the-art performance for its class. Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens. Llama 4 Maverick contains 17 billion active parameters, 128 experts, and 400 billion total parameters, offering high quality at a lower price compared to Llama 3.3 70B.
As a close partner of Meta, Intel today announces functional support for Llama 4 models across Intel® Gaudi® 3 AI accelerator and Intel® Xeon® processors. Intel Gaudi 3 AI accelerators are designed from the ground up for AI workloads and benefit from Tensor cores, and eight large Matrix Multiplication Engines compared to many small matrix multiplication units that a GPU has. This leads to reduced data transfers and greater energy efficiency. The new Llama 4 Maverick model can be run on a single Gaudi 3 node with 8 accelerators.
Intel Xeon processors address demanding end-to-end AI workloads. Available across major cloud service providers, Intel Xeon processors have an AI engine (AMX) in every core that unlocks new levels of performance for inference and training. The combination of Intel Xeon AMX instructions, large memory capacity and increased memory bandwidth in Intel® Xeon® 6 processors make Xeon a cost- efficient solution to deploy MoE models like Llama 4.
Open ecosystem software, like PyTorch*, Hugging Face*, vLLM*, and OPEA, are optimized for Intel Gaudi and Intel Xeon processors, to help make AI system deployment easier.
Intel is excited to support these multimodal models and will continue performance optimizations. Performance benchmarks for Llama 4 herd of models will be published in the following weeks.