UC Berkeley Drives vLLM Performance and Efficiency

We are excited to announce that the Department of Electrical Engineering and Computer Sciences (EECS) at the University of California, Berkeley (UC Berkeley) is contributing its innovative prowess in a joint effort with Intel to drive the evolution of generative AI (GAI) large language model (LLM) model inference efficiency to the next level.

Under the guidance of Professor Ion Stoica, Director of the Sky Computing Lab and a key contributor to the Berkeley Artificial Intelligence Research Center (BAIR), UC Berkeley is establishing a new AI Center of Excellence. Graduate students Woosuk Kwon, Zhuohan Li, and Simon Mo will help coordinate efforts across BAIR, the Sky Computing Lab, and other affiliated research labs. They will focus on optimizing performance and efficiency for the vLLM project on Intel® hardware platforms and beyond.

By focusing on Intel-optimized solutions, the center aims to push the boundaries of what is possible with generative AI, ensuring that these advanced models can deliver unparalleled performance and scalability on Intel-powered systems from data centers to client devices.

Find out more about the AI Center of Excellence Launch. Read the official announcement:
→ UC Berkeley Announces AI Center of Excellence

New AI Center of Excellence

The Computer Science Division at UC Berkeley, with its unified, highly integrated electrical engineering and computer science departments, world-renowned research labs, and rich history at the forefront of artificial intelligence and high-performance computing, is ideally positioned to accelerate the development of more system-level compute, memory, and communication optimizations to democratize AI further.

UC Berkeley and Intel are excited about this opportunity to combine UC Berkeley's research community with Intel's expertise in hardware and software innovation to help shape the future of open standards-based AI for everyone.

We are excited to collaborate with Intel on optimizing vLLM. Establishing this Center of Excellence will provide the ideal environment for this effort.

Ion Stoica, Professor of EECS, University of California, Berkeley

As long innovators in the context of Open Source and AI development, we are delighted to collaborate with the University of California, Berkeley, as a new AI Center of Excellence member focused on enhancing the performance and efficiency of the Gen AI software stack. Partnership on open technologies such as vLLM running on oneAPI software can drive not only the future of AI software but also influence hardware architecture and software development, delivering capabilities for the next generation of advanced AI applications.

Melissa Evers, VP Office of the CTO, GM of Software Ecosystem Enablement, Intel Corporation

Taking vLLM Inference Performance to the Next Level

The vLLM Project upstream support currently includes Intel® Xeon® Processors, Intel® Gaudi® AI Accelerators, and Intel® Data Center GPU Max Series platforms, among other hardware platforms. It was developed at UC Berkeley and is actively maintained and developed there. Prof. Ion Stoica is its principal researcher.

The vLLM library showcases the open source and open standards philosophy and developer ecosystem focus that both UC Berkeley and Intel embrace. It is fast and easy to use, making trained models accessible for real-world use by setting up an inference infrastructure that allows applications to send data to the model and receive predictions or outputs in real time.

It achieves this through:

Efficient management of attention key and value memory
Continuous batching of incoming requests
Fast model execution Graph APIs for compute kernel memory dependency optimization, kernel execution batching, and launch latency reduction.
Offload kernel optimization
Advanced CPU and GPU parallelism
Quantization, taking advantage of various data types
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High throughput serving with various decoding algorithms
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Prefix caching support
And more …

Together, we will work on taking each of these software architecture features to the next level, leveraging oneAPI, SYCL, the Unified Acceleration Foundation (UXL), and Intel® Artificial Intelligence Solutions. The AI Center of Excellence at UC Berkeley will push the boundaries of GAI and make it accessible to a wide range of platforms so everyone can fully leverage its potential.

Join the Future of AI

With UC Berkeley’s AI Center of Excellence, we commit to further accelerating the current state of AI technologies. Together, let us develop AI solutions that are not only innovative but also efficient and scalable, making advanced AI accessible and applicable to a wide range of real-world applications and available on a wide range of compute capabilities and compute form factors.

Ensure your AI workloads are ready for the next generation of Intel® CPUs, GPUs, and AI accelerators!

Join UC Berkeley’s AI Center of Excellence effort by actively contributing to the vLLM Project.

→ Download and install vLLM today!

Learn more about Intel’s AI Frameworks and Tools and get started with your AI solution software development using the AI Tool Selector.

We are looking forward to hearing from you!

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

UC Berkeley Collaborates with Intel to Drive vLLM* Performance and Efficiency

Get the Latest on All Things CODE

New AI Center of Excellence

Taking vLLM Inference Performance to the Next Level

Join the Future of AI

Additional Resources

Product and Performance Information