UC Berkeley Collaborates with Intel to Drive vLLM* Performance and Efficiency

December 11, 2024

Get the Latest on All Things CODE

author-image

By

We are excited to announce that the Department of Electrical Engineering and Computer Sciences (EECS) at the University of California, Berkeley (UC Berkeley) is contributing its innovative prowess in a joint effort with Intel to drive the evolution of generative AI (GAI) large language model (LLM) model inference efficiency to the next level.

Under the guidance of Professor Ion Stoica, Director of the Sky Computing Lab and a key contributor to the Berkeley Artificial Intelligence Research Center (BAIR), UC Berkeley is establishing a new AI Center of Excellence. Graduate students Woosuk Kwon, Zhuohan Li, and Simon Mo will help coordinate efforts across BAIR, the Sky Computing Lab, and other affiliated research labs. They will focus on optimizing performance and efficiency for the vLLM project on Intel® hardware platforms and beyond.

By focusing on Intel-optimized solutions, the center aims to push the boundaries of what is possible with generative AI, ensuring that these advanced models can deliver unparalleled performance and scalability on Intel-powered systems from data centers to client devices.

Find out more about the AI Center of Excellence Launch. Read the official announcement: 
→ UC Berkeley Announces AI Center of Excellence

New AI Center of Excellence

The Computer Science Division at UC Berkeley, with its unified, highly integrated electrical engineering and computer science departments, world-renowned research labs, and rich history at the forefront of artificial intelligence and high-performance computing, is ideally positioned to accelerate the development of more system-level compute, memory, and communication optimizations to democratize AI further.

UC Berkeley and Intel are excited about this opportunity to combine UC Berkeley's research community with Intel's expertise in hardware and software innovation to help shape the future of open standards-based AI for everyone.

We are excited to collaborate with Intel on optimizing vLLM. Establishing this Center of Excellence will provide the ideal environment for this effort.

Ion Stoica, Professor of EECS, University of California, Berkeley

As long innovators in the context of Open Source and AI development, we are delighted to collaborate with the University of California, Berkeley, as a new AI Center of Excellence member focused on enhancing the performance and efficiency of the Gen AI software stack. Partnership on open technologies such as vLLM running on oneAPI software can drive not only the future of AI software but also influence hardware architecture and software development, delivering capabilities for the next generation of advanced AI applications.

Melissa Evers, VP Office of the CTO, GM of Software Ecosystem Enablement, Intel Corporation

Taking vLLM Inference Performance to the Next Level

The vLLM Project upstream support currently includes Intel® Xeon® Processors, Intel® Gaudi® AI Accelerators, and Intel® Data Center GPU Max Series platforms, among other hardware platforms. It was developed at UC Berkeley and is actively maintained and developed there. Prof. Ion Stoica is its principal researcher.  

The vLLM library showcases the open source and open standards philosophy and developer ecosystem focus that both UC Berkeley and Intel embrace. It is fast and easy to use, making trained models accessible for real-world use by setting up an inference infrastructure that allows applications to send data to the model and receive predictions or outputs in real time.

It achieves this through:  

  • Efficient management of attention key and value memory
  • Continuous batching of incoming requests
  • Fast model execution Graph APIs for compute kernel memory dependency optimization, kernel execution batching, and launch latency reduction.
  • Offload kernel optimization
  • Advanced CPU and GPU parallelism
  • Quantization, taking advantage of various data types
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High throughput serving with various decoding algorithms
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Prefix caching support
  • And more …

Together, we will work on taking each of these software architecture features to the next level, leveraging oneAPI, SYCL, the Unified Acceleration Foundation (UXL), and Intel® Artificial Intelligence Solutions. The AI Center of Excellence at UC Berkeley will push the boundaries of GAI and make it accessible to a wide range of platforms so everyone can fully leverage its potential.  

Join the Future of AI

With UC Berkeley’s AI Center of Excellence, we commit to further accelerating the current state of AI technologies. Together, let us develop AI solutions that are not only innovative but also efficient and scalable, making advanced AI accessible and applicable to a wide range of real-world applications and available on a wide range of compute capabilities and compute form factors.

Ensure your AI workloads are ready for the next generation of Intel® CPUs,  GPUs, and AI accelerators!

Join UC Berkeley’s AI Center of Excellence effort by actively contributing to the vLLM Project.

→ Download and install vLLM today!

Learn more about Intel’s AI Frameworks and Tools and get started with your AI solution software development using the AI Tool Selector.

We are looking forward to hearing from you!

Additional Resources

 

1