From the course: LLaMa for Developers

Unlock the full course today

Join today to access over 24,900 courses taught by industry experts.

Using VLLM for serving LLaMA

Using VLLM for serving LLaMA - Llama Tutorial

From the course: LLaMa for Developers

Using VLLM for serving LLaMA

- [Instructor] In this video, we're going to discuss another large language model server, called vLLM. In 2023, this blog post came out, promising easier, faster, and cheaper LLM serving, with a technique called paged attention. If you remember from a previous video, we talked about paged attention as a way of saving states for our large language models. If we scroll down, we can see the performance impact. Compared to TGI, and native Hugging Face serving, vLLM has much better benchmarks. Due to paged attention and other key techniques, the authors were able to optimize serving in an easy to access manner. A vLLM is an open source project. We can see it here on their GitHub page at GitHub.com/vllm-project. Now we're going to be implementing vLLM in our CoLab. It's actually the easiest server to implement from my experience. So let's head over to our notebook. I'm here in our notebook, which is from chapter four, 0404 B,…

Contents