From the course: LLaMa for Developers
Unlock the full course today
Join today to access over 24,900 courses taught by industry experts.
Using VLLM for serving LLaMA - Llama Tutorial
From the course: LLaMa for Developers
Using VLLM for serving LLaMA
- [Instructor] In this video, we're going to discuss another large language model server, called vLLM. In 2023, this blog post came out, promising easier, faster, and cheaper LLM serving, with a technique called paged attention. If you remember from a previous video, we talked about paged attention as a way of saving states for our large language models. If we scroll down, we can see the performance impact. Compared to TGI, and native Hugging Face serving, vLLM has much better benchmarks. Due to paged attention and other key techniques, the authors were able to optimize serving in an easy to access manner. A vLLM is an open source project. We can see it here on their GitHub page at GitHub.com/vllm-project. Now we're going to be implementing vLLM in our CoLab. It's actually the easiest server to implement from my experience. So let's head over to our notebook. I'm here in our notebook, which is from chapter four, 0404 B,…
Contents
-
-
-
-
-
-
(Locked)
Resources required to serve LLaMA4m 35s
-
(Locked)
Quantizing LLaMA4m 7s
-
(Locked)
Using TGI for serving LLaMA2m 40s
-
(Locked)
Using VLLM for serving LLaMA5m 27s
-
(Locked)
Using DeepSpeed for serving LLaMA4m 13s
-
(Locked)
Explaining LoRA and SLoRA1m 59s
-
(Locked)
Using a vendor for serving LLaMA3m 16s
-
(Locked)
-
-