From the course: LLaMa for Developers
Unlock the full course today
Join today to access over 24,900 courses taught by industry experts.
Using DeepSpeed for serving LLaMA - Llama Tutorial
From the course: LLaMa for Developers
Using DeepSpeed for serving LLaMA
- [Instructor] Let's cover a third LLM serving framework called DeepSpeed MII. You might be familiar with DeepSpeed as a way to train and run models, but recently Microsoft has made an update to make it even faster. I'm here on the GitHub page to take a look at some of the features of DeepSpeed. If we scroll down, we can see some charts about how fast DeepSpeed is. Compared to the previous serving framework we covered, it's up to two times faster, depending on the model. Some of the key features include KV-caching, continuous batching, Dynamic SplitFuse, tensor parallelism, and additional CUDA kernels. So we're going to use this framework to run Llama 7 billion parameters. Let's head over to our notebook. I'm here under 04_05, and we're going to be running MII in our collaboratory notebook. And before we get started, let's walk through two pieces of information. In the left-hand side here, we have our secrets tab.…
Contents
-
-
-
-
-
-
(Locked)
Resources required to serve LLaMA4m 35s
-
(Locked)
Quantizing LLaMA4m 7s
-
(Locked)
Using TGI for serving LLaMA2m 40s
-
(Locked)
Using VLLM for serving LLaMA5m 27s
-
(Locked)
Using DeepSpeed for serving LLaMA4m 13s
-
(Locked)
Explaining LoRA and SLoRA1m 59s
-
(Locked)
Using a vendor for serving LLaMA3m 16s
-
(Locked)
-
-