From the course: LLaMa for Developers

Unlock the full course today

Join today to access over 24,900 courses taught by industry experts.

Using DeepSpeed for serving LLaMA

Using DeepSpeed for serving LLaMA - Llama Tutorial

From the course: LLaMa for Developers

Using DeepSpeed for serving LLaMA

- [Instructor] Let's cover a third LLM serving framework called DeepSpeed MII. You might be familiar with DeepSpeed as a way to train and run models, but recently Microsoft has made an update to make it even faster. I'm here on the GitHub page to take a look at some of the features of DeepSpeed. If we scroll down, we can see some charts about how fast DeepSpeed is. Compared to the previous serving framework we covered, it's up to two times faster, depending on the model. Some of the key features include KV-caching, continuous batching, Dynamic SplitFuse, tensor parallelism, and additional CUDA kernels. So we're going to use this framework to run Llama 7 billion parameters. Let's head over to our notebook. I'm here under 04_05, and we're going to be running MII in our collaboratory notebook. And before we get started, let's walk through two pieces of information. In the left-hand side here, we have our secrets tab.…

Contents