DeepSeek Deployment with SGLang: Disaggregation and Expert Parallelism

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/f0/4a/f8/f04af822-008a-2330-e3f3-5fae4e00262c/mza_6620006532835236257.jpg/600x600bb.jpg

The Gist Talk

237 episodes

3 days ago

Welcome to The Gist Talk, the podcast where we break down the big ideas from the world’s most fascinating business and non-fiction books. Whether you’re a busy professional, a lifelong learner, or just someone curious about the latest insights shaping the world, this show is for you. Each episode, we’ll explore the key takeaways, actionable lessons, and inspiring stories—giving you the ‘gist’ of every book, one conversation at a time. Join us for engaging discussions that make learning effortless and fun.

Business

RSS

All content for The Gist Talk is the property of kw and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Business

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/42551424/42551424-1732839355363-f882e4dafe46.jpg

DeepSeek Deployment with SGLang: Disaggregation and Expert Parallelism

The Gist Talk

51 minutes 58 seconds

3 weeks ago

DeepSeek Deployment with SGLang: Disaggregation and Expert Parallelism

This episode is based on a technical blog post from LMSYS Org detailing the deployment of the DeepSeek large language model (LLM) using the SGLang inference system on 96 H100 GPUs. The central focus is on advanced optimization techniques, specifically Prefill-Decode (PD) Disaggregation and Large-Scale Expert Parallelism (EP), which are necessary to efficiently serve DeepSeek's complex Mixture of Experts (MoE) architecture. The authors explain how their implementation, which includes toolkits like Disposable Tensor and the Expert Parallelism Load Balancer (EPLB), achieves throughput performance nearly matching the official DeepSeek profile while significantly reducing costs. Through extensive evaluation, they demonstrate substantial speedups over vanilla tensor parallelism, discuss detailed kernel breakdowns, and outline future work to address latency and scalability limitations