
This episode is based on a technical blog post from LMSYS Org detailing the deployment of the DeepSeek large language model (LLM) using the SGLang inference system on 96 H100 GPUs. The central focus is on advanced optimization techniques, specifically Prefill-Decode (PD) Disaggregation and Large-Scale Expert Parallelism (EP), which are necessary to efficiently serve DeepSeek's complex Mixture of Experts (MoE) architecture. The authors explain how their implementation, which includes toolkits like Disposable Tensor and the Expert Parallelism Load Balancer (EPLB), achieves throughput performance nearly matching the official DeepSeek profile while significantly reducing costs. Through extensive evaluation, they demonstrate substantial speedups over vanilla tensor parallelism, discuss detailed kernel breakdowns, and outline future work to address latency and scalability limitations