PagedAttention: Efficient LLM Memory Management

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/f0/4a/f8/f04af822-008a-2330-e3f3-5fae4e00262c/mza_6620006532835236257.jpg/600x600bb.jpg

The Gist Talk

237 episodes

3 days ago

Welcome to The Gist Talk, the podcast where we break down the big ideas from the world’s most fascinating business and non-fiction books. Whether you’re a busy professional, a lifelong learner, or just someone curious about the latest insights shaping the world, this show is for you. Each episode, we’ll explore the key takeaways, actionable lessons, and inspiring stories—giving you the ‘gist’ of every book, one conversation at a time. Join us for engaging discussions that make learning effortless and fun.

Business

RSS

All content for The Gist Talk is the property of kw and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Business

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/42551424/42551424-1732839355363-f882e4dafe46.jpg

PagedAttention: Efficient LLM Memory Management - Part 2

The Gist Talk

24 minutes 38 seconds

2 weeks ago

PagedAttention: Efficient LLM Memory Management - Part 2

The core problem is identified as memory fragmentation caused by the inefficient management of the Key-Value (KV) cache, which stores intermediate token representations. The presenters explain that PageAttention adopts principles from operating system paging and virtualization by partitioning the KV cache into fixed-size KV blocks to significantly reduce both internal and external fragmentation, achieving a 2.5 to 5 times improvement in memory utilization. Furthermore, the system supports memory sharing for parallel samples and beam search, utilizing a copy-on-write technique to handle divergent outputs and increasing overall serving throughput by up to 4x compared to existing methods. Finally, they discuss preemption strategies like recomputation and swapping to manage unpredictable output lengths, concluding with a presentation of their open-source system vLLM and its evaluation results