
The core problem is identified as memory fragmentation caused by the inefficient management of the Key-Value (KV) cache, which stores intermediate token representations. The presenters explain that PageAttention adopts principles from operating system paging and virtualization by partitioning the KV cache into fixed-size KV blocks to significantly reduce both internal and external fragmentation, achieving a 2.5 to 5 times improvement in memory utilization. Furthermore, the system supports memory sharing for parallel samples and beam search, utilizing a copy-on-write technique to handle divergent outputs and increasing overall serving throughput by up to 4x compared to existing methods. Finally, they discuss preemption strategies like recomputation and swapping to manage unpredictable output lengths, concluding with a presentation of their open-source system vLLM and its evaluation results