
vLLM is a high-throughput serving system for large language models. It addresses inefficient KV cache memory management in existing systems caused by fragmentation and lack of sharing, which limits batch size. vLLM uses PagedAttention, inspired by OS paging, to manage KV cache in non-contiguous blocks. This minimizes memory waste and enables flexible sharing, allowing vLLM to batch significantly more requests. As a result, vLLM achieves 2-4x higher throughput compared to state-of-the-art systems like FasterTransformer and Orca.