Join us as we discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance. Paper: https://arxiv.org/abs/2505.10938 Slides: https://bit.ly/45wolpr Join us for Ar...
All content for Deep Papers is the property of Arize AI and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Join us as we discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance. Paper: https://arxiv.org/abs/2505.10938 Slides: https://bit.ly/45wolpr Join us for Ar...
In this episode, we dive into the intriguing mechanics behind why chat experiences with models like GPT often start slow but then rapidly pick up speed. The key? The KV cache. This essential but under-discussed component enables the seamless and snappy interactions we expect from modern AI systems.Harrison Chu breaks down how the KV cache works, how it relates to the transformer architecture, and why it's crucial for efficient AI responses. By the end of the episode, you'll have a clearer und...
Deep Papers
Join us as we discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance. Paper: https://arxiv.org/abs/2505.10938 Slides: https://bit.ly/45wolpr Join us for Ar...