Xiaoqiang Lin is a Ph.D. student at the National University of Singapore. During his time at Meta, Xiaoqiang lead the research behind REFRAG: Rethinking RAG-based Decoding. Traditional RAG systems use vectors to retrieve relevant context with semantic search, but then throw away the vectors when passing the context to the LLM. REFRAG instead feeds the LLM these pre-compute vectors, achieving massive gains in long context processing and LLM inference speed! REFRAG makes Time-To-First-Token (TTFT) 31x faster and Time-To-Iterative-Token (TTIT) 3x faster, boosting overall LLM throughput by 7x while also being able to handle much longer contexts!
There are so many interesting aspects to this and I really loved diving into the details with Xiaoqiang! I hope you enjoy the podcast!
This episode dives into Weaviate's partnership with SAS! We are super excited about our recent collaboration on the SAS Retrieval Agent Manager (RAM), featuring a first party integration with Weaviate! The podcast dives into all sorts of aspects of Enterprise AI adoption from what has changed, to what has NOT changed with recent breakthroughs in AI systems!
Charles Pierse is the Director of the Weaviate Labs team, where he has recently lead the GA release of the Weaviate Query Agent. The podcast begins with the journey from alpha to GA release, discussing unexpected lessons and the collaborations between teams at Weaviate. Continuing on the product design, we cover the design of the Python and TypeScript clients and how to think about response models with Agent products. Then diving into the tech, we cover several different aspects of the Query Agent from question answering with citations, to schema introspection and typing for database querying, multi-collection routing, and the newly introduced Search Mode. We also discuss the Weaviate Query Agent's integration with the Cloud Console, a GUI home for the Weaviate Database! We are also super excited to share a case study from one of the Query Agent's power uses, MetaBuddy! The podcast concludes with the MetaBuddy case study and some exciting directions for the future development of the Query Agent.
Lakshya A. Agrawal is a Ph.D. student at U.C. Berkeley! Lakshya has lead the research behind GEPA, one of the newest innovations in DSPy and the use of Large Language Models as Optimizers! GEPA makes three key innovations on how exactly we use LLMs to propose prompts for LLMs, (1) Pareto-Optimal Candidate Selection, (2) Reflective Prompt Mutation, and (3) System-Aware Merging. The podcast discusses all of these details further, as well as topics such as Test-Time Training and the LangProBe benchmarks used in the paper! I hope you find the podcast useful!
Maarten Grootendorst is a psychologist turned AI engineer who has created BERTopic and authored "Hands-On Large Language Models" with Jay Alammar. The rise of LLMs and Agents are transforming many areas of software! This podcast dives deep into their impact on Topic Modeling! Maarten designed BERTopic from the start with modularity in mind -- letting you ablate embedding models, dimensionality reduction, clustering algorithms, and more. This early insight to prioritize modularity makes BERTopic incredibly well structured to become more "Agentic". An "Agentic" Topic Modeling algorithm can use LLMs to generate topics or topic descriptions, as well as contrast them with other topics. It can decide which topics to subdivide, and it can integrate human feedback and evaluate topics in novel ways... I hope you find the podcast interesting!
Hailey Joren is a Ph.D. student at UCSD! Hailey and collaborators at Duke University and Google have recently published Sufficient Context: A New Lens on Retrieval Augmented Generation Systems in ICLR 2025! There are so many interesting nuggets to this work! Firstly, it really helped me understand the difference between *relevant* search results and sufficient context for answering the question. Armed with this lens of looking at retrieved context, Hailey and collaborators make all sorts of interesting observations about the current state of Hallucination. RAG unfortunately makes the models far less likely to hallucinate, and the existing RAG benchmarks unfortunately do not emphasize retrieval adaptation well enough -- indicated by LLMs outputting correct answers despite insufficient context 35-62% of the time! However, reason for optimism! Hailey and team develop an autorater that can detect insufficient context 93% of the time! There are all sorts of interesting ideas around this paper! I really hope you find the podcast useful!
Nandan Thakur is a Ph.D. student at the University of Waterloo! Nandan has worked on many of the most impactful works in Retrieval-Augmented Generation (RAG) and Information Retrieval. His work ranges from benchmarks such as BEIR, MIRACLE, TREC, and FreshStack, to improving the training of embedding models and re-rankings, and more!
Multi-vector retrieval offers richer, more nuanced search, but often comes with a significant cost in storage and computational overhead. How can we harness the power of multi-vector representations without breaking the bank? Rajesh Jayaram, the first author of the groundbreaking MUVERA algorithm from Google, and Roberto Esposito from Weaviate, who spearheaded its implementation, reveal how MUVERA tackles this critical challenge.
Dive deep into MUVERA, a novel compression technique specifically designed for multi-vector retrieval. Rajesh and Roberto explain how it leverages contextualized token embeddings and innovative fixed dimensional encodings to dramatically reduce storage requirements while maintaining high retrieval accuracy. Discover the intricacies of quantization within MUVERA, the interpretability benefits of this approach, and how LSH clustering can play a role in topic modeling with these compressed representations.
This conversation explores the core mechanics of efficient multi-vector retrieval, the challenges of benchmarking these advanced systems, and the evolving landscape of vector database schemas designed to handle such complex data. Rajesh and Roberto also share their insights on the future directions in artificial intelligence where efficient, high-dimensional data representation is paramount.
Whether you're an AI researcher grappling with the scalability of vector search, an engineer building advanced retrieval systems, or fascinated by the cutting edge of information retrieval and AI frameworks, this episode delivers unparalleled insights directly from the source. You'll gain a fundamental understanding of MUVERA, practical considerations for its application in making multi-vector retrieval feasible, and a clear view of future directions in AI.
AI agents are getting more complex and harder to debug. How do you know what's happening when your agent makes 20+ function calls? What if you have a Multi-Agent System orchestrating several Agents? Anand Kannappan, co-founder of Patronus AI, reveals how their groundbreaking tool Percival transforms agent debugging and evaluation. Percival can instantly analyze complex agent traces, it pinpoints failures across 60 different modes, and it automatically suggests prompt fixes to improve performance. Anand unpacks several of these common failure modes. This includes the critical challenges of "context explosion" where agents process millions of tokens. He also explains domain adaptation for specific use cases, and the complex challenge of multi-agent orchestration. The paradigm of AI Evals is shifting from static evaluation to dynamic oversight! Also learn how Percival's memory architecture leverages both episodic and semantic knowledge with Weaviate!This conversation explores powerful concepts like process vs. outcome rewards and LLM-as-judge approaches. Anand shares his vision for "agentic supervision" where equally capable AI systems provide oversight for complex agent workflows. Whether you're building AI agents, evaluating LLM systems, or interested in how debugging autonomous systems will evolve, this episode delivers concrete techniques. You'll gain philosophical insights on evaluation and a roadmap for how evaluation must transform to keep pace with increasingly autonomous AI systems.
How do you ensure your AI systems actually do what you expect them to do? Leonard Tang takes us deep into the revolutionary world of AI evaluation with concrete techniques you can apply today. Learn how Haize Labs is transforming AI testing through "scaling judge-time compute" - stacking weaker models to effectively evaluate stronger ones. Leonard unpacks the game-changing Verdict library that outperforms frontier models by 10-20% while dramatically reducing costs. Discover practical insights on creating contrastive evaluation sets that extract maximum signal from human feedback, implementing debate-based judging systems, and building custom reward models that align with enterprise needs. The conversation reveals powerful nuggets like using randomized agent debates to achieve consensus and lightweight guardrail models that run alongside inference. Whether you're developing AI applications or simply fascinated by how we'll ensure increasingly powerful AI systems perform as expected, this episode delivers immediate value with techniques you can implement right away, philosophical perspectives on AI safety, and a glimpse into the future of evaluation that will fundamentally shape how AI evolves.
Ben walks us through Box's three-layer infrastructure puzzle: First, the mind-boggling base infrastructure (think millions of interactions per second and trillions of files). Second, their unique multi-tenant security challenge - unlike most SaaS platforms, Box users share content across company boundaries, making traditional tenant isolation impossible. And third, ensuring AI respects all these complex permissions while still delivering value. The podcast then dives further into how vector embeddings can balloon file sizes - a few hundred bytes of text can require 4-6KB of vector data storage! We also dig into why RAG remains essential despite growing context windows, and how Box is developing AI agents that transform painful enterprise processes like RFP responses.
Hey everyone! Thanks so much for watching another episode of the Weaviate Podcast! Dive into the fascinating world of structured outputs with Will Kurt and Cameron Pfeiffer, the brilliant minds behind Outlines, the revolutionary open-source library from .txt.ai that's changing how we interact with LLMs. In this episode, we explore how constrained decoding enables predictable, reliable outputs from language models—unlocking everything from perfect JSON generation to guided reasoning processes.Will and Cameron share their journey to founding .txt.ai, explain the technical magic behind Outlines (hint: it involves finite state machines!), and debunk misconceptions around structured generation performance. You'll discover practical applications like knowledge graph construction, metadata extraction, and report generation that simply weren't possible before this technology.Whether you're building AI systems or curious about where the field is heading, you'll gain valuable insights on how structured outputs integrate with inference engines like vLLM, why multi-task inference outperforms single-task approaches, and how this technology enables scalable agent systems that could transform software architecture forever. Join us for this mind-expanding conversation about one of AI's most important but under appreciated innovations—and discover why the future might belong to systems that combine freedom with structure.
Synthetic Data: The Building Bocks of AI's Future! Hey everyone! I am SUPER EXCITED to publish the 118th episode of the Weaviate Podcast featuring David Berenstein and Ben Burtenshaw from HuggingFace! This podcast explores the intricacies of synthetic data generation, detailing methodologies such as data augmentation, distillation, and instruction refinement. The conversation delves into persona-driven synthetic data, highlighting applications like Persona Hub, and discusses algorithms to enhance diversity, complexity, and quality of generated data. Additionally, they cover integration with Hugging Face’s ecosystem, including Argilla for annotation, AutoTrain for fine-tuning, and advanced data exploration tools like the Data Studio and SQL console. The podcast also touches upon the potential for synthetic image data generation and the exciting future of AI education and accessibility.
Hey everyone! Thank you so much for watching the 117th episode of the Weaviate podcast! In this episode, we dive deep into the cutting edge of AI agent development with Sarah Wooders, co-founder and CTO of Letta AI. Emerging from Berkeley's Sky Computing Lab, Sarah and her team have pioneered a revolutionary approach to stateful agents - AI systems that genuinely remember both you and themselves across extended conversations. The conversation explores how the groundbreaking MemGPT project evolved into Letta's comprehensive Agent Development Environment (ADE), which empowers developers to build truly persistent AI experiences. Sarah shares powerful insights on context management, memory prioritization, and the critical role of databases in agent architecture. Whether you're building AI systems or simply curious about where conversational AI is heading, this episode illuminates how the future of agents depends not just on their reasoning capabilities, but on their ability to maintain coherent identity and memory over time.
Hey everyone! Thank you so much for watching another episode of the Weaviate Podcast! I am SUPER excited to welcome Matt Biilmann, Co-Founder and CEO of Netlify, as well as Sebastian Witalec and Charles Pierse from Weaviate to discuss Agent Experience! You have probably heard about how you can connect LLMs to external software tools. This supercharges the capabilities of AI systems and what they can do. So what does that mean for you as a software developer?This podcast explores different ideas around designing software user experiences for Agents as well as Humans. How do we write documentation for Agents differently than Humans? How do we design REST or gRPC APIs, or programming languages clients, for Agents differently than Humans? llms.txt, JSON tool definitions, agents to agent, breaking changes, … there were so many interesting topics explored in this podcast! I really hope you find it useful! As always more than happy to discuss these ideas further with you!
Hey everyone! Thank you so much for watching the 115th episode of the Weaviate Podcast featuring Shirley Wu from Stanford University!
We explore the innovative Avatar Optimizer—a novel framework that leverages contrastive reasoning to refine LLM agent prompts for optimal tool usage. Shirley explains how this self-improving system evolves through iterative feedback by contrasting positive and negative examples, enabling agents to handle complex tasks more effectively.
We also dive into the STaRK Benchmark, a comprehensive testbed designed to evaluate retrieval systems on semi-structured knowledge bases. The discussion highlights the challenges of unifying textual and relational retrieval, exploring concepts such as multi-vector embeddings, relational graphs, and dynamic data modeling. Learn how these approaches help overcome information loss, enhance precision, and enable scalable, context-aware retrieval in diverse domains—from product recommendations to precision medicine.
Whether you’re interested in advanced prompt optimization, multi-agent system design, or the future of human-centered language models, this episode offers a wealth of insights and a forward-looking perspective on integrating sophisticated AI techniques into real-world applications.
Hey everyone! Thank you so much for watching the 114th episode of the Weaviate Podcast featuring Amanpreet Singh, Co-Founder and CTO of Contextual AI! Contextual AI is at the forefront of production-grade RAG agents! I learned so much from this conversation! We began by discussing the vision of RAG 2.0, jointly optimizing generative and retrieval models! This then lead us to discuss Agentic RAG and how the RAG 2.0 roadmap is evolving with emerging perspectives on tool use. Amanpreet continues to further motivate the importance of continual learning of the model and the prompt / few-shot examples -- discussing the limits of prompt engineering. Personally I have to admit I think I have been a bit too bullish on only tuning instructions / examples, Amanpreet made an excellent case for updating the weights of the models as well -- citing issues such as parametric knowledge conflicts, and later on discussing how Mechanistic Interpretability is used to audit models and their updates in enterprise settings. We then discussed Contextual AI's LMUnit for evaluating these systems. This then lead us into my favorite part of the podcast, a deep dive into RL algorithms for LLMs. I highly recommend checking out the links below to learn more about Contextual's innovations on APO and KTO! We then discuss the importance of domain specific data, Mechanistic Interpretability, return to another question on RAG 2.0, and conclude with Amanpreet's most exciting future directions for AI! I hope you enjoy the podcast!
Hey everyone! Thank you so much for watching the 113th episode of the Weaviate Podcast with Karan Goel from Cartesia AI! Cartesia AI is leading the AI world in text-to-speech models! As exciting as these new applications in speech generation are, Cartesia is also building around an incredibly exciting new neural network architecture that cuts across all of AI -- State Space Models. State Space Models (SSMs) present a new approach to modeling long sequences circumventing the quadratic attention bottlenecks of transformers. In the podcast, we discuss Karan's perspectives around end-to-end modeling, long context and Multimodal processing, building and deploying a new kind of model, and more! I hope you find the podcast interesting and useful! As always more than happy to discuss these ideas further with you! Thanks for listening!
Hey everyone! Thank you so much for watching the 112th episode of the Weaviate Podcast! This is another super exciting one, diving into the release of the Vertex AI RAG Engine, its integration with Weaviate and thoughts on the future of connecting AI systems with knowledge sources! The podcast begins by reflecting on Bob's experience speaking at Google in 2016 on Knowledge Graphs! This transitions into discussing the evolution of knowledge representation perspectives and things like the semantic web, ontologies, search indexes, and data warehouses. This then leads to discussing how much knowledge is encoded in the prompts themselves and the resurrection of rule-based systems with LLMs! The podcast transitions back to topics around the modern consensus in RAG pipeline engineering. Lewis suggests that parsing in data ingestion is the biggest bottleneck and low hanging fruit to fix. Bob presents the re-indexing problem and how it is additionally complicated with embedding models! Discussing the state of knowledge representation systems inspired me to ask Bob further about his vision with Generative Feedback Loops and controlling databases with LLMs, How open ended will this be? We then discuss the role that Agentic Architectures and Compound AI Systems are having on the state of AI. What is the right way to connect prompts with other prompts, external tools, and agents? The podcast then concludes by discussing a really interesting emerging pattern in the deployment of RAG systems. Whereas the first generation of RAG systems typically were user facing, such as customer support chatbots, the next generation is more API-based. The launch of the Vertex AI RAG Engine quickly shows you how to use RAG Engine as a tool for a Gemini Agent!
Hey everyone! I am SUPER EXCITED to publish the 111th Weaviate Podcast with Aravind Kesiraju from Morningstar! Aravind is a Principal Software Engineer who has lead the development behind the Morningstar Intelligence Engine! There are so many interesting aspects to this, and if you are building Agentic systems that would benefit from a high-quality financial retrieval API, you should check this out right now! The podcast dives into all sorts of ingredients that went into building this system: from custom RAG data pipelines with content management system integrations and embedding task queues, to exploring new chunking strategies, tool marketplaces, ReAct Agents, Text-to-SQL, and all sorts of other things!