Breaking Agent Backbones: AI agents are being deployed at scale, but their security is challenged by non-deterministic behavior and novel vulnerabilities. This episode introduces the "threat snapshot" framework and the new b3 benchmark, which systematically isolate and evaluate security risks stemming from the backbone LLM. We reveal crucial findings: enhanced reasoning capabilities generally improve security, yet model size does not correlate with lower vulnerability scores.
In this episode, OpenAI leaders share unprecedented transparency regarding their research goals, aiming for a fully automated AI researcher by March 2028 and discussing the rapid approach of superintelligence. They detail a new structure, featuring a nonprofit foundation that governs a Public Benefit Corporation, essential for attracting the resources needed for their colossal $1.4 trillion infrastructure commitment. The discussion also covers the pivot to an AI cloud platform model, the importance of accelerating scientific discovery, and the establishment of AI resilience efforts to handle societal risks.
Welcome to the new era of coding collaboration: Agent HQ is here, establishing GitHub as the centralized home for developers and a fleet of AI coding agents. We explore how the fully-fledged GitHub Copilot agent, alongside partners like Claude and Codex, now operates with deeper context and the ability to execute and coordinate tasks across the developer workflow. Discover how innovations like Mission Control and Plan Mode provide developers with the confidence and control to orchestrate parallel tasks and integrate AI natively into their existing processes, fundamentally changing the developer tool chain.
We delve into Jensen Huang's vision that Artificial Intelligence marks the New Industrial Revolution, positioning it as essential national infrastructure and America's next Apollo moment. We explore how NVIDIA's extreme co-design and Accelerated Computing enable new "AI Factories," achieving 10X generational performance leaps to drive down the cost of generating intelligence. The episode concludes by examining new strategic platforms, including 6G telecommunications (NVIDIA ARC), hybrid quantum computing, and the exponential rise of physical AI and robotics.
The modern workplace often buries professionals under context switching and scattered technology, hindering the productivity gains promised by AI. This episode explores the three stages of working smarter: Block Distractions, Scale Yourself, and Get Results, focusing on how a unified AI platform removes friction. Discover how to move past busywork, amplify your natural curiosity, and channel your enhanced capabilities toward strategic, measurable outcomes that define your career progression.
Join Lance from LangChain and Pete from Manus as they dive deep into the crucial discipline of Context Engineering for building effective AI agents. This webinar explores the challenge of context explosion—where performance drops as long-running agents accumulate tool call observations—and the core themes used to combat it: offloading, reducing, retrieving, and isolating context. Pete shares fresh lessons from building Manis, detailing the difference between reversible compaction and irreversible summarization, and how their layered action space manages tool confusion.
Welcome to an essential discussion on Vibe Coding, the new paradigm where developers shift from writing code line-by-line to orchestrating and validating outputs from autonomous AI agents. We'll formalize Vibe Coding as an engineering discipline, exploring its foundations in Large Language Models, complex agent architectures (like planning and memory mechanisms), and integrated feedback loops. Join us as we break down the five distinct development models—from Unconstrained Automation to Test-Driven approaches—and debate the critical challenges of achieving reliable, secure, and scalable human-AI collaboration in software engineering.
Chip Huyen, author of AI Engineering and AI strategy expert from NVIDIA and Netflix, breaks down the technical basics of building successful AI products, covering pre-training, RAG, RLHF, and effective evaluation design. We tackle the growing AI "idea crisis" and the crucial gap between what builders think improves AI applications (like chasing the latest news) versus what actually works (like focusing on user feedback and data preparation). Chip offers essential, in-depth insights into system thinking, organizational structure shifts, and where real productivity gains are being found in the field of AI engineering.
In the hype of ChatGPT Atlas, lets talk about the darkside of Browsing AI Agents
Welcome to the show, where we discuss DeepSeek-OCR and its investigation into using optical 2D mapping for contexts compression, addressing the computational challenges of quadratic scaling faced by Large Language Models. We explore the DeepEncoder, the core engine designed to achieve high compression ratios, delivering near-lossless OCR precision (approximately 97%) even at a 10× token reduction. This groundbreaking work demonstrates strong practical value, achieving state-of-the-art document parsing performance on OmniDocBench while using the fewest vision tokens, offering a promising direction for future memory systems.
This episode explores Anthropic's revolutionary 'Skills,' a new way to implement Standard Operating Procedures (SOPs) for LLM agents, ensuring consistent, high-quality output for specialized tasks like Excel analysis and document formatting. We dive into how these portable folders contain instructions and executable code, allowing Claude to efficiently access deep, specialized expertise only when needed. Learn the best practices for authoring these skills—from conciseness and appropriate degrees of freedom to iterative testing—as LLM platforms rapidly evolve into customizable agentic environments.
**SEAL, the Self-Adapting Language Model framework, is revolutionizing how LLMs learn by enabling them to generate their own finetuning data and update directives. We explore how these powerful models create "self-edits"—synthetic training data and optimization parameters—which are continuously refined through a reinforcement learning loop. Discover how this meta-learning approach allows LLMs to efficiently incorporate new factual knowledge and significantly improve few-shot generalization success rates.
Are expensive Large Language Model (LLM) fine-tuning methods holding back your specialized agents, demanding massive computational resources and data? We dive into Training-Free Group Relative Policy Optimization (Training-Free GRPO), a novel non-parametric method that enhances LLM agent behavior by distilling semantic advantages from group rollouts into lightweight token priors, eliminating costly parameter updates. Discover how this highly efficient approach achieves significant performance gains in specialized domains like mathematical reasoning and web searching, often surpassing traditional fine-tuning while using only dozens of training samples.
Join us for a deep dive with Greg Brockman on the future of AI, where he reveals the internal struggle ("pain and suffering") of managing compute scarcity and the immense physical infrastructure build required to scale systems like Sora 2. Brockman discusses the shift from viewing AGI as a destination to a continuous process, emphasizing that current scaling curves and algorithmic progress continue unabated. We also explore the inevitable move toward proactive AI agents and a fully generative web, predicting a major change to the social contract and web monetization.
Tune in as we explore Agentic Context Engineering (ACE), a novel framework designed to overcome limitations like "brevity bias" and "context collapse" that plague traditional LLM context adaptation methods. ACE transforms model contexts into continuously evolving, structured "playbooks" by employing a modular process of generation, reflection, and curation. We discuss how this approach enables scalable, self-improving agents, yielding substantial performance gains on complex tasks—such as +10.6% on agent benchmarks—while significantly lowering adaptation latency and cost.
This episode explores the Tiny Recursive Model (TRM), a novel approach that leverages a single, tiny network (as small as 7M parameters) to tackle hard puzzle tasks like Sudoku, Maze, and ARC-AGI. We investigate how this simplified, recursive reasoning strategy achieves significantly higher generalization and outperforms much larger models, including complex Large Language Models (LLMs) and the Hierarchical Reasoning Model (HRM). Discover why this "less is more" philosophy is leading to breakthroughs in parameter-efficient AI reasoning by simplifying complex mathematical theories and biological justifications.
Demystify Large Language Model (LLM) evaluation, breaking down the four main methods used to compare models: multiple-choice benchmarks, verifiers, leaderboards, and LLM judges. We offer a clear mental map of these techniques, distinguishing between benchmark-based and judgment-based approaches to help you interpret performance scores and measure progress in your own AI development. Discover the pros and cons of each method—from MMLU accuracy checks to the dynamic Elo ranking system—and learn why combining them is key to holistic model assessment.
Original blog post: https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches
OpenAI DevDay 2025 marked the start of the "agentic era" of software development, focusing on making it "easier to build with AI" and transitioning AI from a "chatbot" into a "doer". We break down the revolutionary AgentKit, featuring Agent Builder, a visual, drag-and-drop platform launched to help developers rapidly deploy multi-step AI agents from prototype to production. We also discuss the new Apps SDK for seamlessly integrating third-party services into ChatGPT and the debut of powerful models like GPT-5 Pro and Sora 2, signifying that software development now takes minutes, not months.
Join us as Turing Award recipient Yann LeCun, Chief Scientist at Meta, critiques the state of AI, arguing that current systems, including Large Language Models (LLMs), are nowhere near matching the learning efficiency observed in humans and animals. LeCun proposes a major architectural shift, advocating that AI must abandon generative models for training and instead focus on building internal "World Models" to enable reasoning and planning. Discover how the Joint Embedding Predictive Architecture (JEPA) uses self-supervised learning to train machines to acquire robust, abstract representations of reality, a crucial step toward achieving common sense and human-level intelligence.
Are smart machines making us forget how to think? This episode dives into the quiet phenomenon of AI-induced skill erosion, where relying on intelligent systems creates an "illusion of mastery" while core competence fades. We explore the organizational implications of deskilling and discuss strategies, such as targeted auditing and better system design, needed to preserve expertise when AI handles essential tasks.
How much do you know about this topic and what is your high level goal for learning about this topic?