Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

https://is1-ssl.mzstatic.com/image/thumb/Podcasts116/v4/08/6d/7e/086d7e88-a976-aad2-0059-530997986b11/mza_9387105271904690689.jpg/600x600bb.jpg

Vanishing Gradients

Hugo Bowne-Anderson

62 episodes

6 days ago

A podcast about all things data, brought to you by data scientist Hugo Bowne-Anderson. It's time for more critical conversations about the challenges in our industry in order to build better compasses for the solution space! To this end, this podcast will consist of long-format conversations between Hugo and other people who work broadly in the data science, machine learning, and AI spaces. We'll dive deep into all the moving parts of the data world, so if you're new to the space, you'll have an opportunity to learn from the experts. And if you've been around for a while, you'll find out what's happening in many other parts of the data world.

Technology

RSS

All content for Vanishing Gradients is the property of Hugo Bowne-Anderson and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://media24.fireside.fm/file/fireside-images-2024/podcasts/images/1/140c3904-8258-4c39-a698-a112b7077bd7/cover.jpg?v=1

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

Vanishing Gradients

1 hour 13 minutes 15 seconds

1 month ago

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

Most AI teams find "evals" frustrating, but ML Engineer Hamel Husain argues they’re just using the wrong playbook. In this episode, he lays out a data-centric approach to systematically measure and improve AI, turning unreliable prototypes into robust, production-ready systems.

Drawing from his experience getting countless teams unstuck, Hamel explains why the solution requires a "revenge of the data scientists." He details the essential mindset shifts, error analysis techniques, and practical steps needed to move beyond guesswork and build AI products you can actually trust.

We talk through:

The 10(+1) critical mistakes that cause teams to waste time on evals
Why "hallucination scores" are a waste of time (and what to measure instead)
The manual review process that finds major issues in hours, not weeks
A step-by-step method for building LLM judges you can actually trust
How to use domain experts without getting stuck in endless review committees
Guest Bryan Bischof's "Failure as a Funnel" for debugging complex AI agents

If you're tired of ambiguous "vibe checks" and want a clear process that delivers real improvement, this episode provides the definitive roadmap.

LINKS

🎓 Learn more:

Hugo's course: Building LLM Applications for Data Scientists and Software Engineers — https://maven.com/s/course/d56067f338