Scientific Incident Management with Dan Slimmon

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/20/f3/2e/20f32e67-e699-3cec-f77a-b7c1b381baa4/mza_5590371432551785944.jpg/600x600bb.jpg

Humans of Reliability

Rootly

20 episodes

4 days ago

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production. Conor dives into the shift from traditional test-driven development to evaluation-driven development, where metrics like context adherence, completeness, and ac...

Technology

RSS

All content for Humans of Reliability is the property of Rootly and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

Scientific Incident Management with Dan Slimmon

Humans of Reliability

37 minutes

8 months ago

Scientific Incident Management with Dan Slimmon

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems. We also explore: Why asking the right questions is more important than knowing all the answers. How to use nerd sniping...