Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production. Conor dives into the shift from traditional test-driven development to evaluation-driven development, where metrics like context adherence, completeness, and ac...
All content for Humans of Reliability is the property of Rootly and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production. Conor dives into the shift from traditional test-driven development to evaluation-driven development, where metrics like context adherence, completeness, and ac...
Beyond SLOs: How an ex-Google SRE scaled reliability at the largest e-commerce in the nordics
Humans of Reliability
7 minutes
9 months ago
Beyond SLOs: How an ex-Google SRE scaled reliability at the largest e-commerce in the nordics
What happens when a Google-trained SRE joins a fast-moving e-commerce company? Gastón Rial Saibene, SRE Lead at Boozt.com, joins Humans of Reliability to talk about adapting reliability practices for different company sizes, the limits of SLOs, and the importance of automation. We also dive into decision-making, his favorite books, and—just for fun—whether he’d survive a zombie apocalypse. Tune in for insights, laughs, and a fresh perspective on the world of reliability engineerin...
Humans of Reliability
Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production. Conor dives into the shift from traditional test-driven development to evaluation-driven development, where metrics like context adherence, completeness, and ac...