#051 Build systems that can be debugged at 4am by tired humans with no context

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/35/0e/ea/350eea4b-dc4c-8299-6bf7-39c4c41aca90/mza_1860621988665580564.jpg/600x600bb.jpg

How AI Is Built

Nicolay Gerold

63 episodes

6 days ago

Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.

Technology

RSS

All content for How AI Is Built is the property of Nicolay Gerold and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_episode/44001690/4a41855c40f9a01a.jpg

#051 Build systems that can be debugged at 4am by tired humans with no context

How AI Is Built

1 hour 5 minutes 51 seconds

4 months ago

#051 Build systems that can be debugged at 4am by tired humans with no context

Nicolay here,

Today I have the chance to talk to Charity Majors, CEO and co-founder of Honeycomb, who recently has been writing about the cost crisis in observability.

"Your source of truth is production, not your IDE - and if you can't understand your code there, you're flying blind."

The key insight is architecturally simple but operationally transformative: replace your 10-20 observability tools with wide structured events that capture everything about a request in one place. Most teams store the same request data across metrics, logs, traces, APM, and error tracking - creating a 20X cost multiplier while making debugging nearly impossible because you're reconstructing stories from fragments.

Charity's approach flips this: instrument once with rich context, derive everything else from that single source. This isn't just about cost - it's about giving engineers the connective tissue to understand distributed systems. When you can correlate "all requests failing from Android version X in region Y using language pack Z," you find problems in minutes instead of days.

The second is putting developers on call for their own code. This creates the tight feedback loop that makes engineers write more reliable software - because nobody wants to get paged at 3am for their own bugs.

In the podcast, we also touch on:

Why deploy time is the foundational feedback loop (15 minutes vs 15 hours changes everything)
The controversial "developers on call" stance and why ops people rarely found companies
How microservices made everything trace-shaped and killed traditional metrics approaches
The "normal engineer" philosophy - building for 4am debugging, not peak performance
AI making "code of unknown quality" the new normal
Progressive deployment strategies (kibble → dogfood → production)
and more

💡 Core Concepts

Wide Structured Events: Capturing all request context in one instrumentation event instead of scattered log lines - enables correlation analysis that's impossible with fragmented data.
Observability 2.0: Moving from metrics-as-workhorse to structured-data-as-workhorse, where you instrument once and derive metrics/alerts/dashboards from the same rich dataset.
SLO-based Alerting: Replacing symptom alerts (CPU, memory, disk) with customer-impact alerts that measure whether you're meeting promises to users.
Progressive Deployment: Gradual rollout through staged environments (kibble → dogfood → production) that builds confidence without requiring 2X infrastructure.
Trace-shaped Systems: Architecture pattern recognizing that distributed systems problems are fundamentally about correlating events across time and services, not isolated metrics.

📶 Connect with Charity:

📶 Connect with Nicolay:

⏱️ Important Moments

Gateway Drug to Engineering: [01:04] How IRC and bash tab completion sparked Charity's fascination with Unix command line possibilities
ADHD and Incident Response: [01:54] Why high-pressure outages brought out her best work - getting "dead calm" when everything's broken
Code vs. Production Reality: [02:56] Evolution from focusing on code beauty to understanding performance, behavior, and maintenance over time
The Alexander's Horse Principle: [04:49] Auto-deployment as daily practice - if you grow up deploying constantly, it feels natural by the time you scale
Production as Source of Truth: [06:32] Why your IDE output doesn't matter if you can't understand your code's intersection with infrastructure and users
The Logging Evolution: [08:03] Moving from debugger-style spam logs to fewer, wider structured events oriented around units of work
Bubble Up Anomaly Detection: [10:27] How correlating dimensions reveals that failures cluster around specific Android versions, regions, and feature combinations
Everything is Trace-Shaped: [12:45] Why microservices complexity is about locating problems in distributed systems, not just identifying them
AI as Acceleration of Automation: [15:57] Most AI panic could be replaced with "automation" - it's the same pattern, just faster feedback loops
Non-determinism as Genuinely New: [16:51] The one aspect of AI that's actually novel in software systems, requiring new architectural patterns
The Cost Crisis: [22:30] How 10-20 observability tools create unsustainable cost multipliers as businesses scale
SLO Revolution: [28:40] Deleting 90% of alerts by focusing on customer impact instead of system symptoms
Shrinking Feedback Loops: [34:28] Keeping deploy-to-validation under one hour so engineers can connect actions to outcomes
Normal Engineer Design: [38:12] Building systems that work for tired humans at 4am, not just heroes during business hours
The Instrumentation Habit: [23:15] Always looking at your code in production after deployment to build informed instincts about system behavior
Progressive Deployment Strategy: [36:43] Kibble → Dog Food → Production pipeline for gradual confidence building
Real Engineering Bar: [49:00] Discussion on what actually makes exceptional vs normal engineers

🛠️ Tools & Tech Mentioned

Honeycomb - Observability platform for structured events
OpenTelemetry - Vendor-neutral instrumentation framework
IRC - Early gateway to computing
Parse - Mobile backend where Honeycomb's origin story began

📚 Recommended Resources

"In Praise of Normal Engineers" - Charity's blog post
"How I Failed" by Tim O'Reilly
"Looking at the Crux" by Richard Rumelt
"Fluke" - Book about randomness in history
"Engineering Management for the Rest of Us" by Sarah Dresner