Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
Sports
Technology
News
About Us
Contact Us
Copyright
© 2024 PodJoint
Podjoint Logo
US
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/35/0e/ea/350eea4b-dc4c-8299-6bf7-39c4c41aca90/mza_1860621988665580564.jpg/600x600bb.jpg
How AI Is Built
Nicolay Gerold
63 episodes
6 days ago
Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.
Show more...
Technology
RSS
All content for How AI Is Built is the property of Nicolay Gerold and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.
Show more...
Technology
https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_episode/44001690/4a41855c40f9a01a.jpg
#051 Build systems that can be debugged at 4am by tired humans with no context
How AI Is Built
1 hour 5 minutes 51 seconds
4 months ago
#051 Build systems that can be debugged at 4am by tired humans with no context

Nicolay here,

Today I have the chance to talk to Charity Majors, CEO and co-founder of Honeycomb, who recently has been writing about the cost crisis in observability.

"Your source of truth is production, not your IDE - and if you can't understand your code there, you're flying blind."

The key insight is architecturally simple but operationally transformative: replace your 10-20 observability tools with wide structured events that capture everything about a request in one place. Most teams store the same request data across metrics, logs, traces, APM, and error tracking - creating a 20X cost multiplier while making debugging nearly impossible because you're reconstructing stories from fragments.

Charity's approach flips this: instrument once with rich context, derive everything else from that single source. This isn't just about cost - it's about giving engineers the connective tissue to understand distributed systems. When you can correlate "all requests failing from Android version X in region Y using language pack Z," you find problems in minutes instead of days.

The second is putting developers on call for their own code. This creates the tight feedback loop that makes engineers write more reliable software - because nobody wants to get paged at 3am for their own bugs.

In the podcast, we also touch on:

  • Why deploy time is the foundational feedback loop (15 minutes vs 15 hours changes everything)
  • The controversial "developers on call" stance and why ops people rarely found companies
  • How microservices made everything trace-shaped and killed traditional metrics approaches
  • The "normal engineer" philosophy - building for 4am debugging, not peak performance
  • AI making "code of unknown quality" the new normal
  • Progressive deployment strategies (kibble → dogfood → production)
  • and more

💡 Core Concepts

  • Wide Structured Events: Capturing all request context in one instrumentation event instead of scattered log lines - enables correlation analysis that's impossible with fragmented data.
  • Observability 2.0: Moving from metrics-as-workhorse to structured-data-as-workhorse, where you instrument once and derive metrics/alerts/dashboards from the same rich dataset.
  • SLO-based Alerting: Replacing symptom alerts (CPU, memory, disk) with customer-impact alerts that measure whether you're meeting promises to users.
  • Progressive Deployment: Gradual rollout through staged environments (kibble → dogfood → production) that builds confidence without requiring 2X infrastructure.
  • Trace-shaped Systems: Architecture pattern recognizing that distributed systems problems are fundamentally about correlating events across time and services, not isolated metrics.

📶 Connect with Charity:

  • LinkedIn
  • Bluesky
  • Personal Blog
  • Company

📶 Connect with Nicolay:

  • LinkedIn
  • X / Twitter
  • Website

⏱️ Important Moments

  • Gateway Drug to Engineering: [01:04] How IRC and bash tab completion sparked Charity's fascination with Unix command line possibilities
  • ADHD and Incident Response: [01:54] Why high-pressure outages brought out her best work - getting "dead calm" when everything's broken
  • Code vs. Production Reality: [02:56] Evolution from focusing on code beauty to understanding performance, behavior, and maintenance over time
  • The Alexander's Horse Principle: [04:49] Auto-deployment as daily practice - if you grow up deploying constantly, it feels natural by the time you scale
  • Production as Source of Truth: [06:32] Why your IDE output doesn't matter if you can't understand your code's intersection with infrastructure and users
  • The Logging Evolution: [08:03] Moving from debugger-style spam logs to fewer, wider structured events oriented around units of work
  • Bubble Up Anomaly Detection: [10:27] How correlating dimensions reveals that failures cluster around specific Android versions, regions, and feature combinations
  • Everything is Trace-Shaped: [12:45] Why microservices complexity is about locating problems in distributed systems, not just identifying them
  • AI as Acceleration of Automation: [15:57] Most AI panic could be replaced with "automation" - it's the same pattern, just faster feedback loops
  • Non-determinism as Genuinely New: [16:51] The one aspect of AI that's actually novel in software systems, requiring new architectural patterns
  • The Cost Crisis: [22:30] How 10-20 observability tools create unsustainable cost multipliers as businesses scale
  • SLO Revolution: [28:40] Deleting 90% of alerts by focusing on customer impact instead of system symptoms
  • Shrinking Feedback Loops: [34:28] Keeping deploy-to-validation under one hour so engineers can connect actions to outcomes
  • Normal Engineer Design: [38:12] Building systems that work for tired humans at 4am, not just heroes during business hours
  • The Instrumentation Habit: [23:15] Always looking at your code in production after deployment to build informed instincts about system behavior
  • Progressive Deployment Strategy: [36:43] Kibble → Dog Food → Production pipeline for gradual confidence building
  • Real Engineering Bar: [49:00] Discussion on what actually makes exceptional vs normal engineers

🛠️ Tools & Tech Mentioned

  • Honeycomb - Observability platform for structured events
  • OpenTelemetry - Vendor-neutral instrumentation framework
  • IRC - Early gateway to computing
  • Parse - Mobile backend where Honeycomb's origin story began

📚 Recommended Resources

  • "In Praise of Normal Engineers" - Charity's blog post
  • "How I Failed" by Tim O'Reilly
  • "Looking at the Crux" by Richard Rumelt
  • "Fluke" - Book about randomness in history
  • "Engineering Management for the Rest of Us" by Sarah Dresner
How AI Is Built
Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.