Back

Explore every episode of the podcast Humans of Reliability

Dive into the complete episode list for Humans of Reliability. Each episode is cataloged with detailed descriptions, making it easy to find and explore specific topics. Keep track of all episodes from your favorite podcast and never miss a moment of insightful content.

Rows per page:

1–19 of 19

TitlePub. DateDuration
The End of “Good Code”? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber10 Sep 202500:37:38

Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of Humans of Reliability, Rob Zuber, CircleCI CTO, joins Sylvain to explore how coding assistants are reshaping developer workflows and changing what teams value.

Rob shares what he’s seeing across CircleCI’s customer base: a clear boost in throughput, new bottlenecks shifting from code creation to code review, and the rise of “vibe coding,” where engineers trust AI-generated code they may not fully understand.

He challenges long-held assumptions about readability and maintainability, arguing that software engineering is on the edge of a paradigm shift. For SREs and developers alike, this conversation is a candid look at how to stay relevant, embrace simplicity, and rethink reliability in the age of AI.

Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)20 Aug 202500:31:03

What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner (Razor, ex-Zalando) shares her unique journey from frontend and backend engineering to leading site reliability practices. She explains why protecting the user journey is the key to effective incident management, how SLOs cut through noisy alerts, and why observability must come first.

Shery also talks about practical steps teams can take to adopt an SLO-driven strategy, the pitfalls of over-instrumentation, and how AI is shaping the future of incident response. Whether you’re an engineer, manager, or reliability leader, you’ll walk away with concrete ideas on how to protect what matters most: your customers’ experience.

Are AI and Platforms Making SRE Obsolete? With Kaspar von Grünberg, Humanitec’s CEO24 Mar 202500:25:44

Last year, over 89% of companies claimed to have adopted platform engineering. And, in the past month, LLMs have been disrupting how we think about software development. In this context, Kaspar, asks if the role of Site Reliability Engineers is being obsolete as we know it. Kaspar argues that while SREs aren’t going anywhere, their responsibilities are evolving—fast.

We talk about:

  • The need for the SRE role to be transformed
  • How to build reliability as part every golden path
  • The role of AI and LLMs in Developer Experience
  • The limits of LLMs for reliability and infrastructure
Scientific Incident Management with Dan Slimmon14 Mar 202500:37:35

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. 

In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems. 

We also explore:

  • Why asking the right questions is more important than knowing all the answers. 
  • How to use nerd sniping to unlock insights from engineers. 
  • Common failure patterns he sees across organizations. 

EPISODE LINKS: 

How AI broke serverless and what to do about it with Vercel’s Mariano Fernández Cocirio06 Mar 202500:13:52

Mariano, Staff Product Manager at Vercel, explains why serverless architectures are hitting unexpected limits—they’re too fast. 

The industry has spent millions optimizing serverless for speed, but AI workloads are changing the game. In the AI realm, slower execution often leads to better results. The challenge? Paying for all that idle compute time while waiting for AI responses. 

Mariano explains how Vercel Fluid is introducing a new execution model that blends the best of serverless and traditional servers—scaling efficiently while reducing costs. Mariano breaks down Fluid’s architecture, its built-in reliability features, and how it redefines cloud computing for LLM-powered applications. 

Tune in to learn how Fluid could reshape the industry and what it means for developers. 

EPISODE LINKS:



I Want My Shoes Fast! Observability, SRE Burnout, and OTel with Dynatrace’s Adriana Villela27 Feb 202500:34:23

In this episode, we sit down with Adriana Villela, Principal DevRel at Dynatrace and OpenTelemetry contributor to break down how observability impacts reliability. 

We dive into what contributes to SRE burnout and how managers can create psychologically safer spaces for responders. 

Adriana also shares her perspective on AI as an observability-buddy to navigate incidents. 

SHOW LINKS:

AI in Production with GitHub’s Sean Goedecke 18 Feb 202500:17:33

In this episode, we sit down with Sean Goedecke, Staff Software Engineer at GitHub, to discuss where LLMs fit into real-world development. 

Sean shares how he’s using LLMs how he’s drawing the line for AI-assistance in the codebases he manages—though, as he says, this might all change by next summer. 

Sean also weighs in on how LLMs could assist SREs during outages—especially when you’re only half-awake at 3 a.m. after a rather inconvinient page. 

Tune in for a nuanced take on the future of AI in software engineering, “vibe coding,” and the evolution of rubber ducks. 

LINKS: 

The Reliability Diagnosis: Google’s Steve McGhee on Debugging and Incident Response10 Feb 202500:15:32

In this episode of Humans of Reliability, we sit down with Steve McGhee, Reliability Advocate at Google, to discuss his journey from early SRE work to advocating for reliability best practices. 

Steve shares fascinating stories from his time at Google, the challenges of implementing SRE in enterprises, and what people often misunderstand about the discipline. 

He also offers valuable insights on incident response, distributed systems, and the underrated skill every reliability engineer should master. Whether you're new to SRE or a seasoned professional, this conversation is packed with wisdom and practical takeaways.

This episode is also available as a video interview on YouTube.

No CS Degree, No Problem: Building a Career in Tech Leadership05 Feb 202500:11:09

What does it take to lead service delivery at a company experiencing massive growth? 

Hannah Hammonds, Service Delivery Lead at Prolific, shares her journey from an IT networking apprentice to a tech leader shaping reliability and incident response. 

We discuss the evolving role of service delivery, the power of mentorship, and how confidence transforms careers.

Plus, we debate hot dogs, spoilers, and The Office.

Tune in for career insights, leadership lessons, and a few laughs! 🎙️🚀

This podcast episode is also available on YouTube if you want to see a video version of this interview.

Beyond SLOs: How an ex-Google SRE scaled reliability at the largest e-commerce in the nordics 03 Feb 202500:07:34

What happens when a Google-trained SRE joins a fast-moving e-commerce company? 

Gastón Rial Saibene, SRE Lead at Boozt.com, joins Humans of Reliability to talk about adapting reliability practices for different company sizes, the limits of SLOs, and the importance of automation. 

We also dive into decision-making, his favorite books, and—just for fun—whether he’d survive a zombie apocalypse. Tune in for insights, laughs, and a fresh perspective on the world of reliability engineering! 










The Domino Effect of Outages with Nuno Tomás, Founder of isDown.app24 Jan 202500:34:45

🎙️ Humans of Reliability: Keeping systems up and the lights on isn’t just about technology—it’s about the people behind it. In this episode, we’re thrilled to chat with Nuno Tomás, founder of IsDown.app, a vendor outage monitoring tool transforming how teams handle third-party incidents.

Nuno shares his journey from software engineer to entrepreneur, the pivotal 4 a.m. moment that inspired Isdown, and the challenges of balancing startup life with family. We dive into the complexities of incident communication, how to tackle alert fatigue, and why transparency is key to building trust in SaaS.

If you’ve ever woken up to a midnight alert or wondered how to streamline incident response, this episode is packed with relatable stories, actionable insights, and a fresh perspective on modern reliability. Don't miss it! 🚀

https://rootly.com/humans-of-reliability

Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)03 Jul 202500:13:23

Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the intersection of traditional finance and crypto markets.

Brian shares how unexpected market events can create massive traffic spikes, how their platform architecture and Kubernetes setup help them stay resilient, and why Uphold's transparency and regulatory approach make them both trustworthy and a high-profile target.

The conversation also touches on AI's emerging role in operations, lessons from major incidents, and the delicate balance between innovation and stringent compliance in financial engineering.

Command Under Pressure: David Owczarek on Incident Leadership and Human-Centered Reliability17 Jun 202500:23:25

Incident response is as much about people as it is about systems. In this episode, David Owczarek, a veteran engineer leader and seasoned incident commander, joins Silvan Kalache to unpack the human dynamics behind effective reliability leadership.

Drawing on experiences across startups and global enterprises, David shares what really matters when everything breaks, including:

– How incident response strategies shift between small companies and large enterprises

– Why not every engineer should be an incident commander

– How empathy and transparency during outages can deepen customer trust instead of eroding it

– Where AI is showing promise in SRE workflows—and where human judgment still reigns supreme

If you’re leading incidents, training ICs, or thinking about the future of AI-assisted reliability, this episode is packed with hard-won insights and grounded strategies for navigating chaos with clarity.

AI at the Frontlines of Healthcare Reliability with Ryan Lockard (CVS Health)30 May 202500:24:07

AI is transforming reliability work—from reactive firefighting to proactive engineering. In this episode, Ryan Lockard, VP of Platform Engineering and AI Enablement at CVS Health, joins Sylvain Kalache to break down how AI is showing up on the frontlines of healthcare infrastructure and operations.

From LLM copilots to cultural shifts in ownership, Ryan walks us through:

  • How AI tools help troubleshoot legacy systems and assist during real-time incidents
  • Why proactive reliability is finally within reach thanks to AI-enhanced tooling and workflows
  • What MCP servers are, and how natural language interfaces are streamlining cloud operations
  • How engineering culture and on-call models shift when teams truly own their reliability posture
  • What AI means for the next generation of developers—and why prompt engineering matters

Whether you're managing incidents, building platforms, or scaling team culture, this conversation offers a pragmatic lens on how AI is changing the work of reliability from the inside out.

Trust Is the Product: Building Reliable Billing in the AI Era with Cosmo Wolfe (Metronome)26 May 202500:20:16

In this episode, we sit down with Cosmo Wolfe, Head of Technology at Metronome, to unpack how reliability, trust, and architecture intersect in one of the most critical and overlooked parts of the AI product stack: billing.

As AI workloads introduce unpredictable usage patterns and nontraditional pricing models—from token-based to outcome-based—companies are navigating a new frontier of customer trust. Cosmo explains why billing is more than just a backend function; it’s a key moment of truth in the product experience.

We explore how event-sourced systems, rigorous monitoring, and internal accountability help avoid trust-eroding mistakes, like misbilled invoices or opaque usage tracking. Whether it’s ensuring a usage cap actually activates or being able to reconstruct billing history for an enterprise customer, this episode reveals how reliability at the billing layer becomes a strategic advantage in a competitive, AI-driven software landscape.

The Golden Path to Nowhere: When Platforms Undermine Reliability with Chase Roberts (Northflank)14 May 202500:27:26

Internal platforms promise speed, consistency, and scale — but what happens when they become a distraction? In this episode, Chase Roberts, COO at Northflank, joins Sylvain Kalache to examine the quiet ways platforms erode developer experience when not planned carefully. 

From abandoned golden paths to shadow deployments and brittle YAML pipelines, Chase walks us through: 

  • Why early PaaS got developer experience right and what it missed 
  • The cultural bias toward building over buying (and its hidden costs) 
  • How complexity quietly kills productivity and reliability, even when everything “works” 
  • The three questions every team should ask before building an IDP 
  • What a future of truly portable, resilient platforms might look like 

Episode links: 

AI can boost developer productivity, if used right, with Justin Reock, Deputy CTO at DX30 Apr 202500:37:48

In this episode of Humans of Reliability, we sit down with Justin Reock, Deputy CTO at DX, to unpack the real impact of generative AI on developer productivity. Drawing from early data in DX’s GenAI Impact Report, he explains why time savings alone don’t tell the full story and why the real value might lie in shifting cognitive load toward meaningful work. 

We also explore how traditional productivity metrics like PR throughput can backfire, why teams need to move beyond DORA, and how modern frameworks like SPACE and the DX Core 4 offer a more complete view of engineering health. 

Episode links:

Why Reliability in the AI Era Starts with the Network with Marino Wijay 17 Apr 202500:27:03

In this episode, we explore how networking has shaped reliability as we know it. Marino Wijay cloud networking expert and Staff Solutions Architect at Kong shares how his journey began not as an SRE, but with cables, routers, and switches.

Marino explains the evolution of the fabric holding systems together through virtualization, and how software-defined networking, which is now a key element to resilient applications.

This episode also dives into the new challenges LLMs are introducing into networking. Marino discusses how these workloads introduce new types of reliability challenges: longer response times, context preservation, model switching, and request sanitization.

Finally, Marino emphasizes the human role. Behind every protocol, abstraction layer, and model prompt is a human providing context, compassion, and a sense of accountability.

Metrics That Matter: Measuring Developer Productivity in the AI Era09 Apr 202500:39:36

In this episode of Humans of Reliability, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at 90, for a conversation that cuts through the noise around developer productivity metrics and AI.

Mark dives deep into how teams can measure what matters—without falling into the trap of turning every measure into a target. He shares how tools like Developer NPS, DORA metrics, and balanced scorecards can help teams optimize for both output and well-being—but only when framed with the right intent.

As AI tools like GitHub Copilot begin to shift how engineering work gets done, Mark also explores the tension between time saved and value created. Drawing on his experience at Wayfair and now 90, he shares real numbers, practical pitfalls, and a healthy dose of executive common sense on how to track gains without over-engineering the data.

© My Podcast Data