Google SRE Prodcast

Follow Google SRE Prodcast
Share on
Copy link to clipboard

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!

MP English, Viv, Salim Virji


    • Apr 16, 2025 LATEST EPISODE
    • monthly NEW EPISODES
    • 34m AVG DURATION
    • 33 EPISODES


    Search for episodes from Google SRE Prodcast with a specific topic:

    Latest episodes from Google SRE Prodcast

    We're back with Season 4!

    Play Episode Listen Later Apr 16, 2025 15:03


    In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as a video format and a feedback form.

    Special Episode: You Missed a Page from Telebot

    Play Episode Listen Later Jan 29, 2025 16:14


    This episode features Javi Beltran, a Google engineering lead who created the "Telebot" theme song. With our beloved hosts, Steve McGhee and Jordan Greenberg, Beltran discusses the origins of the song, created in 2012 for Google's paging system. The song was meant to add a touch of levity to what could be a stressful situation for engineers on-call. Beltran also unveils a new, more modern remix of “Telebot” (created in collaboration with our host, Jordan Greenberg!) which will be used as the intro theme for the podcast's next season.

    Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

    Play Episode Listen Later Dec 11, 2024 36:10


    In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configurations. They discuss the differences between imperative and declarative configuration, explore the benefits and challenges of each approach, and the need for careful consideration when choosing between the two. Ultimately, the goal is to achieve reliable and maintainable systems through effective configuration management.

    Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

    Play Episode Listen Later Dec 4, 2024 41:18


    This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discuss how resilience appears in Software Engineering and SRE and explore the importance of understanding the human factors involved in adapting to system failures—highlighting the need for a more qualitative and holistic approach to understanding how engineers successfully adapt to system behavior and improving overall reliability.

    Embracing Complexity with Christina Schulman & Dr. Laura Maguire

    Play Episode Listen Later Nov 20, 2024 33:59


    In this episode of the Prodcast, we are joined by guests Christina Schulman (Staff SRE, Google) and Dr. Laura Maguire PhD (Principal Engineer, Trace Cognitive Engineering). They emphasize the human element of SRE and the importance of fostering a culture of collaboration, learning, and resilience in managing complex systems. They touch upon topics such as the need for diverse perspectives and collaboration in incident response, the necessity of embracing complexity, and explore concepts such as aerodynamic stability, and more.

    Maglev: load balancing at Google with Cody Smith and Trisha Weir

    Play Episode Listen Later Nov 13, 2024 32:53


    In this episode, Cody Smith (CTO and Co-founder, Camus Energy) & Trisha Weir (SRE Department Lead, Google) join hosts Steve McGhee and Jordan Greenberg, to discuss their experience developing Maglev, a highly available and distributed network load balancer (NLB) that is an integral part of the cloud architecture that manages traffic that comes in to a datacenter. Starting with Maglev's humble beginnings as a skunkworks effort, Cody and Trisha recount the challenges they faced, and emphasize the importance of psychological safety, collaboration, and adaptability in SRE innovation.

    Profiling data with Pat Somaru and Narayan Desai

    Play Episode Listen Later Oct 30, 2024 42:22


    In this episode, guests Narayan Desai (Principal SRE, Google) and Pat Somaru (Senior Production Engineer, Meta) join hosts Steve McGhee and Florian Rathgeber to discuss the challenges of observability and working with profiling data. The discussion covers intriguing topics like noise reduction, workload modeling, and the need for better tools and techniques to handle high-cardinality data.

    Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

    Play Episode Listen Later Oct 23, 2024 32:07


    This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.

    SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

    Play Episode Listen Later Oct 16, 2024 33:40


    Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg  to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.

    Incident Response with Sarah Butt and Vrai Stacey

    Play Episode Listen Later Oct 9, 2024 43:53


    Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.

    Building Reliable Systems with Silvia Botros and Niall Murphy

    Play Episode Listen Later Oct 2, 2024 42:06


    Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th edition”) and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!

    Creating Systems that are Safe with Liz Fong-Jones

    Play Episode Listen Later Sep 25, 2024 28:40


    Liz Fong-Jones (former Google SRE and current Field CTO at honeycomb.io) joins hosts Steve McGhee and Jordan Greenberg for a lively discussion centered around observability, its evolution from monitoring, and its role in modern software development. Tune in for more on the importance of observability as a spectrum, the evolving role of SREs, and advice to aspiring software engineers.

    Production Problems Are For All! with Ben Treynor Sloss

    Play Episode Listen Later Sep 18, 2024 31:21


    Ben Treynor Sloss (VP of Engineering, Google) joins hosts Steve McGhee and Dr. Jennifer Petoff (Director of Technical Infrastructure Education, Google) to share the evolution of SRE and its impact on software development, how AI and ML significantly impacts SRE practices, and the future of SRE. Ben coined the term "Site Reliability Engineering" for his team of (now) 4,000 software engineers, engaged in what were traditionally operations functions. Under Ben's leadership, Google SRE wrote two best-selling books on SRE. Since then, the rest of the SaaS industry has come to adopt the SRE name, mission, and practices. 

    There Remains a Huge Amount of Work to Do, with Healfdene Goguen

    Play Episode Listen Later Sep 11, 2024 26:14


    In this episode, Healfdene Goguen (Principal Engineer, Google) joins hosts Steve McGhee and Jordan Greenberg to discuss the vast amount of work to be done by SREs, and the fascinating challenges to tackle with clear real-world implications. It's a truly exciting time to be an SRE at Google!

    SRE, a Basis of Influence, with Amy Tobey & Vladyslav Ukis

    Play Episode Listen Later Sep 4, 2024 41:02


    In this season of Google Prodcast, current and former SREs, both within and outside of Google, chat with hosts Steve McGhee and Jordan Greenberg to discuss software systems designed and built by SREs.  For "episode zero", guests Amy Tobey (Live Services SRE, Netflix) and Dr. Vladyslav Ukis (Head of R&D, Siemens Healthineers, Author of "Establishing SRE Foundations") will set the stage for the season with a lively discussion about what Software Engineering means to Site Reliability Engineering.

    Life of An SRE: Life after Google SRE, with Carla Geisser, Cody Smith, and Laura Nolan

    Play Episode Listen Later Nov 7, 2023 46:32


    Former Google SREs, or "Xooglers", talk with hosts MP and Steve McGhee about site reliability engineering outside of Google. What's the difference in scale? What skills are generally valuable? And why can't you build “SRE in a box” that jump-starts pretty much any organization? Join Carla Geisser, Cody Smith, and Laura Nolan in their lively conversation about what SRE skills and knowledge they have found useful in roles outside of Google. 

    Life of An SRE with Sabrina Farmer

    Play Episode Listen Later Oct 31, 2023 51:11


    Sabrina Farmer, VP of Engineering at Google, talks about her career journey through Site Reliability Engineering.  What does management mean? What's involved in being an effective manager? and what's a feasibility study? Hear some great advice on how to get what you expect out of a role, wherever on the ladder it is. 

    Life of An SRE with Dave Reisner

    Play Episode Listen Later Oct 17, 2023 29:44


    Dave Reisner talks about his path to Staff SRE, from ArchLinux contributor through DevOps to software engineer. This episode emphasizes the value of strong mentoring and manager relationships, and the challenges of work-life balance.  

    Life of an SRE with Stephen Benjamin

    Play Episode Listen Later Oct 10, 2023 32:04


    Explore the role and responsibilities of an SRE manager with Stephen Benjamin.

    Life of An SRE with Jessica Theodat

    Play Episode Listen Later Oct 3, 2023 25:45


    Explore the role and responsibilities of a Senior SRE with Jessica Theodat, as she discusses life-work balance, the value of mentoring, and being a Black woman in SRE.

    Life of An SRE with Shannon Brady and Theo Klein

    Play Episode Listen Later Sep 26, 2023 44:01


    Explore the career path of SREs Shannon Brady and Theo Klein as they discusses their paths to Site Reliability Engineering and finding their areas of expertise. 

    Life of An SRE with Mariuxi Vasconez and Julian Alarcon

    Play Episode Listen Later Sep 19, 2023 34:30


    In this episode, Mariuxi and Julian discuss their paths to SRE: what drew them initially to SRE, and what motivates them to continue developing skills  

    Life of An SRE Episode 1: Tom Cranitch and Megan Yin

    Play Episode Listen Later Sep 12, 2023 27:14


    How does one become an SRE? And what's the career like? In this episode, Tom and Megan discuss their path to SRE.  

    Creating the SRE Prodcast with John Reese (JTR)

    Play Episode Listen Later Jun 7, 2022 10:55


    Host MP English and former Google SRE John Reese (JTR) chat about the creation of the Prodcast. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Postmortems with Ayelet Sachto

    Play Episode Listen Later May 31, 2022 28:36


    Ayelet Sachto offers advice on creating an actionable, transparent, and blameless postmortem culture. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Incident Management with Adrienne Walcer

    Play Episode Listen Later May 24, 2022 39:57


    Adrienne Walcer discusses how to approach and organize incident management efforts throughout the production lifecycle. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    On-Call Rotations with Andrew Widdowson (APW)

    Play Episode Listen Later May 17, 2022 43:58


    Andrew Widdowson (APW) shares strategies for successful on-call rotations. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Automation with Pierre Palatin

    Play Episode Listen Later May 10, 2022 60:29


    Pierre Palatin dives into different automation strategies, how to build confidence in your system, and why designing the UI may be your biggest challenge. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Client-Transparent Migrations with Pavan Adharapurapu

    Play Episode Listen Later May 3, 2022 40:28


    Pavan Adharapurapu details how to approach large-scale migrations while optimizing for user experience. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Rethinking SLOs with Narayan Desai

    Play Episode Listen Later Apr 26, 2022 25:14


    Narayan Desai explains why SLOs can be problematic and proposes alternative methods for monitoring complex, large-scale systems. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Alerting with Amelia Harrison

    Play Episode Listen Later Apr 19, 2022 26:54


    Amelia Harrison advises on when and how to alert, ideal coverage, and tuning. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Customer-Centric Monitoring

    Play Episode Listen Later Apr 12, 2022 31:05


    Silvia Esparrachiari talks about the challenges of monitoring and the importance of understanding your users. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    SRE Philosophy

    Play Episode Listen Later Apr 5, 2022 33:04


    What is SRE, anyway? Jennifer Mace (Macey) gives us her definition of "site reliability engineer," discusses how to manage risk, and shares key questions to ask developers. Visit https://sre.google/prodcast for transcripts and links to further reading. View transcript

    Claim Google SRE Prodcast

    In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

    Claim Cancel