Podcasts about Site Reliability Engineering

Discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems

154PODCASTS
391EPISODES
33mAVG DURATION
1MONTHLY NEW EPISODE
Jun 4, 2026LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about Site Reliability Engineering

Software Delivery in Small Batches

120 episodes with Site Reliability Engineering

Google SRE Prodcast

30 episodes with Site Reliability Engineering

Screaming in the Cloud

10 episodes with Site Reliability Engineering

Coding Blocks

8 episodes with Site Reliability Engineering

S.R.E.path Podcast

12 episodes with Site Reliability Engineering

Google Cloud Platform Podcast

11 episodes with Site Reliability Engineering

PurePerformance

4 episodes with Site Reliability Engineering

Software Engineering Radio - The Podcast for Professional Software Developers

3 episodes with Site Reliability Engineering

Software Misadventures

5 episodes with Site Reliability Engineering

Latest podcast episodes about Site Reliability Engineering

The Missing GitHub Status Page with Marek Šuppa

Smart Software with SmartLogic

Play Episode Listen Later Jun 4, 2026 41:35

In this episode of Elixir Wizards, hosts Charles Suggs and Emma Whamond sit down with Marek Šuppa, creator of the Missing GitHub Status page, a project that reconstructs GitHub's historical uptime data and reveals discrepancies between official status reporting and the platform's actual reliability. Marek tells us about his dev journey from open source contributor at DuckDuckGo to machine learning engineer at Cisco-acquired Slido. Then, we discuss GitHub's evolution from a hosted Git service into a critical developer tool. We cover reliability, transparency, AI-driven platform growth, developer workflows, and the challenges of balancing convenience with resilience. Along the way, we cover alternative platforms, self-hosted solutions, and whether recent outages are changing how developers think about ownership, dependency, and the future of software collaboration. Topics Discussed in this Episode: Why did Mr. Shu create the Missing GitHub Status Page? GitHub's reported uptime versus developer experiences How open source contributions shaped Marek's career The evolution of GitHub from tool to critical infrastructure Centralization risks in modern software development Git's distributed roots and today's platform-centric workflows Developer reactions to GitHub outages Transparency and accountability in status reporting AI's impact on developer platforms and infrastructure demands Microsoft's stewardship of GitHub Forgejo, Codeberg, and alternative Git hosting platforms Self-hosted Git solutions and tradeoffs Network effects and platform lock-in The social side of software collaboration Building resilience into developer workflows What GitHub outages teach us about infrastructure dependency Links Mentioned: The Missing GitHub Status Page https://mrshu.github.io/github-statuses/ Slido https://www.slido.com/ https://duckduckgo.com/ The official GitHub Status Page https://www.githubstatus.com/ Statuspage.iohttps://www.atlassian.com/software/statuspage Zig Leaves GitHub https://ziglang.org/news/migrating-from-github-to-codeberg/ Ghostty Leaves GitHub https://mitchellh.com/writing/ghostty-leaving-github GitLab https://about.gitlab.com/ Codeberg https://codeberg.org/ https://git.kernel.org/ Forgejo Lightweight Self-Hosting https://forgejo.org/ Former GitHub CEO Thomas Dohmke launches Entire https://entire.io/news/former-github-ceo-thomas-dohmke-raises-60-million-seed-round Update on Spain and LALIGA blocks of the internet https://vercel.com/blog/update-on-spain-and-laliga-blocks-of-the-internet

ai missing spain status entire cisco la liga github devops marek software development git software engineering duckduckgo gitlab sre shu open source software developer experience cloud infrastructure github actions developer tools pull requests distributed systems software architecture machine learning engineer site reliability engineering version control developer productivity developer community slido decentralized internet internet infrastructure developer ecosystem open source development

Cloud Fragility & Distributed Systems with Somtochi Onyekwere

Smart Software with SmartLogic

Play Episode Listen Later May 21, 2026 46:06

In Elixir Wizards S15E04, Charles Suggs and Emma Whamond are joined by Somtochi Onyekwere, a software engineer at Fly.io and contributor to the Corrosion distributed database project, to talk about distributed systems, infrastructure resilience, and the growing fragility of centralized cloud platforms. We discuss what recent outages across major providers reveal about modern infrastructure and why more teams are starting to rethink assumptions around reliability, failover, and system design. Somtochi explains how Fly.io approaches geographic distribution, eventual consistency, and replication across nodes, along with the trade-offs that come with building systems this way. The conversation explores CRDTs (Conflict-free Replicated Data Types), consensus, split-brain prevention, and what actually happens when distributed systems fail in production. We also talk about testing strategies, rollback planning, property-based testing tools, and how teams can reduce blast radius when things inevitably go wrong. Along the way, we discuss AI infrastructure, sandboxing AI agents, and how newer workloads may add pressure to already centralized systems. The episode closes with practical advice for developers who want to build more resilient applications without over-complicating their architecture. Topics Discussed in this Episode: Corrosion and distributed database replication Centralized cloud fragility and recent outage patterns Distributed systems versus traditional cloud architectures Multi-region deployment strategies for Phoenix applications CRDTs and conflict resolution in distributed systems Eventual consistency versus strict consistency tradeoffs Consensus, leader election, and split-brain prevention Testing failover and recovery scenarios Property-based testing and Antithesis Rollback planning for database schema migrations Reducing blast radius through system isolation Health checks and blue-green deployment strategies Fly Proxy request routing and replay behavior Cross-region synchronization and replication challenges Single points of failure inside “redundant” systems Backup restoration testing and disaster recovery planning Network partitions and failure handling in production Infrastructure monitoring and operational visibility AI infrastructure workloads and operational strain Sandboxing and securing AI agents Sprites and AI workflows at Fly.io Latency improvements from geographic distribution Distributed systems tradeoffs in real-world environments Transitive dependency failures across cloud providers Practical resilience strategies for modern engineering teams Links Mentioned: https://fly.io https://github.com/superfly/corrosion https://docs.gitops.weaveworks.org/ FluxCD https://fluxcd.io/ Fly.io Stateful Sandbox Environments https://sprites.dev/ Cloudflare Workers AI Inference Platform https://www.cloudflare.com/products/workers-ai/ “An AI Agent Just Destroyed Our Production Data. It Confessed in Writing” Twitter post from PocketOS founder: https://x.com/lifeof_jer/status/2048103471019434248 Oct 2025 AWS Outage https://www.theguardian.com/technology/2025/oct/24/amazon-reveals-cause-of-aws-outage Dec 2025 Cloudflare Outage https://www.theguardian.com/technology/2025/dec/05/another-cloudflare-outage-takes-down-websites-linkedin-zoom July 2025 Crowdstrike Outage https://www.ibm.com/think/news/recent-crowdstrike-outage-what-you-should-know March 2026 Stryker Cyber Attack https://www.stryker.com/us/en/about/news/2026/a-message-to-our-customers-03-2026.html https://aws.amazon.com/ https://cloud.google.com/ https://azure.microsoft.com/en-us https://fly.io/docs/elixir/ CRDTs!! https://smartlogic.io/podcast/elixir-wizards/s13-e03-local-first-liveview-svelte-pwa/ https://antithesis.com/docs/resources/property_based_testing/ https://hex.pm/packages/proper

ai cloud devops cloud computing fragility software engineering software developers redundancy sre operational excellence disaster recovery systems engineering cloud security business continuity edge computing corrosion cloud native observability antithesis system design cloud infrastructure developer tools platform engineering web dev technology podcast distributed systems chaos engineering software testing software architecture site reliability engineering engineering leadership high availability infrastructure security distributed computing cloud architecture reliability engineering failover crdt infrastructure engineering cap theorem crdts failure modes property based testing phoenix framework web application development

BONUS: Leadership Is Contextual With Daniel Harcek

Scrum Master Toolbox Podcast

Play Episode Listen Later Mar 8, 2026 41:44

In this CTO Series episode, Daniel Harcek shares how leading engineering teams across radically different scales — from a 7-person fintech startup to a 2,000-person cybersecurity company — taught him that leadership isn't one-size-fits-all. We explore how he builds AI-first organizations, drives agile transformations, and why he believes every person in a company should think like a tech person. What Works at 10 People Breaks at 100 "Leadership is contextual, not absolute. What works with 10 people breaks at 50, at 100." Daniel's career spans from building a 30-person team for a German startup out of Žilina, Slovakia, to leading 70 engineers at Avast's mobile division within a 2,000-person organization, and now running a 7-person team at WageNow. Each scale demanded a fundamentally different approach. At smaller scales, you strip away operational overhead and push ownership directly to the people. At larger scales, you need guardrails, dedicated roles, and structured processes that the smaller team would find suffocating. The lesson: don't carry your playbook from one context to another — rebuild it for the reality you're in. End-to-End Ownership Replaces Specialized Roles "Each engineer owns quality for the task he delivers. And he owns the fact that it comes to production." At WageNow, Daniel runs without dedicated QA people — in a fintech company where quality can't be compromised. Instead, each developer owns quality end-to-end, from code to production. This isn't recklessness; it's intentional design. When teams are small, you set up the system so that it's safe to break things, then trust people with hard tasks. The result: people grow faster, move faster, and care more about what they ship. In larger organizations, you might need specialized DevOps, QA, and platform roles — but the principle of ownership stays the same. The Buddy System and Scaling Without Losing Alignment "The buddy system is one of the easiest things you can do. One buddy for a newcomer for the first 1, 3, 6 months — they often become friends." When scaling fast, Daniel focuses on three things: strong on-boarding guides, well-maintained documentation (now much easier with AI), and a buddy system that pairs every newcomer with a dedicated colleague. The buddy system works because it scales the human side of on-boarding — a tech lead or manager can do one-on-ones, but that's formal, and new people might be scared to speak up. The buddy creates a safe channel for questions, concerns, and cultural integration. Beyond people, scaling also means investing in automation and observability so that as you grow with customers, you grow with failures too — and your incident reporting doesn't burn out the team. Building an AI-First Organization "Every person uses AI. Every person has the capability to use AI. The company builds a second brain so AI can build on top of that." At WageNow, Daniel has implemented what he calls an AI-first organization, inspired by Spotify and other companies pioneering this approach. The concept is simple: before doing any task, ask whether AI can help you deliver the output faster or better. This applies across the entire company — not just engineering. Daniel looks for people in HR, accounting, and UX who understand automation tools like n8n or Make.com alongside AI. The key ingredients: Curate the data: Build a company "second brain" with clean, structured context for AI tools to work with Train the muscle: AI ability is like a muscle — people must use it daily because these skills didn't exist 2-3 years ago Share what works: Exponential AI adoption happened at WageNow once people started sharing their successes and failures with AI tools Respect the guardrails: Data privacy and regulation compliance remain non-negotiable The hidden productivity gains, Daniel argues, lie not in engineering (which gets all the attention) but in operations, accounting, HR, and every other area of the business. Selling Transformation: Financial Arguments for Leaders, Ownership for Teams "For the leaders, it's the financial thing and the cultural thing. For the people doing the work, it's personal development — having more control, having more ownership." At Ringier Axel Springer, Daniel proposed and led a company-wide agile transformation — a 1-2 year effort that required convincing the CEO, product teams, marketing, and sales to change how they operate. His approach: build a dual argument. For leadership, frame the change in financial and cultural terms — more revenue with the same people, better visibility into how work translates to business outcomes. For the people doing the work, emphasize personal growth, increased ownership, and transparency. The transformation breaks silos between engineering and product, creating a shared backlog agreed with all stakeholders. Daniel looks for people with high agency — those who can reinvent and change themselves from the inside, not just wait for a change agent from the outside. Balancing Experimentation with Operational Excellence "The SRE books helped me understand quality as a feature — because quality is basically how reliable you are for your customers." When asked about the books that most influenced his approach as a CTO, Daniel points to the Site Reliability Engineering series from Google — three books that frame quality as reliability, a feature your customers experience directly. Alongside those, he recommends The Lean Startup by Eric Ries, because he believes all tech people should have a sense of business and customer understanding. Together, these books guide how to balance rapid experimentation with operational excellence as the organization scales. About Daniel Harcek Daniel is a technology executive with a proven record scaling engineering organizations across fintech, cybersecurity, and digital media. Builds AI-first teams, operating models, and delivery cultures aligned with product strategy. Led platforms serving 30M MAU, deployed fintech capital pilots, transformed agile delivery at internet scale, and mentors global tech communities and ecosystems worldwide actively. You can link with Daniel Harcek on LinkedIn.

Podcasts about Site Reliability Engineering

Best podcasts about Site Reliability Engineering

Software Delivery in Small Batches

Google SRE Prodcast

Screaming in the Cloud

Coding Blocks

S.R.E.path Podcast

Google Cloud Platform Podcast

PurePerformance

Software Engineering Radio - The Podcast for Professional Software Developers

Software Misadventures

Latest news about Site Reliability Engineering

Latest podcast episodes about Site Reliability Engineering

The Missing GitHub Status Page with Marek Šuppa

Cloud Fragility & Distributed Systems with Somtochi Onyekwere

BONUS: Leadership Is Contextual With Daniel Harcek

The One With Damion Yates and Building AI systems

The One With Denia Del Cid and AI

You (and AI) can't automate reliability away

Reliability Engineering Mindset • Alex Ewerlöf & Charity Majors

The Zen of Programming: New Manuscripts

Azure SRE Agents with Deepthi Chelupati

90% of AI Demos Fail. Here's How to Build One that Won't

Intentional Journeys: How Jennifer Petoff Reinvented Success in Tech

The One with Ben Good and Our Kubernetes Friends

The One With AI Agents, Ramón Llamas, and Swapnil Haria

The One with Technical Program Managers and Karanveer Anand

The One with STPA, Jeffrey Snover, and Theo Klein

The One with Startups and Adam Fletcher

The One with SLOs and Sal Furino

The One With the Future of SRE and Matt Zelesko

The One with AI and Todd Underwood

The One With Data Centers and Peter Pellerzi

#722: The Frugal Architect w/Werner Vogels: How Warner Bros. Discovery keeps streaming seamless

The One With Security and Jessica Theodat

Autonomous IT, Live! – Spring Into Automation: Clean Up Tech Debt & Refresh Your IT Operations, E03

We're back with Season 4!

Crusoe Opens European Headquarters in Dublin

Special Episode: You Missed a Page from Telebot

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Small Batches returns in 2025

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

Maglev: load balancing at Google with Cody Smith and Trisha Weir

The Zen of Programming: Part Six

Profiling data with Pat Somaru and Narayan Desai

The Zen of Programming: Part Five

Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

The Zen of Programming: Part Four

Incident Response with Sarah Butt and Vrai Stacey

Building Reliable Systems with Silvia Botros and Niall Murphy

The Zen of Programming: Part Three

Creating Systems that are Safe with Liz Fong-Jones

The Zen of Programming. Part Two.

Postgres Emergency Room

Reading the Zen of Programming. Part One.

What it Means to be an SRE-Driven Organization - Six Five in the Booth

Incidents & Operations with Dan Slimmon

Strategic Thinking with Alex Nesbitt

Playing to Win

On the Gemba

Site Reliability Engineering with Dan Salinas & Sarv Shah of Nobl9 – IT in the D 484