S.R.E.path Podcast

Share on

Most advice on Site Reliability Engineering (SRE) is by BigTech for BigTech. SREpath's more realistic approach helps you make SRE work in a "normal" organization. Join your hosts, Ash Patel and Sebastian Vietz, as we demystify SRE jargon, interview experts, and share practical insights. Our mission is to help you boost your SRE efforts to succeed in areas like observability, incident response, release engineering, and more. We're reliability-focused professionals from companies where software is critical, but is not the product itself. srepath.substack.com

Ash P

Dec 2, 2025 LATEST EPISODE
infrequent NEW EPISODES
26m AVG DURATION
70 EPISODES

Search for episodes from S.R.E.path Podcast with a specific topic:

Latest episodes from S.R.E.path Podcast

You (and AI) can't automate reliability away

Play Episode Listen Later Dec 2, 2025 28:20

What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the human workflows wrapped around the engineering work.Everyone seems to think AI will automate reliability away. I keep hearing the same story: “Our tooling will catch it.” “Copilots will reduce operational load.” “Automation will mitigate incidents before they happen.”But here's a hard truth to swallow: AI only automates the mechanical parts of reliability — the machine in the machine.The hard parts haven't changed at all.You still need teams with clarity on system boundaries.You still need consistent approaches to resolution.You still need postmortems that drive learning rather than blame.AI doesn't fix any of that. If anything, it exposes every organizational gap we've been ignoring. And that's exactly why I wanted today's guest on.Jennifer Petoff is Director of Program Management for Google Cloud Platform and Technical Infrastructure education. Every day, she works with SREs at Google, as well as with SREs at other companies through her public speaking and Google Cloud Customer engagements.Even if you have never touched GCP, you have still been influenced by her work at some point in your SRE career. She is co-editor of Google's original Site Reliability Engineering book from 2016. Yeah, that one!It was my immense pleasure to have her join me to discuss the internal dynamics behind successful reliability initiatives. Here are 5 highlights from our talk:3 issues stifling individual SREs' workTo start, I wanted to know from Jennifer the kinds of challenges she has seen individual SREs face when attempting to introduce or reinforce reliability improvements within their teams or the broader organization.She categorized these challenges into 3 main categories* Cultural issues (with a look into Westrum's typology of organizational culture)* Insufficient buy-in from stakeholders* Inability to communicate the value of reliability workOrganizations with generative cultures have 30% better organizational performance.A key highlight from this topic came from her look at DORA research, an annual survey of thousands of tech professionals and the research upon which the book Accelerate is based.It showed that organizations with generative cultures have 30% better organizational performance. In other words, you can have the best technology, tools, and processes to get good results, but culture further raises the bar. A generative culture also makes it easier to implement the more technical aspects of DevOps or SRE that are associated with improved organizational performance.Hands-on is the best kind of trainingWe then explored structured approaches that ensure consistency, build capability, and deliberately shape reliability culture. As they say – Culture eats strategy for breakfast!One key example Jennifer gave was the hands-on approach they take at Google. She believes that adults learn by doing. In other words, SREs gain confidence by doing hands-on work. Where possible, training programs should move away from passive listening to lectures toward hands-on exercises that mimic real SRE work, especially troubleshooting.One specific exercise that Google has built internally is Simulating Production Breakages. Engineers undergoing that training have a chance to troubleshoot a real system built for this purpose in a safe environment. The results have been profound, with a tremendous amount of confidence that Jennifer's team saw in survey results. This confidence is focused on job-related behaviors, which when repeated over time reinforce that culture of reliability.Reliability is mandatory for everybodyAnother thing Jennifer told me Google did differently was making reliability a mandatory part of every engineer's curriculum, not only SREs.When we first spun up the SRE Education team, our focus was squarely on our SREs. However, that's like preaching to the choir. SREs are usually bought into reliability. A few years in, our leadership was interested in propagating the reliability-focused culture of SRE to all of Google's development teams, a challenge an order of magnitude greater than training SREs. How did they achieve this mandate?* They developed a short and engaging (and mandatory) production safety training* That training has now been taken by tens of thousands of Googlers* Jennifer attributes this initiative's success to how they“SRE'ed the program”. “We ran a canary followed by a progressive roll-out. We instituted monitoring and set up feedback loops so that we could learn and drive continuous improvement.”The result of this massive effort? A very respectable 80%+ net promoter score with open text feedback: “best required training ever.”What made this program successful is that Jennifer and her team SRE'd its design and iterative improvement. You can learn more about “How to SRE anything” (from work to life) using her rubric: https://www.reliablepgm.com/how-to-sre-anything/Reliability gets rewarded just like feature workJennifer then talked about how Google mitigates a risk that I think every reliability engineer wishes could be solved at their organization. That is, having great reliability work rewarded at the same level as great feature work.For development and operations teams alike at Google, this means making sure “grungy work” like tech debt reduction, automation, and other activities that improve reliability are rewarded equally to shiny new product features. Organizational reward programs that recognize outstanding work typically have committees. These committees not only look for excellent feature development work, but also reward and celebrate foundational activities that improve reliability. This is explicitly built into the rubric for judging award submissions.Keep a scorecard of reliability performanceJennifer gave another example of how Google judges reliability performance, but more specifically for SRE teams this time. Google's Production Excellence (ProdEx) program was created in 2015 to assess and improve production excellence (aka reliability improvements) across SRE teams.ProdEx acts like a central scorecard to aggregate metrics from various production health domains to provide a comprehensive overview of an SRE team's health and the reliability of the services they manage. Here are some specifics from the program:* Domains include SLOs, on-call workload, alerting quality, and postmortem discipline* Reviews are conducted live every few quarters by senior SREs (directors or principal engineers) who are not part of the team's direct leadership* There is a focus on coaching and accountability without shame (to elicit psychological safety)ProdEx serves various levels of the SRE organization through:* providing strategic situational awareness regarding organizational and system health to leadership and* keeping forward momentum around reliability and surfacing team-level issues early to support engineers in addressing themWrapping upHaving an inside view of reliability mechanisms within a few large organizations, I know that few are actively doing all — or sometimes any — of the reliability enhancers that Google uses and Jennifer has graciously shared with us. It's time to get the ball rolling. What will you do today to make it happen? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

director ai culture google hands cultural engineers automation accelerate automate organizational devops reliability inability domains sre insufficient gcp program management google cloud platform sres site reliability engineering slos westrum workto technical infrastructure

#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran

Play Episode Listen Later Jul 15, 2025 30:47

A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much.In this episode, I sit down with Dave O'Connor, a 16-year Google SRE veteran, to talk about what happens when organizations cargo-cult reliability practices without understanding the context they were born in.You might know him for his self-deprecating wit and legendary USENIX blurb about being “complicit in the development of the SRE function.”This one's a treat — less “here's a shiny new tool” and more “here's what reliability actually looks like when you've seen it all.”✨ No vendor plugs from Dave at all, just a good old-fashioned chat about what works and what doesn't.Here's what we dive into:* The adoption trap: Why SRE efforts often fail before they begin—especially when new hires care more about reliability than the org ever intended.* The SRE book dilemma: Dave's take on why following the SRE book chapter-by-chapter is a trap for most companies (and what to do instead).* The cost of “caring too much”: How engineers burn out trying to force reliability into places it was never funded to live.* You build it, you run it (but should you?): Not everyone's cut out for incident command—and why pretending otherwise sets teams up to fail.* Buying vs. building: The real reason even conservative enterprises are turning into software shops — and the reliability nightmare that follows.We also discuss the evolving role of reliability in organizations today, from being mistaken for “just ops” to becoming a strategic investment (when done right).Dave's seen the waves come and go in SRE — and he's still optimistic. That alone is worth a listen. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

google lessons veterans fails sre orgs

#66 - Unpacking 2025 SRE Report's Damning Findings

Play Episode Listen Later Jul 1, 2025 30:16

I know it's already six months into 2025, but we recorded this almost three months ago. I've been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care.This episode was prompted by the 2025 Catchpoint SRE Report, which dropped some damning but all-too-familiar findings:* 53% of orgs still define reliability as uptime only, ignoring degraded experience and hidden toil* Manual effort is creeping back in, reversing five years of automation gains* 41% of engineers feel pressure to ship fast, even when it undermines long-term stabilityTo unpack what this actually means inside organizations, I sat down with Sebastian Vietz, Director of Reliability Engineering at Compass Digital and co-host of the Reliability Enablers podcast. Sebastian doesn't just talk about technical fixes — he focuses on the organizational frictions that stall change, burn out engineers, and leave “reliability” as a slide deck instead of a lived practice.We dig into:* How SREs get stuck as messengers of inconvenient truths* What it really takes to move from advocacy to adoption — without turning your whole org into a cost center* Why tech is more like milk than wine (Sebastian explains)* And how SREs can strengthen—not compete with—security, risk, and compliance teamsThis one's for anyone tired of reliability theatrics. No kumbaya around K8s here. Just an exploration of the messy, human work behind making systems and teams more resilient. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

director unpacking manual findings damning k8s sres reliability engineering

#65 - In Critical Systems, 99.9% Isn't Reliable — It's a Liability

Play Episode Listen Later Jun 17, 2025 28:28

Most teams talk about reliability with a margin for error. “What's our SLO? What's our budget for failure?” But in the energy sector? There is no acceptable downtime. Not even a little.In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who's spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.What makes this episode different is that Wade isn't a reliability engineer by title, but it's baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems.We cover:* Why 100% uptime is the minimum bar, not a stretch goal* How the rise of renewables has increased system complexity — and what that means for monitoring* Why bespoke integration and SCADA spaghetti are still normal (and here to stay)* The reality of cloud risk in critical infrastructure (“the cloud is just someone else's computer”)* What software engineers need to understand if they want their products used in serious environmentsThis isn't about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong.And it's a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

director australia ot reliable liability devops slo scada

#64 - Using AI to Reduce Observability Costs

Play Episode Listen Later Jan 28, 2025 20:42

Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions.It's been a hot minute since the last episode of the Reliability Enablers podcast.Sebastian and I have been working on a few things in our realms. On a personal and work front, I've been to over 25 cities in the last 3 months and need a breather.Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control. (To the skeptics, he did not pay me for this episode)Here's an AI-generated summary of what you can expect in our conversation:In this conversation, we explore cutting-edge approaches to FinOps i.e. cost optimization for observability. You'll hear about three pressing topics:* Managing Tool Sprawl: Insights into the common challenge of juggling 5-15 tools and how to identify which ones deliver real value.* Reducing Observability Costs: Techniques to track and trim waste, including how to uncover cost hotspots like overused or redundant metrics.* AI for Observability Decisions: Practical ways AI can simplify complex data, empowering non-technical stakeholders to make informed decisions.We also touch on the balance between open-source solutions like OpenTelemetry and commercial observability tools. Learn how these strategies, informed by Ruchir's experience at Netflix, can help streamline observability operations and cut costs without sacrificing reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

netflix ai exploring reduce costs using ai cardinal observability finops ruchir

#63 - Does "Big Observability" Neglect Mobile?

Play Episode Listen Later Nov 12, 2024 29:11

Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he's vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.* Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.* Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.* Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.* Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.* Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.* Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.* Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager's observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.* OpenTelemetry's Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don't align with traditional time-based observability.* SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.* Amazon's Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon's practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.* Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

#62 - Early Youtube SRE shares Modern Reliability Strategy

Play Episode Listen Later Nov 5, 2024 35:33

Andrew Fong's take on engineering cuts through the usual role labels, urging teams to start with the problem they're solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It's a values-first, practical approach to tackling tough challenges that engineers face every day.Here's a slightly deeper dive into the concepts we discussed:* Career and Evolution in Tech: Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it.* Building Prodvana and the Future of SRE: As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it.* Challenges of Migration and Integration: Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube's infrastructure onto Google's proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google's engineering approach at that time.* SRE's Shift Toward Reliability as a Core Feature: The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices.* Organizational Culture and Leadership Influence: Leadership's role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response.* Outcome-Focused Work Over Titles: Emphasis on assembling the right team based on skills, not titles, to solve technical problems effectively. Titles often distract from focusing on outcomes, and fostering a problem-solving culture over role-based thinking accelerates teamwork and results.* Engineers as Problem Solvers: Engineers, especially natural ones, generally resist job boundaries and focus on solving problems rather than sticking rigidly to job descriptions. This echoes how iconic engineers like Steve Jobs valued versatility over predefined roles.* Culture as Core Values: Organizational culture should be driven by core values like reliability, efficiency, and inclusivity rather than rigid processes or roles. For instance, Dropbox's infrastructure culture emphasized being a “force multiplier” to sustain product velocity, an approach that ensured values were integrated into every decision.* Balancing SRE and Platform Priorities: The fundamental difference between SRE (Site Reliability Engineering) and platform engineering is their focus: SRE prioritizes reliability, while platform engineering is geared toward increasing velocity or reducing costs. Leaders must be cautious when assigning both roles simultaneously, as each requires a distinct focus and expertise.* Strategic Trade-Offs in Smaller Orgs: In smaller companies with limited resources, leaders often face challenges balancing cost, reliability, and other objectives within single roles. It's advised to sequence these priorities rather than burden one individual with conflicting objectives. Prioritizing platform stability, for example, can help improve reliability in the long term.* DevOps as a Philosophy: DevOps is viewed here as an operational philosophy rather than a separate role. The approach enhances both reliability and platform functions by fostering a collaborative, efficient work culture.* Focus Investments for Long-Term Gains: Strategic technology investments, even if they might temporarily hinder short-term metrics (like reliability), can drive long-term efficiency and reliability improvements. For instance, Dropbox invested in a shared metadata system to enable active-active disaster recovery, viewing this as essential for future reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

#61 Scott Moore on SRE, Performance Engineering, and More

Play Episode Listen Later Oct 22, 2024 38:13

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

performance engineering scott moore

#60 How to NOT fail in Platform Engineering

Play Episode Listen Later Oct 1, 2024 30:34

Here's what we covered:Defining Platform Engineering* Platform engineering: Building compelling internal products to help teams reuse capabilities with less coordination.* Cloud computing connection: Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas.Ankit's career journey* Didn't choose platform engineering; it found him.* Early start in programming (since age 11).* Transitioned from a product engineer mindset to building internal tools and platforms.* Key experience across startups, the public sector, unicorn companies, and private cloud projects.Singapore Public Sector Experience* Public sector: Highly advanced digital services (e.g., identity services for tax, housing).* Exciting environment: Software development in Singapore's public sector is fast-paced and digitally progressive.Platform Engineering Turf Wars* Turf wars: Debate among DevOps, SRE, and platform engineering.* DevOps: Collaboration between dev and ops to think systemically.* SRE: Operations done the software engineering way.* Platform engineering: Delivering operational services as internal, self-service products.Dysfunctional Team Interactions* Issue: Requiring tickets to get work done creates bottlenecks.* Ideal state: Teams should be able to work autonomously without raising tickets.* Spectrum of dysfunction: From one ticket for one service to multiple tickets across teams leading to delays and misconfigurations.Quadrant Model (Autonomy vs. Cognitive Load)* Challenge: Balancing user autonomy with managing cognitive load.* Goal: Enable product teams with autonomy while managing cognitive load.* Solution: Platforms should abstract unnecessary complexity while still giving teams the autonomy to operate independently.How it pans out* Low autonomy, low cognitive load: Dependent on platform teams but a simple process.* Low autonomy, high cognitive load: Requires interacting with multiple teams and understanding technical details (worst case).* High autonomy, high cognitive load: Teams have full access (e.g., AWS accounts) but face infrastructure burden and fragmentation.* High autonomy, low cognitive load: Ideal situation—teams get what they need quickly without detailed knowledge.Shift from Product Thinking to Cognitive Load* Cognitive load focus: More important than just product thinking—consider the human experience when using the system.* Team Topologies: Mentioned as a key reference on this concept of cognitive load management.Platform as a Product Mindset* Collaboration: Building the platform in close collaboration with initial users (pilot teams) is crucial for success.* Product Management: Essential to have a product manager or team dedicated to communication, user journeys, and internal marketing.Self-Service as a Platform Requirement* Definition: Users should easily discover, understand, and use platform capabilities without human intervention.* User Testing: Watch how users interact with the platform to understand stumbling points and improve the self-service experience.Platform Team Cognitive Load* Burnout Prevention: Platform engineers need low cognitive load as well. Moving from a reactive (ticket-based) model to a proactive, self-service approach can reduce the strain.* Proactive Approach: Self-service models allow platform teams to prioritize development and avoid being overwhelmed by constant requests. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

#59 Who handles monitoring in your team and how?

Play Episode Listen Later Sep 24, 2024 8:17

Why many copy Google's monitoring team setup* Google's Influence. Google played a key role in defining the concept of software reliability.* Success in Reliability. Few can dispute Google's ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settingsBUT there's a problem:* It's not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.What is Google's monitoring approach within teams?Here's the thing that Google does:* Google assigns one or two people per team to manage monitoring.* Even with centralized infrastructure, a dedicated person handles monitoring.* Many organizations use a separate observability team, unlike Google's integrated approachIf your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google's model to the tee. Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.Can your team mimic Google's model?Here are a few things you should factor in:Size mattersGoogle's model works because of its scale and technical complexity. Many organizations don't have the size, resources, or technology to replicate this.What are the options for your team?Dedicated monitoring team (very popular but $$$)If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it's not something that a startup or SME can easily justify. Dedicate SREs to monitoring work (effective but difficult to manage)You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.Internal monitoring experts (useful but hard capability)One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team's needs. This should be how we get monitoring work done, but it's hard to get volunteers across a majority of teams. Transitioning monitoring from project work to maintenance2 distinct phasesInitial Setup (the “project”) SREs may help set up the monitoring/observability infrastructure. Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.Post-project phase (“keep the lights on”)Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?Who will maintain the monitoring system?Answer: usually not the same teamAfter the project phase, a new set of people—often different from the original team—typically handles maintenance.Options to consider (once again)* Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.* Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.* Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.The key thing to remember here is…Adapt to Your Organizational ContextOne size doesn't fit allGoogle's model may not work for everyone. Tailor your approach based on your organization's specific needs.The core principle to keep in mindAs long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.Work according to engineer awarenessIf engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.In conclusionThere's no universal solution. Whether you centralize or decentralize monitoring depends on your team's structure, size, and expertise. The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

success google work influence options ps transitioning engineers internal dedicated adapt monitoring sme reliability tailor handles sre sres

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Play Episode Listen Later Sep 17, 2024 8:27

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It's a challenge that's been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep.Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise. When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit. When instrumenting your systems, be intentional about what data you collect and transport. Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.To combat this, focus on:* Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.* Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.* Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently. He shared that managing millions of alerts, often filled with noise, is a significant issue. His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don't impact the user journey.According to Dan, the anatomy of a good alert includes:* A run book* A defined priority level* A corresponding dashboard* Consistent labels and tags* Clear escalation paths and ownershipTo elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.The learning point is simple: aim for quality over quantity. By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

data clear fixing consistent signal optimize monitoring top hat overloading noise ratio

#57 How Technical Leads Support Software Reliability

Play Episode Listen Later Sep 10, 2024 31:34

The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.She and I discussed the link between this role and software reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

leads software technical reliability thoughtworks

#56 Resolving DORA Metrics Mistakes

Play Episode Listen Later Sep 4, 2024 26:47

We're already well into 2024 and it's sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas.Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.Nathen Harvey is no stranger to this problem.He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018. His focus has been on questions like:How do we help teams get better at delivering and operating software? You and I can agree that this is an important question to ask. I'd listen to what he has to say about DORA because he's got a wealth of experience behind him, having also run community engineering at Chef Software.Before we continue, let's explore What is DORA? in Nathen's (paraphrased) words:DORA is a software research program that's been running since 2015.This research program looks to figure out:How do teams get good at delivering, operating, building, and running software? The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.Essentially, all those things that matter to the business.One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery? It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.The unfortunate problem is that the factory mindset I think still leaks in. I've personally noted some silly metrics over the years like lines of code.Imagine being asked constantly: “How many lines of code did you write this week?”You might not have to imagine. It might be a reality for you. DORA's researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer. They settled on and validated 4 key measures for software delivery performance.Nathen elaborated that 2 of these measures look at throughput:[Those] two [that] look at throughput really ask two questions:* How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?And then the second question on throughput is:* How frequently are you updating production?In plain English, these 2 metrics are:* Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.* Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production. Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day. They're both very different types of organizations, so their perspective on throughput metrics will be wildly different. This has some implications for the speed of software delivery.Of course, everyone wants to move faster, but there's this other thing that comes in and that's stability.And so, the other two stability-oriented metrics look at:What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.” In plain English, they are:* Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs). * Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production. You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.But keep in mind, it's about balancing all 4 metrics. Nathen believes it's fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another. We can either be fast or we can be stable. But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.They tend to move together. They've seen organizations of every shape and size, in every industry, doing well across all four of those metrics. They are the best performers. The interesting thing is that the size of your organization doesn't matter the industry that you're in.Whether you're working in a highly regulated or unregulated industry, it doesn't matter.The key insight that Nathen thinks we should be searching for is: how do you get there? To him, it's about shipping smaller changes. When you ship small changes, they're easier to move through your pipeline. They're easier to reason about. And when something goes wrong, they're easier to recover from and restore service.But along with those small changes, we need to think about those feedback cycles.Every line of code that we write is in reality a little bit of an experiment. We think it's going to do what we expect and it's going to help our users in some way, but we need to get feedback on that as quickly as possible.Underlying all of this, both small changes and getting fast feedback, is a real climate for learning. Nathen drew up a few thinking points from this:So what is the learning culture like within our organization? Is there a climate for learning? And are we using things like failures as opportunities to learn, so that we can ever be improving? I don't know if you're thinking the same as me already, but we're already learning that DORA is a lot more than just metrics. To Nathen (and me), the metrics should be one of the least interesting parts of DORA because it digs into useful capabilities, like small changes and fast feedback. That's what truly helps determine how well you're going to do against those performance metrics.Not saying “We are a low to medium performer. Now go and improve the metrics!”I think the issue is that a lot of organizations emphasize the metrics because it's something that can sit on an executive dashboard But the true reason we have metrics is to help drive conversations.Through those conversations, we drive improvement.That's important because currently an unfortunately noticeable amount of organizations are doing this according to Nathen:I've seen organizations [where it's like]: “Oh, we're going to do DORA. Here's my dashboard. Okay, we're done. We've done DORA. I can look at these metrics on a dashboard.” That doesn't change anything. We have to go the step further and put those metrics into action.We should be treating the metrics as a kind of compass on a map. You can use those metrics to help orient yourself and understand, “Where are we heading?”.But then you have to choose how are you going to make progress toward whatever your goal is.The capabilities enabled by the DORA framework should help answer questions like:* Where are our bottlenecks?* Where are our constraints?* Do we need to do some improvement work as a team?We also talked about the SPACE framework, which is a follow-on tool from DORA metrics. It is a framework for understanding developer productivity. It encourages teams or organizations to look at five dimensions when trying to measure something from a productivity perspective.It stands for:* S — satisfaction and well-being* P — performance* A — activity* C — communication and collaboration* E — efficiency and flowWhat the SPACE framework recommends is that youFirst, pick metrics from two to three of those five categories. (You don't need a metric from every one of those five but find something that works well for your team.)Then write down those metrics and start measuring them. Here's the interesting thing: DORA is an implementation of SPACE. You can correlate each metric with the SPACE acronym!* Lead time for changes is a measure of Efficiency and flow* Deployment frequency is an Activity* Change fail rate is about Performance.* Failed deployment recovery time is about Efficiency and flowKeep in mind that SPACE itself has no metrics. It is a framework for identifying metrics.Nathen reiterated that you can't use the space metrics because there is no such thing. I mentioned earlier how DORA is a means of identifying the capabilities that can improve the metrics.These can be technical practices like using continuous integration.But they can also be capabilities like collaboration and communication. As an example, you might look at what your change approval process looks like. You might look at how collaboration and communication have failed when you've had to send changes off to an external approval board like a CAB (change approval board).DORA's research backs the above up:What our research has shown through collecting data over the years, is that while they do exist on the whole, an external change approval body will slow you down.That's no surprise. So your change lead time is going to increase, your deployment frequency will decrease. But, at best, they have zero impact on your change fail rate. In most cases, they have a negative impact on your change fail rate. So you're failing more often.It goes back to the idea of smaller changes, faster feedback, and being able to validate that. Building in audit controls and so forth.This is something that reliability-focused engineers should be able to help with because one of the things Sebastian and I talk about a lot is embracing and managing risk effectively and not trying to mitigate it through stifling measures like CABs. In short, DORA and software reliability are not mutually exclusive concepts.They're certainly in the same universe.Nathen went as far as to say that some SRE practices necessarily get a little bit deeper than sort of the capability level that DORA has and provide even more sort of specific guidance on how to do things.He clarified a doubt I had because a lot of people have argued with me (mainly at conferences) that DORA is this thing that developers do, earlier in the SDLC.And then SRE is completely different because it focuses on the production side. The worst possible situation could be turning to developers and saying, “These 2 throughput metrics, they're yours. Make sure they go up no matter what,” and then turn to our SREs and say “Those stability metrics, they're yours. Make sure they stay good” All that does is put these false incentives in place and we're just fighting against each other.We talked a little more about the future of DORA in our podcast episode (player/link right at the top of this post) if you want to hear about that.Here are some useful links from Nathen for further research:DORA online community of practiceDORA homepage[Article] The SPACE of Developer ProductivityNathen Harvey's Linktree This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

english google space building performance mistakes software failed efficiency resolving deployment underlying cab sre cabs sdlc sres nathen you first lead time flow what nathen harvey dora metrics

#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards

Play Episode Listen Later Aug 27, 2024 11:02

We'll explore 3 use cases for monitoring data. They are:* Analyzing long-term trends* Comparing over time or experiment groups* Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point:* How big is my database?* How fast is the database growing? * How quickly is my user count growing?As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:* How is the database performance evolving? Are there signs of degradation?* Is there consistent growth in data volume that may require future infrastructure adjustments?* How is overall resource utilization trending over time across different services?* How is the cost of cloud resources evolving, and what does that mean for budget forecasting?* Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them?Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It's exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare.Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work.Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period. Comparing over time or experiment groupsGoogle's definitionYou're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book:* Are your queries faster in this version of this database or this version of that database? * How much better is my memcache hit rate with an extra node and is my site slower than it was last week? You're comparing it to different buckets of time and different types of products.A proper use case for comparing groupsSebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS. He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively.His need for this went beyond engineering work but enabling product teams with the right decision-making data. He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service.Having the data to compare the two services allowed him to answer questions like:* When should you be using either of these technologies? * What use cases would either technology be more suitable for?This data-based decision support is based mainly on monitoring or observability data. The idea of using the monitoring data to compare tools and technologies for guiding product teams is something I think reliability folk can gain a lot of value from doing. Conducting ad hoc retrospective analysis (debugging)Debugging is a bread-and-butter responsibility for anyone who is a software engineer of any level. It's something that everybody should know a little bit more about than other tasks because there are very effective and also very ineffective ways of going about debugging. Monitoring data can help make the debugging process fall into the effective side.There are organizations where you have 10 different systems. In one system, you might get one fragmented piece of information. In another, you'll get another fragment. And so on for all the different systems. And then you have to correlate these pieces of information in your head and hopefully, you get some clarity out of the fragments to form some kind of insight. Monitoring data that are brought together into one datastream can help correlate and combine all these pieces of information. With it, you can:* Pinpoint slow-running queries or functions by analyzing execution times and resource usage, helping you identify inefficiencies in your code* Correlate application logs with infrastructure metrics to determine if a performance issue is due to code errors or underlying infrastructure problems* Track memory leaks or CPU spikes by monitoring resource usage trends, which can help you identify faulty code or services* Set up detailed error tracking that automatically flags code exceptions and matches them with infrastructure events, to get to the root cause faster* Monitor system load alongside application performance to see if scaling issues are related to traffic spikes or inefficient code pathsBeing able to do all this makes the insight part easier for you. And so your debugging approach becomes very different. It becomes much more effective. It becomes much less time-consuming. It potentially makes the debugging task fun.Because you get to the root cause of the thing that is not working much faster. Your monitoring/observability data setup can make it nice and fun to a certain degree, or it can make it downright miserable. If it's done well, it's just one of those things you don't even have to think about. It's just part of your job. You do it. It's very effective and you move on. Wrapping upSo we've covered three more use cases for monitoring data, other than the usual alerts and dashboards.They are once again:* analyzing long-term trends* comparing over time or experiment groups and* conducting ad hoc retrospective analysis, aka debuggingNext time your boss asks you what all these systems do, you now have three more reasons that you need to focus on your monitoring and be able to use it more effectively. Until next time, happy monitoring. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

data track comparing analyzing monitoring monitor aws conducting cpu alerts sre dashboards cpus pinpoint debugging aws lambda

#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity

Play Episode Listen Later Aug 20, 2024 37:23

Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He's dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer.I've broken them down into the following areas:* Avoid the heroic efforts* Mind + heart > Mind alone * Curiosity > Credentials* Experience > Certifications * Thinking for complexityWhen I saw him in Toronto, I thought he would talk about pre-production observability. It would only make sense after watching the previous presenter do a deep dive into Kubernetes tooling.But surprisingly, he started about culture and the need to prevent burnout among engineers — a topic that is as important today as it was 2 years ago when he did the talk. Here's a look into Shlomo's philosophy and the practices he champions.Avoid the heroic effortsShlomo's perspective on heroics in engineering and operations challenges a traditional mindset that often glorifies excessive individual efforts at the cost of long-term sustainability. He emphasizes that relying on heroics — where individuals consistently go above and beyond to save the day — creates an unhealthy work environment. "We shouldn't be rewarding people for pulling all-nighters to save a project; we should be asking why those all-nighters were necessary in the first place."This approach not only burns out engineers but also masks underlying systemic issues that need to be addressed. So, instead of celebrating these heroic efforts, Shlomo advocates for creating processes and metrics that ensure smooth operations without the need for constant intervention. Mind + Heart > Mind aloneOne of the challenges Shlomo has faced recently is scaling his engineering organization amidst rapid growth. His approach to hiring is unique; he doesn't just look for technical skills but prioritizes self-awareness and kindness. "Hiring with heart means looking for individuals who bring empathy and integrity to the team, not just expertise."When he joined The Score, a subsidiary of Penn Interactive, Shlomo immediately revamped the hiring practices by integrating the values above into the process. He favors role-playing scenarios over solely using behavioral interviews to evaluate candidates, as this method reveals how individuals might react in real production situations. I tend to agree with this approach as seeing how people are doing the work is more enlightening than asking them how they behaved in a past situation alone. Curiosity > credentialsHow it plays into career progressionWhen it comes to career progression, Shlomo places little value on traditional markers like education or years of experience. Instead, he values adaptability, resilience, and curiosity. This last trait is the one he doubles down on.According to Shlomo, curiosity is the cornerstone of continuous growth and innovation. It's not just about asking questions. It's about fostering a mindset that constantly seeks to understand the 'why' behind everything. Shlomo advocates for a deep, insatiable curiosity that drives engineers to explore beyond the surface of problems, looking for underlying causes and potential improvements. He believes that this kind of curiosity is what separates good engineers from great ones, as it leads to discovering solutions that aren't immediately obvious and pushes the boundaries of what's possible.How it plays into teamworkFor Shlomo, curiosity also plays a crucial role in building a cohesive and forward-thinking team. He encourages leaders to cultivate an environment where questions are welcomed, and no stone is left unturned. This approach not only sparks creativity but also ensures that everyone is engaged in a continuous learning process, which is vital in a field that evolves as rapidly as DevOps and SRE.By nurturing curiosity, teams can stay ahead of the curve. They can anticipate challenges before they arise and develop right-fit solutions that keep their work relevant and impactful.Shlomo advises engineers not to let their current organization limit them and to always seek out new challenges and learning opportunities. This mindset will make them valuable to any organization they may work with.Experience > Certifications Shlomo's stance on certifications is clear: they don't necessarily lead to career advancement. He argues that the best engineers are those who are too busy doing the work to focus on accumulating certifications. Instead, he encourages engineers to network with industry leaders, demonstrate their skills, and seek mentorship opportunities. Experience and mentorship, he believes, are far more critical to growth than any piece of paper.Thinking for complexityIt's a well-tread saying now, almost a cliche, but still very relevant to standing out in a crowded engineering talent market. Shlomo and I talked about the issue of many engineers being trained to think in terms of best practices. I feel like over time, this emphasis will reduce, especially for more senior roles. Best practices are not directly applicable to solving today's problems that are increasing in complexity. Shlomo tries to test potential hires to see if they can handle the complexity. During interviews, he presents candidates with unreasonable scenarios to test their ability to think outside the box. This approach not only assesses their problem-solving skills but also helps them understand the interconnectedness of the challenges they will face.Wrapping upThe insights Shlomo shared with me underscore a crucial point:The most successful engineers are those who combine technical prowess with a strong sense of curiosity, a commitment to continuous improvement, and a genuine understanding of their role within the team. By embracing these qualities, you not only enhance your current contributions but also set yourself on a path for long-term growth and success. The takeaway is clear: to truly stand out and advance in your career, it's not just about doing your job well — it's about constantly seeking to learn more, improve processes, and connect with your team on a deeper level.These are the traits that make you not just a good engineer, but a valuable one. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

head thinking toronto mind hiring cloud engineers score valuable sanity sacrificing devops kubernetes sre shlomo

#53 What's Missing in Incident Response Processes?

Play Episode Listen Later Aug 15, 2024 9:43

Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting. According to Vladislav, at the beginning of your SRE journey, it's not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc. And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization. Understanding and Leveraging SLOsOnce your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they've been validated through production. Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively.Implementing a Formal Incident ResponseBefore you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place. Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness.Coordinating During Major IncidentsWhen a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams. Consider appointing incident commanders and coordinators, as recommended in PagerDuty's documentation, to manage this coordination. Develop a lightweight process to guide how incidents are handled.Classifying IncidentsEstablish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three. Due to the inherently fuzzy nature of incidents, your classification system should also include guidelines for handling ambiguous cases. For instance, if uncertain whether an incident is Priority One or Two, default to Priority One.Deriving Actions from Incident ClassificationBased on the incident classification, outline specific actions. For example, Priority One incidents might require immediate involvement from an incident commander. They might take the following actions:* Create a communication channel, assemble relevant teams, and start coordination. * Simultaneously inform stakeholders according to their priority group. * Define stakeholder groups and establish protocols for notifying them as the situation evolves.Keep Incident Response Processes Simple and AccessibleEnsure that your incident response process is concise and easily understandable. Ideally, it should fit on a single sheet of paper. Complexity can lead to confusion and inefficiencies, so aim for simplicity and clarity in your process diagram. This approach ensures that the process is practical and can be followed effectively during an incident.Preparing Your OrganizationAn effective incident response process relies on an organization's readiness for such rigor. Attempting to implement this process in an organization not yet mature enough may result in poor adherence during critical times. Make sure your organization is prepared to follow the established procedures.For a deeper dive into these concepts, consider reading "Establishing SRE Foundations," available on Amazon and other book retailers. For further inquiries, you can also connect with the author, Vlad, on LinkedIn. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

amazon missing product develop define incident implementing complexity processes vlad sre incident response slo vladislav pagerduty priority one slos slis

Can ITIL Benefit from Site Reliability Engineering?

Play Episode Listen Later Aug 13, 2024 29:23

According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale.However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function.Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people globally to offer their cloud platform running off Microsoft Azure.We discussed key concepts from his book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.Unlike other technical books in this field, Dr Ukis' book is aimed at technology professionals who are beginners to the reliability journey. This is different from the Site Reliability Engineering (2016) book by Google, which covers all the bells and whistles that SRE encompasses. That book requires a degree of prior knowledge and also prior experience in the field. Vlad wanted to make it more accessible:What I did with my book is to say, ‘Okay, so now you've never done operations, but you now are thrown in the world of online services where you have to operate them. How do you get started?' So this is what the book is for. So for people who want to learn how to get started in the world of operating online services.ITIL was originally developed by the UK government in the 80s to improve IT governance. It is best related to SRE through its service management and incident management components. But it's for managing systems that are more predictable and can be handled through strict process control.Modern product delivery doesn't have the luxury of bureaucratic levels of predictability that older IT services have. It requires a more engineer-oriented approach to solving problems/incidents and providing services. So how was Vlad's experience bringing SRE into an organization that previously had run solely on the ITIL model?Siemens Healthineers for many years operated like a traditional software development organization. In other words, they were developing on-prem software, not cloud software. The company would ship the physical software product to its hospital customers and then those hospitals would have the software operated and supported by their IT departments. The change came about when Siemens Healthineers began to work on a new digital health platform, which would be cloud-based from the beginning. So they would no longer ship physical software in discs to customers, but provide online services in the cloud centrally for the customers to use.The early days were haphazardly done with the software deployed to the cloud with no major issues. Not many customers were on the cloud platform so the team could get away with “handcrafted operating procedures”.But as traffic and service count started to rise rapidly, the Healthineers team learned that they needed a more professional approach. They began to understand that their initial approach to operations could not continue as-is.This is when Vladislav began to drive SRE practices in the organization. This was a sub-30-minute conversation that covered a lot of ground that would be relevant to the needs of organizations looking to transition to product delivery of online services at scale. Have a listen. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

google uk modern benefit vlad step guide microsoft azure sre itil vladislav site reliability engineering

#52 Navigating Complexity within Incidents

Play Episode Listen Later Aug 6, 2024 36:52

Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is!Our systems are becoming more complex and so are the resulting incidents.Learning about complexity can help reliability folk go into an incident with less anxiety, which we'll explore in this episode.We'll explore the causes of complexity in incidents and how the Cynefin framework classifies incidents.We'll also deep dive into the concept of complexity itself and dispel a common issue where it gets mixed up with complicatedness.About SonjaSonja is a co-founder of Complexity Fit and founder of More Beyond focusing on helping teams build capacity for sensemaking, collaboration, and wayfinding.She has a background in programming from her early career as a meteorologist, having worked in C and Fortran, and then progressing to working as a web developer.You can connect with Sonja to learn more about complexity via LinkedIn. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

learning navigating complexity incidents fortran cynefin

#51 Whitebox vs Blackbox Monitoring

Play Episode Listen Later Jul 30, 2024 9:56

Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitoring.It's not the same as internal versus external monitoring, which we'll explore further.We'll cover topics like:- (quickly) What is monitoring?- What is whitebox monitoring?- What is black box monitoring?- The rising importance of blackbox monitoringThis is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) book. Chapter written by Rob Ewaschuk and edited by Betsy Beyer. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

google monitoring black box whitebox sres

#50 Making Better Sense of Observability Data

Play Episode Listen Later Jul 9, 2024 24:38

Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data.We crammed into just under 25 minutes ideas like these 7 takeaways:* Reasserting the Need to Monitor Four Golden Signals: Focus on latency, traffic, errors, and saturation for effective system monitoring and management.* Prioritize Customer Health: in Jack's words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers for a more comprehensive view of your system's impact.* Apply Mathematical Techniques: Incorporate advanced mathematical concepts, like the Nyquist Shannon law and T Digest algorithm, to enhance data accuracy and observability metrics.* Build Accurate Percentiles: Implement techniques to accurately reproduce percentiles from raw data to ensure reliable performance metrics.* Manage High Cardinality Data: Develop strategies to handle high cardinality data without overwhelming your resources, ensuring you extract valuable insights.* Standardize Log Records: Use readily available frameworks to emit standardized log records makes data easier to process and visualize.* Handle High-Velocity Data Efficiently: Develop methods for collecting and processing high-velocity data without incurring prohibitive costs.Watch Jack's Monitorama talk via this link: https://vimeo.com/843996971 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

data sense devops palo alto networks observability jack neely

#49 Alert Fatigue is Still an Issue - Here's How We Fix it

Play Episode Listen Later Jul 2, 2024 30:13

Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topic. He also happens to be an avid skateboarder!Here are 9 takeaways from our conversation:* Regularly Review and Update Monitoring Systems: Don't set up monitoring once and forget about it. Continuously assess and update your monitoring systems to ensure they remain relevant and effective.* Focus on Relevant Alerts: Ensure your alerting system is tailored to indicate real problems. Avoid relying on outdated criteria such as high CPU or memory usage unless they directly impact user experience.* Adopt a User-Centric Approach: Develop alerts based on how issues affect the user experience rather than purely technical metrics. This helps prioritize what truly matters to the end user.* Evaluate Alert Value: Critically assess each alert for its value. Ask whether the alert provides actionable information and if it impacts the user or business. Eliminate or adjust alerts that don't meet these criteria.* Reduce Alert Noise: Strive to minimize unnecessary alerts contributing to noise and obscure real issues. This makes it easier to detect and respond to genuine problems.* Understand the User Journey: Document the user journey and create Service Level Objectives (SLOs) to align alerts with user-impacting events. This ensures alerts are meaningful and actionable.* Secure Leadership Support: Gain buy-in from leadership by demonstrating the long-term benefits of an effective alerting system. Emphasize how it can improve user satisfaction and operational efficiency.* Improve Documentation and Preparedness: Ensure thorough documentation for all systems and alerts. This reduces stress and increases efficiency, particularly for engineers handling on-call duties.* Automate Alert Responses: Implement automation to handle routine alerts. This reduces the manual burden on engineers and allows them to focus on more complex issues. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

focus fatigue eliminate adopt cpu alert continuously emphasize

#48 Cutting Down "Toil" aka Manual Work in Software

Play Episode Listen Later Jun 25, 2024 44:03

Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot with concepts like:* what is toil according to a 5-point criteria* why even care about toil?* where you can find toil in your software system* Google's goal for how much work (%) should be toil* the fact that toil isn't always all that badDon't have time to listen to what we learned or added to the concepts? Check out the takeaways toward the end of this email.But first…Before we jump into the takeaways, here's a new segment I'm trying out for newsletters. I'll highlight a new reliability tool that I think could help you. Do you struggle to visualize your Kubernetes workloads?In that case, have you heard of kube-ops-view?It helps you visualize your complex K8s clusters and everything inside them.For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details. Back to our original programming…Here are key takeaways from our chat* Define and Identify ToilRegularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction.* Prioritize AutomationLook for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency.* Embrace the Role of an SRERealize that the role of an SRE is to improve system reliability proactively. Focus on long-term improvements rather than just responding to immediate issues.* Address Common Sources of ToilIdentify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas.* Adopt a Toil Elimination MindsetCultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes.* Develop a Culture of Continuous ImprovementEncourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams.Until next time, happy toil hunting! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

culture google focus embrace advocates software develop identify define cutting recognize manual implement adopt toil kubernetes sre k8s site reliability engineering

#47 How to Grow Team Impact Through Learning Culture

Play Episode Listen Later Jun 18, 2024 28:38

The common refrain after an incident is “We could and should learn from this”. To me, that alludes to the need for a robust learning culture.We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives.But how often do we explore the nuances of how we are learning?Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under her own banner.Her work ties in well with the ideas shared by Manuel Pais in episode #45 about how enabling teams can support a continuous learning culture. We tackled issues like the value of certifications, comparing technical with non-technical skills, and more. You can ⁠connect with Sorrel via LinkedInLearn more about what Sorrel does via LaaS.consultingHere's a bonus section because you read all this way. It covers 5 public outages and how the affected teams could improve their learning culture: 1. Slack Outage (February 2023)Slack experienced a global outage disrupting communication for hours due to backend infrastructure issues. Perhaps the team could focus their learning on more robust infrastructure management and resilience improvement.2. Twitter Algorithm Glitch (April 2023)A glitch in Twitter's algorithm caused timeline issues, stemming from a problematic software update. Perhaps the team could focus their learning on thorough testing and game days to rectify critical system errors swiftly.3. Microsoft Azure AD Outage (March 2023)Azure Active Directory faced a significant outage due to an internal configuration change. Perhaps the team could focus their learning on the importance of rigorous change management and how to address misconfigurations quickly.4. Google Cloud Platform Networking Issue (May 2023)Google Cloud Platform experienced widespread service disruptions from a software bug in its networking infrastructure. Perhaps the team could focus their learning on the need for comprehensive testing and preventing disruptions.5. GitHub Outage (June 2023)GitHub suffered a major outage caused by a cascading failure in its storage infrastructure. Perhaps the team could focus their learning on robust fault-tolerance mechanisms and ways to address the root causes of failures. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

slack github google cloud platform learning culture sorrel laas azure active directory team impact manuel pais

#46 Platform Team Design According to Team Team Topologies

Play Episode Listen Later Jun 11, 2024 24:07

I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.In this second part, we will talk about platform teams. A quick refresher on what platform teams doIn the team topologies context:Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity.They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim to allow stream-aligned teams to focus on delivering business value.Here are the key takeaways from our conversation For those who don't have time to listen to this episode (but you're missing out on a great conversation):* Focus on User-Centric Design: Prioritize the user experience in platform development. Regularly collaborate with internal teams to ensure the platform meets their needs and reduces their pain points.* Build and Maintain Trust: Establish and nurture trust with your platform's users. Trust is crucial for platform adoption and can prevent resistance thus assuring sustained use.* Justify Platform Value: Continuously demonstrate the value of your platform to management and stakeholders, especially during economic downturns. Highlight its contributions to avoid cuts and maintain support.* Understand Adoption Lifecycle: Recognize that platforms go through different stages of adoption. Identify and support early adopters, and gradually bring in late adopters by showcasing successful use cases.* Enhance Collaboration: Foster open communication between platform teams and other teams. Avoid rigid roadmaps and be adaptable to changing needs to prevent barriers and build stronger internal relationships.* Manage Cognitive Load: Be mindful of the cognitive load on your teams. Simplify processes and reduce unnecessary complexities to enhance productivity and efficiency.* Use Tools to Measure Cognitive Load: Implement tools like Teamperature to assess the cognitive load on your teams regularly. Use the insights to identify and mitigate factors contributing to cognitive overload.* Leverage Experienced Product Managers: Ensure experienced product managers are part of your platform team. They can balance long-term goals with the flexibility needed to adapt to the evolving needs of internal users.I think the uncommon takeaway here is #9 in that platform teams should treat their platform as a product. Product Managers like and Marty Cagan are doing great work in laying out the roadmap for product management. Did you end up checking out the reliability workstreams map I published last week?It's free and can help you stay focused on the right priorities at work.Check it out via this link This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

trust design focus identify platform simplify product managers doin marty cagan team topologies manuel pais platform team

#45 How Team Topologies Can Guide Enabling Teams

Play Episode Listen Later Jun 4, 2024 25:09

I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams.A quick refresher on what enabling teams doIn the team topologies context:Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas.This kind of team is available to provide expertise, guidance, and support to other teams working to adopt new technologies, practices, or skills.In other news…This podcast has a new nameWhat more a fitting moment to announce renaming the SREpath podcast to “The Reliability Enablers” podcast?This name change reinforces our quest to demystify and enable reliability efforts so that more organizations successfully implement SRE principles and beyond. Before we get to the 8 takeawaysHere's something relevant to enabling reliability work — a reliability workflows map I've had in my private notes for years, now going public.What is a workstream?

guide work adoption takeaways enabling collaborate doin sre team topologies manuel pais

#44 - Making SLOs Matter to Stakeholders

Play Episode Listen Later May 30, 2024 20:22

Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

stakeholders slos

#43 - SLOs: a Deeper Dive into its Mechanics

Play Episode Listen Later May 28, 2024 31:44

This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.Here are 5 takeaways from the show:* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.* Continuous Improvement: Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success.* Effective Communication Skills: Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate.* Understanding User Needs: Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

develop commit user defend mechanics start small deeper dive site reliability engineering slos

#42 - Hitting Software SLA Targets through SLOs and SLIs

Play Episode Listen Later May 21, 2024 29:18

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

focus software hitting align targets slas sres site reliability engineering slos slis

#41 Curbing High Observability Costs

Play Episode Listen Later May 14, 2024 24:34

No one wants to get Coinbase's $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders. Sofia Fosdick shares practical insights on curbing high observability costs. She's a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode!We tackled the cost issue by covering ideas like aligning cost with value, event-based systems, and dynamic sampling. You will not want to miss this conversation if your observability bill is starting to look dangerous.You can ⁠connect with Sofia via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

costs coinbase curbing honeycomb observability grafana dynatrace

#40 How to Enable Observability for Success

Play Episode Listen Later May 7, 2024 27:59

Observability is more than a set of technologies. It's a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He's a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can ⁠connect with Timothy via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

success ikea enable observability timothy mahoney

#39 How Chaos Engineering Helps Reduce Incident Risk

Play Episode Listen Later Apr 30, 2024 24:47

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.He's been at the helm of reliability-focused decision-making at one of Canada's largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!You can ⁠connect with Ananth via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

canada risk helps reduce incident bmo sdlc chaos engineering ananth

#38 The Real Cost of Software Reliability & Downtime

Play Episode Listen Later Apr 23, 2024 23:51

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all.Here are key takeaways from our conversation:* Prioritize Risk Mitigation: Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively.* Consider Cost-Effectiveness: When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to your organization.* Advocate Continuously: Continuously advocate for the importance of reliability engineering within your organization. Communicate transparently about the value SRE teams add and the impact of their work on the organization's success.* Explore Alternative Metrics: Explore alternative availability metrics beyond traditional time-based measurements. Consider event-based metrics to gain a more nuanced understanding of service availability and performance.* Embrace Regional Focus: Shift from relying solely on global availability metrics to more granular regional metrics. Understand the varying impacts on different customer audiences and prioritize improvements accordingly.* Navigate Regulatory Challenges: Be mindful of regulatory challenges, such as GDPR, and understand their implications on service availability and reliability. Adapt strategies and solutions to comply with regulations while maintaining operational efficiency.* Align Reliability with Revenue: Recognize the direct correlation between service availability and revenue generation, particularly for revenue-driven services like ad platforms. Invest in reliability engineering to ensure consistent revenue streams.* Tier Services Strategically: Implement a tiered approach to prioritize reliability efforts, with revenue-generating services like ad platforms placed in the top tier. Allocate resources based on the criticality of services to the organization's objectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

invest software communicate adapt evaluate gdpr reliability downtime real cost sre allocate site reliability engineering

#37 An SRE Approach to Managing Technology Risk

Play Episode Listen Later Apr 16, 2024 30:08

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective. We'll cover how it's very different to the typical IT risk management mindset. Here are key takeaways from our conversation: Embrace Risk with Velocity: Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation. Reevaluate Risk Management Approaches: Challenge traditional approaches to risk management, especially in larger organizations with extensive governance procedures. Explore alternative methods that prioritize agility and efficiency without compromising reliability. Conceptualize Risk as a Continuum: View risk as a continuous spectrum and assess it based on various dimensions, such as the complexity of changes, the criticality of systems, and the impact on user experience. Continuously evaluate and adjust risk management strategies accordingly. Balance Stability and Innovation: Recognize that extreme reliability comes at a cost and may hinder the pace of innovation. Aim for an optimal balance between stability and innovation, prioritizing user satisfaction and efficient service operations. Implement Service-Level Objectives (SLOs): Deliver services with explicitly delineated levels of service, allowing clients to make informed risk and cost trade-offs when building their systems. Define SLOs based on the importance and criticality of services to enable better decision-making. Visualize Risk Assessment: Utilize visual representations, such as whiteboard diagrams, to assess and communicate different levels of risk within your software systems. Encourage collaborative discussions among team members to determine acceptable risk levels. Prioritize Customer Impact: Consider the impact of changes on customer experience and prioritize risk management efforts accordingly. Differentiate between critical user journeys and cosmetic changes to allocate scrutiny appropriately. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

technology managing explore risk encourage aim strive differentiate continuously sre site reliability engineering

#36 Avoiding Critical Platform Engineering Mistakes

Play Episode Listen Later Apr 9, 2024 26:56

Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell us how to achieve greater maturity in this aspect of software operations. She's previously held SRE roles and currently works as Principal Engineer at Syntasso, the company behind the popular Kratix platform framework. Abby highlighted the need for concrete definitions and maturity models in platform engineering trends, cautioning against equating developer portals with fully functional platforms. We also dived into the need to understand your socio-technical landscape with an emphasis on the value of frameworks and method-based approaches.You can ⁠connect with Abby via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

mistakes platform jokes devops sre principal engineer platform engineering

#35 Boosting Your Observability Data's Usability

Play Episode Listen Later Apr 2, 2024 35:04

The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data.He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space. Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights. You can connect with Richard via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

ceo uk data boosting usability observability maidenhead

#34 From Cloud to Concrete: Should You Return to On-Prem?

Play Episode Listen Later Mar 26, 2024 22:59

This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem, which is analogous to that other debate we have in the technology of build vs buy. Here are key takeaways from our conversation: Adapt your storage solutions to business needs: Understand the diverse storage options available and tailor them to specific business needs, considering factors like data type, access patterns, and scalability requirements. Optimize your load balancing: Implement global load balancing strategies to optimize user experience and performance by directing traffic to the nearest data center to minimize latency, and maximize resource utilization. Don't hesitate to continuously evaluate your cloud: Assess the suitability of cloud solutions against your organization's needs, considering factors like cost, control, scalability, and security, and be open to reevaluating decisions based on evolving requirements. Make strategic decisions for your operations footprint: Lean on decisions based on thorough analysis that considers: Encourage objective evaluation and formal planning processes in decision-making: avoid emotional reactions or being swayed by external influences, to ensure decisions are based on sound analysis and truly aligned with organizational goals. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

cloud encourage adapt optimize implement concrete assess prem site reliability engineering

#33 Inside Google's Data Center Design

Play Episode Listen Later Mar 19, 2024 23:11

This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design outlined in the book. One thing is for sure. Building a data center for your own needs is HARD work with many considerations you must make.Here are key takeaways from our conversation: Importance of understanding data center fundamentals: Even if you're not operating at the scale of companies like Google, understanding the fundamentals behind data center infrastructure can help. This knowledge can inform decisions on cloud services, high availability strategies, and the architectural design of systems to ensure resilience and scalability. The impetus to leverage cloud infrastructure: The transition from traditional on-premises infrastructure to cloud-based solutions is a critical trend. Organizations can learn from how tech giants manage resources efficiently at scale, to improve their resource allocation. Cyclical trends in technology adoption: trends in technology are cyclical and that can inform strategic decisions. As there's a current discussion around moving from cloud-centric models back to more traditional data center approaches, understanding the history and evolution of tech infrastructure can prepare organizations to adapt to and anticipate future shifts in the technological landscape. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

google building design organizations data centers cyclical site reliability engineering

#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP

Play Episode Listen Later Mar 14, 2024 16:58

Will Platform Engineering replace DevOps or SRE or both? I don't think this is the case at all. Neither does Ajay Chankramath.He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. I'd take his word for it since he's held senior leadership roles in release engineering and more since 2002.In this bonus episode of the SREpath podcast, Ajay shared his perspective on the debate about SRE vs DevOps vs Platform Engineering. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

head devops clarifying sre platform engineering

#31 Introduction to FinOps (with Ajay Chankramath)

Play Episode Listen Later Mar 12, 2024 26:37

FinOps is on the tip of many tongues in the software space right now, as we try to curb our cloud costs. Ajay Chankramath has given talks on FinOps at conferences like the DevOps Enterprise Summit (DOES) among others.He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. His peers like Martin Fowler and Neal Ford have originated ideas like refactoring, microservices, and more.He shared practical advice for avoiding a harsh, restrictive cost control approach and instead taking a holistic financial view of your software operations. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

head finops platform engineering martin fowler neal ford

#30 Clearing Delusions in Observability (with David Caudill)

Play Episode Listen Later Mar 7, 2024 37:16

Observability is going through interesting times. David Caudill believes that delusions are getting in the way of our success in this area. He's a senior engineering manager at Capital One, a US-based bank. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

clearing capital one delusions observability caudill

#29 - Reacting to Google's SRE book 2016 (Chapter 1 Part 2)

Play Episode Listen Later Feb 27, 2024 31:25

Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: Monitoring is one of the primary means by which service owners keep track of a system's health and availability. Efficient use of resources is important anytime a service cares about money. Humans add latency, even if a given system experiences more actual failures. A system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands on intervention. SRE has found that roughly, 70 percent of outages are due to changes in a live system. Best practices in this domain use automation to accomplish implementing progressive rollouts. Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand, the required availability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

google humans reacting monitoring efficient sre site reliability engineering niall murphy

#28 - Reacting to Google's SRE Book 2016 (Chapter 1 Part 1)

Play Episode Listen Later Feb 20, 2024 25:45

Sebastian and I got together to react to and discuss 5 passages from Chapter 1 of Google's Site Reliability Engineering book (2016) by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: The sysadmin approach and the accompanying development ops split have a number of disadvantages and pitfalls Google has chosen to run our systems with a different approach. Our Site Reliability Engineering teams focus on hiring software engineers to run our products The term DevOps emerged in industry. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions. Google caps operational work for SREs at 50 percent of their time. Their remaining time should be spent using their coding skills on project work. Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

google product reacting devops sre sres site reliability engineering niall murphy

#27 - Growing as a Site Reliability Engineer (Part 3)

Play Episode Listen Later Feb 13, 2024 16:29

Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

engineers sre site reliability engineer

#26 - Growing as a Site Reliability Engineer (Part 2)

Play Episode Listen Later Feb 8, 2024 19:06

In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 more truths that are somewhat trickier...Background music credit: Luna by KaizanBlue This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

engineers site reliability engineer

#25 - DORA and the Pursuit of Engineering Excellence (with Tim Wheeler)

Play Episode Listen Later Jan 30, 2024 37:48

DORA metrics are a hot topic among technology executives in all kinds of enterprise. But there's more to engineering culture than solely relying on the numbers it goes you. We have a rare treat for you because Ash got Tim Wheeler on the pod. He doesn't do much of social media or podcast episodes. Tim is Director of Engineering Excellence at SquaredUp where he follows the DORA metrics but emphasizes starting conversations around them rather than setting directives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

director excellence pursuit engineering ash tim wheeler

#24 - Growing as a Site Reliability Engineer (Part 1)

Play Episode Listen Later Jan 23, 2024 8:42

How can you grow as an SRE? You've probably thought about your career progression at some point. Ash put together his initial thoughts on this topic. Listen on to learn how he unpacks the first idea of "You don't get promotions with tenure". This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

engineers ash sre site reliability engineer

#23 - The Danger of Unreliable Platforms (with Jade Rubick)

Play Episode Listen Later Jan 16, 2024 29:05

Jade Rubick needs no introduction in the reliability and observability space. He was VP of Engineering at New Relic from 2010 to 2019. It was my pleasure to take on his non-obvious ideas on managing expectations with teams, especially platform-based teams. We had a few spicy ideas to dive into.We also touched on topics like enhancing engineering practices, DORA metrics, and so much more. Be sure to listen all the way through to learn Jade's amazing insights. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

danger engineering platforms unreliable new relic rubick

#22 - How Google does SRE Consulting (with Yury Niño Roa)

Play Episode Listen Later Jan 9, 2024 35:51

I did not know that Google itself does consulting around its SRE practices. This is not a sponsored episode LOL! I wanted to talk with my SRE friend, Yury Niño Roa, about her drawings and SRE ideas, but we dove into a whole lot more than that. We spoke about her work at Google's PSO office, the antipatterns she's seen, and a whole lot more. Listen in for an engaging conversation.You can follow Yury and her amazing drawings via: https://www.linkedin.com/in/yurynino/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

google consulting lol sre roa pso yury

#21 - Better SRE in 2024 is all we can hope for

Play Episode Listen Later Jan 2, 2024 32:25

Sebastian is back for this episode to help set out direction for 2024. We reflected during the holidays on the problems SREs faced in 2023 in terms of job insecurity, burnout, and "that really shouldn't be my sole job". Sebastian and I talked about what we hope to bring to the community in 2024 to make SREs and SRE teams stronger, happier, and healthier at their work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

sre sres

#20 Holiday Special with Stephen Townshend

Play Episode Listen Later Dec 19, 2023 29:32

Join Ash Patel and Stephen Townshend for a friendly chat about what they've learned in SRE as 2023 comes toward a wrap! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

holiday special sre townshend

#19 How to Develop Early Career Engineers (with John Hyland)

Play Episode Listen Later Dec 12, 2023 40:35

Ash Patel talks with John Hyland who ran the Ignite Program at New Relic, which is dedicated to developing early career engineers.John shares insights about driving better outcomes for the organization and the early career professionals who join them. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

develop engineers early career hyland new relic ignite program

Claim S.R.E.path Podcast

In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

Claim Cancel