POPULARITY
In this episode of Experts Unleashed, I had the pleasure of hearing from Rhonda Melogy, the founder of Business Command Center, as she shared her expertise on running self-liquidating offers (SLOs). Rhonda took me through her journey—from her background in business to transitioning into digital products, and how she created innovative Trello templates to help entrepreneurs declutter their digital chaos. One of the most fascinating parts of our conversation was how she scaled her ad spend from $5,000 to $15,000 per month. She broke down her winning Facebook ad strategies, including the importance of revealing the offer upfront and making her marketing relatable—especially for her primary audience: women over 40 juggling multiple responsibilities. Rhonda didn't hold back on the lessons she's learned along the way. She opened up about overcoming tech challenges, the power of authenticity, and why creating a personalized customer experience is a game-changer. Her biggest takeaway? Keep things simple and focus on solving specific pain points—because that's where true business success lies.
The Scaling Lounge: Business Strategy • Operations • Team
So, fun fact: If you've ever thought, “Ugh, funnels feel so bro-markety and sleazy,” you're not alone. But here's the thing – you already have a funnel, whether you call it that or not. The real question is… is it actually working for you?In this episode, I'm diving deep into what's working in funnels right now (because let's be real—so much of what used to work just… doesn't anymore). I'm breaking down the three classic funnels that once ran the online business world, why they're struggling today, and the one strategy I swear by that blends the best of evergreen, live launching, and low-ticket sales without the pushy sales tactics.If you want a funnel that feels aligned, converts better, and doesn't make your audience roll their eyes, you're in the right place. What's Inside This Episode:The three funnels that dominated online marketing—and why they aren't cutting it anymoreWhat “bro marketers” got so wrong about funnels and how to build one that feels high-integrity instead of high-pressureThe sneaky reason evergreen webinars aren't converting like they used toWhy live launches aren't *dead* exactly... but they do need a serious refreshThe big problem (and bigger future problem) with self-liquidating offers (SLOs) that is rarely talked aboutHow to structure a funnel that nurtures your audience instead of shoving them into a saleThe Invisible Sales Triggers that sell for you without gross persuasion tacticsResources mentioned in this episode: Learn more about The Evergreen Expert: where I'll show you how to build a high converting Value Add Funnel for your high ticket offerConnect with Adriane: Adriane's Instagram Visionaries' Instagram Adriane's LinkedIn Visit the Visionaries website ⭐️ Love this episode? We'd GREATLY appreciate a 5-star review! ⭐️
Observability is expensive because traditional tools weren't designed for the complexity and scale of modern cloud-native systems, explains Christine Yen, CEO of Honeycomb.io. Logging tools, while flexible, were optimized for manual, human-scale data reading. This approach struggles with the massive scale of today's software, making logging slow and resource-intensive. Monitoring tools, with their dashboards and metrics, prioritized speed over flexibility, which doesn't align with the dynamic nature of containerized microservices. Similarly, traditional APM tools relied on “magical” setups tailored for consistent application environments like Rails, but they falter in modern polyglot infrastructures with diverse frameworks.Additionally, observability costs are rising due to evolving demands from DevOps, platform engineering, and site reliability engineering (SRE). Practices like service-level objectives (SLOs) emphasize end-user experience, pushing teams to track meaningful metrics. However, outdated observability tools often hinder this, forcing teams to cut back on crucial data. Yen highlights the potential of AI and innovations like OpenTelemetry to address these challenges.Learn more from The New Stack about the latest trends in observability:Honeycomb.io's Austin Parker: OpenTelemetry In-DepthObservability in 2025: OpenTelemetry and AI to Fill In GapsObservability and AI: New Connections at KubeConJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.
SecOps continues to be one of the most challenging areas of cybersecurity. It involves addressing alert fatigue, minimizing dwell time and meantime-to-respond (MTTR), automating repetitive tasks, integrating with existing tools, and leading to ROI.In this episode, we sit with Grant Oviatt, Head of SecOps at Prophet Security and an experienced SecOps leader, to discuss how AI SOC Analysts are reshaping SecOps by addressing systemic security operations challenges and driving down organizational risks.Grant and I dug into a lot of great topics, such as:Systemic issues impacting the SecOps space include alert fatigue, triage, burnout, staffing shortages, and inability to keep up with threats.What makes SecOps such a compelling niche for Agentic AI, and what key ways can AI help with these systemic challenges?How Agentic AI and platforms such as Prophet Security can aid with key metrics such as SLOs or meantime-to-remediation (MTTR) to drive down organizational risks.Addressing the skepticism around AI, including its use in production operational environments and how the human-in-the-loop still plays a critical role for many organizations.Many organizations are using Managed Detection and Response (MDR) providers as well, and how Agentic AI may augment or replace these existing offerings depending on the organization's maturity, complexity, and risk tolerance.How Prophet Security differs from vendor-native offerings such as Microsoft Co-Pilot and the role of cloud-agnostic offerings for Agentic AI.
Leading Improvements in Higher Education with Stephen Hundley
This episode features a conversation with a higher education assessment leader involved in championing Student Learning Outcomes, or SLOs as they are also known. Our guest is Jarek Janio, who, in addition to being a faculty member at Santa Ana College, is also one of the founders of the California Outcomes & Assessment Coordinator Hub, otherwise known as COACHes, which provides plentiful resources for those involved in working with SLOs. Link to resource mentioned in this episode:California Outcomes & Assessment Coordinator Hub (COACHes):https://coaches.institute/ This season of Leading Improvements in Higher Education is sponsored by the Center for Assessment and Research Studies at James Madison University; learn more at jmu.edu/assessment. Episode recorded: December 2024. Host: Stephen Hundley. Producers: Chad Beckner and Angela Bergman. Original music: Caleb Keith. This award-winning podcast is a service of the Assessment Institute in Indianapolis; learn more go.iu.edu/assessmentinstitute.
eBay, Yahoo, Netflix and then 10+ years at Uber. In this episode we sit down with Vishnu Acharya, Head of Network Infrastructure EMEA and Platform Engineering at Uber. Vishnu shares how Uber has scaled over the years to about 4000 engineers and how his team makes sure that infrastructure and platform engineering scales with the growing company and the growing demand on their digital services.Tune in and learn about how Vishnu thinks about SLOs across all layers of the stack, how they manage to get better insights with their cloud providers and why its important to have an end-to-end understanding of the most critical end user journeys.Links we discussed:Conference talk at Observability & SRE Summit: https://www.iqpc.com/events-observability-sre-summit/speakers/vishnu-acharyaVishnu's LinkedIn Page: https://www.linkedin.com/in/vishnuacharya/Uber Engineering Blog: https://www.uber.com/blog/engineering/
Hometown Radio 12/19/24 6p: Learn the history of SLOs Mee Heng Low Cafe
Launch Your Box Podcast with Sarah Williams | Start, Launch, and Grow Your Subscription Box
This past year in my business was filled with wins, challenges, and hard-earned lessons. And that's okay. Because being an entrepreneur, and especially a subscription box owner, isn't as glamorous as the internet sometimes makes it seem. It's more like a rollercoaster – holding on tight through the highs, the lows, and all the in-betweens. In this episode, I'm taking you behind the scenes of my business, sharing the good, the bad, and the downright frustrating moments. Why share all of this with you? You know I believe in being transparent. And I want to inspire you to take a good look at your own business, reflect on your year, and use those insights to propel you into 2025, making it your best year yet! Why Reflecting on the Past Year in Your Business Matters Looking back helps us see what worked, what didn't, and how we can grow. As entrepreneurs, it's easy to focus on where we fell short. I'm challenging you to shift your mindset to celebrate the progress you've made instead. I started my subscription box business in 2017. To date, I've surpassed $10 million in revenue. That's the gain I choose to focus on, even when things don't go as planned. And there were definitely some times this year when things didn't go to plan. The Highs – What Worked in 2024 Recurring Revenue – My three subscriptions—The Monogram Box™, T-Shirt Club, and Tees 4 Teachers—account for 75% of my revenue. Recurring income is the foundation of my business. Successful Launch – My April launch, featuring low-cost bonuses, brought in over 400 new subscribers in just a few days. Turning Inventory into Growth – Self-liquidating offers (SLOs) helped me clear out excess inventory while attracting new customers, and turning one-time buyers into subscribers. The Lows – What Didn't Work in 2024 A Slow Start to the Year – Taking my eye off the numbers in January led to slower months and a revenue deficit I worked all year to recover from. Campaigns that Fell Short – An "intro offer" flopped early in the year. But… Instead of giving up on the strategy, I revamped and relaunched it later with much better results. 2024 Had Its Surprises, Too Retention Challenges – My retention has always been high. Always. Like 94%+ high. Until it wasn't. A payment failure issue in September caused my retention rate to drop to a never-before-seen 79%. This was a hard pill to swallow – and sent me for the wine and chocolate – but it taught me to stay resilient. Key Takeaways for Your Business Track Your Numbers – Every month. Don't let busy months distract you from keeping an eye on your metrics. What happens today impacts the months ahead. Embrace Launches –Launches breathe new life into your business, giving customers a reason to subscribe or make a purchase. Experiment and Adjust – Not every campaign will succeed. But even those that don't can teach you how to improve and grow. Focus on the Gain –Instead of dwelling on where you want to be, celebrate how far you've come. Progress is worth celebrating. Entrepreneurship isn't easy, but it's worth it. By reflecting on your year, focusing on the gains, and staying intentional with your plans for 2025 you can set yourself up for success. Join me for this episode where I take an honest look at my subscription box business—and challenge you to reflect on 2024 in your business. Tune in and let's dive into the lessons of the year! Resources Mentioned in This Episode: The Ed Mylett Show (Podcast) The Gap and The Gain by Dan Sullivan and Benjamin Hardy Subscribe to the podcast on your favorite podcast platform and leave a 5-star rating and a review! Join me in all the places: Facebook Instagram Launch Your Box with Sarah Website Are you ready for Launch Your Box? Our complete training program walks you step by step through how to start, launch, and grow your subscription box business. Join the waitlist today!
How can teams unlock elite productivity while navigating the complexities of DevOps? Expert consultant Xing Zhou reveals how DORA metrics—from deployment frequency to change failure rate, drive performance and bridge gaps between leadership and the team members. Xing's impressive career spans companies like Amazon, Pivotal Labs, and Integral. He currently serves as Xing has led high-performing teams, implemented cutting-edge DevOps strategies, and helped organizations make meaningful progress toward elite DORA benchmarks. Xing and Ashok discuss actionable steps for C-Suite leaders and product teams, focusing on SLAs, CI/CD, behavior-driven development, and event storming workshops to drive alignment and enhance productivity. Packed with real-world examples, this conversation is essential listening for anyone aiming to turn metrics into a reflection of team health and organizational success. Inside the episode... A clear explanation of Dora metrics: what they are and why they matter. How SLAs and SLOs empower teams while aligning with organizational goals. Continuous delivery (CD): balancing technical capability with business priorities. The difference between pull request (PR) and trunk-based development for CI. Behavior-driven development (BDD) and its role in improving test automation. Event storming: how this simple tool drives clarity in business processes. Practical strategies for introducing new practices like BDD or event storming to your team. Mentioned in this episode: Behavior-Driven Development (BDD) Gherkin language for test automation Event storming Unlock the full potential of your product team with Integral's player coaches, experts in lean, human-centered design. Visit integral.io/convergence for a free Product Success Lab workshop to gain clarity and confidence in tackling any product design or engineering challenge. Subscribe to the Convergence podcast wherever you get podcasts including video episodes to get updated on the other crucial conversations that we'll post on YouTube at youtube.com/@convergencefmpodcast Learn something? Give us a 5 star review and like the podcast on YouTube. It's how we grow. Follow the Pod Linkedin: https://www.linkedin.com/company/convergence-podcast/ X: https://twitter.com/podconvergence Instagram: @podconvergence
Decoding the ABCs of Acronym Overload - A Pep Talk by YOUR Principal"Good morning, team! Today's priorities: align PPG with MTSS strategies, focusing on PBIS to reduce ODRs. Address RTI needs for students with ADD, ADHD, or OHI in IEPs and 504s. Boost ELA, MAPE, and SEL through targeted SLOs. Support ELL students via SSI and ensure compliance with FAPE and ESSA. Plan interventions during PLC to discuss SPED and ESY. Tonight: report PBIS, IEP, and SEL gains to the BOE. In fact, I'm going to take them directly to the BOE for you. Let's make this count people!" We've got this together." Holy crap! It's like the DaVinci code. We need to have some kind of code breaker.Do you have a story to share? Do you just want to talk? Send us a text! #holidayParties #StaffParties #TheyDontPayMeEnoughForThis, #IGiveUp, #HandsToYourself, #WhyDoIBotherDressingUp, #WhatItsAllAbout, #LessonPlan, #BathroomBreak, #DanielsonModel, #TryingToBeNice, #StopTheWorldIWantToGetOff, #WeDidntKnow, #WeDidntKnowWhatWeDidntKnow, #StressedTeacher, #funny,#NiceTry #StillFail #elementaryHumor, #DoAsISayNotAsIDo , #AForEffort, #IsItSummerYet, #ImHip #CoolTeacher, #WhyIsThisSticky, #ClassPets Please contact us with comments or questions at podcastwedidntknow@gmail.com. Facebook: https://www.facebook.com/SueandLisaInstagram: https://www.instagram.com/wedidntknowpodcast/YouTube: https://www.youtube.com/channel/UCgpsWcy93XJpleqVCML4IBQThanks for listening! -Sue and Lisa #teacherlife #teachersofinstagram #teacher #iteach #teachers #iteachtoo #funnyteacherstories #education #teaching #school #teach #teacherstyle #classroom #teacherretirement #teachertribe #learning #teacherproblems #students #elementaryteacher #primaryteacher #cryingteachers #elementary #thirdgrade #fourthgrade #fifthgrade #cryinginmycar #teacherfunny #ageism #proudtoteach #teachermamas #recessduty
Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he's vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.* Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.* Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.* Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.* Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.* Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.* Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.* Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager's observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.* OpenTelemetry's Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don't align with traditional time-based observability.* SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.* Amazon's Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon's practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.* Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.
Market, Scale, Grow: Facebook Ad Strategy for Teacherpreneurs
Can a simple, low-cost offer transform your marketing strategy? Discover the potential of tripwires in this Saturday strategy session, where I break down this powerful tool that can help offset ad expenses and grow your email list. In this episode, we'll cover...✨ the essentials of tripwire pricing✨ the inclusion of bumps and one-time offers✨ how this technique compares to self-liquidating offers (SLOs). Whether you're a seasoned marketer or just getting started, this episode is packed with actionable insights that could revolutionize your approach.__________________Find me on Instagram → https://www.instagram.com/heyitsjenzaia/Email Me → support@jenzaiadimartile.comJoin the Facebook community → https://www.facebook.com/groups/marketscalegrow
Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting. According to Vladislav, at the beginning of your SRE journey, it's not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc. And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization. Understanding and Leveraging SLOsOnce your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they've been validated through production. Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively.Implementing a Formal Incident ResponseBefore you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place. Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness.Coordinating During Major IncidentsWhen a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams. Consider appointing incident commanders and coordinators, as recommended in PagerDuty's documentation, to manage this coordination. Develop a lightweight process to guide how incidents are handled.Classifying IncidentsEstablish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three. Due to the inherently fuzzy nature of incidents, your classification system should also include guidelines for handling ambiguous cases. For instance, if uncertain whether an incident is Priority One or Two, default to Priority One.Deriving Actions from Incident ClassificationBased on the incident classification, outline specific actions. For example, Priority One incidents might require immediate involvement from an incident commander. They might take the following actions:* Create a communication channel, assemble relevant teams, and start coordination. * Simultaneously inform stakeholders according to their priority group. * Define stakeholder groups and establish protocols for notifying them as the situation evolves.Keep Incident Response Processes Simple and AccessibleEnsure that your incident response process is concise and easily understandable. Ideally, it should fit on a single sheet of paper. Complexity can lead to confusion and inefficiencies, so aim for simplicity and clarity in your process diagram. This approach ensures that the process is practical and can be followed effectively during an incident.Preparing Your OrganizationAn effective incident response process relies on an organization's readiness for such rigor. Attempting to implement this process in an organization not yet mature enough may result in poor adherence during critical times. Make sure your organization is prepared to follow the established procedures.For a deeper dive into these concepts, consider reading "Establishing SRE Foundations," available on Amazon and other book retailers. For further inquiries, you can also connect with the author, Vlad, on LinkedIn. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
On this episode of the Six Five in the Booth, host Paul Nashawaty is joined by Flexera's Kristian Dell'Orso, Vice President, Site Reliability Engineering & Site Leader, highlighting their collaboration with Nobl9 for a conversation on becoming an SRE-driven organization. This in-depth discussion explores the transformative impact of adopting Service Level Objectives (SLOs) over traditional Service Level Agreements (SLAs), and how Flexera has shifted its approach to prioritize reliability and enhance customer experiences. Their discussion covers: The transition from SLAs to SLOs at Flexera and its impact on organizational key performance indicators (KPIs), including improvements in reliability and customer experience. The limitations of SLAs in capturing the full spectrum of service reliability and customer satisfaction, and the move towards a more proactive and accountable approach within organizations. How adopting SLOs has led to consistency in measuring reliability across different groups in the company, fostering a culture of accountability and transparency. Learn more how Nobl9 and Flexera articulates its strategy: Customer testimonial/webinar SLOConf Presentation
Daniel Ruby is a VP of Marketing at Nobl9. Ruby is a dynamic marketing executive with a focus on B2B marketing, and has significant experience building teams and driving successful, data-driven programs for a range of startups and mid-sized organizations. As the Director of Online Marketing for Localytics, Ruby was the first marketing hire and scaled his team to a full-fledged marketing department with domain specialists focused on mobile apps. Ruby also has a background in journalism and spent several years guest lecturing marketing courses at Bentley University, bringing this dynamic skill set to his current role at Nobl9. Ruby holds a BA in Broadcast Journalism from the University of Missouri-Columbia and an MBA in International Business from Brandeis University. Questions · We always like to start off by asking our guests if they could share with us a little bit about their journey, how you got from where you were to where you are today. · Could you share with our listeners, what is Nobl9 and what exactly are you providing? What service are you providing? How are you adding value to your customers lives using this platform? · Could you give us an example, like a use case of an application, you can choose any industry, and kind of just give us an idea of what that looks like so the audience can get a more practical view. · Could distinguish or differentiate for our listeners, what's the difference between an SLO and an SLA? · Now, Daniel, could you share with us what's the one online resource, tool, website or app that you absolutely can't live without in your business? · Where can listeners find you online? · Before we wrap our episodes up, we always like to ask our guests, do you have a quote or saying that during times of adversity or challenge, you'll tend to revert to this quote if for any reason you get derailed or you get off track. Highlights Daniel's Journey Me: Now, Daniel, could you share with our listeners, we always like to start off by asking our guests if they could share with us a little bit about their journey, how you got from where you were to where you are today. Daniel shared that he was a journalist for a while, and kind of realized that there was not much of a market for it, so he kind of went and got an MBA to get a job, and kind of fell backwards into marketing as it is. And over the years, he's been with more than 10, mostly 10 startups, driving marketing, driving the customer experience in terms of from the moment that they're introduced to them as a brand until the moment that they no longer want to be or need to be working with them. He finds a lot of joy in trying to make that experience pleasant, for lack of a better term. He's self-taught with most of the marketing stuff that he does, and he's kind of over the course of the years he's become very arrogantly convinced about a few core tenets of really communicating to customers and communicating to potential customers that have served him well, and it kind of always comes back to giving value at every step of the journey. About Nobl9 – Enhancing Software Reliability and Value for Your Business Me: Now, could you share with our listeners, what is Nobl9 and what exactly are you providing? What service are you providing? How are you adding value to your customers lives using this platform? Daniel shared that Nobl9 is basically a platform for software reliability. So, if we think about how somebody engages with a digital product, or even an in-person product with a digital back end to it. They are the premier, and kind of really only well-established provider of what's called Service Level Objectives, or SLOs. And an SLO was basically taking all of the different data points that make up your product, be it from the software, from the infrastructure, from third party microservices, etc, etc, etc, and rolls it up and actually gives a customer centric view into how reliable is their product. And the reason that they do this is because most reliability, historically has been around is the product up. Is it up? Is it down? But anybody who's ever used a mobile app or a digital product, or even like a scan to pay service at a cafe knows that it's not just is it up? Is it working, or is everything within it doing what it's supposed to do? So, he knows marketers like himself have taken the word holistic and kind of beaten it to death. But they do provide a holistic customer centric view. What is the customer experience like when actually using your product? Me: So, Nobl9 is helping the application to maintain its reliability and have as little or no downtime as possible while the customer is interacting with it correct? Daniel agreed, correct and kind of beyond downtime, is it, does it load fast enough? Do all of the different features load fast enough? Is there anything blocking my ability as a customer to buy from you or to use your service, or whatever you're trying to do? So, that's effectively what they do. Practical Use Case – How Nobl9 Enhances Software Reliability Me: Okay, could you give us an example, like a use case of an application, you can choose any industry, and kind of just give us an idea of what that looks like so the audience can get a more practical view. Daniel shared that it's a little engineering, and he's a marketer so he feels like he's not necessarily smart enough to completely understand the engineering speak of it. But if you think about like an Ecommerce app, if you open up the app and it loads in traditional reliability, that's reliable. But how many times have you gone into an app. So, let's say your path in using E commerce app is, you're going to load the app, going to search for the product that he wants to buy, he's going to add it to his cart, he's going to go to his cart, he's going to check out. And maybe there's a login to My Account somewhere in the way. With SLOs, what you can do is you can have all of the different steps of that path viewed as part of this overarching experience with the app. So, if you go to add something to your cart, the app sits there and hangs, that's not good reliability. If you go to the cart itself, and for whatever reason you go to pay and like the app's connection to PayPal for whatever reason isn't working, and you get this message back that says, hey, we can't complete your payment, try another credit card or something, that's a bad experience. And that's the kind of reliability breakdown that leads people to quit using a product. So, they make sure that strategically, all of the things that make up that path, you'll have a server dedicated to your shopping cart, you'll have a micro service that is dedicated to completing a purchase. They make sure that all of that you've got visibility into how the experience is for the end user, not just some third-party service that tries to connect and says yes, it can connect, no it can't connect. It's the whole experience of a digital product and they make it easy to understand. Understanding the Difference – SLOs vs. SLAs Me: Now, at the beginning, you mentioned that it's based on your SLOs or Service Level Objectives. And just wanted to know if you could distinguish or differentiate for our listeners, what's the difference between an SLO and an SLA? Daniel stated that that's a great question. So, an SLA is typically an agreement that you have with your users. You say you'll be able to use my product or my service xx percent of the time. And that's more of a contractual conversation than it is, a reliability conversation. When you start actually building it, in order to make sure that you're achieving this SLA that you've agreed upon with your clients, you've got to make sure that you know all of these different elements of their experience are operating at a certain efficiency. So, he hates to over complicate things, but there's actually a step in between called an SLI, a Service Level Indicator. And so, a Service Level Indicator is things like, “I want my website to load in less than 100 milliseconds.” That's a Service Level Indicator. And the SLI is effectively a consistent goal, but then what you do with SLOs is you take that a little bit further. So, an SLO operates within what's called an error budget, and you get a certain number of errors per month, per week, however, you want to set it up. And an error is when you don't meet that SLI and you say, “I know that it's maybe not impossible, but either improbable or just extremely expensive to make my website load in less than 100 seconds every single time.” So, what an SLO does is it says, okay, I expect to meet this SLI however often I have decided is actually impactful on the customer experience, and whenever I don't, you get an error, and at that point, you're like, okay, it's an error, but an error is not an outage. Daniel stated that he doesn't know if he's explaining this really well, but basically if you look at an SLO graph, it's going to be going down into the right and that is your budget. You say, okay, I expect this part of my service will meet my everyday life 95% of the time. And then you see the little graph going down, which, every time it doesn't meet that, it froze an error, and a certain amount of errors, it's just part of digital product development is understanding that you have to accept some errors. 100% perfection, it's impossible. Me: Does not exist. Daniel agreed, it does not exist. And a lot of people think about reliability in the number of nines, “My product is available 99.99% of the time.” That would be called four nines. Every nine that you add to that number basically increases your IT costs by an order of magnitude. So, at a certain point, you got to understand with SLOs are really the ideal way to do it. At what point does this actually impact my customers? At what point does this actually impact my users? It allows you to be strategic, and it allows you to see things when they begin. So, you mentioned outages earlier, an outage, your whole product crashes, it's unavailable to anybody. Those don't happen in a vacuum, there's something that causes them. And if you've got an SLO or a set of SLOs really, displaying the health of your service, the health of your product, you're suddenly going to see a bunch of errors coming in. And you may or may not have enough forewarning to say, “Oh, crap, we're about to have an outage. Well, this server in my cloud provider is having massive latency issues, it looks like the server is about to crash, and that's going to take everything down.” You can have a little bit more of a runway to try and identify the actual issue behind a potential outage. And that's where you strategically define how much, how many errors can I tolerate in a month? And then you basically have a real time view at all times. Is my product, is my service running the way I expect it to, and the way I expect it to is from the perspective of the customer. Are my customers able to do what they expect to be able to do when they launch my app, or when they use my product. App, Website or Tool that Daniel Absolutely Can't Live Without in His Business When asked about online resource that he can't live without in his business, Daniel shared that personally, it's HubSpot. He spent a few years as a HubSpot consultant, a third-party consultant for HubSpot integrations. And he could say a marketing automation platform, but he got a fanboy a little bit about HubSpot because it takes everything that used to be complicated about sales and marketing and customer support and customer experience, and kind of rolls it up. You've got your CRM, you've got your automation platform, you've got your email platform. He's actually taken to feeding his HubSpot metrics into Nobl9 SLO lately, because there's so much rich data within HubSpot. And he's been running little things like, they'll change a script for their biz dev team. And then they'll just run a metric on calls connected to calls having a success, either a demo or a follow up request. And HubSpot's got such great data, and then he can turn that into an SLO where he can say, “Okay, I expect the submission rate on my forms to be X percent. I expect the completion rate of our inside sales calls to be X percent over 80% of the time.” And he can run that in an SLO, he can see that like if he changes something, but simple little things, if he changes the colour of the submit button on a form, he can see in real time what the view to submission rate is that's changing, and he can act on that. He can put a little note in the SLO with that time stamp and say, “Hey, I changed the color of this button” and see how that's making an impact. Or, if he looks in HubSpot and see, “Oh, crap, my submission rate has been terrible for the past 36 hours, what's happened?” And he can go into his SLO and say, “Oh, there's a little note here, I changed something in the UX. I changed something colour wise. I added a question, I removed a question.” And you see direct historical cause for that. But going back to HubSpot in general, he really respects HubSpot's approach to they call it the flywheel, which is all about delighting customers and delighting prospects. And they really do a great job of giving you the tools to actually give value in your marketing and sales and customer service processes. He doesn't know what he's do without HubSpot right now, he'd probably try and hack something together with Salesforce and Marketo and be more complicated and less easy to get adoption internally. But luckily, he doesn't need to. Where can listeners find Daniel online? LinkedIn – Nobl9inc X – @nobl9inc Quote or Saying that During Times of Adversity Daniel Uses When asked about a quote or saying that he tends to revert, Daniel shared, “Nothing is a failure unless you fail to learn from it.” He stated that he's got a great team that he works with at Nobl9. And some are early in their career, some have been working in marketing for several years, some have advanced degrees, some don't. And so, he's been with Nobl9 since January, and one of the things that he wanted to do quickly was make sure that people were comfortable making mistakes and understanding that an action and its outcome is not the end of that process. If you don't learn from it, why it worked, why it didn't, then you've made a mistake. But failure is as important an element in driving success in any business scenario. So, he likes making sure that people know that, and he likes making sure that they feel comfortable, because without being comfortable with failure, how are you going to try anything revolutionary? Me: True. Thank you so much for sharing, Daniel. Now, we just like to thank you so much for hopping on this podcast and sharing all of the great insights about Nobl9, the wonderful work that your team is doing as it relates to ensuring that customers are having a more seamless and frictionless experience across these platforms and increases the reliability. And we really appreciate some of the great nuggets that you shared with us as it relates to what is a Service Level Objective versus a Service Level Agreement and a Service Level Indicator, great information to learn, to understand the whole process, because in order to navigate the customer experience, there is really a lot that you have to take into account to ensure that the customer walks away feeling like, yeah, that was that was fun, it was easy, it wasn't hard, and I was able to do it really quickly. So, thank you so much. Please connect with us on Twitter @navigatingcx and also join our Private Facebook Community – Navigating the Customer Experience and listen to our FB Lives weekly with a new guest The ABC's of a Fantastic Customer Experience Grab the Freebie on Our Website – TOP 10 Online Business Resources for Small Business Owners Do you want to pivot your online customer experience and build loyalty - get a copy of “The ABC's of a Fantastic Customer Experience.” The ABC's of a Fantastic Customer Experience provides 26 easy to follow steps and techniques that helps your business to achieve success and build brand loyalty. This Guide to Limitless, Happy and Loyal Customers will help you to strengthen your service delivery, enhance your knowledge and appreciation of the customer experience and provide tips and practical strategies that you can start implementing immediately! This book will develop your customer service skills and sharpen your attention to detail when serving others. Master your customer experience and develop those knock your socks off techniques that will lead to lifetime customers. Your customers will only want to work with your business and it will be your brand differentiator. It will lead to recruiters to seek you out by providing practical examples on how to deliver a winning customer service experience!
Service Level Objectives (SLOs) are a set of reliability measurements for customer or user expectations of services; in other words, are people having a good experience with your application or service? Today's Day Two Cloud explores SLOs, the relevant metrics, and how to measure them. We also talk about how SLOs are a cross-discipline objective... Read more »
Service Level Objectives (SLOs) are a set of reliability measurements for customer or user expectations of services; in other words, are people having a good experience with your application or service? Today's Day Two Cloud explores SLOs, the relevant metrics, and how to measure them. We also talk about how SLOs are a cross-discipline objective... Read more »
Service Level Objectives (SLOs) are a set of reliability measurements for customer or user expectations of services; in other words, are people having a good experience with your application or service? Today's Day Two Cloud explores SLOs, the relevant metrics, and how to measure them. We also talk about how SLOs are a cross-discipline objective... Read more »
Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.Here are 5 takeaways from the show:* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.* Continuous Improvement: Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success.* Effective Communication Skills: Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate.* Understanding User Needs: Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
In this episode Bob and Randy invite Dan Salinas and Sarv Shah from Nobl9 to dive deep into the complexities of Site Reliability Engineering (SRE) and Service Level Objectives (SLOs). Discover the origins of SRE, the significance of SLOs in improving customer experience, and the impact of digital reliability on businesses today. From the challenges of maintaining microservices to the advent of cloud dependency, this episode is packed with insights on ensuring operational excellence in our digital world.
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Adam presents the mental model behind T1 and T2 signals, a necessary lexicon for understanding production operations.Want more?
► ACSOM is LIVE throughout April & May. Get your tickets on the links below!
Adam presents a catch-all episode on ops reviews, visual management, call-to-actions, and SLOs.Want more?
SREs (Site Reliability Engineers) have varying roles across different organizations: From Codifying your Infrastructure, handling high priority incidents, automating resiliency, ensuring proper observability, defining SLOs or getting rid of alert fatigue. What an SRE team must not be is a SWAT team - or - as Dana Harrison, Staff SRE at Telus puts it: "You don't want to be the fire brigade along the DevOps Infinity Loop"In his years of experience as an SRE Dana also used to run 1 week boot camps for developers to educate them on making apps observable, proper logging, resiliency architecture patterns, defining good SLIs & SLOs. He talked about the 3 things that are the foundation of a good SRE: understand the app, understand the current state and make sure you know when your systems are down before your customers tell you so!If you are interested in seeing Dana and his colleagues from Telus talk about their observability and SRE journey then check out the On-Demand session from Dynatrace Perform 2024: https://www.dynatrace.com/perform/on-demand/perform-2024/?session=simplifying-observability-automations-and-insights-with-dynatrace#sessions
Embrace the Site Reliability Mindset with Alex Ewerlöf, Sr. Staff Engineer @ Volvo Cars
Lead platform SRE, Viktor Peacock takes us through what is SRE Engineering and why is it important to have an SRE function? Viktor then goes on to explain how you can use SLOs and error budgets to build your alerts and decide what a team should work on.
Laura Nolan, Principal Software Engineer at Stanza, joins Corey on Screaming in the Cloud to offer insights on how to use SRE to avoid disastrous and lengthy production delays. Laura gives a rich history of her work with SREcon, why her approach to SRE is about first identifying the biggest fire instead of toiling with day-to-day issues, and why the lack of transparency in systems today actually hurts new engineers entering the space. Plus, Laura explains to Corey why she dedicates time to work against companies like Google who are building systems to help the government (inefficiently) select targets during wars and conflicts.About LauraLaura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand and control their production systems. Laura also serves as a member of the USENIX Association board of directors. In her copious spare time after that, she volunteers for the Campaign to Stop Killer Robots, and is half-way through the MSc in Human Factors and Systems Safety at Lund University. She lives in rural Ireland in a small village full of medieval ruins.Links Referenced: Company Website: https://www.stanza.systems/ Twitter: https://twitter.com/lauralifts LinkedIn: https://www.linkedin.com/in/laura-nolan-bb7429/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is someone that I have been low-key annoying to come onto this show for years, and finally, I have managed to wear her down. Lauren Nolan is a Principal Software Engineer over at Stanza. At least that's what you're up to today, last I've heard. Is that right?Laura: That is correct. I'm working at Stanza, and I don't want to go on and on about my startup, but I'm working with Niall Murphy and Joseph Bironas and Matthew Girard and a bunch of other people who more recently joined us. We are trying to build a load management SaaS service. So, we're interested in service observability out of the box, knowing if your critical user journeys are good or bad out of the box, being able to prioritize your incoming requests by what's most critical in terms of visibility to your customers. So, an emerging space. Not in the Gartner Group Magic Circle yet, but I'm sure at some point [laugh].Corey: It is surreal to me to hear you talk about your day job because for, it feels like, the better part of a decade now, “Laura, Laura… oh, you mean USENIX Laura?” Because you are on the USENIX board of directors, and in my mind, that is what is always short-handed to what you do. It's, “Oh, right. I guess that isn't your actual full-time job.” It's weird. It's almost like seeing your teacher outside of the elementary school. You just figure that they fold themselves up in the closet there when you're not paying attention. I don't know what you do when SREcon is not in process. I assume you just sit there and wait for the next one, right?Laura: Well, no. We've run four of them in the last year, so there hasn't been very much waiting. I'm afraid. Everything got a little bit smooshed up together during the pandemic, so we've had a lot of events coming quite close together. But no, I do have a full-time day job. But the work I do with USENIX is just as a volunteer. So, I'm on the board of directors, as you say, and I'm on the steering committee for all of the global SREcon events, and typically is often served by the program committee as well. And I'm sort of there, annoying the chairs to, “Hey, do your thing on time,” very much like an elementary school teacher, as you say.Corey: I've been a big fan of USENIX for a while. One of the best interview processes I ever saw was closely aligned with evaluating candidates along with USENIX SAGE levels to figure out what level of seniority are they in different areas. And it was always viewed through the lens of in what types of consulting engagements will the candidate shine within, not the idea of, “Oh, are you good or are you crap? And spoiler, if I'm asking the question, I'm of course defaulting myself to goading you to crap.” Like the terrible bespoke artisanal job interview process that so many companies do. I love how this company had built this out, and I asked them about it, and, “Oh, yeah, it comes—that dates back to the USENIX SAGE things.” That was one of my first encounters with what USENIX actually did. And the more I learned, the more I liked. How long have you been involved with the group?Laura: A relatively short period of time. I think I first got involved with USENIX in around 2015, going to [Lisa 00:03:29] and then going on to SREcon. And it was all by accident, of course. I fell onto the SREcon program committee somehow because I was around. And then because I was still around and doing stuff, I got eventually—you know, got co-opted into chairing and onto the steering committee and so forth.And you know, it's like everything volunteer. I mean, people who stick around and do stuff tend to be kept around. But USENIX is quite important to me. We have an open access policy, which is something that I would like to see a whole lot more of, you know, we put everything right out there for free as soon as it is ready. And we are constantly plagued by people saying, “Hey, where's my SREcon video? The conference was like two weeks ago.” And we're like, “No, no, we're still processing the videos. We'll be there; they'll be there.”We've had people, like, literally offer to pay extra money to get the videos sooner, but [laugh] we're, like, we are open access. We are not keeping the videos away from you. We just aren't ready yet. So, I love the open access policy and I think what I like about it more than anything else is the fact that it's… we are staunchly non-vendor. We're non-technology specific and non-vendor.So, it's not, like, say, AWS re:Invent for example or any of the big cloud vendor conferences. You know, we are picking vendor-neutral content by quality. And as well, as anyone who's ever sponsored SREcon or any of the other events will also tell you that that does not get you a talk in the conference program. So, the content selection is completely independent, and in fact, we have a complete Chinese wall between the sponsorship organization and the content organization. So, I mean, I really like how we've done that.I think, as well, it's for a long time been one of the family of conferences that our organizations have conferences that has had the best diversity. Not perfect, but certainly better than it was, although very, very unfortunately, I see conference diversity everywhere going down after the pandemic, which is—particularly gender diversity—which is a real shame.Corey: I've been a fan of the SREcon conferences for a while before someone—presumably you; I'm not sure—screwed up before the pandemic and apparently thought they were talking about someone else, and I was invited to give a keynote at SREcon in EMEA that I co-presented with John Looney. Which was fun because he and I met in person for the first time three hours beforehand, beat together our talk, then showed up an hour beforehand, found there will be no confidence monitor, went away for the next 45 minutes and basically loaded it all into short term cash and gave a talk that we could not repeat if we had to for a million dollars, just because it was so… you're throwing the ball to your partner on stage and really hoping they're going to be able to catch it. And it worked out. It was an anger subtext translator skit for a bit, which was fun. All the things that your manager says but actually means, you know, the fun sort of approach. It was zany, ideally had some useful takeaways to it.But I loved the conference. That was one of the only SREcons that I found myself not surprised to discover was coming to town the next week because for whatever reason, there's presumably a mailing list that I'm not on somewhere where I get blindsided by, “Oh, yeah, hey, didn't you know SREcon is coming up?” There's probably a notice somewhere that I really should be paying attention to, but on the plus side, I get to be delightfully surprised every time.Laura: Indeed. And hopefully, you'll be delightfully surprised in March 2024. I believe it's the 18th to the 20th, when SREcon will be coming to town in San Francisco, where you live.Corey: So historically, in addition to, you know, the work with USENIX, which is, again, not your primary occupation most days, you spent over five years at Google, which of course means that you have strong opinions on SRE. I know that that is a bit dated, where the gag was always, it's only called SRE if it comes from the Mountain View region of California, otherwise it's just sparkling DevOps. But for the initial take of a lot of the SRE stuff was, “Here's how to work at Google.” It has progressed significantly beyond that to the point where companies who have SRE groups are no longer perceived incorrectly as, “Oh, we just want to be like Google,” or, “We hired a bunch of former Google people.”But you clearly have opinions to this. You've contributed to multiple books on SRE, you have spoken on it at length. You have enabled others to speak on it at length, which in many ways, is by far the better contribution. You can only go so far scaling yourself, but scaling other people, that has a much better multiplier on it, which feels almost like something an SRE might observe.Laura: It is indeed something an SRE might observe. And also, you know, good catch because I really felt you were implying there that you didn't like my book contributions. Oh, the shock.Corey: No. And to be clear, I meant [unintelligible 00:08:13], strictly to speaking.Laura: [laugh].Corey: Books are also a great one-to-many multiplier because it turns out, you can only shove so many people into a conference hall, but books have this ability to just carry your words beyond the room that you're in a way that video just doesn't seem to.Laura: Ah, but open access video that was published on YouTube, like, six weeks ahead [laugh]. That scales.Corey: I wish. People say they want to write a book and I think they're all lying. I think they want to have written the book. That's my philosophy on it. I do not understand people who've written a book. Like, “So, what are you going to do now?” “I'm going to write another book.” “Okay.” I'm going to smile, not take my eyes off you for a second and back away slowly because I do not understand your philosophy on that. But you've worked on multiple books with people.Laura: I actually enjoy writing. I enjoy the process of it because I always learn something when I write. In fact, I learn a lot of things when I write, and I enjoy that crafting. I will say I do not enjoy having written things because for me, any achievement once I have achieved it is completely dead. I will never think of it again, and I will think only of my excessively lengthy-to do list, so I clearly have problems here. But nevertheless. It's exactly the same with programming projects, by the way. But back to SRE we were talking about SRE. SRE is 20 now. SRE can almost drink alcohol in the US, and that is crazy.Corey: So, 2003 was the founding of it, then.Laura: Yes.Corey: Yay, I can do simple arithmetic in my head, still. I wondered how far my math skills had atrophied.Laura: Yes. Good job. Yes, apparently invented in roughly 2003. So, the—I mean, from what I understand Google's publishing of the, “20 years of SRE at Google,” they have, in the absence of an actual definite start date, they've simply picked. Ben Treynor's start date at Google as the start date of SRE.But nevertheless, [unintelligible 00:09:58] about 20 years old. So, is it all grown up? I mean, I think it's become heavily commodified. My feeling about SRE is that it's always been this—I mean, you said it earlier, like, it's about, you know, how do I scale things? How do I optimize my systems? How do I intervene in systems to solve problems to make them better, to see where we're going to be in pain and six months, and work to prevent that?That's kind of SRE work to me is, figure out where the problems are, figure out good ways to intervene and to improve. But there's a lot of SRE as bureaucracy around at the moment where people are like, “Well, we're an SRE team, so you know, you will have your SLO Golden Signals, and you will have your Production Readiness Checklists, which will be the things that we say, no matter how different your system is from what we designed this checklist for, and that's it. We're doing SRE now. It's great.” So, I think we miss a lot there.My personal way of doing SRE is very much more about thinking, not so much about the day-to-day SLO [excursion-type 00:10:56] things because—not that they're not important; they are important, but they will always be there. I always tend to spend more time thinking about how do we avoid the risk of, you know, a giant production fire that will take you down for days, or God forbid, more than days, you know? The sort of, big Roblox fire or the time that Meta nearly took down the internet in late-2021, that kind of thing. So, I think that modern SRE misses quite a lot of that. It's a little bit like… so when BP, when they had the Deepwater Horizon disaster on that very same day, they received an award for minimizing occupational safety risks in their environment. So, you know, [unintelligible 00:11:41] things like people tripping and—Corey: Must have been fun the next day. “Yeah, we're going to need that back.”Laura: [laugh] people tripping and falling, and you know, hitting themselves with a hammer, they got an award because it was so safe, they had very little of that. And then this thing goes boom.Corey: And now they've tried to pivot into an optimization award for efficiency, like, we just decided to flash fry half the sea life in the Gulf at once.Laura: Yes. Extremely efficient. So, you know, I worry that we're doing SRE a little bit like BP. We're doing it back before Deepwater Horizon.Corey: I should disclose that I started my technical career as a grumpy old Unix sysadmin—because it's not like you ever see one of those who's happy or young; didn't matter that I was 23 years old, I was grumpy and old—and I have viewed the evolution since then have going from calling myself a sysadmin to a DevOps engineer to an SRE to a platform engineer to whatever we're calling it this week, I still view it as fundamentally the same job, in the sense that the responsibility has not changed, and that is keep the site or environment up. But the tools, the processes and the techniques we apply to it have evolved. Is that accurate? Does it sound like I'm spouting nonsense? You're far closer to the SRE world than I ever was, but I'm curious to get your take on that perspective. And please feel free to tell me I'm wrong.Laura: No, no. I think you're completely right. And I think one of the ways that I think is shifted, and it's really interesting, but when you and I were, when we were young, we could see everything that was happening. We were deploying on some sort of Linux box or other sort of Unix box somewhere, most likely, and if we wanted, we could go and see the entire source code of everything that our software was running on. And kids these days, they're coming up, and they are deploying their stuff on RDS and ECS and, you know, how many layers of abstraction are sitting between them and—Corey: “I run Kubernetes. That means I don't know where it runs, and neither does anyone else.” It's great.Laura: Yeah. So, there's no transparency anymore in what's happening. So, it's very easy, you get to a point where sometimes you hit a problem, and you just can't figure it out because you do not have a way to get into that system and see what's happening. You know, even at work, we ran into a problem with Amazon-hosted Prometheus. We were like, “This will be great. We'll just do that.” And we could not get some particular type of remote write operation to work. We just could not. Okay, so we'll have to do something else.So, one of the many, many things I do when I'm not, you know, trying to run the SREcon conference or do actual work or definitely not write a book, I'm studying at Lund University at the moment. I'm doing this master's degree in human factors and system safety. And one of the things I've realized since doing that program is, in tech, we missed this whole 1980s and 1990s discipline of cognitive systems theory, cognitive systems engineering. This is what people were doing. They were like, how can people in the control room in nuclear plants and in the cockpit in the airplane, how can they get along with their systems and build a good mental model of the automation and understand what's going on?We missed all that. We came of age when safety science was asking questions like how can we stop organizational failures like Challenger and Columbia, where people are just not making the correct decisions? And that was a whole different sort of focus. So, we've missed all of this 1980s and 1990s cognitive system stuff. And there's this really interesting idea there where you can build two types of systems: you can build a prosthesis which does all your interaction with a system for you, and you can see nothing, feel nothing, do nothing, it's just this black box, or you can have an amplifier, which lets you do more stuff than you could do just by yourself, but lets you still get into the details.And we build mostly prostheses. We do not build amplifiers. We're hiding all the details; we're building these very, very opaque abstractions. And I think it's to the detriment of—I mean, it makes our life harder in a bunch of ways, but I think it also makes life really hard for systems engineers coming up because they just can't get into the systems as easily anymore unless they're running them themselves.Corey: I have to confess that I have a certain aversion to aspects of SRE, and I'm feeling echoes of it around a lot of the human factor stuff that's coming out of that Lund program. And I think I know what it is, and it's not a problem with either of those things, but rather a problem with me. I have never been a good academic. I have an eighth grade education because school is not really for me. And what I loved about being a systems administrator for years was the fact that it was like solving puzzles every day.I got to do interesting things, I got to chase down problems, and firefight all the time. And what SRE is represented is a step away from that to being more methodical, to taking on keeping the site up as a discipline rather than an occupation or a task that you're working on. And I think that a lot of the human factors stuff plays directly into it. It feels like the field is becoming a lot more academic, which is a luxury we never had, when holy crap, the site is down, we're going to go out of business if it isn't back up immediately: panic mode.Laura: I got to confess here, I have three master's degrees. Three. I have problems, like I said before. I got what you mean. You don't like when people are speaking in generalizations and sort of being all theoretical rather than looking at the actual messy details that we need to deal with to get things done, right? I know. I know what you mean, I feel it too.And I've talked about the human factors stuff and theoretical stuff a fair bit at conferences, and what I always try to do is I always try and illustrate with the details. Because I think it's very easy to get away from the actual problems and, you know, spend too much time in the models and in the theory. And I like to do both. I will confess, I like to do both. And that means that the luxury I miss out on is mostly sleep. But here we are.Corey: I am curious as far as what you've seen as far as the human factors adoption in this space because every company for a while claimed to be focused on blameless postmortems. But then there would be issues that quickly turned into a blame Steve postmortem instead. And it really feels, at least from a certain point of view, that there was a time where it seemed to be gaining traction, but that may have been a zero interest rate phenomenon, as weird as that sounds. Do you think that the idea of human factors being tied to keeping systems running in a computer sense has demonstrated staying power or are you seeing a recession? It could be I'm just looking at headlines too much.Laura: It's a good question. There's still a lot of people interested in it. There was a conference in Denver last February that was decently well attended for, you know, a first initial conference that was focusing on this issue, and this very vibrant Slack community, the LFI and the Learning from Incidents in Software community. I will say, everything is a little bit stretched at the moment in industry, as you know, with all the layoffs, and a lot of people are just… there's definitely a feeling that people want to hunker down and do the basics to make sure that they're not seen as doing useless stuff and on the line for layoffs.But the question is, is this stuff actually useful or not? I mean, I contend that it is. I contend that we can learn from failures, we can learn from what we're doing day-to-day, and we can do things better. Sometimes you don't need a lot of learning because what's the biggest problem is obvious, right [laugh]? You know, in that case, yeah, your focus should just be on solving your big obvious problem, for sure.Corey: If there was a hierarchy of needs here, on some level, okay, step one, is the building—Laura: Yes.Corey: Currently on fire? Maybe solve that before thinking about the longer-term context of what this does to corporate culture.Laura: Yes, absolutely. And I've gone into teams before where people are like, “Oh, well, you're an SRE, so obviously, you wish to immediately introduce SLOs.” And I can look around and go, “Nope. Not the biggest problem right now. Actually, I can see a bunch of things are on fire. We should fix those specific things.”I actually personally think that if you want to go in and start improving reliability in a system, the best thing to do is to start a weekly production meeting if the team doesn't have that, actually create a dedicated space and time for everyone to be able to get together, discuss what's been happening, discuss concerns and risks, and get all that stuff out in the open. I think that's very useful, and you don't need to spend however long it takes to formally sit down and start creating a bunch of SLOs. Because if you're not dealing with a perfectly spherical web service where you can just use the Golden Signals and if you start getting into any sorts of thinking about data integrity, or backups, or any sorts of asynchronous processing, these sorts of things, they need SLOs that are a lot more interesting than your standard error rate and latency. Error rate and latency gets you so far, but it's really just very cookie-cutter stuff. But people know what's wrong with their systems, by and large. They may not know everything that's wrong with their systems, but they'll know the big things, for sure. Give them space to talk about it.Corey: Speaking of bigger things and turning into the idea of these things escaping beyond pure tech, you have been doing some rather interesting work in an area that I don't see a whole lot of people that I talked to communicating about. Specifically, you're volunteering for the campaign to stop killer robots, which ten years ago would have made you sound ridiculous, and now it makes you sound like someone who is very rationally and reasonably calling an alarm on something that is on our doorstep. What are you doing over there?Laura: Well, I mean, let's be real, it sounds ridiculous because it is ridiculous. I mean, who would let a computer fly around to the sky and choose what to shoot at? But it turns out that there are, in fact, a bunch of people who are building systems like that. So yeah, I've been volunteering with the campaign for about the last five years, since roughly around the time that I left Google, in fact, because I got interested in that around about the time that Google was doing the Project Maven work, which was when Google said, “Hey, wouldn't it be super cool if we took all of this DoD video footage of drone video footage, and, you know, did a whole bunch of machine-learning analysis on it and figured out where people are going all the time? Maybe we could click on this house and see, like, a whole timeline of people's comings and goings and which other people they are sort of in a social network with.”So, I kind of said, “Ahh… maybe I don't want to be involved in that.” And I left Google. And I found out that there was this campaign. And this campaign was largely lawyers and disarmament experts, people of that nature—philosophers—but also a few technologists. And for me, having run computer systems for a large number of years at this point, the idea that you would want to rely on a big distributed system running over some janky network with a bunch of 18-year-old kids running it to actually make good decisions about who should be targeted in a conflict seems outrageous.And I think almost every [laugh] software operations person, or in fact, software engineer that I've spoken to, tends to feel the same way. And yet there is this big practical debate about this in international relations circles. But luckily, there has just been a resolution in the UN just in the last day or two as we record this, the first committee has, by a very large majority, voted to try and do something about this. So hopefully, we'll get some international law. The specific interventions that most of us in this field think would be good would be to limit the amount of force that autonomous weapon, or in fact, an entire set of autonomous weapons in a region would be able to wield because there's a concern that should there be some bug or problem or a sort of weird factor that triggers these systems to—Corey: It's an inevitability that there will be. Like, that is not up for debate. Of course, it's going to break in 2020, the template slide deck that AWS sent out for re:Invent speakers had a bunch of clip art, and one of them was a line art drawing of a ham with a bone in it. So, I wound up taking that image, slapping it on a t-shirt, captioning it “AWS Hambone,” and selling that as a fundraiser for 826 National.Laura: [laugh].Corey: Now, what happened next is that for a while, anyone who tweeted the phrase “AWS Hambone” would find themselves banned from Twitter for the next 12 hours due to some weird algorithmic thing where it thought that was doxxing or harassment or something. And people on the other side of the issue that you're talking about are straight face-idly suggesting that we give that algorithm [unintelligible 00:24:32] tool a gun.Laura: Or many guns. Many guns.Corey: I'm sorry, what?Laura: Absolutely.Corey: Yes, or missiles or, heck, let's build a whole bunch of them and turn them loose with no supervision, just like we do with junior developers.Laura: Exactly. Yes, so many people think this is a great idea, or at least they purport to think this is a great idea, which is not always the same thing. I mean, there's lots of different vested interests here. Some people who are proponents of this will say, well, actually, we think that this will make targeting more accurate, less civilians will actually will die as a result of this. And the question there that you have to ask is—there's a really good book called Drone by Chamayou, Grégoire Chamayou, and he says that there's actually three meanings to accuracy.So, are you hitting what you're aiming at is one of it—one thing. And that's a solved problem in military circles for quite some time. You got, you know, laser targeting, very accurate. Then the other question is, how big is the blast radius? So, that's just a matter of, you know, how big an explosion are you going to get? That's not something that autonomy can help with.The only thing that autonomy could even conceivably help with in terms of accuracy is better target selection. So, instead of selecting targets that are not valid targets, selecting more valid targets. But I don't think there's any good reason to think that computers can solve that problem. I mean, in fact, if you read stuff that military experts write on this, and I've got, you know, lots of academic handbooks on military targeting processes, they will tell you, it's very hard and there's a lot of gray areas, a lot of judgment. And that's exactly what computers are pretty bad at. Although mind you, I'm amused by your Hambone story and I want to ask if AWS Hambone is a database?Corey: Anything is a database, if you hold it wrong.Laura: [laugh].Corey: It's fun. I went through a period of time where, just for fun, I would ask people to name an AWS service and I would talk about how you could use it incorrectly as a database. And then someone mentioned, “What about AWS Neptune,” which is their graph database, which absolutely no one understands, and the answer there is, “I give up. It's impossible to use that thing as a database.” But everything else can be. Like, you know, the tagging system. Great, that has keys and values; it's a database now. Welcome aboard. And I didn't say it was a great database, but it is a free one, and it scales to a point. Have fun with it.Laura: All I'll say is this: you can put labels on anything.Corey: Exactly.Laura: We missed you at the most recent SREcon EMEA. There was a talk about Google's internal Chubby system and how people started using it as a database. And I did summon you in Slack, but you didn't show up.Corey: No. Sadly, I've gotten a bit out of the SRE space. And also, frankly, I've gotten out of the community space for a little while, when it comes to conferences. And I have a focused effort at the start of 2024 to start changing that. I am submitting CFPs left and right.My biggest fear is that a conference will accept one of these because a couple of them are aspirational. “Here's how I built the thing with generative AI,” which spoiler, I have done no such thing yet, but by God, I will by the time I get there. I have something similar around Kubernetes, which I've never used in anger, but soon will if someone accepts the right conference talk. This is how I learned Git: I shot my mouth off in a CFP, and I had four months to learn the thing. It was effective, but I wouldn't say it was the best approach.Laura: [laugh]. You shouldn't feel bad about lying about having built things in Kubernetes, and with LLMs because everyone has, right?Corey: Exactly. It'll be true enough by the time I get there. Why not? I'm not submitting for a conference next week. We're good. Yeah, Future Corey is going to hate me.Laura: Have it build you a database system.Corey: I like that. I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place for them to find you these days?Laura: Ohh, I'm sort of homeless on social media since the whole Twitter implosion, but you can still find me there. I'm @lauralifts on Twitter and I have the same tag on BlueSky, but haven't started to use it yet. Yeah, socials are hard at the moment. I'm on LinkedIn. Please feel free to follow me there if you wish to message me as well.Corey: And we will, of course, put links to that in the [show notes 00:28:31]. Thank you so much for taking the time to speak with me. I appreciate it.Laura: Thank you for having me.Corey: Laura Nolan, Principal Software Engineer at Stanza. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that soon—due to me screwing up a database system—will be transmogrified into a CFP submission for an upcoming SREcon.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.
Rachel Dines, Head of Product and Technical Marketing at Chronosphere, joins Corey on Screaming in the Cloud to discuss why creating a cloud-native observability strategy is so critical, and the challenges that come with both defining and accomplishing that strategy to fit your current and future observability needs. Rachel explains how Chronosphere is taking an open-source approach to observability, and why it's more important than ever to acknowledge that the stakes and costs are much higher when it comes to observability in the cloud. About RachelRachel leads product and technical marketing for Chronosphere. Previously, Rachel wore lots of marketing hats at CloudHealth (acquired by VMware), and before that, she led product marketing for cloud-integrated storage at NetApp. She also spent many years as an analyst at Forrester Research. Outside of work, Rachel tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston where she's based.Links Referenced: Chronosphere: https://chronosphere.io/ LinkedIn: https://www.linkedin.com/in/rdines/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's featured guest episode is brought to us by our friends at Chronosphere, and they have also brought us Rachel Dines, their Head of Product and Solutions Marketing. Rachel, great to talk to you again.Rachel: Hi, Corey. Yeah, great to talk to you, too.Corey: Watching your trajectory has been really interesting, just because starting off, when we first started, I guess, learning who each other were, you were working at CloudHealth which has since become VMware. And I was trying to figure out, huh, the cloud runs on money. How about that? It feels like it was a thousand years ago, but neither one of us is quite that old.Rachel: It does feel like several lifetimes ago. You were just this snarky guy with a few followers on Twitter, and I was trying to figure out what you were doing mucking around with my customers [laugh]. Then [laugh] we kind of both figured out what we're doing, right?Corey: So, speaking of that iterative process, today, you are at Chronosphere, which is an observability company. We would have called it a monitoring company five years ago, but now that's become an insult after the observability war dust has settled. So, I want to talk to you about something that I've been kicking around for a while because I feel like there's a gap somewhere. Let's say that I build a crappy web app—because all of my web apps inherently are crappy—and it makes money through some mystical form of alchemy. And I have a bunch of users, and I eventually realize, huh, I should probably have a better observability story than waiting for the phone to ring and a customer telling me it's broken.So, I start instrumenting various aspects of it that seem to make sense. Maybe I go too low level, like looking at all the discs on every server to tell me if they're getting full or not, like their ancient servers. Maybe I just have a Pingdom equivalent of is the website up enough to respond to a packet? And as I wind up experiencing different failure modes and getting yelled at by different constituencies—in my own career trajectory, my own boss—you start instrumenting for all those different kinds of breakages, you start aggregating the logs somewhere and the volume gets bigger and bigger with time. But it feels like it's sort of a reactive process as you stumble through that entire environment.And I know it's not just me because I've seen this unfold in similar ways in a bunch of different companies. It feels to me, very strongly, like it is something that happens to you, rather than something you set about from day one with a strategy in mind. What's your take on an effective way to think about strategy when it comes to observability?Rachel: You just nailed it. That's exactly the kind of progression that we so often see. And that's what I really was excited to talk with you about today—Corey: Oh, thank God. I was worried for a minute there that you'd be like, “What the hell are you talking about? Are you just, like, some sort of crap engineer?” And, “Yes, but it's mean of you to say it.” But yeah, what I'm trying to figure out is there some magic that I just was never connecting? Because it always feels like you're in trouble because the site's always broken, and oh, like, if the disk fills up, yeah, oh, now we're going to start monitoring to make sure the disk doesn't fill up. Then you wind up getting barraged with alerts, and no one wins, and it's an uncomfortable period of time.Rachel: Uncomfortable period of time. That is one very polite way to put it. I mean, I will say, it is very rare to find a company that actually sits down and thinks, “This is our observability strategy. This is what we want to get out of observability.” Like, you can think about a strategy and, like, the old school sense, and you know, as an industry analyst, so I'm going to have to go back to, like, my roots at Forrester with thinking about, like, the people, and the process, and the technology.But really what the bigger component here is like, what's the business impact? What do you want to get out of your observability platform? What are you trying to achieve? And a lot of the time, people have thought, “Oh, observability strategy. Great, I'm just going to buy a tool. That's it. Like, that's my strategy.”And I hate to bring it to you, but buying tools is not a strategy. I'm not going to say, like, buy this tool. I'm not even going to say, “Buy Chronosphere.” That's not a strategy. Well, you should buy Chronosphere. But that's not a strategy.Corey: Of course. I'm going to throw the money by the wheelbarrow at various observability vendors, and hope it solves my problem. But if that solved the problem—I've got to be direct—I've never spoken to those customers.Rachel: Exactly. I mean, that's why this space is such a great one to come in and be very disruptive in. And I think, back in the days when we were running in data centers, maybe even before virtual machines, you could probably get away with not having a monitoring strategy—I'm not going to call it observability; it's not we call the back then—you could get away with not having a strategy because what was the worst that was going to happen, right? It wasn't like there was a finite amount that your monitoring bill could be, there was a finite amount that your customer impact could be. Like, you're paying the penny slots, right?We're not on the penny slots anymore. We're in the $50 craps table, and it's Las Vegas, and if you lose the game, you're going to have to run down the street without your shirt. Like, the game and the stakes have changed, and we're still pretending like we're playing penny slots, and we're not anymore.Corey: That's a good way of framing it. I mean, I still remember some of my biggest observability challenges were building highly available rsyslog clusters so that you could bounce a member and not lose any log data because some of that was transactionally important. And we've gone beyond that to a stupendous degree, but it still feels like you don't wind up building this into the application from day one. More's the pity because if you did, and did that intelligently, that opens up a whole world of possibilities. I dream of that changing where one day, whenever you start to build an app, oh, and we just push the button and automatically instrument with OTel, so you instrument the thing once everywhere it makes sense to do it, and then you can do your vendor selection and what you said were decisions later in time. But these days, we're not there.Rachel: Well, I mean, and there's also the question of just the legacy environment and the tech debt. Even if you wanted to, the—actually I was having a beer yesterday with a friend who's a VP of Engineering, and he's got his new environment that they're building with observability instrumented from the start. How beautiful. They've got OTel, they're going to have tracing. And then he's got his legacy environment, which is a hot mess.So, you know, there's always going to be this bridge of the old and the new. But this was where it comes back to no matter where you're at, you can stop and think, like, “What are we doing and why?” What is the cost of this? And not just cost in dollars, which I know you and I could talk about very deeply for a long period of time, but like, the opportunity costs. Developers are working on stuff that they could be working on something that's more valuable.Or like the cost of making people work round the clock, trying to troubleshoot issues when there could be an easier way. So, I think it's like stepping back and thinking about cost in terms of dollar sense, time, opportunity, and then also impact, and starting to make some decisions about what you're going to do in the future that's different. Once again, you might be stuck with some legacy stuff that you can't really change that much, but [laugh] you got to be realistic about where you're at.Corey: I think that that is a… it's a hard lesson to be very direct, in that, companies need to learn it the hard way, for better or worse. Honestly, this is one of the things that I always noticed in startup land, where you had a whole bunch of, frankly, relatively early-career engineers in their early-20s, if not younger. But then the ops person was always significantly older because the thing you actually want to hear from your ops person, regardless of how you slice it, is, “Oh, yeah, I've seen this kind of problem before. Here's how we fixed it.” Or even better, “Here's the thing we're doing, and I know how that's going to become a problem. Let's fix it before it does.” It's the, “What are you buying by bringing that person in?” “Experience, mostly.”Rachel: Yeah, that's an interesting point you make, and it kind of leads me down this little bit of a side note, but a really interesting antipattern that I've been seeing in a lot of companies is that more seasoned ops person, they're the one who everyone calls when something goes wrong. Like, they're the one who, like, “Oh, my God, I don't know how to fix it. This is a big hairy problem,” I call that one ops person, or I call that very experienced person. That experience person then becomes this huge bottleneck into solving problems that people don't really—they might even be the only one who knows how to use the observability tool. So, if we can't find a way to democratize our observability tooling a little bit more so, like, just day-to-day engineers, like, more junior engineers, newer ones, people who are still ramping, can actually use the tool and be successful, we're going to have a big problem when these ops people walk out the door, maybe they retire, maybe they just get sick of it. We have these massive bottlenecks in organizations, whether it's ops or DevOps or whatever, that I see often exacerbated by observability tools. Just a side note.Corey: Yeah. On some level, it feels like a lot of these things can be fixed with tooling. And I'm not going to say that tools aren't important. You ever tried to implement observability by hand? It doesn't work. There have to be computers somewhere in the loop, if nothing else.And then it just seems to devolve into a giant swamp of different companies, doing different things, taking different approaches. And, on some level, whenever you read the marketing or hear the stories any of these companies tell you also to normalize it from translating from whatever marketing language they've got into something that comports with the reality of your own environment and seeing if they align. And that feels like it is so much easier said than done.Rachel: This is a noisy space, that is for sure. And you know, I think we could go out to ten people right now and ask those ten people to define observability, and we would come back with ten different definitions. And then if you throw a marketing person in the mix, right—guilty as charged, and I know you're a marketing person, too, Corey, so you got to take some of the blame—it gets mucky, right? But like I said a minute ago, the answer is not tools. Tools can be part of the strategy, but if you're just thinking, “I'm going to buy a tool and that's going to solve my problem,” you're going to end up like this company I was talking to recently that has 25 different observability tools.And not only do they have 25 different observability tools, what's worse is they have 25 different definitions for their SLOs and 25 different names for the same metric. And to be honest, it's just a mess. I'm not saying, like, go be Draconian and, you know, tell all the engineers, like, “You can only use this tool [unintelligible 00:10:34] use that tool,” you got to figure out this kind of balance of, like, hands-on, hands-off, you know? How much do you centralize, how much do you push and standardize? Otherwise, you end up with just a huge mess.Corey: On some level, it feels like it was easier back in the days of building it yourself with Nagios because there's only one answer, and it sucks, unless you want to start going down the world of HP OpenView. Which step one: hire a 50-person team to manage OpenView. Okay, that's not going to solve my problem either. So, let's get a little more specific. How does Chronosphere approach this?Because historically, when I've spoken to folks at Chronosphere, there isn't that much of a day one story, in that, “I'm going to build a crappy web app. Let's instrument it for Chronosphere.” There's a certain, “You must be at least this tall to ride,” implicit expectation built into the product just based upon its origins. And I'm not saying that doesn't make sense, but it also means there's really no such thing as a greenfield build out for you either.Rachel: Well, yes and no. I mean, I think there's no green fields out there because everyone's doing something for observability, or monitoring, or whatever you want to call it, right? Whether they've got Nagios, whether they've got the Dog, whether they've got something else in there, they have some way of introspecting their systems, right? So, one of the things that Chronosphere is built on, that I actually think this is part of something—a way you might think about building out an observability strategy as well, is this concept of control and open-source compatibility. So, we only can collect data via open-source standards. You have to send this data via Prometheus, via Open Telemetry, it could be older standards, like, you know, statsd, Graphite, but we don't have any proprietary instrumentation.And if I was making a recommendation to somebody building out their observability strategy right now, I would say open, open, open, all day long because that gives you a huge amount of flexibility in the future. Because guess what? You know, you might put together an observability strategy that seems like it makes sense for right now—actually, I was talking to a B2B SaaS company that told me that they made a choice a couple of years ago on an observability tool. It seemed like the right choice at the time. They were growing so fast, they very quickly realized it was a terrible choice.But now, it's going to be really hard for them to migrate because it's all based on proprietary standards. Now, of course, a few years ago, they didn't have the luxury of Open Telemetry and all of these, but now that we have this, we can use these to kind of future-proof our mistakes. So, that's one big area that, once again, both my recommendation and happens to be our approach at Chronosphere.Corey: I think that that's a fair way of viewing it. It's a constant challenge, too, just because increasingly—you mentioned the Dog earlier, for example—I will say that for years, I have been asked whether or not at The Duckbill Group, we look at Azure bills or GCP bills. Nope, we are pure AWS. Recently, we started to hear that same inquiry specifically around Datadog, to the point where it has become a board-level concern at very large companies. And that is a challenge, on some level.I don't deviate from my typical path of I fix AWS bills, and that's enough impossible problems for one lifetime, but there is a strong sense of you want to record as much as possible for a variety of excellent reasons, but there's an implicit cost to doing that, and in many cases, the cost of observability becomes a massive contributor to the overall cost. Netflix has said in talks before that they're effectively an observability company that also happens to stream movies, just because it takes so much effort, engineering, and raw computing resources in order to get that data do something actionable with it. It's a hard problem.Rachel: It's a huge problem, and it's a big part of why I work at Chronosphere, to be honest. Because when I was—you know, towards the tail end at my previous company in cloud cost management, I had a lot of customers coming to me saying, “Hey, when are you going to tackle our Dog or our New Relic or whatever?” Similar to the experience you're having now, Corey, this was happening to me three, four years ago. And I noticed that there is definitely a correlation between people who are having these really big challenges with their observability bills and people that were adopting, like Kubernetes, and microservices and cloud-native. And it was around that time that I met the Chronosphere team, which is exactly what we do, right? We focus on observability for these cloud-native environments where observability data just goes, like, wild.We see 10X 20X as much observability data and that's what's driving up these costs. And yeah, it is becoming a board-level concern. I mean, and coming back to the concept of strategy, like if observability is the second or third most expensive item in your engineering bill—like, obviously, cloud infrastructure, number one—number two and number three is probably observability. How can you not have a strategy for that? How can this be something the board asks you about, and you're like, “What are we trying to get out of this? What's our purpose?” “Uhhhh… troubleshooting?”Corey: Right because it turns into business metrics as well. It's not just about is the site up or not. There's a—like, one of the things that always drove me nuts not just in the observability space, but even in cloud costing is where, okay, your costs have gone up this week so you get a frowny face, or it's in red, like traffic light coloring. Cool, but for a lot of architectures and a lot of customers, that's because you're doing a lot more volume. That translates directly into increased revenues, increased things you care about. You don't have the position or the context to say, “That's good,” or, “That's bad.” It simply is. And you can start deriving business insight from that. And I think that is the real observability story that I think has largely gone untold at tech conferences, at least.Rachel: It's so right. I mean, spending more on something is not inherently bad if you're getting more value out of it. And it definitely a challenge on the cloud cost management side. “My costs are going up, but my revenue is going up a lot faster, so I'm okay.” And I think some of the plays, like you know, we put observability in this box of, like, it's for low-level troubleshooting, but really, if you step back and think about it, there's a lot of larger, bigger picture initiatives that observability can contribute to in an org, like digital transformation. I know that's a buzzword, but, like that is a legit thing that a lot of CTOs are out there thinking about. Like, how do we, you know, get out of the tech debt world, and how do we get into cloud-native?Maybe it's developer efficiency. God, there's a lot of people talking about developer efficiency. Last week at KubeCon, that was one of the big, big topics. I mean, and yeah, what [laugh] what about cost savings? To me, we've put observability in a smaller box, and it needs to bust out.And I see this also in our customer base, you know? Customers like DoorDash use observability, not just to look at their infrastructure and their applications, but also look at their business. At any given minute, they know how many Dashers are on the road, how many orders are being placed, cut by geos, down to the—actually down to the second, and they can use that to make decisions.Corey: This is one of those things that I always found a little strange coming from the world of running systems in large [unintelligible 00:17:28] environments to fixing AWS bills. There's nothing that even resembles a fast, reactive response in the world of AWS billing. You wind up with a runaway bill, they're going to resolve that over a period of weeks, on Seattle business hours. If you wind up spinning something up that creates a whole bunch of very expensive drivers behind your bill, it's going to take three days, in most cases, before that starts showing up anywhere that you can reasonably expect to get at it. The idea of near real time is a lie unless you want to start instrumenting everything that you're doing to trap the calls and then run cost extrapolation from there. That's hard to do.Observability is a very different story, where latencies start to matter, where being able to get leading indicators of certain events—be a technical or business—start to be very important. But it seems like it's so hard to wind up getting there from where most people are. Because I know we like to talk dismissively about the past, but let's face it, conference-ware is the stuff we're the proudest of. The reality is the burning dumpster of regret in our data centers that still also drives giant piles of revenue, so you can't turn it off, nor would you want to, but you feel bad about it as a result. It just feels like it's such a big leap.Rachel: It is a big leap. And I think the very first step I would say is trying to get to this point of clarity and being honest with yourself about where you're at and where you want to be. And sometimes not making a choice is a choice, right, as well. So, sticking with the status quo is making a choice. And so, like, as we get into things like the holiday season right now, and I know there's going to be people that are on-call 24/7 during the holidays, potentially, to keep something that's just duct-taped together barely up and running, I'm making a choice; you're make a choice to do that. So, I think that's like the first step is the kind of… at least acknowledging where you're at, where you want to be, and if you're not going to make a change, just understanding the cost and being realistic about it.Corey: Yeah, being realistic, I think, is one of the hardest challenges because it's easy to wind up going for the aspirational story of, “In the future when everything's great.” Like, “Okay, cool. I appreciate the need to plant that flag on the hill somewhere. What's the next step? What can we get done by the end of this week that materially improves us from where we started the week?” And I think that with the aspirational conference-ware stories, it's hard to break that down into things that are actionable, that don't feel like they're going to be an interminable slog across your entire existing environment.Rachel: No, I get it. And for things like, you know, instrumenting and adding tracing and adding OTEL, a lot of the time, the return that you get on that investment is… it's not quite like, “I put a dollar in, I get a dollar out,” I mean, something like tracing, you can't get to 60% instrumentation and get 60% of the value. You need to be able to get to, like, 80, 90%, and then you'll get a huge amount of value. So, it's sort of like you're trudging up this hill, you're charging up this hill, and then finally you get to the plateau, and it's beautiful. But that hill is steep, and it's long, and it's not pretty. And I don't know what to say other than there's a plateau near the top. And those companies that do this well really get a ton of value out of it. And that's the dream, that we want to help customers get up that hill. But yeah, I'm not going to lie, the hill can be steep.Corey: One thing that I find interesting is there's almost a bimodal distribution in companies that I talk to. On the one side, you have companies like, I don't know, a Chronosphere is a good example of this. Presumably you have a cloud bill somewhere and the majority of your cloud spend will be on what amounts to a single application, probably in your case called, I don't know, Chronosphere. It shares the name of the company. The other side of that distribution is the large enterprise conglomerates where they're spending, I don't know, $400 million a year on cloud, but their largest workload is 3 million bucks, and it's just a very long tail of a whole bunch of different workloads, applications, teams, et cetera.So, what I'm curious about from the Chronosphere perspective—or the product you have, not the ‘you' in this metaphor, which gets confusing—is, it feels easier to instrument a Chronosphere-like company that has a primary workload that is the massive driver of most things and get that instrumented and start getting an observability story around that than it does to try and go to a giant company and, “Okay, 1500 teams need to all implement this thing that are all going in different directions.” How do you see it playing out among your customer base, if that bimodal distribution holds up in your world?Rachel: It does and it doesn't. So, first of all, for a lot of our customers, we often start with metrics. And starting with metrics means Prometheus. And Prometheus has hundreds of exporters. It is basically built into Kubernetes. So, if you're running Kubernetes, getting Prometheus metrics out, actually not a very big lift. So, we find that we start with Prometheus, we start with getting metrics in, and we can get a lot—I mean, customers—we have a lot of customers that use us just for metrics, and they get a massive amount of value.But then once they're ready, they can start instrumenting for OTEL and start getting traces in as well. And yeah, in large organizations, it does tend to be one team, one application, one service, one department that kind of goes at it and gets all that instrumented. But I've even seen very large organizations, when they get their act together and decide, like, “No, we're doing this,” they can get OTel instrumented fairly quickly. So, I guess it's, like, a lining up. It's more of a people issue than a technical issue a lot of the time.Like, getting everyone lined up and making sure that like, yes, we all agree. We're on board. We're going to do this. But it's usually, like, it's a start small, and it doesn't have to be all or nothing. We also just recently added the ability to ingest events, which is actually a really beautiful thing, and it's very, very straightforward.It basically just—we connect to your existing other DevOps tools, so whether it's, like, a Buildkite, or a GitHub, or, like, a LaunchDarkly, and then anytime something happens in one of those tools, that gets registered as an event in Chronosphere. And then we overlay those events over your alerts. So, when an alert fires, then first thing I do is I go look at the alert page, and it says, “Hey, someone did a deploy five minutes ago,” or, “There was a feature flag flipped three minutes ago,” I solved the problem right then. I don't think of this as—there's not an all or nothing nature to any of this stuff. Yes, tracing is a little bit of a—you know, like I said, it's one of those things where you have to make a lot of investment before you get a big reward, but that's not the case in all areas of observability.Corey: Yeah. I would agree. Do you find that there's a significant easy, early win when customers start adopting Chronosphere? Because one of the problems that I've found, especially with things that are holistic, and as you talk about tracing, well, you need to get to a certain point of coverage before you see value. But human psychology being what it is, you kind of want to be able to demonstrate, oh, see, the Meantime To Dopamine needs to come down, to borrow an old phrase. Do you find that some of there's some easy wins that start to help people to see the light? Because otherwise, it just feels like a whole bunch of work for no discernible benefit to them.Rachel: Yeah, at least for the Chronosphere customer base, one of the areas where we're seeing a lot of traction this year is in optimizing the costs, like, coming back to the cost story of their overall observability bill. So, we have this concept of the control plane in our product where all the data that we ingest hits the control plane. At that point, that customer can look at the data, analyze it, and decide this is useful, this is not useful. And actually, not just decide that, but we show them what's useful, what's not useful. What's being used, what's high cardinality, but—and high cost, but maybe no one's touched it.And then we can make decisions around aggregating it, dropping it, combining it, doing all sorts of fancy things, changing the—you know, downsampling it. We can do this, on the trace side, we can do it both head based and tail based. On the metrics side, it's as it hits the control plane and then streams out. And then they only pay for the data that we store. So typically, customers are—they come on board and immediately reduce their observability dataset by 60%. Like, that's just straight up, that's the average.And we've seen some customers get really aggressive, get up to, like, in the 90s, where they realize we're only using 10% of this data. Let's get rid of the rest of it. We're not going to pay for it. So, paying a lot less helps in a lot of ways. It also helps companies get more coverage of their observability. It also helps customers get more coverage of their overall stack. So, I was talking recently with an autonomous vehicle driving company that recently came to us from the Dog, and they had made some really tough choices and were no longer monitoring their pre-prod environments at all because they just couldn't afford to do it anymore. It's like, well, now they can, and we're still saving the money.Corey: I think that there's also the downstream effect of the money saving to that, for example, I don't fix observability bills directly. But, “Huh, why is your CloudWatch bill through the roof?” Or data egress charges in some cases? It's oh because your observability vendor is pounding the crap out of those endpoints and pulling all your log data across the internet, et cetera. And that tends to mean, oh, yeah, it's not just the first-order effect; it's the second and third and fourth-order effects this winds up having. It becomes almost a holistic challenge. I think that trying to put observability in its own bucket, on some level—when you're looking at it from a cost perspective—starts to be a, I guess, a structure that makes less and less sense in the fullness of time.Rachel: Yeah, I would agree with that. I think that just looking at the bill from your vendor is one very small piece of the overall cost you're incurring. I mean, all of the things you mentioned, the egress, the CloudWatch, the other services, it's impacting, what about the people?Corey: Yeah, it sure is great that your team works for free.Rachel: [laugh]. Exactly, right? I know, and it makes me think a little bit about that viral story about that particular company with a certain vendor that had a $65 million per year observability bill. And that impacted not just them, but, like, it showed up in both vendors' financial filings. Like, how did you get there? How did you get to that point? And I think this all comes back to the value in the ROI equation. Yes, we can all sit in our armchairs and be like, “Well, that was dumb,” but I know there are very smart people out there that just got into a bad situation by kicking the can down the road on not thinking about the strategy.Corey: Absolutely. I really want to thank you for taking the time to speak with me about, I guess, the bigger picture questions rather than the nuts and bolts of a product. I like understanding the overall view that drives a lot of these things. I don't feel I get to have enough of those conversations some weeks, so thank you for humoring me. If people want to learn more, where's the best place for them to go?Rachel: So, they should definitely check out the Chronosphere website. Brand new beautiful spankin' new website: chronosphere.io. And you can also find me on LinkedIn. I'm not really on the Twitters so much anymore, but I'd love to chat with you on LinkedIn and hear what you have to say.Corey: And we will, of course, put links to all of that in the [show notes 00:28:26]. Thank you so much for taking the time to speak with me. It's appreciated.Rachel: Thank you, Corey. Always fun.Corey: Rachel Dines, Head of Product and Solutions Marketing at Chronosphere. This has been a featured guest episode brought to us by our friends at Chronosphere, and I'm Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment that I will one day read once I finished building my highly available rsyslog system to consume it with.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.
The State Board of Education has adopted new teacher evaluation guidelines that allow districts to move away from ratings, SLOs, and standardized indicators. To ensure districts move to an evaluation plan that really helps educators grow, teacher representatives to professional development and evaluation committees (PDECs) need to speak up and advocate for themselves and their colleagues. In this episode CEA President Kate Dias and Vice President Joslyn DeLancey talk with CEA Teacher Development Specialist Kate Field about changes to teacher evaluation and what local association members need to do to ensure their district gets evaluation right. Check out the new evaluation guidelines, CEA's implementation guide, and CEA's model teacher evaluation plan: https://cea.org/certification
Dr. Vladyslav Ukis is Head of R&D at Siemens Healthineers with a strong interest in topics of Software Architecture, Cloud Computing, and Digital Transformation, to name a few. Vlad is also a book author; his book “Establishing SRE Foundations” describes his practical experience in starting the SRE practice. Listen to our conversation on introducing the SRE practice to enterprise organizations.
Gerhard joins us for the 12th Kaizen and this time talk about what we DIDN'T do. We were holding S3 wrong, we put some cash back in our pockets, we enabled HTTP/3, Brotli compression, and Fastly websockets, we improved our SLOs, we improved Changelog Nightly, and we're going to KubeCon 2023 in Chicago.
Gerhard joins us for the 12th Kaizen and this time talk about what we DIDN'T do. We were holding S3 wrong, we put some cash back in our pockets, we enabled HTTP/3, Brotli compression, and Fastly websockets, we improved our SLOs, we improved Changelog Nightly, and we're going to KubeCon 2023 in Chicago.
Noticias Virgas ✨Le da un ACV a la Mona! Pero sobreviveElon arruga, se baja del octógono de ZuckerbergSalió el trailer de la nueva serie de Scott Pilgrim animada de NetflixLa nueva peli del Miya la rompe en taquilla ponjaCharles Martinet deja de ser la voz de MarioOverwatch 2 llega a puntaje Overwhelmingly Negative en SteamEl eShop de Nintendo Argentina no permite comprar con tarjetas de otro paísLos garcas de Rockstar contratan modders para trabajar con ellos aunque los habían banneado del juego en 2015!Lenovo anuncia la "Legion Go" y sale a competirle a la Switch y al SteamDeckEste 14 de Agosto la Sega Génesis cumplió 34 años!Las pelis/series que vimos
Fallecimiento de Vincent van Gogh y nacimiento del empresario indio J.R.D. Tata.Acompáñanos y descubre qué pasó un día como hoy hace algunos años mientras mejoras tu comprensión auditiva y aprendes palabras nuevas. Cuéntanos tu opinión con un correo a podcasting@babbel.com.Vocabulario útil:postImpressionista: un artista que pertenece al movimiento francés del postimpresionismo, que se interesaba más en mostrar emociones que en tener una pintura realistaaerolínea nacional: compañía aérea que pertenece o ha pertenecido al gobierno de un paísLos sucesos presentados están escritos de manera simplificada para oyentes con un nivel intermedio de español y reflejan la información disponible hasta abril de 2022.¡Puedes escuchar y leer a la vez! Usa la transcripción del episodio: https://bit.ly/3R86ZIv
Nacimiento de Albert Namatjira y de Fahmida Riaz y un duro terremoto golpea Tangshan.Acompáñanos y descubre qué pasó un día como hoy hace algunos años mientras mejoras tu comprensión auditiva y aprendes palabras nuevas. Cuéntanos tu opinión con un correo a podcasting@babbel.com.Vocabulario útil:ciudadanía: reconocimiento de una persona como ciudadana de un paísliberal: que no se adhiere a las tradiciones y las reglas estrictasexiliarse: verse obligado/a a irse a otro paísLos sucesos presentados están escritos de manera simplificada para oyentes con un nivel intermedio de español y reflejan la información disponible hasta abril de 2022.¡Puedes escuchar y leer a la vez! Usa la transcripción del episodio: https://bit.ly/3r2JANU Aviso: violencia
Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts (most interviews from #32 on) hereProvided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.Manisha's LinkedIn: https://www.linkedin.com/in/evermanisha/'A streamlined developer experience in Data Mesh' articles by Manisha: Part 1 - Platform: https://www.thoughtworks.com/insights/blog/data-strategy/dev-experience-data-mesh-platformPart 2 - Product: https://www.thoughtworks.com/insights/blog/data-strategy/dev-experience-data-mesh-productArticle on Lean Value Tree: https://rolandbutler.medium.com/what-is-the-lean-value-tree-e90d06328f09Blog post on mentioned data mesh workshops: https://martinfowler.com/articles/data-mesh-accelerate-workshop.htmlIn this episode, Scott interviewed Manisha Jain, Data Engineer at Thoughtworks.Some key takeaways/thoughts from Manisha's point of view:Manisha's top 3 pieces of early mesh journey advice: A) put together a specification of what a data product is, make it clear. Your definition will evolve/improve but if people don't understand the building blocks, it's going to be hard to build value. B) start to create standardized input and output ports because that is how data products - your units of value - actually exchange value. C) make it easy to discover your important SLOs to create trust; partner with consumers to serve what they need.When working with your first domain, it's crucially important to make sure they have strong data engineering talent - whether that...
Eswar Bala, Director of Amazon EKS at AWS, joins Corey on Screaming in the Cloud to discuss how and why AWS built a Kubernetes solution, and what customers are looking for out of Amazon EKS. Eswar reveals the concerns he sees from customers about the cost of Kubernetes, as well as the reasons customers adopt EKS over ECS. Eswar gives his reasoning on why he feels Kubernetes is here to stay and not just hype, as well as how AWS is working to reduce the complexity of Kubernetes. Corey and Eswar also explore the competitive landscape of Amazon EKS, and the new product offering from Amazon called Karpenter.About EswarEswar Bala is a Director of Engineering at Amazon and is responsible for Engineering, Operations, and Product strategy for Amazon Elastic Kubernetes Service (EKS). Eswar leads the Amazon EKS and EKS Anywhere teams that build, operate, and contribute to the services customers and partners use to deploy and operate Kubernetes and Kubernetes applications securely and at scale. With a 20+ year career in software , spanning multimedia, networking and container domains, he has built greenfield teams and launched new products multiple times.Links Referenced: Amazon EKS: https://aws.amazon.com/eks/ kubernetesthemuchharderway.com: https://kubernetesthemuchharderway.com kubernetestheeasyway.com: https://kubernetestheeasyway.com EKS documentation: https://docs.aws.amazon.com/eks/ EKS newsletter: https://eks.news/ EKS GitHub: https://github.com/aws/eks-distro TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: It's easy to **BEEP** up on AWS. Especially when you're managing your cloud environment on your own!Mission Cloud un **BEEP**s your apps and servers. Whatever you need in AWS, we can do it. Head to missioncloud.com for the AWS expertise you need. Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. Today's promoted guest episode is brought to us by our friends at Amazon. Now, Amazon is many things: they sell underpants, they sell books, they sell books about underpants, and underpants featuring pictures of books, but they also have a minor cloud computing problem. In fact, some people would call them a cloud computing company with a gift shop that's attached. Now, the problem with wanting to work at a cloud company is that their interviews are super challenging to pass.If you want to work there, but can't pass the technical interview for a long time, the way to solve that has been, “Ah, we're going to run Kubernetes so we get to LARP as if we worked at a cloud company but don't.” Eswar Bala is the Director of Engineering for Amazon EKS and is going to basically suffer my slings and arrows about one of the most complicated, and I would say overwrought, best practices that we're seeing industry-wide. Eswar, thank you for agreeing to subject yourself to this nonsense.Eswar: Hey, Corey, thanks for having me here.Corey: [laugh]. So, I'm a little bit unfair to Kubernetes because I wanted to make fun of it and ignore it. But then I started seeing it in every company that I deal with in one form or another. So yes, I can still sit here and shake my fist at the tide, but it's turned into, “Old Man Yells at Cloud,” which I'm thrilled to embrace, but everyone's using it. So, EKS has recently crossed, I believe, the five-year mark since it was initially launched. What is EKS other than Amazon's own flavor of Kubernetes?Eswar: You know, the best way I can define EKS is, EKS is just Kubernetes. Not Amazon's version of Kubernetes. It's just Kubernetes that we get from the community and offer it to customers to make it easier for them to consume. So, EKS. I've been with EKS from the very beginning when we thought about offering a managed Kubernetes service in 2017.And at that point, the goal was to bring Kubernetes to enterprise customers. So, we have many customers telling us that they want us to make their life easier by offering a managed version of Kubernetes that they've actually beginning to [erupt 00:02:42] at that time period, right? So, my goal was to figure it out, what does that service look like and which customer base should be targeting service towards.Corey: Kelsey Hightower has a fantastic learning tool out there in a GitHub repo called, “Kubernetes the Hard Way,” where he talks you through building the entire thing, start to finish. I wound up forking it and doing that on top of AWS, and you can find that at kubernetesthemuchharderway.com. And that was fun.And I went through the process and my response at the end was, “Why on earth would anyone ever do this more than once?” And we got that sorted out, but now it's—customers aren't really running these things from scratch. It's like the Linux from Scratch project. Great learning tool; probably don't run this in production in the same way that you might otherwise because there are better ways to solve for the problems that you will have to solve yourself when you're building these things from scratch. So, as I look across the ecosystem, it feels like EKS stands in the place of the heavy, undifferentiated lifting of running the Kubernetes control plane so customers functionally don't have to. Is that an effective summation of this?Eswar: That is precisely right. And I'm glad you mentioned, “Kubernetes the Hard Way,” I'm a big fan of that when it came out. And if anyone who did that tutorial, and also your tutorial, “Kubernetes the Harder Way,” would walk away thinking, “Why would I pick this technology when it's super complicated to setup?” But then you see that customers love Kubernetes and you see that reflected in the adoption, even in 2016, 2017 timeframes.And the reason is, it made life easier for application developers in terms of offering web services that they wanted to offer to their customer base. And because of all the features that Kubernetes brought on, application lifecycle management, service discoveries, and then it evolved to support various application architectures, right, in terms of stateless services, stateful applications, and even daemon sets, right, like for running your logging and metrics agents. And these are powerful features, at the end of the day, and that's what drove Kubernetes. And because it's super hard to get going to begin with and then to operate, the day-two operator experience is super complicated.Corey: And the day one experience is super hard and the day two experience of, “Okay, now I'm running it and something isn't working the way it used to. Where do I start,” has been just tremendously overwrought. And frankly, more than a little intimidating.Eswar: Exactly. Right? And that exactly was our opportunity when we started in 2017. And when we started, there was question on, okay, should we really build a service when you have an existing service like ECS in place? And by the way, like, I did work in ECS before I started working in EKS from the beginning.So, the answer then was, it was about giving what customers want. And their space for many container orchestration systems, right, ECS was the AWS service at that point in time. And our thinking was, how do we give customers what they wanted? They wanted a Kubernetes solution. Let's go build that. But we built it in a way that we remove the undifferentiated heavy lifting of managing Kubernetes.Corey: One of the weird things that I find is that everyone's using Kubernetes, but I don't see it in the way that I contextualize the AWS universe, which of course, is on the bill. That's right. If you don't charge for something in AWS Lambda, and preferably a fair bit, I don't tend to know it exists. Like, “What's an IAM and what might that possibly do?” Always have reassuring thing to hear from someone who's often called an expert in this space. But you know, if it doesn't cost money, why do I pay attention to it?The control plane is what EKS charges for, unless you're running a bunch of Fargate-managed pods and containers to wind up handling those things. So, it mostly just shows up as an addenda to the actual big, meaty portions of the belt. It just looks like a bunch of EC2 instances with some really weird behavior patterns, particularly with regard to auto-scaling and crosstalk between all of those various nodes. So, it's a little bit of a murder mystery, figuring out, “So, what's going on in this environment? Do you folks use containers at all?” And the entire Kubernetes shop is looking at me like, “Are you simple?”No, it's just I tend to disregard the lies that customers say, mostly to themselves because everyone has this idea of what's going on in their environment, but the bill speaks. It's always been a little bit of an investigation to get to the bottom of anything that involves Kubernetes at significant points of scale.Eswar: Yeah, you're right. Like if you look at EKS, right, like, we started with managing the control plane to begin with. And managing the control plane is a drop in the bucket when you actually look at the costs in terms of operating a Kubernetes cluster or running a Kubernetes cluster. When you look at how our customers use and where they spend most of their cost, it's about where their applications run; it's actually the Kubernetes data plane and the amount of compute and memory that the applications end of using end up driving 90% of the cost. And beyond that is the storage, beyond that as a networking costs, right, and then after that is the actual control plane costs. So, the problem right now is figuring out, how do we optimize our costs for the application to run on?Corey: On some level, it requires a little bit of understanding of what's going on under the hood. There have been a number of cost optimization efforts that have been made in the Kubernetes space, but they tend to focus around stuff that I find relatively, well, I call it banal because it basically is. You're looking at the idea of, okay, what size instances should you be running, and how well can you fill them and make sure that all the resources per node wind up being taken advantage of? But that's also something that, I guess from my perspective, isn't really the interesting architectural point of view. Whether or not you're running a bunch of small instances or a few big ones or some combination of the two, that doesn't really move the needle on any architectural shift, whereas ingesting a petabyte a month of data and passing 50 petabytes back and forth between availability zones, that's where it starts to get really interesting as far as tracking that stuff down.But what I don't see is a whole lot of energy or effort being put into that. And I mean, industry-wide, to be clear. I'm not attempting to call out Amazon specifically on this. That's [laugh] not the direction I'm taking this in. For once. I know, I'm still me. But it seems to be just an industry-wide issue, where zone affinity for Kubernetes has been a very low priority item, even on project roadmaps on the Kubernetes project.Eswar: Yeah, the Kubernetes does provide ability for customers to restrict their workloads within as particular [unintelligible 00:09:20], right? Like, there is constraints that you can place on your pod specs that end up driving applications towards a particular AZ if they want, right? You're right, it's still left to the customers to configure. Just because there's a configuration available doesn't mean the customers use it. If it's not defaulted, most of the time, it's not picked up.That's where it's important for service providers—like EKS—to offer ability to not only provide the visibility by means of reporting that it's available using tools like [Cue Cards 00:09:50] and Amazon Billing Explorer but also provide insights and recommendations on what customers can do. I agree that there's a gap today. For example in EKS, in terms of that. Like, we're slowly closing that gap and it's something that we're actively exploring. How do we provide insights across all the resources customers end up using from within a cluster? That includes not just compute and memory, but also storage and networking, right? And that's where we are actually moving towards at this point.Corey: That's part of the weird problem I've found is that, on some level, you get to play almost data center archaeologists when you start exploring what's going on in these environments. I found one of the only reliable ways to get answers to some of this stuff has been oral tradition of, “Okay, this Kubernetes cluster just starts hurling massive data quantities at 3 a.m. every day. What's causing that?” And it leads to, “Oh, no no, have you talked to the data science team,” like, “Oh, you have a data science team. A common AWS billing mistake.” And exploring down that particular path sometimes pays dividends. But there's no holistic way to solve that globally. Today. I'm optimistic about tomorrow, though.Eswar: Correct. And that's where we are spending our efforts right now. For example, we recently launched our partnership with Cue Cards, and Cue Cards is now available as an add-on from the Marketplace that you can easily install and provision on Kubernetes EKS clusters, for example. And that is a start. And Cue Cards is amazing in terms of features, in terms of insight it offers, right, it looking into computer, the memory, and the optimizations and insights it provides you.And we are also working with the AWS Cost and Usage Reporting team to provide a native AWS solution for the cost reporting and the insights aspect as well in EKS. And it's something that we are going to be working really closely to solve the networking gaps in the near future.Corey: What are you seeing as far as customer concerns go, with regard to cost and Kubernetes? I see some things, but let's be very clear here, I have a certain subset of the market that I spend an inordinate amount of time speaking to and I always worry that what I'm seeing is not holistically what's going on in the broader market. What are you seeing customers concerned about?Eswar: Well, let's start from the fundamentals here, right? Customers really want to get to market faster, whatever services and applications that they want to offer. And they want to have it cheaper to operate. And if they're adopting EKS, they want it cheaper to operate in Kubernetes in the cloud. They also want a high performance, they also want scalability, and they want security and isolation.There's so many parameters that they have to deal with before they put their service on the market and continue to operate. And there's a fundamental tension here, right? Like they want cost efficiency, but they also want to be available in the market quicker and they want performance and availability. Developers have uptime, SLOs, and SLAs is to consider and they want the maximum possible resources that they want. And on the other side, you've got financial leaders and the business leaders who want to look at the spending and worry about, like, okay, are we allocating our capital wisely? And are we allocating where it makes sense? And are we doing it in a manner that there's very little wastage and aligned with our customer use, for example? And this is where the actual problems arise from [unintelligible 00:13:00].Corey: I want to be very clear that for a long time, one of the most expensive parts about running Kubernetes has not been the infrastructure itself. It's been the people to run this responsibly, where it's the day two, day three experience where for an awful lot of companies like, oh, we're moving to Kubernetes because I don't know we read it in an in-flight magazine or something and all the cool kids are doing it, which honestly during the pandemic is why suddenly everyone started making better IT choices because they're execs were not being exposed to airport ads. I digress. The point, though, is that as customers are figuring this stuff out and playing around with it, it's not sustainable that every company that wants to run Kubernetes can afford a crack SRE team that is individually incredibly expensive and collectively staggeringly so. That it seems to be the real cost is the complexity tied to it.And EKS has been great in that it abstracts an awful lot of the control plane complexity away. But I still can't shake the feeling that running Kubernetes is mind-bogglingly complicated. Please argue with me and tell me I'm wrong.Eswar: No, you're right. It's still complicated. And it's a journey towards reducing the complexity. When we launched EKS, we launched only with managing the control plane to begin with. And that's where we started, but customers had the complexity of managing the worker nodes.And then we evolved to manage the Kubernetes worker nodes in terms two products: we've got Managed Node Groups and Fargate. And then customers moved on to installing more agents in their clusters before they actually installed their business applications, things like Cluster Autoscaler, things like Metric Server, critical components that they have come to rely on, but doesn't drive their business logic directly. They are supporting aspects of driving core business logic.And that's how we evolved into managing the add-ons to make life easier for our customers. And it's a journey where we continue to reduce the complexity of making it easier for customers to adopt Kubernetes. And once you cross that chasm—and we are still trying to cross it—once you cross it, you have the problem of, okay so, adopting Kubernetes is easy. Now, we have to operate it, right, which means that we need to provide better reporting tools, not just for costs, but also for operations. Like, how easy it is for customers to get to the application level metrics and how easy it is for customers to troubleshoot issues, how easy for customers to actually upgrade to newer versions of Kubernetes. All of these challenges come out beyond day one, right? And those are initiatives that we have in flight to make it easier for customers [unintelligible 00:15:39].Corey: So, one of the things I see when I start going deep into the Kubernetes ecosystem is, well, Kubernetes will go ahead and run the containers for me, but now I need to know what's going on in various areas around it. One of the big booms in the observability space, in many cases, has come from the fact that you now need to diagnose something in a container you can't log into and incidentally stopped existing 20 minutes for you got the alert about the issue, so you'd better hope your telemetry is up to snuff. Now, yes, that does act as a bit of a complexity burden, but on the other side of it, we don't have to worry about things like failed hard drives taking systems down anymore. That has successfully been abstracted away by Kubernetes, or you know, your cloud provider, but that's neither here nor there these days. What are you seeing as far as, effectively, the sidecar pattern, for example of, “Oh, you have too many containers and need to manage them? Have you considered running more containers?” Sounds like something a container salesman might say.Eswar: So, running containers demands that you have really solid observability tooling, things that you're able to troubleshoot—successfully—debug without the need to log into the containers itself. In fact, that's an anti-pattern, right? You really don't want a container to have the ability to SSH into a particular container, for example. And to be successful at it demands that you publish your metrics and you publish your logs. All of these are things that a developer needs to worry about today in order to adopt containers, for example.And it's on the service providers to actually make it easier for the developers not to worry about these. And all of these are available automatically when you adopt a Kubernetes service. For example, in EKS, we are working with our managed Prometheus service teams inside Amazon, right—and also CloudWatch teams—to easily enable metrics and logging for customers without having to do a lot of heavy lifting.Corey: Let's talk a little bit about the competitive landscape here. One of my biggest competitors in optimizing AWS bills is Microsoft Excel, specifically, people are going to go ahead and run it themselves because, “Eh, hiring someone who's really good at this, that sounds expensive. We can screw it up for half the cost.” Which is great. It seems to me that one of your biggest competitors is people running their own control plane, on some level.I don't tend to accept the narrative that, “Oh, EKS is expensive that winds up being what 35 bucks or 70 bucks or whatever it is per control plane per cluster on a monthly basis.” Okay, yes, that's expensive if you're trying to stay completely within a free tier perhaps, but if you're running anything that's even slightly revenue-generating or a for-profit company, you will spend far more than that just on people's time. I have no problems—for once—with the EKS pricing model, start to finish. Good work on that. You've successfully nailed it. But are you seeing significant pushback from the industry of, “Nope, we're going to run our own Kubernetes management system instead because we enjoy pain, corporately speaking.”Eswar: Actually, we are in a good spot there, right? Like, at this point, customers who choose to run Kubernetes on AWS by themselves and not adopt EKS just fall into one main category, so—or two main categories: number one, they have existing technical stack built on running Kubernetes on themselves and they'd rather maintain that and not moving to EKS. Or they demand certain custom configurations of the Kubernetes control plane that EKS doesn't support. And those are the only two reasons why we see customers not moving into EKS and prefer to run their own Kubernetes on AWS clusters.[midroll 00:19:46]Corey: It really does seem, on some level, like there's going to be a… I don't want to say reckoning because that makes it sound vaguely ominous and that's not the direction that I intend for things to go in, but there has to be some form of collapsing of the complexity that is inherent to all of this because the entire industry has always done that. An analogy that I fall back on because I've seen this enough times to have the scars to show for it is that in the '90s, running a web server took about a week of spare time and an in-depth knowledge of GCC compiler flags. And then it evolved to ah, I could just unzip a tarball of precompiled stuff, and then RPM or Deb became a thing. And then Yum, or something else, or I guess apt over in the Debian land to wind up wrapping around that. And then you had things like Puppet where it was it was ensure installed. And now it's Docker Run.And today, it's a checkbox in the S3 console that proceeds to yell at you because you're making a website public. But that's neither here nor there. Things don't get harder with time. But I've been surprised by how I haven't yet seen that sort of geometric complexity collapsing of around Kubernetes to make it easier to work with. Is that coming or are we going to have to wait for the next cycle of things?Eswar: Let me think. I actually don't have a good answer to that, Corey.Corey: That's good, at least because if you did, I'd worried that I was just missing something obvious. That's kind of the entire reason I ask. Like, “Oh, good. I get to talk to smart people and see what they're picking up on that I'm absolutely missing.” I was hoping you had an answer, but I guess it's cold comfort that you don't have one off the top of your head. But man, is it confusing.Eswar: Yeah. So, there are some discussions in the community out there, right? Like, it's Kubernetes the right layer to do interact? And there are some tooling that's built on top of Kubernetes, for example, Knative that tries to provide a serverless layer on top of Kubernetes, for example. There are also attempts at abstracting Kubernetes completely and providing tooling that just completely removes any sort of Kubernetes API out of the picture and maybe a specific CI/CD-based solution that takes it from the source and deploys the service without even showing you that there's Kubernetes underneath, right?All of these are evolutions that are being tested out there in the community. Time will tell whether these end up sticking. But what's clear here is the gravity around Kubernetes. All sorts of tooling that gets built on top of Kubernetes, all the operators, all sorts of open-source initiatives that are built to run on Kubernetes. For example, Spark, for example, Cassandra, so many of these big, large-scale, open-source solutions are now built to run really well on Kubernetes. And that is the gravity that's pushing Kubernetes at this point.Corey: I'm curious to get your take on one other, I would consider interestingly competitive spaces. Now, because I have a domain problem, if you go to kubernetestheeasyway.com, you'll wind up on the ECS marketing page. That's right, the worst competition in the world: the people who work down the hall from you.If someone's considering using ECS, Elastic Container Service versus EKS, Elastic Kubernetes Service, what is the deciding factor when a customer's making that determination? And to be clear, I'm not convinced there's a right or wrong answer. But I am curious to get your take, given that you have a vested interest, but also presumably don't want to talk complete smack about your colleagues. But feel free to surprise me.Eswar: Hey, I love ECS, by the way. Like I said, I started my life in the AWS in ECS. So look, ECS is a hugely successful container orchestration service. I know we talk a lot about Kubernetes, I know there's a lot of discussions around Kubernetes, but I wouldn't make it a point that, like, ECS is a hugely successful service. Now, what determines how customers go to?If customers are… if the customers tech stack is entirely on AWS, right, they use a lot of AWS services and they want an easy way to get started in the container world that has really tight integration with other AWS services without them having to configure a lot, ECS is the way, right? And customers have actually seen terrific success adopting ECS for that particular use case. Whereas EKS customers, they start with, “Okay, I want an open-source solution. I really love Kubernetes. I lo—or, I have a tooling that I really like in the open-source land that really works well with Kubernetes. I'm going to go that way.” And those kind of customers end up picking EKS.Corey: I feel like, on some level, Kubernetes has become the most the default API across a wide variety of environments. AWS obviously, but on-prem other providers. It seems like even the traditional VPS companies out there that offer just rent-a-server in the cloud somewhere are all also offering, “Oh, and we have a Kubernetes service as well.” I wound up backing a Kickstarter project that runs a Kubernetes cluster with a shared backplane across a variety of Raspberries Pi, for example. And it seems to be almost everywhere you look.Do you think that there's some validity to that approach of effectively whatever it is that we're going to wind up running in the future, it's going to be done on top of Kubernetes or do you think that that's mostly hype-driven these days?Eswar: It's definitely not hype. Like we see the proof in the kind of adoption we see. It's becoming the de facto container orchestration API. And with all the tooling, open-source tooling that's continuing to build on top of Kubernetes, CNCF tooling ecosystem that's actually spawned to actually support Kubernetes at option, all of this is solid proof that Kubernetes is here to stay and is a really strong, powerful API for customers to adopt.Corey: So, four years ago, I had a prediction on Twitter, and I said, “In five years, nobody will care about Kubernetes.” And it was in February, I believe, and every year, I wind up updating an incrementing a link to it, like, “Four years to go,” “Three years to go,” and I believe it expires next year. And I have to say, I didn't really expect when I made that prediction for it to outlive Twitter, but yet, here we are, which is neither here nor there. But I'm curious to get your take on this. But before I wind up just letting you savage the naive interpretation of that, my impression has been that it will not be that Kubernetes has gone away. That is ridiculous. It is clearly in enough places that even if they decided to rip it out now, it would take them ten years, but rather than it's going to slip below the surface level of awareness.Once upon a time, there was a whole bunch of energy and drama and debate around the Linux virtual memory management subsystem. And today, there's, like, a dozen people on the planet who really have to care about that, but for the rest of us, it doesn't matter anymore. We are so far past having to care about that having any meaningful impact in our day-to-day work that it's just, it's the part of the iceberg that's below the waterline. I think that's where Kubernetes is heading. Do you agree or disagree? And what do you think about the timeline?Eswar: I agree with you; that's a perfect analogy. It's going to go the way of Linux, right? It's here to stay; it just going to get abstracted out if any of the abstraction efforts are going to stick around. And that's where we're testing the waters there. There are many, many open-source initiatives there trying to abstract Kubernetes. All of these are yet to gain ground, but there's some reasonable efforts being made.And if they are successful, they just end up being a layer on top of Kubernetes. Many of the customers, many of the developers, don't have to worry about Kubernetes at that point, but a certain subset of us in the tech world will need to do a deal with Kubernetes, and most likely teams like mine that end up managing and operating their Kubernetes clusters.Corey: So, one last question I have for you is that if there's one thing that AWS loves, it's misspelling things. And you have an open-source offering called Karpenter spelled with a K that is an extending of that tradition. What does Karpenter do and why would someone use it?Eswar: Thank you for that. Karpenter is one of my favorite launches in the last one year.Corey: Presumably because you're terrible at the spelling bee back when you were a kid. But please tell me more.Eswar: [laugh]. So Karpenter, is an open-source flexible and high performance cluster auto-scaling solution. So basically, when your cluster needs more capacity to support your workloads, Karpenter automatically scales the capacity as needed. For people that know the Kubernetes space well, there's an existing component called Cluster Autoscaler that fills this space today. And it's our take on okay, so what if we could reimagine the capacity management solution available in Kubernetes? And can we do something better? Especially for cases where we expect terrific performance at scale to enable cost efficiency and optimization use cases for our customers, and most importantly, provide a way for customers not to pre-plan a lot of capacity to begin with.Corey: This is something we see a lot, in the sense of very bursty workloads where, okay, you're going to steady state load. Cool. Buy a bunch of savings plans, get things set up the way you want them, and call it a day. But when it's bursty, there are challenges with it. Folks love using Spot, but in the event of a sudden capacity shortfall, the question is, is can we spin up capacity to backfill it within those two minutes that we have a warning on that on? And if the answer is no, then it becomes a bit of a non-starter.Customers have had to build an awful lot of those things around EC2 instances that handle a lot of that logic for them in ways that are tuned specifically for their use cases. I'm encouraged to see there's a Kubernetes story around this that starts to remove some of that challenge from the customer side.Eswar: Yeah. So, the burstiness is where complexity comes [here 00:29:42], right? Like many customers for steady state, they know what their capacity requirements are, they set up the capacity, they can also reason out what is the effective capacity needed for good utilization for economical reasons and they can actually pre plan that and set it up. But once burstiness comes in, which inevitably does it at [unintelligible 00:30:05] applications, customers worry about, “Okay, am I going to get the capacity that I need in time that I need to be able to service my customers? And am I confident at it?”If I'm not confident, I'm going to actually allocate capacity beforehand, assuming that I'm going to actually get the burst that I needed. Which means, you're paying for resources that you're not using at the moment. And the burstiness might happen and then you're on the hook to actually reduce the capacity for it once the peak subsides at the end of the [day 00:30:36]. And this is a challenging situation. And this is one of the use cases that we targeted Karpenter towards.Corey: I find that the idea that you're open-sourcing this is fascinating because of two reasons. One, it does show a willingness to engage with the community that… again, it's difficult. When you're a big company, people love to wind up taking issue with almost anything that you do. But for another, it also puts it out in the open, on some level, where, especially when you're talking about cost optimization and decisions that affect cost, it's all out in public. So, people can look at this and think, “Wait a minute, it's not—what is this line of code that means if it's toward the end of the month, crank it up because we might need to hit our numbers.” Like, there's nothing like that in there. At least I'm assuming. I'm trusting that other people have read this code because honestly, that seems like a job for people who are better at that than I am. But that does tend to breed a certain element of trust.Eswar: Right. It's one of the first things that we thought about when we said okay, so we have some ideas here to actually improve the capacity management solution for Kubernetes. Okay, should we do it out in the open? And the answer was a resounding yes, right? I think there's a good story here that actually enables not just AWS to offer these ideas out there, right, and we want to bring it to all sorts of Kubernetes customers.And one of the first things we did is to architecturally figure out all the core business logic of Karpenter, which is, okay, how to schedule better, how quickly to scale, what is the best instance types to pick for this workload. All of that business logic was abstracted out from the actual cloud provider implementation. And the cloud provider implementation is super simple. It's just creating instances, deleting instances, and describing instances. And it's something that we bake from the get-go so it's easier for other cloud providers to come in and to add their support to it. And we as a community actually can take these ideas forward in a much faster way than just AWS doing it.Corey: I really want to thank you for taking the time to speak with me today about all these things. If people want to learn more, where's the best place for them to find you?Eswar: The best place to learn about EKS, right, as EKS evolves, is using our documentation, we have an EKS newsletter that you can go subscribe, and you can also find us on GitHub where we share our product roadmap. So, it's a great places to learn about how EKS is evolving and also sharing your feedback.Corey: Which is always great to hear, as opposed to, you know, in the AWS Console, where we live, waiting for you to stumble upon us, which, yeah. No it's good does have a lot of different places for people to engage with you. And we'll put links to that, of course, in the [show notes 00:33:17]. Thank you so much for being so generous with your time. I appreciate it.Eswar: Corey, really appreciate you having me.Corey: Eswar Bala, Director of Engineering for Amazon EKS. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice telling me why, when it comes to tracking Kubernetes costs, Microsoft Excel is in fact the superior experience.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
const podcast = { episode: 228, title: 'Web Apps and Site Reliability Engineering', topics: [ 'reliability', 'web apps', 'user focused' ], guest: 'Brian Love' hosts: [ 'John Papa', 'Ward Bell' ]};Recording date: March, 23, 2023John Papa @John_PapaWard Bell @WardBellDan Wahlin @DanWahlinCraig Shoemaker @craigshoemakerBrian Love @Brian_loveBrought to you byAG GridIdeaBladeResources:Google Books on SREWhat is SREIntroduction to Site Reliability Engineering (SRE)Reliable systems in DevOpsPing testVoting with your feetWhat is an SLAService Level Objectives and IndicatorsSLA vs SLO vs SLISLIs, SLOs, and SLAs, oh my:Interview with Dave Rensen, SRE Engineering Director on the SRE Workbook:The Origins of SREWhat it means to be a SREGet Polaris (SRE tool)Send Beacon APIGitHub Copilot XPrompt EngineeringLearn with Introduction to Prompt EngineeringTimejumps00:29 Welcome01:37 Guest introduction02:55 What is SRE?05:38 What is it like if you don't have an SRE?09:29 Sponsor: Ag Grid10:36 Available vs reliable13:35 Is SRE the same as health monitoring?21:29 Sponsor: IdeaBlade22:30 How do I make sure I don't cause more reliability issues?27:36 Who's providing the infastructure?31:04 Where's the AI in all of this?33:59 Final thoughtsPodcast editing on this episode done by Chris Enns of Lemon Productions.
Lex Neva, Staff Site Reliability Engineer at Honeycomb and Curator of SRE Weekly, joins Corey on Screaming in the Cloud to discuss reliability and the life of a newsletter curator. Lex shares some interesting insights on how he keeps his hobbies and side projects separate, as well as the intrusion that open-source projects can have on your time. Lex and Corey also discuss the phenomenon of newsletter curators being much more demanding of themselves than their audience typically is. Lex also shares his views on how far reliability has come, as well as how far we have to go, and the critical implications reliability has on our day-to-day lives. About LexLex Neva is interested in all things related to running large, massively multiuser online services. He has years of SRE, Systems Engineering, tinkering, and troubleshooting experience and perhaps loves incident response more than he ought to. He's previously worked for Linden Lab, DeviantArt, Heroku, and Fastly, and currently works as an SRE at Honeycomb while also curating the SRE Weekly newsletter on the side.Lex lives in Massachusetts with his family including 3 adorable children, 3 ridiculous cats, and assorted other awesome humans and animals. In his copious spare time he likes to garden, play tournament poker, tinker with machine embroidery, and mess around with Arduinos.Links Referenced: SRE Weekly: https://sreweekly.com/ Honeycomb: https://www.honeycomb.io/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Chronosphere. Tired of observability costs going up every year without getting additional value? Or being locked into a vendor due to proprietary data collection, querying, and visualization? Modern-day, containerized environments require a new kind of observability technology that accounts for the massive increase in scale and attendant cost of data. With Chronosphere, choose where and how your data is routed and stored, query it easily, and get better context and control. 100% open-source compatibility means that no matter what your setup is, they can help. Learn how Chronosphere provides complete and real-time insight into ECS, EKS, and your microservices, wherever they may be at snark.cloud/chronosphere that's snark.cloud/chronosphere.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Once upon a time, I decided to start writing an email newsletter, and well, many things happened afterwards, some of them quite quickly. But before that, I was reading a number of email newsletters in the space. One that I'd been reading for a year at the time, was called SRE Weekly. It still comes out. I still wind up reading it most weeks.And it's written by Lex Neva, who is not only my guest today but also a staff site reliability engineer at Honeycomb. Lex, it is so good to finally talk to you, other than reading emails that we send to the entire world that pass each other like ships in the night.Lex: Yeah. I feel like we should have had some kind of meeting before now. But yeah, it's really good to [laugh] finally meet you.Corey: It was one of the inspirations that I had. And to be clear, when I signed up for your newsletter originally—I was there for issue 15, which is many, many years ago—I was also running a small-scale SRE team at the time. It was, I found as useful as a part of doing my job and keeping abreast of what was going on in the ecosystem. And I found myself, once I went independent, wishing that your newsletter and a few others had a whole bunch more AWS content. Well, why doesn't it?And the answer is because you are, you know, a reasonable person who understands that mental health is important and boundaries exist for a reason. No one sensible is going to care that much about one cloud provider all the time [sigh]. If only we were all that wise.Lex: Right? Well, [laugh] well, first of all, I love your newsletter, and also the content that you write that—I mean, I would be nowhere without content to link to. And I'm glad you took on the AWS thing because, much like how I haven't written Security Weekly, I also didn't write any kind of AWS Weekly because there's just too much. So, thanks for falling on that sword.Corey: I fell on another one about two years ago and started the Thursdays, which are Last Week in AWS Security. But I took a different bent on it because there are a whole bunch of security newsletters that litter the landscape and most of them are very good—except for the ones that seem to be entirely too vendor-captured—but the problem is, is that they lacked both a significant cloud focus, as well as an understanding that there's a universe of people out here who care about security—or at least should—but don't have the word security baked into their job title. So, it was very insular, using acronyms they assume that everyone knows, or it's totally vendor-captured and it's trying to the whole fear, uncertainty, and doubt thing, “And that's why you should buy this widget.” “Will it solve problems?” “Well, it'll solve our revenue problems at our company that sells the widgets, but other than that, not really.” And it just became such an almost incestuous ecosystem. I wanted something different.Lex: Yeah. And the snark is also very useful [laugh] in order to show us that you're not in their pocket. So yeah, nice work.Corey: Well, I'll let you in on a secret, now that we are—what, I'm somewhat like 300 and change issues in, which means I've been doing this for far too long, the snark is a byproduct of what I needed to do to write it myself. Because let's face it, this stuff is incredibly boring. I needed to keep myself interested as I started down that path. And how can I continually keep it fresh and funny and interesting, but not go too far? That's a fun game, whereas copying and pasting some announcement was never fun.Lex: Yeah, that's not—I hear you on trying to make it interesting.Corey: One regret that I've had, and I'm curious if you've ever encountered this yourself because most people don't get to see any of this. They see the finished product that lands in their inbox every Monday, and—in my case, Monday; I forget the exact day that yours comes out. I collect them and read through them for them all at once—but I find that I have often had caused a look back and regret the implicit commitment in Last Week in AWS as a name because it would be nice to skip a week here and there, just because either I don't particularly feel like it, or wow, there was not a lot of news worth talking about that came out last week. But it feels like I've forced myself onto a very particular treadmill schedule.Lex: Yeah. Yeah, it comes with, like, calling it SRE Weekly. I just followed suit for some of the other weeklies. But yeah, that can be hard. And I do give myself permission to take a week off here and there, but you know, I'll let you in on a secret.What I do is I try to target eight to ten articles a week. And if I have more than that, I save some of them. And then when it comes time to put out an issue, I'll go look at what's in that ready queue and swap some of those in and swap some of the current ones out just so I keep things fresh. And then if I need a week off, I'll just fill it from that queue, you know, if it's got enough in it. So, that lets me take vacations and whatnot. Without that, I think I would have had a lot harder of a time sticking with this, or there just would have been more gaps. So yeah.Corey: You're fortunate in that you have what appears to be a single category of content when you construct your newsletter, whereas I have three that are distinct: AWS releases and announcements and news and things to make fun of for the past week; the things from the larger community folks who do not work there, but are talking about interesting approaches or news that is germane; and then ideally a tip or a tool of the week. And I found, at least lately, that I've been able to build out the tools portion of it significantly far in advance. Because a tool that makes working with AWS easier this week is probably still going to be fairly helpful a month from now.Lex: Yeah, that's fair. Definitely.Corey: But putting some of the news out late has been something of a challenge. I've also learned—by getting it wrong—that I'm holding myself to a tighter expectation of turnaround time than any part of the audience is. The Thursday news is all written the week before, almost a full week beforehand and no one complains about that. I have put out the newsletter a couple of times an hour or two after its usual 7:30 pacific time slot that it goes out in; not a single person has complained. In one case, I moved it by a day to accommodate an announcement but didn't explain why; not a single person emailed in. So, okay. That's good to know.Lex: Yeah, I've definitely gotten to, like, Monday morning, like, a couple of times. Not much, not many times, but a couple of times, I've gotten a Monday morning be like, “Oh, hey. I didn't do that thing yesterday.” And then I just release it in the morning. And I've never had a complaint.I've cancelled last minute because life interfered. The most I've ever had was somebody emailing me and be like, you know, “Hope you feel better soon,” like when I had Covid, and stuff like that. So, [laugh] yeah, sometimes maybe we do hold ourselves to a little bit of a higher standard than is necessary. I mean, there was a point where I got—I had major eye surgery and I had to take a month off of everything and took a month off the newsletter. And yeah, I didn't lose any subscribers. I didn't have any complaints. So people, I think, appreciate it when it's there. And, you know, if it's not there, just wait till it comes out.Corey: I think that there is an additional challenge that I started feeling as soon as I started picking up sponsors for it because it's well, but at this point, I have a contractual obligation to put things out. And again, life happens, but you also don't want to have to reach out on apology tours every third week or whatnot. And I think that's in part due to the fact that I have multiple sponsors per issue and that becomes a bit of a juggling dance logistically on this end.Lex: Yeah. When I started, I really didn't think I necessarily wanted to have sponsors because, you know, it's like, I have a job. This is just for fun. It got to the point where it's like, you know, I'll probably stop this if there's not some kind of monetary advantage [laugh]. And having a sponsor has been really helpful.But I have been really careful. Like, I have always had only a single sponsor because I don't want that many people to apologize to. And that meant I took in maybe less money than I then I could have, but that's okay. And I also was very clear, you know, even from the start having a contract that I may miss a week without notice. And yes, they're paying in advance, but it's not for a specific range of time, it's for a specific number of issues, whenever those come out. That definitely helped to reduce the stress a little bit. And I think without that, you know, having that much over my head would make it hard to do this, you know? It has to stay fun, right?Corey: That's part of the things that kept me from, honestly, getting into tech for the first part of my 20s. It was the fear that I would be taking a hobby, something that I love, and turning it into something that I hated.Lex: Yeah, there is that.Corey: It's almost 20 years now and I'm still wondering whether I actually succeeded or not in avoiding hating this.Lex: Well, okay. But I mean, are you, you know, are you depressed [unintelligible 00:09:16] so there's this other thing, there's this thing that people like to say, which is like, “You should only do a job that you really love.” And I used to think that. And I don't actually think that anymore. I think that it is important to have a job that you can do and not hate day-to-day, but there's no shame in not being passionate about your work and I don't think that we should require passion from anyone when we're hiring. And I think to do so is even, like, privilege. So, you know, I think that it's totally fine to just do something because it pays the bills.Corey: Oh, absolutely. I find it annoying as hell when I'm talking to folks who are looking to hire for roles and, “Well, include a link to your GitHub profile,” is a mandatory field. It's, well, great. What about people who work in places where they're not working on open-source projects as a result, and they can't really disclose what they're doing? And the expectation that oh, well outside of work, you should be doing public stuff, too.It's, I used to do a lot of public open-source style work on GitHub, but I got yelled at all the time for random, unrelated reasons and it's, I don't want to put something out there that I have to support and people start to ask me questions about. It feels like impromptu unasked-for code review. No, thanks. So, my GitHub profile looks fairly barren.Lex: You mean like yelling at you, like, “Oh, you're not contributing enough.” Or, you know, “We need this free thing you're doing, like, immediately,” or that kind of thing?Corey: Worse than that. The worst example I've ever had for this was when I was giving a talk called “Terrible Ideas in Git,” and because I wanted to give some hilariously contrived demos that took a fair bit of work to set up, I got them ready to go inside of a Docker container because I didn't trust that my laptop would always work, I'm might have to borrow someone else's, I pushed that image called “Terrible Ideas” up to Docker Hub. And I wound up with people asking questions about it. Like, “Is this vulnerable to ShellCheck.” And it's, “You do realize that this is intentionally designed to be awful? It is only for giving a very specific version of a very specific talk. It's in public, just because I didn't bother to make it private. What are you doing? Please tell me you're not running this in production at a bank?” “No comment.” Right. I don't want that responsibility of people yelling at me for things I didn't do on purpose. I want to get yelled at for the things I did intentionally.Lex: Exactly. It's funny that sometimes people expect more out of you when you're giving them something free versus when they're paying you for it. It's an interesting quirk of psychology that I'm sure that professionals could tell me all about. Maybe there's been research on it, I don't know. But yeah, that can be difficult.Corey: Oh, absolutely. I used to work at a web hosting company and the customer spending thousands a month with us were uniformly great. But there was always the lowest tier customer of the cheapest thing that we offered that seemed to expect that that entitle them to 80 hours a month of support from engineering problems and whatnot. And it was not profitable to service some of those folks. I've also found that there's a real transitive barrier that begins as soon as you find a way to charge someone a dollar for something.There's a bit of a litmus test of can you transfer a dollar from your bank account to mine? And suddenly, the entire tenor of the conversations with people who have crossed that boundary change. I have toyed, on some level, with the idea of launching a version of this newsletter—or wondering if I retcon the whole thing—do I charge people to subscribe to this? And the answer I keep coming away with is not at all because it started in many respects is marketing for AWS bill consulting and I want the audience as fast as possible. Artificially limiting its distribution via a pay-for model just seemed a little on the strange side.Lex: Yeah. And then you're beholden to a very many people and there's that disproportionality. So, years ago, before I even started in my career in I guess, you know, things that were SRE before SRE was cool, I worked for a living in Second Life. Are you familiar with Second Life?Corey: Oh, yes. I'm very familiar with that. Linden Labs.Lex: Yep. So, I worked for Linden Lab years later, but before I worked for them, I sort of spent a lot of my time living in Second Life. And I had a product that I sold for two or three dollars. And actually, it's still in there; you could still buy it. It's interesting. I don't know if it's because the purchase price was 800 Linden dollars, which equates to, like, $2.16, or something like that, but—Corey: The original cryptocurrency.Lex: Right, exactly. Except there's no crypto involved.Corey: [laugh].Lex: But people seem to have a disproportionate amount of, like, how much of my time they expected for support. You know, I'm going to support them a little bit. You have to recognize at some point, I actually can't come give you a tutorial on using this product because you're one of 500 customers for this month. And you give me two dollars and I don't have ten hours to give you. You know, like, sorry [laugh]. Yeah, so that can be really tough.Corey: And on some level, you need to find a way to either charge more or charge for support on top of it, or ideally—it I wish more open-source projects would take this approach—“Huh. We've had 500 people asking us the exact same question. Should we improve our docs? No, of course not. They're the ones who are wrong. It's the children who are getting it wrong.”I don't find that approach [laugh] to be particularly useful, but it bothers me to no end when I keep running into the same problem onboarding with something new and I ask about it, and, “Oh, yeah, everyone runs into that problem. Here's how you get around it.” This would have been useful to mention in the documentation. I try not to ask questions without reading the manual first.Lex: Well, so there's a couple different directions. I could go with this. First of all, there's a really interesting thing that happened with the core-js project that I recommend people check out. Another thing that I think the direction I'll go at the moment—we can bookmark that other one, but I have an open-source project on the side that I kind of did for my own fun, which is a program for creating designs that can be processed by computer-controlled embroidery machines. So, this is sewing machines that can plot stitches in the x-y plane based on a program that you give it.And there really wasn't much in the way of open-source software available that could help you create these designs and so I just sort of hack something together and started hacking with Python for my own fun, and then put it out there and open-sourced. And it's kind of taken off, kind of like gotten a life of its own. But of course, I've got a newsletter, I've got three kids, I've got a family, and a day job, and I definitely hear you on the, like, you know, yeah, we should put this FAQ in the docs, but there can be so little time to even do that. And I'm finding that there's, like—you know, people talk about work-life balance, there's, like, work slash life slash open-source balance that you really—you know, you have to, like, balance all three of them.And a lot of weeks, I don't have any time to spend on the project. But you know what, it's still kicks along and people just kind of, they use my terrible little project [laugh] as best they can, even though it has a ton of rough edges. I'm sorry, everyone, I'm so sorry. I know it has a t—the UI is terrible. But yeah, it's interesting how these things sometimes take on a life of their own and you can feel dragged along by your own open-source work, you know?Corey: It always bothers me—I think this might tie back to the core-js issue you talked about a second ago—where there are people who are building and supporting open-source tools or libraries that they originally constructed to scratch an itch and now they are core dependencies of basically half the internet. And these people are still wondering on some level, how do I put food on the table this month? It's wild to me. If there were justice in the world, you'd start to think these people would wind up in never-have-to-work-again-if-they-don't-want-to positions. But in many cases, it's exactly the opposite.Lex: Well, that's the really interesting thing. So, first of all, I'm hugely privileged to have any time to get to work on open-source. There's plenty of people that don't, and yeah, so requiring people to have a GitHub link to show their open-source contributions is inherently unfair and biased and discriminatory. That aside, people have asked all along, like, “Lex, this is decent software, you could sell this. You could charge money for this thing and you could probably make a, you know, a decent living at this.”And I categorically refuse to accept money for that project because I don't want to have to support it on a commercial level like that. If I take your money, then you have an expectation that—especially if I charge what one would expect—so this software, part of the reason I decided to write my own is because it starts at two-hundred-some-off dollars for the competitors that are commercial and goes up into the five, ten-thousand dollars. For a software package. Mine is free. If I started charging money, then yeah, I'm going to have to build a support department and we're going to have a knowledge base, I'm going to have to incorporate. I don't want to do that for something I'm doing for fun, you know? So yeah, I'm going to keep it free and terrible [laugh].Corey: It becomes something you love, turns into something you hate without even noticing that it happens. Or at least something that you start to resent.Lex: Yeah. I don't think I would necessarily hate machine embroidery because I love it. It's an amazingly fun little quirky hobby, but I think it would definitely take away some of the magic for me. Where there's no stress at all, I can spend months noodling on an algorithm getting it right, whereas it'd be, you know, if I start having to have deliverables, it changes it entirely. Yeah.Corey: It's odd, it seems, on some level too, that the open-source world that I got started with has evolved in a whole bunch of different ways. Whereas it used to be write a quick fix for something and it would get merged, in many cases by the time you got back from lunch. And these days, it seems like it takes multiple weeks, especially with a corporate-controlled open-source project, and there's so much back and forth. And even getting the boilerplate, like the CLI—the Contributor License Agreement—aside and winding up getting other people to sign off on it, then there's back and forth, in some cases for weeks about, well, the right kind of test coverage and how to look at this and the right holistic framework. And I appreciate that there is validity and value to these things, but is that the bulk of the effort should be going when there's a pull request ready to go that solves a breaking customer problem?But the test coverage isn't right so we're going to delay it for two or three releases. It's what are you doing there? Someone lost the plot somewhere. And I'm sure there are reasons that makes sense, given the framework people are operating within. I just find it maddening from the side of having to [laugh] deal with this as a human.Lex: Yeah, I hear you. And it sometimes can go even beyond test coverage to something like code style, you know? It's like, “Oh, that's not really in the style of this project,” or, “You know, I would have written it this way.” And one thing I've had to really work on, on this project is to make it as inviting to developers as possible. I have to sometimes look at things and be like, yeah, I might do that a different way. But does that actually matter? Like, do I have a reason for that that really matters or is it just my style? And maybe because it's a group project I should just be like, no, that's good as it is.[midroll 00:20:23]Corey: So, you've had an interesting career. And clearly you have opinions about SRE as a result. When I started seeing that you were the author of SRE Weekly, years ago, I just assumed something that I don't believe is true. Is it possible that you have been contributing to the community around SRE, but somehow have never worked at Google?Lex: I have never worked at Google. I have never worked at Netflix. I've never worked at any of those big companies. The biggest company I've worked for is Salesforce. Although I worked for Heroku who had been bought by Salesforce a couple of years prior, and so it was kind of like working for a startup inside a big company. And here's the other thing. I created that newsletter two months after starting my first job where I had a—like, the first job in which I was titled ‘SRE.' So, that's possibly contentious right there.Corey: You know, I hadn't thought of it this way, but you're right. I did almost the exact same thing. I was no expert in AWS when I started these things. It came out of an effort that I needed to do of keeping touch with everything that came out that had potential economic impact, which it turns out are most things when you understand architecture and cost are the same thing when it comes to cloud. But I was more or less gathering what smart people were saying.And somehow there's been this osmotic effect, where people start to view me as the wise old sage of the mountain when it comes to AWS. And no, no, no, I'm just old and grumpy. That looks alike. Don't mistake it for wisdom. But people will now seek me out to get my opinion on things and I have no idea what the answer looks like for most of the stuff.But that's the old SRE model—or sysadmin model that I've followed, which is when you don't know the answer, well, how do you get to a place where you can find the answer? How do you troubleshoot this? Click the button. It doesn't work? Well, time to start taking the button apart to figure out why.Lex: Yeah, definitely. I hear you on people. So, first of all, thanks to everyone who writes the articles that I include. I would be nothing without—I mean—literally, that I could not have a newsletter without content creators. I also kind of started the newsletter as an exploration of this new career title.I mean, I've been doing things that basically fit along with SRE for a long time, but also, I think my view of SRE might be not really the same as a lot of folks, or, like, that Google passed down from the [Google Book Model 00:22:46]. I don't—I'm going to be a little heretical here—I don't necessarily a hundred percent believe in the SLI SLO SLA error budget model. I don't think that that necessarily fits everyone, I'm not sure even suits the bigger companies as well as they think it does. I think that there's a certain point to which you can't actually predict failure and just slowing down on your deploys. And it likes to cause there to be fewer incidents so that you can get—your you know, you can go back to passing in your error budget, to passing your SLO, I'm not sure that actually makes sense or is realistic and works in the real world.Corey: I've been left with the distinct impression that it's something of a framework for how to think about a lot of those things. And it's for folks on a certain point of their development along whatever maturity model or maturity curve you want to talk about, it becomes extraordinarily useful. And at some point, it feels like the path that a given company is on will deviate from that. And, on some level, if you don't wind up addressing it, it turns into what it seems like Agile did, where you wind up with the Cult of Agile around it and the entire purpose of it is to perpetuate the Cult of Agile.And I don't know that I'm necessarily willing to go so far as to say that's where SLOs are headed right now, but I'm starting to get the same sort of feeling around the early days of the formalization of frameworks like that, and the ex cathedra proclamation that this is right for everyone. So, I'm starting to wonder whether there's a reckoning, in that sense, coming down the road. I'm fortunate that I don't run anything that's production-facing, so for me, it's, I don't have to care about these things. Mostly.Lex: Yeah. I mean, we are in… we're in 2023. Things have come so much further than when I was a kid. I have a little computer in my pocket. Yeah, you know, “Hey, math teacher, turns out yeah, we do carry calculators around with us wherever we go.” We've built all these huge, complicated systems online and built our entire society around them.We're still in our infancy. We still don't know what we're doing. We're still feeling out what SRE even is, if it even makes sense, and I think there's—yeah, there's going to be more evolution. I mean, there's been the, like, what is DevOps and people coining the term DevOps and then getting, you know, almost immediately subsumed or turned into whatever other people want. Same thing for observability.I think same thing for SRE. So honestly, I'm feeling it out as I go and I think we all are. And I don't think anyone really knows what we're doing. And I think that the moment we feel like we do is probably where we're in trouble. Because this is all just so new. Look where we were even 40 years, 30, even 20 years ago. We've come really far.Corey: For me, one of the things that concerns slash scares me has been that once someone learns something and it becomes rote, it sort of crystallizes in amber within their worldview, and they don't go back and figure out, “Okay, is this still the right approach?” Or, “Has the thing that I know changed?” And I see this on a constant basis just because I'm working with AWS so often. And there are restrictions and things you cannot do and constraints that the cloud provider imposes on you. Until one day, that thing that was impossible is now possible and supported.But people don't keep up with that so they still operate under the model of what used to be. I still remember a year or so after they raised the global per-resource tag limit to 50, I was seeing references to only ten tags being allowed per resource in the AWS console because not even internal service teams are allowed to talk to each other over there, apparently. And if they can't keep it straight internally, what hope to the rest of us have? It's the same problem of once you get this knowledge solidified, it's hard to keep current and adapt to things that are progressing. Especially in tech where things are advancing so rapidly and so quickly.Lex: Yeah, I gather things are a little feudalistic over inside AWS, although I've never worked there, so I don't know. But it's also just so big. I mean, there's just—like, do you even know all of the—like, I challenge you to go through the list of services. I bet you're going to find when you don't know about. You know, the AWS services. Maybe that's a challenge I would lose, but it's so hard to keep track of all this stuff with how fast it's changing that I don't blame people for not getting that.Corey: I would agree. We've long since passed the point where I can talk incredibly convincingly about AWS services that do not exist and not get called out on it by AWS employees. Because who would just go and make something up like that? That would be psychotic. No one in the right mind would do it.“Hi, I'm Corey, we haven't met yet. But you're going to remember this, whether I want you to or not because I make an impression on people. Oops.”Lex: Yeah. Mr. AWS Snark. You're exactly who I would expect to do that. And then there was Hunter, what's his name? The guy who made the—[singing] these are the many services of AWS—song. That was pretty great, too.Corey: Oh, yeah. Forrest Brazeal. He was great. I loved having him in the AWS community. And then he took a job, head of content over at Google Cloud. It's, well, suddenly, you can't very well make fun of AWS anymore, not without it taking a very different tone. So, I feel like that's our collective loss.Lex: Yeah, definitely. But yeah, I feel like we've done amazing things as a society, but the problem is that we're still, like, at the level of, we don't know how to program the VCR as far as, like, trying to run reliable services. It's really hard to build a complex system that, by its nature of being useful for customers, it must increase in complexity. Trying to run that reliably is hugely difficult and trying to do so profitably is almost impossible.And then I look at how hard that is and then I look at people trying to make self-driving cars. And I think that I will never set foot in one of those things until I see us getting good at running reliable services. Because if we can't do this with all of these people involved, how do I expect that a little car is going to be—that they're going to be able to produce a car that can drive and understand the complexities of navigating around and all the hazards that are involved to keep me safe.Corey: It's wild to me. The more I learned about the internet, the more surprised I am that any of it works at all. It's like, “Well, at least you're only using it for ridiculous things like cat pictures, right?” “Oh, no, no, no. We do emergency services and banking and insurance on top of that, too.” “Oh, good. I'm sure that won't end horribly one day.”Lex: Right? Yeah. I mean, you look at, like—you look at how much of a concerted effort towards safety they've had to put in, in the aviation industry to go from where they were in the '70s and '80s to where we are now where it's so incredibly safe. We haven't made that kind of full industry push toward reliability and safety. And it's going to have to happen soon as more and more of the services we're building are, exactly as you say, life-critical.Corey: Yeah, the idea of having this stuff be life-critical means you have to take a very different approach to it than you do when you're running, I don't know, Twitter for Pets. Though, I probably need a new fake reference startup now that Twitter for reality is becoming more bizarre than anything I can make up. But the idea that, “Well, our ad network needs to have the same rigor and discipline applied to it as the life support system,” maybe that's the wrong framing.Lex: Or maybe it's not. I keep finding instances of situations—maybe not necessarily ad networks, although I wouldn't put it past them—but situations where a system that we're dealing with becomes life-critical when we had no idea that it could possibly do. So, for example, a couple companies back, there was this billing situation where a vendor of ours accidentally nilled our customers incorrectly and wiped bank accounts, and real people were unable to make their mortgage payments and unable to, like, their bank accounts were empty, so they couldn't buy food. Like, that's starting to become life-critical and it all came down to a single, like, this could have been any outage at any company. And that's going to happen more and more, I think.Corey: I really want to thank you for taking time to speak with me. If people want to learn more, where's the best place for them to find you?Lex: sreweekly.com. You can subscribe there. Thank you so much for having me on. It has been a real treat.Corey: It really has. You'll have to come back and we'll find other topics to talk about, I'm sure, in the very near future. Thank you so much for your time. I appreciate it.Lex: Thanks.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
This week Adam talks with Marcin Kurc about chasing the 9s. Marcin is the Co-founder and CEO of Nobl9 where they build tools for managing service level objectives, aka SLOs. We also talk about service level agreements (SLAs), service level indicators (SLIs), error budgets, and monitoring, and how it all comes together to help teams align on goals, improve customer satisfaction, manage risks, increase transparency, and of course, a favorite around here…continuous improvement. Kaizen! This is an awesome deep dive into the world of chasing those 9s, and how teams are levering SLOs to earn the trust of their customers as well showcase transparency.
Guest: Ana Oprea, Staff Security Engineer, European Lead of Vulnerability Coordination Center @ Google Topics: What is the scope for the vulnerability management program at Google? Does it cover OS, off-the-shelf applications, custom code we wrote … or all of the above? Our vulnerability prioritization includes a process called “impact assessment.” What does our impact assessment for a vulnerability look like? How do we prioritize what to remediate? How do we decide on the speed of remediation needed? How do we know if we've done a good job? When we look backwards, what are our critical metrics (SLIs and SLOs) and how high up the security stack is the reporting on our progress? What of the “Google Approach” should other companies not try to emulate? Surely some things work because of Google being Google, so what are the weird or surprising things that only work for us? Resources: SRS Book, Chapter 20: Understanding Roles and Responsibilities and Chapter 21: Building a Culture of Security and Reliability Why Google Stores Billions of Lines of Code in a Single Repository SRE book and SRE Workbook “How Google Secures It's Google Cloud Usage at Massive Scale” (ep107) “Is This Binary Legit? How Google Uses Binary Authorization and Code Provenance” (ep66) “How We Scale Detection and Response at Google: Automation, Metrics, Toil” (ep75)
Tracking Mean Time To Restore (MTTR) is standard industry practice for incident response and analysis, but should it be? Courtney Nash, an Internet Incident Librarian, argues that MTTR is not a reliable metric - and we think she's got a point. We caught up with Courtney at the DevOps Enterprise Summit in Las Vegas, where she was making her case against MTTR in favor of alternative metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organization better learn from their incidents. Show NotesCheck out Courtney's The Void podcastLearn how to increase your dev team's efficiency by up to 30% at our free Scaling Developer Efficiency workshop on February, 15th. Register: linearb.io/efficiency Support the show: Subscribe to our Substack Follow us on YouTube Review us on Apple Podcasts or Spotify Follow us on Twitter or LinkedIn Offers: Learn about Continuous Merge with gitStream Want to try LinearB? Book a Demo & use discount code "Dev Interrupted Podcast"...
Software Engineering Radio - The Podcast for Professional Software Developers
Alex Hidalgo, principal reliability advocate at Nobl9 and author of Implementing Service Level Objectives, joins SE Radio's Robert Blumen for a discussion of service-level objectives (SLOs) and error budgets. The conversation covers the meaning...
Laura Maguire is a Researcher at Jeli.io, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Victoria talks to Laura about incident management, giving companies a powerful tool to learn from their incidents, and what types of customers are ideal for taking on a platform like Jeli.io. Jeli.io (https://www.jeli.io/) Follow Jeli.io on Instagram (https://www.instagram.com/jeli_io/), Twitter (https://twitter.com/jeli_io) or LinkedIn (https://www.linkedin.com/company/jeli-inc/). Follow Laura Maguire on Twitter (https://twitter.com/LauraMDMaguire) or LinkedIn (https://www.linkedin.com/in/lauramaguire/). Follow thoughtbot on Twitter (https://twitter.com/thoughtbot) or LinkedIn (https://www.linkedin.com/company/150727/). Become a Sponsor (https://thoughtbot.com/sponsorship) of Giant Robots! Transcript: VICTORIA: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Victoria Guido. And with me today is Laura Maguire, Researcher at Jeli, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Laura, thank you for joining me. LAURA: Thanks for having me, Victoria. VICTORIA: This might be a very introductory level question but just right off the bat, what is an incident? LAURA: What we find is a lot of companies define this very differently across the space, but typically, it's where they are seeing an impact, either a customer impact or a degradation of their service. This can be either formally, it kind of impacts their SLOs or their SLAs, or informally it's something that someone on the team notices or someone, you know, one of their users notice as being degraded performance or something not working as intended. VICTORIA: Gotcha. From my background being in IT operations, I'm familiar with incidents, and it's been a practice in IT for a long time. But what brought you to be a part of building this platform and creating a product around incidents? LAURA: I am a, let's say, recovering safety professional. VICTORIA: [chuckles] LAURA: I started my career in the safety and risk management realm within natural resource industries in the physical world. And so I worked with people who were at the sharp end in high-risk, high-consequence type work. And they were really navigating risk and navigating safety in the real world. And as I was working in this domain, I noticed that there was a delta between what was being said, created safety, and helped risk management and what I was actually seeing with the people that I was working with on the front lines. And so I started to pull the thread on this, and I thought, is work as done really the same as work as written or work as prescribed? And what I found was a whole field of research, a whole field of practice around thinking about safety and risk management in the world of cognitive work. And so this is how people think about risk, how they manage risk, and how do they interpret change and events in the world around them. And so as I started to do my master's degree in human factors and system safety and then later my Ph.D. in cognitive systems engineering, I realized that whether you are on the frontlines of a wildland fire or you're on the frontlines of responding to an incident in the software realm, the ways in which people detect, diagnose, and repair the issues that they're facing are quite similar in terms of the cognitive work. And so when I was starting my Ph.D. work, I was working with Dr. David Woods at the Cognitive Systems Engineering Lab at The Ohio State University. And I came into it, and I was thinking I'm going to work with astronauts, or with fighter pilots, or emergency room doctors, these really exciting domains. And he was like, "We're going to have you work with software engineers." And at first, I really failed to see the connection there, but as I started to learn more about site reliability engineering, about DevOps, about the continuous deployment, continuous integration world, I realized software engineers are really at the forefront of managing critical digital infrastructure. They're keeping up the systems that run society, both for recreation and pleasure in the sense of Netflix, for example, as well as the critical functions within society like our 911 call routing systems, our financial markets. And so the ability to study how software engineers detect outages, manage outages, and work together collaboratively across the team was really giving us a way to study this kind of work that could actually feed back into other types of domains like emergency response, like emergency rooms, and even back to the fighter pilots and astronauts. VICTORIA: Wow, that's so interesting. And so is your research that went into your Ph.D. did that help you help define the product strategy and kind of market fit for what you've been building at Jeli? LAURA: Yeah, absolutely. So Nora Jones, who is the founder and CEO of Jeli, reached out to me at a conference and told me a little bit about what she was thinking about, about how she wanted to support software engineers using a lot of this literature and a lot of the learnings from these other domains to build this product to help support incident management in software engineering. So we base a lot of our thinking around how to help support this cognitive work and how to help resilient performance in these very dynamic, these very changing large scale, you know, distributed software systems on this research, as well as the research that we do with our own users and with our own members from learning from incidents in software engineering Slack community that Nora and several other fairly prominent names within the software community started, Lorin Hochstein, John Allspaw Dr. Richard Cook, Jessica DeVita, Ryan Kitchens, and I may be missing someone else but...and myself, oh, Will Galego as well. Yeah, we based a lot of our understandings, really deep qualitative understandings of what is work like for software engineers when they're, you know, in continuous deployment type environments. And we've translated this into building a product that we think helps but not hinders by getting in the way of engineers while they're under time pressure and there's a lot of uncertainty. And there's often quite a bit of stress involved with responding to incidents. VICTORIA: Right. And you mentioned resilience engineering. And for those who don't know, David Woods, who you worked on with your Ph.D., wrote "Resilience Engineering: Concepts and Precepts." So maybe you could talk a little bit about resilience engineering and what that really means, not just in technology but in the people who were running the tools, right? LAURA: Yeah. So resilience engineering is different from how we think about protecting and defending our software systems. And it's different in the sense that we aren't just thinking about how do we prevent incidents from happening again, like, how do we fix things that have happened to us in the past? But how do we better understand the ways in which our systems operate under a wide variety of conditions? So that includes normal operating conditions as well as abnormal or anomalous operating conditions, such as an incident response. And so resilience engineering was kind of this way of thinking differently about predicting failure, about managing failure, and navigating these kinds of worlds. And one of the fundamental differences about it is it sees people as being the most adaptive component within the system of work. So we can have really good processes and practices around deploying code; we can institute things like cross-checking and peer review of code; we can have really good robust backup and failover systems, but ultimately, it's very likely that in these kinds of complex and adaptive always-changing systems that you're going to encounter problems that you weren't able to anticipate. And so this is where the resilience part comes in because if you're faced with a novel problem, if you're faced with an issue you've never seen before, or a hidden dependency within your system, or an unanticipated failure mode, you have to adapt. You have to be able to take all of the information that's available to you in the moment. You have to interpret that in real-time. You have to think of who else might have skills, knowledge, expertise, access to information, or access to certain kinds of systems or software components. And you have to bring all of those people together in real-time to be able to manage the problem at hand. And so this is really quite a different way of thinking about supporting this work than just let's keep the runbooks updated, and let's make sure that we can write prescriptive processes for everything that we're going to encounter. Because this really is the difference that I saw when I was talking about earlier about that work is done versus work is prescribed. The rules don't cover all of the situations. And so you have to think of how do you help people adapt? How do you help people access information in real-time to be able to handle unforeseen failures? VICTORIA: Right. That makes a lot of sense. It's an interesting evolution of site reliability engineering where you're thinking about the users' experience of your site. It's also thinking about the people who are running your site and what their experience is, and what freedom they have to be able to solve the problems that you wouldn't be able to predict, right? LAURA: Yeah, it's a really good point, actually, because there is sort of this double layer in the product that we are building. So, as you mentioned earlier, we are an incident analysis platform, and so what does that mean? Well, it means that we pull in data whenever there's been an incident, and we help you to look at it a little bit more deeply than you may if you're just following a template and sort of reconstructing a timeline. And so we pull in the actual Slack data that, you know, say, an ops channel or an incident channel that's been spun up following a report of a degraded performance or of an outage. And we look very closely at how did people talk to one another? Who did they bring into the incident? What kinds of things did they think were relevant and important at different points in time? And in doing this, it helps us to understand what information was available to people at different points in time. Because after the incident and after it's been resolved, people often look back and say, "Oh, there's nothing we can learn from that. We figured out what it was." But if we go back and we start looking at how people detected it, how they diagnosed it, who they brought into the event, we can start to unpack these patterns and these ways of understanding how do people work together? What information is useful at different points in time? Which helps us get a deeper understanding of how our systems actually work and how they actually fail. VICTORIA: Right. And I see there are a few different ways the platform does that: there's a narrative builder, a people view, and also a visual timeline. So, do you find that combining all those things together really gives companies a powerful tool to learn from their incidents? LAURA: Yeah. So let me talk a little bit about each of those different components. Our MVP of the product we started out with this understanding of the incident analyst and the incident investigator who, you know, was ready to dive in and ready to understand their incident and apply some qualitative analysis techniques to thinking about their incidents. And what we found was there are a number of these people who are really interested in this deep dive within the software industry. But there's a broader subset of folks that they work with who maybe only do these kinds of incident analysis every once in a while, and they're not as interested in going quite as deep. And so the narrative builder is really this kind of bridge between those two types of users. And what it does is helps construct a timeline which is typically what most companies do to help drive the discussion that they might have in a post-mortem or to drive their kind of findings in their summary report. And it helps them take this closer look at the interactions that happened in that slack transcript and raise questions about what kinds of uncertainties there were, point out who was involved, or interesting aspects of the event at that point in time. And it helps them to summarize what was happening. What did people think was happening at this point in time to create this story about the incident? And the story element is really important because we all learn from stories. It helps bring to life some of the details about what was hard, who was involved, how did they get brought in, what the sources of technical failure were, and whether those were easy or difficult to understand and to repair once the source of the failure was actually understood. And so that narrative builder helps reconstruct this timeline in a much richer way but also do it very efficiently. And as you mentioned, the visual timeline is something that we've created to help that lightweight user or that every once in a while user to go a little bit deeper on their analysis. And how we do that is because it lays out the progression of the event in a way that helps you see, oh, this maybe wasn't straightforward. We didn't detect it in the beginning, and then diagnose it, and then repair it at the end. What happened actually was the detection was intermittent. The signals about what was going wrong was intermittent, and so that was going on in parallel with the diagnosis. The diagnosis took a really long time, and that may have been because we can also see the repair was happening concurrently. And so it starts to show these kinds of characteristics about whether the incident was difficult, whether it was challenging and hard, or whether it was simple and straightforward. This helps lend a bit more depth to metrics like MTTR and TTD by saying, oh, there was a lot more going on in this incident than we initially thought. The last thing that you mentioned was the people view, and so that really sets our product apart from other products in that we look at the sociotechnical system. So it's not just about the software that broke; it is about who was involved in managing that system, in repairing that system, and in communicating about that system outwardly. And so the people view this kind of pulls in some HR data. It helps us to understand who was involved. How long have they been in their role? Were they on-call? Were they not on-call? And other kinds of irrelevant details that show us what was their engagement or their interaction with this event. And so when we start to bring in the socio part of the sociotechnical system, we can identify things like what knowledge do we have within the organization? Is that knowledge well-distributed, or is it just isolated in one or two people? And so those people are constantly getting pulled into incidents when they may be not on-call, which can start to show us whether or not these folks are in danger of burning out or whether their knowledge might need to be transferred more broadly throughout the organization. So this is kind of where the resilience piece comes in because it helps us to distribute knowledge. It helps us to identify who is relevant and useful and how do they partner and collaborate with other people, and their knowledge and skill sets to be able to manage some of the outages that they face? VICTORIA: That's wonderful because one of my follow-up questions would be, as a CEO, as a founder, what kind of insights or choices do you get to make now that you have this insight to help make your team more resilient? [laughs] LAURA: So if this is a manager, or a founder, or a CEO that is looking at their data in Jeli, they can start to understand how to resource their teams more appropriately, as I mentioned, how to spread that knowledge around. They can start to see what parts of their system are creating the most problems or what parts of their system do they have maybe less insight into how it works, how it interacts with other parts of the system, and what this actually means for their ability to meet their SLOs or their SLAs. So it gives you a more in-depth understanding of how your business is actually operating on both the technical side of things, as well as on the people side of things. VICTORIA: That makes a lot of sense. Thank you for that overview of the platform. There's the incident analysis platform, and you also have the bot, the response chatbot. Can you tell me a little bit more about that? LAURA: Yeah, absolutely. We think that incident management should be conducted wherever your work actually takes place, and so for most of our customers and a lot of folks that we know about in the industry, that's Slack. And so, if you are communicating in real-time with your team in Slack, we think that you should stay there. And so, we built this incident management bot that is free and will be free for the lifetime of the product. Because we think that this is really the fundamental basis for helping you manage your incidents more efficiently and more effectively. So it's a pretty lightweight bot. It gives kind of some guardrails or some guidance around collaboration by spinning up a new incident channel, helping you to bring the right kinds of responders into that, helping you to communicate to interested stakeholders by broadcasting to channels they might be in. It kind of nudges you to think about how to communicate about what's happening during different stages of the event progression. And so it's prompting you in a very lightweight way; hey, do you have a status update? Do you have a summary of what the current thinking is? What are the hypotheses about what's going on? Who's conducting what kinds of activities right now? So that if I'm a responder that's coming into the event after 20-30 minutes after it started, I can very quickly come up to speed, understand what's going on, who's doing what, and figure out what's useful for me to do to help step in and not disrupt the incident management that's underway right now. Our users can choose to use the bot independently of the incident analysis platform. But of course, being able to ingest that incident into Jeli it helps you understand who's been involved in the incident, if they've been involved in similar incidents in the past, and helps them start to see some patterns and some themes that emerge over time when you start to look at incidents across the organization. VICTORIA: That makes sense. And I love that it's free and that there's something for every type of organization to take advantage of there. And I wonder if at Jeli you have data about what type of customer is it who'd be targeted or really ideal to take on this kind of platform. LAURA: So most organizations...I was actually recently at SREcon EMEA, and there was a really interesting series of talks; one was SRE for Enterprise, and the next talk was SRE for Startups. And so it was a very thought-provoking discussion around is SRE for everyone, so site reliability engineering? Even smaller teams are starting to have to be responsible for reliability and responsible for running their service. And so we kind of have built our platform thinking about how do we help not just big enterprises or organizations that may have dedicated teams for this but also small startups to learn from their incidents. So internally, we actually call incidents opportunities as in they are learning opportunities for checking out how does your system actually work? How do your people work together? What things were difficult and challenging about the incident? And how do you talk about those things as a team to help create more resilient performance in future? So in terms of an ideal customer, it's really folks that are interested in conducting these sort of lightweight but in-depth looks at how their system actually works on both the people side of things and the technical side of things. Those who we found are most successful with our product are interested in not so much figuring out who did the thing and who can they blame for the incident itself but rather how do they learn from what happened? And would another engineer, or another product owner, another customer service representative, whoever the incident may be sort of focused around, would another person in their shoes have taken the same actions that they took or made the same decisions that they made? Which helps us understand from a systems level how do we repair or how do we adjust the system of work surrounding folks so that they are better supported when they're faced with uncertainty, or with that kind of time pressure, or that ambiguity about what's actually going on? VICTORIA: And I love that you said that because part of the reason [laughs] I invited you on to the podcast is that a lot of companies I have experience with don't think about incidents until it happens to them, and then it can be a scramble. It can impact their customer base. It can stress their team out. But if you go about creating...the term obviously you all use is psychological safety on your team, and maybe you use some of the free tools from Jeli like the Post-Incident Guide and the Incident Analysis 101 blog to set your team up for success from the beginning, then you can increase your customer loyalty and your team loyalty as well to the company. Is that your experience? LAURA: Yeah, absolutely. So one thing that I have learned throughout my career, you know, starting way back in forestry and looking at safety and risk in that domain, was as soon as there is an accident or even a serious near miss, right away, everybody gets sweaty palms. Everybody is concerned about, uh-oh, am I going to get blamed for this? Am I going to get fired? Am I going to get publicly shamed for the decisions that I made when I was in this situation? And what that response, that reaction does is it drives a lot of the communication and a lot of the understanding of the conditions that that person was in. It drives that underground. And it's important to allow people to talk about here's what I was seeing, here's what I was experiencing because, in these kinds of complex systems, information is not readily available to people. The signals are not always coming through loud and clear about what's going on or about what the appropriate actions to take are. Instead, it's messy; it's loud, it's noisy. There are usually multiple different demands on that person's attention and on their time, and they're often managing trade-offs: do I keep the system down so that I can gather more information about what's actually going on, or do I just try and bring it up as quickly as I can so that there's less impact to users? Those kinds of decisions are having to be made under pressure. So when we create these conditions of psychological safety, when we say you know what? This happened. We want to learn from it. We've already made this investment. Richard Cook mentioned in the very first SNAFU Catchers Report, which was a report that came out of Ohio State, that incidents are unplanned investments into understanding how your system works. And so you've already had the incident. You've already paid the price of that downtime or of that outage. So you might as well extract some learning from it so that you can help create a safer and more resilient system in the future. So by helping people to reconstruct what was actually happening in real-time, not what they were retrospectively saying, "Oh, I should have done this," well, you didn't do that. So let's understand why you thought at that moment in time that was the right way to respond because, more than likely, other people in that same position would have made that same choice. And so it helps us to think more broadly about ways that we can support decision-making and sense-making under conditions of stress and uncertainty. And ultimately, that helps your system be more resilient and be more reliable for your customers. VICTORIA: What a great reframing: unplanned investment. [laughs] And if you don't learn from it, then you're going to lose out on what you've already invested that time in resolving it, right? LAURA: Absolutely. MID-ROLL AD: Are you an entrepreneur or start-up founder looking to gain confidence in the way forward for your idea? At thoughtbot, we know you're tight on time and investment, which is why we've created targeted 1-hour remote workshops to help you develop a concrete plan for your product's next steps. Over four interactive sessions, we work with you on research, product design sprint, critical path, and presentation prep so that you and your team are better equipped with the skills and knowledge for success. Find out how we can help you move the needle at: tbot.io/entrepreneurs. VICTORIA: Getting more into that psychological safety and how to create that culture where people feel safe telling about what really happened, but how does that relate to...Jeli says that they are a people software. [laughs] Talk to me more about that. Like, what advice do you give founders and CEOs on how to create that psychological safety which makes them be more resilient in these types of incidents? LAURA: So you mentioned the Howie Guide that we published last year, and this is our guidance around how to do incident analysis, how to help your team start to learn from their incidents, and Howie stands for how we got here. And that's really important, that language because what it says is there's a history that led up to this incident. And most teams, when they've had an outage, they'll kind of look backwards from that outage, maybe an hour, maybe a day, maybe to the last deploy. But they don't think about how the decisions got made to use that piece of software in the first place. They don't think about how did engineers actually get on-boarded to being on-call. They don't necessarily think about what kinds of skills, and knowledge, and expertise when we're hiring a DevOps engineer, and I'm using air quotes here or an SRE. What kinds of skills and knowledge do they actually have? Those are very broad terms. And what it means to be a DevOps engineer or an SRE is quite underspecified. And so the knowledge behind the folks that you might hire into the company is going to necessarily be very diverse. It's going to be partial and incomplete in many ways because not everyone can know everything about the system. And so, we need to have multiple diverse perspectives about how the system works, how our customers use that system, what kinds of pressures and constraints exist within our company that allow us some possibilities over others. We need to bring all of those perspectives together to get a more reflective picture of what was actually happening before this incident took place and how we actually got here. This reframing helps a lot of people disarm that initial defensiveness response or that initial, oh, shoot; I'm going to get in trouble for this kind of response. And it says to them, "Hey, you're a part of this bigger system of work. You are only one piece of this puzzle. And what we want to try and do is understand what was happening within the company, not just what you did, what you said, and what you decided." So once people realize that you're not just trying to find fault or place blame, but you're really trying to understand their work, and you're trying to understand their work with other teams and other vendors, and trying to understand their work relative to the competing demands that were going on, so those are some of the things that help create psychological safety. About ten years ago, John Allspaw and the team at Etsy put out The Etsy Debriefing Facilitation Guide, which also poses a number of questions and helps to frame the post-incident learnings in a way that moves it from the individual and looks more collectively at the company as a whole. And so these things are helpful for founders or for CEOs to help bring forward more information about what's really going on, more information about what are the real risks and threats and opportunities within the company, and gives you an opportunity to step back and do what we call microlearning, which is sharing knowledge about how the system works, sharing understandings of what people think is going on, and what people know about the system. We don't typically talk about those things unless there's a reason to, and incidents kind of give us that reason because they're uncomfortable and they can be painful. They can be very public. They can be very disruptive to what we think about how resilient and reliable we actually are. And so if you can kind of step away from this defensiveness and step away from this need to place blame and instead try and understand the conditions, you will get a lot more learning and a lot more resilience and reliability out of your teams and out of your systems. VICTORIA: That makes sense to me. And I'd like to draw a connection between that and some other things you mentioned with The 2022 Accelerate State of DevOps Report that highlights that the people who are often responding to those incidents or in that high-stress situation tend to be historically underrepresented or historically excluded groups. And so do you see that having this insight into both who is actually taking on a lot of the work when these incidents happen and creating that psychological safety can make a better environment for diversity, equity, inclusion at a company as well? LAURA: Well, I think anytime you work to establish trust and transparency, and you focus on recognizing the skills that people do have, the knowledge that they do have, and not over assuming that someone knows something or that they have been involved in the discussions that may have been relevant to an incident, anytime you focus on that trust and transparency you are really signaling to people within your organization that you value their contributions and that you recognize that they've come to work and trying to do a good job. But they have multiple competing demands on their attention and on their time. And so we're not making assumptions about people being complacent, or people being reckless or being sloppy in their work. So that creates an environment where people feel more willing to speak up and to talk about some of the challenges that they might face, to talk about the ways in which it's not clear to them how certain parts of the system work or how certain teams actually operate. So you're just opening the channels for communication, which helps to share more knowledge. It helps to share more information about what teams are doing at different points in time. And this helps people to preemptively anticipate how a change that they might be making in their part of the system could be influencing up or downstream teams. And so this helps create more resilience because now you're thinking laterally about your system and about your involvement across teams and across boundary lines. And an example of this is if a marketing team...this is a story that Nora tells quite a bit; if a marketing team is, say, launching a Super Bowl commercial for their company but they don't actually tell the engineers on-call that that is about to happen, you can create all sorts of breakdowns when all of a sudden you have this surge of traffic to your website because people see the Super Bowl commercial and they want to go to the site. And then you have a single person who's trying to respond to that in real-time. So, instead, when you do start thinking about that trust and transparency, you're helping teams to help each other and to think more broadly about how their work is actually impacting other parts of the system. So from a diversity and inclusion and underrepresented groups perspective, this is creating the conditions for more people to be involved, more people to feel like their voice is going to be heard, and that their perspective actually matters. VICTORIA: That sounds really powerful, and I'm glad we were able to touch on that. Shifting gears a little bit, I wanted to talk about two different questions; so one is if you could travel back in time to when Jeli first started, what advice would you give yourself, your past self? LAURA: I would encourage myself to recognize that our ability to experiment is fundamental to our ability to learn. And learning is what helps us to iterate faster. Learning is what helps us to reflect on the tool that we're building or the feature that we're building and what this actually means to our users. I actually copped that advice to myself from CEO Zoran Perkov of the Long-Term Stock Exchange. They launched a whole new stock market during the pandemic with a fully remote team. And I had interviewed him for an article that I wrote about resilient leadership. And he said to me, like, "My job as a CEO is 100% about protecting our ability to experiment as a company because if we stop learning, we're not going to be able to iterate. We're not going to be able to adapt to the changes that we see in the market and in our users." So I think I would tell myself to continually experiment. One of the things that I talk to our customers about a lot because many of them are implementing new incident management programs or they're trying to level up their engineering teams around incident analysis, and I would say, "This doesn't have to be a fully-fleshed out program where you know all of the ways in which this is going to unfold." It's really about trying experiments, conduct some training, start small. Do one incident analysis on a really particularly spicy incident that you may have had or a really challenging incident where a lot of people were surprised by what happened. Bring together that group and say, "Hey, we're going to try something a little bit different here. We'll use some questions from the Howie Guide. We'll use the format and the structure from the Etsy Debriefing Guide. And we're just going to try and learn what we can about this event. We're not going to try and place blame. We're not going to try and generate corrective actions. We just want to see what we can learn from this." Then ask people that were involved, "How did this go? What did we learn from it? What should we do differently next time?" And continually iterate on those small, little experiments so that you can grow your product and grow your team's capacity. I think it took us a little bit of time to figure that out within the organization, but once we did, we were just able to collaborate more effectively work more effectively by integrating some of the feedback that we were getting from our users. And then the last piece of advice that I would give myself is to really invest in cross-discipline coordination and collaboration. Engineers, designers, researchers, CEOs they all have a different view of the product. They all have a different understanding of what the goals and priorities are. And those mental models of the product and of what the right thing to do is are constantly changing. And they all have different language that they use to talk about the product and to talk about their processes for integrating this understanding of the changing conditions and the changing user into the product. And so I would say invest in establishing common ground across the different disciplines within your team to be able to talk about what people are seeing, to be able to stop and identify when we're making assumptions about what other people know or what other people's orientation towards the problem or towards the product are. And spend a little bit of time saying, "When I say this is important, I'm saying it's important because of XYZ, not just this is important." So spending a little bit of time elaborating on what your mental model is and where you're drawing from can help the teams work more effectively together across those disciplines. VICTORIA: That's pretty powerful advice. You're iterating and experimenting at Jeli. What's on the horizon that you are...what new experiments are you excited about? LAURA: One of the things that has been front and center for us since we started is this idea of cross-incident analysis. And so we've kind of built out a number of different features within the product, being able to help tag the incident with the relevant services and technologies that were involved, being able to identify which teams were involved, and also being able to identify different kinds of themes or patterns that emerge from individual incidents. So all of this data that we can get from mostly just from the ingested incident itself or from the incident that you bring into Jeli but also from the analysis that you do on it this helps us start to be able to see across incidents what's happening not just with the technical side of things. So is it always Travis that is causing a problem? Are there components that work together that kind of have these really hidden and strange interdependencies that are really hard for the team to actually cope with? What kinds of themes are emerging across your suite of opportunities, your suite of incidents that you've ingested? Some of the things that we're starting to see from those experiments is an ability to look at where are your knowledge islands within your organization? Do you have an engineer who, if they were to leave, would take the majority of your systems knowledge about your database, or about your users, or about some critical aspect of your system that would disappear with all of that tacit knowledge? Or are there engineers that work really effectively together during really difficult incidents? And so you can start to unpack what are these characteristics of these people, and of these teams, and of these technologies that offer both opportunities or threats to your organization? So basically, what we're doing is we're helping you to see how your system performs under different kinds of conditions, which I think as a safety and risk professional working in a variety of different domains for the last 15 years, I think this is really where the rubber hits the road in helping teams be more reliable, and be more resilient, and more proactive about where investments in maintenance, or training, or headcount are going to have the biggest bang for your buck. VICTORIA: That makes a lot of sense. In my experience, sometimes those decisions are made more on intuition or on limited data so having a more full picture to rely on probably produces better results. [laughs] LAURA: Yeah, and I think that we all want to be data-driven, thinking about not only the quantitative data is how many incidents do we have around certain parts of the system, or certain teams, or certain services? But also, the qualitative side of things is what does this actually mean? And what does this mean to our ability to grow and change over time and to scale? The partnership of that quantitative data and qualitative data means we're being data-driven on a whole other level. VICTORIA: Wonderful. And it seems like we're getting close to the end of our time here. Is there anything else you want to give as a final takeaway to our listeners? LAURA: Yeah. So I think that we are, you know, as a domain, as a field, software engineering is increasingly becoming responsible for not only critical infrastructure within society, but we have a responsibility to our users and to each other within our companies to help make work better, help make our services more reliable and more resilient over time. And there's a variety of lessons that we can learn from other domains. As I mentioned before, aviation, healthcare, nuclear power all of those kinds of domains have been thinking about supporting cognitive work and supporting frontline operators. And we can learn from this history and this literature that exists out there. There is a GitHub repo that Lorin Hochstein has curated with a number of other folks with the industry that points to some of these resources. And as well, we'll be hosting the first Learning From Incidents in Software Engineering Conference in Denver in February, February 15 and 16th. And one feature of this conference that I'm super excited about is affectionately called CasesConf. And it is going to be an opportunity for software engineers from a variety of organizations to tell real stories about incidents that they had, how they handled them, what was challenging, what went surprisingly well, and just what is actually going on within their organizations. And this is kind of a new thing for the software industry to be talking very publicly about failures and sharing the messy details of our incidents. This won't be a recorded part of the conference. It is going to be conducted under the Chatham House Rule, which is participants who are in the room while these stories are being told can share some of the stories but not any identifying details about the company or the engineers that were involved. And so this kind of real-world situations helps us to, as I talked about before, with that psychological safety, helps us to say this is the reality of operating complex systems. They're going to fail. We're going to have to learn from them. And the more that we can talk at an industry level about what's going on and about what kinds of things are creating problems or opportunities for each other, the more we're going to be able to lift the bar for the industry as a whole. So you can check out register.learningfromincidents.io for more information about the conference. And we can link Lorin's resilience engineering GitHub repo in the notes as well. VICTORIA: Wonderful. Well, I was looking for an excuse to come to Denver in February anyways. LAURA: We would love to have ya. VICTORIA: Thank you. And thank you so much for taking time to share with us today, Laura. You can subscribe to the show and find notes along with a complete transcript for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @victori_ousg. This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time. ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success. Special Guest: Laura Maguire.
“If I wouldn't measure it I wouldn't know it!” or “Build, Measure, Learn! ”These quotes could be from any engineer building new digital services, observing them in production and based on that learn how to improve their software.They are however from Bernhard Dominguez, Digital Consultant at FACTOR, who we invited to the show. Bernhard highlights a lot of parallels between his work planning and executing digital marketing strategies and the world we live in: designing, operating and optimizing complex software systems.Tune in and learn about how important it is to understand your real target groups (=end users), how to define clear goals (=SLOs), how to change from campaign to funnel activities (=User Journeys) and why it is so important to get an outsider's opinion before implementing your next big project! (=We have always done it this way) If you want to follow up with Bernhard and his work check out the following links we discussed during the podcast:Bernhard on LinkedInFACTORPodcast (German): Newsletter MarketingPodcast (German): Build - Measure - Learn
#179: For many decades, the use of service level objectives (SLOs) in IT has been a routine part of day-to-day business. The objectives are based on measurable impacts that each individual customer or operational unit should experience with a specific IT service. These SLOs are typically associated with costs and benefits that can be tracked over time. But what do you need to do if you are new to SLOs? What are the pitfalls associated with introducing SLOs into an environment where they have not yet been a part of the culture? For those of you that have been working with IT for any length of time, there has probably been one idea in the IT space that has received more flak from different teams than anything else. Many people and teams still do not understand what SLOs are, how to use them and why some companies use them and others do not. In this episode, we speak with Brian Singer, the Chief Product Officer at Nobl9, about the mechanics of implementing and maintaining SLOs in your organization. Brian's contact information: Twitter: https://twitter.com/brian_singer LinkedIn: https://www.linkedin.com/in/briantsinger/ YouTube channel: https://youtube.com/devopsparadox/ Books and Courses: Catalog, Patterns, And Blueprints https://www.devopstoolkitseries.com/posts/catalog/ Review the podcast on Apple Podcasts: https://www.devopsparadox.com/review-podcast/ Slack: https://www.devopsparadox.com/slack/ Connect with us at: https://www.devopsparadox.com/contact/
The increasing complexity of software systems has made it challenging for service providers to maximize efficiency and reduce technical debt. Marcin Kurc, the Co-founder and CEO of Nobl9, claims that these companies can find help in the burgeoning market of SLOs (service-level objectives). Tune in to hear the software innovator explain why SLOs are saving service providers precious resources and helping them maintain customer satisfaction.Definitions: five 9s: “The five-9s term indicates how many ‘9's there are in the uptime calculation. For example, saying something is up (and available) 99.99% of the time would be four nines. Five nines mean something is available 99.999% of the time or it only has downtime of 5.26 minutes per year.” (404 Tech Support article by Jason, 05/10/2012) SLA (service-level agreement): “A service-level agreement (SLA) defines the level of service you expect from a vendor, laying out the metrics by which service is measured, as well as remedies or penalties should agreed-on service levels not be achieved. It is a critical component of any technology vendor contract.” (CIO Article by Stephanie Overby and Lynn Greiner and Lauren Gibbons Paul, 07/05/17)SLO (service-level objective): “Within service-level agreements (SLAs), SLOs are the objectives that must be achieved — for each service activity, function and process — to provide the best opportunity for service recipient success.” (Gartner glossary)SRE (site reliability engineering): “SRE leverages operations data and software engineering to automate IT operations tasks, and to accelerate software delivery while minimizing IT risk.” (IBM Cloud Education, 11/12/20)Tune in to learn:About Nobl9 and how it was founded (04:23) About service-level objectives (SLOs) (06:33) What the market demand is for SLOs (23:47)Marcin's advice for startup founders (33:42)IT Visionaries is brought to you by Salesforce Platform. If you love the thought leadership on this podcast, Salesforce has even more meaty IT thoughts to chew on. Take your company to the next level with in-depth research and trends right in your inbox. Subscribe to a newsletter tailored to your role at Salesforce.com/newsletter.Mission.org is a media studio producing content for world-class clients. Learn more at mission.org.