Podcast appearances and mentions of john allspaw

Play Episode Listen Later Sep 2, 2025 7:30

Please enjoy this encore of Word Notes. The set of people, process, technology, and cultural norms that integrates software development and IT operations into a system-of-systems. CyberWire Glossary link: ⁠https://thecyberwire.com/glossary/devops⁠ Audio reference link: "⁠10+ Deploys Per Day: Dev and Ops Cooperation at Flickr⁠," by John Allspaw and Paul Hammond, Velocity 09, 25 July 2009.

devops velocity flickr noun paul hammond john allspaw

DevOps (noun)

Play Episode Listen Later Sep 2, 2025 7:30

devops velocity flickr noun paul hammond john allspaw

Agile Software Development Method (noun) [Word Notes]

Play Episode Listen Later Aug 19, 2025 7:45

Please enjoy this encore of Word Notes. A software development philosophy that emphasizes incremental delivery, team collaboration, continual planning, and continual learning CyberWire Glossary link: ⁠https://thecyberwire.com/glossary/agile-software-development⁠ Audio reference link: "⁠Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe,⁠" John Allspaw and Paul Hammond, 2009 Velocity Conference, YouTube, 25 June 2009.

method velocity noun agile software development paul hammond john allspaw velocity conference

Agile Software Development Method (noun)

Play Episode Listen Later Aug 19, 2025 7:45

method velocity noun agile software development paul hammond john allspaw velocity conference

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Google SRE Prodcast

Play Episode Listen Later Dec 4, 2024 41:18

This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discuss how resilience appears in Software Engineering and SRE and explore the importance of understanding the human factors involved in adapting to system failures—highlighting the need for a more qualitative and holistic approach to understanding how engineers successfully adapt to system behavior and improving overall reliability.

principal rosenthal software engineering human factors sre complex systems site reliability engineering site reliability engineer john allspaw

Replay - Finding a Common Language for Incidents with John Allspaw

The Engineering Leadership Podcast

Play Episode Listen Later Nov 26, 2024 29:36

On this Screaming in the Cloud Replay, Corey is joined by John Allspaw, Founder/Principal at Adaptive Capacity Labs. John was foundational in the DevOps movement, but he's continued to bring much more to the table. He's written multiple books and seems to always be at the forefront. Which is why he is now at Adaptive Capacity Labs. John tells us what exactly Adaptive Capacity Labs does and how it works and how he convinced some heroes to get behind it. John brings a much-needed insight into how to get multiple people in an organization on the same level when it comes to dealing with incidents. Engineers and non. John points out the issues surrounding public vs. private write-ups and the roadblocks they may prop up. Adaptive Capacity Labs is working towards bringing those roadblocks down, tune in for how!Show Highlights(0:00) Introduction(0:59) The Duckbill Group sponsor read(1:33) What is Adaptive Capacity Labs and the work that they do?(3:00) How to effectively learn from incidents(7:33) What is the root of confusion in incident analysis(13:20) Identifying if an organization has truly learned from their incidents(18:23) Gitpod sponsor read(19:35) Adaptive Capacity Lab's reputation for positively shifting company culture(24:22) What the tech industry is missing when it comes to learning effectively from the incidents(28:44) Where you can find more from John and Adaptive Capacity LabsAbout John AllspawJohn Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund UniversityLinksThe Art of Capacity Planning: https://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/1491939206/Web Operations: https://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/The DevOps Handbook: https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002/Adaptive Capacity Labs: https://www.adaptivecapacitylabs.comJohn Allspaw Twitter: https://twitter.com/allspawRichard Cook Twitter: https://twitter.com/ri_cookDave Woods Twitter: https://twitter.com/ddwoods2Original Episodehttps://www.lastweekinaws.com/podcast/screaming-in-the-cloud/finding-a-common-language-for-incidents-with-john-allspaw/SponsorsThe Duckbill Group: duckbillgroup.com Gitpod: http://www.gitpod.io/

amazon art cloud identifying engineers cto etsy msc screaming aws devops velocity incidents human factors common language founder principal capacity planning paul hammond duckbill group john allspaw web operations last week in aws

585: From Ops to Dev and Back Again

Coder Radio

Play Episode Listen Later Aug 28, 2024 53:30

We reflect on the rise of DevOps and the frustrating dynamics that led to it. Plus, tech's latest bright idea: Roombas with attitude.

developers devops rails roomba docker foxconn procreate tdd motion capture chris fisher development podcast security vulnerabilities paul hammond john allspaw coder radio

Resilience engineering, learning from incidents and unintuitive perspectives on incident analysis w/ John Allspaw #116

Play Episode Listen Later Feb 7, 2023 42:38

We cover resilience engineering & learning from incidents with John Allspaw, former CTO @ Etsy and current Founder & Principal @ Adaptive Capacity Labs! Co-hosted by Kenji Kiuchi (Head of Quality and Performance @ Postman) this episode also addresses common unintuitive perspectives within resilience engineering, strategies for effective incident response / problem solving, how to identify current sources of resilience, and practical tips for implementing these resiliency tactics in your organization today.ABOUT JOHN ALLSPAWJohn Allspaw (@allspaw) has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University."The competitive advantage is not for a leader to say, ‘Why did it take so long to restore this issue or resolve this outage?' A competitive advantage is, ‘Oh my God, that is amazing. Tell me what made this hard and what are any of the things that made it difficult to resolve? Is there anything I can do to help get out of the way for people to do the work?'"- John Allspaw ABOUT KENJI KIUCHIKenji Kiuchi (@dr_kiuchi) is Head of Quality and Performance at Postman, an API platform whose mission is to maximize everyone's creativity through the power of connected software. There he leads a global team with a focus on maximizing user delight and innovating the practice of testing. Before coming to Postman, he spent several years ‘Helping people get Jobs” at Indeed. There, he worked on scaling teams and practice to optimize engineering delivery as well as leading Diversity, Inclusion and Belonging initiatives as an Associate Site Director. Prior to Indeed, Kenji spent several years as an Engineering Manager at Twitter where he led Quality efforts across monetization, growth, infra and the delivery of live video. When Kenji isn't driving engineering excellence, he's driving his motorcycle, spending quality time with his 3 daughters, and mentoring leaders across the globe.Check out our friends and sponsor, JellyfishTo learn more about Jellyfish and how they can help you increase engineering satisfaction and create happier, higher-performing engineering teams...Learn more at Jellyfish.co/elcSHOW NOTES:John's perspective on production (4:27)What drove John toward resilience engineering (6:22)How complex systems relate to resilience engineering (9:23)Differences between robustness and resilience (13:13)The role of productive adaptation in resilience engineering (17:26)Identify sources of resilience already present in your organization (22:52)Examples of unintuitive perspectives involving incident analysis (27:15)How to make room for unintuitive perspectives (31:41)Practical tips for implementing resiliency tactics & understanding incidents (36:12)Rapid fire questions (39:51)LINKS AND RESOURCESLearning From Incidents Conference 2023 - This is a forum for sharing stories of incidents, incident handling, and the learnings from software engineers who handle large-scale distributed software systems.Hindsight and Sacrifice Decisions Blog Post on Adaptive Capacity Labs reaction to the NYSE halting trading to resolve an issueUsing Language by Herbert H. Clark - Herbert Clark argues that language use is more than the sum of a speaker speaking and a listener listening. It is the joint action that emerges when speakers and listeners, writers and readers perform their individual actions in coordination, as ensembles. In contrast to work within the cognitive sciences, which has seen language use as an individual process, and to work within the social sciences, which has seen it as a social process, the author argues strongly that language use embodies both individual and social processes.Papers We Love TalkVisual Momentum

god head learning art performance diversity resilience jobs practical inclusion engineering identify differences perspectives belonging cto etsy rapid api msc problem solving influencing hindsight el c devops velocity incidents jellyfish nyse postman human factors kenji engineering manager lund university complex systems engineering management engineering leadership capacity planning paul hammond john allspaw web operations

Encore: Agile Software Development Method (noun)

Giant Robots Smashing Into Other Giant Robots

Play Episode Listen Later Jan 17, 2023 7:45

A software development philosophy that emphasizes incremental delivery, team collaboration, continual planning, and continual learning CyberWire Glossary link: https://thecyberwire.com/glossary/agile-software-development Audio reference link: "Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe," John Allspaw and Paul Hammond, 2009 Velocity Conference, YouTube, 25 June 2009. Learn more about your ad choices. Visit megaphone.fm/adchoices

method velocity noun agile software development paul hammond john allspaw velocity conference

456: Jeli.io with Laura Maguire

Play Episode Listen Later Jan 5, 2023 46:37

Laura Maguire is a Researcher at Jeli.io, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Victoria talks to Laura about incident management, giving companies a powerful tool to learn from their incidents, and what types of customers are ideal for taking on a platform like Jeli.io. Jeli.io (https://www.jeli.io/) Follow Jeli.io on Instagram (https://www.instagram.com/jeli_io/), Twitter (https://twitter.com/jeli_io) or LinkedIn (https://www.linkedin.com/company/jeli-inc/). Follow Laura Maguire on Twitter (https://twitter.com/LauraMDMaguire) or LinkedIn (https://www.linkedin.com/in/lauramaguire/). Follow thoughtbot on Twitter (https://twitter.com/thoughtbot) or LinkedIn (https://www.linkedin.com/company/150727/). Become a Sponsor (https://thoughtbot.com/sponsorship) of Giant Robots! Transcript: VICTORIA: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Victoria Guido. And with me today is Laura Maguire, Researcher at Jeli, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Laura, thank you for joining me. LAURA: Thanks for having me, Victoria. VICTORIA: This might be a very introductory level question but just right off the bat, what is an incident? LAURA: What we find is a lot of companies define this very differently across the space, but typically, it's where they are seeing an impact, either a customer impact or a degradation of their service. This can be either formally, it kind of impacts their SLOs or their SLAs, or informally it's something that someone on the team notices or someone, you know, one of their users notice as being degraded performance or something not working as intended. VICTORIA: Gotcha. From my background being in IT operations, I'm familiar with incidents, and it's been a practice in IT for a long time. But what brought you to be a part of building this platform and creating a product around incidents? LAURA: I am a, let's say, recovering safety professional. VICTORIA: [chuckles] LAURA: I started my career in the safety and risk management realm within natural resource industries in the physical world. And so I worked with people who were at the sharp end in high-risk, high-consequence type work. And they were really navigating risk and navigating safety in the real world. And as I was working in this domain, I noticed that there was a delta between what was being said, created safety, and helped risk management and what I was actually seeing with the people that I was working with on the front lines. And so I started to pull the thread on this, and I thought, is work as done really the same as work as written or work as prescribed? And what I found was a whole field of research, a whole field of practice around thinking about safety and risk management in the world of cognitive work. And so this is how people think about risk, how they manage risk, and how do they interpret change and events in the world around them. And so as I started to do my master's degree in human factors and system safety and then later my Ph.D. in cognitive systems engineering, I realized that whether you are on the frontlines of a wildland fire or you're on the frontlines of responding to an incident in the software realm, the ways in which people detect, diagnose, and repair the issues that they're facing are quite similar in terms of the cognitive work. And so when I was starting my Ph.D. work, I was working with Dr. David Woods at the Cognitive Systems Engineering Lab at The Ohio State University. And I came into it, and I was thinking I'm going to work with astronauts, or with fighter pilots, or emergency room doctors, these really exciting domains. And he was like, "We're going to have you work with software engineers." And at first, I really failed to see the connection there, but as I started to learn more about site reliability engineering, about DevOps, about the continuous deployment, continuous integration world, I realized software engineers are really at the forefront of managing critical digital infrastructure. They're keeping up the systems that run society, both for recreation and pleasure in the sense of Netflix, for example, as well as the critical functions within society like our 911 call routing systems, our financial markets. And so the ability to study how software engineers detect outages, manage outages, and work together collaboratively across the team was really giving us a way to study this kind of work that could actually feed back into other types of domains like emergency response, like emergency rooms, and even back to the fighter pilots and astronauts. VICTORIA: Wow, that's so interesting. And so is your research that went into your Ph.D. did that help you help define the product strategy and kind of market fit for what you've been building at Jeli? LAURA: Yeah, absolutely. So Nora Jones, who is the founder and CEO of Jeli, reached out to me at a conference and told me a little bit about what she was thinking about, about how she wanted to support software engineers using a lot of this literature and a lot of the learnings from these other domains to build this product to help support incident management in software engineering. So we base a lot of our thinking around how to help support this cognitive work and how to help resilient performance in these very dynamic, these very changing large scale, you know, distributed software systems on this research, as well as the research that we do with our own users and with our own members from learning from incidents in software engineering Slack community that Nora and several other fairly prominent names within the software community started, Lorin Hochstein, John Allspaw Dr. Richard Cook, Jessica DeVita, Ryan Kitchens, and I may be missing someone else but...and myself, oh, Will Galego as well. Yeah, we based a lot of our understandings, really deep qualitative understandings of what is work like for software engineers when they're, you know, in continuous deployment type environments. And we've translated this into building a product that we think helps but not hinders by getting in the way of engineers while they're under time pressure and there's a lot of uncertainty. And there's often quite a bit of stress involved with responding to incidents. VICTORIA: Right. And you mentioned resilience engineering. And for those who don't know, David Woods, who you worked on with your Ph.D., wrote "Resilience Engineering: Concepts and Precepts." So maybe you could talk a little bit about resilience engineering and what that really means, not just in technology but in the people who were running the tools, right? LAURA: Yeah. So resilience engineering is different from how we think about protecting and defending our software systems. And it's different in the sense that we aren't just thinking about how do we prevent incidents from happening again, like, how do we fix things that have happened to us in the past? But how do we better understand the ways in which our systems operate under a wide variety of conditions? So that includes normal operating conditions as well as abnormal or anomalous operating conditions, such as an incident response. And so resilience engineering was kind of this way of thinking differently about predicting failure, about managing failure, and navigating these kinds of worlds. And one of the fundamental differences about it is it sees people as being the most adaptive component within the system of work. So we can have really good processes and practices around deploying code; we can institute things like cross-checking and peer review of code; we can have really good robust backup and failover systems, but ultimately, it's very likely that in these kinds of complex and adaptive always-changing systems that you're going to encounter problems that you weren't able to anticipate. And so this is where the resilience part comes in because if you're faced with a novel problem, if you're faced with an issue you've never seen before, or a hidden dependency within your system, or an unanticipated failure mode, you have to adapt. You have to be able to take all of the information that's available to you in the moment. You have to interpret that in real-time. You have to think of who else might have skills, knowledge, expertise, access to information, or access to certain kinds of systems or software components. And you have to bring all of those people together in real-time to be able to manage the problem at hand. And so this is really quite a different way of thinking about supporting this work than just let's keep the runbooks updated, and let's make sure that we can write prescriptive processes for everything that we're going to encounter. Because this really is the difference that I saw when I was talking about earlier about that work is done versus work is prescribed. The rules don't cover all of the situations. And so you have to think of how do you help people adapt? How do you help people access information in real-time to be able to handle unforeseen failures? VICTORIA: Right. That makes a lot of sense. It's an interesting evolution of site reliability engineering where you're thinking about the users' experience of your site. It's also thinking about the people who are running your site and what their experience is, and what freedom they have to be able to solve the problems that you wouldn't be able to predict, right? LAURA: Yeah, it's a really good point, actually, because there is sort of this double layer in the product that we are building. So, as you mentioned earlier, we are an incident analysis platform, and so what does that mean? Well, it means that we pull in data whenever there's been an incident, and we help you to look at it a little bit more deeply than you may if you're just following a template and sort of reconstructing a timeline. And so we pull in the actual Slack data that, you know, say, an ops channel or an incident channel that's been spun up following a report of a degraded performance or of an outage. And we look very closely at how did people talk to one another? Who did they bring into the incident? What kinds of things did they think were relevant and important at different points in time? And in doing this, it helps us to understand what information was available to people at different points in time. Because after the incident and after it's been resolved, people often look back and say, "Oh, there's nothing we can learn from that. We figured out what it was." But if we go back and we start looking at how people detected it, how they diagnosed it, who they brought into the event, we can start to unpack these patterns and these ways of understanding how do people work together? What information is useful at different points in time? Which helps us get a deeper understanding of how our systems actually work and how they actually fail. VICTORIA: Right. And I see there are a few different ways the platform does that: there's a narrative builder, a people view, and also a visual timeline. So, do you find that combining all those things together really gives companies a powerful tool to learn from their incidents? LAURA: Yeah. So let me talk a little bit about each of those different components. Our MVP of the product we started out with this understanding of the incident analyst and the incident investigator who, you know, was ready to dive in and ready to understand their incident and apply some qualitative analysis techniques to thinking about their incidents. And what we found was there are a number of these people who are really interested in this deep dive within the software industry. But there's a broader subset of folks that they work with who maybe only do these kinds of incident analysis every once in a while, and they're not as interested in going quite as deep. And so the narrative builder is really this kind of bridge between those two types of users. And what it does is helps construct a timeline which is typically what most companies do to help drive the discussion that they might have in a post-mortem or to drive their kind of findings in their summary report. And it helps them take this closer look at the interactions that happened in that slack transcript and raise questions about what kinds of uncertainties there were, point out who was involved, or interesting aspects of the event at that point in time. And it helps them to summarize what was happening. What did people think was happening at this point in time to create this story about the incident? And the story element is really important because we all learn from stories. It helps bring to life some of the details about what was hard, who was involved, how did they get brought in, what the sources of technical failure were, and whether those were easy or difficult to understand and to repair once the source of the failure was actually understood. And so that narrative builder helps reconstruct this timeline in a much richer way but also do it very efficiently. And as you mentioned, the visual timeline is something that we've created to help that lightweight user or that every once in a while user to go a little bit deeper on their analysis. And how we do that is because it lays out the progression of the event in a way that helps you see, oh, this maybe wasn't straightforward. We didn't detect it in the beginning, and then diagnose it, and then repair it at the end. What happened actually was the detection was intermittent. The signals about what was going wrong was intermittent, and so that was going on in parallel with the diagnosis. The diagnosis took a really long time, and that may have been because we can also see the repair was happening concurrently. And so it starts to show these kinds of characteristics about whether the incident was difficult, whether it was challenging and hard, or whether it was simple and straightforward. This helps lend a bit more depth to metrics like MTTR and TTD by saying, oh, there was a lot more going on in this incident than we initially thought. The last thing that you mentioned was the people view, and so that really sets our product apart from other products in that we look at the sociotechnical system. So it's not just about the software that broke; it is about who was involved in managing that system, in repairing that system, and in communicating about that system outwardly. And so the people view this kind of pulls in some HR data. It helps us to understand who was involved. How long have they been in their role? Were they on-call? Were they not on-call? And other kinds of irrelevant details that show us what was their engagement or their interaction with this event. And so when we start to bring in the socio part of the sociotechnical system, we can identify things like what knowledge do we have within the organization? Is that knowledge well-distributed, or is it just isolated in one or two people? And so those people are constantly getting pulled into incidents when they may be not on-call, which can start to show us whether or not these folks are in danger of burning out or whether their knowledge might need to be transferred more broadly throughout the organization. So this is kind of where the resilience piece comes in because it helps us to distribute knowledge. It helps us to identify who is relevant and useful and how do they partner and collaborate with other people, and their knowledge and skill sets to be able to manage some of the outages that they face? VICTORIA: That's wonderful because one of my follow-up questions would be, as a CEO, as a founder, what kind of insights or choices do you get to make now that you have this insight to help make your team more resilient? [laughs] LAURA: So if this is a manager, or a founder, or a CEO that is looking at their data in Jeli, they can start to understand how to resource their teams more appropriately, as I mentioned, how to spread that knowledge around. They can start to see what parts of their system are creating the most problems or what parts of their system do they have maybe less insight into how it works, how it interacts with other parts of the system, and what this actually means for their ability to meet their SLOs or their SLAs. So it gives you a more in-depth understanding of how your business is actually operating on both the technical side of things, as well as on the people side of things. VICTORIA: That makes a lot of sense. Thank you for that overview of the platform. There's the incident analysis platform, and you also have the bot, the response chatbot. Can you tell me a little bit more about that? LAURA: Yeah, absolutely. We think that incident management should be conducted wherever your work actually takes place, and so for most of our customers and a lot of folks that we know about in the industry, that's Slack. And so, if you are communicating in real-time with your team in Slack, we think that you should stay there. And so, we built this incident management bot that is free and will be free for the lifetime of the product. Because we think that this is really the fundamental basis for helping you manage your incidents more efficiently and more effectively. So it's a pretty lightweight bot. It gives kind of some guardrails or some guidance around collaboration by spinning up a new incident channel, helping you to bring the right kinds of responders into that, helping you to communicate to interested stakeholders by broadcasting to channels they might be in. It kind of nudges you to think about how to communicate about what's happening during different stages of the event progression. And so it's prompting you in a very lightweight way; hey, do you have a status update? Do you have a summary of what the current thinking is? What are the hypotheses about what's going on? Who's conducting what kinds of activities right now? So that if I'm a responder that's coming into the event after 20-30 minutes after it started, I can very quickly come up to speed, understand what's going on, who's doing what, and figure out what's useful for me to do to help step in and not disrupt the incident management that's underway right now. Our users can choose to use the bot independently of the incident analysis platform. But of course, being able to ingest that incident into Jeli it helps you understand who's been involved in the incident, if they've been involved in similar incidents in the past, and helps them start to see some patterns and some themes that emerge over time when you start to look at incidents across the organization. VICTORIA: That makes sense. And I love that it's free and that there's something for every type of organization to take advantage of there. And I wonder if at Jeli you have data about what type of customer is it who'd be targeted or really ideal to take on this kind of platform. LAURA: So most organizations...I was actually recently at SREcon EMEA, and there was a really interesting series of talks; one was SRE for Enterprise, and the next talk was SRE for Startups. And so it was a very thought-provoking discussion around is SRE for everyone, so site reliability engineering? Even smaller teams are starting to have to be responsible for reliability and responsible for running their service. And so we kind of have built our platform thinking about how do we help not just big enterprises or organizations that may have dedicated teams for this but also small startups to learn from their incidents. So internally, we actually call incidents opportunities as in they are learning opportunities for checking out how does your system actually work? How do your people work together? What things were difficult and challenging about the incident? And how do you talk about those things as a team to help create more resilient performance in future? So in terms of an ideal customer, it's really folks that are interested in conducting these sort of lightweight but in-depth looks at how their system actually works on both the people side of things and the technical side of things. Those who we found are most successful with our product are interested in not so much figuring out who did the thing and who can they blame for the incident itself but rather how do they learn from what happened? And would another engineer, or another product owner, another customer service representative, whoever the incident may be sort of focused around, would another person in their shoes have taken the same actions that they took or made the same decisions that they made? Which helps us understand from a systems level how do we repair or how do we adjust the system of work surrounding folks so that they are better supported when they're faced with uncertainty, or with that kind of time pressure, or that ambiguity about what's actually going on? VICTORIA: And I love that you said that because part of the reason [laughs] I invited you on to the podcast is that a lot of companies I have experience with don't think about incidents until it happens to them, and then it can be a scramble. It can impact their customer base. It can stress their team out. But if you go about creating...the term obviously you all use is psychological safety on your team, and maybe you use some of the free tools from Jeli like the Post-Incident Guide and the Incident Analysis 101 blog to set your team up for success from the beginning, then you can increase your customer loyalty and your team loyalty as well to the company. Is that your experience? LAURA: Yeah, absolutely. So one thing that I have learned throughout my career, you know, starting way back in forestry and looking at safety and risk in that domain, was as soon as there is an accident or even a serious near miss, right away, everybody gets sweaty palms. Everybody is concerned about, uh-oh, am I going to get blamed for this? Am I going to get fired? Am I going to get publicly shamed for the decisions that I made when I was in this situation? And what that response, that reaction does is it drives a lot of the communication and a lot of the understanding of the conditions that that person was in. It drives that underground. And it's important to allow people to talk about here's what I was seeing, here's what I was experiencing because, in these kinds of complex systems, information is not readily available to people. The signals are not always coming through loud and clear about what's going on or about what the appropriate actions to take are. Instead, it's messy; it's loud, it's noisy. There are usually multiple different demands on that person's attention and on their time, and they're often managing trade-offs: do I keep the system down so that I can gather more information about what's actually going on, or do I just try and bring it up as quickly as I can so that there's less impact to users? Those kinds of decisions are having to be made under pressure. So when we create these conditions of psychological safety, when we say you know what? This happened. We want to learn from it. We've already made this investment. Richard Cook mentioned in the very first SNAFU Catchers Report, which was a report that came out of Ohio State, that incidents are unplanned investments into understanding how your system works. And so you've already had the incident. You've already paid the price of that downtime or of that outage. So you might as well extract some learning from it so that you can help create a safer and more resilient system in the future. So by helping people to reconstruct what was actually happening in real-time, not what they were retrospectively saying, "Oh, I should have done this," well, you didn't do that. So let's understand why you thought at that moment in time that was the right way to respond because, more than likely, other people in that same position would have made that same choice. And so it helps us to think more broadly about ways that we can support decision-making and sense-making under conditions of stress and uncertainty. And ultimately, that helps your system be more resilient and be more reliable for your customers. VICTORIA: What a great reframing: unplanned investment. [laughs] And if you don't learn from it, then you're going to lose out on what you've already invested that time in resolving it, right? LAURA: Absolutely. MID-ROLL AD: Are you an entrepreneur or start-up founder looking to gain confidence in the way forward for your idea? At thoughtbot, we know you're tight on time and investment, which is why we've created targeted 1-hour remote workshops to help you develop a concrete plan for your product's next steps. Over four interactive sessions, we work with you on research, product design sprint, critical path, and presentation prep so that you and your team are better equipped with the skills and knowledge for success. Find out how we can help you move the needle at: tbot.io/entrepreneurs. VICTORIA: Getting more into that psychological safety and how to create that culture where people feel safe telling about what really happened, but how does that relate to...Jeli says that they are a people software. [laughs] Talk to me more about that. Like, what advice do you give founders and CEOs on how to create that psychological safety which makes them be more resilient in these types of incidents? LAURA: So you mentioned the Howie Guide that we published last year, and this is our guidance around how to do incident analysis, how to help your team start to learn from their incidents, and Howie stands for how we got here. And that's really important, that language because what it says is there's a history that led up to this incident. And most teams, when they've had an outage, they'll kind of look backwards from that outage, maybe an hour, maybe a day, maybe to the last deploy. But they don't think about how the decisions got made to use that piece of software in the first place. They don't think about how did engineers actually get on-boarded to being on-call. They don't necessarily think about what kinds of skills, and knowledge, and expertise when we're hiring a DevOps engineer, and I'm using air quotes here or an SRE. What kinds of skills and knowledge do they actually have? Those are very broad terms. And what it means to be a DevOps engineer or an SRE is quite underspecified. And so the knowledge behind the folks that you might hire into the company is going to necessarily be very diverse. It's going to be partial and incomplete in many ways because not everyone can know everything about the system. And so, we need to have multiple diverse perspectives about how the system works, how our customers use that system, what kinds of pressures and constraints exist within our company that allow us some possibilities over others. We need to bring all of those perspectives together to get a more reflective picture of what was actually happening before this incident took place and how we actually got here. This reframing helps a lot of people disarm that initial defensiveness response or that initial, oh, shoot; I'm going to get in trouble for this kind of response. And it says to them, "Hey, you're a part of this bigger system of work. You are only one piece of this puzzle. And what we want to try and do is understand what was happening within the company, not just what you did, what you said, and what you decided." So once people realize that you're not just trying to find fault or place blame, but you're really trying to understand their work, and you're trying to understand their work with other teams and other vendors, and trying to understand their work relative to the competing demands that were going on, so those are some of the things that help create psychological safety. About ten years ago, John Allspaw and the team at Etsy put out The Etsy Debriefing Facilitation Guide, which also poses a number of questions and helps to frame the post-incident learnings in a way that moves it from the individual and looks more collectively at the company as a whole. And so these things are helpful for founders or for CEOs to help bring forward more information about what's really going on, more information about what are the real risks and threats and opportunities within the company, and gives you an opportunity to step back and do what we call microlearning, which is sharing knowledge about how the system works, sharing understandings of what people think is going on, and what people know about the system. We don't typically talk about those things unless there's a reason to, and incidents kind of give us that reason because they're uncomfortable and they can be painful. They can be very public. They can be very disruptive to what we think about how resilient and reliable we actually are. And so if you can kind of step away from this defensiveness and step away from this need to place blame and instead try and understand the conditions, you will get a lot more learning and a lot more resilience and reliability out of your teams and out of your systems. VICTORIA: That makes sense to me. And I'd like to draw a connection between that and some other things you mentioned with The 2022 Accelerate State of DevOps Report that highlights that the people who are often responding to those incidents or in that high-stress situation tend to be historically underrepresented or historically excluded groups. And so do you see that having this insight into both who is actually taking on a lot of the work when these incidents happen and creating that psychological safety can make a better environment for diversity, equity, inclusion at a company as well? LAURA: Well, I think anytime you work to establish trust and transparency, and you focus on recognizing the skills that people do have, the knowledge that they do have, and not over assuming that someone knows something or that they have been involved in the discussions that may have been relevant to an incident, anytime you focus on that trust and transparency you are really signaling to people within your organization that you value their contributions and that you recognize that they've come to work and trying to do a good job. But they have multiple competing demands on their attention and on their time. And so we're not making assumptions about people being complacent, or people being reckless or being sloppy in their work. So that creates an environment where people feel more willing to speak up and to talk about some of the challenges that they might face, to talk about the ways in which it's not clear to them how certain parts of the system work or how certain teams actually operate. So you're just opening the channels for communication, which helps to share more knowledge. It helps to share more information about what teams are doing at different points in time. And this helps people to preemptively anticipate how a change that they might be making in their part of the system could be influencing up or downstream teams. And so this helps create more resilience because now you're thinking laterally about your system and about your involvement across teams and across boundary lines. And an example of this is if a marketing team...this is a story that Nora tells quite a bit; if a marketing team is, say, launching a Super Bowl commercial for their company but they don't actually tell the engineers on-call that that is about to happen, you can create all sorts of breakdowns when all of a sudden you have this surge of traffic to your website because people see the Super Bowl commercial and they want to go to the site. And then you have a single person who's trying to respond to that in real-time. So, instead, when you do start thinking about that trust and transparency, you're helping teams to help each other and to think more broadly about how their work is actually impacting other parts of the system. So from a diversity and inclusion and underrepresented groups perspective, this is creating the conditions for more people to be involved, more people to feel like their voice is going to be heard, and that their perspective actually matters. VICTORIA: That sounds really powerful, and I'm glad we were able to touch on that. Shifting gears a little bit, I wanted to talk about two different questions; so one is if you could travel back in time to when Jeli first started, what advice would you give yourself, your past self? LAURA: I would encourage myself to recognize that our ability to experiment is fundamental to our ability to learn. And learning is what helps us to iterate faster. Learning is what helps us to reflect on the tool that we're building or the feature that we're building and what this actually means to our users. I actually copped that advice to myself from CEO Zoran Perkov of the Long-Term Stock Exchange. They launched a whole new stock market during the pandemic with a fully remote team. And I had interviewed him for an article that I wrote about resilient leadership. And he said to me, like, "My job as a CEO is 100% about protecting our ability to experiment as a company because if we stop learning, we're not going to be able to iterate. We're not going to be able to adapt to the changes that we see in the market and in our users." So I think I would tell myself to continually experiment. One of the things that I talk to our customers about a lot because many of them are implementing new incident management programs or they're trying to level up their engineering teams around incident analysis, and I would say, "This doesn't have to be a fully-fleshed out program where you know all of the ways in which this is going to unfold." It's really about trying experiments, conduct some training, start small. Do one incident analysis on a really particularly spicy incident that you may have had or a really challenging incident where a lot of people were surprised by what happened. Bring together that group and say, "Hey, we're going to try something a little bit different here. We'll use some questions from the Howie Guide. We'll use the format and the structure from the Etsy Debriefing Guide. And we're just going to try and learn what we can about this event. We're not going to try and place blame. We're not going to try and generate corrective actions. We just want to see what we can learn from this." Then ask people that were involved, "How did this go? What did we learn from it? What should we do differently next time?" And continually iterate on those small, little experiments so that you can grow your product and grow your team's capacity. I think it took us a little bit of time to figure that out within the organization, but once we did, we were just able to collaborate more effectively work more effectively by integrating some of the feedback that we were getting from our users. And then the last piece of advice that I would give myself is to really invest in cross-discipline coordination and collaboration. Engineers, designers, researchers, CEOs they all have a different view of the product. They all have a different understanding of what the goals and priorities are. And those mental models of the product and of what the right thing to do is are constantly changing. And they all have different language that they use to talk about the product and to talk about their processes for integrating this understanding of the changing conditions and the changing user into the product. And so I would say invest in establishing common ground across the different disciplines within your team to be able to talk about what people are seeing, to be able to stop and identify when we're making assumptions about what other people know or what other people's orientation towards the problem or towards the product are. And spend a little bit of time saying, "When I say this is important, I'm saying it's important because of XYZ, not just this is important." So spending a little bit of time elaborating on what your mental model is and where you're drawing from can help the teams work more effectively together across those disciplines. VICTORIA: That's pretty powerful advice. You're iterating and experimenting at Jeli. What's on the horizon that you are...what new experiments are you excited about? LAURA: One of the things that has been front and center for us since we started is this idea of cross-incident analysis. And so we've kind of built out a number of different features within the product, being able to help tag the incident with the relevant services and technologies that were involved, being able to identify which teams were involved, and also being able to identify different kinds of themes or patterns that emerge from individual incidents. So all of this data that we can get from mostly just from the ingested incident itself or from the incident that you bring into Jeli but also from the analysis that you do on it this helps us start to be able to see across incidents what's happening not just with the technical side of things. So is it always Travis that is causing a problem? Are there components that work together that kind of have these really hidden and strange interdependencies that are really hard for the team to actually cope with? What kinds of themes are emerging across your suite of opportunities, your suite of incidents that you've ingested? Some of the things that we're starting to see from those experiments is an ability to look at where are your knowledge islands within your organization? Do you have an engineer who, if they were to leave, would take the majority of your systems knowledge about your database, or about your users, or about some critical aspect of your system that would disappear with all of that tacit knowledge? Or are there engineers that work really effectively together during really difficult incidents? And so you can start to unpack what are these characteristics of these people, and of these teams, and of these technologies that offer both opportunities or threats to your organization? So basically, what we're doing is we're helping you to see how your system performs under different kinds of conditions, which I think as a safety and risk professional working in a variety of different domains for the last 15 years, I think this is really where the rubber hits the road in helping teams be more reliable, and be more resilient, and more proactive about where investments in maintenance, or training, or headcount are going to have the biggest bang for your buck. VICTORIA: That makes a lot of sense. In my experience, sometimes those decisions are made more on intuition or on limited data so having a more full picture to rely on probably produces better results. [laughs] LAURA: Yeah, and I think that we all want to be data-driven, thinking about not only the quantitative data is how many incidents do we have around certain parts of the system, or certain teams, or certain services? But also, the qualitative side of things is what does this actually mean? And what does this mean to our ability to grow and change over time and to scale? The partnership of that quantitative data and qualitative data means we're being data-driven on a whole other level. VICTORIA: Wonderful. And it seems like we're getting close to the end of our time here. Is there anything else you want to give as a final takeaway to our listeners? LAURA: Yeah. So I think that we are, you know, as a domain, as a field, software engineering is increasingly becoming responsible for not only critical infrastructure within society, but we have a responsibility to our users and to each other within our companies to help make work better, help make our services more reliable and more resilient over time. And there's a variety of lessons that we can learn from other domains. As I mentioned before, aviation, healthcare, nuclear power all of those kinds of domains have been thinking about supporting cognitive work and supporting frontline operators. And we can learn from this history and this literature that exists out there. There is a GitHub repo that Lorin Hochstein has curated with a number of other folks with the industry that points to some of these resources. And as well, we'll be hosting the first Learning From Incidents in Software Engineering Conference in Denver in February, February 15 and 16th. And one feature of this conference that I'm super excited about is affectionately called CasesConf. And it is going to be an opportunity for software engineers from a variety of organizations to tell real stories about incidents that they had, how they handled them, what was challenging, what went surprisingly well, and just what is actually going on within their organizations. And this is kind of a new thing for the software industry to be talking very publicly about failures and sharing the messy details of our incidents. This won't be a recorded part of the conference. It is going to be conducted under the Chatham House Rule, which is participants who are in the room while these stories are being told can share some of the stories but not any identifying details about the company or the engineers that were involved. And so this kind of real-world situations helps us to, as I talked about before, with that psychological safety, helps us to say this is the reality of operating complex systems. They're going to fail. We're going to have to learn from them. And the more that we can talk at an industry level about what's going on and about what kinds of things are creating problems or opportunities for each other, the more we're going to be able to lift the bar for the industry as a whole. So you can check out register.learningfromincidents.io for more information about the conference. And we can link Lorin's resilience engineering GitHub repo in the notes as well. VICTORIA: Wonderful. Well, I was looking for an excuse to come to Denver in February anyways. LAURA: We would love to have ya. VICTORIA: Thank you. And thank you so much for taking time to share with us today, Laura. You can subscribe to the show and find notes along with a complete transcript for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @victori_ousg. This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time. ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success. Special Guest: Laura Maguire.

ceo netflix learning talk super bowl startups ceos mvp shifting engineers researchers enterprise etsy ohio state slack ohio state university github devops howie maguire xyz mandy moore sre precepts lorin slas giant robots david woods ttd slos jeli richard cook devops report john allspaw laura yeah laura so

Incidents, Solutions, and ChatOps Integration with Chris Evans

Play Episode Listen Later Jul 7, 2022 33:28

About ChrisChris is the Co-founder and Chief Product Officer at incident.io, where they're building incident management products that people actually want to use. A software engineer by trade, Chris is no stranger to gnarly incidents, having participated (and caused!) them at everything from early stage startups through to enormous IT organizations.Links Referenced: incident.io: https://incident.io Practical Guide to Incident Management: https://incident.io/guide/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: DoorDash had a problem. As their cloud-native environment scaled and developers delivered new features, their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology and about the business, losing observability means the entire company loses their competitive edge. With Chronosphere, DoorDash is no longer losing visibility into their applications suite. The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business, confidence, and peace of mind. Read the full success story at snark.cloud/chronosphere. That's snark.cloud slash C-H-R-O-N-O-S-P-H-E-R-E.Corey: Let's face it, on-call firefighting at 2am is stressful! So there's good news and there's bad news. The bad news is that you probably can't prevent incidents from happening, but the good news is that incident.io makes incidents less stressful and a lot more valuable. incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues and learn from incident insights to improve site reliability and fix your vulnerabilities. Try incident.io, recover faster and sleep more.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest is Chris Evans, who's the CPO and co-founder of incident.io. Chris, first, thank you very much for joining me. And I'm going to start with an easy question—well, easy question, hard answer, I think—what is an incident.io exactly?Chris: Incident.io is a software platform that helps entire organizations to respond to recover from and learn from incidents.Corey: When you say incident, that means an awful lot of things. And depending on where you are in the ecosystem in the world, that means different things to different people. For example, oh, incident. Like, “Are you talking about the noodle incident because we had an agreement that we would never speak about that thing again,” style, versus folks who are steeped in DevOps or SRE culture, which is, of course, a fancy way to say those who are sad all the time, usually about computers. What is an incident in the context of what you folks do?Chris: That, I think, is the killer question. I think if you look at organizations in the past, I think incidents were those things that happened once a quarter, maybe once a year, and they were the thing that brought the entirety of your site down because your big central database that was in a data center sort of disappeared. The way that modern companies run means that the definition has to be very, very different. So, most places now rely on distributed systems and there is no, sort of, binary sense of up or down these days. And essentially, in the general case, like, most companies are continually in a sort of state of things being broken all of the time.And so, for us, when we look at what an incident is, it is essentially anything that takes you away from your planned work with a sense of urgency. And that's the sort of the pithy definition that we use there. Generally, that can mean anything—it means different things to different folks, and, like, when we talk to folks, we encourage them to think carefully about what that threshold is, but generally, for us at incident.io, that means basically a single error that is worthwhile investigating that you would stop doing your backlog work for is an incident. And also an entire app being down, that is an incident.So, there's quite a wide range there. But essentially, by sort of having more incidents and lowering that threshold, you suddenly have a heap of benefits, which I can go very deep into and talk for hours about.Corey: It's a deceptively complex question. When I talk to folks about backups, one of the biggest problems in the world of backup and building a DR plan, it's not building the DR plan—though that's no picnic either—it's okay. In the time of cloud, all your planning figures out, okay. Suddenly the site is down, how do we fix it? There are different levels of down and that means different things to different people where, especially the way we build apps today, it's not is the service or site up or down, but with distributed systems, it's how down is it?And oh, we're seeing elevated error rates in us-tire-fire-1 region of AWS. At what point do we begin executing on our disaster plan? Because the worst answer, in some respects is, every time you think you see a problem, you start failing over to other regions and other providers and the rest, and three minutes in, you've irrevocably made the cutover and it's going to take 15 minutes to come back up. And oh, yeah, then your primary site comes back up because whoever unplugged something, plugged it back in and now you've made the wrong choice. Figuring out all the things around the incident, it's not what it once was.When you were running your own blog on a single web server and it's broken, it's pretty easy to say, “Is it up or is it down?” As you scale out, it seems like that gets more and more diffuse. But it feels to me that it's also less of a question of how the technology has scaled, but also how the culture and the people have scaled. When you're the only engineer somewhere, you pretty much have no choice but to have the entire state of your stack shoved into your head. When that becomes 15 or 20 different teams of people, in some cases, it feels like it's almost less than a technology problem than it is a problem of how you communicate and how you get people involved. And the issues in front of the people who are empowered and insightful in a certain area that needs fixing.Chris: A hundred percent. This is, like, a really, really key point, which is that organizations themselves are very complex. And so, you've got this combination of systems getting more and more complicated, more and more sort of things going wrong and perpetually breaking but you've got very, very complicated information structures and communication throughout the whole organization to keep things up and running. The very best orgs are the ones where they can engage the entire, sort of, every corner of the organization when things do go wrong. And lived and breathed this firsthand when various different previous companies, but most recently at Monzo—which is a bank here in the UK—when an incident happened there, like, one of our two physical data center locations went down, the bank wasn't offline. Everything was resilient to that, but that required an immediate response.And that meant that engineers were deployed to go and fix things. But it also meant the customer support folks might be required to get involved because we might be slightly slower processing payments. And it means that risk and compliance folks might need to get involved because they need to be reporting things to regulators. And the list goes on. There's, like, this need for a bunch of different people who almost certainly have never worked together or rarely worked together to come together, land in this sort of like empty space of this incident room or virtual incident room, and figure out how they're going to coordinate their response and get things back on track in the sort of most streamlined way and as quick as possible.Corey: Yeah, when your bank is suddenly offline, that seems like a really inopportune time to be introduced to the database team. It's, “Oh, we have one of those. Wonderful. I feel like you folks are going to come in handy later today.” You want to have those pathways of communication open well in advance of these issues.Chris: A hundred percent. And I think the thing that makes incidents unique is that fact. And I think the solution to that is this sort of consistent, level playing field that you can put everybody on. So, if everybody understands that the way that incidents are dealt with is consistent, we declare it like this, and under these conditions, these things happen. And, you know, if I flag this kind of level of impact, we have to pull in someone else to come and help make a decision.At the core of it, there's this weird kind of duality to incidents where they are both kind of semi-formulaic and that you can basically encode a lot of the processes that happen, but equally, they are incredibly chaotic and require a lot of human impact to be resilient and figure these things out because stuff that you have never seen happen before is happening and failing in ways that you never predicted. And so, this is where incident.io plays into this is that we try to take the first half of that off of your hands, which is, we will help you run your process so that all of the brain capacity you have, it goes on to the bit that humans are uniquely placed to be able to do, which is responding to these very, very chaotic, sort of, surprise events that have happened.Corey: I feel as well—because I played around in this space a bit before I used to run ops teams—and, more or less I really should have had a t-shirt then that said, “I am the root cause,” because yeah, I basically did a lot of self-inflicted outages in various environments because it turns out, I'm not always the best with computers. Imagine that. There are a number of different companies that play in the space that look at some part of the incident lifecycle. And from the outside, first, they all look alike because it's, “Oh, so you're incident.io. I assume you're PagerDuty. You're the thing that calls me at two in the morning to make sure I wake up.”Conversely, for folks who haven't worked deeply in that space, as well, of setting things on fire, what you do sounds like it's highly susceptible to the Hacker News problem. Where, “Wait, so what you do is effectively just getting people to coordinate and talk during an incident? Well, that doesn't sound hard. I could do that in a weekend.” And no, no, you can't.If this were easy, you would not have been in business as long as you have, have the team the size that you do, the customers that you do. But it's one of those things that until you've been in a very specific set of a problem, it doesn't sound like it's a real problem that needs solving.Chris: Yeah, I think that's true. And I think that the Hacker News point is a particularly pertinent one and that someone else, sort of, in an adjacent area launched on Hacker News recently, and the amount of feedback they got around, you know, “You're a Slack bot. How is this a company?” Was kind of staggering. And I think generally where that comes from is—well, first of all that bias that engineers have, which is just everything you look at as an engineer is like, “Yeah, I can build that in a weekend.” I think there's often infinite complexity under the hood that just gets kind of brushed over. But yeah, I think at the core of it, you probably could build a Slack bot in a weekend that creates a channel for you in Slack and allows you to post somewhere that some—Corey: Oh, good. More channels in Slack. Just when everyone wants.Chris: Well, there you go. I mean, that's a particular pertinent one because, like, our tool does do that. And one of the things—so I built at Monzo, a version of incident.io that we used at the company there, and that was something that I built evenings and weekends. And among the many, many things I never got around to building, archiving and cleaning up channels was one of the ones that was always on that list.And so, Monzo did have this problem of littered channels everywhere, I think that sort of like, part of the problem here is, like, it is easy to look at a product like ours and sort of assume it is this sort of friendly Slack bot that helps you orchestrate some very basic commands. And I think when you actually dig into the problems that organizations above a certain size have, they're not solved by Slack bots. They're solved by platforms that help you to encode your processes that otherwise have to live on a Google Doc somewhere which is five pages long and when it's 2 a.m. and everything's on fire, I guarantee you not a single person reads that Google Doc, so your process is as good as not in place at all. That's the beauty of a tool like ours. We have a powerful engine that helps you basically to encode that and take some load off of you.Corey: To be clear, I'm also not coming at this from a position of judging other people. I just look right now at the Slack workspace that we have The Duckbill Group, and we have something like a ten-to-one channel-to-human ratio. And the proliferation of channels is a very real thing. And the problem that I've seen across the board with other things that try to address incident management has always been fanciful at best about what really happens when something breaks. Like, you talk about, oh, here's what happens. Step one: you will pull up the Google Doc, or you will pull up the wiki or the rest, or in some aspirational places, ah, something seems weird, I will go open a ticket in Jira.Meanwhile, here in reality, anyone who's ever worked in these environments knows that step one, “Oh shit, oh shit, oh shit, oh shit, oh shit. What are we going to do?” And all the practices and procedures that often exist, especially in orgs that aren't very practiced at these sorts of things, tend to fly out the window and people are going to do what they're going to do. So, any tool or any platform that winds up addressing that has to accept the reality of meeting people where they are not trying to educate people into different patterns of behavior as such. One of the things I like about your approach is, yeah, it's going to be a lot of conversation in Slack that is a given we can pretend otherwise, but here in reality, that is how work gets communicated, particularly in extremis. And I really appreciate the fact that you are not trying to, like, fight what feels almost like a law of nature at this point.Chris: Yeah, I think there's a few things in that. The first point around the document approach or the clearly defined steps of how an incident works. In my experience, those things have always gone wrong because—Corey: The data center is down, so we're going to the wiki to follow our incident management procedure, which is in the data center just lost power.Chris: Yeah.Corey: There's a dependency problem there, too. [laugh].Chris: Yeah, a hundred percent. [laugh]. A hundred percent. And I think part of the problem that I see there is that very, very often, you've got this situation where the people designing the process are not the people following the process. And so, there's this classic, I've heard it through John Allspaw, but it's a bunch of other folks who talk about the difference between people, you know, at the sharp end or the blunt end of the work.And I think the problem that people are facing the past is you have these people who sit in the, sort of, metaphorical upstairs of the office and think that they make a company safe by defining a process on paper. And they ship the piece of paper and go, “That is a good job for me done. I'm going to leave and know that I've made the bank—the other whatever your organization does—much, much safer.” And I think this is where things fall down because—Corey: I want to ambush some of those people in their performance reviews with, “Cool. Just for fun, all the documentation here, we're going to pull up the analytics to see how often that stuff gets viewed. Oh, nobody ever sees it. Hmm.”Chris: It's frustrating. It's frustrating because that never ever happens, clearly. But the point you made around, like, meeting people where you are, I think that is a huge one, which is incidents are founded on great communication. Like, as I said earlier, this is, like, a form of team with someone you've never ever worked with before and the last thing you want to do is be, like, “Hey, Corey, I've never met you before, but let's jump out onto this other platform somewhere that I've never been or haven't been for weeks and we'll try and figure stuff out over there.” It's like, no, you're going to be communicating—Corey: We use Slack internally, but we have a WhatsApp chat that we wind up using for incident stuff, so go ahead and log into WhatsApp, which you haven't done in 18 months, and join the chat. Yeah, in the dawn of time, in the mists of antiquity, you vaguely remember hearing something about that your first week and then never again. This stuff has to be practiced and it's important to get it right. How do you approach the inherent and often unfortunate reality that incident response and management inherently becomes very different depending upon the specifics of your company or your culture or something like that? In other words, how cookie-cutter is what you have built versus adaptable to different environments it finds itself operating in?Chris: Man, the amount of time we spent as a founding team in the early days deliberating over how opinionated we should be versus how flexible we should be was staggering. The way we like to describe it as we are quite opinionated about how we think incidents should be run, however we let you imprint your own process into that, so putting some color onto that. We expect incidents to have a lead. That is something you cannot get away from. However, you can call the lead whatever makes sense for you at your organization. So, some folks call them an incident commander or a manager or whatever else.Corey: There's overwhelming militarization of these things. Like, oh, yes, we're going to wind up taking a bunch of terms from the military here. It's like, you realize that your entire giant screaming fire is that the lights on the screen are in the wrong pattern. You're trying to make them in the right pattern. No one dies here in most cases, so it feels a little grandiose for some of those terms being tossed around in some cases, but I get it. You've got to make something that is unpleasant and tedious in many respects, a little bit more gripping. I don't envy people. Messaging is hard.Chris: Yeah, it is. And I think if you're overly virtuoustic and inflexible, you're sort of fighting an uphill battle here, right? So, folks are going to want to call things what they want to call things. And you've got people who want to import [ITIL 00:15:04] definitions for severity ease into the platform because that's what they're familiar with. That's fine.What we are opinionated about is that you have some severity levels because absent academic criticism of severity levels, they are a useful mechanism to very coarsely and very quickly assess how bad something is and to take some actions off of it. So yeah, we basically have various points in the product where you can customize and put your own sort of flavor on it, but generally, we have a relatively opinionated end-to-end expectation of how you will run that process.Corey: The thing that I find that annoys me—in some cases—the most is how heavyweight the process is, and it's clearly built by people in an ivory tower somewhere where there's effectively a two-day long postmortem analysis of the incident, and so on and so forth. And okay, great. Your entire site has been blown off the internet, yeah, that probably makes sense. But as soon as you start broadening that to things like okay, an increase in 500 errors on this service for 30 minutes, “Great. Well, we're going to have a two-day postmortem on that.” It's, “Yeah, sure would be nice if we could go two full days without having another incident of that caliber.” So, in other words, whose foot—are we going to hire a new team whose full-time job it is, is to just go ahead and triage and learn from all these incidents? Seems to me like that's sort of throwing wood behind the wrong arrows.Chris: Yeah, I think it's very reductive to suggest that learning only happens in a postmortem process. So, I wrote a blog, actually, not so long ago that is about running postmortems and when it makes sense to do it. And as part of that, I had a sort of a statement that was [laugh] that we haven't run a single postmortem when I wrote this blog at incident.io. Which is probably shocking to many people because we're an incident company, and we talk about this stuff, but we were also a company of five people and when something went wrong, the learning was happening and these things were sort of—we were carving out the time, whether it was called a postmortem, or not to learn and figure out these things. Extrapolating that to bigger companies, there is little value in following processes for the sake of following processes. And so, you could have—Corey: Someone in compliance just wound up spitting their coffee over their desktop as soon as you said that. But I hear you.Chris: Yeah. And it's those same folks who are the ones who care about the document being written, not the process and the learning happening. And I think that's deeply frustrating to me as—Corey: All the plans, of course, assume that people will prioritize the company over their own family for certain kinds of disasters. I love that, too. It's divorced from reality; that's ridiculous, on some level. Speaking of ridiculous things, as you continue to grow and scale, I imagine you integrate with things beyond just Slack. You grab other data sources and over in the fullness of time.For example, I imagine one of your most popular requests from some of your larger customers is to integrate with their HR system in order to figure out who's the last engineer who left, therefore everything immediately their fault because lord knows the best practice is to pillory whoever was the last left because then they're not there to defend themselves anymore and no one's going to get dinged for that irresponsible jackass's decisions, even if they never touched the system at all. I'm being slightly hyperbolic, but only slightly.Chris: Yeah. I think [laugh] that's an interesting point. I am definitely going to raise that feature request for a prefilled root cause category, which is, you know, the value is just that last person who left the organization. That it's a wonderful scapegoat situation there. I like it.To the point around what we do integrate with, I think the thing is actually with incidents that's quite interesting is there is a lot of tooling that exists in this space that does little pockets of useful, valuable things in the shape of incidents. So, you have PagerDuty is this system that does a great job of making people's phone making noise, but that happens, and then you're dropped into this sort of empty void of nothingness and you've got to go and figure out what to do. And then you've got things like Jira where clearly you want to be able to track actions that are coming out of things going wrong in some cases, and that's a great tool for that. And various other things in the middle there. And yeah, our value proposition, if you want to call it that, is to bring those things together in a way that is massively ergonomic during an incident.So, when you're in the middle of an incident, it is really handy to be able to go, “Oh, I have shipped this horrible fix to this thing. It works, but I must remember to undo that.” And we put that at your fingertips in an incident channel from Slack, that you can just log that action, lose that cognitive load that would otherwise be there, move on with fixing the thing. And you have this sort of—I think it's, like, that multiplied by 1000 in incidents that is just what makes it feel delightful. And I cringe a little bit saying that because it's an incident at the end of the day, but genuinely, it feels magical when some things happen that are just like, “Oh, my gosh, you've automatically hooked into my GitHub thing and someone else merged that PR and you've posted that back into the channel for me so I know that that happens. That would otherwise have been a thing where I jump out of the incident to go and figure out what was happening.”Corey: This episode is sponsored in part by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on-premises, private cloud, and they just announced a fully-managed service on AWS and Azure called BigAnimal, all one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications—including Oracle—to the cloud. To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.Corey: The problem with the cloud, too, is the first thing that, when there starts to be an incident happening is the number one decision—almost the number one decision point is this my shitty code, something we have just pushed in our stuff, or is it the underlying provider itself? Which is why the AWS status page being slow to update is so maddening. Because those are two completely different paths to go down and you are having to pursue both of them equally at the same time until one can be ruled out. And that is why time to identify at least what side of the universe it's on is so important. That has always been a bit of a tricky challenge.I want to talk a bit about circular dependencies. You target a certain persona of customer, but I'm going to go out on a limb and assume that one explicit company that you are not going to want to do business with in your current iteration is Slack itself because a tool to manage—okay, so our service is down, so we're going to go to Slack to fix it doesn't work when the service is Slack itself. So, that becomes a significant challenge. As you look at this across the board, are you seeing customers having problems where you have circular dependency issues with this? Easy example: Slack is built on top of AWS.When there's an underlying degradation of, huh, suddenly us-east-1 is not doing what it's supposed to be doing, now, Slack is degraded as well, as well as the customer site, it seems like at that point, you're sort of in a bit of tricky positioning as a customer. Counterpoint, when neither Slack nor your site are working, figuring out what caused that issue doesn't seem like it's the biggest stretch of the imagination at that point.Chris: I've spent a lot of my career working in infrastructure, platform-type teams, and I think you can end up tying yourself in knots if you try and over-optimize for, like, avoiding these dependencies. I think it's one of those, sort of, turtles all the way down situations. So yes, Slack are unlikely to become a customer because they are clearly going to want to use our product when they are down.Corey: They reach out, “We'd like to be your customer.” Your response is, “Please don't be.” None of us are going to be happy with this outcome.Chris: Yeah, I mean, the interesting thing that is that we're friends with some folks at Slack, and they believe it or not, they do use Slack to navigate their incidents. They have an internal tool that they have written. And I think this sort of speaks to the point we made earlier, which is that incidents and things failing or not these sort of big binary events. And so—Corey: All of Slack is down is not the only kind of incident that a company like Slack can experience.Chris: I'd go as far as that it's most commonly not that. It's most commonly that you're navigating incidents where it is a degradation, or some edge case, or something else that's happened. And so, like, the pragmatic solution here is not to avoid the circular dependencies, in my view; it's to accept that they exist and make sure you have sensible escape hatches so that when something does go wrong—so a good example, we use incident.io at incident.io to manage incidents that we're having with incident.io. And 99% of the time, that is absolutely fine because we are having some error in some corner of the product or a particular customer is doing something that is a bit curious.And I could count literally on one hand the number of times that we have not been able to use our products to fix our product. And in those cases, we have a fallback which is jump into—Corey: I assume you put a little thought into what happened. “Well, what if our product is down?” “Oh well, I guess we'll never be able to fix it or communicate about it.” It seems like that's the sort of thing that, given what you do, you might have put more than ten seconds of thought into.Chris: We've put a fair amount of thought into it. But at the end of the day, [laugh] it's like if stuff is down, like, what do you need to do? You need to communicate with people. So, jump on a Google Chat, jump on a Slack huddle, whatever else it is we have various different, like, fallbacks in different order. And at the core of it, I think this is the thing is, like, you cannot be prepared for every single thing going wrong, and so what you can be prepared for is to be unprepared and just accept that humans are incredibly good at being resilient, and therefore, all manner of things are going to happen that you've never seen before and I guarantee you will figure them out and fix them, basically.But yeah, I say this; if my SOC 2 auditor is listening, we also do have a very well-defined, like, backup plan in our SOC 2 [laugh] in our policies and processes that is the thing that we will follow that. But yeah.Corey: The fact that you're saying the magic words of SOC 2, yes, exactly. Being in a responsible adult and living up to some baseline compliance obligations is really the sign of a company that's put a little thought into these things. So, as I pull up incident.io—the website, not the company to be clear—and look through what you've written and how you talk about what you're doing, you've avoided what I would almost certainly have not because your tagline front and center on your landing page is, “Manage incidents at scale without leaving Slack.” If someone were to reach out and say, well, we're down all the time, but we're using Microsoft Teams, so I don't know that we can use you, like, the immediate instinctive response that I would have for that to the point where I would put it in the copy is, “Okay, this piece of advice is free. I would posit that you're down all the time because you're the kind of company to use Microsoft Teams.” But that doesn't tend to win a whole lot of friends in various places. In a slightly less sarcastic bent, do you see people reaching out with, “Well, we want to use you because we love what you're doing, but we don't use Slack.”Chris: Yeah. We do. A lot of folks actually. And we will support Teams one day, I think. There is nothing especially unique about the product that means that we are tied to Slack.It is a great way to distribute our product and it sort of aligns with the companies that think in the way that we do in the general case but, like, at the core of what we're building, it's a platform that augments a communication platform to make it much easier to deal with a high-stress, high-pressure situation. And so, in the future, we will support ways for you to connect Microsoft Teams or if Zoom sought out getting rich app experiences, talk on a Zoom and be able to do various things like logging actions and communicating with other systems and things like that. But yeah, for the time being very, very deliberate focus mechanism for us. We're a small company with, like, 30 people now, and so yeah, focusing on that sort of very slim vertical is working well for us.Corey: And it certainly seems to be working to your benefit. Every person I've talked to who is encountered you folks has nothing but good things to say. We have a bunch of folks in common listed on the wall of logos, the social proof eye chart thing of here's people who are using us. And these are serious companies. I mean, your last job before starting incident.io was at Monzo, as you mentioned.You know what you're doing in a regulated, serious sense. I would be, quite honestly, extraordinarily skeptical if your background were significantly different from this because, “Well, yeah, we worked at Twitter for Pets in our three-person SRE team, we can tell you exactly how to go ahead and handle your incidents.” Yeah, there's a certain level of operational maturity that I kind of just based upon the name of the company there; don't think that Twitter for Pets is going to nail. Monzo is a bank. Guess you know what you're talking about, given that you have not, basically, been shut down by an army of regulators. It really does breed an awful lot of confidence.But what's interesting to me is the number of people that we talk to in common are not themselves banks. Some are and they do very serious things, but others are not these highly regulated, command-and-control, top-down companies. You are nimble enough that you can get embedded at those startup-y of startup companies once they hit a certain point of scale and wind up helping them arrive at a better outcome. It's interesting in that you don't normally see a whole lot of tools that wind up being able to speak to both sides of that very broad spectrum—and most things in between—very effectively. But you've somehow managed to thread that needle. Good work.Chris: Thank you. Yeah. What else can I say other than thank you? I think, like, it's a deliberate product positioning that we've gone down to try and be able to support those different use cases. So, I think, at the core of it, we have always tried to maintain the incident.io should be installable and usable in your very first incident without you having to have a very steep learning curve, but there is depth behind it that allows you to support a much more sophisticated incident setup.So, like, I mean, you mentioned Monzo. Like, I just feel incredibly fortunate to have worked at that company. I joined back in 2017 when they were, I don't know, like, 150,000 customers and it was just getting its banking license. And I was there for four years and was able to then see it scale up to 6 million customers and all of the challenges and pain that goes along with that both from building infrastructure on the technical side of things, but from an organizational side of things. And was, like, front-row seat to being able to work with some incredibly smart people and sort of see all these various different pain points.And honestly, it feels a little bit like being in sort of a cheat mode where we get to this import a lot of that knowledge and pain that we felt at Monzo into the product. And that happens to resonate with a bunch of folks. So yeah, I feel like things are sort of coming out quite well at the moment for folks.Corey: The one thing I will say before we wind up calling this an episode is just how grateful I am that I don't have to think about things like this anymore. There's a reason that the problem that I chose to work on of expensive AWS bills being very much a business-hours only style of problem. We're a services company. We don't have production infrastructure that is externally facing. “Oh, no, one of our data analysis tools isn't working internally.”That's an interesting curiosity, but it's not an emergency in the same way that, “Oh, we're an ad network and people are looking at ads right now because we're broken,” is. So, I am grateful that I don't have to think about these things anymore. And also a little wistful because there's so much that you do it would have made dealing with expensive and dangerous outages back in my production years a lot nicer.Chris: Yep. I think that's what a lot of folks are telling us essentially. There's this curious thing with, like, this product didn't exist however many years ago and I think it's sort of been quite emergent in a lot of companies that, you know, as sort of things have moved on, that something needs to exist in this little pocket of space, dealing with incidents in modern companies. So, I'm very pleased that what we're able to build here is sort of working and filling that for folks.Corey: Yeah. I really want to thank you for taking so much time to go through the ethos of what you do, why you do it, and how you do it. If people want to learn more, where's the best place for them to go? Ideally, not during an incident.Chris: Not during an incident, obviously. Handily, the website is the company name. So, incident.io is a great place to go and find out more. We've literally—literally just today, actually—launched our Practical Guide to Incident Management, which is, like, a really full piece of content which, hopefully, will be useful to a bunch of different folks.Corey: Excellent. We will, of course, put a link to that in the [show notes 00:29:52]. I really want to thank you for being so generous with your time. Really appreciate it.Chris: Thanks so much. It's been an absolute pleasure.Corey: Chris Evans, Chief Product Officer and co-founder of incident.io. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice along with an angry comment telling me why your latest incident is all the intern's fault.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

amazon uk pr speaking zoom whatsapp manage cloud pets integration oracle slack messaging generally doordash screaming aws chris evans github conversely azure devops chief product officer google docs incidents practical guide cpo microsoft teams soc sre jira counterpoint monzo postgresql itil hacker news pagerduty incident management google chat corey quinn extrapolating chronosphere chris yeah chatops chris well duckbill group chris thank john allspaw chris thanks enterprisedb chris not chief cloud economist last week in aws humblepod

KubeCon, Kindness, and Legos with Michael Chenetz

Break Things On Purpose

Play Episode Listen Later May 31, 2022 27:57

Today we chat with Cisco's head of developer content, community, and events, Michael Chenetz. We discuss everything from KubeCon to kindness and Legos! Michael delves into some of the main themes he heard from creators at KubeCon, and we discuss methods for increasing adoption of new concepts in your organization. We have a conversation about attending live conferences, COVID protocol, and COVID shaming, and then we talk about how Legos can be used in talks to demonstrate concepts. We end the conversation with a discussion about combining passions to practice creativity. We discuss our time at KubeCon in Spain (5:51) Themes Michael heard at KubeCon talking with creators (7:46) Increasing adoption of new concepts (9:27) We talk conferences, COVID shaming, and blamelessness (12:21) Legos and reliability (18:04) Michael talks about ways to exercise creativity (23:20) Links: KubeCon October 2022: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/ Nintendo Lego Set: https://www.amazon.com/dp/B08HVXMQ87?ref_=cm_sw_r_cp_ud_dp_ED7NVBWPR8ANGT8WNGS5 Cloud Unfiltered podcast episode featuring Julie and Jason:https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884 Links Referenced: Cisco: https://www.cisco.com/ Cloud Unfiltered Podcast with Julie and Jason: https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884 Cloud Unfiltered Podcast: https://www.cisco.com/c/en/us/solutions/cloud/podcasts.html Nintendo Lego: https://www.amazon.com/dp/B08HVXMQ87 TranscriptJulie: And for folks that are interested in, too, what day it is—because I think we're all still a little bit confused—it is Monday, May 24th that we are recording this episode.Jason: Uh, Julie's definitely confused on what day it is because it's actually Tuesday, [laugh] May 24th.Michael: Oh, my God. [laugh]. That's great. I love it.Julie: Welcome to Break Things on Purpose, a podcast about reliability, learning from each other, and blamelessness. In this episode, we talk to Michael Chenetz, head of developer content, community, and events at Cisco, about all of the learnings from KubeCon, the importance of being kind to each other, and of course, how Lego translates into technology.Julie: Today, we are joined by Michael Chenetz. Michael, do you want to tell us a little bit about yourself?Michael: Yeah. [laugh]. Well, first of all, thank you for having me on the show. And I'm really good at breaking things, so I guess that's why I'm asked to be here is because I'm superb at it. What I'm not so good at is, like, putting things back together.Like when I was a kid, I remember taking my dad's stereo apart; wasn't too happy about that. Wasn't very good at putting it back together. But you know, so that's just going back a little ways there. But yeah, so I work for the DevRel at Cisco and my whole responsibility is, you know, to get people to know that know a little bit about us in terms of, you know, all the developer-related topics.Julie: Well, and Jason and I had the awesome opportunity to hang out with you at KubeCon, where we got to join your Cloud Unfiltered podcast. So folks, definitely go check out that episode. We have a lot of fun. We'll put a link in the [show notes 00:02:03]. But yeah, let's talk a little bit about KubeCon. So, as of recording this episode, we all just recently traveled back from Spain, for KubeCon EU, which was… amazing. I really enjoyed being there. My first time in Spain. I got back, I can tell you, less than 24 hours ago. Michael, I think—when did you get back?Michael: So, I got back Saturday night, but my bags have not arrived yet. So, they're still traveling and they're enjoying Europe. And they should be back soon, I guess when they're when they feel like they're—you know, they should be back from vacation.Julie: [laugh].Michael: So. [laugh].Julie: Jason, how about you? When did you get home?Jason: I got home on Sunday night. So, I took the train from Valencia to Barcelona on Saturday evening, and then an early morning flight on Sunday and got home late Sunday night.Julie: And for folks that are interested in, too, what day it is—because I think we're all still a little bit confused—it is Monday, May 24th that we are recording this episode.Jason: Uh, Julie's definitely confused on what day it is because it's actually Tuesday, [laugh] May 24th.Michael: Oh, my God. [laugh]. That's great. I love it. By the way, yesterday was my birthday so I'm going to say—Julie: Happy birthday.Michael: —happy birthday to myself.Julie: Oh, my gosh, happy birthday. [laugh].Michael: Thank you [laugh].Julie: So… what is time anyway?Jason: Yeah.Michael: It's all good. It's all relative. Time is relative.Julie: Time is relative. And so, you know, tell us a little bit about—I'd love to know a little bit about why you want folks to know about, like, what is the message you try to get across?Jason: Oh, that's not the question I thought you were going to ask. I thought you were going to ask, “What's on your Amazon wishlist so people can send you birthday presents?”Julie: Yeah, let's back up. Let's do that. So, let's start with your Amazon wishlist. We know that there might be some Legos involved.Michael: Oh, my God, yeah. I mean, you just told me about a cool one, which was Optimus Prime and I just—I'm already on the website, my credit card is out and I'm ready to buy. So, you know, this is the problem with talking to you guys. [laugh]. It's definitely—you know, that's definitely on my list. So, anything that, anything music-related because obviously behind me is a lot of music equipment—I love music stuff—and anything tech. The combination of tech and music, and if you can combine Legos and that, too, man that would just match all the boxes. [laugh].Julie: Just to let you know, there's a Lego Con. Like, I did not know this until last night, actually. But it is a virtual conference.Michael: Really.Julie: Yeah. But one of the things I was looking at actually on Lego, when you look at their website, like, to request one of their speakers, to request one of their engineers as a speaker, they actually don't do that because they get so many requests for their folks to speak at conferences, they actually have a dedicated part of their website that talks about this. So, I thought that was interesting.Michael: Well listen, just because of that, if they want somebody that's in, you know, cloud computing, I'm not going to go talk for Lego. And I know they really want somebody from cloud computing talking to Lego, so, you know… it's, you know, quid pro quo there, so that's just the way it's going to work. [laugh].Julie: I want to be best friends with Lego people.Michael: [laugh]. I know, me too.Julie: I'm just going to make it a goal in life now to have one of their engineers speak at DevOpsDays Boise. It's like a challenge.Michael: It is. I accept it.Julie: [laugh]. With that, though, just on other Lego news, before we start talking about all the other things that folks may also want to hear about, there is another new Lego, which is the Van Gogh Starry Night that has been newly released by the time this episode comes out.Michael: With a free ear, right?Julie: I mean—[laugh].Michael: Is that what happens?Julie: —well played. Well, played. [laugh]. So, now you really got to spend a lot of time at KubeCon, you were just really recording podcast after podcast.Michael: Oh, my God. Yeah. So, I mean, it was great. I love—because I'm a techie, so I love tech and I love to find out origin stories of stuff. So, I love to, like, talk to these people and like, “Why did that come about? How did—” you know, “What happened in your life that made you want to do this? Who hurt you?” [laugh].And so, that's what I constantly try and figure out is, like, [laugh], “What is that?” So, it was really cool because I had, like, Jimmy Zelinskie who came from CoreOS, and he came from—you know, they create, you know, Quay and some of this other kinds of stuff. And you know, just to talk about, like, some of the operators and how they came about, and like… those were the original operators, so that was pretty cool. Varun from Tetrate was supposed to come on, and he created Istio, you know? So, there were so many of these things that I just geek out knowing about, you know?And then the other thing that was really high on our list, and it's really high from where I am, is API quality, API testing, API—so really, that's why I got in touch with you guys because I was like, “Wow, that fits in really good, you know? You guys are doing stuff that's around chaos, and you know, I think that's amazing.” So, all of this stuff is just so interesting to me. But man, it was just a whirlwind of every day just recording, and by the end that was just like, you know, “I'm so sorry, but I just, I can't talk anymore.” You know, and that was it. [laugh].Jason: I love that chatting with the creators. We had Zack Butcher on who is also from Tetrate and one of the early Istio—Michael: Yeah, yeah.Jason: Contributors. And I find it fascinating because I feel like when you chat with these folks, you start to understand the context of why things were built. And it—Michael: Yes.Jason: —it opens your brain up to, like, cool, there's a software—oh, now I know exactly why it's doing things that way, right? Like, it's just so, so eye-opening. I love it.Julie: With that, though, like, did you see any trends or any themes as you were talking to all these folks?Michael: Yeah, so a few real big trends. One is everybody wants to know about eBPF. That was the biggest thing at KubeCon, by far, was that, “We want to learn how to do this low-level kernel stuff that's really fast, that can give us all the information we need, and we don't have to use sidecars and things like that.” I mean it was—you know, that was the most excitement that I saw. OTel was another one for OpenTelemetry, which was a big one.The other thing was simplification. You know, a lot of people were looking to simplify the Kubernetes ecosystem because there's so much out there, and there's so many things that you have to learn about that it was super hard, you know, for somebody to come into it to say, “Where do I even start?” You know? So, that was a big theme was simplification.I'm trying to think. I think another one is APIs, for sure. You know, because there's this whole thing about API sprawl. And people don't know what their APIs are, people just, like—you know, I always say people can see—like, developers are lazy in a good way, and I consider myself one of them. So, what that means is that when we want to develop something, what we're going to do is we're just going to pull down the nearest API that does what we need, that has the best documentation, that has the best blog, that has the best everything.We don't know what their testing strategy is; we don't know what their security strategy is; we don't know if they use other libraries. And you have to figure that stuff out. And that's the thing that—you know, so everything around APIs is super important. And you really have to test that stuff out. Yes, people, you have to test it [laugh] and know more about it. So, those are those were the big themes, I think. [laugh].Julie: You know, I know that Kerim and I gave a talk on observability where we kind of talked more high-level about some of the overarching concepts, but folks were really excited about that. I think is was because we briefly touched on OpenTelemetry, which we should have gone into a little bit more depth, but there's only so much you can fit into a 30-minute talk, so hopefully we'll be able to talk about that more at a KubeCon in the future, we [crosstalk 00:09:54] to the selection committee.Michael: Hashtag topics?Julie: Uh-huh. [laugh]. You know, that said, though, it really did seem like a huge topic that people just wanted to learn more about. I know, too, at the Gremlin booth, a lot of folks were also interested in talking about, like, how do we just get our organization to adopt some of these concepts that we're hearing about here? And I think that was the thing that surprised me the most is I expected people to be coming up to the booth and deep-diving into very, very deep, technical-level questions, and really, a lot of it was how do we get our organization to do this? How can we increase adoption? So, that was a surprise for me.Michael: Yeah, you know what, and I would say two things to that. One is, when you talk about Chaos Engineering, I think people think it's like rocket science and people are really scared and they don't want to claim to be experts in it, so they're like, “Wow, this is, like, next-level stuff, and you know, we're really scared. You guys are the experts. I don't want to even attempt this.” And the other thing is that organizations are scared because they think that it's going to, like, create mass hysteria throughout their organization.And really, none of this is true in either way. In reality, it's a very, very scripted, very exacting stuff that you're testing, and you throw stuff out there and see what kind of response you get. So, you know, it's not this, like, you know—I think people just have—there needs to be more education around a lot of areas in cloud-native. But you know, that's one of the areas. So, I think it's really interesting there.Julie: I think so too. How about for you, Jason? Like, what was your surprise from the conference or something that maybe—Jason: Yeah, I mean, I think my surprise was mostly around just seeing people coming back, right? Because we're now I would say, six months into conferences being back as a thing, right? Like, we had re:Invent last year in Vegas; we had KubeCon last year in LA, and so, like, those are okay events. They weren't, like, back to normal. And this was, I feel like, one of the first conferences, that it really started to feel back to normal.Like, there was much better attendance, there was much more just buzz and hallway tracking and everything else that we're used to. Like, the whole reason that we go to conferences is getting together with people and hanging out and stuff, and this one has so far felt the most back-to-normal out of any event that I've been to over the past six months.Michael: Can I just talk about one thing that I think, you know, people have to get over is, you know, I see a lot online, I think it was—I forget who it was that was talking about it. But this whole idea of Covid shaming. I mean, we're going to this event, and it's like, yeah, everybody wants to get out, everybody wants to learn things, but don't shame people just because they got Covid, everybody's getting Covid, okay? That's just the point of life at this point. So, let's just, you know, let's just be nice to each other, be friendly to each other, you know? I just have to say that because I think it's a shame that people are getting shamed, you know, just for going to an event. [laugh].Julie: See, and I think that—that's an interesting—there's been a lot of conversation around this. And I don't think anybody should be Covid-shamed. Look, I think that we all took a calculated risk in coming—Michael: Absolutely.Julie: To this event. I personally gave out a lot of hugs. I hugged some of the folks that have mentioned that they have come up positive from Covid, so there's a calculated risk in going. I think there has been a little bit of pushback on maybe how some of the communication has come out around it. That said, as an organizer of a small conference with, like, 400 people, I think that these are very complicated matters. And what I really think is important is to listen to feedback from attendees and to take that.And then we're always looking to improve, right?Michael: Absolutely.Julie: If everything that we did was perfect right out of the gate, then we wouldn't have Chaos Engineering because there'd be nothing [crosstalk 00:13:45] be just perfectly reliable. And so, if we take away anything, let's take away—just like what you said, first of all, Covid, you should never shame somebody for having Covid. Like, that's not cool. It's not somebody's fault that they caught an illness.Michael: Yes.Julie: I mean unless they were licking doorknobs. And that's a whole different—Michael: Yes. [laugh]. That's a whole different thing, right there.Julie: Conversation. But when we talk about just like these questions around cultural adoption, we talk about blamelessness; we talk about learning from failure; we talked about finding ways to improve, and I think all of that can come into play. So, it'll be interesting to see how we learn and grow as we move forward. And like, thank you to re:Invent, thank you to KubeCon, thank you to DevOpsDays Boise. But these conferences that have started going back in-person, at great risk to organizers and the committee because people are going to be mad, one way or the other.Michael: Yeah. And you can see that people want to be back because it was huge, you know?Julie: Yeah.Michael: Maybe you guys, I'm going to put in a feature request for Gremlin to chaos engineer crowds. Can we do that so we can figure out, like, what's going to happen when we have these big events? Can we do that?Julie: I mean, that sounds fun. I think what's going to happen is there's going to be hugs, there's going to be people getting sick, but there's going to be people learning and growing.Michael: Yes.Julie: And ultimately, I just think that we have to remember that just, like, our systems aren't perfect, and neither are people. Like, the fact that we expect people to be perfect, and maybe we should just keep some mask mandates for a little bit longer when we're at conferences with 8000 people.Michael: Sure.Julie: I mean, that's—Michael: That makes sense.Jason: Yeah. I mean, it's all about risk management, right? This is, essentially what we do in SRE is there's always a risk of a massive outage, and so it's that balance of, right, do what you can, but ultimately, that's why we have SLOs and things is, you can never be a hundred percent, so like, where do we draw the line of here are the things that we're going to do to help manage this risk, but you can never shoot for a perfectly, entirely safe space, right? Because then we'd all be having conferences in padded rooms, and not touching each other, and things like that. There's a balance there.And I think we're all just trying to find that, so yeah, as you mentioned, that whole, like, DevOps blamelessness thing, you know, treat each other with the notion that we're all trying to get through this together and do what we think is best. Nobody's just like John Allspaw said, you know, “Nobody goes to work thinking that, like, their intent is to crash everything and destroy the company.” No one's going to KubeCon or any of these conferences thinking, “Yeah, I'm going to be a super-spreader.”Julie: [laugh].Michael: Yeah, that would be [crosstalk 00:16:22].Jason: Like, everyone's trying not to do it. They're doing their best. They're not actively, like, aggressively trying to get you sick or intentionally about it. But you know—so just be kind to one another.Michael: Yeah. And that's the key.Julie: It is.Michael: The key. Be kind to one another, you know? I mean, it's a great community. People are really nice, so, you know, let's keep that up. I think that's something special about the, you know, the community around KubeCon, specifically.Julie: As we can refine this and find ways, I would take all of the hugs over virtual conferences—Michael: Yes.Julie: Any day now. Because, as Jason mentioned, is even just with you, Michael, the time we got to spend with you, or the time I kept going up to Jfrog's booth and Baruch and I would have conversations as he made me a delicious coffee, these hallway tracks, these conversations, that's what no one figured out how to recreate during the virtual events—Michael: Absolutely.Julie: —and it's just not possible, right?Michael: Yeah. I mean, I think it would take a little bit of VR and then maybe some, like, suit that you wear in order to feel the hug. And, you know, so it would take a lot more in order to do that. I mean, I guess it's technologically possible. I don't know if the graphics are there yet, so it might be like a pixelated version, like, you know, like, NES-style, or something like that. But it could look pretty cool. [laugh]. So, we'll have to see, you know?Julie: Everybody listening to this episode, I hope you're getting as much of a kick out of it as we are recording it because I mean, there are so many different topics here. One of the things that Michael and I bonded about years ago, for our listeners that are—not years ago; months ago. Again, what is time?Michael: Yeah. What is time? It's all relative.Julie: It is. It was Lego, though, and so we've been talking about that. But Michael, you asked a great question when we were recording with you, which is, like—Michael: Wow.Julie: Can—just one. Only one great question.Michael: [laugh].Julie: [laugh]. Which was, how would you incorporate Lego into a talk? And, like, when we look at our systems breaking and all of that, I've really been thinking about that and how to make our systems more reliable. And here's one of the things I really wanted to clarify that answer. I kind of went… I went talking about my Lego that I build, like, my Optim—not my Optimus Primes, I don't have it, but my Voltron or my Nintendo Lego. And those are all box sets.Michael: Yep.Julie: But one of the things if you're not playing with a box set with instruction, if you're just playing with just the—or excuse me, architecting with just the Lego blocks because it's not playing because we're adults now, I think.Michael: Yes, now it's architecting. Yes.Julie: Yes, now that we're architecting, like, that's one of the things that I was really thinking about this, and I think that it would make something really fun to talk about is how you're building upon each layer and you're testing out these new connection pieces. And then that really goes into, like, when we get into Technics, into dependencies because if you forget that one little one-inch plastic piece that goes from the one to the other, then your whole Lego can fall apart. So anyway, I just thought that was really interesting, and I'd wondered if you or Jason even gave that any more thought, or if it was just fleeting for you.Michael: It was definitely fleeting for me, but I will give it some more thought, you know? But you know, when—as you're saying that though, I'm thinking these Lego pieces really need names because you're like that little two-inch Lego piece that kind of connects this and this, like, we got to give these all names so that people can know, that's x-54 that's—that you're putting between x-53 and x-52. I don't know but you need some kind of name for these parts now.Julie: There are Lego names. You just Google it. There are actual names for all of the parts but—Michael: Wow. [laugh].Julie: Like, Jason, what do you think? I know you've got [unintelligible 00:19:59].Jason: Yeah, I mean, I think it's interesting because I am one of those, like, freeform folks, right? You know, my standard practice when I was growing up with Legos was you build the thing that you bought once and then you immediately, like, tear it apart, and you build whatever the hell you want.Michael: Absolutely.Jason: So, I think that that's kind of an interesting thing as we think about our systems and stuff, right? Like, part of it is, like, yeah, there's best practices and various companies will publish, like, you know, “Here's how to architect such-and-such system.” And it's interesting because that's just not reality, right? You're not going to go and take, like, the Amazon CloudFormation thing, and like, congrats, you're done. You know, you just implement that and your job's done; you just kick back for the rest of the week.It never works that way, right? You're taking these little bits of, like, cool, I might have, like, set that up once just to see what's happening but then you immediately, like, deconstruct it, and you take the knowledge of what you learned in those building blocks, and you, like, go and remix it to build the thing that you actually need to build.Michael: But yeah, I mean, that's exactly—so you know, Legos is what got me interested in that as a kid, but when you look at, you know, cloud services and things like that, there's so many different ways to combine things and so many different ways to, like—you know, you could use Terraform, you could use Crossplane, you could use, you know, any of the services in the cloud, you could use FaaS, you could use serverless, you could use, you know, all these different kinds of solutions and tie them together. So, there's so much choice, and what Lego teaches you is that, embrace the choice. Figure out and embrace the different pieces, embrace all the different things that you have and what the art of possibility is, and then start to build on that. So, I think it's a really good thing. And that's why there's so much correlation between, like, kind of, art and tech and things like that because that's the kind of mentality that you need in order to be really successful in tech.Jason: And I think the other thing that works really well with what you said is, as you're playing with Legos, you start to learn these hacks, right? Like, I don't have, like, a four-by-one brick, but I know that if I have three four-by-one flats, I can stack those three and it's the same height as a brick, right?Michael: Yep.Jason: And you can start combining things. And I love that engineering mentality of, like, I have this problem that I need to solve, I have a limited toolbox for whatever constraints, right, and understanding those constraints, and then cool, how can I remix what I've got in my toolbox to get this thing done?Michael: And that's a thing that I'm always doing. Like, when I used to do a lot of development, you know, it was always like, what is the right code? Or what is the library that's going to solve my problem? Or what is the API that's going to solve my problem, you know?And there's so many different ways to do it. I mean, so many people are afraid of, like, making the wrong choice, when really in programming, there is no wrong choice. It's all about how you want to do it and what makes sense to you, you know? There might be better options in formatting and in the way that you kind of, you know, format that code together and put them in different libraries and things like that, but making choices on, like, APIs and things like that, that's all up to the artist. I would say that's an artist. [laugh]. So, you know, I think it all stems though, when you go back from, you know, just being creative with things… so creativity is king.Jason: So Michael, how do you exercise your creativity, then? How do you keep up that creativity?Michael: Yeah, so there's multiple ways. And that's a great segment because one of the things that I really enjoy—so you know, I like development, but I'm also a people person. And I like product management, but I also like dealing with people. So really, to me, it's about how do I relate products, how do I relate solutions, how do I talk to people about solutions that people can understand? And that's a creative process.Like, what is the right media? What is the right demos? What is the right—you know, what do people need? And what do people need to, kind of, embrace things? And to me, that's a really creative medium to me, and I love it.So, I love that I can use my technical, I love that I can use my artistic, I love that I can use, you know, all these pieces all at once. And sometimes maybe I'll play guitar and just put it in the intro or something, I don't know. So, that kind of combines that together, too. So, we'll figure that piece out later. Maybe nobody wants to hear me play guitar, that's fine, too. [laugh].But I love to be able to use, you know, both sides of my brain to do these creative aspects. So, that's really what does it. And then sometimes I'll program again and I'll find the need, and I'll say, “Hey, look, you know, I realized there's a need for this,” just like a lot of those creators are. But I haven't created anything cool, but you know, maybe someday I will. I feel like it's just been in between all those different intersections that's really cool.Jason: I love the electric guitar stuff that you mentioned. So, for folks who are listening to this show, during our recording of the Cloud Unfiltered you were talking about bringing that art and technical together with electric guitars, and you've been building electric guitar pickups.Michael: Yes. Yeah. So, I mean, I love anything that can combine my music passion with tech, so I have a CNC machine back here that winds pickups and it does it automatically. So, I can say, “Hey, I need a 57 pickup, you know, whatever it is,” and it'll wind it to that exact spec.But that's not the only thing I do. I mean, I used to design control surfaces for artists that were a big band, and I really can't—a lot of them I can't mention because we're under NDA. But I designed a lot of these big, you know, control surfaces for a lot of the big electronic and rock bands that are out there. I taught people how to use Max for Live, which is an artist's, kind of, programming language that's graphical, so [NMax 00:25:33] and MSP and all that kind of stuff. So, I really, really like to combine that.Nowadays, you know, I'm talking about doing some kind of events that may be combined tech, with art. So, maybe doing things like Algorave, and you know, things that are live-coding music and an art. So, being able to combine all these things together, I love that. That's my ultimate passion.Jason: That is super cool.Julie: I think we have learned quite a bit on this episode of Break Things on Purpose, first of all, from the guy who said he hasn't created much—because you did say that, which I'm going to call you out on that because you just gave a long list of things that you created. And I think we need to remember that we're all creators in our own way, so it's very important to remember that. But I think that right now we've created a couple of options for talks in the future, whether or not it's with Lego, or guitar pickups.Michael: Yeah.Julie: Is that—Michael: Hey—Julie: Because I—Michael: Yeah, why not?Julie: —know you do kind of explain that a little bit to me as well when I was there. So, Michael, this has just been amazing having you. We're going to put a lot of links in the notes for everybody today. So, to Michael's podcast, to some Lego, and to anything else Michael wants to share with us as well. Oh, real quick, is there anything you want to leave our listeners with other than that? You know, are you looking to hire Cisco? Is there anything you wanted to share with us?Michael: Yeah, I mean, we're always looking for great people at Cisco, but the biggest thing I'd say is, just realize that we are doing stuff around cloud-native, we're not just network. And I think that's something to note there. But you know, I just love being on the show with you guys. I love doing anything with you guys. You guys are awesome, you know. So.Julie: You're great too, and I think we'll probably do more stuff, all of us together, in the future. And with that, I just want to thank everybody for joining us today.Michael: Thank you. Thanks so much. Thanks for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

covid-19 god spotify amazon time live europe google technology las vegas battle chaos spain resilience kindness barcelona engineering vr figure lego increasing infrastructure programming api gremlins cisco nes faa apis reliability devops invent nda voltron kubernetes cnc baruch msp optimus prime sre varun quay terraform pogs technics devrel kerim break things komiku chaos engineering kubecon ebpf istio slos otel coreos optim michael oh michael it crossplane michael yeah jason yeah michael well michael yes michael so jason so john allspaw julie it tetrate michael thank julie you jason oh michael sure jason for julie jason julie well michael absolutely

DevOps (noun) [Word Notes]

Play Episode Listen Later May 24, 2022 7:00

The set of people, process, technology, and cultural norms that integrates software development and IT operations into a system-of-systems. CyberWire Glossary link: Audio reference link: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr," by John Allspaw and Paul Hammond, Velocity 09, 25 July 2009.

devops velocity flickr noun paul hammond john allspaw

Reliability Starts in Cultural Change with Amy Tobey

Play Episode Listen Later May 11, 2022 46:37

About AmyAmy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she spends her time building an innovative Site Reliability Engineering program at Equinix, where she is a principal engineer. When she's not working, she can be found with her nose in a book, watching anime with her son, making noise with electronics, or doing yoga poses in the sun.Links Referenced: Equinix Metal: https://metal.equinix.com Personal Twitter: https://twitter.com/MissAmyTobey Personal Blog: https://tobert.github.io/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning-fast processing power, courtesy of third-gen AMD EPYC processors without the IO or hardware limitations of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general-purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vultr delivers the power of the cloud with none of the bloat. “Screaming in the Cloud” listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That's G-E-T-V-U-L-T-R dot com slash screaming. My thanks to them for sponsoring this ridiculous podcast.Corey: Finding skilled DevOps engineers is a pain in the neck! And if you need to deploy a secure and compliant application to AWS, forgettaboutit! But that's where DuploCloud can help. Their comprehensive no-code/low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks, while automating the full DevSecOps lifestyle. Get started with DevOps-as-a-Service from DuploCloud so that your cloud configurations are done right the first time. Tell them I sent you and your first two months are free. To learn more visit: snark.cloud/duplo. Thats's snark.cloud/D-U-P-L-O-C-L-O-U-D.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while I catch up with someone that it feels like I've known for ages, and I realize somehow I have never been able to line up getting them on this show as a guest. Today is just one of those days. And my guest is Amy Tobey who has been someone I've been talking to for ages, even in the before-times, if you can remember such a thing. Today, she's a Senior Principal Engineer at Equinix. Amy, thank you for finally giving in to my endless wheedling.Amy: Thanks for having me. You mentioned the before-times. Like, I remember it was, like, right before the pandemic we had beers in San Francisco wasn't it? There was Ian there—Corey: Yeah, I—Amy: —and a couple other people. It was a really great time. And then—Corey: I vaguely remember beer. Yeah. And then—Amy: And then the world ended.Corey: Oh, my God. Yes. It's still March of 2020, right?Amy: As far as I know. Like, I haven't checked in a couple years.Corey: So, you do an awful lot. And it's always a difficult question to ask someone, so can you encapsulate your entire existence in a paragraph? It's—Amy: [sigh].Corey: —awful, so I'd like to give a bit more structure to it. Let's start with the introduction: You are a Senior Principal Engineer. We know it's high level because of all the adjectives that get put in there, and none of those adjectives are ‘associate' or ‘beginner' or ‘junior,' or all the other diminutives that companies like to play games with to justify paying people less. And you're at Equinix, which is a company that is a bit unlike most of the, shall we say, traditional cloud providers. What do you do over there and both as a company, as a person?Amy: So, as a company Equinix, what most people know about is that we have a whole bunch of data centers all over the world. I think we have the most of any company. And what we do is we lease out space in that data center, and then we have a number of other products that people don't know as well, which one is Equinix Metal, which is what I specifically work on, where we rent you bare-metal servers. None of that fancy stuff that you get any other clouds on top of it, there's things you can get that are… partner things that you can add-on, like, you know, storage and other things like that, but we just deliver you bare-metal servers with really great networking. So, what I work on is the reliability of that whole system. All of the things that go into provisioning the servers, making them come up, making sure that they get delivered to the server, make sure the API works right, all of that stuff.Corey: So, you're on the Equinix cloud side of the world more so than you are on the building data centers by the sweat of your brow, as they say?Amy: Correct. Yeah, yeah. Software side.Corey: Excellent. I spent some time in data centers in the early part of my career before cloud ate that. That was sort of cotemporaneous with the discovery that I'm the hardware destruction bunny, and I should go to great pains to keep my aura from anything expensive and important, like, you know, the SAN. So—Amy: Right, yeah.Corey: Companies moving out of data centers, and me getting out was a great thing.Amy: But the thing about SANs though, is, like, it might not be you. They're just kind of cursed from the start, right? They just always were kind of fussy and easy to break.Corey: Oh, yeah. I used to think—and I kid you not—that I had a limited upside to my career in tech because I sometimes got sloppy and I was fairly slow at crimping ethernet cables.Amy: [laugh].Corey: That is very similar to growing up in third grade when it became apparent that I was going to have problems in my career because my handwriting was sloppy. Yeah, it turns out the future doesn't look like we predicted it would.Amy: Oh, gosh. Are we going to talk about, like, neurological development now or… [laugh] okay, that's a thing I struggle with, too right, is I started typing as soon as they would let—in fact, before they would let me. I remember in high school, I had teachers who would grade me down for typing a paper out. They want me to handwrite it and I would go, “Cool. Go ahead and take a grade off because if I handwrite it, you're going to take two grades off my handwriting, so I'm cool with this deal.”Corey: Yeah, it was pretty easy early on. I don't know when the actual shift was, but it became more and more apparent that more and more things are moving towards a world where you could type. And I was almost five when I started working on that stuff, and that really wound up changing a lot of aspects of how I started seeing things. One thing I think you're probably fairly well known for is incidents. I want to be clear when I say that you are not the root cause as—“So, why are things broken?” “It's Amy again. What's she gotten into this time?” Great.Amy: [laugh]. But it does happen, but not all the time.Corey: Exa—it's a learning experience.Amy: Right.Corey: You've also been deeply involved with SREcon and a number of—a lot of aspects of what I will term—and please don't yell at me for this—SRE culture—Amy: Yeah.Corey: Which is sometimes a challenging thing to wind up describing or putting a definition around. The one that I've always been somewhat partial to is, “SRE is DevOps, except you worked at Google for a while.” I don't know how necessarily accurate that is, but it does rile people up.Amy: Yeah, it does. Dave Stanke actually did a really great talk at SREcon San Francisco just a couple weeks ago, about the DORA report. And the new DORA report, they split SRE out into its own function and kind of is pushing against that old model, which actually comes from Liz Fong-Jones—I think it's from her, or older—about, like, class SRE implements DevOps, which is kind of this idea that, like, SREs make DevOps happen. Things have evolved, right, since then. Things have evolved since Google released those books, and we're all just figured out what works and what doesn't a little bit.And so, it's not that we're implementing DevOps so much. In fact, it's that ops stuff that kind of holds us back from the really high impact work that SREs, I think, should be doing, that aren't just, like, fixing the problems, the symptoms down at the bottom layer, right? Like what we did as sysadmins 20 years ago. You know, we'd go and a lot of people are SREs that came out of the sysadmin world and still think in that mode, where it's like, “Well, I set up the systems, and when things break, I go and I fix them.” And, “Why did the developers keep writing crappy code? Why do I have to always getting up in the middle of the night because this thing crashed?”And it turns out that the work we need to do to make things more reliable, there's a ceiling to how far away the platform can take us, right? Like, we can have the best platform in the world with redundancy, and, you know, nine-way replicated data storage and all this crazy stuff, and still if we put crappy software on top, it's going to be unreliable. So, how do we make less crappy software? And for most of my career, people would be, like, “Well, you should test it.” And so, we started doing that, and we still have crappy software, so what's going on here? We still have incidents.So, we write more tests, and we still have incidents. We had a QA group, we still have incidents. We send the developers to training, and we still have incidents. So like, what is the thing we need to do to make things more reliable? And it turns out, most of it is culture work.Corey: My perspective on this stems from being a grumpy old sysadmin. And at some point, I started calling myself a systems engineer or DevOps or production engineer, or SRE. It was all from my point of view, the same job, but you know, if you call yourself a sysadmin, you're just asking for a 40% pay cut off the top.Amy: [laugh].Corey: But I still tended to view the world through that lens. I tended to be very good at Linux systems internals, for example, understanding system calls and the rest, but increasingly, as the DevOps wave or SRE wave, or Google-isation of the internet wound up being more and more of a thing, I found myself increasingly in job interviews, where, “Great, now, can you go wind up implementing a sorting algorithm on the whiteboard?” “What on earth? No.” Like, my lingua franca is shitty Bash, and no one tends to write that without a bunch of tab completions and quick checking with manpages—die.net or whatnot—on the fly as you go down that path.And it was awful, and I felt… like my skill set was increasingly eroding. And it wasn't honestly until I started this place where I really got into writing a fair bit of code to do different things because it felt like an orthogonal skill set, but the fullness of time, it seems like it's not. And it's a reskilling. And it made me wonder, does this mean that the areas of technology that I focused on early in my career, was that all a waste? And the answer is not really. Sometimes, sure, in that I don't spend nearly as much time worrying about inodes—for example—as I once did. But every once in a while, I'll run into something and I looked like a wizard from the future, but instead, I'm a wizard from the past.Amy: Yeah, I find that a lot in my work, now. Sometimes things I did 20 years ago, come back, and it's like, oh, yeah, I remember I did all that threading work in 2002 in Perl, and I learned everything the very, very, very hard way. And then, you know, this January, did some threading work to fix some stability issues, and all of it came flooding back, right? Just that the experiences really, more than the code or the learning or the text and stuff; more just the, like, this feels like threads [BLEEP]-ery. Is a diagnostic thing that sometimes we have to say.And then people are like, “Can you prove it?” And I'm like, “Not really,” because it's literally thread [BLEEP]-ery. Like, the definition of it is that there's weird stuff happening that we can't figure out why it's happening. There's something acting in the system that isn't synchronized, that isn't connected to other things, that's happening out of order from what we expect, and if we had a clear signal, we would just fix it, but we don't. We just have, like, weird stuff happening over here and then over there and over there and over there.And, like, that tells me there's just something happening at that layer and then have to go and dig into that right, and like, just basically charge through. My colleagues are like, “Well, maybe you should look at this, and go look at the database,” the things that they're used to looking at and that their experiences inform, whereas then I bring that ancient toiling through the threading mines experiences back and go, “Oh, yeah. So, let's go find where this is happening, where people are doing dangerous things with threads, and see if we can spot something.” But that came from that experience.Corey: And there's so much that just repeats itself. And history rhymes. The challenge is that, do you have 20 years of experience, or do you have one year of experience repeated 20 times? And as the tide rises, doing the same task by hand, it really is just a matter of time before your full-time job winds up being something a piece of software does. An easy example is, “Oh, what's your job?” “I manually place containers onto specific hosts.” “Well, I've got news for you, and you're not going to like it at all.”Amy: Yeah, yeah. I think that we share a little bit. I'm allergic to repeated work. I don't know if allergic is the right word, but you know, if I sit and I do something once, fine. Like, I'll just crank it out, you know, it's this form, or it's a datafile I got to write and I'll—fine I'll type it in and do the manual labor.The second time, the difficulty goes up by ten, right? Like, just mentally, just to do it, be like, I've already done this once. Doing it again is anathema to everything that I am. And then sometimes I'll get through it, but after that, like, writing a program is so much easier because it's like exponential, almost, growth in difficulty. You know, the third time I have to do the same thing that's like just typing the same stuff—like, look over here, read this thing and type it over here—I'm out; I can't do it. You know, I got to find a way to automate. And I don't know, maybe normal people aren't driven to live this way, but it's kept me from getting stuck in those spots, too.Corey: It was weird because I spent a lot of time as a consultant going from place to place and it led to some weird changes. For example, “Oh, thank God, I don't have to think about that whole messaging queue thing.” Sure enough, next engagement, it's message queue time. Fantastic. I found that repeating myself drove me nuts, but you also have to be very sensitive not to wind up, you know, stealing IP from the people that you're working with.Amy: Right.Corey: But what I loved about the sysadmin side of the world is that the vast majority of stuff that I've taken with me, lives in my shell config. And what I mean by that is I'm not—there's nothing in there is proprietary, but when you have a weird problem with trying to figure out the best way to figure out which Ruby process is stealing all the CPU, great, turns out that you can chain seven or eight different shell commands together through a bunch of pipes. I don't want to remember that forever. So, that's the sort of thing I would wind up committing as I learned it. I don't remember what company I picked that up at, but it was one of those things that was super helpful.I have a sarcastic—it's a one-liner, except no sane editor setting is going to show it in any less than three—of a whole bunch of Perl, piped into du, piped into the rest, that tells you one of the largest consumers of files in a given part of the system. And it rates them with stars and it winds up doing some neat stuff. I would never sit down and reinvent something like that today, but the fact that it's there means that I can do all kinds of neat tricks when I need to. It's making sure that as you move through your career, on some level, you're picking up skills that are repeatable and applicable beyond one company.Amy: Skills and tooling—Corey: Yeah.Amy: —right? Like, you just described the tool. Another SREcon talk was John Allspaw and Dr. Richard Cook talking about above the line; below the line. And they started with these metaphors about tools, right, showing all the different kinds of hammers.And if you're a blacksmith, a lot of times you craft specialized hammers for very specific jobs. And that's one of the properties of a tool that they were trying to get people to think about, right, is that tools get crafted to the job. And what you just described as a bespoke tool that you had created on the fly, that kind of floated under the radar of intellectual property. [laugh].So, let's not tell the security or IP people right? Like, because there's probably billions and billions of dollars of technically, like, made-up IP value—I'm doing air quotes with my fingers—you know, that's just basically people's shell profiles. And my God, the Emacs automation that people have done. If you've ever really seen somebody who's amazing at Emacs and is 10, 20, 30, maybe 40 years of experience encoded in their emacs settings, it's a wonder to behold. Like, I look at it and I go, “Man, I wish I could do that.”It's like listening to a really great guitar player and be like, “Wow, I wish I could play like them.” You see them just flying through stuff. But all that IP in there is both that person's collection of wisdom and experience and working with that code, but also encodes that stuff like you described, right? It's just all these little systems tricks and little fiddly commands and things we don't want to remember and so we encode them into our toolset.Corey: Oh, yeah. Anything I wound up taking, I always would share it with people internally, too. I'd mention, “Yeah, I'm keeping this in my shell files.” Because I disclosed it, which solves a lot of the problem. And also, none of it was even close to proprietary or anything like that. I'm sorry, but the way that you wind up figuring out how much of a disk is being eaten up and where in a more pleasing way, is not a competitive advantage. It just isn't.Amy: It isn't to you or me, but, you know, back in the beginning of our careers, people thought it was worth money and should be proprietary. You know, like, oh, that disk-checking script as a competitive advantage for our company because there are only a few of us doing this work. Like, it was actually being able to, like, manage your—[laugh] actually manage your servers was a competitive advantage. Now, it's kind of commodity.Corey: Let's also be clear that the world has moved on. I wound up buying a DaisyDisk a while back for Mac, which I love. It is a fantastic, pretty effective, “Where's all the stuff on your disk going?” And it does a scan and you can drive and collect things and delete them when trying to clean things out. I was using it the other day, so it's top of mind at the moment.But it's way more polished than that crappy Perl three-liner. And I see both sides, truly I do. The trick also, for those wondering [unintelligible 00:15:45], like, “Where is the line?” It's super easy. Disclose it, what you're doing, in those scenarios in the event someone is no because they believe that finding the right man page section for something is somehow proprietary.Great. When you go home that evening in a completely separate environment, build it yourself from scratch to solve the problem, reimplement it and save that. And you're done. There are lots of ways to do this. Don't steal from your employer, but your employer employs you; they don't own you and the way that you think about these problems.Every person I've met who has had a career that's longer than 20 minutes has a giant doc somewhere on some system of all of the scripts that they wound up putting together, all of the one-liners, the notes on, “Next time you see this, this is the thing to check.”Amy: Yeah, the cheat sheet or the notebook with all the little commands, or again the Emacs config, sometimes for some people, or shell profiles. Yeah.Corey: Here's the awk one-liner that I put that automatically spits out from an Apache log file what—the httpd log file that just tells me what are the most frequent talkers, and what are the—Amy: You should probably let go of that one. You know, like, I think that one's lifetime is kind of past, Corey. Maybe you—Corey: I just have to get it working with Nginx, and we're good to go.Amy: Oh, yeah, there you go. [laugh].Corey: Or S3 access logs. Perish the thought. But yeah, like, what are the five most high-volume talkers, and what are those relative to each other? Huh, that one thing seems super crappy and it's coming from Russia. But that's—hmm, one starts to wonder; maybe it's time to dig back in.So, one of the things that I have found is that a lot of the people talking about SRE seem to have descended from an ivory tower somewhere. And they're talking about how some of the best-in-class companies out there, renowned for their technical cultures—at least externally—are doing these things. But there's a lot more folks who are not there. And honestly, I consider myself one of those people who is not there. I was a competent engineer, but never a terrific one.And looking at the way this was described, I often came away thinking, “Okay, it was the purpose of this conference talk just to reinforce how smart people are, and how I'm not,” and/or, “There are the 18 cultural changes you need to make to your company, and then you can do something kind of like we were just talking about on stage.” It feels like there's a combination of problems here. One is making this stuff more accessible to folks who are not themselves in those environments, and two, how to drive cultural change as an individual contributor if that's even possible. And I'm going to go out on a limb and guess you have thoughts on both aspects of that, and probably some more hit me, please.Amy: So, the ivory tower, right. Let's just be straight up, like, the ivory tower is Google. I mean, that's where it started. And we get it from the other large companies that, you know, want to do conference talks about what this stuff means and what it does. What I've kind of come around to in the last couple of years is that those talks don't really reach the vast majority of engineers, they don't really apply to a large swath of the enterprise especially, which is, like, where a lot of the—the bulk of our industry sits, right? We spend a lot of time talking about the darlings out here on the West Coast in high tech culture and startups and so on.But, like, we were talking about before we started the show, right, like, the interior of even just America, is filled with all these, like, insurance and banks and all of these companies that are cranking out tons of code and servers and stuff, and they're trying to figure out the same problems. But they're structured in companies where their tech arm is still, in most cases, considered a cost center, often is bundled under finance, for—that's a whole show of itself about that historical blunder. And so, the tech culture is tend to be very, very different from what we experience in—what do we call it anymore? Like, I don't even want to say West Coast anymore because we've gone remote, but, like, high tech culture we'll say. And so, like, thinking about how to make SRE and all this stuff more accessible comes down to, like, thinking about who those engineers are that are sitting at the computers, writing all the code that runs our banks, all the code that makes sure that—I'm trying to think of examples that are more enterprise-y right?Or shoot buying clothes online. You go to Macy's for example. They have a whole bunch of servers that run their online store and stuff. They have internal IT-ish people who keep all this stuff running and write that code and probably integrating open-source stuff much like we all do. But when you go to try to put in a reliability program that's based on the current SRE models, like SLOs; you put in SLOs and you start doing, like, this incident management program that's, like, you know, you have a form you fill out after every incident, and then you [unintelligible 00:20:25] retros.And it turns out that those things are very high-level skills, skills and capabilities in an organization. And so, when you have this kind of IT mindset or the enterprise mindset, bringing the culture together to make those things work often doesn't happen. Because, you know, they'll go with the prescriptive model and say, like, okay, we're going to implement SLOs, we're going to start measuring SLIs on all of the services, and we're going to hold you accountable for meeting those targets. If you just do that, right, you're just doing more gatekeeping and policing of your tech environment. My bet is, reliability almost never improves in those cases.And that's been my experience, too, and why I get charged up about this is, if you just go slam in these practices, people end up miserable, the practices then become tarnished because people experienced the worst version of them. And then—Corey: And with the remote explosion as well, it turns out that changing jobs basically means their company sends you a different Mac, and the next Monday, you wind up signing into a different Slack team.Amy: Yeah, so the culture really matters, right? You can't cover it over with foosball tables and great lunch. You actually have to deliver tools that developers want to use and you have to deliver a software engineering culture that brings out the best in developers instead of demanding the best from developers. I think that's a fundamental business shift that's kind of happening. If I'm putting on my wizard hat and looking into the future and dreaming about what might change in the world, right, is that there's kind of a change in how we do leadership and how we do business that's shifting more towards that model where we look at what people are capable of and we trust in our people, and we get more out of them, the knowledge work model.If we want more knowledge work, we need people to be happy and to feel engaged in their community. And suddenly we start to see these kind of generational, bigger-pie kind of things start to happen. But how do we get there? It's not SLOs. It maybe it's a little bit starting with incidents. That's where I've had the most success, and you asked me about that. So, getting practical, incident management is probably—Corey: Right. Well, as I see it, the problem with SLOs across the board is it feels like it's a very insular community so far, and communicating it to engineers seems to be the focus of where the community has been, but from my understanding of it, you absolutely need buy-in at significantly high executive levels, to at the very least by you air cover while you're doing these things and making these changes, but also to help drive that cultural shift. None of this is something I have the slightest clue how to do, let's be very clear. If I knew how to change a company's culture, I'd have a different job.Amy: Yeah. [laugh]. The biggest omission in the Google SRE books was [Ers 00:22:58]. There was a guy at Google named Ers who owns availability for Google, and when anything is, like, in dispute and bubbles up the management team, it goes to Ers, and he says, “Thou shalt…” right? Makes the call. And that's why it works, right?Like, it's not just that one person, but that system of management where the whole leadership team—there's a large, very well-funded team with a lot of power in the organization that can drive availability, and they can say, this is how you're going to do metrics for your service, and this is the system that you're in. And it's kind of, yeah, sure it works for them because they have all the organizational support in place. What I was saying to my team just the other day—because we're in the middle of our SLO rollout—is that really, I think an SLO program isn't [clear throat] about the engineers at all until late in the game. At the beginning of the game, it's really about getting the leadership team on board to say, “Hey, we want to put in SLIs and SLOs to start to understand the functioning of our software system.” But if they don't have that curiosity in the first place, that desire to understand how well their teams are doing, how healthy their teams are, don't do it. It's not going to work. It's just going to make everyone miserable.Corey: It feels like it's one of those difficult to sell problems as well, in that it requires some tooling changes, absolutely. It requires cultural change and buy-in and whatnot, but in order for that to happen, there has to be a painful problem that a company recognizes and is willing to pay to make go away. The problem with stuff like this is that once you pay, there's a lot of extra work that goes on top of it as well, that does not have a perception—rightly or wrongly—of contributing to feature velocity, of hitting the next milestone. It's, “Really? So, we're going to be spending how much money to make engineers happier? They should get paid an awful lot and they're still complaining and never seem happy. Why do I care if they're happy other than the pure mercenary perspective of otherwise they'll quit?” I'm not saying that it's not worth pursuing; it's not a worthy goal. I am saying that it becomes a very difficult thing to wind up selling as a product.Amy: Well, as a product for sure, right? Because—[sigh] gosh, I have friends in the space who work on these tools. And I want to be careful.Corey: Of course. Nothing but love for all of those people, let's be very clear.Amy: But a lot of them, you know, they're pulling metrics from existing monitoring systems, they are doing some interesting math on them, but what you get at the end is a nice service catalog and dashboard, which are things we've been trying to land as products in this industry for as long as I can remember, and—Corey: “We've got it this time, though. This time we'll crack the nut.” Yeah. Get off the island, Gilligan.Amy: And then the other, like, risky thing, right, is the other part that makes me uncomfortable about SLOs, and why I will often tell folks that I talk to out in the industry that are asking me about this, like, one-on-one, “Should I do it here?” And it's like, you can bring the tool in, and if you have a management team that's just looking to have metrics to drive productivity, instead of you know, trying to drive better knowledge work, what you get is just a fancier version of more Taylorism, right, which is basically scientific management, this idea that we can, like, drive workers to maximum efficiency by measuring random things about them and driving those numbers. It turns out, that doesn't really work very well, even in industrial scale, it just happened to work because, you know, we have a bloody enough society that we pushed people into it. But the reality is, if you implement SLOs badly, you get more really bad Taylorism that's bad for you developers. And my suspicion is that you will get worse availability out of it than you would if you just didn't do it at all.Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I've been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn't limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That's R-E-V-E-L-O dot I-O slash screaming.Corey: That is part of the problem is, in some cases, to drive some of these improvements, you have to go backwards to move forwards. And it's one of those, “Great, so we spent all this effort and money in the rest of now things are worse?” No, not necessarily, but suddenly are aware of things that were slipping through the cracks previously.Amy: Yeah. Yeah.Corey: Like, the most realistic thing about first The Phoenix Project and then The Unicorn Project, both by Gene Kim, has been the fact that companies have these problems and actively cared enough to change it. In my experience, that feels a little on the rare side.Amy: Yeah, and I think that's actually the key, right? It's for the culture change, and for, like, if you really looking to be, like, do I want to work at this company? Am I investing my myself in here? Is look at the leadership team and be, like, do these people actually give a crap? Are they looking just to punt another number down the road?That's the real question, right? Like, the technology and stuff, at the point where I'm at in my career, I just don't care that much anymore. [laugh]. Just… fine, use Kubernetes, use Postgres, [unintelligible 00:27:30], I don't care. I just don't. Like, Oracle, I might have to ask, you know, go to finance and be like, “Hey, can we spend 20 million for a database?” But like, nobody really asks for that anymore, so. [laugh].Corey: As one does. I will say that I mostly agree with you, but a technology that I found myself getting excited about, given the time of the recording on this is… fun, I spent a bit of time yesterday—from when we're recording this—teaching myself just enough Go to wind up being together a binary that I needed to do something actively ridiculous for my camera here. And I found myself coming away deeply impressed by a lot of things about it, how prescriptive it was for one, how self-contained for another. And after spending far too many years of my life writing shitty Perl, and shitty Bash, and worse Python, et cetera, et cetera, the prescriptiveness was great. The fact that it wound up giving me something I could just run, I could cross-compile for anything I need to run it on, and it just worked. It's been a while since I found a technology that got me this interested in exploring further.Amy: Go is great for that. You mentioned one of my two favorite features of Go. One is usually when a program compiles—at least the way I code in Go—it usually works. I've been working with Go since about 0.9, like, just a little bit before it was released as 1.0, and that's what I've noticed over the years of working with it is that most of the time, if you have a pretty good data structure design and you get the code to compile, usually it's going to work, unless you're doing weird stuff.The other thing I really love about Go and that maybe you'll discover over time is the malleability of it. And the reason why I think about that more than probably most folks is that I work on other people's code most of the time. And maybe this is something that you probably run into with your business, too, right, where you're working on other people's infrastructure. And the way that we encode business rules and things in the languages, in our programming language or our config syntax and stuff has a huge impact on folks like us and how quickly we can come into a situation, assess, figure out what's going on, figure out where things are laid out, and start making changes with confidence.Corey: Forget other people for a minute they're looking at what I built out three or four years ago here, myself, like, I look at past me, it's like, “What was that rat bastard thinking? This is awful.” And it's—forget other people's code; hell is your own code, on some level, too, once it's slipped out of the mental stack and you have to re-explore it and, “Oh, well thank God I defensively wound up not including any comments whatsoever explaining what the living hell this thing was.” It's terrible. But you're right, the other people's shell scripts are finicky and odd.I started poking around for help when I got stuck on something, by looking at GitHub, and a few bit of searching here and there. Even these large, complex, well-used projects started making sense to me in a way that I very rarely find. It's, “What the hell is that thing?” is my most common refrain when I'm looking at other people's code, and Go for whatever reason avoids that, I think because it is so prescriptive about formatting, about how things should be done, about the vision that it has. Maybe I'm romanticizing it and I'll hate it and a week from now, and I want to go back and remove this recording, but.Amy: The size of the language helps a lot.Corey: Yeah.Amy: But probably my favorite. It's more of a convention, which actually funny the way I'm going to talk about this because the two languages I work on the most right now are Ruby and Go. And I don't feel like two languages could really be more different.Syntax-wise, they share some things, but really, like, the mental models are so very, very different. Ruby is all the way in on object-oriented programming, and, like, the actual real kind of object-oriented with messaging and stuff, and, like, the whole language kind of springs from that. And it kind of requires you to understand all of these concepts very deeply to be effective in large programs. So, what I find is, when I approach Ruby codebase, I have to load all this crap into my head and remember, “Okay, so yeah, there's this convention, when you do this kind of thing in Ruby”—or especially Ruby on Rails is even worse because they go deep into convention over configuration. But what that's code for is, this code is accessible to people who have a lot of free cognitive capacity to load all this convention into their heads and keep it in their heads so that the code looks pretty, right?And so, that's the trade-off as you said, okay, my developers have to be these people with all these spare brain cycles to understand, like, why I would put the code here in this place versus this place? And all these, like, things that are in the code, like, very compact, dense concepts. And then you go to something like Go, which is, like, “Nah, we're not going to do Lambdas. Nah”—[laugh]—“We're not doing all this fancy stuff.” So, everything is there on the page.This drives some people crazy, right, is that there's all this boilerplate, boilerplate, boilerplate. But the reality is, I can read most Go files from top to the bottom and understand what the hell it's doing, whereas I can go sometimes look at, like, a Ruby thing, or sometimes Python and e—Perl is just [unintelligible 00:32:19] all the time, right, it's there's so much indirection. And it just be, like, “What the [BLEEP] is going on? This is so dense. I'm going to have to sit down and write it out in longhand so I can understand what the developer was even doing here.” And—Corey: Well, that's why I got the Mac Studio; for when I'm not doing A/V stuff with it, that means that I'll have one core that I can use for, you know, front-end processing and the rest, and the other 19 cores can be put to work failing to build Nokogiri in Ruby yet again.Amy: [laugh].Corey: I remember the travails of working with Ruby, and the problem—I have similar problems with Python, specifically in that—I don't know if I'm special like this—it feels like it's a SRE DevOps style of working, but I am grabbing random crap off a GitHub constantly and running it, like, small scripts other people have built. And let's be clear, I run them on my test AWS account that has nothing important because I'm not a fool that I read most of it before I run it, but I also—it wants a different version of Python every single time. It wants a whole bunch of other things, too. And okay, so I use ASDF as my version manager for these things, which for whatever reason, does not work for the way that I think about this ergonomically. Okay, great.And I wind up with detritus scattered throughout my system. It's, “Hey, can you make this reproducible on my machine?” “Almost certainly not, but thank you for asking.” It's like ‘Step 17: Master the Wolf' level of instructions.Amy: And I think Docker generally… papers over the worst of it, right, is when we built all this stuff in the aughts, you know, [CPAN 00:33:45]—Corey: Dev containers and VS Code are very nice.Amy: Yeah, yeah. You know, like, we had CPAN back in the day, I was doing chroots, I think in, like, '04 or '05, you know, to solve this problem, right, which is basically I just—screw it; I will compile an entire distro into a directory with a Perl and all of its dependencies so that I can isolate it from the other things I want to run on this machine and not screw up and not have these interactions. And I think that's kind of what you're talking about is, like, the old model, when we deployed servers, there was one of us sitting there and then we'd log into the server and be like, I'm going to install the Perl. You know, I'll compile it into, like, [/app/perl 558 00:34:21] whatever, and then I'll CPAN all this stuff in, and I'll give it over to the developer, tell them to set their shebang to that and everything just works. And now we're in a mode where it's like, okay, you got to set up a thousand of those. “Okay, well, I'll make a tarball.” [laugh]. But it's still like we had to just—Corey: DevOps, but [unintelligible 00:34:37] dev closer to ops. You're interrelating all the time. Yeah, then Docker comes along, and add dev is, like, “Well, here's the container. Good luck, asshole.” And it feels like it's been cast into your yard to worry about.Amy: Yeah, well, I mean, that's just kind of business, or just—Corey: Yeah. Yeah.Amy: I'm not sure if it's business or capitalism or something like that, but just the idea that, you know, if I can hand off the shitty work to some other poor schlub, why wouldn't I? I mean, that's most folks, right? Like, just be like, “Well”—Corey: Which is fair.Amy: —“I got it working. Like, my part is done, I did what I was supposed to do.” And now there's a lot of folks out there, that's how they work, right? “I hit done. I'm done. I shipped it. Sure. It's an old [unintelligible 00:35:16] Ubuntu. Sure, there's a bunch of shell scripts that rip through things. Sure”—you know, like, I've worked on repos where there's hundreds of things that need to be addressed.Corey: And passing to someone else is fine. I'm thrilled to do it. Where I run into problems with it is where people assume that well, my part was the hard part and anything you schlubs do is easy. I don't—Amy: Well, that's the underclass. Yeah. That's—Corey: Forget engineering for a second; I throw things to the people over in the finance group here at The Duckbill Group because those people are wizards at solving for this thing. And it's—Amy: Well, that's how we want to do things.Corey: Yeah, specialization works.Amy: But we have this—it's probably more cultural. I don't want to pick, like, capitalism to beat on because this is really, like, human cultural thing, and it's not even really particularly Western. Is the idea that, like, “If I have an underclass, why would I give a shit what their experience is?” And this is why I say, like, ops teams, like, get out of here because most ops teams, the extant ops teams are still called ops, and a lot of them have been renamed SRE—but they still do the same job—are an underclass. And I don't mean that those people are below us. People are treated as an underclass, and they shouldn't be. Absolutely not.Corey: Yes.Amy: Because the idea is that, like, well, I'm a fancy person who writes code at my ivory tower, and then it all flows down, and those people, just faceless people, do the deployment stuff that's beneath me. That attitude is the most toxic thing, I think, in tech orgs to address. Like, if you're trying to be like, “Well, our liability is bad, we have security problems, people won't fix their code.” And go look around and you will find people that are treated as an underclass that are given codes thrown over the wall at them and then they just have to toil through and make it work. I've worked on that a number of times in my career.And I think just like saying, underclass, right, or caste system, is what I found is the most effective way to get people actually thinking about what the hell is going on here. Because most people are just, like, “Well, that's just the way things are. It's just how we've always done it. The developers write to code, then give it to the sysadmins. The sysadmins deploy the code. Isn't that how it always works?”Corey: You'd really like to hope, wouldn't you?Amy: [laugh]. Not me. [laugh].Corey: Again, the way I see it is, in theory—in theory—sysadmins, ops, or that should not exist. People should theoretically be able to write code as developers that just works, the end. And write it correct the first time and never have to change it again. Yeah. There's a reason that I always like to call staging environments in places I work ‘theory' because it works in theory, but not in production, and that is fundamentally the—like, that entire job role is the difference between theory and practice.Amy: Yeah, yeah. Well, I think that's the problem with it. We're already so disconnected from the physical world, right? Like, you and I right now are talking over multiple strands of glass and digital transcodings and things right now, right? Like, we are detached from the physical reality.You mentioned earlier working in data centers, right? The thing I miss about it is, like, the physicality of it. Like, actually, like, I held a server in my arms and put it in the rack and slid it into the rails. I plugged into power myself; I pushed the power button myself. There's a server there. I physically touched it.Developers who don't work in production, we talked about empathy and stuff, but really, I think the big problem is when they work out in their idea space and just writing code, they write the unit tests, if we're very lucky, they'll write a functional test, and then they hand that wad off to some poor ops group. They're detached from the reality of operations. It's not even about accountability; it's about experience. The ability to see all of the weird crap we deal with, right? You know, like, “Well, we pushed the code to that server, but there were three bit flips, so we had to do it again. And then the other server, the disk failed. And on the other server…” You know? [laugh].It's just, there's all this weird crap that happens, these systems are so complex that they're always doing something weird. And if you're a developer that just spends all day in your IDE, you don't get to see that. And I can't really be mad at those folks, as individuals, for not understanding our world. I figure out how to help them, and the best thing we've come up with so far is, like, well, we start giving this—some responsibility in a production environment so that they can learn that. People do that, again, is another one that can be done wrong, where it turns into kind of a forced empathy.I actually really hate that mode, where it's like, “We're forcing all the developers online whether they like it or not. On-call whether they like it or not because they have to learn this.” And it's like, you know, maybe slow your roll a little buddy because the stuff is actually hard to learn. Again, minimizing how hard ops work is. “Oh, we'll just put the developers on it. They'll figure it out, right? They're software engineers. They're probably smarter than you sysadmins.” Is the unstated thing when we do that, right? When we throw them in the pit and be like, “Yeah, they'll get it.” [laugh].Corey: And that was my problem [unintelligible 00:39:49] the interview stuff. It was in the write code on a whiteboard. It's, “Look, I understood how the system fundamentally worked under the hood.” Being able to power my way through to get to an outcome even in language I don't know, was sort of part and parcel of the job. But this idea of doing it in artificially constrained environment, in a language I'm not super familiar with, off the top of my head, it took me years to get to a point of being able to do it with a Bash script because who ever starts with an empty editor and starts getting to work in a lot of these scenarios? Especially in an ops role where we're not building something from scratch.Amy: That's the interesting thing, right? In the majority of tech work today—maybe 20 years ago, we did it more because we were literally building the internet we have today. But today, most of the engineers out there working—most of us working stiffs—are working on stuff that already exists. We're making small incremental changes, which is great that's what we're doing. And we're dealing with old code.Corey: We're gluing APIs together, and that's fine. Ugh. I really want to thank you for taking so much time to talk to me about how you see all these things. If people want to learn more about what you're up to, where's the best place to find you?Amy: I'm on Twitter every once in a while as @MissAmyTobey, M-I-S-S-A-M-Y-T-O-B-E-Y. I have a blog I don't write on enough. And there's a couple things on the Equinix Metal blog that I've written, so if you're looking for that. Otherwise, mainly Twitter.Corey: And those links will of course be in the [show notes 00:41:08]. Thank you so much for your time. I appreciate it.Amy: I had fun. Thank you.Corey: As did I. Amy Tobey, Senior Principal Engineer at Equinix. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, or on the YouTubes, smash the like and subscribe buttons, as the kids say. Whereas if you've hated this episode, same thing, five-star review all the platforms, smash the buttons, but also include an angry comment telling me that you're about to wind up subpoenaing a copy of my shell script because you're convinced that your intellectual property and secrets are buried within.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

united states america god amazon english google starting master man mexico service san francisco russia spanish western brazil argentina launch starts wolf software cloud mac west coast south america costa rica latin america ip oracle fantastic developers slack antarctica nah api io screaming makes python aws bash linux github apis qa reliability devops apache ubuntu cpu perish ide docker kubernetes bleep optimized sre disclose ruby on rails devsecops syntax slo ers cultural change mac studio vs code postgres equinix nginx phoenix project gene kim emacs sres site reliability engineering revelo corey quinn slos taylorism lambdas senior principal engineer richard cook slis amd epyc cpan amy it unicorn project amy you duckbill group asdf john allspaw amy well daisydisk amy so it's amy amy oh nokogiri chief cloud economist equinix metal last week in aws humblepod

Agile Software Development Method (noun) [Word Notes]

Play Episode Listen Later May 3, 2022 7:15

A software development philosophy that emphasizes incremental delivery, team collaboration, continual planning, and continual learning Audio reference link: https://thecyberwire.com/glossary/agile-software-development "Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe" John Allspaw and Paul Hammond, 2009 Velocity Conference, YouTube, 25 June 2009.

method velocity noun agile software development paul hammond john allspaw velocity conference

DevOps (noun)

Play Episode Listen Later May 3, 2022 7:00

The set of people, process, technology, and cultural norms that integrates software development and IT operations into a system-of-systems. CyberWire Glossary link: https://thecyberwire.com/glossary/devops Audio reference link: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr," by John Allspaw and Paul Hammond, Velocity 09, 25 July 2009.

devops velocity flickr noun paul hammond john allspaw

Agile Software Development Method (noun)

The Idealcast with Gene Kim by IT Revolution

Play Episode Listen Later Apr 19, 2022 7:15

method velocity noun agile software development paul hammond john allspaw velocity conference

#43 Applying Resilience Engineering Practices to Scale Data Sharing - Interview w/ Tim Tischler

Data Mesh Radio

Play Episode Listen Later Mar 18, 2022 75:08

Provided as a free resource by DataStax https://www.datastax.com/products/datastax-astra?utm_source=DataMeshRadio (AstraDB) https://www.patreon.com/datameshradio (Patreon) In this episode, Scott interviewed Tim Tischler, Principal Engineer at Wayfair. Prior to Wayfair, Tim worked as a Site Reliability Champion at New Relic and is well known in the "human factors" and resilience engineering space. Per Tim, our current work culture is overly action-item driven - every meeting must have a set of agenda items generated from it. This prevents people from having learning-focused meetings exclusively designed for context sharing. Humans' brains work differently between learning and fixing mode and we ask totally different questions. To be able to scale our knowledge sharing, we need to have the space to have learning-focused meetings. A good way to center learning-focused meetings, be they "show and tell" or event storming sessions, is via sharing stories - human communication is founded on story sharing through the millennia. Tim's "show and tell" and event storming sessions at Wayfair have had extremely positive reviews so far. Tim sees ticket-based interactions - just throwing requirements on someone's JIRA backlog or similar - as fundamentally flawed. If Team A gives Team B requirements, Team B just looks to close the ticket versus getting both sides in the room to exchange context and have a negotiation. Tim prefers two modes of interactions over ticket systems: #1 - no human-touch, automated interactions, e.g. an API; and #2 - high touch, high context sharing interactions. For resilience engineering specifically, you should apply learnings to each data product AND the mesh as a whole. Part of that is a broad acceptance that you are in a highly dynamic and highly changing org - there will be changes! A few anti-patterns to resilience engineering that apply to data mesh are: 1) a hub and spoke relationship model where one person is the key glue - this is bad at a human level and even worse at a technical level :); 2) business leaders pushing for metrics without sharing the specific context as the results end up as completely empty and useless things you are tracking; and 3) not embedding people building platforms into the teams they are building the platform for - they must really understand the workflows. Books/posts/papers mentioned: Blameless PostMortems and a Just Culture by John Allspaw - https://www.etsy.com/codeascraft/blameless-postmortems/ (Link) The Theory of Graceful Extensibility: Basic rules that govern adaptive systems by David D Woods - https://www.researchgate.net/publication/327427067_The_Theory_of_Graceful_Extensibility_Basic_rules_that_govern_adaptive_systems (Link) The Field Guide to Understanding 'Human Error' by Sidney Dekker - https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/1472439058 (Link) Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/ (https://www.linkedin.com/in/scotthirleman/) If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/ (https://datameshlearning.com/community/) If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see https://docs.google.com/document/d/1WkXLhSH7mnbjfTChD0uuYeIF5Tj0UBLUP4Jvl20Ym10/edit?usp=sharing (here) All music used this episode created by Lesfm (intro includes slight edits by Scott Hirleman): https://pixabay.com/users/lesfm-22579021/ (https://pixabay.com/users/lesfm-22579021/) Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under "add payment"): https://www.datastax.com/products/datastax-astra?utm_source=DataMeshRadio (AstraDB)

books resilience theory humans scale engineering practices api apis wayfair jira data sharing principal engineer human error tischler new relic lesfm datastax team b just culture sidney dekker john allspaw field guide understanding human error

Personal DevOps Aha Moments, the Rise of Infrastructure, and the DevOps Enterprise Scenius: Interviews with The DevOps Handbook Coauthors (Part 1 of 2: Patrick Debois and John Willis)

Play Episode Listen Later Dec 16, 2021 139:36

In part one of this two-part episode on The DevOpsHandbook, Second Edition, Gene Kim speaks with coauthors Patrick Debois and John Willis about the past, present, and future of DevOps. By sharing their personal stories and experiences, Kim, Debois, and Willis discuss the scenius that inspired the book, and why and how the DevOps movement took hold around the world. They also examine the updated content in the book, including new case studies, updated metrics, and practices. Finally, they each share the new lessons they have learned since writing the handbook and the future challenges they think DevOps professionals need to solve for the future. Kim will conclude the series in Part 2, where he interviews the remaining two coauthors, Jez Humble and Dr. Nicole Forsgren. ABOUT THE GUEST(S) Patrick Debois is considered to be the godfather of the DevOps movement after he coined the term DevOps accidentally in 2008. Through his work, he creates synergies projects and operations by using Agile techniques in development, project management, and system administration. He has worked in several companies such as Atlassian, Zender, and VRT Media Lab. Currently, he is a Labs Researcher at Synk and an independent IT consultant. John Willis an author and Senior Director of the Global Transformation Office at Red Hat.. He has been an active force in the IT management industry for over 35 years. Willis' experience includes being the Director of Ecosystem Development at Docker, the VP of Solutions for Socketplane, the VP of Training and Services at Opscode. He also founded Gulf Breeze Software, an award-winning IBM business partner, which specializes in deploying Tivoli technology for the enterprise. Patrick DeBois and John Willis are two of five coauthors of The DevOps Handbook along with Gene Kim, Jez Humble, and Nicole Forsgren, PhD. YOU'LL LEARN ABOUT The DevOps origin story from coining the term, why it took off, to launching the DevOps Days conference as an offshoot of the velocity conference. How people thought of DevOps when it was first presented (their reactions, their mentalities, and their willingness to adopt it). What has changed in the DevOps world since the first edition of The DevOps Handbook was published. How the rise of SaaS companies is altering the DevOps world and participating in its evolution, and how building solid relationships with SaaS vendors and communicating comprehensive feedback to them is integral to DevOps. The significance of speed in changing team dynamics. Why resilient companies like Google and Amazon engineer chaos, and why companies like Toyota are happy when production stoppages happen. Why you can't afford to provide a high variety of products if you also offer high product variation. RESOURCES Get The DevOps Handbook (Second Edition) Nudge vs Shove: A Conversation With Richard Thaler Solaris Zones wiki Agile Conference in Toronto 2008 Sys Advent article: In Defense of the Modern Day JVM (Java Virtual Machine) by Gene Kim Mob programming Breaking Traditional IT Paradigms to... (San Francisco 2015) Crowdsourcing Technology Governance (Las Vegas 2018) Laying Down the Tracks for Technical Change at Comcast (Las Vegas 2020) 10+ Deploys Per Day by John Allspaw and Paul Hammond 10+ Deploys Per Day How chaos engineering works at Vanguard Patrick DeBois tweet mapping out all the failure modes of an online conference. Jesse Robins LinkedIn Jesse Robbins on Twitter How A Hotel Company Ran $30B of Revenue In Containers (Las Vegas 2020) by Dwayne Holmes Google Cloud Certified Fellow Program Operations is a competitive advantage… (Secret Sauce for Startups!) Love Letter To Conferences (And What Makes Some Truly Amazing) by Gene Kim Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results by Mike Rother Profound podcast by John Willis Ben Rockwood on Twitter Luke Kanies on LinkedIn DevOps 2020 - The Next Decade (London 2020) Beyond the Phoenix Project: The Origins and Evolution of DevOps by Gene Kim and John Willis The Goal: A Process of Ongoing Improvement by Eliyahu M. Goldratt and Jeff Cox The Convergence Of DevOps Operations as a Strategic Weapon by John Willis Iterative Enterprise SRE Transformation (US 2021) TIMESTAMPS [00:00] Intro [01:18] What's new and improved in the second edition of the DevOps handbook [03:56] Meet Patrick DeBois [10:35] How faster technology made ideas like DevOps possible [18:11] The myths and inefficiencies of team autonomy [20:04] What the first DevOps days were like [27:59] Different opinions between the dev community and ops community [30:49] Mob programming and the future of collaboration [39:31] Two surprising things Patrick learned about DevOps [47:20] Patrick DeBois' favorite DevOps patterns [51:28] How fear of not delivering on time can mask technical errors [59:45] What Patrick DeBois is working on these days [1:04:38] What was expanded in the second edition of the DevOps handbook [1:06:30] How Gene Kim entered the DevOps world. [1:07:38] Meet John Willis [1:10:42] Why the DevOps movement took off [1:16:00] Mastering production disasters [1:23:32] The birth of the DevOps Days conference [1:37:37] Feelings of belonging and connection in a conference [1:41:29] A few clarifications [1:49:32] Two of the greatest DevOps open spaces [1:52:40] The difference between variety and variation (the cost of knowledge work). [2:07:12] Why you should want more stoppages in your production line [2:10:16] John Willis' two favorite DevOps case studies [2:18:55] Outro

director amazon google personal training san francisco phd evolution startups feelings services mastering ibm infrastructure saas senior director enterprise toyota willis agile handbook secret sauce co authors mob devops red hat docker atlassian aha moments in defense second edition tivoli laying down gene kim john willis synk devopsdays eliyahu m jez humble zender nicole forsgren ongoing improvement john allspaw

Mandi Walls

Break Things On Purpose

Play Episode Listen Later Dec 14, 2021 36:53

In this episode, we cover: 00:00:00 - Introduction 00:04:30 - Early Dark Days in Chaos Engineering and Reliability 00:08:27 - Anecdotes from the “Long Dark Time” 00:16:00 - The Big Changes Over the Years 00:20:50 - Mandi's Work at PagerDuty 00:27:40 - Mandi's Tips for Better DevOps 00:34:15 - Outro Links:PagerDuty: https://www.pagerduty.com TranscriptJason: — hilarious or stupid?Mandi: [laugh]. I heard that; I listened to the J. Paul Reed episode and I was like, “Oh, there's, like, a little, like, cold intro.” And I'm like, “Oh, okay.”Jason: Welcome to Break Things on Purpose, a podcast about reliability and learning from failure. In this episode, we take a trip down memory lane with Mandi Walls to discuss how much technology, reliability practices, and chaos engineering has evolved over her extensive career in technology.Jason: Everybody, welcome to the show, Julie Gunderson, who recently joined Gremlin on the developer advocacy team. How's it going, Julie?Julie: Great, Jason. Really excited to be here.Jason: So, Mandi is actually a guest of yours. I mean, we both have been friends with Mandi for quite a while but you had the wonderful opportunity of working with Mandi.Julie: I did, and I was really excited to have her on our podcast now as we ran a podcast together at PagerDuty when we worked there. Mandi has such a wealth of knowledge that I thought we should have her share it with the world.Mandi: Oh, no. Okay.Julie: [laugh].Jason: “Oh, no?” Well, in that case, Mandi, why don't you—Mandi: [crosstalk 00:01:28]. I don't know.Jason: Well, in that case with that, “Oh no,” let's have Mandi introduce herself. [laugh].Mandi: Yeah hi. So, thanks for having me. I am Mandi Walls. I am currently a DevOps advocate at PagerDuty, Julie's last place of employment before she left us to join Jason at Gremlin.Julie: And Mandi, we worked on quite a few things over a PagerDuty. We actually worked on things together, joint projects between Gremlin, when it was just Jason and us where we would run joint workshops to talk about chaos engineering and actually how you can practice your incident response. And I'm sure we'll get to that a little bit later in the episode, but will you kick us off with your background so everybody knows why we're so excited to talk to you today?Mandi: Oh, goodness. Well, so I feel like I've been around forever. [laugh]. Prior to joining PagerDuty. I spent eight-and-a-half years at Chef Software, doing all kinds of things there, so if I ever trained you on Chef, I hope it was good.Prior to joining Chef, I was assistant administrator for AOL.com and a bunch of other platform and sites at AOL for a long time. So, things like Moviefone, and the AOL Sports Channel, and dotcom, and all kinds of things. Most of them ran on one big platform because the monolith was a thing. So yeah, my background is largely in operations, and just systems administration on that side.Jason: I'm laughing in the background because you mentioned Moviefone, and whenever I think of Moviefone, I think of the Seinfeld episode where Kramer decides to make a Moviefone competitor, and it's literally just his own phone number, and people call up and he pretends to be that, like, robotic voice and has people, like, hit numbers for which movie they want to see and hear the times that it's playing. Gives a new meaning to the term on-call.Mandi: Indeed. Yes, absolutely.Julie: And I'm laughing just because I recently watched Hackers and, you know, they needed that AOL.com disc.Mandi: That's one of my favorite movies. Like, it's so ridiculous, but also has so many gems of just complete nonsense in it. Absolutely love Hackers. “Hack the planet.”Julie: “Hack the planet.” So, with hacking the planet, Mandi, and your time working at AOL with the monolith, let's talk a little bit because you're in the incident business right now over at PagerDuty, but let's talk about the before times, the before we practiced Chaos Engineering and before we really started thinking about reliability. What was it like?Mandi: Yeah, so I'll call this the Dark Ages, right? So before the Enlightenment. And, like, for folks listening at home, [laugh] the timeline here is probably—so between two-thousand-and-fi—four, five, and 2011. So, right before the beginning of cloud, right before the beginning of, like, Infrastructure as Code, and DevOps and all those things that's kind of started at, like, the end of my tenure at AOL. So, before that, right—so in that time period, right, like, the web was, it wasn't like it was just getting started, but, like, the Web 2.0 moniker was just kind of getting a grip, where you were going from the sort of generic sites like Yahoo and Yellow Pages and those kinds of things and AOL.com, which was kind of a collection of different community bits and news and things like that, into more personalized experiences, right?So, we had a lot of hook up with the accounts on the AOL side, and you could personalize all of your stuff, and read your email and do all those things, but the sophistication of the systems that we were running was such that like, I mean, good luck, right? It was migration from commercial Unixes into Linux during that era, right? So, looking at when I first joined AOL, there were a bunch of Solaris boxes, and some SGIs, and some other weird stuff in the data center. You're like, good luck on all that. And we migrated most of those platforms onto Linux at that time; 64 bit. Hurray.At least I caught that. And there was an increase in the use of open-source software for big commercial ventures, right, and so less of a reliance on commercial software and caught solutions for things, although we did have some very interesting commercial web servers that—God help them, they were there, but were not a joy, exactly, to work on because the goals were different, right? That time period was a huge acceleration. It was like a Cambrian explosion of software pieces, and tools, and improvements, and metrics, and monitoring, and all that stuff, as well as improvements on the platform side. Because you're talking about that time period is also being the migration from bare metal and, like, ordering machines by the rack, which really only a handful of players need to do that now, and that was what everybody was doing then.And in through the earliest bits of virtualization and really thinking about only deploying the structures that you needed to meet the needs of your application, rather than saying, “Oh, well, I can only order gear, I can only do my capacity planning once a year when we do the budget, so like, I got to order as much as they'll let me order and then it's going to sit in the data center spinning until I need it because I have no ability to have any kind of elastic capacity.” So, it was a completely, [laugh] completely different paradigm from what things are now. We have so much more flexibility, and the ability to, you know, expand and contract when we need to, and to shape our infrastructures to meet the needs of the application in such a more sophisticated and almost graceful way that we really didn't have then. So, it was like, “Okay, so I'm running these big websites; I've got thousands of machines.” Like, not containers, not services.Like, there's tens of thousands of services, but there's a thousand machines in one location, and we've got other things spread out. There's like, six different pods of things in different places and all this other crazy business going on. At the same time, we were also running our own CDN, and like, I totally recommend you never, ever do that for any reason. Like, just—yeah. It was a whole experience and I still sometimes have, like, anxiety dreams about, like, the configuration for some of our software that we ran at that point. And all of that stuff is—it was a long… dark time.Julie: So, now speaking of anxiety dreams, during that long, dark time that you mentioned, there had to have been some major incidents, something that stands out that that you just never want to relive. And, Mandi, I would like to ask you to relive that for us today.Mandi: [laugh]. Okay, well, okay, so there's two that I always tell people about because they were so horrific in the moment, and they're still just, like, horrible to think about. But, like, the first one was Thanksgiving morning, sometime early in the morning, like, maybe 2 a.m. something like that, I was on call.I was at my mom's, so at the time, my mom had terrible internet access. And again, this time period don't have a lot of—there was no LTE or any kind of mobile data, right? So, I'm, like, on my mom's, like, terrible modem. And something happened to the database behind news.aol.com—which was kind of a big deal at the time—and unfortunately, we were in the process of, like, migrating off of one kind of database onto another kind of database.News was on the target side but, like, the actual platform that we were planning to move to for everything else, but the [laugh] database on-call, the poor guy was only trained up in the old platform, so he had no idea what was going on. And yeah, we were on that call—myself, my backup, the database guy, the NOC analyst, and a handful of other people that we could get hold of—because we could not get into touch with the team lead for the new database platform to actually fix things. And that was hours. Like, I missed Thanksgiving dinner. So, my family eats Thanksgiving at midday rather than in the evening. So, that was a good ten hour call. So, that was horrifying.The other one wasn't quite as bad as that, but like, the interesting thing about the platform we were running at the time was it was AOL server, don't even look it up. Like, it was just crazytown. And it was—some of the interesting things about it was you could actually get into the server platform and dig around in what the threads were doing. Each of the servers had, like, a control port on it and I could log into the control port and see what all the requests were doing on each thread that was live. And we had done a big push of a new release of dotcom onto that platform, and everything fell over.And of course, we've got, like, sites in half a dozen different places. We've got, you know, distributed DNS that's, like, trying to throw traffic between different locations as they fall over. So, I'm watching, like, all of these graphs oscillate as, like, traffic pours out of the [Secaucus 00:11:10] or whatever we were doing, and into Mountain View or something and, like, then all the machines in the Secaucus recover. So, then they start pinging and traffic goes back, and, like, they just fall over, over and over again. So, what happened there was we didn't have enough threads configured in the server for the new time duration for the requests, so we had to, like, just boosted up all of the threads we could handle and then restart all of the applications. But that meant pushing out new config to all the thousands of servers that were in the pool at the time and then restarting all of them. So, that was exciting. That was the outage that I learned that the CTO knew how to call my desk. So, highly don't recommend that. But yeah, it was an experience. So.Julie: So, that's really interesting because there's been so many investments now in reliability. And when we talk about the Before Times when we had to cap our text messages because they cost us ten cents a piece, or when we were using those AOL discs, the thought was there; we wanted to make that user experience better. And you brought up a couple of things, you know, you were moving to those more personalized experiences, you were migrating those platforms, and you actually talked about your metrics and monitoring. And I'd like to dig in a little on that and see, how did that help you during those incidents? And after those incidents, what did you do to ensure that these types of incidents didn't occur again in the future?Mandi: Yeah, so one of the interesting things about, you know, especially that time period was that the commercially available solutions, even some of the open-source solutions were pretty immature at that time. So, AOL had an internally built solution that was fascinating. And it's unfortunate that they were never able to open-source it because it would have been something interesting to sort of look at. Scale of it was just absolutely immense. But the things that we could look at the time to sort of give us, you know, an indication of something, like, an AOL.com, it's kind of a general purpose website; a lot of different people are going to go there for different reasons.It's the easiest place for them to find their email, it's the easiest place for them to go to the news, and they just kind of use it as their homepage, so as soon as traffic starts dropping off, you can start to see that, you know, maybe there's something going on and you can pull up sort of secondary indicators for things like CPU utilization, or memory exhaustion, or things like that. Some of the other interesting things that would come up there is, like, for folks who are sort of intimately tied to these platforms for long periods of time, to get to know them as, like, their own living environment, something like—so all of AOL's channels at the time were on a single platform.—like, hail to the monolith; they all live there—because it was all linked into one publishing site, so it made sense at the time, but like, oh, my goodness, like, scaling for the combination of entertainment plus news plus sports plus all the stuff that's there, there's 75 channels at one time, so, like, the scaling of that is… ridiculous.But you could get a view for, like, what people were actually doing, and other things that were going on in the world. So like, one summer, there were a bunch of floods in the Midwest and you could just see the traffic bottom out because, like, people couldn't get to the internet. So, like, looking at that region, there's, like, a 40% drop in the traffic or whatever for a few days as people were not able to be online. Things like big snowstorms where all the kids had to stay home and, like, you get a big jump in the traffic and you get to see all these things and, like, you get to get a feel for more of a holistic attachment or holistic relationship with a platform that you're running. It was like it—they are very much a living creature of their own sort of thing.Like, I always think of them as, like, a Kraken or whatever. Like, something that's a little bit menacing, you don't really think see all of it, and there's a lot of things going on in the background, but you can get a feel for the personality and the shape of the behaviors, and knowing that, okay, well, now we have a lot of really good metrics to say, “All right, that one 500 error, it's kind of sporadic, we know that it's there, it's not a huge deal.” Like, we did not have the sophistication of tooling to really be able to say that quantitatively, like, and actually know that but, like, you get a feel for it. It's kind of weird. Like, it's almost like you're just kind of plugged into it yourself.It's like the scene in The Matrix where the operator guy is like, “I don't even see the text anymore.” Right? Like, he's looking directly into the matrix. And you can, kind of like—you spend a lot of time with [laugh] those applications, you get to know how they operate, and what they feel like, and what they're doing. And I don't recommend it to anyone, but it was absolutely fascinating at the time.Julie: Well, it sounds like it. I mean, anytime you can relate anything to The Matrix, it is going to be quite an experience. With that said, though, and the fact that we don't operate in these monolithic environments anymore, how have you seen that change?Mandi: Oh, it's so much easier to deal with. Like I said, like, your monolithic application, especially if there are lots of different and diverse functionalities in it, like, it's impossible to deal with scaling them. And figuring out, like, okay, well, this part of the application is memory-bound, and here's how we have to scale for that; and this part of the application is CPU-bound; and this part of the application is I/O bound. And, like, peeling all of those pieces apart so that you can optimize for all of the things that the application is doing in different ways when you need to make everything so much smoother and so much more efficient, across, like, your entire ecosystem over time, right?Plus, looking at trying to navigate the—like an update, right? Like, oh, you want to do an update to your next version of your operating system on a monolith? Good luck. You want to update the next version of your runtime? Plug and pray, right? Like, you just got to hope that everybody is on board.So, once you start to deconstruct that monolith into pieces that you can manage independently, then you've got a lot more responsibility on the application teams, that they can see more directly what their impacts are, get a better handle on things like updates, and software components, and all the things that they need independent of every other component that might have lived with them in the monolith. Noisy neighbors, right? Like, if you have a noisy neighbor in your apartment building, it makes everybody miserable. Let's say if you have, like, one lagging team in your monolith, like, nobody gets the update until they get beaten into submission.Julie: That is something that you and I used to talk about a lot, too, and I'm sure that you still do—I know I do—was just the service ownership piece. Now, you know who owns this. Now, you know who's responsible for the reliability.Mandi: Absolutely.Julie: You know, I'm thinking back again to these before times, when you're talking about all of the bare metal. Back then, I'm sure you probably didn't pull a Jesse Robbins where you went in and just started unplugging cords to see what happened, but was there a way that AOL practiced Chaos Engineering with maybe not calling it that?Mandi: It's kind of interesting. Like, watching the evolution of Chaos Engineering from the early days when Netflix started talking about it and, like, the way that it has emerged as being a more deliberate practice, like, I cannot say that we ever did any of that. And some of the early internet culture, right, is really built off of telecom, right? It was modem-based; people dialed into your POP, and like, that was the reliability they were expecting was very similar to what they expect out of a telephone, right? Like, the reason we have, like, five nines as a thing is because you want to pick up dial tone, and—pick up your phone and get dial tone on your line 99.999% of the time.Like, it has nothing to do with the internet. It's like 1970s circuits with networking. For part of that reason, like, a lot of the way things were built at that time—and I can't speak for Yahoo, although I suspect they had a very similar setup—that we had a huge integration environment. It's completely insane to think now that you would build an integration environment that was very similar in scope and scale to your production environment; simply does not happen. But for a lot of the services that we had at that time, we absolutely had an integration environment that was extraordinarily similar.You simply don't do that anymore. Like, it's just not part of—it's not cost effective. And it was only cost effective at that time because there wasn't anything else going on. Like, you had, like, the top ten sites on the internet, and AOL was, like, number three at the time. So like, that was just kind of the way things are done.So, that was kind of interesting and, like, figuring out that you needed to do some kind of proactive planning for what would happen just wasn't really part of the culture at the time. Like, we did have a NOC and we had some amazing engineers on the NOC that would help us out and do some of the things that we automate now: putting a call together, or when paging other folks into an incident, or helping us with that kind of response. I don't ever remember drilling on it, right, like we do. Like, practicing that, pulling a game day, having, like, an actual plan for your reliability along those lines.Julie: Well, and now I think that yeah, the different times are that the competitive landscape is real now—Mandi: Yeah, absolutely.Julie: And it was hard to switch from AOL to something else. It was hard to switch from Facebook to MySpace—or MySpace to Facebook, I should say.Mandi: Yeah.Julie: I know that really ages me quite a bit.Mandi: [laugh].Julie: But when we look at that and when we look at why reliability is so important now, I think it's because we've drilled it into our users; the users have this expectation and they aren't aware of what's happening on the back end. They just kn—Mandi: Have no idea. Yeah.Julie: —just know that they can't deposit money in their bank, for example, or play that title at Netflix. And you and I have talked about this when you're on Netflix, and you see that, “We can't play this title right now. Retry.” And you retry and it pops back up, we know what's going on in the background.Mandi: I always assume it's me, or, like, something on my internet because, like, Netflix, they [don't ever 00:21:48] go down. But, you know, yeah, sometimes it's [crosstalk 00:21:50]—Julie: I just always assume it's J. Paul doing some chaos engineering experiments over there. But let's flash forward a little bit. I know we could spend a lot of time talking about your time at Chef, however, you've been over at PagerDuty for a while now, and you are in the incident response game. You're in that lowering that Mean Time to Identification and Resolution. And that brings that reliability piece back together. Do you want to talk a little bit about that?Mandi: One of the things that is interesting to me is, like, watching some of these slower-moving industries as they start to really get on board with cloud, the stairstep of sophistication of the things that they can do in cloud that they didn't have the resources to do when they were using their on-premises data center. And from an operation standpoint, like, being able to say, “All right, well, I'm going from, you know, maybe not bare metal, but I've got, like, some kind of virtualization, maybe some kind of containerization, but like, I also own the spinning disks, or whatever is going on there—and the network and all those things—and I'm putting that into a much more flexible environment that has modern networking, and you know, all these other elastic capabilities, and my scaling and all these things are already built in and already there for me.” And your ability to then widen the scope of your reliability planning across, “Here's what my failure domains used to look like. Here's what I used to have to plan for with thinking about my switching networks, or my firewalls, or whatever else was going on and, like, moving that into the cloud and thinking about all right, well, here's now, this entire buffet of services that I have available that I can now think about when I'm architecting my applications for the cloud.” And that, just, expanded reliability available to you is, I think, absolutely amazing.Julie: A hundred percent. And then I think just being able to understand how to respond to incidents; making sure that your alerting is working, for example, that's something that we did in that joint workshop, right? We would teach people how to validate their alerting and monitoring, both with PagerDuty and Gremlin through the practice of incident response and of chaos engineering. And I know that one of the practices at PagerDuty is Failure Fridays, and having those regular game days that are scheduled are so important to ensuring the reliability of the product. I mean, PagerDuty has no maintenance windows, correct?Mandi: No that—I don't think so, right?Julie: Yeah. I don't think there's any planned maintenance windows, and how do we make sure for organizations that rely on PagerDuty—Mandi: Mm-hm.Julie: —that they are one hundred percent reliable?Mandi: Right. So, you know, we've got different kinds of backup plans and different kinds of rerouting for things when there's some hiccup in the platform. And for things like that, we have out of band communications with our teams and things like that. And planning for that, having that game day to just be able to say—well, it gives you context. Being able to say, “All right, well, here's this back-end that's kind of wobbly. Like, this is the thing we're going to target with our experiments today.”And maybe it's part of the account application, or maybe it's part of authorization, or whatever it is; the team that worked on that, you know, they have that sort of niche view, it's a little microcosm, here's a little thing that they've got and it's their little widget. And what that looks like then to the customer, and that viewpoint, it's going to come in from somewhere else. So, you're running a Failure Friday; you're running a game day, or whatever it is, but including your customer service folks, and your front-end engineers, and everyone else so that, you know, “Well, hey, you know, here's what this looks like; here's the customers' report for it.” And giving you that telemetry that is based on customer experience and your actual—what the business looks like when something goes wrong deep in the back end, right, those deep sea, like, angler fish in the back, and figuring out what all that looks like is an incredible opportunity. Like, just being able to know that what's going to happen there, what the interface is going to look like, what things don't load, when things take a long time, what your timeouts look like, did you really even think about that, but they're cascading because it's actually two layers back, or whatever you're working on, like that kind of insight, like, is so valuable for your application engineers as they're improving all the pieces of architecture, whether it's the most front-end user-facing things, or in the deep back-end that everybody relies on.Julie: Well, absolutely. And I love that idea of bringing in the different folks like the customer service teams, the product managers. I think that's important on a couple of levels because not only are you bringing them into this experience so they're understanding the organization and how folks operate as a whole, but you're building that culture, that failure is acceptable and that we learn from our failures and we make our systems more resilient, which is the entire goal.Mandi: The goal.Julie: And you're sharing the learning. When we operate in silos—which even now as much as we talk about how terrible it is to be in siloed teams and how we want to remove silos, it happens. Silos just happen. And when we can break down those barriers, any way that we can to bring the whole organization in, I think it just makes for a stronger organization, a stronger culture, and then ultimately a stronger product where our customers are living.Mandi: Yeah.Julie: Now, I really do want to ask you a couple of things for some fun here. But if you were to give one tip, what is your number one tip for better DevOps?Mandi: Your DevOps is always going to be—like, I'm totally on board with John Wallace's [CAMS 00:27:57] to, like, move to CALMS sort of model, right? So, you've got your culture, your automation, your learning, your metrics, and your sharing. For better DevOps, I think one of the things that's super important—and, you know, you and I have hashed this out in different things that we've done—we hear about it in other places, is definitely having empathy for the other folks in your organization, for the work that they're doing, and the time constraints that they're under, and the pressures that they're feeling. Part of that then sort of rolls back up to the S part of that particular model, the sharing. Like, knowing what's going on, not—when we first started out years ago doing sort of DevOps consulting through Chef, like, one of the things we would occasionally run into is, like, you'd ask people where their dashboards were, like, how are they finding out, you know, what's going on, and, like, the dashboards were all hidden and, like, nobody had access to them; they were password protected, or they were divided up by teams, like, all this bonkers nonsense.And I'm like, “You need to give everybody a full view, so that they've all got a 360 view when they're making decisions.” Like you mentioned your product managers as part of, like, being part of your practice; that's absolutely what you want. They have to see as much data as your applications engineers need to see. Having that level of sharing for the data, for the work processes, for the backlog, you know, the user inputs, what the support team is seeing, like, you're getting all of this input, all this information, from everywhere in your ecosystem and you cannot be selfish with it; you cannot hide it from other people.Maybe it doesn't look as nice as you want it to, maybe you're getting some negative feedback from your users, but pass that around, and you ask for advice; you ask for other inputs. How are we going to solve this problem? And not hide it and feel ashamed or embarrassed. We're learning. All this stuff is brand new, right?Like, yeah, I feel old talking about AOL stuff, but, like, at the same time, like, it wasn't that long ago, and we've learned an amazing amount of things in that time period, and just being able to share and have empathy for the folks on your team, and for your users, and the other folks in your ecosystem is super important.Julie: I agree with that. And I love that you hammer down on the empathy piece because again, when we're working in ones and zeros all day long, sometimes we forget about that. And you even mentioned at the beginning how at AOL, you had such intimate knowledge of these applications, they were so deep to you, sometimes with that I wonder if we forget a little bit about the customer experience because it's something that's so close to us; it's a feature maybe that we just believe in wholeheartedly, but then we don't see our customers using it, or the experience for them is a little bit rockier. And having empathy for what the customer may go through as well because sometimes we just like to think, “Well, we know how it works. You should be able to”—Mandi: Yes.Julie: Yes. And, “They're definitely not going to find very unique and interesting ways to break my thing.” [laugh].Mandi: [laugh]. No, never.Julie: Never.Mandi: Never.Julie: And then you touched on sharing and I think that's one thing we haven't touched on yet, but I do want to touch on a little bit. Because with incident—with incident response, with chaos engineering, with the learning and the sharing, you know, an important piece of that is the postmortem.Mandi: Absolutely.Julie: And do you want to talk a little bit about the PagerDuty view, your view on the postmortems?Mandi: As an application piece, like, as a feature, our postmortem stuff is under review. But as a practice, as a thing that you do, like, a postmortem is an—it should be an active word; like, it's a verb, right? You hol—and if you want to call it a post-incident review, or whatever, or post-incident retrospective, if you're more comfortable with those words, like that's great, and that's—as long as you don't put a hyphen in postmortem, I don't care. So, like—Julie: I agree with you. No hyphen—Mandi: [laugh].Julie: —please. [laugh].Mandi: Please, no hyphen. Whatever you want to call that, like, it's an active thing. And you and I have talked a number of times about blamelessness and, like, making sure that what you do with that opportunity, this is—it's a gift, it's a learning opportunity after something happened. And honestly, you probably need to be running them, good or bad, for large things, but if you have a failure that impacted your users and you have this opportunity to sit down and say, all right, here's where things didn't go as we wanted them to, here's what happened, here's where the weaknesses are in our socio-technical systems, whether it was a breakdown in communication, or breakdown in documentation, or, like, we we found a bug or, you know, [unintelligible 00:32:53] defect of some kind, like, whatever it is, taking that opportunity to get that view from as many people as possible is super important.And they're hard, right? And, like, we—John Allspaw, on our podcast, right, last year talked a bit about this. And, like, there's a tendency to sort of write the postmortem and put it on a shelf like it's, like, in a museum or whatever. They are hopefully, like, they're learning documents that are things that maybe you have your new engineers sort of review to say, “Here's a thing that happened to us. What do you think about this?” Like, maybe having, like, a postmortem book club or something internally so that the teams that weren't maybe directly involved have a chance to really think about what they can learn from another application's learning, right, what opportunities are there for whatever has transpired? So, one of the things that I will say about that is like they aren't meant to be write-only, right? [laugh]. They're—Julie: Yeah.Mandi: They're meant to be an actual living experience and a practice that you learn from.Julie: Absolutely. And then once you've implemented those fixes, if you've determined the ROI is great enough, validate it.Mandi: Yes.Julie: Validate and validate and validate. And folks, you heard it here first on Break Things on Purpose, but the postmortem book club by Mandi Walls.Mandi: Yes. I think we should totally do it.Julie: I think that's a great idea. Well, Mandi, thank you. Thank you for taking the time to talk with us. Real quick before we go, did you want to talk a little bit about PagerDuty and what they do?Mandi: Yes, so Page—everyone knows PagerDuty; you have seen PagerDuty. If you haven't seen PagerDuty recently, it's worth another look. It's not just paging anymore. And we're working on a lot of things to help people deal with unplanned work, sort of all the time, right, or thinking about automation. We have some new features that integrate more with our friends at Rundeck—PagerDuty acquired Rundeck last year—we're bringing out some new integrations there for Rundeck actions and some things that are going to be super interesting for people.I think by the time this comes out, they'll have been in the wild for a few weeks, so you can check those out. As well as, like, getting better insight into your production platforms, like, with a service graph and other insights there. So, if you haven't looked at PagerDuty in a while or you think about it as being just a place to be annoyed with alerts and pages, definitely worth revisiting to see if some of the other features are useful to you.Julie: Well, thank you. And thanks, Mandi, and looking forward to talking to you again in the future. And I hope you have a wonderful day.Mandi: Thank you, Julie. Thank you very much for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

god spotify netflix thanksgiving technology battle work real news chaos tips resolutions chefs code pop web matrix software scale midwest roi hack walls infrastructure hackers yahoo cto programming enlightenment seinfeld plug gremlins myspace io kraken san francisco bay area kramer linux aol identification devops cpu anecdotes dark ages noisy dns silos mountain view solaris cdn cams hurray lte yellow pages noc calms pogs cambrian pagerduty paul reed break things komiku chaos engineering john wallace retry secaucus moviefone jason so john allspaw julie you jason oh jason for jesse robbins julie well

Episode #35: Interview with John Allspaw

Naturalistic Decision Making

Play Episode Listen Later Nov 20, 2021 45:05

Date recorded: November 12, 2021 Show Description: Today we welcome John Allspaw. John is an engineering leader and researcher with over 20 years of experience in building and leading teams engaged in software and systems engineering. He is a co-founder of Adaptive Capacity Labs, LLC. Previously, he was Chief Technology Officer at Etsy. He has also worked at Flickr, Friendster, InfoWorld, Salon, Genentech, Volpe National Transportation Center, and a bunch of other places as a consultant from time to time. John has spent the last decade bridging insights from Human Factors, Cognitive Systems Engineering, and Resilience Engineering to the domain of software engineering and operations. His publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. He holds a Master's degree in Human Factors and Systems Safety from Lund University. Where to find John: LinkedIn Twitter Learn more about NDM: NaturalisticDecisionMaking.org Journal of Cognitive Engineering and Decision Making Where to find hosts Brian Moon and Laura Militello: Brian's website Brian's LinkedIn Brian's Twitter Laura's website Laura's LinkedIn Laura's Twitter

art master llc etsy salon chief technology officer devops velocity flickr human factors genentech lund university friendster capacity planning infoworld paul hammond john allspaw web operations cognitive engineering

Non-Incidentally Keeping Tabs on the Internet with Courtney Nash

Play Episode Listen Later Oct 5, 2021 33:40

About CourtneyCourtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she's held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O'Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.Links: Verica: https://www.verica.io Twitter: https://twitter.com/courtneynash Email: courtney@verica.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: This episode is sponsored in part by our friends at VMware. Let's be honest—the past year has been far from easy. Due to, well, everything. It caused us to rush cloud migrations and digital transformation, which of course means long hours refactoring your apps, surprises on your cloud bill, misconfigurations and headache for everyone trying manage disparate and fractured cloud environments. VMware has an answer for this. With VMware multi-cloud solutions, organizations have the choice, speed, and control to migrate and optimizeapplications seamlessly without recoding, take the fastest path to modern infrastructure, and operate consistently across the data center, the edge, and any cloud. I urge to take a look at vmware.com/go/multicloud. You know my opinions on multi cloud by now, but there's a lot of stuff in here that works on any cloud. But don't take it from me thats: VMware.com/go/multicloud and my thanks to them again for sponsoring my ridiculous nonsense.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Periodically, websites like to fall into the sea and explode. And it's sort of a thing that we've accepted happens. Well, most of us have. My guest today is Courtney Nash, Internet Incident Librarian at Verica. Courtney, thank you for joining me.Courtney: Hi, Corey. Thanks so much for having me.Corey: So, I'm going to assume that my intro is somewhat accurate, that we've sort of accepted that sites will crash into the sea, the internet will break, and then everyone tears their hair out and complains on Twitter, assuming that's not the thing that fell over this time—Courtney: [laugh].Corey: —but what does an Internet Incident Librarian do?Courtney: Yeah, I'll come back to the first part about how—some people have accepted it and some people haven't, I think is the interesting part. So technically, I think my official real title is, like, research analyst or something really boring, but I have a background in the cognitive sciences and also in technology, and I'm really—have always been fascinated by how these socio-technical systems work. And so as an Internet Incident Librarian, I am doing a number of things to try to better understand—both for myself and, obviously, the company I work for, but for the industry as a whole—what do we really know about how incidents happen, why they happen, when they happen, and what do we do when they happen? And how do we learn from that? So, one of the first things that I'm doing along those lines is actually collecting a database of all of the public write-ups of incidents that happened at companies that are software-related.So, there's already bodies of work of people who collect airline incidents and other kinds of things. And we don't have that [laugh] as an industry, which I think is—I want to solve that problem because I think other industries that have spent some time introspecting about why things fall down, or when things fall down and how they fall down. Take the airline industry for example; planes don't really fall out of the sky very often.Corey: No. When it does, it makes news and everyone's scared about flying, but at the same time, it's yeah, do you have any idea how many people die in car crashes in a given hour?Courtney: Yeah, yeah. And we'll come back to how the media covers things in a minute because that is definitely something I have opinions about. But, I'm not trying to say I want to create the NTSB of the internet; I don't think that's quite the same thing, and I really want something in the spirit of software, and the internet, and open-source that's more collaborative and it's very open to all of us. So, the first step is to just get them in one place. There is no single place where you could go and say, “Oh, where all of the X incident reports? Where all the ones that Microsoft's written, and also Amazon, or Google, or, you know, whoever.”Corey: They have them, but they hide them so thoroughly. It turns out that they don't really put that in big letters on their corporate blog with links to it. And when you look at one incident report, they don't say, “Here, look at our previous incident reports.” They really—Courtney: Yeah.Corey: —should but no one does.Courtney: And I think that's fascinating because there's a precedent. So, there's two precedents, and I just gave you basically one side of the two, which is, the airline industry has done this and it's not like people don't fly, right? So, a lot of internet companies, a lot of software-based companies, seem to be afraid of what their customers, or what the stock market, or what folks will think. Mind you, these are publicly traded [laugh] airline companies. People aren't going to stop using Amazon just because you give more of this information out.And so I think that piece is—I would love to see that stop being the case. Because the flip side of the coin is that this is a rising tide lifts all boats kind of thing, which granted, not all companies agree on, especially really big ones because their boats already mowing all the little ones out of the ocean. But that's another story.Corey: Sure, but also, it's easy to hide an outage. “Our site is down for you can say three days. Great, if a customer didn't try to access the site at all during those three days, was the site really down in the first place?”Courtney: Oh, the tree in the forest of internet outages. Yes, it's true, although I think that companies are—they know that people go complain on social media, right? I think there's more and more of that happening now. It's not like you can hide it as easily as you could have before Twitter or Instagram or—Corey: Right. Whereas a plane falls out of the sky, generally it's one of those things that people notice.Courtney: Yeah. Even if you weren't interested in that flight at all.Corey: Right. When it lands in your garden, you sort of have a comment on this.Courtney: [laugh]. Yeah. Pieces fall out of the sky. That has happened. But I think the other flip side of the coin I already mentioned is the safety of airline industry has increased so significantly over the past, you know, whatever, 30, 40 years because of this concerted effort.And the other piece of it, then, as an industry, as technologists, as people who use software to run their businesses, some of those things are now safety-critical. And this comes back to the whole software is running the world now. Planes now actually could fall out of the sky because of software, not just because of hardware failures. And nuclear power plants are [laugh] run by software, and your electronic grid, and your health care systems, heart rate monitors, insulin pumps. There are a lot of really critical things, and now our phone services and our internet stuff is so entwined in our lives, that people can't be on their Zoom calls, people can't run their businesses. So, this stuff has a massive impact on people's lives. It's no longer just pictures of cats on the internet, which admittedly, we've really honed the machine for that.Corey: No, but now when software goes down, the biggest arguments people make, the stories people tell is, “Oh, well, it meant that the company lost this much money during that timeframe.” And great, maybe. We can argue about is that really true or is it not? It depends entirely on the company's business model, but I don't like to tend to accept those things at face value. But yeah, that's the small-scale thing, especially when you start getting to these massive platform providers. There are a lot of second and third-order effects that are a lot more interesting slash important to people's lives, than, well, we couldn't show ads to people for an hour and a half.Courtney: Right. Yes. Absolutely. So, T-Mobile had this outage, what is it, how is time—time is still not working very well, for me. I'm trying to remember if it was earlier this year, or if it was in—it was last year. I think it was 2020. And you're like, T-Mobile, oh okay, whatever. You know, like, cell phones, yadda, yadda. 911 stopped working. [laugh].And it was a fascinating outage because these are now actually regulated industries that are heavily software-backed. There was a government investigation into that the same way we have NTSB investigations into airline accidents, and they looked at all of those, kind of, second or third-order effects of people who—you know, a grandma who was stranded on the road, people who couldn't call 911, those kinds of things that are really significant impacts on people's lives. And the second-order effect is, oh, yeah, AWS goes down—like you said—and Amazon or people like to say, Jeff Bezos—I guess, now, are they going to complain about how much money Andy loses? I guess so—but [laugh] what lives on AWS, that's crazy to think about, right?Corey: Yeah, the more I learn the answer to that question, the more disturbed I become.Courtney: Well, you'd probably know a better answer to that question [laugh] than a lot of people.Corey: They have the big companies they can talk about. What's really interesting is the companies that they don't and can't. An easy example: financial services is an industry that is notorious for never granting logo rights. Like, at some point, they'll begrudgingly admit, “Yes, our multinational bank does use computers.” But it's always like pulling teeth, and I get it on some level; the entire philosophy of a lot of these companies is risk-mitigation, rather than growth and advancing the current awareness of knowledge. But it does become a problem.Courtney: Yeah. It's interesting, I need more data, which we'll get to—help me, people—but I am able to start seeing some of those interesting graphs of, kind of these cascading effects of these kinds of outages. And so I strongly believe that we need to talk about them more, that more companies need to write them up, and publish them, and be a lot more transparent about it. And I think there's a number of companies that are showing the way there that—and it has to do with your first question which is, we've all sort of accepted this, right? But I disagree with that.I think those of us who are super close to these kinds of complex, dynamic distributed systems totally know that they're going to fail, and that's not shocking, nor the case of incompetence. We are building systems that are so big and so complex, no one person, no 10X engineer out there could possibly model or hold the whole thing in their head. Especially because it's not even just your systems… we were just talking about, right? Your stuff's on GitHub; it's on AWS; there's, like, three other upstream providers; there's this API from over there. These systems are too intricate, too complex; they're going to fail.Corey: So, we're back to why all these things failed simultaneously and it comes out it's a Northern woods, middle of nowhere backhoe incident. That's right, if we look at the natural food chain of things, fiber optic cable has a natural predator in the form of a backhoe. To the point where if I'm ever lost in the woods, I will drop a length of fiber, kick some dirt over it, wait a few minutes; a backhoe will be along to sever it. Then I can follow the backhoe back to civilization. They don't teach that one and the boy scout manual, but they really should.Courtney: Yeah. Oh, my gosh. There was a beaver outage in Canada, which is the—[laugh] God, that's the most Canadian thing ever.Corey: Can you come up with a more Canadian—Courtney: No.Corey: —story than that? I would posit you could not, but give it a shot.Courtney: No, probably not. Anyhoo. So, I think, like I was saying, those of us close to it accept that, understand it, and are trying to now think about, okay, well, how do we change our approach and our philosophy about this, knowing that things will fall down? But I think if you look at a lot of the rest of the world, people are still like, “What are those idiots doing over there? Why did their site fall down?”Corey: Oh, my God—Courtney: Right?Corey: —the general population is the worst on stuff like this. The absolute worst.Courtney: The media is the worst. [laugh].Corey: It's, “How did they wind up to going down?” “Yeah, because this stuff is complicated.” Back when I was getting started in tech, I thought the whole thing worked on magic, so I started figuring out different pieces of it worked. And now I'm convinced; it runs on magic. The most amazing thing is this all works together. Because—Courtney: Yeah.Corey: —spit and duct tape and baling wire holding this stuff together would be an upgrade from a lot of the stuff that currently exists in the real world. And it's amazing.Courtney: I know the secret, Corey. You know what holds it all together?Corey: Hit me with it. Hope? Tears?Courtney: People.Corey: Mmm.Courtney: Technology is Soylent Green, Corey. It's Soylent Green. It's made of people.Corey: And that's the thing that always bugs me on Twitter. The whole HugOps movement has it right. When you see a big provider taking an outage, all their competitors are immediately there with, “Man, hope things get back together soon. Best of luck. Let us know if we can help.” And that's super reassuring because today is their outage; tomorrow it's yours.Courtney: Yep.Corey: And once in a blue moon, you see someone who's relatively new to the industry starting trying to market their stuff based on someone else's outage, and they basically get their butts fed to them, just because it's this—it's not what you do, and it's not how we operate. And it's one of the few moments where I look at this and realize that maybe people's inherent nature isn't all terrible.Courtney: [laugh]. Oh. Oh, I would hope that would be something that comes out of all of this.Corey: Yeah.Courtney: No one goes to work at their day job doing what we do, to suck. [laugh]. Right? To do a bad job.Corey: Right. Unless you're in Facebook's ethics department, I completely agree with you.Courtney: Okay. Yes. All right. There are a few caveats to that, probably. But you know, we all want to show up and do good stuff. So, nobody's going in trying to take the site down, barring bad actor stuff that's not relevant.Corey: When Azure takes an outage, AWS is not sitting there going, “Ah, we're going to win more cloud deals because of this,” because they're smarter than that. It's, no, people are going to look at this and say, “Ah, see. Told you the cloud was dangerous.” It sets the entire industry back.Courtney: Yeah. That's why we need to talk about it more, and we need to just normalize that these things happen and that we can all level up as an industry if we get a lot smarter about how we, A) think about that, and B) how we react to them. And we will develop much more useful models of our safety boundaries, right? That's really it. You don't know—no one at any of these companies hardly knows if you're five steps from the cliff, five feet, driving a Ferrari 90 miles an hour towards the edge of it.Like, we don't know, it's amazing to me just how much in the dark we are as an industry and how much of the world we're running. So, I think this is one tiny, first little step in what could be sort of a sea change about how all of this works. So, that's a big part of why I'm doing what I'm doing.Corey: Well, let's talk about something else you're doing. So, tell me a little bit about VOID?Courtney: Yeah. So, that's the first iteration of this. So, it's the [Verica Open Incident Database 00:14:10]. I feel like I have to say this almost every time John Allspaw would like me to say that it's the Verica Open Incident Report Database, but VOID is way cooler than—Corey: VOIRD?Courtney: VOIRD.Corey: Yeah, that sounds like you're trying to make fun of someone ineffectively.Courtney: Yeah. And there's a reason why he's not in marketing. But what this is is a collection of all of the publicly available incident reports in one place, easily searchable. You can search by company, you can search by technology, you can filter things by the types of, sort of, kinds of failure modes that we're seeing. And it's, I hope, valuable to a wide swath of folks, both technologists and otherwise: researchers, media and press types, analysts, and whatnot.And my biggest desire is that people will look at it, realize how incomplete it is, and then help me fill it. [laugh]. Help me fill the VOID, people. I think I have right now, at the time we're talking, about 1700, maybe 1800 of these. And they run the gamut. And I know some people who like to quibble about language—and I am one of those people having been an editor in various flavors of my life—not all of these are what a lot of people directly related to these, sort of, incident management and whatnot would call ‘incident reports.'I wanted to collect a corpus that reflects all of the public information about software-related incidents. So, it's anything from tweets—either from a company or just from people—to a status page, to a media article, a news article, an online article, to a full-blown deep-dive retrospective or post-mortem from a company that really does go into detail. It's the whole gamut. It's all of those things. I have no opinionated take on that.I want that all to be available to people. And we've collected some metadata on all of the incidents as well. So, we're collecting the obvious things like when did it happen? What date was it, if we can figure it out, or if it's explicit—how long was it? And those kinds of things and then we collect some metadata, like I said. We add some tags: was this a complete production outage, was it a partial outage? Those kinds of things.And this is all directly just taken from the language of the report. And we're not trying—like I said—we're trying not to have any sort of really subjective takes on any of that, but a bit of metadata that helps people spelunk some of this stuff. So, if it is the kind of report—these are usually from a status page, or a company post about it—what kinds of things were involved in this outage? So, sometimes you'll get lucky and the company will tell you, “It was DNS,” because, you know, it's always DNS.Corey: On some level, it always is. That's why—Courtney: It always is.Corey: —DNS is my database. It's a database problem.Courtney: It's a database problem. And sometimes you get even more detail. And so we will put as much of that that's in the report into a set of metadata about these things. So, I think there's some fascinating, really easy things that I've already seen from some of these data, and we kind of hit on one of these, which is the way that companies themselves talk about these outages versus the way that press and media and other types of organizations talk about these things. So, I think there's a whole bunch of really fascinating analysis that's going to be available to nerdy research-minded type folks like myself.I think it's a place, though, where technologists can also go and spelunk things that they're interested in, looking for patterns, anything that's really—there's an opportunity for experts in the field to add insights to what we can discern from these public incident reports. They are, like, two orders abstracted from what happened internally, but I think there's still a lot that we can learn from those. So, the first iteration of the VOID will allow people to get a first look at some of the data and to help me, hopefully, add to it, grow that corpus over time, and we'll see where that goes.This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.Corey: I love the idea of having a centralized place where outages, post-mortems, root cause analyses—I'll let you tear into that in a minute—and other things that are all tied to where can I find a list of outages. Because companies list these on their websites, they put them in blog posts, and it's always very begrudging; they don't link them from any other place, you have to know the magic incantation to find the buried link on their site. Having something that is easily searchable for outages is really something that's kind of valuable.Courtney: Yeah. And I mean, some of them are like—I'm looking at you, Microsoft—I like you for a lot of reasons, but hey, I have to scroll your status page. I can't link directly to their write-ups, and—this is Azure—and it [laugh] please stop. Make it easier. [laugh]. You're driving me crazy; I don't even have a data model to figure out how to make this work for people, other than, like, taking screenshots of them.So yeah, so there's shades of grey and black in how much they'll share, or how easy it is to find these things. So, it'll be interesting to see if there's any less-than-positive [laugh] reactions to all of this being available in one place. I'm anticipating at least a little bit of that.There is one other type of metadata that we collect for the VOID. And that is the type of analysis that is conducted if it is clear what that type of analysis is. And there, some companies explicitly say, or call it an RCA, “We did a Root Cause Analysis.” There's a few other types; some people talk about having a Contributing Factors Analysis. Most people don't consider a formal analysis type, but I am trying to collect and categorize these because I do think there are some fascinating implications buried therein, and I would like to see if I can keep track of whether or not those change over time. And yes, you've hit on one of my favorite hot-take soapbox things, which is root cause.Corey: Please, take it away.Courtney: Yeah. Well, and anyone who's close to these systems and has watched these things fall down has the inherent sense that there is no root cause. Like—[laugh]—let's—great. One of my favorite ones: human error. We don't have enough hours for this, Corey. I'm sorry. That's one of my favorite other ones. But let's say somebody fat-fingers a config change. Which happens—Corey: That was fundamentally the S3 service disruption back in—Courtney: Yes.Corey: —2017 that took down S3 for hours on end.Courtney: And took down so many other people that relied on S3.Corey: Everything was tied to that. And that's an interesting question; when something like that hits, does that mean that everything it takes down get its own entry in VOID?Courtney: I hope so. If everybody writes them up, then yes. [laugh]. So, if S3 goes down, and you go down, and you write it up, and you put it in the VOID, then we can see those things, which would be so cool. But let's go back to the fat-fingered config file—which if you haven't ever done, you're lying, first of all—Corey: Or you haven't been allowed to touch anything large and breakable yet, which, either way, you're lying on some level. So, please—Courtney: Yeah. I mean, I took down [Halloway's 00:20:53] homepage when it was on Hacker News because of YAML. So, anywho. Even if you fat-finger a config change, that's not the root cause because you have this system wherein a fat-fingered configure change can take down S3. That is a very big, complex, and I might add, socio-technical system.There are decisions that were made long ago about why it was structured that way, or why this happens that way, or what kinds of checks and balances you have. It's just, get over it people. There is no root cause. These are complex, highly dynamic systems that when they fail, they fail in unpredictable and weird ways because we've built them that way. They're complex because you're successful at pushing the envelope and your safety boundaries.So, if we could get past the root cause thing as an industry, I mean, I could probably just retire happy, honestly. [laugh]. I'm a simple woman; could we just get one thing, people? [laugh]. First of all, then it gives non-technologists, people outside of our bubble, the media, you can't hang it on these things anymore. We all have to then grapple with the complexity, which admittedly humans, not big fans of, but—Corey: People want simple stories, simple narratives. When people say, “Oh, remember the S3 outage?” They don't want to sit there and have to recount 50,000 different details. They want to say, “Oh, yeah. It took down a few big sites like Instagram, United Airlines, and it was a real mess.” The end. They want something that fits in a tweet, not something that fits in a thesis.Courtney: Well, and if you have a single root cause, then you can fix the root cause and it will never happen again. Right?Corey: That's the theory. If we're just a little bit more careful, we're never going to have outages anymore.Courtney: Yeah, if we could just train those humans to not try to make the best possible high-quality decision they could possibly make in that situation given the information they have at the time, then we'll do better. But I mean, that's why your system stay up most of the time, if you think about it. It's shocking how well these things actually work the vast majority of the time. And that's what we could learn from this, too. We could, you know—oh if we would write near-misses up, please.I mean, if I could have one more wish, I think one of the coolest things the airline industry and the government side of that did was start writing up near-misses. It's, wow, what do we learn from when we're successful, versus trying to, like, spelunk and nitpick the failures.Corey: Most of us aren't so good at the whole introspection part. We need failures, we need painful outages to really force us to make difficult, introspective, soul-searching decisions and learn from them.Courtney: Yeah. And I don't disagree with that. I just wish one of the things we would learn is that we should study our successes, too. There's more to be mined from our successes, if we can figure out how to do that, then there is from our failures. So, I have a metadata category in the VOID called ‘near-miss.'And oh man, I really wish people would write those up more. I mean, I think there's, like, five things in there that I've found so far. Because the humans hold these systems together. We make these things work the vast majority of the time. That's why there is no root cause, and even when we're involved in these things, we're also involved in preventing them, or solving them, or remediating them. So, yeah, there's no root cause. Humans aren't the problem. Those are my big hot button ones.Corey: I really wish more places would embrace that. Even Amazon uses the ‘root cause' terminology internally, and I'm not going to sit here and tell them how to run large things at scale; that's what I pay them to figure out for me. But I can't shake the feeling that by using that somewhat reductive terminology that they're glossing over an awful lot of things the rest of us could really benefit from.Courtney: Well, so the question then—one of the other things that I look at is, personally when I read and analyze these incident reports, these public ones a lot, I always ask myself, “Who's the audience for this?” And there are different audiences for different types of incident reports and different things. The vast majority of them are for customers, partners, investors.Corey: The stock market. Yes. Yes.Courtney: They're not actually for the organization. There's usually an internal one that we don't get to see—maybe—that's for the organization. But a lot of places feel that if you have a process, and a template, and a checklist, and a list of action items at the end, then you've done the right thing. You've had your incident, you've talked about it, you've got your action items. Move on.Corey: Right, and it always seems with companies, that as you get further into the company, the more honest and transparent the actual analysis is. Like, at some point, you wind up with the, like, they're very public and very cagey, and under NDA, they open up a little bit more, and a little bit more, and finally, when you work there, their executive team, it turns out, the actual thing was, “Well, Dewey was carrying arm full of boxes in the data center, tripped, went cascading face-first into the EPO cutoff switch that cut power to the entire facility.” The cagier they get, the—I guess, not to be unkind here—but the more ridiculous whatever the actual answer is. It's one of those things where, “Really? Someone tripped and hit a button. You didn't have a plan for that?” “Well, not really. We sort of assumed that people would”—Courtney: Why would you have a plan for that, right?Corey: Right.Courtney: I mean like—[laugh].Corey: Why would you have a plan for that, the first time?Courtney: Yeah. I mean, so imagine this exercise: sitting down in a room with a bunch of people and going, “What are all the things that could go wrong?” I mean, [laugh] ain't nobody got time for that? That's not how it works. You all have other jobs to do, too, and systems to build, and pressures, and customers, and partners, and features to build, so admit and acknowledge that you just won't know all of the antecedents and how do you respond when things happen?Which is a whole other, you know—I know you told me you recorded an episode with Dr. Christina Maslach on burnout, which I'm so happy you did, and there's a whole ‘nother piece of incidents and incident response, and burning people out, and blaming people, and all that stuff that's a whole ‘nother pod—it sounds like you might—you know, probably not incidents with her. But still, these things take a toll on people. And people who, like I said, show up every day really hoping to do their best job, and go up a ladder, and get a promotion, and whatever. So, I think not just treating those things as checklists has broader implications as well, just for the wellbeing of your organization.Corey: On some level, the biggest problem that I think we've run into is that, as you said, it all comes down to people. Unfortunately, legally, we can't patch those. Yet.Courtney: No, [laugh]. No, no. Not most kinds of patches, no. And that's messy. And I know some people are like, “Everyone should learn to code.” And I'm like, “Actually, everyone should get a liberal arts degree.” Come on, help me out people. Because there's so much of these socio-technical systems where the socio part of it is more relevant than the actual technical part.Corey: I believe you're right, for better or worse; there's no way around it. Thank you so much for taking the time to speak with me. If people want to learn more about what you're up to, where can they find you? And we will, of course, throw a link to VOID in the [show notes 00:28:06].Courtney: Yeah, I also like to talk on Twitter, like you do. I'm not as good at it as you are, but I try. So yeah, I'm @courtneynash on Twitter. And at Verica, you can find me at Verica as well, courtney@verica.io. And those are the best ways to find me, I would say. And yeah, please people, write up your incidents, send them to the VOID and let's all learn and get better together, please.Corey: Thank you so much for taking the time to speak with me today. I really do appreciate it.Courtney: Thank you for having me on. I know—do people say this: I'm like, “Yeah, big fan,” but I am. I'm a [laugh] big fan [laugh] of the podcast.Corey: Oh, dear Lord, find better things to listen to. My God.Courtney: [laugh]. But it's been a treat. Thank you.Corey: Courtney Nash, Internet Incident Librarian at Verica. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment making it very clear that for whatever reason the website is down, it is most certainly not your fault.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

god jesus christ amazon canada world lord google man internet canadian zoom mind microsoft humans tears cloud ferrari oracle pieces designed counting void planes northern api screaming aws github united airlines t mobile my god 10x azure devops holloway nda dewey jellyfish vmware dns s3 ide git rca ntsb soylent green periodically epo fastly anyhoo hacker news root cause analysis yaml oracle cloud keeping tabs corey quinn always free microsoft powerpoint christina maslach halloway verica duckbill group john allspaw even amazon chief cloud economist last week in aws humblepod

Molding Leadership Within Tech with Adam Zimman

Play Episode Listen Later Sep 22, 2021 37:46

About AdamAdam Zimman is a start-up Advisor providing guidance on leadership, platform architecture, product marketing, and GTM strategy. He has over 20 years of experience working in a variety of roles from software engineering to technical sales. He has worked in both enterprise and consumer companies such as VMware, EMC, GitHub, and LaunchDarkly. Adam is driven by a passion for inclusive leadership and solving problems with technology. As an Advisor he works with a number of startups and nonprofits. His perspective on life has been shaped by a background in Physics and Visual Art, an ongoing adventure as a husband and father, and a childhood career as a fire juggler.Links:Twitter: https://twitter.com/azimman TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.This episode is sponsored in part by our friends at VMware. Let's be honest—the past year has been far from easy. Due to, well, everything. It caused us to rush cloud migrations and digital transformation, which of course means long hours refactoring your apps, surprises on your cloud bill, misconfigurations and headache for everyone trying manage disparate and fractured cloud environments. VMware has an answer for this. With VMware multi-cloud solutions, organizations have the choice, speed, and control to migrate and optimizeapplications seamlessly without recoding, take the fastest path to modern infrastructure, and operate consistently across the data center, the edge, and any cloud. I urge to take a look at vmware.com/go/multicloud. You know my opinions on multi cloud by now, but there's a lot of stuff in here that works on any cloud. But don't take it from me thats: VMware.com/go/multicloud and my thanks to them again for sponsoring my ridiculous nonsense.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: Welcome to Screaming in the Cloud. I'm Cloud Economist Corey Quinn, and periodically I like to talk to people about different aspects of the industry. One that I think is interesting that doesn't get spoken about a lot directly is the idea of leadership. My guest today is Adam Zimman, who's a startup advisor providing guidance on—as mentioned—leadership, platform architecture, Product Marketing, and GTM Strategy—GTM, of course, standing for go-to-market. Who goes to market? That's right, little piggies. Adam, thank you for joining me.Adam: Thank you, Corey. It's a pleasure to be here.Corey: I imagine that you usually don't advise your clients to call their GTM execs, little piggies?Adam: Well, I mean, I guess it depends. You know, if you're actually a bacon manufacturer then that might be actually a reasonable thing to do.Corey: Yeah, that's a level of investment in the product that you usually don't see in most environments, but we take what we can get. So, snark and cynicism aside, what is it you do?Adam: Ultimately, I look for ways in which I can add value. And I've had the privilege in my career to be exposed to a lot of amazing companies, and I look for ways to be able to take the lessons that I've learned, mainly through mistakes and failure, and be able to translate those into success for others.Corey: Most recently, you were at LaunchDarkly for a while, taking a number of different VP roles. While you were there we spoke, back in 2017, briefly while you were in that environment. And in fact, my first guest on the show was one of the folks on your team, Heidi Waterhouse, who has been back at least once since then, and hopefully more than that. But it's been an interesting ride there. Before that you were at places like GitHub—or JIF-ub as I insist on pronouncing it—EMC-slash-VMware—where does one start and the other stop? Hard to say, it's sort of a giant corporate shell game—but you've spent a lot of time in large companies and small ones as well, and now you're effectively hanging out your shingle as a strategic advisor.Adam: This is true. I mean, I think that one of the things that I've found is that doesn't really matter what size of company you're at; you're going to find new and interesting challenges, and you really don't have to look that hard. And so one of the things that I found consistently, and I would say that this was most pointedly phrased for me by Emily Freeman in the context of, “DevOps is this amazing thing of people, process, and technology. And the reality is, is the only one that's complicated is the people.” And oddly enough, small companies, you still got people; big companies, you still got people. So, therein lies some of the challenges.Corey: And people are inherently non-deterministic; you never know what you're going to get by applying the same input, even to the same person just separated out by time. It's a challenge, and the problem that I see across the industry is that very often, you'll have a team of engineers and you'll pick the best and brightest one of those engineers, and, “Congratulations, you manage the team now.” Now, management's inherently orthogonal skill, and what you've simultaneously done is gotten rid of a great engineer and introduced a terrible manager. And that's through no fault of this person's own. But when I started managing teams, I got surprisingly far by just doing the exact opposite of all the stuff that my previous terrible bosses have done.And that works really well right up until it doesn't in a variety of probably fairly easily predictable ways. And the challenge that I'm seeing is that there is no book on how to do these things. If you want to climb an engineering ladder, great; there's a bunch of very qualified people who will tell you how to go from wherever you are technically, to where you want to go, and what you have to demonstrate, and what you have to do. Leadership is squishy, in that sense. At least it always has been to me.Adam: The interesting part that I would challenge you a little bit on is that there are thousands of interesting books on leadership, even smaller subsection on management specifically. I think one of the challenges there is that they're not well circulated within tech as an industry. I think that there are a few that people come back to, like Andy Grove's book on his experience building Intel. There are a lot of books out there that have done a lot for talking about how to manage people and how to think about what are the specific tactical things that you do. It's having one-on-ones, it's having meetings with clear agendas, it's being able to look for ways to set expectations with your organization.I think one of the challenges that I see pretty consistently, is the fact that that effort to be able to go out and find that information or to learn those skills is something that is put on to, as you said, this individual who is coming to management through punishment. They've been extraordinarily successful and now you will punish them by putting them in a role where they can no longer do all the things that they enjoyed, that made them successful. And I think that you see time and time again, where organizations put people in these roles, but they don't do anything to either prepare them for it or do anything to continue that notion of professional development or training for those individuals once they're in those roles.Corey: There are a lot of books out there for any discipline under the sun; some are good, some are terrible, most are somewhere in the middle of the road law of averages winds up working out. I think a key difference, on some level, is I can take to Twitter, or a forum, or something like that, and complain about software; the computer isn't doing the thing I think the computer should be doing. And that's great. I can't very well go and complain about managerial issues while actively having a team and not find myself no longer having managerial issues, if you catch my meaning. It's hard to find communities around this stuff.Adam: I think that you're right. And I think that this is one of those things where not only that, but I think that we also in tech have predominantly taken a very hierarchical structure to the way that we think about management and leadership, to the sense where oftentimes, it is not only discouraged but downright forbidden for an individual contributor to challenge their manager if they want to continue to have gainful employment. And I think that this is a cultural thing that, you know, it's funny; I know that you recently did an episode with John Allspaw and were talking about incident remediation. And I think that one of the things that I've always tried to do as a manager, as a leader, is think about opportunities for being able to do that type of incident response, for people. If you have a person that leaves, whether that is forced attrition, whether that is voluntary attrition, whether that is something that you wanted to happen, something that you didn't want to happen, what are you doing from a perspective of kind of a post-incident assessment to learn from that? And I think that the next level that is, how do you do it so that you actually, in some way, incorporate that for the individual that's actually leaving. Because ideally, they're learning from that experience, as well.Corey: Back when I was a generally terrible employee, I decided at some point, I was tired of dealing with computer problems and wanted to deal with people problems instead. Now, let's be clear, I found a path to do that in a very different direction than I expected at the time, but at the time, it was, “Great. I'm going to go ahead and become a manager of a team.” And I talked to a number of folks about all right, what is the path to go from decent technical engineer—I was a senior SRE type at most of these places—into management. And not just talking to people at the companies I was at, but talking to people in the larger community, and every engineering manager who I respected and talked to about, it always seemed like they got this lucky break at just the right time and that made them a manager for the first time.And once you have a track record of having managed people, then you're in. You can go back and forth between IC and management roles. But, “Well, you've never managed people before, so we're not going to take a chance on you to manage people.” The way that I did it, honestly, was I—a few times—I wound up joining startups where I was effectively the only ops person; we suddenly started scaling and having fun problems, and well, I did negotiate for that director title, so all right, I have teams now. I was more of a team lead than most things, in some cases.But it led to a really pretty interesting evolution in how I approach these things. I find now that the right answer is for me not to manage people at all because what I fundamentally do here at The Duckbill Group is basically become the loud, obnoxious center of attention. And I think that what managers need to do is showcase their people instead. And those two things, at least in my view, are opposed. And it's very challenging to do both of them, let alone well. For me at least, I tend to back away from the management side of things almost entirely and abdicate the role. Which is great. People self-manage, right?Adam: Well, I mean, I think that there are individuals who definitely will take—have the ability to self-organize and self-manage to a degree. I think that the challenge that you run into is, as the organization scales, as the nature of their role tends to change with that scaling organization, it becomes more challenging for them to navigate through those changes. A great example would be, I have had the pleasure and the privilege a number of times in my career of managing extraordinarily senior individuals; these are individuals who, to your point, don't need a whole lot of care and feeding. But what they do sometimes need is they need someone who is able to be in rooms that they're not in, whether that's from a higher-level leadership meeting understanding larger organizational goals, or they need someone that's going to check them; they need someone that they can trust, someone that they can bounce their ideas off of to know is this something that's going to be perceived value or something that's going to actually take me in the wrong direction, or somebody that's, kind of like, paying attention to the work product that they're doing and giving them some coaching, whether that's cheerleading or whether that's connecting of saying, “Hey, there's also this other person you should talk to.” Those types of things are really valuable for those individuals who are, to your point, a little bit more self-sufficient.Corey: On some level, I ran into this trap a lot, and having over drinks conversations with a bunch of people who went on similar paths, it's blindingly obvious that it's a dumb move in hindsight, but an awful lot of us did it, where we're sitting there as engineers with the belief of, “Ah, if I can make my manager—or beyond, several skip-levels up—look incredibly foolish in the middle of a large meeting, they will inherently see the value of what I have to say and will thus elevate me to management.” As it turns out, they elevate you to customer because you're not working there anymore, in many cases. And when I talk to people about this, it usually has that lightbulb coming on moment of as soon as you hear it, of course, it is blindingly obvious that you aren't going to sarcastically obnoxious your way into being management. Instead, the path there—in hindsight, also blindly obvious—is act as if: act managerial; help to effectively carry on your manager's message to the rest of the team, and when you have reservations or whatnot, talk to them in private rather than calling them out. And it's the obvious stuff of who gets promoted to management? Well, the people that look managerial. And that is what that looks like, in many respects.Adam: And this is one of the reasons why, when I talk about management I like to separate the notion of management from leadership. Because I think that anyone can be a leader. You don't actually have to be the administrative manager of an individual to be a leader to them.Corey: I saw a great poster once when I was younger. “Leaders are like eagles. We don't have either of them here.”Adam: [sigh]. Yeah, yeah. Ugh. I do miss good motivational posters.Corey: Oh, yeah.Adam: You know, I think that there's some truth to it. I think that finding people who are genuinely invested in being able to enable the success of others—which is how I define leadership—is challenging. I think that, especially in rather capitalistic-type industry like we're in, there is a lot of measurement of people's success by their own personal achievements and by their ability to beat their own drum. And I think that it's something that is, frankly, a failing of our industry, where we don't do a better job of encouraging folks, and rewarding folks that actually look out for others and enable the success of others. Because I think that's something that is—ultimately you think about how you build strong teams, and it's not about getting a bunch of individuals who can do amazing things individually. It's about getting individuals who are capable of working together and being able to do more than they would be able to if they were simply working individually.Corey: Do you ever find that people are chasing management in many respects because they think that it's something very different than what it is, and then find themselves in situations where well, I'm the dog that caught the car that I was chasing and only now do I realize that I have no idea how to drive the thing?Adam: Oh, absolutely. So, this is something that has been interesting me a lot recently, in the sense that I think we as an industry also do a very poor job of measuring management, measuring leadership. We give a lot of power to managers through performance reviews to measure their individual contributors, but there are very few companies who actually efficiently do things like 360 reviews, which has always confused me because I think that implies that you're getting feedback from all around you, as opposed to what you really want is you want feedback pointed back at you, which would be 180. But maybe that's just—Corey: Let's be clear, that was also pioneered by the German [Wehrmacht 00:13:48] in World War II, which is yeah, basically how some people I've worked with do tend to manage.Adam: Yeah. I think that if we can think about how do we measure the success of a manager, is it simply a function of the output of their team, or are there other efficiency metrics that you should be looking at? Very obvious one is how efficient is a manager from a perspective of the utilization of their resources? And when I think about that, I think about are they actually able to effectively hire? Are they able to effectively retain the people that they hire?What does it look like for the people on their organization from a promotion perspective in terms of skill growth? Do they become more valuable over time? Those are ways in which we can think about how we measure the manager, potentially, directly. And then there's indirect things like what's the qualitative aspect of those individuals that work for them? Are they people who are enjoying the work that they're doing?Are they motivated to continue to work towards the company's vision and mission, to be able to actually make their manager look good, but also make the company successful?Corey: A challenge, too, because I've seen this myself is, all right, you're not elevated to manager. Congratulations. It's not really a promotion. It's a lateral move. However, a lot of companies don't treat it that way.They don't compensate it that way, et cetera. And oh, okay, management, it turns out is not for me. There's no real good way to say, “I'm going back to being an IC,” especially at the same company, without it being perceived by many—rightly or wrongly—as a demotion or a failure.Adam: This question of, like, motivation to people, why do they want to go into management? I think that oftentimes this is misplaced. A lot of times the number one motivation that I've heard has nothing to do with wanting to actually help people or solve people problems, as you said earlier; it has to do with I want a bigger paycheck, I want more seniority, I want more responsibility, and therefore the only path available to me is management. In fact, many career ladders at organizations require an individual contributor to go to a management position before they can become a principal or a staff-level engineer, which is nonsense. First of all, why would you torture the individual to do something that is so completely and utterly outside of where their interests are? Secondly, why would you just decimate your lower-level individual contributors, your newer individual contributors by having someone who is completely non-inclined towards management be responsible for them? Oh.Corey: Oh, yeah. Used to be your peer; now they manage you, and great. I think people underestimate exactly how broad the blast radius of a manager is.Adam: Yeah. Talk to anyone, and they'll be more than happy to tell you the worst manager that they've ever had. At the same time, they'll also probably be able to tell you the best manager they've ever had.Corey: Oh, yeah. I called both of those out—only one the one of those by name, by the way—in conference talks that I've had because it's—yeah, you can probably guess which one I would call out and which one I would not name publicly—yeah—Adam: It depends on the conference, I guess. But yeah.Corey: Oh, yeah, absolutely. If it was you-know-what-your-problem-is con, yeah, it went super well.Adam: [laugh].Corey: It was fun. And management, especially in the current era is getting interesting, as we're seeing the heating up of the market in a bunch of different ways. And I understand, to be clear, that Twitter is not a perfect microcosm of the industry, but there's a recurring theme that I'm seeing among a number of engineering types that seemed to get—and again, I don't want to get letters for this, so if I misstate it, audience, please go ahead and be kind—but there seems to be a certain thread running through engineering communities that the purpose of a company is to provide a utopian work environment for its staff. Now, as someone who runs a company myself, yeah, I absolutely want to provide the kind of working environment I wish I'd had in a bunch of different environments. And that's not going to work for everyone, but that's okay.But fundamentally we're here to make money, and ideally, enough monies that we can keep the lights on. And that does mean that, however, we want to treat our staff that has to be subordinate to can we continue as a going concern? So yeah, it turns out, we can't—sustainably—outbid Netflix on every hire that we make and we aren't able to wind up having three catered meals a day as a full remote company delivered to everyone's house. Now, I'd like to, in a world where money flows like water, but it doesn't. For better or worse, there are constraints, and constraints shape us.But there's a thread that I'm starting to see of… I hesitate to call it entitlement, but it trends slightly toward the direction of folks who are in tech, and in some ways seem very far removed from business realities—now, let's be clear in the FAANG world, yeah, it's pretty attenuated. And in startup land where well, we're the VC backed, so we're losing money by the billion but we're making it up in volume. Great. That is not necessarily what I'm talking about here. I'm seeing a thread where, oh, engineers are clearly the smartest people in any company, which means that every other department should defer to them. I disagree with that position.Adam: I want to follow that thread a little bit with regards to engineers. So, I've worked as a software developer—Corey: My condolences.Adam: Yeah. I've worked as a technical salesperson. I've had the opportunity to work in pretty much every department with the exceptions of HR and finance. So, that has been part of my career of jack of all trades, master of none, but it has given me some interesting insights in terms of the value that different organizations, different individuals, bring to a company. And I think that—one of the things that I will say is that for the longest time, in large organizations, especially non-tech industry organizations, the engineer or the developer was at the same expectations or the role as someone in the janitorial staff.It was basically, “You're part of the plumbing. You just do the things so that the tech just works, and we're going to have the other business folks that are more responsible for actually making decisions that are going to make our business money.” The quintessential example is someone like Kraft Foods or someone like John Deere, right, where you're building tractors; for the longest time, the guy who ran the website wasn't going to be the guy who was going to make or break John Deere's quarterly earnings. Now, you've got tractors that literally are more computers than they are mechanical devices and so you suddenly have this change in dynamic with regards to the importance of that developer. But I think that something that's interesting, also, is that those other people who worked at the company didn't go away.They're still there; they're still important. In fact, they're still oftentimes making the buying decisions on behalf of the developers. The developers aren't the ones that are making those choices. And so you need to figure out, how do you actually make the technology choices and the technology outcomes accessible to individuals that are in roles that were, historically, had nothing to do with tech.This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.Corey: I've always been a big believer in the idea that if you're going to transition into a new field, be it into tech, out of tech, et cetera, great. In almost every case, you should find ways to do that laterally. I think that this idea that, oh, you're going to go ahead and just start over with an entry-level job after you've been in a field for five years—no. Find the position that's halfway between where you are and where you think you want to go next and start getting exposure there. In time, it's those niches that add value that distinguish you from other folks.It turns out that they don't generally want to hire someone in almost any role that comes from Central Casting, where it's alright, give me a standard MBA with the following pedigree and drop them in as my new executive, whatever. No. They want to see things like industry experience; they want to see things that distinguish folks, and having experience in industries that are not traditionally, purely what this role is, is super helpful in a lot of different ways. What I do pretty clearly blends finance and tech; that goes reasonably well. Increasingly it starts to blend media, which is something I don't pretend to understand. But here we are, he said into the microphone.Adam: Yeah. Well, as long as you're not starting the next Fox News, I'm fine with that.Corey: No, no. Generally not.Adam: Okay, fair enough. But I think that you're right. This is one of the things where, trailing back, we've throughout this conversation to the notion of leadership, this is something that I found extraordinarily rewarding and empowering that I've done with individuals that I've brought into new organizations, either through initial conversations during an interview process, or during, as part of their onboarding, is I sit down, and I actually talk to them about what are their plans? What are their expectations? What are their goals, not only for the next 30, 60, 90 days in this role that we're talking about but what are they thinking about from a perspective of what do they want to do in the next year? In the next three years? Five years? Ten years? What are those checkpoints of what do you want to do in this role? What do you want to do at this company? What do you want to do with your career? Like, where do you see it headed?And it doesn't mean that you're writing this in stone, or that I'm going to hold you to it, but I think that one of those things that's really empowering for a leader is to be able to help those individuals find those connective threads that tie one position to the next and help them get there. If they're somebody who is saying, “Hey, look, I'm currently a developer, but I really wish that I could give more talks.” Okay, well, that's great for me to know. Let's put you on some projects that maybe actually would result in great content for a talk that you could give at a conference. And then we'll figure out, how do we work with the marketing department to be able to help you bring that to fruition?There's a lot of ways to be able to leverage this experience that you have as a leader, as a manager, to an individual who's coming up in their career and saying, “Hey, look. This is how some more ancillary things are connected.” And being able to bring those back to them.Corey: I really wish, on some level, that there was a more defined path toward a lot of these things, where the stuff is explained to folks. So often, I had terrible managers that, in hindsight, weren't that terrible. Because I didn't understand where the role started and stopped, I tended to view the role of the manager is there to protect the team. The end. And be our advocate in the organization, and get us the thing that we want, and what do we want? Comfy chairs.And it turns out that isn't ever how it really works. If I had to define management, it would basically be, balancing competing priorities more than it is almost anything else. And counterintuitively, the higher you rise in an organization, the more responsibility you have, and the less you can actually directly do. Everything you do drives influence. And that's it. That's how it distills down.Adam: You talk about the engineer that wants to move into management role because that's how they see their career progressing. This is a close corollary to the engineer that wants to move into a product management role because they want to have greater oversight into the decisions that are being made about what's getting built. And what you come to realize, for any engineer who successfully made that transition, is it's really complicated and difficult to be able to have that mental switch take place between this is how I'm going to build it versus this is the priority of what needs to get built next. And all too often you see engineers that land in product management roles that are dictating how something should be built, and suddenly the engineers are just like, “No, I have no respect for you. Because that's not your job.”And likewise, in a management role, oftentimes people view that as an opportunity for them to make all the choices, make all the decisions, and suddenly lose sight of the fact that they used to be on the other side of that outcome themselves, and were disappointed when they weren't included in some way, shape or form, or their priorities weren't taken into consideration.Corey: As you look at your own career, what is the worst job experience you've ever had? Or the worst job you've ever had? Or the worst boss you've ever had? That's always a good one to do.Adam: [laugh].Corey: Pick a superlative and not the good kind. Hit me.Adam: Yeah, no, I mean, look, I think that probably the worst… experience that I ever had with a manager, with a boss, was actually when I was first a software developer. And my manager would occasionally just come up behind me and just stand and watch me code. And we're not talking about peer programming, where it was just like, we're working together. No, it was, literally would come up, stand behind me on my shoulder, and just stand there. Not saying anything; just watching me write Java code. And that was probably the most disconcerting experience that I've ever had in a job ever. I lasted about six months and then I was just like, “I need to move on to something else.”Corey: It turns out one of my failure modes was that I was great for the first three months in new ops roles because things were invariably a fire, and—Adam: [laugh].Corey: —I know how to solve those things. And then it becomes a maintenance role, and I'm bad at that. For longest time, I thought I was just a crap employee. And I am, but for different reasons. Instead, though, for me, it turned into a, I need to find the thing that I'm good at and embrace that. And I have to say, it was not being, basically, a cloud comedian on Twitter where my primary means of communication is shitposting. But you know, here we are, and this is how we've gotten there.Adam: I mean, know your strengths, man. Know your strengths.Corey: Yeah, lean into it. I mean, you went to college in Maine; you know what it's like there. It's dark and cold nine months out of the year, so all we do is sit inside and develop personality disorders. And well, here we are.Adam: Well, hey, I mean, I took a break from tech after that first job in software development and I actually went back and worked for a guy that I met while I was in school, and I worked for him, he was a general contractor. So, I have an appreciation for Maine winters in a way that I never gained as a privileged college student, when I was actually digging snow out of ditches to be able to pour concrete at six in the morning and then later in the day, I got to go up and use 80-pound weight shingles to reshingle the roof in 20-degree weather. So, it was an eye-opening experience. But I'll tell you, I learned pretty much everything that I know about how to build infrastructure from that eight months that I spent doing everything from framing, ditch-digging, to electrical, and plumbing, and roofing.Corey: Kind of fun how often is that we wind up trying other things. And this is part of it, too. As much fun as it is to complain about various jobs and whatnot that we have, let's be very clear here for a minute that I'm not dealing with hot tar, being paid seven bucks an hour. There are advantages to the [unintelligible 00:28:08] jobs I have.Adam: I mean, that was a number of years ago, but I still got ten bucks an hour.Corey: My first job at the University of Maine call center working in tech, in those days, I think I was being paid something like $5.35 an hour. To answer phones, which again, not that hard of a job. I made a lot more money a couple years later when I moved to construction. Yeah, I wouldn't recommend any of those things for me these days, but it was instructive.Adam: But at the same time, I would argue that you also have benefited from those experiences in the way that you approach the things that you do now. And I think that's one of the things that I've tried to bring forward in my career is look for those opportunities to make those connections, and understand the value of those experiences, and be able to help to enable other people because I've had those experiences.Corey: To me at least, the answer is to turn whatever you've done or whatever happened to you into some form of empathy. The idea of well, I had to struggle coming up, so you should, too. Let's instead focus on making it better for people who follow us. Send the elevator back down, as it were.Adam: I mean, I think that's great advice, and I think that it's something that's done far too infrequently. One of the things that I've noticed is that that aspect, unless somebody has actually been through the experience where somebody has done that for them, it is oftentimes something that is a lot harder for people to see. This goes to your earlier statement around the expectations that maybe are changing, and they're not such great ways with regards to what people are expecting from companies, what people are expecting from managers. I think that there is a distinct lack of expectation setting that takes place at companies in terms of what is the role of the company, what is the role of an employee, and how can those two come together to still have a positive interaction, but aren't overstepping on either side? Because that's really where you get into problems. That's where all of a sudden you have these companies that are looking to fill the role of, I will take care of all aspects of your life, when in reality that's not a very healthy relationship for an individual to have with a company.Corey: So, I want to thank you for coming and speak to me. What are you up to these days, and where can people find you? And why should people find you?Adam: Well, I don't know that anybody should find me.Corey: “I hope this email finds you never. I hope you're free.”Adam: Yeah, exactly. No, I mean, I would love to find folks that I can add value to and help out. It's easy enough to find me on Twitter. It's just @-A-Z-I-M-M-A-N—azimman. And they're welcome to reach out to me there. My DMs are open—much to my displeasure sometimes—but happy to help people who are looking for help. I'm particularly interested in spending my time with those individuals who maybe are coming from underrepresented backgrounds in tech and looking for ways to be able to either get into tech or to move up within leadership roles in tech.But I'm spending a lot of my time doing a lot of coaching, doing a lot of advising for small startups, and then also just as a small side project have been working pretty extensively with James Governor and a woman by the name of Kim Harrison on this little thing called Progressive Delivery, which is, as far as we're concerned, it is the next iteration of the software development lifecycle that we've written about and talked about pretty extensively. James and Kim and I are working on a book together to be able to capture all those ideas and bring them and coalesce them for people, to make more consumable. But ultimately, we're trying to say, “Hey, look. The way that we've done things leading up till now, moving from waterfall to agile to continuous delivery into what's next?” And look at some of the market conditions that have changed. A lot of stuff that you talk about. I think that you would be the first to point out how things have changed since the launch of AWS.Corey: Oh, yes. It's more confusing now.Adam: Oh, way more confusing. And the ways in which people consume cloud-based services has radically changed. And so I think that the way that we are building software and the way that we're consuming software is something that we need to put some serious thought into. And the players that are—you know, as I spoke about earlier on this talk with you—are different. It's no longer just your developers that care about your AWS choices or care about the cloud service choices that you're making.You've got other individuals, whether it's the finance side you focus on or thinking about it from the perspective of the marketing team, or the HR team that's thinking about which cloud service HRIS are they going to use. There's a lot of people that need to be party to those choices that you're making and how you build out your company stack, as it were. And the Progressive Delivery model looks to take into consideration that changing and evolving group of people.Corey: And we will, of course, have links to that in the [show notes 00:32:46]. Thank you so much for taking the time to speak with me. I appreciate it.Adam: Corey, thank you so much for having me. It was a pleasure.Corey: Adam Zimman, startup advisor, and oh, so much more. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with a scathing comment telling me why you as an engineer are best suited to be the manager of everything.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

jesus christ university amazon netflix world leadership talk tech leaders world war ii mba cloud maine fox news congratulations oracle intel designed physics counting vc advisor generally screaming aws java github devops ic visual arts master of none john deere jellyfish vmware product marketing ide git gtm comfy sre emc faang molding kraft foods jif andy grove oracle cloud hris launchdarkly corey quinn always free microsoft powerpoint my dms emily freeman central casting kim harrison german wehrmacht duckbill group adam you adam it john allspaw adam yeah adam well chief cloud economist adam oh last week in aws humblepod

Incident investigation: What can we learn from the software world?

Embracing Differences

Play Episode Listen Later Aug 17, 2021 53:44

In this podcast, John Allspaw, the Founder of Adaptive Safety Labs and I talk about the world of software engineering and operations and its connections to safety science and human factors. Together we explore what opportunities — and challenges — exist in the domain, what incident analysis and genuine learning from incidents looks like, and what makes this an exciting time for exploring this domain from a safety perspective.

founders software investigation incident john allspaw

Finding a Common Language for Incidents with John Allspaw