Podcast appearances and mentions of john allspaw

  • 32PODCASTS
  • 47EPISODES
  • 40mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • Nov 26, 2024LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about john allspaw

Latest podcast episodes about john allspaw

Screaming in the Cloud
Replay - Finding a Common Language for Incidents with John Allspaw

Screaming in the Cloud

Play Episode Listen Later Nov 26, 2024 29:36


On this Screaming in the Cloud Replay, Corey is joined by John Allspaw, Founder/Principal at Adaptive Capacity Labs. John was foundational in the DevOps movement, but he's continued to bring much more to the table. He's written multiple books and seems to always be at the forefront. Which is why he is now at Adaptive Capacity Labs. John tells us what exactly Adaptive Capacity Labs does and how it works and how he convinced some heroes to get behind it. John brings a much-needed insight into how to get multiple people in an organization on the same level when it comes to dealing with incidents. Engineers and non. John points out the issues surrounding public vs. private write-ups and the roadblocks they may prop up. Adaptive Capacity Labs is working towards bringing those roadblocks down, tune in for how!Show Highlights(0:00) Introduction(0:59) The Duckbill Group sponsor read(1:33) What is Adaptive Capacity Labs and the work that they do?(3:00) How to effectively learn from incidents(7:33) What is the root of confusion in incident analysis(13:20) Identifying if an organization has truly learned from their incidents(18:23) Gitpod sponsor read(19:35) Adaptive Capacity Lab's reputation for positively shifting company culture(24:22) What the tech industry is missing when it comes to learning effectively from the incidents(28:44) Where you can find more from John and Adaptive Capacity LabsAbout John AllspawJohn Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.”  His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund UniversityLinksThe Art of Capacity Planning: https://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/1491939206/Web Operations: https://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/The DevOps Handbook: https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002/Adaptive Capacity Labs: https://www.adaptivecapacitylabs.comJohn Allspaw Twitter: https://twitter.com/allspawRichard Cook Twitter: https://twitter.com/ri_cookDave Woods Twitter: https://twitter.com/ddwoods2Original Episodehttps://www.lastweekinaws.com/podcast/screaming-in-the-cloud/finding-a-common-language-for-incidents-with-john-allspaw/SponsorsThe Duckbill Group: duckbillgroup.com Gitpod: http://www.gitpod.io/

Coder Radio
585: From Ops to Dev and Back Again

Coder Radio

Play Episode Listen Later Aug 28, 2024 53:30


We reflect on the rise of DevOps and the frustrating dynamics that led to it. Plus, tech's latest bright idea: Roombas with attitude.

The Engineering Leadership Podcast
Resilience engineering, learning from incidents and unintuitive perspectives on incident analysis w/ John Allspaw #116

The Engineering Leadership Podcast

Play Episode Listen Later Feb 7, 2023 42:38


We cover resilience engineering & learning from incidents with John Allspaw, former CTO @ Etsy and current Founder & Principal @ Adaptive Capacity Labs! Co-hosted by Kenji Kiuchi (Head of Quality and Performance @ Postman) this episode also addresses common unintuitive perspectives within resilience engineering, strategies for effective incident response / problem solving, how to identify current sources of resilience, and practical tips for implementing these resiliency tactics in your organization today.ABOUT JOHN ALLSPAWJohn Allspaw (@allspaw) has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.”  His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University."The competitive advantage is not for a leader to say, ‘Why did it take so long to restore this issue or resolve this outage?' A competitive advantage is, ‘Oh my God, that is amazing. Tell me what made this hard and what are any of the things that made it difficult to resolve? Is there anything I can do to help get out of the way for people to do the work?'"- John Allspaw   ABOUT KENJI KIUCHIKenji Kiuchi (@dr_kiuchi) is Head of Quality and Performance at Postman, an API platform whose mission is to maximize everyone's creativity through the power of connected software. There he leads a global team with a focus on maximizing user delight and innovating the practice of testing. Before coming to Postman, he spent several years ‘Helping people get Jobs” at Indeed. There, he worked on scaling teams and practice to optimize engineering delivery as well as leading Diversity, Inclusion and Belonging initiatives as an Associate Site Director. Prior to Indeed, Kenji spent several years as an Engineering Manager at Twitter where he led Quality efforts across monetization, growth, infra and the delivery of live video. When Kenji isn't driving engineering excellence, he's driving his motorcycle, spending quality time with his 3 daughters, and mentoring leaders across the globe.Check out our friends and sponsor, JellyfishTo learn more about Jellyfish and how they can help you increase engineering satisfaction and create happier, higher-performing engineering teams...Learn more at Jellyfish.co/elcSHOW NOTES:John's perspective on production (4:27)What drove John toward resilience engineering (6:22)How complex systems relate to resilience engineering (9:23)Differences between robustness and resilience (13:13)The role of productive adaptation in resilience engineering (17:26)Identify sources of resilience already present in your organization (22:52)Examples of unintuitive perspectives involving incident analysis (27:15)How to make room for unintuitive perspectives (31:41)Practical tips for implementing resiliency tactics & understanding incidents (36:12)Rapid fire questions (39:51)LINKS AND RESOURCESLearning From Incidents Conference 2023 - This is a forum for sharing stories of incidents, incident handling, and the learnings from software engineers who handle large-scale distributed software systems.Hindsight and Sacrifice Decisions Blog Post on Adaptive Capacity Labs reaction to the NYSE halting trading to resolve an issueUsing Language by Herbert H. Clark - Herbert Clark argues that language use is more than the sum of a speaker speaking and a listener listening. It is the joint action that emerges when speakers and listeners, writers and readers perform their individual actions in coordination, as ensembles. In contrast to work within the cognitive sciences, which has seen language use as an individual process, and to work within the social sciences, which has seen it as a social process, the author argues strongly that language use embodies both individual and social processes.Papers We Love TalkVisual Momentum

Word Notes
Encore: Agile Software Development Method (noun)

Word Notes

Play Episode Listen Later Jan 17, 2023 7:45


A software development philosophy that emphasizes incremental delivery, team collaboration, continual planning, and continual learning  CyberWire Glossary link: https://thecyberwire.com/glossary/agile-software-development Audio reference link: "Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe," John Allspaw and Paul Hammond, 2009 Velocity Conference, YouTube, 25 June 2009. Learn more about your ad choices. Visit megaphone.fm/adchoices

Giant Robots Smashing Into Other Giant Robots
456: Jeli.io with Laura Maguire

Giant Robots Smashing Into Other Giant Robots

Play Episode Listen Later Jan 5, 2023 46:37


Laura Maguire is a Researcher at Jeli.io, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Victoria talks to Laura about incident management, giving companies a powerful tool to learn from their incidents, and what types of customers are ideal for taking on a platform like Jeli.io. Jeli.io (https://www.jeli.io/) Follow Jeli.io on Instagram (https://www.instagram.com/jeli_io/), Twitter (https://twitter.com/jeli_io) or LinkedIn (https://www.linkedin.com/company/jeli-inc/). Follow Laura Maguire on Twitter (https://twitter.com/LauraMDMaguire) or LinkedIn (https://www.linkedin.com/in/lauramaguire/). Follow thoughtbot on Twitter (https://twitter.com/thoughtbot) or LinkedIn (https://www.linkedin.com/company/150727/). Become a Sponsor (https://thoughtbot.com/sponsorship) of Giant Robots! Transcript: VICTORIA: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Victoria Guido. And with me today is Laura Maguire, Researcher at Jeli, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Laura, thank you for joining me. LAURA: Thanks for having me, Victoria. VICTORIA: This might be a very introductory level question but just right off the bat, what is an incident? LAURA: What we find is a lot of companies define this very differently across the space, but typically, it's where they are seeing an impact, either a customer impact or a degradation of their service. This can be either formally, it kind of impacts their SLOs or their SLAs, or informally it's something that someone on the team notices or someone, you know, one of their users notice as being degraded performance or something not working as intended. VICTORIA: Gotcha. From my background being in IT operations, I'm familiar with incidents, and it's been a practice in IT for a long time. But what brought you to be a part of building this platform and creating a product around incidents? LAURA: I am a, let's say, recovering safety professional. VICTORIA: [chuckles] LAURA: I started my career in the safety and risk management realm within natural resource industries in the physical world. And so I worked with people who were at the sharp end in high-risk, high-consequence type work. And they were really navigating risk and navigating safety in the real world. And as I was working in this domain, I noticed that there was a delta between what was being said, created safety, and helped risk management and what I was actually seeing with the people that I was working with on the front lines. And so I started to pull the thread on this, and I thought, is work as done really the same as work as written or work as prescribed? And what I found was a whole field of research, a whole field of practice around thinking about safety and risk management in the world of cognitive work. And so this is how people think about risk, how they manage risk, and how do they interpret change and events in the world around them. And so as I started to do my master's degree in human factors and system safety and then later my Ph.D. in cognitive systems engineering, I realized that whether you are on the frontlines of a wildland fire or you're on the frontlines of responding to an incident in the software realm, the ways in which people detect, diagnose, and repair the issues that they're facing are quite similar in terms of the cognitive work. And so when I was starting my Ph.D. work, I was working with Dr. David Woods at the Cognitive Systems Engineering Lab at The Ohio State University. And I came into it, and I was thinking I'm going to work with astronauts, or with fighter pilots, or emergency room doctors, these really exciting domains. And he was like, "We're going to have you work with software engineers." And at first, I really failed to see the connection there, but as I started to learn more about site reliability engineering, about DevOps, about the continuous deployment, continuous integration world, I realized software engineers are really at the forefront of managing critical digital infrastructure. They're keeping up the systems that run society, both for recreation and pleasure in the sense of Netflix, for example, as well as the critical functions within society like our 911 call routing systems, our financial markets. And so the ability to study how software engineers detect outages, manage outages, and work together collaboratively across the team was really giving us a way to study this kind of work that could actually feed back into other types of domains like emergency response, like emergency rooms, and even back to the fighter pilots and astronauts. VICTORIA: Wow, that's so interesting. And so is your research that went into your Ph.D. did that help you help define the product strategy and kind of market fit for what you've been building at Jeli? LAURA: Yeah, absolutely. So Nora Jones, who is the founder and CEO of Jeli, reached out to me at a conference and told me a little bit about what she was thinking about, about how she wanted to support software engineers using a lot of this literature and a lot of the learnings from these other domains to build this product to help support incident management in software engineering. So we base a lot of our thinking around how to help support this cognitive work and how to help resilient performance in these very dynamic, these very changing large scale, you know, distributed software systems on this research, as well as the research that we do with our own users and with our own members from learning from incidents in software engineering Slack community that Nora and several other fairly prominent names within the software community started, Lorin Hochstein, John Allspaw Dr. Richard Cook, Jessica DeVita, Ryan Kitchens, and I may be missing someone else but...and myself, oh, Will Galego as well. Yeah, we based a lot of our understandings, really deep qualitative understandings of what is work like for software engineers when they're, you know, in continuous deployment type environments. And we've translated this into building a product that we think helps but not hinders by getting in the way of engineers while they're under time pressure and there's a lot of uncertainty. And there's often quite a bit of stress involved with responding to incidents. VICTORIA: Right. And you mentioned resilience engineering. And for those who don't know, David Woods, who you worked on with your Ph.D., wrote "Resilience Engineering: Concepts and Precepts." So maybe you could talk a little bit about resilience engineering and what that really means, not just in technology but in the people who were running the tools, right? LAURA: Yeah. So resilience engineering is different from how we think about protecting and defending our software systems. And it's different in the sense that we aren't just thinking about how do we prevent incidents from happening again, like, how do we fix things that have happened to us in the past? But how do we better understand the ways in which our systems operate under a wide variety of conditions? So that includes normal operating conditions as well as abnormal or anomalous operating conditions, such as an incident response. And so resilience engineering was kind of this way of thinking differently about predicting failure, about managing failure, and navigating these kinds of worlds. And one of the fundamental differences about it is it sees people as being the most adaptive component within the system of work. So we can have really good processes and practices around deploying code; we can institute things like cross-checking and peer review of code; we can have really good robust backup and failover systems, but ultimately, it's very likely that in these kinds of complex and adaptive always-changing systems that you're going to encounter problems that you weren't able to anticipate. And so this is where the resilience part comes in because if you're faced with a novel problem, if you're faced with an issue you've never seen before, or a hidden dependency within your system, or an unanticipated failure mode, you have to adapt. You have to be able to take all of the information that's available to you in the moment. You have to interpret that in real-time. You have to think of who else might have skills, knowledge, expertise, access to information, or access to certain kinds of systems or software components. And you have to bring all of those people together in real-time to be able to manage the problem at hand. And so this is really quite a different way of thinking about supporting this work than just let's keep the runbooks updated, and let's make sure that we can write prescriptive processes for everything that we're going to encounter. Because this really is the difference that I saw when I was talking about earlier about that work is done versus work is prescribed. The rules don't cover all of the situations. And so you have to think of how do you help people adapt? How do you help people access information in real-time to be able to handle unforeseen failures? VICTORIA: Right. That makes a lot of sense. It's an interesting evolution of site reliability engineering where you're thinking about the users' experience of your site. It's also thinking about the people who are running your site and what their experience is, and what freedom they have to be able to solve the problems that you wouldn't be able to predict, right? LAURA: Yeah, it's a really good point, actually, because there is sort of this double layer in the product that we are building. So, as you mentioned earlier, we are an incident analysis platform, and so what does that mean? Well, it means that we pull in data whenever there's been an incident, and we help you to look at it a little bit more deeply than you may if you're just following a template and sort of reconstructing a timeline. And so we pull in the actual Slack data that, you know, say, an ops channel or an incident channel that's been spun up following a report of a degraded performance or of an outage. And we look very closely at how did people talk to one another? Who did they bring into the incident? What kinds of things did they think were relevant and important at different points in time? And in doing this, it helps us to understand what information was available to people at different points in time. Because after the incident and after it's been resolved, people often look back and say, "Oh, there's nothing we can learn from that. We figured out what it was." But if we go back and we start looking at how people detected it, how they diagnosed it, who they brought into the event, we can start to unpack these patterns and these ways of understanding how do people work together? What information is useful at different points in time? Which helps us get a deeper understanding of how our systems actually work and how they actually fail. VICTORIA: Right. And I see there are a few different ways the platform does that: there's a narrative builder, a people view, and also a visual timeline. So, do you find that combining all those things together really gives companies a powerful tool to learn from their incidents? LAURA: Yeah. So let me talk a little bit about each of those different components. Our MVP of the product we started out with this understanding of the incident analyst and the incident investigator who, you know, was ready to dive in and ready to understand their incident and apply some qualitative analysis techniques to thinking about their incidents. And what we found was there are a number of these people who are really interested in this deep dive within the software industry. But there's a broader subset of folks that they work with who maybe only do these kinds of incident analysis every once in a while, and they're not as interested in going quite as deep. And so the narrative builder is really this kind of bridge between those two types of users. And what it does is helps construct a timeline which is typically what most companies do to help drive the discussion that they might have in a post-mortem or to drive their kind of findings in their summary report. And it helps them take this closer look at the interactions that happened in that slack transcript and raise questions about what kinds of uncertainties there were, point out who was involved, or interesting aspects of the event at that point in time. And it helps them to summarize what was happening. What did people think was happening at this point in time to create this story about the incident? And the story element is really important because we all learn from stories. It helps bring to life some of the details about what was hard, who was involved, how did they get brought in, what the sources of technical failure were, and whether those were easy or difficult to understand and to repair once the source of the failure was actually understood. And so that narrative builder helps reconstruct this timeline in a much richer way but also do it very efficiently. And as you mentioned, the visual timeline is something that we've created to help that lightweight user or that every once in a while user to go a little bit deeper on their analysis. And how we do that is because it lays out the progression of the event in a way that helps you see, oh, this maybe wasn't straightforward. We didn't detect it in the beginning, and then diagnose it, and then repair it at the end. What happened actually was the detection was intermittent. The signals about what was going wrong was intermittent, and so that was going on in parallel with the diagnosis. The diagnosis took a really long time, and that may have been because we can also see the repair was happening concurrently. And so it starts to show these kinds of characteristics about whether the incident was difficult, whether it was challenging and hard, or whether it was simple and straightforward. This helps lend a bit more depth to metrics like MTTR and TTD by saying, oh, there was a lot more going on in this incident than we initially thought. The last thing that you mentioned was the people view, and so that really sets our product apart from other products in that we look at the sociotechnical system. So it's not just about the software that broke; it is about who was involved in managing that system, in repairing that system, and in communicating about that system outwardly. And so the people view this kind of pulls in some HR data. It helps us to understand who was involved. How long have they been in their role? Were they on-call? Were they not on-call? And other kinds of irrelevant details that show us what was their engagement or their interaction with this event. And so when we start to bring in the socio part of the sociotechnical system, we can identify things like what knowledge do we have within the organization? Is that knowledge well-distributed, or is it just isolated in one or two people? And so those people are constantly getting pulled into incidents when they may be not on-call, which can start to show us whether or not these folks are in danger of burning out or whether their knowledge might need to be transferred more broadly throughout the organization. So this is kind of where the resilience piece comes in because it helps us to distribute knowledge. It helps us to identify who is relevant and useful and how do they partner and collaborate with other people, and their knowledge and skill sets to be able to manage some of the outages that they face? VICTORIA: That's wonderful because one of my follow-up questions would be, as a CEO, as a founder, what kind of insights or choices do you get to make now that you have this insight to help make your team more resilient? [laughs] LAURA: So if this is a manager, or a founder, or a CEO that is looking at their data in Jeli, they can start to understand how to resource their teams more appropriately, as I mentioned, how to spread that knowledge around. They can start to see what parts of their system are creating the most problems or what parts of their system do they have maybe less insight into how it works, how it interacts with other parts of the system, and what this actually means for their ability to meet their SLOs or their SLAs. So it gives you a more in-depth understanding of how your business is actually operating on both the technical side of things, as well as on the people side of things. VICTORIA: That makes a lot of sense. Thank you for that overview of the platform. There's the incident analysis platform, and you also have the bot, the response chatbot. Can you tell me a little bit more about that? LAURA: Yeah, absolutely. We think that incident management should be conducted wherever your work actually takes place, and so for most of our customers and a lot of folks that we know about in the industry, that's Slack. And so, if you are communicating in real-time with your team in Slack, we think that you should stay there. And so, we built this incident management bot that is free and will be free for the lifetime of the product. Because we think that this is really the fundamental basis for helping you manage your incidents more efficiently and more effectively. So it's a pretty lightweight bot. It gives kind of some guardrails or some guidance around collaboration by spinning up a new incident channel, helping you to bring the right kinds of responders into that, helping you to communicate to interested stakeholders by broadcasting to channels they might be in. It kind of nudges you to think about how to communicate about what's happening during different stages of the event progression. And so it's prompting you in a very lightweight way; hey, do you have a status update? Do you have a summary of what the current thinking is? What are the hypotheses about what's going on? Who's conducting what kinds of activities right now? So that if I'm a responder that's coming into the event after 20-30 minutes after it started, I can very quickly come up to speed, understand what's going on, who's doing what, and figure out what's useful for me to do to help step in and not disrupt the incident management that's underway right now. Our users can choose to use the bot independently of the incident analysis platform. But of course, being able to ingest that incident into Jeli it helps you understand who's been involved in the incident, if they've been involved in similar incidents in the past, and helps them start to see some patterns and some themes that emerge over time when you start to look at incidents across the organization. VICTORIA: That makes sense. And I love that it's free and that there's something for every type of organization to take advantage of there. And I wonder if at Jeli you have data about what type of customer is it who'd be targeted or really ideal to take on this kind of platform. LAURA: So most organizations...I was actually recently at SREcon EMEA, and there was a really interesting series of talks; one was SRE for Enterprise, and the next talk was SRE for Startups. And so it was a very thought-provoking discussion around is SRE for everyone, so site reliability engineering? Even smaller teams are starting to have to be responsible for reliability and responsible for running their service. And so we kind of have built our platform thinking about how do we help not just big enterprises or organizations that may have dedicated teams for this but also small startups to learn from their incidents. So internally, we actually call incidents opportunities as in they are learning opportunities for checking out how does your system actually work? How do your people work together? What things were difficult and challenging about the incident? And how do you talk about those things as a team to help create more resilient performance in future? So in terms of an ideal customer, it's really folks that are interested in conducting these sort of lightweight but in-depth looks at how their system actually works on both the people side of things and the technical side of things. Those who we found are most successful with our product are interested in not so much figuring out who did the thing and who can they blame for the incident itself but rather how do they learn from what happened? And would another engineer, or another product owner, another customer service representative, whoever the incident may be sort of focused around, would another person in their shoes have taken the same actions that they took or made the same decisions that they made? Which helps us understand from a systems level how do we repair or how do we adjust the system of work surrounding folks so that they are better supported when they're faced with uncertainty, or with that kind of time pressure, or that ambiguity about what's actually going on? VICTORIA: And I love that you said that because part of the reason [laughs] I invited you on to the podcast is that a lot of companies I have experience with don't think about incidents until it happens to them, and then it can be a scramble. It can impact their customer base. It can stress their team out. But if you go about creating...the term obviously you all use is psychological safety on your team, and maybe you use some of the free tools from Jeli like the Post-Incident Guide and the Incident Analysis 101 blog to set your team up for success from the beginning, then you can increase your customer loyalty and your team loyalty as well to the company. Is that your experience? LAURA: Yeah, absolutely. So one thing that I have learned throughout my career, you know, starting way back in forestry and looking at safety and risk in that domain, was as soon as there is an accident or even a serious near miss, right away, everybody gets sweaty palms. Everybody is concerned about, uh-oh, am I going to get blamed for this? Am I going to get fired? Am I going to get publicly shamed for the decisions that I made when I was in this situation? And what that response, that reaction does is it drives a lot of the communication and a lot of the understanding of the conditions that that person was in. It drives that underground. And it's important to allow people to talk about here's what I was seeing, here's what I was experiencing because, in these kinds of complex systems, information is not readily available to people. The signals are not always coming through loud and clear about what's going on or about what the appropriate actions to take are. Instead, it's messy; it's loud, it's noisy. There are usually multiple different demands on that person's attention and on their time, and they're often managing trade-offs: do I keep the system down so that I can gather more information about what's actually going on, or do I just try and bring it up as quickly as I can so that there's less impact to users? Those kinds of decisions are having to be made under pressure. So when we create these conditions of psychological safety, when we say you know what? This happened. We want to learn from it. We've already made this investment. Richard Cook mentioned in the very first SNAFU Catchers Report, which was a report that came out of Ohio State, that incidents are unplanned investments into understanding how your system works. And so you've already had the incident. You've already paid the price of that downtime or of that outage. So you might as well extract some learning from it so that you can help create a safer and more resilient system in the future. So by helping people to reconstruct what was actually happening in real-time, not what they were retrospectively saying, "Oh, I should have done this," well, you didn't do that. So let's understand why you thought at that moment in time that was the right way to respond because, more than likely, other people in that same position would have made that same choice. And so it helps us to think more broadly about ways that we can support decision-making and sense-making under conditions of stress and uncertainty. And ultimately, that helps your system be more resilient and be more reliable for your customers. VICTORIA: What a great reframing: unplanned investment. [laughs] And if you don't learn from it, then you're going to lose out on what you've already invested that time in resolving it, right? LAURA: Absolutely. MID-ROLL AD: Are you an entrepreneur or start-up founder looking to gain confidence in the way forward for your idea? At thoughtbot, we know you're tight on time and investment, which is why we've created targeted 1-hour remote workshops to help you develop a concrete plan for your product's next steps. Over four interactive sessions, we work with you on research, product design sprint, critical path, and presentation prep so that you and your team are better equipped with the skills and knowledge for success. Find out how we can help you move the needle at: tbot.io/entrepreneurs. VICTORIA: Getting more into that psychological safety and how to create that culture where people feel safe telling about what really happened, but how does that relate to...Jeli says that they are a people software. [laughs] Talk to me more about that. Like, what advice do you give founders and CEOs on how to create that psychological safety which makes them be more resilient in these types of incidents? LAURA: So you mentioned the Howie Guide that we published last year, and this is our guidance around how to do incident analysis, how to help your team start to learn from their incidents, and Howie stands for how we got here. And that's really important, that language because what it says is there's a history that led up to this incident. And most teams, when they've had an outage, they'll kind of look backwards from that outage, maybe an hour, maybe a day, maybe to the last deploy. But they don't think about how the decisions got made to use that piece of software in the first place. They don't think about how did engineers actually get on-boarded to being on-call. They don't necessarily think about what kinds of skills, and knowledge, and expertise when we're hiring a DevOps engineer, and I'm using air quotes here or an SRE. What kinds of skills and knowledge do they actually have? Those are very broad terms. And what it means to be a DevOps engineer or an SRE is quite underspecified. And so the knowledge behind the folks that you might hire into the company is going to necessarily be very diverse. It's going to be partial and incomplete in many ways because not everyone can know everything about the system. And so, we need to have multiple diverse perspectives about how the system works, how our customers use that system, what kinds of pressures and constraints exist within our company that allow us some possibilities over others. We need to bring all of those perspectives together to get a more reflective picture of what was actually happening before this incident took place and how we actually got here. This reframing helps a lot of people disarm that initial defensiveness response or that initial, oh, shoot; I'm going to get in trouble for this kind of response. And it says to them, "Hey, you're a part of this bigger system of work. You are only one piece of this puzzle. And what we want to try and do is understand what was happening within the company, not just what you did, what you said, and what you decided." So once people realize that you're not just trying to find fault or place blame, but you're really trying to understand their work, and you're trying to understand their work with other teams and other vendors, and trying to understand their work relative to the competing demands that were going on, so those are some of the things that help create psychological safety. About ten years ago, John Allspaw and the team at Etsy put out The Etsy Debriefing Facilitation Guide, which also poses a number of questions and helps to frame the post-incident learnings in a way that moves it from the individual and looks more collectively at the company as a whole. And so these things are helpful for founders or for CEOs to help bring forward more information about what's really going on, more information about what are the real risks and threats and opportunities within the company, and gives you an opportunity to step back and do what we call microlearning, which is sharing knowledge about how the system works, sharing understandings of what people think is going on, and what people know about the system. We don't typically talk about those things unless there's a reason to, and incidents kind of give us that reason because they're uncomfortable and they can be painful. They can be very public. They can be very disruptive to what we think about how resilient and reliable we actually are. And so if you can kind of step away from this defensiveness and step away from this need to place blame and instead try and understand the conditions, you will get a lot more learning and a lot more resilience and reliability out of your teams and out of your systems. VICTORIA: That makes sense to me. And I'd like to draw a connection between that and some other things you mentioned with The 2022 Accelerate State of DevOps Report that highlights that the people who are often responding to those incidents or in that high-stress situation tend to be historically underrepresented or historically excluded groups. And so do you see that having this insight into both who is actually taking on a lot of the work when these incidents happen and creating that psychological safety can make a better environment for diversity, equity, inclusion at a company as well? LAURA: Well, I think anytime you work to establish trust and transparency, and you focus on recognizing the skills that people do have, the knowledge that they do have, and not over assuming that someone knows something or that they have been involved in the discussions that may have been relevant to an incident, anytime you focus on that trust and transparency you are really signaling to people within your organization that you value their contributions and that you recognize that they've come to work and trying to do a good job. But they have multiple competing demands on their attention and on their time. And so we're not making assumptions about people being complacent, or people being reckless or being sloppy in their work. So that creates an environment where people feel more willing to speak up and to talk about some of the challenges that they might face, to talk about the ways in which it's not clear to them how certain parts of the system work or how certain teams actually operate. So you're just opening the channels for communication, which helps to share more knowledge. It helps to share more information about what teams are doing at different points in time. And this helps people to preemptively anticipate how a change that they might be making in their part of the system could be influencing up or downstream teams. And so this helps create more resilience because now you're thinking laterally about your system and about your involvement across teams and across boundary lines. And an example of this is if a marketing team...this is a story that Nora tells quite a bit; if a marketing team is, say, launching a Super Bowl commercial for their company but they don't actually tell the engineers on-call that that is about to happen, you can create all sorts of breakdowns when all of a sudden you have this surge of traffic to your website because people see the Super Bowl commercial and they want to go to the site. And then you have a single person who's trying to respond to that in real-time. So, instead, when you do start thinking about that trust and transparency, you're helping teams to help each other and to think more broadly about how their work is actually impacting other parts of the system. So from a diversity and inclusion and underrepresented groups perspective, this is creating the conditions for more people to be involved, more people to feel like their voice is going to be heard, and that their perspective actually matters. VICTORIA: That sounds really powerful, and I'm glad we were able to touch on that. Shifting gears a little bit, I wanted to talk about two different questions; so one is if you could travel back in time to when Jeli first started, what advice would you give yourself, your past self? LAURA: I would encourage myself to recognize that our ability to experiment is fundamental to our ability to learn. And learning is what helps us to iterate faster. Learning is what helps us to reflect on the tool that we're building or the feature that we're building and what this actually means to our users. I actually copped that advice to myself from CEO Zoran Perkov of the Long-Term Stock Exchange. They launched a whole new stock market during the pandemic with a fully remote team. And I had interviewed him for an article that I wrote about resilient leadership. And he said to me, like, "My job as a CEO is 100% about protecting our ability to experiment as a company because if we stop learning, we're not going to be able to iterate. We're not going to be able to adapt to the changes that we see in the market and in our users." So I think I would tell myself to continually experiment. One of the things that I talk to our customers about a lot because many of them are implementing new incident management programs or they're trying to level up their engineering teams around incident analysis, and I would say, "This doesn't have to be a fully-fleshed out program where you know all of the ways in which this is going to unfold." It's really about trying experiments, conduct some training, start small. Do one incident analysis on a really particularly spicy incident that you may have had or a really challenging incident where a lot of people were surprised by what happened. Bring together that group and say, "Hey, we're going to try something a little bit different here. We'll use some questions from the Howie Guide. We'll use the format and the structure from the Etsy Debriefing Guide. And we're just going to try and learn what we can about this event. We're not going to try and place blame. We're not going to try and generate corrective actions. We just want to see what we can learn from this." Then ask people that were involved, "How did this go? What did we learn from it? What should we do differently next time?" And continually iterate on those small, little experiments so that you can grow your product and grow your team's capacity. I think it took us a little bit of time to figure that out within the organization, but once we did, we were just able to collaborate more effectively work more effectively by integrating some of the feedback that we were getting from our users. And then the last piece of advice that I would give myself is to really invest in cross-discipline coordination and collaboration. Engineers, designers, researchers, CEOs they all have a different view of the product. They all have a different understanding of what the goals and priorities are. And those mental models of the product and of what the right thing to do is are constantly changing. And they all have different language that they use to talk about the product and to talk about their processes for integrating this understanding of the changing conditions and the changing user into the product. And so I would say invest in establishing common ground across the different disciplines within your team to be able to talk about what people are seeing, to be able to stop and identify when we're making assumptions about what other people know or what other people's orientation towards the problem or towards the product are. And spend a little bit of time saying, "When I say this is important, I'm saying it's important because of XYZ, not just this is important." So spending a little bit of time elaborating on what your mental model is and where you're drawing from can help the teams work more effectively together across those disciplines. VICTORIA: That's pretty powerful advice. You're iterating and experimenting at Jeli. What's on the horizon that you are...what new experiments are you excited about? LAURA: One of the things that has been front and center for us since we started is this idea of cross-incident analysis. And so we've kind of built out a number of different features within the product, being able to help tag the incident with the relevant services and technologies that were involved, being able to identify which teams were involved, and also being able to identify different kinds of themes or patterns that emerge from individual incidents. So all of this data that we can get from mostly just from the ingested incident itself or from the incident that you bring into Jeli but also from the analysis that you do on it this helps us start to be able to see across incidents what's happening not just with the technical side of things. So is it always Travis that is causing a problem? Are there components that work together that kind of have these really hidden and strange interdependencies that are really hard for the team to actually cope with? What kinds of themes are emerging across your suite of opportunities, your suite of incidents that you've ingested? Some of the things that we're starting to see from those experiments is an ability to look at where are your knowledge islands within your organization? Do you have an engineer who, if they were to leave, would take the majority of your systems knowledge about your database, or about your users, or about some critical aspect of your system that would disappear with all of that tacit knowledge? Or are there engineers that work really effectively together during really difficult incidents? And so you can start to unpack what are these characteristics of these people, and of these teams, and of these technologies that offer both opportunities or threats to your organization? So basically, what we're doing is we're helping you to see how your system performs under different kinds of conditions, which I think as a safety and risk professional working in a variety of different domains for the last 15 years, I think this is really where the rubber hits the road in helping teams be more reliable, and be more resilient, and more proactive about where investments in maintenance, or training, or headcount are going to have the biggest bang for your buck. VICTORIA: That makes a lot of sense. In my experience, sometimes those decisions are made more on intuition or on limited data so having a more full picture to rely on probably produces better results. [laughs] LAURA: Yeah, and I think that we all want to be data-driven, thinking about not only the quantitative data is how many incidents do we have around certain parts of the system, or certain teams, or certain services? But also, the qualitative side of things is what does this actually mean? And what does this mean to our ability to grow and change over time and to scale? The partnership of that quantitative data and qualitative data means we're being data-driven on a whole other level. VICTORIA: Wonderful. And it seems like we're getting close to the end of our time here. Is there anything else you want to give as a final takeaway to our listeners? LAURA: Yeah. So I think that we are, you know, as a domain, as a field, software engineering is increasingly becoming responsible for not only critical infrastructure within society, but we have a responsibility to our users and to each other within our companies to help make work better, help make our services more reliable and more resilient over time. And there's a variety of lessons that we can learn from other domains. As I mentioned before, aviation, healthcare, nuclear power all of those kinds of domains have been thinking about supporting cognitive work and supporting frontline operators. And we can learn from this history and this literature that exists out there. There is a GitHub repo that Lorin Hochstein has curated with a number of other folks with the industry that points to some of these resources. And as well, we'll be hosting the first Learning From Incidents in Software Engineering Conference in Denver in February, February 15 and 16th. And one feature of this conference that I'm super excited about is affectionately called CasesConf. And it is going to be an opportunity for software engineers from a variety of organizations to tell real stories about incidents that they had, how they handled them, what was challenging, what went surprisingly well, and just what is actually going on within their organizations. And this is kind of a new thing for the software industry to be talking very publicly about failures and sharing the messy details of our incidents. This won't be a recorded part of the conference. It is going to be conducted under the Chatham House Rule, which is participants who are in the room while these stories are being told can share some of the stories but not any identifying details about the company or the engineers that were involved. And so this kind of real-world situations helps us to, as I talked about before, with that psychological safety, helps us to say this is the reality of operating complex systems. They're going to fail. We're going to have to learn from them. And the more that we can talk at an industry level about what's going on and about what kinds of things are creating problems or opportunities for each other, the more we're going to be able to lift the bar for the industry as a whole. So you can check out register.learningfromincidents.io for more information about the conference. And we can link Lorin's resilience engineering GitHub repo in the notes as well. VICTORIA: Wonderful. Well, I was looking for an excuse to come to Denver in February anyways. LAURA: We would love to have ya. VICTORIA: Thank you. And thank you so much for taking time to share with us today, Laura. You can subscribe to the show and find notes along with a complete transcript for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @victori_ousg. This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time. ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success. Special Guest: Laura Maguire.

Screaming in the Cloud
Incidents, Solutions, and ChatOps Integration with Chris Evans

Screaming in the Cloud

Play Episode Listen Later Jul 7, 2022 33:28


About ChrisChris is the Co-founder and Chief Product Officer at incident.io, where they're building incident management products that people actually want to use. A software engineer by trade, Chris is no stranger to gnarly incidents, having participated (and caused!) them at everything from early stage startups through to enormous IT organizations.Links Referenced: incident.io: https://incident.io Practical Guide to Incident Management: https://incident.io/guide/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: DoorDash had a problem. As their cloud-native environment scaled and developers delivered new features, their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology and about the business, losing observability means the entire company loses their competitive edge. With Chronosphere, DoorDash is no longer losing visibility into their applications suite. The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business, confidence, and peace of mind. Read the full success story at snark.cloud/chronosphere. That's snark.cloud slash C-H-R-O-N-O-S-P-H-E-R-E.Corey: Let's face it, on-call firefighting at 2am is stressful! So there's good news and there's bad news. The bad news is that you probably can't prevent incidents from happening, but the good news is that incident.io makes incidents less stressful and a lot more valuable. incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues and learn from incident insights to improve site reliability and fix your vulnerabilities. Try incident.io, recover faster and sleep more.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest is Chris Evans, who's the CPO and co-founder of incident.io. Chris, first, thank you very much for joining me. And I'm going to start with an easy question—well, easy question, hard answer, I think—what is an incident.io exactly?Chris: Incident.io is a software platform that helps entire organizations to respond to recover from and learn from incidents.Corey: When you say incident, that means an awful lot of things. And depending on where you are in the ecosystem in the world, that means different things to different people. For example, oh, incident. Like, “Are you talking about the noodle incident because we had an agreement that we would never speak about that thing again,” style, versus folks who are steeped in DevOps or SRE culture, which is, of course, a fancy way to say those who are sad all the time, usually about computers. What is an incident in the context of what you folks do?Chris: That, I think, is the killer question. I think if you look at organizations in the past, I think incidents were those things that happened once a quarter, maybe once a year, and they were the thing that brought the entirety of your site down because your big central database that was in a data center sort of disappeared. The way that modern companies run means that the definition has to be very, very different. So, most places now rely on distributed systems and there is no, sort of, binary sense of up or down these days. And essentially, in the general case, like, most companies are continually in a sort of state of things being broken all of the time.And so, for us, when we look at what an incident is, it is essentially anything that takes you away from your planned work with a sense of urgency. And that's the sort of the pithy definition that we use there. Generally, that can mean anything—it means different things to different folks, and, like, when we talk to folks, we encourage them to think carefully about what that threshold is, but generally, for us at incident.io, that means basically a single error that is worthwhile investigating that you would stop doing your backlog work for is an incident. And also an entire app being down, that is an incident.So, there's quite a wide range there. But essentially, by sort of having more incidents and lowering that threshold, you suddenly have a heap of benefits, which I can go very deep into and talk for hours about.Corey: It's a deceptively complex question. When I talk to folks about backups, one of the biggest problems in the world of backup and building a DR plan, it's not building the DR plan—though that's no picnic either—it's okay. In the time of cloud, all your planning figures out, okay. Suddenly the site is down, how do we fix it? There are different levels of down and that means different things to different people where, especially the way we build apps today, it's not is the service or site up or down, but with distributed systems, it's how down is it?And oh, we're seeing elevated error rates in us-tire-fire-1 region of AWS. At what point do we begin executing on our disaster plan? Because the worst answer, in some respects is, every time you think you see a problem, you start failing over to other regions and other providers and the rest, and three minutes in, you've irrevocably made the cutover and it's going to take 15 minutes to come back up. And oh, yeah, then your primary site comes back up because whoever unplugged something, plugged it back in and now you've made the wrong choice. Figuring out all the things around the incident, it's not what it once was.When you were running your own blog on a single web server and it's broken, it's pretty easy to say, “Is it up or is it down?” As you scale out, it seems like that gets more and more diffuse. But it feels to me that it's also less of a question of how the technology has scaled, but also how the culture and the people have scaled. When you're the only engineer somewhere, you pretty much have no choice but to have the entire state of your stack shoved into your head. When that becomes 15 or 20 different teams of people, in some cases, it feels like it's almost less than a technology problem than it is a problem of how you communicate and how you get people involved. And the issues in front of the people who are empowered and insightful in a certain area that needs fixing.Chris: A hundred percent. This is, like, a really, really key point, which is that organizations themselves are very complex. And so, you've got this combination of systems getting more and more complicated, more and more sort of things going wrong and perpetually breaking but you've got very, very complicated information structures and communication throughout the whole organization to keep things up and running. The very best orgs are the ones where they can engage the entire, sort of, every corner of the organization when things do go wrong. And lived and breathed this firsthand when various different previous companies, but most recently at Monzo—which is a bank here in the UK—when an incident happened there, like, one of our two physical data center locations went down, the bank wasn't offline. Everything was resilient to that, but that required an immediate response.And that meant that engineers were deployed to go and fix things. But it also meant the customer support folks might be required to get involved because we might be slightly slower processing payments. And it means that risk and compliance folks might need to get involved because they need to be reporting things to regulators. And the list goes on. There's, like, this need for a bunch of different people who almost certainly have never worked together or rarely worked together to come together, land in this sort of like empty space of this incident room or virtual incident room, and figure out how they're going to coordinate their response and get things back on track in the sort of most streamlined way and as quick as possible.Corey: Yeah, when your bank is suddenly offline, that seems like a really inopportune time to be introduced to the database team. It's, “Oh, we have one of those. Wonderful. I feel like you folks are going to come in handy later today.” You want to have those pathways of communication open well in advance of these issues.Chris: A hundred percent. And I think the thing that makes incidents unique is that fact. And I think the solution to that is this sort of consistent, level playing field that you can put everybody on. So, if everybody understands that the way that incidents are dealt with is consistent, we declare it like this, and under these conditions, these things happen. And, you know, if I flag this kind of level of impact, we have to pull in someone else to come and help make a decision.At the core of it, there's this weird kind of duality to incidents where they are both kind of semi-formulaic and that you can basically encode a lot of the processes that happen, but equally, they are incredibly chaotic and require a lot of human impact to be resilient and figure these things out because stuff that you have never seen happen before is happening and failing in ways that you never predicted. And so, this is where incident.io plays into this is that we try to take the first half of that off of your hands, which is, we will help you run your process so that all of the brain capacity you have, it goes on to the bit that humans are uniquely placed to be able to do, which is responding to these very, very chaotic, sort of, surprise events that have happened.Corey: I feel as well—because I played around in this space a bit before I used to run ops teams—and, more or less I really should have had a t-shirt then that said, “I am the root cause,” because yeah, I basically did a lot of self-inflicted outages in various environments because it turns out, I'm not always the best with computers. Imagine that. There are a number of different companies that play in the space that look at some part of the incident lifecycle. And from the outside, first, they all look alike because it's, “Oh, so you're incident.io. I assume you're PagerDuty. You're the thing that calls me at two in the morning to make sure I wake up.”Conversely, for folks who haven't worked deeply in that space, as well, of setting things on fire, what you do sounds like it's highly susceptible to the Hacker News problem. Where, “Wait, so what you do is effectively just getting people to coordinate and talk during an incident? Well, that doesn't sound hard. I could do that in a weekend.” And no, no, you can't.If this were easy, you would not have been in business as long as you have, have the team the size that you do, the customers that you do. But it's one of those things that until you've been in a very specific set of a problem, it doesn't sound like it's a real problem that needs solving.Chris: Yeah, I think that's true. And I think that the Hacker News point is a particularly pertinent one and that someone else, sort of, in an adjacent area launched on Hacker News recently, and the amount of feedback they got around, you know, “You're a Slack bot. How is this a company?” Was kind of staggering. And I think generally where that comes from is—well, first of all that bias that engineers have, which is just everything you look at as an engineer is like, “Yeah, I can build that in a weekend.” I think there's often infinite complexity under the hood that just gets kind of brushed over. But yeah, I think at the core of it, you probably could build a Slack bot in a weekend that creates a channel for you in Slack and allows you to post somewhere that some—Corey: Oh, good. More channels in Slack. Just when everyone wants.Chris: Well, there you go. I mean, that's a particular pertinent one because, like, our tool does do that. And one of the things—so I built at Monzo, a version of incident.io that we used at the company there, and that was something that I built evenings and weekends. And among the many, many things I never got around to building, archiving and cleaning up channels was one of the ones that was always on that list.And so, Monzo did have this problem of littered channels everywhere, I think that sort of like, part of the problem here is, like, it is easy to look at a product like ours and sort of assume it is this sort of friendly Slack bot that helps you orchestrate some very basic commands. And I think when you actually dig into the problems that organizations above a certain size have, they're not solved by Slack bots. They're solved by platforms that help you to encode your processes that otherwise have to live on a Google Doc somewhere which is five pages long and when it's 2 a.m. and everything's on fire, I guarantee you not a single person reads that Google Doc, so your process is as good as not in place at all. That's the beauty of a tool like ours. We have a powerful engine that helps you basically to encode that and take some load off of you.Corey: To be clear, I'm also not coming at this from a position of judging other people. I just look right now at the Slack workspace that we have The Duckbill Group, and we have something like a ten-to-one channel-to-human ratio. And the proliferation of channels is a very real thing. And the problem that I've seen across the board with other things that try to address incident management has always been fanciful at best about what really happens when something breaks. Like, you talk about, oh, here's what happens. Step one: you will pull up the Google Doc, or you will pull up the wiki or the rest, or in some aspirational places, ah, something seems weird, I will go open a ticket in Jira.Meanwhile, here in reality, anyone who's ever worked in these environments knows that step one, “Oh shit, oh shit, oh shit, oh shit, oh shit. What are we going to do?” And all the practices and procedures that often exist, especially in orgs that aren't very practiced at these sorts of things, tend to fly out the window and people are going to do what they're going to do. So, any tool or any platform that winds up addressing that has to accept the reality of meeting people where they are not trying to educate people into different patterns of behavior as such. One of the things I like about your approach is, yeah, it's going to be a lot of conversation in Slack that is a given we can pretend otherwise, but here in reality, that is how work gets communicated, particularly in extremis. And I really appreciate the fact that you are not trying to, like, fight what feels almost like a law of nature at this point.Chris: Yeah, I think there's a few things in that. The first point around the document approach or the clearly defined steps of how an incident works. In my experience, those things have always gone wrong because—Corey: The data center is down, so we're going to the wiki to follow our incident management procedure, which is in the data center just lost power.Chris: Yeah.Corey: There's a dependency problem there, too. [laugh].Chris: Yeah, a hundred percent. [laugh]. A hundred percent. And I think part of the problem that I see there is that very, very often, you've got this situation where the people designing the process are not the people following the process. And so, there's this classic, I've heard it through John Allspaw, but it's a bunch of other folks who talk about the difference between people, you know, at the sharp end or the blunt end of the work.And I think the problem that people are facing the past is you have these people who sit in the, sort of, metaphorical upstairs of the office and think that they make a company safe by defining a process on paper. And they ship the piece of paper and go, “That is a good job for me done. I'm going to leave and know that I've made the bank—the other whatever your organization does—much, much safer.” And I think this is where things fall down because—Corey: I want to ambush some of those people in their performance reviews with, “Cool. Just for fun, all the documentation here, we're going to pull up the analytics to see how often that stuff gets viewed. Oh, nobody ever sees it. Hmm.”Chris: It's frustrating. It's frustrating because that never ever happens, clearly. But the point you made around, like, meeting people where you are, I think that is a huge one, which is incidents are founded on great communication. Like, as I said earlier, this is, like, a form of team with someone you've never ever worked with before and the last thing you want to do is be, like, “Hey, Corey, I've never met you before, but let's jump out onto this other platform somewhere that I've never been or haven't been for weeks and we'll try and figure stuff out over there.” It's like, no, you're going to be communicating—Corey: We use Slack internally, but we have a WhatsApp chat that we wind up using for incident stuff, so go ahead and log into WhatsApp, which you haven't done in 18 months, and join the chat. Yeah, in the dawn of time, in the mists of antiquity, you vaguely remember hearing something about that your first week and then never again. This stuff has to be practiced and it's important to get it right. How do you approach the inherent and often unfortunate reality that incident response and management inherently becomes very different depending upon the specifics of your company or your culture or something like that? In other words, how cookie-cutter is what you have built versus adaptable to different environments it finds itself operating in?Chris: Man, the amount of time we spent as a founding team in the early days deliberating over how opinionated we should be versus how flexible we should be was staggering. The way we like to describe it as we are quite opinionated about how we think incidents should be run, however we let you imprint your own process into that, so putting some color onto that. We expect incidents to have a lead. That is something you cannot get away from. However, you can call the lead whatever makes sense for you at your organization. So, some folks call them an incident commander or a manager or whatever else.Corey: There's overwhelming militarization of these things. Like, oh, yes, we're going to wind up taking a bunch of terms from the military here. It's like, you realize that your entire giant screaming fire is that the lights on the screen are in the wrong pattern. You're trying to make them in the right pattern. No one dies here in most cases, so it feels a little grandiose for some of those terms being tossed around in some cases, but I get it. You've got to make something that is unpleasant and tedious in many respects, a little bit more gripping. I don't envy people. Messaging is hard.Chris: Yeah, it is. And I think if you're overly virtuoustic and inflexible, you're sort of fighting an uphill battle here, right? So, folks are going to want to call things what they want to call things. And you've got people who want to import [ITIL 00:15:04] definitions for severity ease into the platform because that's what they're familiar with. That's fine.What we are opinionated about is that you have some severity levels because absent academic criticism of severity levels, they are a useful mechanism to very coarsely and very quickly assess how bad something is and to take some actions off of it. So yeah, we basically have various points in the product where you can customize and put your own sort of flavor on it, but generally, we have a relatively opinionated end-to-end expectation of how you will run that process.Corey: The thing that I find that annoys me—in some cases—the most is how heavyweight the process is, and it's clearly built by people in an ivory tower somewhere where there's effectively a two-day long postmortem analysis of the incident, and so on and so forth. And okay, great. Your entire site has been blown off the internet, yeah, that probably makes sense. But as soon as you start broadening that to things like okay, an increase in 500 errors on this service for 30 minutes, “Great. Well, we're going to have a two-day postmortem on that.” It's, “Yeah, sure would be nice if we could go two full days without having another incident of that caliber.” So, in other words, whose foot—are we going to hire a new team whose full-time job it is, is to just go ahead and triage and learn from all these incidents? Seems to me like that's sort of throwing wood behind the wrong arrows.Chris: Yeah, I think it's very reductive to suggest that learning only happens in a postmortem process. So, I wrote a blog, actually, not so long ago that is about running postmortems and when it makes sense to do it. And as part of that, I had a sort of a statement that was [laugh] that we haven't run a single postmortem when I wrote this blog at incident.io. Which is probably shocking to many people because we're an incident company, and we talk about this stuff, but we were also a company of five people and when something went wrong, the learning was happening and these things were sort of—we were carving out the time, whether it was called a postmortem, or not to learn and figure out these things. Extrapolating that to bigger companies, there is little value in following processes for the sake of following processes. And so, you could have—Corey: Someone in compliance just wound up spitting their coffee over their desktop as soon as you said that. But I hear you.Chris: Yeah. And it's those same folks who are the ones who care about the document being written, not the process and the learning happening. And I think that's deeply frustrating to me as—Corey: All the plans, of course, assume that people will prioritize the company over their own family for certain kinds of disasters. I love that, too. It's divorced from reality; that's ridiculous, on some level. Speaking of ridiculous things, as you continue to grow and scale, I imagine you integrate with things beyond just Slack. You grab other data sources and over in the fullness of time.For example, I imagine one of your most popular requests from some of your larger customers is to integrate with their HR system in order to figure out who's the last engineer who left, therefore everything immediately their fault because lord knows the best practice is to pillory whoever was the last left because then they're not there to defend themselves anymore and no one's going to get dinged for that irresponsible jackass's decisions, even if they never touched the system at all. I'm being slightly hyperbolic, but only slightly.Chris: Yeah. I think [laugh] that's an interesting point. I am definitely going to raise that feature request for a prefilled root cause category, which is, you know, the value is just that last person who left the organization. That it's a wonderful scapegoat situation there. I like it.To the point around what we do integrate with, I think the thing is actually with incidents that's quite interesting is there is a lot of tooling that exists in this space that does little pockets of useful, valuable things in the shape of incidents. So, you have PagerDuty is this system that does a great job of making people's phone making noise, but that happens, and then you're dropped into this sort of empty void of nothingness and you've got to go and figure out what to do. And then you've got things like Jira where clearly you want to be able to track actions that are coming out of things going wrong in some cases, and that's a great tool for that. And various other things in the middle there. And yeah, our value proposition, if you want to call it that, is to bring those things together in a way that is massively ergonomic during an incident.So, when you're in the middle of an incident, it is really handy to be able to go, “Oh, I have shipped this horrible fix to this thing. It works, but I must remember to undo that.” And we put that at your fingertips in an incident channel from Slack, that you can just log that action, lose that cognitive load that would otherwise be there, move on with fixing the thing. And you have this sort of—I think it's, like, that multiplied by 1000 in incidents that is just what makes it feel delightful. And I cringe a little bit saying that because it's an incident at the end of the day, but genuinely, it feels magical when some things happen that are just like, “Oh, my gosh, you've automatically hooked into my GitHub thing and someone else merged that PR and you've posted that back into the channel for me so I know that that happens. That would otherwise have been a thing where I jump out of the incident to go and figure out what was happening.”Corey: This episode is sponsored in part by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on-premises, private cloud, and they just announced a fully-managed service on AWS and Azure called BigAnimal, all one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications—including Oracle—to the cloud. To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.Corey: The problem with the cloud, too, is the first thing that, when there starts to be an incident happening is the number one decision—almost the number one decision point is this my shitty code, something we have just pushed in our stuff, or is it the underlying provider itself? Which is why the AWS status page being slow to update is so maddening. Because those are two completely different paths to go down and you are having to pursue both of them equally at the same time until one can be ruled out. And that is why time to identify at least what side of the universe it's on is so important. That has always been a bit of a tricky challenge.I want to talk a bit about circular dependencies. You target a certain persona of customer, but I'm going to go out on a limb and assume that one explicit company that you are not going to want to do business with in your current iteration is Slack itself because a tool to manage—okay, so our service is down, so we're going to go to Slack to fix it doesn't work when the service is Slack itself. So, that becomes a significant challenge. As you look at this across the board, are you seeing customers having problems where you have circular dependency issues with this? Easy example: Slack is built on top of AWS.When there's an underlying degradation of, huh, suddenly us-east-1 is not doing what it's supposed to be doing, now, Slack is degraded as well, as well as the customer site, it seems like at that point, you're sort of in a bit of tricky positioning as a customer. Counterpoint, when neither Slack nor your site are working, figuring out what caused that issue doesn't seem like it's the biggest stretch of the imagination at that point.Chris: I've spent a lot of my career working in infrastructure, platform-type teams, and I think you can end up tying yourself in knots if you try and over-optimize for, like, avoiding these dependencies. I think it's one of those, sort of, turtles all the way down situations. So yes, Slack are unlikely to become a customer because they are clearly going to want to use our product when they are down.Corey: They reach out, “We'd like to be your customer.” Your response is, “Please don't be.” None of us are going to be happy with this outcome.Chris: Yeah, I mean, the interesting thing that is that we're friends with some folks at Slack, and they believe it or not, they do use Slack to navigate their incidents. They have an internal tool that they have written. And I think this sort of speaks to the point we made earlier, which is that incidents and things failing or not these sort of big binary events. And so—Corey: All of Slack is down is not the only kind of incident that a company like Slack can experience.Chris: I'd go as far as that it's most commonly not that. It's most commonly that you're navigating incidents where it is a degradation, or some edge case, or something else that's happened. And so, like, the pragmatic solution here is not to avoid the circular dependencies, in my view; it's to accept that they exist and make sure you have sensible escape hatches so that when something does go wrong—so a good example, we use incident.io at incident.io to manage incidents that we're having with incident.io. And 99% of the time, that is absolutely fine because we are having some error in some corner of the product or a particular customer is doing something that is a bit curious.And I could count literally on one hand the number of times that we have not been able to use our products to fix our product. And in those cases, we have a fallback which is jump into—Corey: I assume you put a little thought into what happened. “Well, what if our product is down?” “Oh well, I guess we'll never be able to fix it or communicate about it.” It seems like that's the sort of thing that, given what you do, you might have put more than ten seconds of thought into.Chris: We've put a fair amount of thought into it. But at the end of the day, [laugh] it's like if stuff is down, like, what do you need to do? You need to communicate with people. So, jump on a Google Chat, jump on a Slack huddle, whatever else it is we have various different, like, fallbacks in different order. And at the core of it, I think this is the thing is, like, you cannot be prepared for every single thing going wrong, and so what you can be prepared for is to be unprepared and just accept that humans are incredibly good at being resilient, and therefore, all manner of things are going to happen that you've never seen before and I guarantee you will figure them out and fix them, basically.But yeah, I say this; if my SOC 2 auditor is listening, we also do have a very well-defined, like, backup plan in our SOC 2 [laugh] in our policies and processes that is the thing that we will follow that. But yeah.Corey: The fact that you're saying the magic words of SOC 2, yes, exactly. Being in a responsible adult and living up to some baseline compliance obligations is really the sign of a company that's put a little thought into these things. So, as I pull up incident.io—the website, not the company to be clear—and look through what you've written and how you talk about what you're doing, you've avoided what I would almost certainly have not because your tagline front and center on your landing page is, “Manage incidents at scale without leaving Slack.” If someone were to reach out and say, well, we're down all the time, but we're using Microsoft Teams, so I don't know that we can use you, like, the immediate instinctive response that I would have for that to the point where I would put it in the copy is, “Okay, this piece of advice is free. I would posit that you're down all the time because you're the kind of company to use Microsoft Teams.” But that doesn't tend to win a whole lot of friends in various places. In a slightly less sarcastic bent, do you see people reaching out with, “Well, we want to use you because we love what you're doing, but we don't use Slack.”Chris: Yeah. We do. A lot of folks actually. And we will support Teams one day, I think. There is nothing especially unique about the product that means that we are tied to Slack.It is a great way to distribute our product and it sort of aligns with the companies that think in the way that we do in the general case but, like, at the core of what we're building, it's a platform that augments a communication platform to make it much easier to deal with a high-stress, high-pressure situation. And so, in the future, we will support ways for you to connect Microsoft Teams or if Zoom sought out getting rich app experiences, talk on a Zoom and be able to do various things like logging actions and communicating with other systems and things like that. But yeah, for the time being very, very deliberate focus mechanism for us. We're a small company with, like, 30 people now, and so yeah, focusing on that sort of very slim vertical is working well for us.Corey: And it certainly seems to be working to your benefit. Every person I've talked to who is encountered you folks has nothing but good things to say. We have a bunch of folks in common listed on the wall of logos, the social proof eye chart thing of here's people who are using us. And these are serious companies. I mean, your last job before starting incident.io was at Monzo, as you mentioned.You know what you're doing in a regulated, serious sense. I would be, quite honestly, extraordinarily skeptical if your background were significantly different from this because, “Well, yeah, we worked at Twitter for Pets in our three-person SRE team, we can tell you exactly how to go ahead and handle your incidents.” Yeah, there's a certain level of operational maturity that I kind of just based upon the name of the company there; don't think that Twitter for Pets is going to nail. Monzo is a bank. Guess you know what you're talking about, given that you have not, basically, been shut down by an army of regulators. It really does breed an awful lot of confidence.But what's interesting to me is the number of people that we talk to in common are not themselves banks. Some are and they do very serious things, but others are not these highly regulated, command-and-control, top-down companies. You are nimble enough that you can get embedded at those startup-y of startup companies once they hit a certain point of scale and wind up helping them arrive at a better outcome. It's interesting in that you don't normally see a whole lot of tools that wind up being able to speak to both sides of that very broad spectrum—and most things in between—very effectively. But you've somehow managed to thread that needle. Good work.Chris: Thank you. Yeah. What else can I say other than thank you? I think, like, it's a deliberate product positioning that we've gone down to try and be able to support those different use cases. So, I think, at the core of it, we have always tried to maintain the incident.io should be installable and usable in your very first incident without you having to have a very steep learning curve, but there is depth behind it that allows you to support a much more sophisticated incident setup.So, like, I mean, you mentioned Monzo. Like, I just feel incredibly fortunate to have worked at that company. I joined back in 2017 when they were, I don't know, like, 150,000 customers and it was just getting its banking license. And I was there for four years and was able to then see it scale up to 6 million customers and all of the challenges and pain that goes along with that both from building infrastructure on the technical side of things, but from an organizational side of things. And was, like, front-row seat to being able to work with some incredibly smart people and sort of see all these various different pain points.And honestly, it feels a little bit like being in sort of a cheat mode where we get to this import a lot of that knowledge and pain that we felt at Monzo into the product. And that happens to resonate with a bunch of folks. So yeah, I feel like things are sort of coming out quite well at the moment for folks.Corey: The one thing I will say before we wind up calling this an episode is just how grateful I am that I don't have to think about things like this anymore. There's a reason that the problem that I chose to work on of expensive AWS bills being very much a business-hours only style of problem. We're a services company. We don't have production infrastructure that is externally facing. “Oh, no, one of our data analysis tools isn't working internally.”That's an interesting curiosity, but it's not an emergency in the same way that, “Oh, we're an ad network and people are looking at ads right now because we're broken,” is. So, I am grateful that I don't have to think about these things anymore. And also a little wistful because there's so much that you do it would have made dealing with expensive and dangerous outages back in my production years a lot nicer.Chris: Yep. I think that's what a lot of folks are telling us essentially. There's this curious thing with, like, this product didn't exist however many years ago and I think it's sort of been quite emergent in a lot of companies that, you know, as sort of things have moved on, that something needs to exist in this little pocket of space, dealing with incidents in modern companies. So, I'm very pleased that what we're able to build here is sort of working and filling that for folks.Corey: Yeah. I really want to thank you for taking so much time to go through the ethos of what you do, why you do it, and how you do it. If people want to learn more, where's the best place for them to go? Ideally, not during an incident.Chris: Not during an incident, obviously. Handily, the website is the company name. So, incident.io is a great place to go and find out more. We've literally—literally just today, actually—launched our Practical Guide to Incident Management, which is, like, a really full piece of content which, hopefully, will be useful to a bunch of different folks.Corey: Excellent. We will, of course, put a link to that in the [show notes 00:29:52]. I really want to thank you for being so generous with your time. Really appreciate it.Chris: Thanks so much. It's been an absolute pleasure.Corey: Chris Evans, Chief Product Officer and co-founder of incident.io. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice along with an angry comment telling me why your latest incident is all the intern's fault.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Break Things On Purpose
KubeCon, Kindness, and Legos with Michael Chenetz

Break Things On Purpose

Play Episode Listen Later May 31, 2022 27:57


Today we chat with Cisco's head of developer content, community, and events, Michael Chenetz. We discuss everything from KubeCon to kindness and Legos! Michael delves into some of the main themes he heard from creators at KubeCon, and we discuss methods for increasing adoption of new concepts in your organization. We have a conversation about attending live conferences, COVID protocol, and COVID shaming, and then we talk about how Legos can be used in talks to demonstrate concepts. We end the conversation with a discussion about combining passions to practice creativity. We discuss our time at KubeCon in Spain (5:51) Themes Michael heard at KubeCon talking with creators (7:46) Increasing adoption of new concepts (9:27) We talk conferences, COVID shaming, and blamelessness (12:21) Legos and reliability  (18:04) Michael talks about ways to exercise creativity (23:20) Links: KubeCon October 2022: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/ Nintendo Lego Set: https://www.amazon.com/dp/B08HVXMQ87?ref_=cm_sw_r_cp_ud_dp_ED7NVBWPR8ANGT8WNGS5 Cloud Unfiltered podcast episode featuring Julie and Jason:https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884 Links Referenced: Cisco: https://www.cisco.com/ Cloud Unfiltered Podcast with Julie and Jason: https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884 Cloud Unfiltered Podcast: https://www.cisco.com/c/en/us/solutions/cloud/podcasts.html Nintendo Lego: https://www.amazon.com/dp/B08HVXMQ87 TranscriptJulie: And for folks that are interested in, too, what day it is—because I think we're all still a little bit confused—it is Monday, May 24th that we are recording this episode.Jason: Uh, Julie's definitely confused on what day it is because it's actually Tuesday, [laugh] May 24th.Michael: Oh, my God. [laugh]. That's great. I love it.Julie: Welcome to Break Things on Purpose, a podcast about reliability, learning from each other, and blamelessness. In this episode, we talk to Michael Chenetz, head of developer content, community, and events at Cisco, about all of the learnings from KubeCon, the importance of being kind to each other, and of course, how Lego translates into technology.Julie: Today, we are joined by Michael Chenetz. Michael, do you want to tell us a little bit about yourself?Michael: Yeah. [laugh]. Well, first of all, thank you for having me on the show. And I'm really good at breaking things, so I guess that's why I'm asked to be here is because I'm superb at it. What I'm not so good at is, like, putting things back together.Like when I was a kid, I remember taking my dad's stereo apart; wasn't too happy about that. Wasn't very good at putting it back together. But you know, so that's just going back a little ways there. But yeah, so I work for the DevRel at Cisco and my whole responsibility is, you know, to get people to know that know a little bit about us in terms of, you know, all the developer-related topics.Julie: Well, and Jason and I had the awesome opportunity to hang out with you at KubeCon, where we got to join your Cloud Unfiltered podcast. So folks, definitely go check out that episode. We have a lot of fun. We'll put a link in the [show notes 00:02:03]. But yeah, let's talk a little bit about KubeCon. So, as of recording this episode, we all just recently traveled back from Spain, for KubeCon EU, which was… amazing. I really enjoyed being there. My first time in Spain. I got back, I can tell you, less than 24 hours ago. Michael, I think—when did you get back?Michael: So, I got back Saturday night, but my bags have not arrived yet. So, they're still traveling and they're enjoying Europe. And they should be back soon, I guess when they're when they feel like they're—you know, they should be back from vacation.Julie: [laugh].Michael: So. [laugh].Julie: Jason, how about you? When did you get home?Jason: I got home on Sunday night. So, I took the train from Valencia to Barcelona on Saturday evening, and then an early morning flight on Sunday and got home late Sunday night.Julie: And for folks that are interested in, too, what day it is—because I think we're all still a little bit confused—it is Monday, May 24th that we are recording this episode.Jason: Uh, Julie's definitely confused on what day it is because it's actually Tuesday, [laugh] May 24th.Michael: Oh, my God. [laugh]. That's great. I love it. By the way, yesterday was my birthday so I'm going to say—Julie: Happy birthday.Michael: —happy birthday to myself.Julie: Oh, my gosh, happy birthday. [laugh].Michael: Thank you [laugh].Julie: So… what is time anyway?Jason: Yeah.Michael: It's all good. It's all relative. Time is relative.Julie: Time is relative. And so, you know, tell us a little bit about—I'd love to know a little bit about why you want folks to know about, like, what is the message you try to get across?Jason: Oh, that's not the question I thought you were going to ask. I thought you were going to ask, “What's on your Amazon wishlist so people can send you birthday presents?”Julie: Yeah, let's back up. Let's do that. So, let's start with your Amazon wishlist. We know that there might be some Legos involved.Michael: Oh, my God, yeah. I mean, you just told me about a cool one, which was Optimus Prime and I just—I'm already on the website, my credit card is out and I'm ready to buy. So, you know, this is the problem with talking to you guys. [laugh]. It's definitely—you know, that's definitely on my list. So, anything that, anything music-related because obviously behind me is a lot of music equipment—I love music stuff—and anything tech. The combination of tech and music, and if you can combine Legos and that, too, man that would just match all the boxes. [laugh].Julie: Just to let you know, there's a Lego Con. Like, I did not know this until last night, actually. But it is a virtual conference.Michael: Really.Julie: Yeah. But one of the things I was looking at actually on Lego, when you look at their website, like, to request one of their speakers, to request one of their engineers as a speaker, they actually don't do that because they get so many requests for their folks to speak at conferences, they actually have a dedicated part of their website that talks about this. So, I thought that was interesting.Michael: Well listen, just because of that, if they want somebody that's in, you know, cloud computing, I'm not going to go talk for Lego. And I know they really want somebody from cloud computing talking to Lego, so, you know… it's, you know, quid pro quo there, so that's just the way it's going to work. [laugh].Julie: I want to be best friends with Lego people.Michael: [laugh]. I know, me too.Julie: I'm just going to make it a goal in life now to have one of their engineers speak at DevOpsDays Boise. It's like a challenge.Michael: It is. I accept it.Julie: [laugh]. With that, though, just on other Lego news, before we start talking about all the other things that folks may also want to hear about, there is another new Lego, which is the Van Gogh Starry Night that has been newly released by the time this episode comes out.Michael: With a free ear, right?Julie: I mean—[laugh].Michael: Is that what happens?Julie: —well played. Well, played. [laugh]. So, now you really got to spend a lot of time at KubeCon, you were just really recording podcast after podcast.Michael: Oh, my God. Yeah. So, I mean, it was great. I love—because I'm a techie, so I love tech and I love to find out origin stories of stuff. So, I love to, like, talk to these people and like, “Why did that come about? How did—” you know, “What happened in your life that made you want to do this? Who hurt you?” [laugh].And so, that's what I constantly try and figure out is, like, [laugh], “What is that?” So, it was really cool because I had, like, Jimmy Zelinskie who came from CoreOS, and he came from—you know, they create, you know, Quay and some of this other kinds of stuff. And you know, just to talk about, like, some of the operators and how they came about, and like… those were the original operators, so that was pretty cool. Varun from Tetrate was supposed to come on, and he created Istio, you know? So, there were so many of these things that I just geek out knowing about, you know?And then the other thing that was really high on our list, and it's really high from where I am, is API quality, API testing, API—so really, that's why I got in touch with you guys because I was like, “Wow, that fits in really good, you know? You guys are doing stuff that's around chaos, and you know, I think that's amazing.” So, all of this stuff is just so interesting to me. But man, it was just a whirlwind of every day just recording, and by the end that was just like, you know, “I'm so sorry, but I just, I can't talk anymore.” You know, and that was it. [laugh].Jason: I love that chatting with the creators. We had Zack Butcher on who is also from Tetrate and one of the early Istio—Michael: Yeah, yeah.Jason: Contributors. And I find it fascinating because I feel like when you chat with these folks, you start to understand the context of why things were built. And it—Michael: Yes.Jason: —it opens your brain up to, like, cool, there's a software—oh, now I know exactly why it's doing things that way, right? Like, it's just so, so eye-opening. I love it.Julie: With that, though, like, did you see any trends or any themes as you were talking to all these folks?Michael: Yeah, so a few real big trends. One is everybody wants to know about eBPF. That was the biggest thing at KubeCon, by far, was that, “We want to learn how to do this low-level kernel stuff that's really fast, that can give us all the information we need, and we don't have to use sidecars and things like that.” I mean it was—you know, that was the most excitement that I saw. OTel was another one for OpenTelemetry, which was a big one.The other thing was simplification. You know, a lot of people were looking to simplify the Kubernetes ecosystem because there's so much out there, and there's so many things that you have to learn about that it was super hard, you know, for somebody to come into it to say, “Where do I even start?” You know? So, that was a big theme was simplification.I'm trying to think. I think another one is APIs, for sure. You know, because there's this whole thing about API sprawl. And people don't know what their APIs are, people just, like—you know, I always say people can see—like, developers are lazy in a good way, and I consider myself one of them. So, what that means is that when we want to develop something, what we're going to do is we're just going to pull down the nearest API that does what we need, that has the best documentation, that has the best blog, that has the best everything.We don't know what their testing strategy is; we don't know what their security strategy is; we don't know if they use other libraries. And you have to figure that stuff out. And that's the thing that—you know, so everything around APIs is super important. And you really have to test that stuff out. Yes, people, you have to test it [laugh] and know more about it. So, those are those were the big themes, I think. [laugh].Julie: You know, I know that Kerim and I gave a talk on observability where we kind of talked more high-level about some of the overarching concepts, but folks were really excited about that. I think is was because we briefly touched on OpenTelemetry, which we should have gone into a little bit more depth, but there's only so much you can fit into a 30-minute talk, so hopefully we'll be able to talk about that more at a KubeCon in the future, we [crosstalk 00:09:54] to the selection committee.Michael: Hashtag topics?Julie: Uh-huh. [laugh]. You know, that said, though, it really did seem like a huge topic that people just wanted to learn more about. I know, too, at the Gremlin booth, a lot of folks were also interested in talking about, like, how do we just get our organization to adopt some of these concepts that we're hearing about here? And I think that was the thing that surprised me the most is I expected people to be coming up to the booth and deep-diving into very, very deep, technical-level questions, and really, a lot of it was how do we get our organization to do this? How can we increase adoption? So, that was a surprise for me.Michael: Yeah, you know what, and I would say two things to that. One is, when you talk about Chaos Engineering, I think people think it's like rocket science and people are really scared and they don't want to claim to be experts in it, so they're like, “Wow, this is, like, next-level stuff, and you know, we're really scared. You guys are the experts. I don't want to even attempt this.” And the other thing is that organizations are scared because they think that it's going to, like, create mass hysteria throughout their organization.And really, none of this is true in either way. In reality, it's a very, very scripted, very exacting stuff that you're testing, and you throw stuff out there and see what kind of response you get. So, you know, it's not this, like, you know—I think people just have—there needs to be more education around a lot of areas in cloud-native. But you know, that's one of the areas. So, I think it's really interesting there.Julie: I think so too. How about for you, Jason? Like, what was your surprise from the conference or something that maybe—Jason: Yeah, I mean, I think my surprise was mostly around just seeing people coming back, right? Because we're now I would say, six months into conferences being back as a thing, right? Like, we had re:Invent last year in Vegas; we had KubeCon last year in LA, and so, like, those are okay events. They weren't, like, back to normal. And this was, I feel like, one of the first conferences, that it really started to feel back to normal.Like, there was much better attendance, there was much more just buzz and hallway tracking and everything else that we're used to. Like, the whole reason that we go to conferences is getting together with people and hanging out and stuff, and this one has so far felt the most back-to-normal out of any event that I've been to over the past six months.Michael: Can I just talk about one thing that I think, you know, people have to get over is, you know, I see a lot online, I think it was—I forget who it was that was talking about it. But this whole idea of Covid shaming. I mean, we're going to this event, and it's like, yeah, everybody wants to get out, everybody wants to learn things, but don't shame people just because they got Covid, everybody's getting Covid, okay? That's just the point of life at this point. So, let's just, you know, let's just be nice to each other, be friendly to each other, you know? I just have to say that because I think it's a shame that people are getting shamed, you know, just for going to an event. [laugh].Julie: See, and I think that—that's an interesting—there's been a lot of conversation around this. And I don't think anybody should be Covid-shamed. Look, I think that we all took a calculated risk in coming—Michael: Absolutely.Julie: To this event. I personally gave out a lot of hugs. I hugged some of the folks that have mentioned that they have come up positive from Covid, so there's a calculated risk in going. I think there has been a little bit of pushback on maybe how some of the communication has come out around it. That said, as an organizer of a small conference with, like, 400 people, I think that these are very complicated matters. And what I really think is important is to listen to feedback from attendees and to take that.And then we're always looking to improve, right?Michael: Absolutely.Julie: If everything that we did was perfect right out of the gate, then we wouldn't have Chaos Engineering because there'd be nothing [crosstalk 00:13:45] be just perfectly reliable. And so, if we take away anything, let's take away—just like what you said, first of all, Covid, you should never shame somebody for having Covid. Like, that's not cool. It's not somebody's fault that they caught an illness.Michael: Yes.Julie: I mean unless they were licking doorknobs. And that's a whole different—Michael: Yes. [laugh]. That's a whole different thing, right there.Julie: Conversation. But when we talk about just like these questions around cultural adoption, we talk about blamelessness; we talk about learning from failure; we talked about finding ways to improve, and I think all of that can come into play. So, it'll be interesting to see how we learn and grow as we move forward. And like, thank you to re:Invent, thank you to KubeCon, thank you to DevOpsDays Boise. But these conferences that have started going back in-person, at great risk to organizers and the committee because people are going to be mad, one way or the other.Michael: Yeah. And you can see that people want to be back because it was huge, you know?Julie: Yeah.Michael: Maybe you guys, I'm going to put in a feature request for Gremlin to chaos engineer crowds. Can we do that so we can figure out, like, what's going to happen when we have these big events? Can we do that?Julie: I mean, that sounds fun. I think what's going to happen is there's going to be hugs, there's going to be people getting sick, but there's going to be people learning and growing.Michael: Yes.Julie: And ultimately, I just think that we have to remember that just, like, our systems aren't perfect, and neither are people. Like, the fact that we expect people to be perfect, and maybe we should just keep some mask mandates for a little bit longer when we're at conferences with 8000 people.Michael: Sure.Julie: I mean, that's—Michael: That makes sense.Jason: Yeah. I mean, it's all about risk management, right? This is, essentially what we do in SRE is there's always a risk of a massive outage, and so it's that balance of, right, do what you can, but ultimately, that's why we have SLOs and things is, you can never be a hundred percent, so like, where do we draw the line of here are the things that we're going to do to help manage this risk, but you can never shoot for a perfectly, entirely safe space, right? Because then we'd all be having conferences in padded rooms, and not touching each other, and things like that. There's a balance there.And I think we're all just trying to find that, so yeah, as you mentioned, that whole, like, DevOps blamelessness thing, you know, treat each other with the notion that we're all trying to get through this together and do what we think is best. Nobody's just like John Allspaw said, you know, “Nobody goes to work thinking that, like, their intent is to crash everything and destroy the company.” No one's going to KubeCon or any of these conferences thinking, “Yeah, I'm going to be a super-spreader.”Julie: [laugh].Michael: Yeah, that would be [crosstalk 00:16:22].Jason: Like, everyone's trying not to do it. They're doing their best. They're not actively, like, aggressively trying to get you sick or intentionally about it. But you know—so just be kind to one another.Michael: Yeah. And that's the key.Julie: It is.Michael: The key. Be kind to one another, you know? I mean, it's a great community. People are really nice, so, you know, let's keep that up. I think that's something special about the, you know, the community around KubeCon, specifically.Julie: As we can refine this and find ways, I would take all of the hugs over virtual conferences—Michael: Yes.Julie: Any day now. Because, as Jason mentioned, is even just with you, Michael, the time we got to spend with you, or the time I kept going up to Jfrog's booth and Baruch and I would have conversations as he made me a delicious coffee, these hallway tracks, these conversations, that's what no one figured out how to recreate during the virtual events—Michael: Absolutely.Julie: —and it's just not possible, right?Michael: Yeah. I mean, I think it would take a little bit of VR and then maybe some, like, suit that you wear in order to feel the hug. And, you know, so it would take a lot more in order to do that. I mean, I guess it's technologically possible. I don't know if the graphics are there yet, so it might be like a pixelated version, like, you know, like, NES-style, or something like that. But it could look pretty cool. [laugh]. So, we'll have to see, you know?Julie: Everybody listening to this episode, I hope you're getting as much of a kick out of it as we are recording it because I mean, there are so many different topics here. One of the things that Michael and I bonded about years ago, for our listeners that are—not years ago; months ago. Again, what is time?Michael: Yeah. What is time? It's all relative.Julie: It is. It was Lego, though, and so we've been talking about that. But Michael, you asked a great question when we were recording with you, which is, like—Michael: Wow.Julie: Can—just one. Only one great question.Michael: [laugh].Julie: [laugh]. Which was, how would you incorporate Lego into a talk? And, like, when we look at our systems breaking and all of that, I've really been thinking about that and how to make our systems more reliable. And here's one of the things I really wanted to clarify that answer. I kind of went… I went talking about my Lego that I build, like, my Optim—not my Optimus Primes, I don't have it, but my Voltron or my Nintendo Lego. And those are all box sets.Michael: Yep.Julie: But one of the things if you're not playing with a box set with instruction, if you're just playing with just the—or excuse me, architecting with just the Lego blocks because it's not playing because we're adults now, I think.Michael: Yes, now it's architecting. Yes.Julie: Yes, now that we're architecting, like, that's one of the things that I was really thinking about this, and I think that it would make something really fun to talk about is how you're building upon each layer and you're testing out these new connection pieces. And then that really goes into, like, when we get into Technics, into dependencies because if you forget that one little one-inch plastic piece that goes from the one to the other, then your whole Lego can fall apart. So anyway, I just thought that was really interesting, and I'd wondered if you or Jason even gave that any more thought, or if it was just fleeting for you.Michael: It was definitely fleeting for me, but I will give it some more thought, you know? But you know, when—as you're saying that though, I'm thinking these Lego pieces really need names because you're like that little two-inch Lego piece that kind of connects this and this, like, we got to give these all names so that people can know, that's x-54 that's—that you're putting between x-53 and x-52. I don't know but you need some kind of name for these parts now.Julie: There are Lego names. You just Google it. There are actual names for all of the parts but—Michael: Wow. [laugh].Julie: Like, Jason, what do you think? I know you've got [unintelligible 00:19:59].Jason: Yeah, I mean, I think it's interesting because I am one of those, like, freeform folks, right? You know, my standard practice when I was growing up with Legos was you build the thing that you bought once and then you immediately, like, tear it apart, and you build whatever the hell you want.Michael: Absolutely.Jason: So, I think that that's kind of an interesting thing as we think about our systems and stuff, right? Like, part of it is, like, yeah, there's best practices and various companies will publish, like, you know, “Here's how to architect such-and-such system.” And it's interesting because that's just not reality, right? You're not going to go and take, like, the Amazon CloudFormation thing, and like, congrats, you're done. You know, you just implement that and your job's done; you just kick back for the rest of the week.It never works that way, right? You're taking these little bits of, like, cool, I might have, like, set that up once just to see what's happening but then you immediately, like, deconstruct it, and you take the knowledge of what you learned in those building blocks, and you, like, go and remix it to build the thing that you actually need to build.Michael: But yeah, I mean, that's exactly—so you know, Legos is what got me interested in that as a kid, but when you look at, you know, cloud services and things like that, there's so many different ways to combine things and so many different ways to, like—you know, you could use Terraform, you could use Crossplane, you could use, you know, any of the services in the cloud, you could use FaaS, you could use serverless, you could use, you know, all these different kinds of solutions and tie them together. So, there's so much choice, and what Lego teaches you is that, embrace the choice. Figure out and embrace the different pieces, embrace all the different things that you have and what the art of possibility is, and then start to build on that. So, I think it's a really good thing. And that's why there's so much correlation between, like, kind of, art and tech and things like that because that's the kind of mentality that you need in order to be really successful in tech.Jason: And I think the other thing that works really well with what you said is, as you're playing with Legos, you start to learn these hacks, right? Like, I don't have, like, a four-by-one brick, but I know that if I have three four-by-one flats, I can stack those three and it's the same height as a brick, right?Michael: Yep.Jason: And you can start combining things. And I love that engineering mentality of, like, I have this problem that I need to solve, I have a limited toolbox for whatever constraints, right, and understanding those constraints, and then cool, how can I remix what I've got in my toolbox to get this thing done?Michael: And that's a thing that I'm always doing. Like, when I used to do a lot of development, you know, it was always like, what is the right code? Or what is the library that's going to solve my problem? Or what is the API that's going to solve my problem, you know?And there's so many different ways to do it. I mean, so many people are afraid of, like, making the wrong choice, when really in programming, there is no wrong choice. It's all about how you want to do it and what makes sense to you, you know? There might be better options in formatting and in the way that you kind of, you know, format that code together and put them in different libraries and things like that, but making choices on, like, APIs and things like that, that's all up to the artist. I would say that's an artist. [laugh]. So, you know, I think it all stems though, when you go back from, you know, just being creative with things… so creativity is king.Jason: So Michael, how do you exercise your creativity, then? How do you keep up that creativity?Michael: Yeah, so there's multiple ways. And that's a great segment because one of the things that I really enjoy—so you know, I like development, but I'm also a people person. And I like product management, but I also like dealing with people. So really, to me, it's about how do I relate products, how do I relate solutions, how do I talk to people about solutions that people can understand? And that's a creative process.Like, what is the right media? What is the right demos? What is the right—you know, what do people need? And what do people need to, kind of, embrace things? And to me, that's a really creative medium to me, and I love it.So, I love that I can use my technical, I love that I can use my artistic, I love that I can use, you know, all these pieces all at once. And sometimes maybe I'll play guitar and just put it in the intro or something, I don't know. So, that kind of combines that together, too. So, we'll figure that piece out later. Maybe nobody wants to hear me play guitar, that's fine, too. [laugh].But I love to be able to use, you know, both sides of my brain to do these creative aspects. So, that's really what does it. And then sometimes I'll program again and I'll find the need, and I'll say, “Hey, look, you know, I realized there's a need for this,” just like a lot of those creators are. But I haven't created anything cool, but you know, maybe someday I will. I feel like it's just been in between all those different intersections that's really cool.Jason: I love the electric guitar stuff that you mentioned. So, for folks who are listening to this show, during our recording of the Cloud Unfiltered you were talking about bringing that art and technical together with electric guitars, and you've been building electric guitar pickups.Michael: Yes. Yeah. So, I mean, I love anything that can combine my music passion with tech, so I have a CNC machine back here that winds pickups and it does it automatically. So, I can say, “Hey, I need a 57 pickup, you know, whatever it is,” and it'll wind it to that exact spec.But that's not the only thing I do. I mean, I used to design control surfaces for artists that were a big band, and I really can't—a lot of them I can't mention because we're under NDA. But I designed a lot of these big, you know, control surfaces for a lot of the big electronic and rock bands that are out there. I taught people how to use Max for Live, which is an artist's, kind of, programming language that's graphical, so [NMax 00:25:33] and MSP and all that kind of stuff. So, I really, really like to combine that.Nowadays, you know, I'm talking about doing some kind of events that may be combined tech, with art. So, maybe doing things like Algorave, and you know, things that are live-coding music and an art. So, being able to combine all these things together, I love that. That's my ultimate passion.Jason: That is super cool.Julie: I think we have learned quite a bit on this episode of Break Things on Purpose, first of all, from the guy who said he hasn't created much—because you did say that, which I'm going to call you out on that because you just gave a long list of things that you created. And I think we need to remember that we're all creators in our own way, so it's very important to remember that. But I think that right now we've created a couple of options for talks in the future, whether or not it's with Lego, or guitar pickups.Michael: Yeah.Julie: Is that—Michael: Hey—Julie: Because I—Michael: Yeah, why not?Julie: —know you do kind of explain that a little bit to me as well when I was there. So, Michael, this has just been amazing having you. We're going to put a lot of links in the notes for everybody today. So, to Michael's podcast, to some Lego, and to anything else Michael wants to share with us as well. Oh, real quick, is there anything you want to leave our listeners with other than that? You know, are you looking to hire Cisco? Is there anything you wanted to share with us?Michael: Yeah, I mean, we're always looking for great people at Cisco, but the biggest thing I'd say is, just realize that we are doing stuff around cloud-native, we're not just network. And I think that's something to note there. But you know, I just love being on the show with you guys. I love doing anything with you guys. You guys are awesome, you know. So.Julie: You're great too, and I think we'll probably do more stuff, all of us together, in the future. And with that, I just want to thank everybody for joining us today.Michael: Thank you. Thanks so much. Thanks for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

Hacking Humans
DevOps (noun) [Word Notes]

Hacking Humans

Play Episode Listen Later May 24, 2022 7:00


The set of people, process, technology, and cultural norms that integrates software development and IT operations into a system-of-systems. CyberWire Glossary link: Audio reference link: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr," by John Allspaw and Paul Hammond, Velocity 09, 25 July 2009.

Screaming in the Cloud
Reliability Starts in Cultural Change with Amy Tobey

Screaming in the Cloud

Play Episode Listen Later May 11, 2022 46:37


About AmyAmy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she spends her time building an innovative Site Reliability Engineering program at Equinix, where she is a principal engineer. When she's not working, she can be found with her nose in a book, watching anime with her son, making noise with electronics, or doing yoga poses in the sun.Links Referenced: Equinix Metal: https://metal.equinix.com Personal Twitter: https://twitter.com/MissAmyTobey Personal Blog: https://tobert.github.io/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning-fast processing power, courtesy of third-gen AMD EPYC processors without the IO or hardware limitations of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general-purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vultr delivers the power of the cloud with none of the bloat. “Screaming in the Cloud” listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That's G-E-T-V-U-L-T-R dot com slash screaming. My thanks to them for sponsoring this ridiculous podcast.Corey: Finding skilled DevOps engineers is a pain in the neck! And if you need to deploy a secure and compliant application to AWS, forgettaboutit! But that's where DuploCloud can help. Their comprehensive no-code/low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks, while automating the full DevSecOps lifestyle. Get started with DevOps-as-a-Service from DuploCloud so that your cloud configurations are done right the first time. Tell them I sent you and your first two months are free. To learn more visit: snark.cloud/duplo. Thats's snark.cloud/D-U-P-L-O-C-L-O-U-D.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while I catch up with someone that it feels like I've known for ages, and I realize somehow I have never been able to line up getting them on this show as a guest. Today is just one of those days. And my guest is Amy Tobey who has been someone I've been talking to for ages, even in the before-times, if you can remember such a thing. Today, she's a Senior Principal Engineer at Equinix. Amy, thank you for finally giving in to my endless wheedling.Amy: Thanks for having me. You mentioned the before-times. Like, I remember it was, like, right before the pandemic we had beers in San Francisco wasn't it? There was Ian there—Corey: Yeah, I—Amy: —and a couple other people. It was a really great time. And then—Corey: I vaguely remember beer. Yeah. And then—Amy: And then the world ended.Corey: Oh, my God. Yes. It's still March of 2020, right?Amy: As far as I know. Like, I haven't checked in a couple years.Corey: So, you do an awful lot. And it's always a difficult question to ask someone, so can you encapsulate your entire existence in a paragraph? It's—Amy: [sigh].Corey: —awful, so I'd like to give a bit more structure to it. Let's start with the introduction: You are a Senior Principal Engineer. We know it's high level because of all the adjectives that get put in there, and none of those adjectives are ‘associate' or ‘beginner' or ‘junior,' or all the other diminutives that companies like to play games with to justify paying people less. And you're at Equinix, which is a company that is a bit unlike most of the, shall we say, traditional cloud providers. What do you do over there and both as a company, as a person?Amy: So, as a company Equinix, what most people know about is that we have a whole bunch of data centers all over the world. I think we have the most of any company. And what we do is we lease out space in that data center, and then we have a number of other products that people don't know as well, which one is Equinix Metal, which is what I specifically work on, where we rent you bare-metal servers. None of that fancy stuff that you get any other clouds on top of it, there's things you can get that are… partner things that you can add-on, like, you know, storage and other things like that, but we just deliver you bare-metal servers with really great networking. So, what I work on is the reliability of that whole system. All of the things that go into provisioning the servers, making them come up, making sure that they get delivered to the server, make sure the API works right, all of that stuff.Corey: So, you're on the Equinix cloud side of the world more so than you are on the building data centers by the sweat of your brow, as they say?Amy: Correct. Yeah, yeah. Software side.Corey: Excellent. I spent some time in data centers in the early part of my career before cloud ate that. That was sort of cotemporaneous with the discovery that I'm the hardware destruction bunny, and I should go to great pains to keep my aura from anything expensive and important, like, you know, the SAN. So—Amy: Right, yeah.Corey: Companies moving out of data centers, and me getting out was a great thing.Amy: But the thing about SANs though, is, like, it might not be you. They're just kind of cursed from the start, right? They just always were kind of fussy and easy to break.Corey: Oh, yeah. I used to think—and I kid you not—that I had a limited upside to my career in tech because I sometimes got sloppy and I was fairly slow at crimping ethernet cables.Amy: [laugh].Corey: That is very similar to growing up in third grade when it became apparent that I was going to have problems in my career because my handwriting was sloppy. Yeah, it turns out the future doesn't look like we predicted it would.Amy: Oh, gosh. Are we going to talk about, like, neurological development now or… [laugh] okay, that's a thing I struggle with, too right, is I started typing as soon as they would let—in fact, before they would let me. I remember in high school, I had teachers who would grade me down for typing a paper out. They want me to handwrite it and I would go, “Cool. Go ahead and take a grade off because if I handwrite it, you're going to take two grades off my handwriting, so I'm cool with this deal.”Corey: Yeah, it was pretty easy early on. I don't know when the actual shift was, but it became more and more apparent that more and more things are moving towards a world where you could type. And I was almost five when I started working on that stuff, and that really wound up changing a lot of aspects of how I started seeing things. One thing I think you're probably fairly well known for is incidents. I want to be clear when I say that you are not the root cause as—“So, why are things broken?” “It's Amy again. What's she gotten into this time?” Great.Amy: [laugh]. But it does happen, but not all the time.Corey: Exa—it's a learning experience.Amy: Right.Corey: You've also been deeply involved with SREcon and a number of—a lot of aspects of what I will term—and please don't yell at me for this—SRE culture—Amy: Yeah.Corey: Which is sometimes a challenging thing to wind up describing or putting a definition around. The one that I've always been somewhat partial to is, “SRE is DevOps, except you worked at Google for a while.” I don't know how necessarily accurate that is, but it does rile people up.Amy: Yeah, it does. Dave Stanke actually did a really great talk at SREcon San Francisco just a couple weeks ago, about the DORA report. And the new DORA report, they split SRE out into its own function and kind of is pushing against that old model, which actually comes from Liz Fong-Jones—I think it's from her, or older—about, like, class SRE implements DevOps, which is kind of this idea that, like, SREs make DevOps happen. Things have evolved, right, since then. Things have evolved since Google released those books, and we're all just figured out what works and what doesn't a little bit.And so, it's not that we're implementing DevOps so much. In fact, it's that ops stuff that kind of holds us back from the really high impact work that SREs, I think, should be doing, that aren't just, like, fixing the problems, the symptoms down at the bottom layer, right? Like what we did as sysadmins 20 years ago. You know, we'd go and a lot of people are SREs that came out of the sysadmin world and still think in that mode, where it's like, “Well, I set up the systems, and when things break, I go and I fix them.” And, “Why did the developers keep writing crappy code? Why do I have to always getting up in the middle of the night because this thing crashed?”And it turns out that the work we need to do to make things more reliable, there's a ceiling to how far away the platform can take us, right? Like, we can have the best platform in the world with redundancy, and, you know, nine-way replicated data storage and all this crazy stuff, and still if we put crappy software on top, it's going to be unreliable. So, how do we make less crappy software? And for most of my career, people would be, like, “Well, you should test it.” And so, we started doing that, and we still have crappy software, so what's going on here? We still have incidents.So, we write more tests, and we still have incidents. We had a QA group, we still have incidents. We send the developers to training, and we still have incidents. So like, what is the thing we need to do to make things more reliable? And it turns out, most of it is culture work.Corey: My perspective on this stems from being a grumpy old sysadmin. And at some point, I started calling myself a systems engineer or DevOps or production engineer, or SRE. It was all from my point of view, the same job, but you know, if you call yourself a sysadmin, you're just asking for a 40% pay cut off the top.Amy: [laugh].Corey: But I still tended to view the world through that lens. I tended to be very good at Linux systems internals, for example, understanding system calls and the rest, but increasingly, as the DevOps wave or SRE wave, or Google-isation of the internet wound up being more and more of a thing, I found myself increasingly in job interviews, where, “Great, now, can you go wind up implementing a sorting algorithm on the whiteboard?” “What on earth? No.” Like, my lingua franca is shitty Bash, and no one tends to write that without a bunch of tab completions and quick checking with manpages—die.net or whatnot—on the fly as you go down that path.And it was awful, and I felt… like my skill set was increasingly eroding. And it wasn't honestly until I started this place where I really got into writing a fair bit of code to do different things because it felt like an orthogonal skill set, but the fullness of time, it seems like it's not. And it's a reskilling. And it made me wonder, does this mean that the areas of technology that I focused on early in my career, was that all a waste? And the answer is not really. Sometimes, sure, in that I don't spend nearly as much time worrying about inodes—for example—as I once did. But every once in a while, I'll run into something and I looked like a wizard from the future, but instead, I'm a wizard from the past.Amy: Yeah, I find that a lot in my work, now. Sometimes things I did 20 years ago, come back, and it's like, oh, yeah, I remember I did all that threading work in 2002 in Perl, and I learned everything the very, very, very hard way. And then, you know, this January, did some threading work to fix some stability issues, and all of it came flooding back, right? Just that the experiences really, more than the code or the learning or the text and stuff; more just the, like, this feels like threads [BLEEP]-ery. Is a diagnostic thing that sometimes we have to say.And then people are like, “Can you prove it?” And I'm like, “Not really,” because it's literally thread [BLEEP]-ery. Like, the definition of it is that there's weird stuff happening that we can't figure out why it's happening. There's something acting in the system that isn't synchronized, that isn't connected to other things, that's happening out of order from what we expect, and if we had a clear signal, we would just fix it, but we don't. We just have, like, weird stuff happening over here and then over there and over there and over there.And, like, that tells me there's just something happening at that layer and then have to go and dig into that right, and like, just basically charge through. My colleagues are like, “Well, maybe you should look at this, and go look at the database,” the things that they're used to looking at and that their experiences inform, whereas then I bring that ancient toiling through the threading mines experiences back and go, “Oh, yeah. So, let's go find where this is happening, where people are doing dangerous things with threads, and see if we can spot something.” But that came from that experience.Corey: And there's so much that just repeats itself. And history rhymes. The challenge is that, do you have 20 years of experience, or do you have one year of experience repeated 20 times? And as the tide rises, doing the same task by hand, it really is just a matter of time before your full-time job winds up being something a piece of software does. An easy example is, “Oh, what's your job?” “I manually place containers onto specific hosts.” “Well, I've got news for you, and you're not going to like it at all.”Amy: Yeah, yeah. I think that we share a little bit. I'm allergic to repeated work. I don't know if allergic is the right word, but you know, if I sit and I do something once, fine. Like, I'll just crank it out, you know, it's this form, or it's a datafile I got to write and I'll—fine I'll type it in and do the manual labor.The second time, the difficulty goes up by ten, right? Like, just mentally, just to do it, be like, I've already done this once. Doing it again is anathema to everything that I am. And then sometimes I'll get through it, but after that, like, writing a program is so much easier because it's like exponential, almost, growth in difficulty. You know, the third time I have to do the same thing that's like just typing the same stuff—like, look over here, read this thing and type it over here—I'm out; I can't do it. You know, I got to find a way to automate. And I don't know, maybe normal people aren't driven to live this way, but it's kept me from getting stuck in those spots, too.Corey: It was weird because I spent a lot of time as a consultant going from place to place and it led to some weird changes. For example, “Oh, thank God, I don't have to think about that whole messaging queue thing.” Sure enough, next engagement, it's message queue time. Fantastic. I found that repeating myself drove me nuts, but you also have to be very sensitive not to wind up, you know, stealing IP from the people that you're working with.Amy: Right.Corey: But what I loved about the sysadmin side of the world is that the vast majority of stuff that I've taken with me, lives in my shell config. And what I mean by that is I'm not—there's nothing in there is proprietary, but when you have a weird problem with trying to figure out the best way to figure out which Ruby process is stealing all the CPU, great, turns out that you can chain seven or eight different shell commands together through a bunch of pipes. I don't want to remember that forever. So, that's the sort of thing I would wind up committing as I learned it. I don't remember what company I picked that up at, but it was one of those things that was super helpful.I have a sarcastic—it's a one-liner, except no sane editor setting is going to show it in any less than three—of a whole bunch of Perl, piped into du, piped into the rest, that tells you one of the largest consumers of files in a given part of the system. And it rates them with stars and it winds up doing some neat stuff. I would never sit down and reinvent something like that today, but the fact that it's there means that I can do all kinds of neat tricks when I need to. It's making sure that as you move through your career, on some level, you're picking up skills that are repeatable and applicable beyond one company.Amy: Skills and tooling—Corey: Yeah.Amy: —right? Like, you just described the tool. Another SREcon talk was John Allspaw and Dr. Richard Cook talking about above the line; below the line. And they started with these metaphors about tools, right, showing all the different kinds of hammers.And if you're a blacksmith, a lot of times you craft specialized hammers for very specific jobs. And that's one of the properties of a tool that they were trying to get people to think about, right, is that tools get crafted to the job. And what you just described as a bespoke tool that you had created on the fly, that kind of floated under the radar of intellectual property. [laugh].So, let's not tell the security or IP people right? Like, because there's probably billions and billions of dollars of technically, like, made-up IP value—I'm doing air quotes with my fingers—you know, that's just basically people's shell profiles. And my God, the Emacs automation that people have done. If you've ever really seen somebody who's amazing at Emacs and is 10, 20, 30, maybe 40 years of experience encoded in their emacs settings, it's a wonder to behold. Like, I look at it and I go, “Man, I wish I could do that.”It's like listening to a really great guitar player and be like, “Wow, I wish I could play like them.” You see them just flying through stuff. But all that IP in there is both that person's collection of wisdom and experience and working with that code, but also encodes that stuff like you described, right? It's just all these little systems tricks and little fiddly commands and things we don't want to remember and so we encode them into our toolset.Corey: Oh, yeah. Anything I wound up taking, I always would share it with people internally, too. I'd mention, “Yeah, I'm keeping this in my shell files.” Because I disclosed it, which solves a lot of the problem. And also, none of it was even close to proprietary or anything like that. I'm sorry, but the way that you wind up figuring out how much of a disk is being eaten up and where in a more pleasing way, is not a competitive advantage. It just isn't.Amy: It isn't to you or me, but, you know, back in the beginning of our careers, people thought it was worth money and should be proprietary. You know, like, oh, that disk-checking script as a competitive advantage for our company because there are only a few of us doing this work. Like, it was actually being able to, like, manage your—[laugh] actually manage your servers was a competitive advantage. Now, it's kind of commodity.Corey: Let's also be clear that the world has moved on. I wound up buying a DaisyDisk a while back for Mac, which I love. It is a fantastic, pretty effective, “Where's all the stuff on your disk going?” And it does a scan and you can drive and collect things and delete them when trying to clean things out. I was using it the other day, so it's top of mind at the moment.But it's way more polished than that crappy Perl three-liner. And I see both sides, truly I do. The trick also, for those wondering [unintelligible 00:15:45], like, “Where is the line?” It's super easy. Disclose it, what you're doing, in those scenarios in the event someone is no because they believe that finding the right man page section for something is somehow proprietary.Great. When you go home that evening in a completely separate environment, build it yourself from scratch to solve the problem, reimplement it and save that. And you're done. There are lots of ways to do this. Don't steal from your employer, but your employer employs you; they don't own you and the way that you think about these problems.Every person I've met who has had a career that's longer than 20 minutes has a giant doc somewhere on some system of all of the scripts that they wound up putting together, all of the one-liners, the notes on, “Next time you see this, this is the thing to check.”Amy: Yeah, the cheat sheet or the notebook with all the little commands, or again the Emacs config, sometimes for some people, or shell profiles. Yeah.Corey: Here's the awk one-liner that I put that automatically spits out from an Apache log file what—the httpd log file that just tells me what are the most frequent talkers, and what are the—Amy: You should probably let go of that one. You know, like, I think that one's lifetime is kind of past, Corey. Maybe you—Corey: I just have to get it working with Nginx, and we're good to go.Amy: Oh, yeah, there you go. [laugh].Corey: Or S3 access logs. Perish the thought. But yeah, like, what are the five most high-volume talkers, and what are those relative to each other? Huh, that one thing seems super crappy and it's coming from Russia. But that's—hmm, one starts to wonder; maybe it's time to dig back in.So, one of the things that I have found is that a lot of the people talking about SRE seem to have descended from an ivory tower somewhere. And they're talking about how some of the best-in-class companies out there, renowned for their technical cultures—at least externally—are doing these things. But there's a lot more folks who are not there. And honestly, I consider myself one of those people who is not there. I was a competent engineer, but never a terrific one.And looking at the way this was described, I often came away thinking, “Okay, it was the purpose of this conference talk just to reinforce how smart people are, and how I'm not,” and/or, “There are the 18 cultural changes you need to make to your company, and then you can do something kind of like we were just talking about on stage.” It feels like there's a combination of problems here. One is making this stuff more accessible to folks who are not themselves in those environments, and two, how to drive cultural change as an individual contributor if that's even possible. And I'm going to go out on a limb and guess you have thoughts on both aspects of that, and probably some more hit me, please.Amy: So, the ivory tower, right. Let's just be straight up, like, the ivory tower is Google. I mean, that's where it started. And we get it from the other large companies that, you know, want to do conference talks about what this stuff means and what it does. What I've kind of come around to in the last couple of years is that those talks don't really reach the vast majority of engineers, they don't really apply to a large swath of the enterprise especially, which is, like, where a lot of the—the bulk of our industry sits, right? We spend a lot of time talking about the darlings out here on the West Coast in high tech culture and startups and so on.But, like, we were talking about before we started the show, right, like, the interior of even just America, is filled with all these, like, insurance and banks and all of these companies that are cranking out tons of code and servers and stuff, and they're trying to figure out the same problems. But they're structured in companies where their tech arm is still, in most cases, considered a cost center, often is bundled under finance, for—that's a whole show of itself about that historical blunder. And so, the tech culture is tend to be very, very different from what we experience in—what do we call it anymore? Like, I don't even want to say West Coast anymore because we've gone remote, but, like, high tech culture we'll say. And so, like, thinking about how to make SRE and all this stuff more accessible comes down to, like, thinking about who those engineers are that are sitting at the computers, writing all the code that runs our banks, all the code that makes sure that—I'm trying to think of examples that are more enterprise-y right?Or shoot buying clothes online. You go to Macy's for example. They have a whole bunch of servers that run their online store and stuff. They have internal IT-ish people who keep all this stuff running and write that code and probably integrating open-source stuff much like we all do. But when you go to try to put in a reliability program that's based on the current SRE models, like SLOs; you put in SLOs and you start doing, like, this incident management program that's, like, you know, you have a form you fill out after every incident, and then you [unintelligible 00:20:25] retros.And it turns out that those things are very high-level skills, skills and capabilities in an organization. And so, when you have this kind of IT mindset or the enterprise mindset, bringing the culture together to make those things work often doesn't happen. Because, you know, they'll go with the prescriptive model and say, like, okay, we're going to implement SLOs, we're going to start measuring SLIs on all of the services, and we're going to hold you accountable for meeting those targets. If you just do that, right, you're just doing more gatekeeping and policing of your tech environment. My bet is, reliability almost never improves in those cases.And that's been my experience, too, and why I get charged up about this is, if you just go slam in these practices, people end up miserable, the practices then become tarnished because people experienced the worst version of them. And then—Corey: And with the remote explosion as well, it turns out that changing jobs basically means their company sends you a different Mac, and the next Monday, you wind up signing into a different Slack team.Amy: Yeah, so the culture really matters, right? You can't cover it over with foosball tables and great lunch. You actually have to deliver tools that developers want to use and you have to deliver a software engineering culture that brings out the best in developers instead of demanding the best from developers. I think that's a fundamental business shift that's kind of happening. If I'm putting on my wizard hat and looking into the future and dreaming about what might change in the world, right, is that there's kind of a change in how we do leadership and how we do business that's shifting more towards that model where we look at what people are capable of and we trust in our people, and we get more out of them, the knowledge work model.If we want more knowledge work, we need people to be happy and to feel engaged in their community. And suddenly we start to see these kind of generational, bigger-pie kind of things start to happen. But how do we get there? It's not SLOs. It maybe it's a little bit starting with incidents. That's where I've had the most success, and you asked me about that. So, getting practical, incident management is probably—Corey: Right. Well, as I see it, the problem with SLOs across the board is it feels like it's a very insular community so far, and communicating it to engineers seems to be the focus of where the community has been, but from my understanding of it, you absolutely need buy-in at significantly high executive levels, to at the very least by you air cover while you're doing these things and making these changes, but also to help drive that cultural shift. None of this is something I have the slightest clue how to do, let's be very clear. If I knew how to change a company's culture, I'd have a different job.Amy: Yeah. [laugh]. The biggest omission in the Google SRE books was [Ers 00:22:58]. There was a guy at Google named Ers who owns availability for Google, and when anything is, like, in dispute and bubbles up the management team, it goes to Ers, and he says, “Thou shalt…” right? Makes the call. And that's why it works, right?Like, it's not just that one person, but that system of management where the whole leadership team—there's a large, very well-funded team with a lot of power in the organization that can drive availability, and they can say, this is how you're going to do metrics for your service, and this is the system that you're in. And it's kind of, yeah, sure it works for them because they have all the organizational support in place. What I was saying to my team just the other day—because we're in the middle of our SLO rollout—is that really, I think an SLO program isn't [clear throat] about the engineers at all until late in the game. At the beginning of the game, it's really about getting the leadership team on board to say, “Hey, we want to put in SLIs and SLOs to start to understand the functioning of our software system.” But if they don't have that curiosity in the first place, that desire to understand how well their teams are doing, how healthy their teams are, don't do it. It's not going to work. It's just going to make everyone miserable.Corey: It feels like it's one of those difficult to sell problems as well, in that it requires some tooling changes, absolutely. It requires cultural change and buy-in and whatnot, but in order for that to happen, there has to be a painful problem that a company recognizes and is willing to pay to make go away. The problem with stuff like this is that once you pay, there's a lot of extra work that goes on top of it as well, that does not have a perception—rightly or wrongly—of contributing to feature velocity, of hitting the next milestone. It's, “Really? So, we're going to be spending how much money to make engineers happier? They should get paid an awful lot and they're still complaining and never seem happy. Why do I care if they're happy other than the pure mercenary perspective of otherwise they'll quit?” I'm not saying that it's not worth pursuing; it's not a worthy goal. I am saying that it becomes a very difficult thing to wind up selling as a product.Amy: Well, as a product for sure, right? Because—[sigh] gosh, I have friends in the space who work on these tools. And I want to be careful.Corey: Of course. Nothing but love for all of those people, let's be very clear.Amy: But a lot of them, you know, they're pulling metrics from existing monitoring systems, they are doing some interesting math on them, but what you get at the end is a nice service catalog and dashboard, which are things we've been trying to land as products in this industry for as long as I can remember, and—Corey: “We've got it this time, though. This time we'll crack the nut.” Yeah. Get off the island, Gilligan.Amy: And then the other, like, risky thing, right, is the other part that makes me uncomfortable about SLOs, and why I will often tell folks that I talk to out in the industry that are asking me about this, like, one-on-one, “Should I do it here?” And it's like, you can bring the tool in, and if you have a management team that's just looking to have metrics to drive productivity, instead of you know, trying to drive better knowledge work, what you get is just a fancier version of more Taylorism, right, which is basically scientific management, this idea that we can, like, drive workers to maximum efficiency by measuring random things about them and driving those numbers. It turns out, that doesn't really work very well, even in industrial scale, it just happened to work because, you know, we have a bloody enough society that we pushed people into it. But the reality is, if you implement SLOs badly, you get more really bad Taylorism that's bad for you developers. And my suspicion is that you will get worse availability out of it than you would if you just didn't do it at all.Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I've been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn't limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That's R-E-V-E-L-O dot I-O slash screaming.Corey: That is part of the problem is, in some cases, to drive some of these improvements, you have to go backwards to move forwards. And it's one of those, “Great, so we spent all this effort and money in the rest of now things are worse?” No, not necessarily, but suddenly are aware of things that were slipping through the cracks previously.Amy: Yeah. Yeah.Corey: Like, the most realistic thing about first The Phoenix Project and then The Unicorn Project, both by Gene Kim, has been the fact that companies have these problems and actively cared enough to change it. In my experience, that feels a little on the rare side.Amy: Yeah, and I think that's actually the key, right? It's for the culture change, and for, like, if you really looking to be, like, do I want to work at this company? Am I investing my myself in here? Is look at the leadership team and be, like, do these people actually give a crap? Are they looking just to punt another number down the road?That's the real question, right? Like, the technology and stuff, at the point where I'm at in my career, I just don't care that much anymore. [laugh]. Just… fine, use Kubernetes, use Postgres, [unintelligible 00:27:30], I don't care. I just don't. Like, Oracle, I might have to ask, you know, go to finance and be like, “Hey, can we spend 20 million for a database?” But like, nobody really asks for that anymore, so. [laugh].Corey: As one does. I will say that I mostly agree with you, but a technology that I found myself getting excited about, given the time of the recording on this is… fun, I spent a bit of time yesterday—from when we're recording this—teaching myself just enough Go to wind up being together a binary that I needed to do something actively ridiculous for my camera here. And I found myself coming away deeply impressed by a lot of things about it, how prescriptive it was for one, how self-contained for another. And after spending far too many years of my life writing shitty Perl, and shitty Bash, and worse Python, et cetera, et cetera, the prescriptiveness was great. The fact that it wound up giving me something I could just run, I could cross-compile for anything I need to run it on, and it just worked. It's been a while since I found a technology that got me this interested in exploring further.Amy: Go is great for that. You mentioned one of my two favorite features of Go. One is usually when a program compiles—at least the way I code in Go—it usually works. I've been working with Go since about 0.9, like, just a little bit before it was released as 1.0, and that's what I've noticed over the years of working with it is that most of the time, if you have a pretty good data structure design and you get the code to compile, usually it's going to work, unless you're doing weird stuff.The other thing I really love about Go and that maybe you'll discover over time is the malleability of it. And the reason why I think about that more than probably most folks is that I work on other people's code most of the time. And maybe this is something that you probably run into with your business, too, right, where you're working on other people's infrastructure. And the way that we encode business rules and things in the languages, in our programming language or our config syntax and stuff has a huge impact on folks like us and how quickly we can come into a situation, assess, figure out what's going on, figure out where things are laid out, and start making changes with confidence.Corey: Forget other people for a minute they're looking at what I built out three or four years ago here, myself, like, I look at past me, it's like, “What was that rat bastard thinking? This is awful.” And it's—forget other people's code; hell is your own code, on some level, too, once it's slipped out of the mental stack and you have to re-explore it and, “Oh, well thank God I defensively wound up not including any comments whatsoever explaining what the living hell this thing was.” It's terrible. But you're right, the other people's shell scripts are finicky and odd.I started poking around for help when I got stuck on something, by looking at GitHub, and a few bit of searching here and there. Even these large, complex, well-used projects started making sense to me in a way that I very rarely find. It's, “What the hell is that thing?” is my most common refrain when I'm looking at other people's code, and Go for whatever reason avoids that, I think because it is so prescriptive about formatting, about how things should be done, about the vision that it has. Maybe I'm romanticizing it and I'll hate it and a week from now, and I want to go back and remove this recording, but.Amy: The size of the language helps a lot.Corey: Yeah.Amy: But probably my favorite. It's more of a convention, which actually funny the way I'm going to talk about this because the two languages I work on the most right now are Ruby and Go. And I don't feel like two languages could really be more different.Syntax-wise, they share some things, but really, like, the mental models are so very, very different. Ruby is all the way in on object-oriented programming, and, like, the actual real kind of object-oriented with messaging and stuff, and, like, the whole language kind of springs from that. And it kind of requires you to understand all of these concepts very deeply to be effective in large programs. So, what I find is, when I approach Ruby codebase, I have to load all this crap into my head and remember, “Okay, so yeah, there's this convention, when you do this kind of thing in Ruby”—or especially Ruby on Rails is even worse because they go deep into convention over configuration. But what that's code for is, this code is accessible to people who have a lot of free cognitive capacity to load all this convention into their heads and keep it in their heads so that the code looks pretty, right?And so, that's the trade-off as you said, okay, my developers have to be these people with all these spare brain cycles to understand, like, why I would put the code here in this place versus this place? And all these, like, things that are in the code, like, very compact, dense concepts. And then you go to something like Go, which is, like, “Nah, we're not going to do Lambdas. Nah”—[laugh]—“We're not doing all this fancy stuff.” So, everything is there on the page.This drives some people crazy, right, is that there's all this boilerplate, boilerplate, boilerplate. But the reality is, I can read most Go files from top to the bottom and understand what the hell it's doing, whereas I can go sometimes look at, like, a Ruby thing, or sometimes Python and e—Perl is just [unintelligible 00:32:19] all the time, right, it's there's so much indirection. And it just be, like, “What the [BLEEP] is going on? This is so dense. I'm going to have to sit down and write it out in longhand so I can understand what the developer was even doing here.” And—Corey: Well, that's why I got the Mac Studio; for when I'm not doing A/V stuff with it, that means that I'll have one core that I can use for, you know, front-end processing and the rest, and the other 19 cores can be put to work failing to build Nokogiri in Ruby yet again.Amy: [laugh].Corey: I remember the travails of working with Ruby, and the problem—I have similar problems with Python, specifically in that—I don't know if I'm special like this—it feels like it's a SRE DevOps style of working, but I am grabbing random crap off a GitHub constantly and running it, like, small scripts other people have built. And let's be clear, I run them on my test AWS account that has nothing important because I'm not a fool that I read most of it before I run it, but I also—it wants a different version of Python every single time. It wants a whole bunch of other things, too. And okay, so I use ASDF as my version manager for these things, which for whatever reason, does not work for the way that I think about this ergonomically. Okay, great.And I wind up with detritus scattered throughout my system. It's, “Hey, can you make this reproducible on my machine?” “Almost certainly not, but thank you for asking.” It's like ‘Step 17: Master the Wolf' level of instructions.Amy: And I think Docker generally… papers over the worst of it, right, is when we built all this stuff in the aughts, you know, [CPAN 00:33:45]—Corey: Dev containers and VS Code are very nice.Amy: Yeah, yeah. You know, like, we had CPAN back in the day, I was doing chroots, I think in, like, '04 or '05, you know, to solve this problem, right, which is basically I just—screw it; I will compile an entire distro into a directory with a Perl and all of its dependencies so that I can isolate it from the other things I want to run on this machine and not screw up and not have these interactions. And I think that's kind of what you're talking about is, like, the old model, when we deployed servers, there was one of us sitting there and then we'd log into the server and be like, I'm going to install the Perl. You know, I'll compile it into, like, [/app/perl 558 00:34:21] whatever, and then I'll CPAN all this stuff in, and I'll give it over to the developer, tell them to set their shebang to that and everything just works. And now we're in a mode where it's like, okay, you got to set up a thousand of those. “Okay, well, I'll make a tarball.” [laugh]. But it's still like we had to just—Corey: DevOps, but [unintelligible 00:34:37] dev closer to ops. You're interrelating all the time. Yeah, then Docker comes along, and add dev is, like, “Well, here's the container. Good luck, asshole.” And it feels like it's been cast into your yard to worry about.Amy: Yeah, well, I mean, that's just kind of business, or just—Corey: Yeah. Yeah.Amy: I'm not sure if it's business or capitalism or something like that, but just the idea that, you know, if I can hand off the shitty work to some other poor schlub, why wouldn't I? I mean, that's most folks, right? Like, just be like, “Well”—Corey: Which is fair.Amy: —“I got it working. Like, my part is done, I did what I was supposed to do.” And now there's a lot of folks out there, that's how they work, right? “I hit done. I'm done. I shipped it. Sure. It's an old [unintelligible 00:35:16] Ubuntu. Sure, there's a bunch of shell scripts that rip through things. Sure”—you know, like, I've worked on repos where there's hundreds of things that need to be addressed.Corey: And passing to someone else is fine. I'm thrilled to do it. Where I run into problems with it is where people assume that well, my part was the hard part and anything you schlubs do is easy. I don't—Amy: Well, that's the underclass. Yeah. That's—Corey: Forget engineering for a second; I throw things to the people over in the finance group here at The Duckbill Group because those people are wizards at solving for this thing. And it's—Amy: Well, that's how we want to do things.Corey: Yeah, specialization works.Amy: But we have this—it's probably more cultural. I don't want to pick, like, capitalism to beat on because this is really, like, human cultural thing, and it's not even really particularly Western. Is the idea that, like, “If I have an underclass, why would I give a shit what their experience is?” And this is why I say, like, ops teams, like, get out of here because most ops teams, the extant ops teams are still called ops, and a lot of them have been renamed SRE—but they still do the same job—are an underclass. And I don't mean that those people are below us. People are treated as an underclass, and they shouldn't be. Absolutely not.Corey: Yes.Amy: Because the idea is that, like, well, I'm a fancy person who writes code at my ivory tower, and then it all flows down, and those people, just faceless people, do the deployment stuff that's beneath me. That attitude is the most toxic thing, I think, in tech orgs to address. Like, if you're trying to be like, “Well, our liability is bad, we have security problems, people won't fix their code.” And go look around and you will find people that are treated as an underclass that are given codes thrown over the wall at them and then they just have to toil through and make it work. I've worked on that a number of times in my career.And I think just like saying, underclass, right, or caste system, is what I found is the most effective way to get people actually thinking about what the hell is going on here. Because most people are just, like, “Well, that's just the way things are. It's just how we've always done it. The developers write to code, then give it to the sysadmins. The sysadmins deploy the code. Isn't that how it always works?”Corey: You'd really like to hope, wouldn't you?Amy: [laugh]. Not me. [laugh].Corey: Again, the way I see it is, in theory—in theory—sysadmins, ops, or that should not exist. People should theoretically be able to write code as developers that just works, the end. And write it correct the first time and never have to change it again. Yeah. There's a reason that I always like to call staging environments in places I work ‘theory' because it works in theory, but not in production, and that is fundamentally the—like, that entire job role is the difference between theory and practice.Amy: Yeah, yeah. Well, I think that's the problem with it. We're already so disconnected from the physical world, right? Like, you and I right now are talking over multiple strands of glass and digital transcodings and things right now, right? Like, we are detached from the physical reality.You mentioned earlier working in data centers, right? The thing I miss about it is, like, the physicality of it. Like, actually, like, I held a server in my arms and put it in the rack and slid it into the rails. I plugged into power myself; I pushed the power button myself. There's a server there. I physically touched it.Developers who don't work in production, we talked about empathy and stuff, but really, I think the big problem is when they work out in their idea space and just writing code, they write the unit tests, if we're very lucky, they'll write a functional test, and then they hand that wad off to some poor ops group. They're detached from the reality of operations. It's not even about accountability; it's about experience. The ability to see all of the weird crap we deal with, right? You know, like, “Well, we pushed the code to that server, but there were three bit flips, so we had to do it again. And then the other server, the disk failed. And on the other server…” You know? [laugh].It's just, there's all this weird crap that happens, these systems are so complex that they're always doing something weird. And if you're a developer that just spends all day in your IDE, you don't get to see that. And I can't really be mad at those folks, as individuals, for not understanding our world. I figure out how to help them, and the best thing we've come up with so far is, like, well, we start giving this—some responsibility in a production environment so that they can learn that. People do that, again, is another one that can be done wrong, where it turns into kind of a forced empathy.I actually really hate that mode, where it's like, “We're forcing all the developers online whether they like it or not. On-call whether they like it or not because they have to learn this.” And it's like, you know, maybe slow your roll a little buddy because the stuff is actually hard to learn. Again, minimizing how hard ops work is. “Oh, we'll just put the developers on it. They'll figure it out, right? They're software engineers. They're probably smarter than you sysadmins.” Is the unstated thing when we do that, right? When we throw them in the pit and be like, “Yeah, they'll get it.” [laugh].Corey: And that was my problem [unintelligible 00:39:49] the interview stuff. It was in the write code on a whiteboard. It's, “Look, I understood how the system fundamentally worked under the hood.” Being able to power my way through to get to an outcome even in language I don't know, was sort of part and parcel of the job. But this idea of doing it in artificially constrained environment, in a language I'm not super familiar with, off the top of my head, it took me years to get to a point of being able to do it with a Bash script because who ever starts with an empty editor and starts getting to work in a lot of these scenarios? Especially in an ops role where we're not building something from scratch.Amy: That's the interesting thing, right? In the majority of tech work today—maybe 20 years ago, we did it more because we were literally building the internet we have today. But today, most of the engineers out there working—most of us working stiffs—are working on stuff that already exists. We're making small incremental changes, which is great that's what we're doing. And we're dealing with old code.Corey: We're gluing APIs together, and that's fine. Ugh. I really want to thank you for taking so much time to talk to me about how you see all these things. If people want to learn more about what you're up to, where's the best place to find you?Amy: I'm on Twitter every once in a while as @MissAmyTobey, M-I-S-S-A-M-Y-T-O-B-E-Y. I have a blog I don't write on enough. And there's a couple things on the Equinix Metal blog that I've written, so if you're looking for that. Otherwise, mainly Twitter.Corey: And those links will of course be in the [show notes 00:41:08]. Thank you so much for your time. I appreciate it.Amy: I had fun. Thank you.Corey: As did I. Amy Tobey, Senior Principal Engineer at Equinix. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, or on the YouTubes, smash the like and subscribe buttons, as the kids say. Whereas if you've hated this episode, same thing, five-star review all the platforms, smash the buttons, but also include an angry comment telling me that you're about to wind up subpoenaing a copy of my shell script because you're convinced that your intellectual property and secrets are buried within.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Hacking Humans
Agile Software Development Method (noun) [Word Notes]

Hacking Humans

Play Episode Listen Later May 3, 2022 7:15


A software development philosophy that emphasizes incremental delivery, team collaboration, continual planning, and continual learning  Audio reference link: https://thecyberwire.com/glossary/agile-software-development "Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe" John Allspaw and Paul Hammond, 2009 Velocity Conference, YouTube, 25 June 2009. 

Word Notes
DevOps (noun)

Word Notes

Play Episode Listen Later May 3, 2022 7:00


The set of people, process, technology, and cultural norms that integrates software development and IT operations into a system-of-systems. CyberWire Glossary link: https://thecyberwire.com/glossary/devops Audio reference link: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr," by John Allspaw and Paul Hammond, Velocity 09, 25 July 2009.

Word Notes
Agile Software Development Method (noun)

Word Notes

Play Episode Listen Later Apr 19, 2022 7:15


A software development philosophy that emphasizes incremental delivery, team collaboration, continual planning, and continual learning  CyberWire Glossary link: https://thecyberwire.com/glossary/agile-software-development Audio reference link: "Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe," John Allspaw and Paul Hammond, 2009 Velocity Conference, YouTube, 25 June 2009.

Data Mesh Radio
#43 Applying Resilience Engineering Practices to Scale Data Sharing - Interview w/ Tim Tischler

Data Mesh Radio

Play Episode Listen Later Mar 18, 2022 75:08


Provided as a free resource by DataStax https://www.datastax.com/products/datastax-astra?utm_source=DataMeshRadio (AstraDB) https://www.patreon.com/datameshradio (Patreon) In this episode, Scott interviewed Tim Tischler, Principal Engineer at Wayfair. Prior to Wayfair, Tim worked as a Site Reliability Champion at New Relic and is well known in the "human factors" and resilience engineering space. Per Tim, our current work culture is overly action-item driven - every meeting must have a set of agenda items generated from it. This prevents people from having learning-focused meetings exclusively designed for context sharing. Humans' brains work differently between learning and fixing mode and we ask totally different questions. To be able to scale our knowledge sharing, we need to have the space to have learning-focused meetings. A good way to center learning-focused meetings, be they "show and tell" or event storming sessions, is via sharing stories - human communication is founded on story sharing through the millennia. Tim's "show and tell" and event storming sessions at Wayfair have had extremely positive reviews so far. Tim sees ticket-based interactions - just throwing requirements on someone's JIRA backlog or similar - as fundamentally flawed. If Team A gives Team B requirements, Team B just looks to close the ticket versus getting both sides in the room to exchange context and have a negotiation. Tim prefers two modes of interactions over ticket systems: #1 - no human-touch, automated interactions, e.g. an API; and #2 - high touch, high context sharing interactions. For resilience engineering specifically, you should apply learnings to each data product AND the mesh as a whole. Part of that is a broad acceptance that you are in a highly dynamic and highly changing org - there will be changes! A few anti-patterns to resilience engineering that apply to data mesh are: 1) a hub and spoke relationship model where one person is the key glue - this is bad at a human level and even worse at a technical level :); 2) business leaders pushing for metrics without sharing the specific context as the results end up as completely empty and useless things you are tracking; and 3) not embedding people building platforms into the teams they are building the platform for - they must really understand the workflows. Books/posts/papers mentioned: Blameless PostMortems and a Just Culture by John Allspaw - https://www.etsy.com/codeascraft/blameless-postmortems/ (Link) The Theory of Graceful Extensibility: Basic rules that govern adaptive systems by David D Woods - https://www.researchgate.net/publication/327427067_The_Theory_of_Graceful_Extensibility_Basic_rules_that_govern_adaptive_systems (Link) The Field Guide to Understanding 'Human Error' by Sidney Dekker - https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/1472439058 (Link) Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/ (https://www.linkedin.com/in/scotthirleman/) If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/ (https://datameshlearning.com/community/) If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see https://docs.google.com/document/d/1WkXLhSH7mnbjfTChD0uuYeIF5Tj0UBLUP4Jvl20Ym10/edit?usp=sharing (here) All music used this episode created by Lesfm (intro includes slight edits by Scott Hirleman): https://pixabay.com/users/lesfm-22579021/ (https://pixabay.com/users/lesfm-22579021/) Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under "add payment"): https://www.datastax.com/products/datastax-astra?utm_source=DataMeshRadio (AstraDB)

The Idealcast with Gene Kim by IT Revolution
Personal DevOps Aha Moments, the Rise of Infrastructure, and the DevOps Enterprise Scenius: Interviews with The DevOps Handbook Coauthors (Part 1 of 2: Patrick Debois and John Willis)

The Idealcast with Gene Kim by IT Revolution

Play Episode Listen Later Dec 16, 2021 139:36


In part one of this two-part episode on The DevOpsHandbook, Second Edition, Gene Kim speaks with coauthors Patrick Debois and John Willis about the past, present, and future of DevOps. By sharing their personal stories and experiences, Kim, Debois, and Willis discuss the scenius that inspired the book, and why and how the DevOps movement took hold around the world.   They also examine the updated content in the book, including new case studies, updated metrics, and practices. Finally, they each share the new lessons they have learned since writing the handbook and the future challenges they think DevOps professionals need to solve for the future. Kim will conclude the series in Part 2, where he interviews the remaining two coauthors, Jez Humble and Dr. Nicole Forsgren.    ABOUT THE GUEST(S) Patrick Debois is considered to be the godfather of the DevOps movement after he coined the term DevOps accidentally in 2008. Through his work, he creates synergies projects and operations by using Agile techniques in development, project management, and system administration. He has worked in several companies such as Atlassian, Zender, and VRT Media Lab. Currently, he is a Labs Researcher at Synk and an independent IT consultant.   John Willis an author and Senior Director of the Global Transformation Office at Red Hat.. He has been an active force in the IT management industry for over 35 years. Willis' experience includes being the Director of Ecosystem Development at Docker, the VP of Solutions for Socketplane, the VP of Training and Services at Opscode. He also founded Gulf Breeze Software, an award-winning IBM business partner, which specializes in deploying Tivoli technology for the enterprise.    Patrick DeBois and John Willis are two of five coauthors of The DevOps Handbook along with Gene Kim, Jez Humble, and Nicole Forsgren, PhD.   YOU'LL LEARN ABOUT The DevOps origin story from coining the term, why it took off, to launching the DevOps Days conference as an offshoot of the velocity conference.  How people thought of DevOps when it was first presented (their reactions, their mentalities, and their willingness to adopt it).   What has changed in the DevOps world since the first edition of The DevOps Handbook was published. How the rise of SaaS companies is altering the DevOps world and participating in its evolution, and how building solid relationships with SaaS vendors and communicating comprehensive feedback to them is integral to DevOps.  The significance of speed in changing team dynamics. Why resilient companies like Google and Amazon engineer chaos, and why companies like Toyota are happy when production stoppages happen.   Why you can't afford to provide a high variety of products if you also offer high product variation.   RESOURCES Get The DevOps Handbook (Second Edition) Nudge vs Shove: A Conversation With Richard Thaler Solaris Zones wiki Agile Conference in Toronto 2008 Sys Advent article: In Defense of the Modern Day JVM (Java Virtual Machine) by Gene Kim Mob programming Breaking Traditional IT Paradigms to... (San Francisco 2015) Crowdsourcing Technology Governance (Las Vegas 2018) Laying Down the Tracks for Technical Change at Comcast (Las Vegas 2020) 10+ Deploys Per Day by John Allspaw and Paul Hammond 10+ Deploys Per Day  How chaos engineering works at Vanguard Patrick DeBois tweet mapping out all the failure modes of an online conference.  Jesse Robins LinkedIn  Jesse Robbins on Twitter How A Hotel Company Ran $30B of Revenue In Containers (Las Vegas 2020) by Dwayne Holmes Google Cloud Certified Fellow Program  Operations is a competitive advantage… (Secret Sauce for Startups!) Love Letter To Conferences (And What Makes Some Truly Amazing) by Gene Kim Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results by Mike Rother Profound podcast by John Willis Ben Rockwood on Twitter Luke Kanies on LinkedIn DevOps 2020 - The Next Decade (London 2020) Beyond the Phoenix Project: The Origins and Evolution of DevOps by Gene Kim and John Willis The Goal: A Process of Ongoing Improvement by Eliyahu M. Goldratt and Jeff Cox The Convergence Of DevOps Operations as a Strategic Weapon by John Willis Iterative Enterprise SRE Transformation (US 2021)   TIMESTAMPS [00:00] Intro  [01:18] What's new and improved in the second edition of the DevOps handbook  [03:56] Meet Patrick DeBois [10:35] How faster technology made ideas like DevOps possible [18:11] The myths and inefficiencies of team autonomy [20:04] What the first DevOps days were like [27:59] Different opinions between the dev community and ops community [30:49] Mob programming and the future of collaboration [39:31] Two surprising things Patrick learned about DevOps [47:20] Patrick DeBois' favorite DevOps patterns  [51:28] How fear of not delivering on time can mask technical errors [59:45] What Patrick DeBois is working on these days [1:04:38] What was expanded in the second edition of the DevOps handbook [1:06:30] How Gene Kim entered the DevOps world.  [1:07:38] Meet John Willis [1:10:42] Why the DevOps movement took off [1:16:00] Mastering production disasters [1:23:32] The birth of the DevOps Days conference [1:37:37] Feelings of belonging and connection in a conference [1:41:29] A few clarifications [1:49:32] Two of the greatest DevOps open spaces [1:52:40] The difference between variety and variation (the cost of knowledge work).  [2:07:12] Why you should want more stoppages in your production line [2:10:16] John Willis' two favorite DevOps case studies [2:18:55] Outro

Break Things On Purpose
Mandi Walls

Break Things On Purpose

Play Episode Listen Later Dec 14, 2021 36:53


In this episode, we cover: 00:00:00 - Introduction  00:04:30 - Early Dark Days in Chaos Engineering and Reliability 00:08:27 - Anecdotes from the “Long Dark Time” 00:16:00 - The Big Changes Over the Years 00:20:50 - Mandi's Work at PagerDuty 00:27:40 - Mandi's Tips for Better DevOps 00:34:15 - Outro Links:PagerDuty: https://www.pagerduty.com TranscriptJason: — hilarious or stupid?Mandi: [laugh]. I heard that; I listened to the J. Paul Reed episode and I was like, “Oh, there's, like, a little, like, cold intro.” And I'm like, “Oh, okay.”Jason: Welcome to Break Things on Purpose, a podcast about reliability and learning from failure. In this episode, we take a trip down memory lane with Mandi Walls to discuss how much technology, reliability practices, and chaos engineering has evolved over her extensive career in technology.Jason: Everybody, welcome to the show, Julie Gunderson, who recently joined Gremlin on the developer advocacy team. How's it going, Julie?Julie: Great, Jason. Really excited to be here.Jason: So, Mandi is actually a guest of yours. I mean, we both have been friends with Mandi for quite a while but you had the wonderful opportunity of working with Mandi.Julie: I did, and I was really excited to have her on our podcast now as we ran a podcast together at PagerDuty when we worked there. Mandi has such a wealth of knowledge that I thought we should have her share it with the world.Mandi: Oh, no. Okay.Julie: [laugh].Jason: “Oh, no?” Well, in that case, Mandi, why don't you—Mandi: [crosstalk 00:01:28]. I don't know.Jason: Well, in that case with that, “Oh no,” let's have Mandi introduce herself. [laugh].Mandi: Yeah hi. So, thanks for having me. I am Mandi Walls. I am currently a DevOps advocate at PagerDuty, Julie's last place of employment before she left us to join Jason at Gremlin.Julie: And Mandi, we worked on quite a few things over a PagerDuty. We actually worked on things together, joint projects between Gremlin, when it was just Jason and us where we would run joint workshops to talk about chaos engineering and actually how you can practice your incident response. And I'm sure we'll get to that a little bit later in the episode, but will you kick us off with your background so everybody knows why we're so excited to talk to you today?Mandi: Oh, goodness. Well, so I feel like I've been around forever. [laugh]. Prior to joining PagerDuty. I spent eight-and-a-half years at Chef Software, doing all kinds of things there, so if I ever trained you on Chef, I hope it was good.Prior to joining Chef, I was assistant administrator for AOL.com and a bunch of other platform and sites at AOL for a long time. So, things like Moviefone, and the AOL Sports Channel, and dotcom, and all kinds of things. Most of them ran on one big platform because the monolith was a thing. So yeah, my background is largely in operations, and just systems administration on that side.Jason: I'm laughing in the background because you mentioned Moviefone, and whenever I think of Moviefone, I think of the Seinfeld episode where Kramer decides to make a Moviefone competitor, and it's literally just his own phone number, and people call up and he pretends to be that, like, robotic voice and has people, like, hit numbers for which movie they want to see and hear the times that it's playing. Gives a new meaning to the term on-call.Mandi: Indeed. Yes, absolutely.Julie: And I'm laughing just because I recently watched Hackers and, you know, they needed that AOL.com disc.Mandi: That's one of my favorite movies. Like, it's so ridiculous, but also has so many gems of just complete nonsense in it. Absolutely love Hackers. “Hack the planet.”Julie: “Hack the planet.” So, with hacking the planet, Mandi, and your time working at AOL with the monolith, let's talk a little bit because you're in the incident business right now over at PagerDuty, but let's talk about the before times, the before we practiced Chaos Engineering and before we really started thinking about reliability. What was it like?Mandi: Yeah, so I'll call this the Dark Ages, right? So before the Enlightenment. And, like, for folks listening at home, [laugh] the timeline here is probably—so between two-thousand-and-fi—four, five, and 2011. So, right before the beginning of cloud, right before the beginning of, like, Infrastructure as Code, and DevOps and all those things that's kind of started at, like, the end of my tenure at AOL. So, before that, right—so in that time period, right, like, the web was, it wasn't like it was just getting started, but, like, the Web 2.0 moniker was just kind of getting a grip, where you were going from the sort of generic sites like Yahoo and Yellow Pages and those kinds of things and AOL.com, which was kind of a collection of different community bits and news and things like that, into more personalized experiences, right?So, we had a lot of hook up with the accounts on the AOL side, and you could personalize all of your stuff, and read your email and do all those things, but the sophistication of the systems that we were running was such that like, I mean, good luck, right? It was migration from commercial Unixes into Linux during that era, right? So, looking at when I first joined AOL, there were a bunch of Solaris boxes, and some SGIs, and some other weird stuff in the data center. You're like, good luck on all that. And we migrated most of those platforms onto Linux at that time; 64 bit. Hurray.At least I caught that. And there was an increase in the use of open-source software for big commercial ventures, right, and so less of a reliance on commercial software and caught solutions for things, although we did have some very interesting commercial web servers that—God help them, they were there, but were not a joy, exactly, to work on because the goals were different, right? That time period was a huge acceleration. It was like a Cambrian explosion of software pieces, and tools, and improvements, and metrics, and monitoring, and all that stuff, as well as improvements on the platform side. Because you're talking about that time period is also being the migration from bare metal and, like, ordering machines by the rack, which really only a handful of players need to do that now, and that was what everybody was doing then.And in through the earliest bits of virtualization and really thinking about only deploying the structures that you needed to meet the needs of your application, rather than saying, “Oh, well, I can only order gear, I can only do my capacity planning once a year when we do the budget, so like, I got to order as much as they'll let me order and then it's going to sit in the data center spinning until I need it because I have no ability to have any kind of elastic capacity.” So, it was a completely, [laugh] completely different paradigm from what things are now. We have so much more flexibility, and the ability to, you know, expand and contract when we need to, and to shape our infrastructures to meet the needs of the application in such a more sophisticated and almost graceful way that we really didn't have then. So, it was like, “Okay, so I'm running these big websites; I've got thousands of machines.” Like, not containers, not services.Like, there's tens of thousands of services, but there's a thousand machines in one location, and we've got other things spread out. There's like, six different pods of things in different places and all this other crazy business going on. At the same time, we were also running our own CDN, and like, I totally recommend you never, ever do that for any reason. Like, just—yeah. It was a whole experience and I still sometimes have, like, anxiety dreams about, like, the configuration for some of our software that we ran at that point. And all of that stuff is—it was a long… dark time.Julie: So, now speaking of anxiety dreams, during that long, dark time that you mentioned, there had to have been some major incidents, something that stands out that that you just never want to relive. And, Mandi, I would like to ask you to relive that for us today.Mandi: [laugh]. Okay, well, okay, so there's two that I always tell people about because they were so horrific in the moment, and they're still just, like, horrible to think about. But, like, the first one was Thanksgiving morning, sometime early in the morning, like, maybe 2 a.m. something like that, I was on call.I was at my mom's, so at the time, my mom had terrible internet access. And again, this time period don't have a lot of—there was no LTE or any kind of mobile data, right? So, I'm, like, on my mom's, like, terrible modem. And something happened to the database behind news.aol.com—which was kind of a big deal at the time—and unfortunately, we were in the process of, like, migrating off of one kind of database onto another kind of database.News was on the target side but, like, the actual platform that we were planning to move to for everything else, but the [laugh] database on-call, the poor guy was only trained up in the old platform, so he had no idea what was going on. And yeah, we were on that call—myself, my backup, the database guy, the NOC analyst, and a handful of other people that we could get hold of—because we could not get into touch with the team lead for the new database platform to actually fix things. And that was hours. Like, I missed Thanksgiving dinner. So, my family eats Thanksgiving at midday rather than in the evening. So, that was a good ten hour call. So, that was horrifying.The other one wasn't quite as bad as that, but like, the interesting thing about the platform we were running at the time was it was AOL server, don't even look it up. Like, it was just crazytown. And it was—some of the interesting things about it was you could actually get into the server platform and dig around in what the threads were doing. Each of the servers had, like, a control port on it and I could log into the control port and see what all the requests were doing on each thread that was live. And we had done a big push of a new release of dotcom onto that platform, and everything fell over.And of course, we've got, like, sites in half a dozen different places. We've got, you know, distributed DNS that's, like, trying to throw traffic between different locations as they fall over. So, I'm watching, like, all of these graphs oscillate as, like, traffic pours out of the [Secaucus 00:11:10] or whatever we were doing, and into Mountain View or something and, like, then all the machines in the Secaucus recover. So, then they start pinging and traffic goes back, and, like, they just fall over, over and over again. So, what happened there was we didn't have enough threads configured in the server for the new time duration for the requests, so we had to, like, just boosted up all of the threads we could handle and then restart all of the applications. But that meant pushing out new config to all the thousands of servers that were in the pool at the time and then restarting all of them. So, that was exciting. That was the outage that I learned that the CTO knew how to call my desk. So, highly don't recommend that. But yeah, it was an experience. So.Julie: So, that's really interesting because there's been so many investments now in reliability. And when we talk about the Before Times when we had to cap our text messages because they cost us ten cents a piece, or when we were using those AOL discs, the thought was there; we wanted to make that user experience better. And you brought up a couple of things, you know, you were moving to those more personalized experiences, you were migrating those platforms, and you actually talked about your metrics and monitoring. And I'd like to dig in a little on that and see, how did that help you during those incidents? And after those incidents, what did you do to ensure that these types of incidents didn't occur again in the future?Mandi: Yeah, so one of the interesting things about, you know, especially that time period was that the commercially available solutions, even some of the open-source solutions were pretty immature at that time. So, AOL had an internally built solution that was fascinating. And it's unfortunate that they were never able to open-source it because it would have been something interesting to sort of look at. Scale of it was just absolutely immense. But the things that we could look at the time to sort of give us, you know, an indication of something, like, an AOL.com, it's kind of a general purpose website; a lot of different people are going to go there for different reasons.It's the easiest place for them to find their email, it's the easiest place for them to go to the news, and they just kind of use it as their homepage, so as soon as traffic starts dropping off, you can start to see that, you know, maybe there's something going on and you can pull up sort of secondary indicators for things like CPU utilization, or memory exhaustion, or things like that. Some of the other interesting things that would come up there is, like, for folks who are sort of intimately tied to these platforms for long periods of time, to get to know them as, like, their own living environment, something like—so all of AOL's channels at the time were on a single platform.—like, hail to the monolith; they all live there—because it was all linked into one publishing site, so it made sense at the time, but like, oh, my goodness, like, scaling for the combination of entertainment plus news plus sports plus all the stuff that's there, there's 75 channels at one time, so, like, the scaling of that is… ridiculous.But you could get a view for, like, what people were actually doing, and other things that were going on in the world. So like, one summer, there were a bunch of floods in the Midwest and you could just see the traffic bottom out because, like, people couldn't get to the internet. So, like, looking at that region, there's, like, a 40% drop in the traffic or whatever for a few days as people were not able to be online. Things like big snowstorms where all the kids had to stay home and, like, you get a big jump in the traffic and you get to see all these things and, like, you get to get a feel for more of a holistic attachment or holistic relationship with a platform that you're running. It was like it—they are very much a living creature of their own sort of thing.Like, I always think of them as, like, a Kraken or whatever. Like, something that's a little bit menacing, you don't really think see all of it, and there's a lot of things going on in the background, but you can get a feel for the personality and the shape of the behaviors, and knowing that, okay, well, now we have a lot of really good metrics to say, “All right, that one 500 error, it's kind of sporadic, we know that it's there, it's not a huge deal.” Like, we did not have the sophistication of tooling to really be able to say that quantitatively, like, and actually know that but, like, you get a feel for it. It's kind of weird. Like, it's almost like you're just kind of plugged into it yourself.It's like the scene in The Matrix where the operator guy is like, “I don't even see the text anymore.” Right? Like, he's looking directly into the matrix. And you can, kind of like—you spend a lot of time with [laugh] those applications, you get to know how they operate, and what they feel like, and what they're doing. And I don't recommend it to anyone, but it was absolutely fascinating at the time.Julie: Well, it sounds like it. I mean, anytime you can relate anything to The Matrix, it is going to be quite an experience. With that said, though, and the fact that we don't operate in these monolithic environments anymore, how have you seen that change?Mandi: Oh, it's so much easier to deal with. Like I said, like, your monolithic application, especially if there are lots of different and diverse functionalities in it, like, it's impossible to deal with scaling them. And figuring out, like, okay, well, this part of the application is memory-bound, and here's how we have to scale for that; and this part of the application is CPU-bound; and this part of the application is I/O bound. And, like, peeling all of those pieces apart so that you can optimize for all of the things that the application is doing in different ways when you need to make everything so much smoother and so much more efficient, across, like, your entire ecosystem over time, right?Plus, looking at trying to navigate the—like an update, right? Like, oh, you want to do an update to your next version of your operating system on a monolith? Good luck. You want to update the next version of your runtime? Plug and pray, right? Like, you just got to hope that everybody is on board.So, once you start to deconstruct that monolith into pieces that you can manage independently, then you've got a lot more responsibility on the application teams, that they can see more directly what their impacts are, get a better handle on things like updates, and software components, and all the things that they need independent of every other component that might have lived with them in the monolith. Noisy neighbors, right? Like, if you have a noisy neighbor in your apartment building, it makes everybody miserable. Let's say if you have, like, one lagging team in your monolith, like, nobody gets the update until they get beaten into submission.Julie: That is something that you and I used to talk about a lot, too, and I'm sure that you still do—I know I do—was just the service ownership piece. Now, you know who owns this. Now, you know who's responsible for the reliability.Mandi: Absolutely.Julie: You know, I'm thinking back again to these before times, when you're talking about all of the bare metal. Back then, I'm sure you probably didn't pull a Jesse Robbins where you went in and just started unplugging cords to see what happened, but was there a way that AOL practiced Chaos Engineering with maybe not calling it that?Mandi: It's kind of interesting. Like, watching the evolution of Chaos Engineering from the early days when Netflix started talking about it and, like, the way that it has emerged as being a more deliberate practice, like, I cannot say that we ever did any of that. And some of the early internet culture, right, is really built off of telecom, right? It was modem-based; people dialed into your POP, and like, that was the reliability they were expecting was very similar to what they expect out of a telephone, right? Like, the reason we have, like, five nines as a thing is because you want to pick up dial tone, and—pick up your phone and get dial tone on your  line 99.999% of the time.Like, it has nothing to do with the internet. It's like 1970s circuits with networking. For part of that reason, like, a lot of the way things were built at that time—and I can't speak for Yahoo, although I suspect they had a very similar setup—that we had a huge integration environment. It's completely insane to think now that you would build an integration environment that was very similar in scope and scale to your production environment; simply does not happen. But for a lot of the services that we had at that time, we absolutely had an integration environment that was extraordinarily similar.You simply don't do that anymore. Like, it's just not part of—it's not cost effective. And it was only cost effective at that time because there wasn't anything else going on. Like, you had, like, the top ten sites on the internet, and AOL was, like, number three at the time. So like, that was just kind of the way things are done.So, that was kind of interesting and, like, figuring out that you needed to do some kind of proactive planning for what would happen just wasn't really part of the culture at the time. Like, we did have a NOC and we had some amazing engineers on the NOC that would help us out and do some of the things that we automate now: putting a call together, or when paging other folks into an incident, or helping us with that kind of response. I don't ever remember drilling on it, right, like we do. Like, practicing that, pulling a game day, having, like, an actual plan for your reliability along those lines.Julie: Well, and now I think that yeah, the different times are that the competitive landscape is real now—Mandi: Yeah, absolutely.Julie: And it was hard to switch from AOL to something else. It was hard to switch from Facebook to MySpace—or MySpace to Facebook, I should say.Mandi: Yeah.Julie: I know that really ages me quite a bit.Mandi: [laugh].Julie: But when we look at that and when we look at why reliability is so important now, I think it's because we've drilled it into our users; the users have this expectation and they aren't aware of what's happening on the back end. They just kn—Mandi: Have no idea. Yeah.Julie: —just know that they can't deposit money in their bank, for example, or play that title at Netflix. And you and I have talked about this when you're on Netflix, and you see that, “We can't play this title right now. Retry.” And you retry and it pops back up, we know what's going on in the background.Mandi: I always assume it's me, or, like, something on my internet because, like, Netflix, they [don't ever 00:21:48] go down. But, you know, yeah, sometimes it's [crosstalk 00:21:50]—Julie: I just always assume it's J. Paul doing some chaos engineering experiments over there. But let's flash forward a little bit. I know we could spend a lot of time talking about your time at Chef, however, you've been over at PagerDuty for a while now, and you are in the incident response game. You're in that lowering that Mean Time to Identification and Resolution. And that brings that reliability piece back together. Do you want to talk a little bit about that?Mandi: One of the things that is interesting to me is, like, watching some of these slower-moving industries as they start to really get on board with cloud, the stairstep of sophistication of the things that they can do in cloud that they didn't have the resources to do when they were using their on-premises data center. And from an operation standpoint, like, being able to say, “All right, well, I'm going from, you know, maybe not bare metal, but I've got, like, some kind of virtualization, maybe some kind of containerization, but like, I also own the spinning disks, or whatever is going on there—and the network and all those things—and I'm putting that into a much more flexible environment that has modern networking, and you know, all these other elastic capabilities, and my scaling and all these things are already built in and already there for me.” And your ability to then widen the scope of your reliability planning across, “Here's what my failure domains used to look like. Here's what I used to have to plan for with thinking about my switching networks, or my firewalls, or whatever else was going on and, like, moving that into the cloud and thinking about all right, well, here's now, this entire buffet of services that I have available that I can now think about when I'm architecting my applications for the cloud.” And that, just, expanded reliability available to you is, I think, absolutely amazing.Julie: A hundred percent. And then I think just being able to understand how to respond to incidents; making sure that your alerting is working, for example, that's something that we did in that joint workshop, right? We would teach people how to validate their alerting and monitoring, both with PagerDuty and Gremlin through the practice of incident response and of chaos engineering. And I know that one of the practices at PagerDuty is Failure Fridays, and having those regular game days that are scheduled are so important to ensuring the reliability of the product. I mean, PagerDuty has no maintenance windows, correct?Mandi: No that—I don't think so, right?Julie: Yeah. I don't think there's any planned maintenance windows, and how do we make sure for organizations that rely on PagerDuty—Mandi: Mm-hm.Julie: —that they are one hundred percent reliable?Mandi: Right. So, you know, we've got different kinds of backup plans and different kinds of rerouting for things when there's some hiccup in the platform. And for things like that, we have out of band communications with our teams and things like that. And planning for that, having that game day to just be able to say—well, it gives you context. Being able to say, “All right, well, here's this back-end that's kind of wobbly. Like, this is the thing we're going to target with our experiments today.”And maybe it's part of the account application, or maybe it's part of authorization, or whatever it is; the team that worked on that, you know, they have that sort of niche view, it's a little microcosm, here's a little thing that they've got and it's their little widget. And what that looks like then to the customer, and that viewpoint, it's going to come in from somewhere else. So, you're running a Failure Friday; you're running a game day, or whatever it is, but including your customer service folks, and your front-end engineers, and everyone else so that, you know, “Well, hey, you know, here's what this looks like; here's the customers' report for it.” And giving you that telemetry that is based on customer experience and your actual—what the business looks like when something goes wrong deep in the back end, right, those deep sea, like, angler fish in the back, and figuring out what all that looks like is an incredible opportunity. Like, just being able to know that what's going to happen there, what the interface is going to look like, what things don't load, when things take a long time, what your timeouts look like, did you really even think about that, but they're cascading because it's actually two layers back, or whatever you're working on, like that kind of insight, like, is so valuable for your application engineers as they're improving all the pieces of architecture, whether it's the most front-end user-facing things, or in the deep back-end that everybody relies on.Julie: Well, absolutely. And I love that idea of bringing in the different folks like the customer service teams, the product managers. I think that's important on a couple of levels because not only are you bringing them into this experience so they're understanding the organization and how folks operate as a whole, but you're building that culture, that failure is acceptable and that we learn from our failures and we make our systems more resilient, which is the entire goal.Mandi: The goal.Julie: And you're sharing the learning. When we operate in silos—which even now as much as we talk about how terrible it is to be in siloed teams and how we want to remove silos, it happens. Silos just happen. And when we can break down those barriers, any way that we can to bring the whole organization in, I think it just makes for a stronger organization, a stronger culture, and then ultimately a stronger product where our customers are living.Mandi: Yeah.Julie: Now, I really do want to ask you a couple of things for some fun here. But if you were to give one tip, what is your number one tip for better DevOps?Mandi: Your DevOps is always going to be—like, I'm totally on board with John Wallace's [CAMS 00:27:57] to, like, move to CALMS sort of model, right? So, you've got your culture, your automation, your learning, your metrics, and your sharing. For better DevOps, I think one of the things that's super important—and, you know, you and I have hashed this out in different things that we've done—we hear about it in other places, is definitely having empathy for the other folks in your organization, for the work that they're doing, and the time constraints that they're under, and the pressures that they're feeling. Part of that then sort of rolls back up to the S part of that particular model, the sharing. Like, knowing what's going on, not—when we first started out years ago doing sort of DevOps consulting through Chef, like, one of the things we would occasionally run into is, like, you'd ask people where their dashboards were, like, how are they finding out, you know, what's going on, and, like, the dashboards were all hidden and, like, nobody had access to them; they were password protected, or they were divided up by teams, like, all this bonkers nonsense.And I'm like, “You need to give everybody a full view, so that they've all got a 360 view when they're making decisions.” Like you mentioned your product managers as part of, like, being part of your practice; that's absolutely what you want. They have to see as much data as your applications engineers need to see. Having that level of sharing for the data, for the work processes, for the backlog, you know, the user inputs, what the support team is seeing, like, you're getting all of this input, all this information, from everywhere in your ecosystem and you cannot be selfish with it; you cannot hide it from other people.Maybe it doesn't look as nice as you want it to, maybe you're getting some negative feedback from your users, but pass that around, and you ask for advice; you ask for other inputs. How are we going to solve this problem? And not hide it and feel ashamed or embarrassed. We're learning. All this stuff is brand new, right?Like, yeah, I feel old talking about AOL stuff, but, like, at the same time, like, it wasn't that long ago, and we've learned an amazing amount of things in that time period, and just being able to share and have empathy for the folks on your team, and for your users, and the other folks in your ecosystem is super important.Julie: I agree with that. And I love that you hammer down on the empathy piece because again, when we're working in ones and zeros all day long, sometimes we forget about that. And you even mentioned at the beginning how at AOL, you had such intimate knowledge of these applications, they were so deep to you, sometimes with that I wonder if we forget a little bit about the customer experience because it's something that's so close to us; it's a feature maybe that we just believe in wholeheartedly, but then we don't see our customers using it, or the experience for them is a little bit rockier. And having empathy for what the customer may go through as well because sometimes we just like to think, “Well, we know how it works. You should be able to”—Mandi: Yes.Julie: Yes. And, “They're definitely not going to find very unique and interesting ways to break my thing.” [laugh].Mandi: [laugh]. No, never.Julie: Never.Mandi: Never.Julie: And then you touched on sharing and I think that's one thing we haven't touched on yet, but I do want to touch on a little bit. Because with incident—with incident response, with chaos engineering, with the learning and the sharing, you know, an important piece of that is the postmortem.Mandi: Absolutely.Julie: And do you want to talk a little bit about the PagerDuty view, your view on the postmortems?Mandi: As an application piece, like, as a feature, our postmortem stuff is under review. But as a practice, as a thing that you do, like, a postmortem is an—it should be an active word; like, it's a verb, right? You hol—and if you want to call it a post-incident review, or whatever, or post-incident retrospective, if you're more comfortable with those words, like that's great, and that's—as long as you don't put a hyphen in postmortem, I don't care. So, like—Julie: I agree with you. No hyphen—Mandi: [laugh].Julie: —please. [laugh].Mandi: Please, no hyphen. Whatever you want to call that, like, it's an active thing. And you and I have talked a number of times about blamelessness and, like, making sure that what you do with that opportunity, this is—it's a gift, it's a learning opportunity after something happened. And honestly, you probably need to be running them, good or bad, for large things, but if you have a failure that impacted your users and you have this opportunity to sit down and say, all right, here's where things didn't go as we wanted them to, here's what happened, here's where the weaknesses are in our socio-technical systems, whether it was a breakdown in communication, or breakdown in documentation, or, like, we we found a bug or, you know, [unintelligible 00:32:53] defect of some kind, like, whatever it is, taking that opportunity to get that view from as many people as possible is super important.And they're hard, right? And, like, we—John Allspaw, on our podcast, right, last year talked a bit about this. And, like, there's a tendency to sort of write the postmortem and put it on a shelf like it's, like, in a museum or whatever. They are hopefully, like, they're learning documents that are things that maybe you have your new engineers sort of review to say, “Here's a thing that happened to us. What do you think about this?” Like, maybe having, like, a postmortem book club or something internally so that the teams that weren't maybe directly involved have a chance to really think about what they can learn from another application's learning, right, what opportunities are there for whatever has transpired? So, one of the things that I will say about that is like they aren't meant to be write-only, right? [laugh]. They're—Julie: Yeah.Mandi: They're meant to be an actual living experience and a practice that you learn from.Julie: Absolutely. And then once you've implemented those fixes, if you've determined the ROI is great enough, validate it.Mandi: Yes.Julie: Validate and validate and validate. And folks, you heard it here first on Break Things on Purpose, but the postmortem book club by Mandi Walls.Mandi: Yes. I think we should totally do it.Julie: I think that's a great idea. Well, Mandi, thank you. Thank you for taking the time to talk with us. Real quick before we go, did you want to talk a little bit about PagerDuty and what they do?Mandi: Yes, so Page—everyone knows PagerDuty; you have seen PagerDuty. If you haven't seen PagerDuty recently, it's worth another look. It's not just paging anymore. And we're working on a lot of things to help people deal with unplanned work, sort of all the time, right, or thinking about automation. We have some new features that integrate more with our friends at Rundeck—PagerDuty acquired Rundeck last year—we're bringing out some new integrations there for Rundeck actions and some things that are going to be super interesting for people.I think by the time this comes out, they'll have been in the wild for a few weeks, so you can check those out. As well as, like, getting better insight into your production platforms, like, with a service graph and other insights there. So, if you haven't looked at PagerDuty in a while or you think about it as being just a place to be annoyed with alerts and pages, definitely worth revisiting to see if some of the other features are useful to you.Julie: Well, thank you. And thanks, Mandi, and looking forward to talking to you again in the future. And I hope you have a wonderful day.Mandi: Thank you, Julie. Thank you very much for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

Naturalistic Decision Making
Episode #35: Interview with John Allspaw

Naturalistic Decision Making

Play Episode Listen Later Nov 20, 2021 45:05


Date recorded: November 12, 2021 Show Description: Today we welcome John Allspaw. John is an engineering leader and researcher with over 20 years of experience in building and leading teams engaged in software and systems engineering. He is a co-founder of Adaptive Capacity Labs, LLC. Previously, he was Chief Technology Officer at Etsy. He has also worked at Flickr, Friendster, InfoWorld, Salon, Genentech, Volpe National Transportation Center, and a bunch of other places as a consultant from time to time. John has spent the last decade bridging insights from Human Factors, Cognitive Systems Engineering, and Resilience Engineering to the domain of software engineering and operations. His publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. He holds a Master's degree in Human Factors and Systems Safety from Lund University. Where to find John: LinkedIn Twitter Learn more about NDM: NaturalisticDecisionMaking.org Journal of Cognitive Engineering and Decision Making Where to find hosts Brian Moon and Laura Militello: Brian's website Brian's LinkedIn Brian's Twitter Laura's website Laura's LinkedIn Laura's Twitter

Screaming in the Cloud
Non-Incidentally Keeping Tabs on the Internet with Courtney Nash

Screaming in the Cloud

Play Episode Listen Later Oct 5, 2021 33:40


About CourtneyCourtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she's held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O'Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.Links: Verica: https://www.verica.io Twitter: https://twitter.com/courtneynash Email: courtney@verica.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: This episode is sponsored in part by our friends at VMware. Let's be honest—the past year has been far from easy. Due to, well, everything. It caused us to rush cloud migrations and digital transformation, which of course means long hours refactoring your apps, surprises on your cloud bill, misconfigurations and headache for everyone trying manage disparate and fractured cloud environments. VMware has an answer for this. With VMware multi-cloud solutions, organizations have the choice, speed, and control to migrate and optimizeapplications seamlessly without recoding, take the fastest path to modern infrastructure, and operate consistently across the data center, the edge, and any cloud. I urge to take a look at vmware.com/go/multicloud. You know my opinions on multi cloud by now, but there's a lot of stuff in here that works on any cloud. But don't take it from me thats: VMware.com/go/multicloud and my thanks to them again for sponsoring my ridiculous nonsense.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Periodically, websites like to fall into the sea and explode. And it's sort of a thing that we've accepted happens. Well, most of us have. My guest today is Courtney Nash, Internet Incident Librarian at Verica. Courtney, thank you for joining me.Courtney: Hi, Corey. Thanks so much for having me.Corey: So, I'm going to assume that my intro is somewhat accurate, that we've sort of accepted that sites will crash into the sea, the internet will break, and then everyone tears their hair out and complains on Twitter, assuming that's not the thing that fell over this time—Courtney: [laugh].Corey: —but what does an Internet Incident Librarian do?Courtney: Yeah, I'll come back to the first part about how—some people have accepted it and some people haven't, I think is the interesting part. So technically, I think my official real title is, like, research analyst or something really boring, but I have a background in the cognitive sciences and also in technology, and I'm really—have always been fascinated by how these socio-technical systems work. And so as an Internet Incident Librarian, I am doing a number of things to try to better understand—both for myself and, obviously, the company I work for, but for the industry as a whole—what do we really know about how incidents happen, why they happen, when they happen, and what do we do when they happen? And how do we learn from that? So, one of the first things that I'm doing along those lines is actually collecting a database of all of the public write-ups of incidents that happened at companies that are software-related.So, there's already bodies of work of people who collect airline incidents and other kinds of things. And we don't have that [laugh] as an industry, which I think is—I want to solve that problem because I think other industries that have spent some time introspecting about why things fall down, or when things fall down and how they fall down. Take the airline industry for example; planes don't really fall out of the sky very often.Corey: No. When it does, it makes news and everyone's scared about flying, but at the same time, it's yeah, do you have any idea how many people die in car crashes in a given hour?Courtney: Yeah, yeah. And we'll come back to how the media covers things in a minute because that is definitely something I have opinions about. But, I'm not trying to say I want to create the NTSB of the internet; I don't think that's quite the same thing, and I really want something in the spirit of software, and the internet, and open-source that's more collaborative and it's very open to all of us. So, the first step is to just get them in one place. There is no single place where you could go and say, “Oh, where all of the X incident reports? Where all the ones that Microsoft's written, and also Amazon, or Google, or, you know, whoever.”Corey: They have them, but they hide them so thoroughly. It turns out that they don't really put that in big letters on their corporate blog with links to it. And when you look at one incident report, they don't say, “Here, look at our previous incident reports.” They really—Courtney: Yeah.Corey: —should but no one does.Courtney: And I think that's fascinating because there's a precedent. So, there's two precedents, and I just gave you basically one side of the two, which is, the airline industry has done this and it's not like people don't fly, right? So, a lot of internet companies, a lot of software-based companies, seem to be afraid of what their customers, or what the stock market, or what folks will think. Mind you, these are publicly traded [laugh] airline companies. People aren't going to stop using Amazon just because you give more of this information out.And so I think that piece is—I would love to see that stop being the case. Because the flip side of the coin is that this is a rising tide lifts all boats kind of thing, which granted, not all companies agree on, especially really big ones because their boats already mowing all the little ones out of the ocean. But that's another story.Corey: Sure, but also, it's easy to hide an outage. “Our site is down for you can say three days. Great, if a customer didn't try to access the site at all during those three days, was the site really down in the first place?”Courtney: Oh, the tree in the forest of internet outages. Yes, it's true, although I think that companies are—they know that people go complain on social media, right? I think there's more and more of that happening now. It's not like you can hide it as easily as you could have before Twitter or Instagram or—Corey: Right. Whereas a plane falls out of the sky, generally it's one of those things that people notice.Courtney: Yeah. Even if you weren't interested in that flight at all.Corey: Right. When it lands in your garden, you sort of have a comment on this.Courtney: [laugh]. Yeah. Pieces fall out of the sky. That has happened. But I think the other flip side of the coin I already mentioned is the safety of airline industry has increased so significantly over the past, you know, whatever, 30, 40 years because of this concerted effort.And the other piece of it, then, as an industry, as technologists, as people who use software to run their businesses, some of those things are now safety-critical. And this comes back to the whole software is running the world now. Planes now actually could fall out of the sky because of software, not just because of hardware failures. And nuclear power plants are [laugh] run by software, and your electronic grid, and your health care systems, heart rate monitors, insulin pumps. There are a lot of really critical things, and now our phone services and our internet stuff is so entwined in our lives, that people can't be on their Zoom calls, people can't run their businesses. So, this stuff has a massive impact on people's lives. It's no longer just pictures of cats on the internet, which admittedly, we've really honed the machine for that.Corey: No, but now when software goes down, the biggest arguments people make, the stories people tell is, “Oh, well, it meant that the company lost this much money during that timeframe.” And great, maybe. We can argue about is that really true or is it not? It depends entirely on the company's business model, but I don't like to tend to accept those things at face value. But yeah, that's the small-scale thing, especially when you start getting to these massive platform providers. There are a lot of second and third-order effects that are a lot more interesting slash important to people's lives, than, well, we couldn't show ads to people for an hour and a half.Courtney: Right. Yes. Absolutely. So, T-Mobile had this outage, what is it, how is time—time is still not working very well, for me. I'm trying to remember if it was earlier this year, or if it was in—it was last year. I think it was 2020. And you're like, T-Mobile, oh okay, whatever. You know, like, cell phones, yadda, yadda. 911 stopped working. [laugh].And it was a fascinating outage because these are now actually regulated industries that are heavily software-backed. There was a government investigation into that the same way we have NTSB investigations into airline accidents, and they looked at all of those, kind of, second or third-order effects of people who—you know, a grandma who was stranded on the road, people who couldn't call 911, those kinds of things that are really significant impacts on people's lives. And the second-order effect is, oh, yeah, AWS goes down—like you said—and Amazon or people like to say, Jeff Bezos—I guess, now, are they going to complain about how much money Andy loses? I guess so—but [laugh] what lives on AWS, that's crazy to think about, right?Corey: Yeah, the more I learn the answer to that question, the more disturbed I become.Courtney: Well, you'd probably know a better answer to that question [laugh] than a lot of people.Corey: They have the big companies they can talk about. What's really interesting is the companies that they don't and can't. An easy example: financial services is an industry that is notorious for never granting logo rights. Like, at some point, they'll begrudgingly admit, “Yes, our multinational bank does use computers.” But it's always like pulling teeth, and I get it on some level; the entire philosophy of a lot of these companies is risk-mitigation, rather than growth and advancing the current awareness of knowledge. But it does become a problem.Courtney: Yeah. It's interesting, I need more data, which we'll get to—help me, people—but I am able to start seeing some of those interesting graphs of, kind of these cascading effects of these kinds of outages. And so I strongly believe that we need to talk about them more, that more companies need to write them up, and publish them, and be a lot more transparent about it. And I think there's a number of companies that are showing the way there that—and it has to do with your first question which is, we've all sort of accepted this, right? But I disagree with that.I think those of us who are super close to these kinds of complex, dynamic distributed systems totally know that they're going to fail, and that's not shocking, nor the case of incompetence. We are building systems that are so big and so complex, no one person, no 10X engineer out there could possibly model or hold the whole thing in their head. Especially because it's not even just your systems… we were just talking about, right? Your stuff's on GitHub; it's on AWS; there's, like, three other upstream providers; there's this API from over there. These systems are too intricate, too complex; they're going to fail.Corey: So, we're back to why all these things failed simultaneously and it comes out it's a Northern woods, middle of nowhere backhoe incident. That's right, if we look at the natural food chain of things, fiber optic cable has a natural predator in the form of a backhoe. To the point where if I'm ever lost in the woods, I will drop a length of fiber, kick some dirt over it, wait a few minutes; a backhoe will be along to sever it. Then I can follow the backhoe back to civilization. They don't teach that one and the boy scout manual, but they really should.Courtney: Yeah. Oh, my gosh. There was a beaver outage in Canada, which is the—[laugh] God, that's the most Canadian thing ever.Corey: Can you come up with a more Canadian—Courtney: No.Corey: —story than that? I would posit you could not, but give it a shot.Courtney: No, probably not. Anyhoo. So, I think, like I was saying, those of us close to it accept that, understand it, and are trying to now think about, okay, well, how do we change our approach and our philosophy about this, knowing that things will fall down? But I think if you look at a lot of the rest of the world, people are still like, “What are those idiots doing over there? Why did their site fall down?”Corey: Oh, my God—Courtney: Right?Corey: —the general population is the worst on stuff like this. The absolute worst.Courtney: The media is the worst. [laugh].Corey: It's, “How did they wind up to going down?” “Yeah, because this stuff is complicated.” Back when I was getting started in tech, I thought the whole thing worked on magic, so I started figuring out different pieces of it worked. And now I'm convinced; it runs on magic. The most amazing thing is this all works together. Because—Courtney: Yeah.Corey: —spit and duct tape and baling wire holding this stuff together would be an upgrade from a lot of the stuff that currently exists in the real world. And it's amazing.Courtney: I know the secret, Corey. You know what holds it all together?Corey: Hit me with it. Hope? Tears?Courtney: People.Corey: Mmm.Courtney: Technology is Soylent Green, Corey. It's Soylent Green. It's made of people.Corey: And that's the thing that always bugs me on Twitter. The whole HugOps movement has it right. When you see a big provider taking an outage, all their competitors are immediately there with, “Man, hope things get back together soon. Best of luck. Let us know if we can help.” And that's super reassuring because today is their outage; tomorrow it's yours.Courtney: Yep.Corey: And once in a blue moon, you see someone who's relatively new to the industry starting trying to market their stuff based on someone else's outage, and they basically get their butts fed to them, just because it's this—it's not what you do, and it's not how we operate. And it's one of the few moments where I look at this and realize that maybe people's inherent nature isn't all terrible.Courtney: [laugh]. Oh. Oh, I would hope that would be something that comes out of all of this.Corey: Yeah.Courtney: No one goes to work at their day job doing what we do, to suck. [laugh]. Right? To do a bad job.Corey: Right. Unless you're in Facebook's ethics department, I completely agree with you.Courtney: Okay. Yes. All right. There are a few caveats to that, probably. But you know, we all want to show up and do good stuff. So, nobody's going in trying to take the site down, barring bad actor stuff that's not relevant.Corey: When Azure takes an outage, AWS is not sitting there going, “Ah, we're going to win more cloud deals because of this,” because they're smarter than that. It's, no, people are going to look at this and say, “Ah, see. Told you the cloud was dangerous.” It sets the entire industry back.Courtney: Yeah. That's why we need to talk about it more, and we need to just normalize that these things happen and that we can all level up as an industry if we get a lot smarter about how we, A) think about that, and B) how we react to them. And we will develop much more useful models of our safety boundaries, right? That's really it. You don't know—no one at any of these companies hardly knows if you're five steps from the cliff, five feet, driving a Ferrari 90 miles an hour towards the edge of it.Like, we don't know, it's amazing to me just how much in the dark we are as an industry and how much of the world we're running. So, I think this is one tiny, first little step in what could be sort of a sea change about how all of this works. So, that's a big part of why I'm doing what I'm doing.Corey: Well, let's talk about something else you're doing. So, tell me a little bit about VOID?Courtney: Yeah. So, that's the first iteration of this. So, it's the [Verica Open Incident Database 00:14:10]. I feel like I have to say this almost every time John Allspaw would like me to say that it's the Verica Open Incident Report Database, but VOID is way cooler than—Corey: VOIRD?Courtney: VOIRD.Corey: Yeah, that sounds like you're trying to make fun of someone ineffectively.Courtney: Yeah. And there's a reason why he's not in marketing. But what this is is a collection of all of the publicly available incident reports in one place, easily searchable. You can search by company, you can search by technology, you can filter things by the types of, sort of, kinds of failure modes that we're seeing. And it's, I hope, valuable to a wide swath of folks, both technologists and otherwise: researchers, media and press types, analysts, and whatnot.And my biggest desire is that people will look at it, realize how incomplete it is, and then help me fill it. [laugh]. Help me fill the VOID, people. I think I have right now, at the time we're talking, about 1700, maybe 1800 of these. And they run the gamut. And I know some people who like to quibble about language—and I am one of those people having been an editor in various flavors of my life—not all of these are what a lot of people directly related to these, sort of, incident management and whatnot would call ‘incident reports.'I wanted to collect a corpus that reflects all of the public information about software-related incidents. So, it's anything from tweets—either from a company or just from people—to a status page, to a media article, a news article, an online article, to a full-blown deep-dive retrospective or post-mortem from a company that really does go into detail. It's the whole gamut. It's all of those things. I have no opinionated take on that.I want that all to be available to people. And we've collected some metadata on all of the incidents as well. So, we're collecting the obvious things like when did it happen? What date was it, if we can figure it out, or if it's explicit—how long was it? And those kinds of things and then we collect some metadata, like I said. We add some tags: was this a complete production outage, was it a partial outage? Those kinds of things.And this is all directly just taken from the language of the report. And we're not trying—like I said—we're trying not to have any sort of really subjective takes on any of that, but a bit of metadata that helps people spelunk some of this stuff. So, if it is the kind of report—these are usually from a status page, or a company post about it—what kinds of things were involved in this outage? So, sometimes you'll get lucky and the company will tell you, “It was DNS,” because, you know, it's always DNS.Corey: On some level, it always is. That's why—Courtney: It always is.Corey: —DNS is my database. It's a database problem.Courtney: It's a database problem. And sometimes you get even more detail. And so we will put as much of that that's in the report into a set of metadata about these things. So, I think there's some fascinating, really easy things that I've already seen from some of these data, and we kind of hit on one of these, which is the way that companies themselves talk about these outages versus the way that press and media and other types of organizations talk about these things. So, I think there's a whole bunch of really fascinating analysis that's going to be available to nerdy research-minded type folks like myself.I think it's a place, though, where technologists can also go and spelunk things that they're interested in, looking for patterns, anything that's really—there's an opportunity for experts in the field to add insights to what we can discern from these public incident reports. They are, like, two orders abstracted from what happened internally, but I think there's still a lot that we can learn from those. So, the first iteration of the VOID will allow people to get a first look at some of the data and to help me, hopefully, add to it, grow that corpus over time, and we'll see where that goes.This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.Corey: I love the idea of having a centralized place where outages, post-mortems, root cause analyses—I'll let you tear into that in a minute—and other things that are all tied to where can I find a list of outages. Because companies list these on their websites, they put them in blog posts, and it's always very begrudging; they don't link them from any other place, you have to know the magic incantation to find the buried link on their site. Having something that is easily searchable for outages is really something that's kind of valuable.Courtney: Yeah. And I mean, some of them are like—I'm looking at you, Microsoft—I like you for a lot of reasons, but hey, I have to scroll your status page. I can't link directly to their write-ups, and—this is Azure—and it [laugh] please stop. Make it easier. [laugh]. You're driving me crazy; I don't even have a data model to figure out how to make this work for people, other than, like, taking screenshots of them.So yeah, so there's shades of grey and black in how much they'll share, or how easy it is to find these things. So, it'll be interesting to see if there's any less-than-positive [laugh] reactions to all of this being available in one place. I'm anticipating at least a little bit of that.There is one other type of metadata that we collect for the VOID. And that is the type of analysis that is conducted if it is clear what that type of analysis is. And there, some companies explicitly say, or call it an RCA, “We did a Root Cause Analysis.” There's a few other types; some people talk about having a Contributing Factors Analysis. Most people don't consider a formal analysis type, but I am trying to collect and categorize these because I do think there are some fascinating implications buried therein, and I would like to see if I can keep track of whether or not those change over time. And yes, you've hit on one of my favorite hot-take soapbox things, which is root cause.Corey: Please, take it away.Courtney: Yeah. Well, and anyone who's close to these systems and has watched these things fall down has the inherent sense that there is no root cause. Like—[laugh]—let's—great. One of my favorite ones: human error. We don't have enough hours for this, Corey. I'm sorry. That's one of my favorite other ones. But let's say somebody fat-fingers a config change. Which happens—Corey: That was fundamentally the S3 service disruption back in—Courtney: Yes.Corey: —2017 that took down S3 for hours on end.Courtney: And took down so many other people that relied on S3.Corey: Everything was tied to that. And that's an interesting question; when something like that hits, does that mean that everything it takes down get its own entry in VOID?Courtney: I hope so. If everybody writes them up, then yes. [laugh]. So, if S3 goes down, and you go down, and you write it up, and you put it in the VOID, then we can see those things, which would be so cool. But let's go back to the fat-fingered config file—which if you haven't ever done, you're lying, first of all—Corey: Or you haven't been allowed to touch anything large and breakable yet, which, either way, you're lying on some level. So, please—Courtney: Yeah. I mean, I took down [Halloway's 00:20:53] homepage when it was on Hacker News because of YAML. So, anywho. Even if you fat-finger a config change, that's not the root cause because you have this system wherein a fat-fingered configure change can take down S3. That is a very big, complex, and I might add, socio-technical system.There are decisions that were made long ago about why it was structured that way, or why this happens that way, or what kinds of checks and balances you have. It's just, get over it people. There is no root cause. These are complex, highly dynamic systems that when they fail, they fail in unpredictable and weird ways because we've built them that way. They're complex because you're successful at pushing the envelope and your safety boundaries.So, if we could get past the root cause thing as an industry, I mean, I could probably just retire happy, honestly. [laugh]. I'm a simple woman; could we just get one thing, people? [laugh]. First of all, then it gives non-technologists, people outside of our bubble, the media, you can't hang it on these things anymore. We all have to then grapple with the complexity, which admittedly humans, not big fans of, but—Corey: People want simple stories, simple narratives. When people say, “Oh, remember the S3 outage?” They don't want to sit there and have to recount 50,000 different details. They want to say, “Oh, yeah. It took down a few big sites like Instagram, United Airlines, and it was a real mess.” The end. They want something that fits in a tweet, not something that fits in a thesis.Courtney: Well, and if you have a single root cause, then you can fix the root cause and it will never happen again. Right?Corey: That's the theory. If we're just a little bit more careful, we're never going to have outages anymore.Courtney: Yeah, if we could just train those humans to not try to make the best possible high-quality decision they could possibly make in that situation given the information they have at the time, then we'll do better. But I mean, that's why your system stay up most of the time, if you think about it. It's shocking how well these things actually work the vast majority of the time. And that's what we could learn from this, too. We could, you know—oh if we would write near-misses up, please.I mean, if I could have one more wish, I think one of the coolest things the airline industry and the government side of that did was start writing up near-misses. It's, wow, what do we learn from when we're successful, versus trying to, like, spelunk and nitpick the failures.Corey: Most of us aren't so good at the whole introspection part. We need failures, we need painful outages to really force us to make difficult, introspective, soul-searching decisions and learn from them.Courtney: Yeah. And I don't disagree with that. I just wish one of the things we would learn is that we should study our successes, too. There's more to be mined from our successes, if we can figure out how to do that, then there is from our failures. So, I have a metadata category in the VOID called ‘near-miss.'And oh man, I really wish people would write those up more. I mean, I think there's, like, five things in there that I've found so far. Because the humans hold these systems together. We make these things work the vast majority of the time. That's why there is no root cause, and even when we're involved in these things, we're also involved in preventing them, or solving them, or remediating them. So, yeah, there's no root cause. Humans aren't the problem. Those are my big hot button ones.Corey: I really wish more places would embrace that. Even Amazon uses the ‘root cause' terminology internally, and I'm not going to sit here and tell them how to run large things at scale; that's what I pay them to figure out for me. But I can't shake the feeling that by using that somewhat reductive terminology that they're glossing over an awful lot of things the rest of us could really benefit from.Courtney: Well, so the question then—one of the other things that I look at is, personally when I read and analyze these incident reports, these public ones a lot, I always ask myself, “Who's the audience for this?” And there are different audiences for different types of incident reports and different things. The vast majority of them are for customers, partners, investors.Corey: The stock market. Yes. Yes.Courtney: They're not actually for the organization. There's usually an internal one that we don't get to see—maybe—that's for the organization. But a lot of places feel that if you have a process, and a template, and a checklist, and a list of action items at the end, then you've done the right thing. You've had your incident, you've talked about it, you've got your action items. Move on.Corey: Right, and it always seems with companies, that as you get further into the company, the more honest and transparent the actual analysis is. Like, at some point, you wind up with the, like, they're very public and very cagey, and under NDA, they open up a little bit more, and a little bit more, and finally, when you work there, their executive team, it turns out, the actual thing was, “Well, Dewey was carrying arm full of boxes in the data center, tripped, went cascading face-first into the EPO cutoff switch that cut power to the entire facility.” The cagier they get, the—I guess, not to be unkind here—but the more ridiculous whatever the actual answer is. It's one of those things where, “Really? Someone tripped and hit a button. You didn't have a plan for that?” “Well, not really. We sort of assumed that people would”—Courtney: Why would you have a plan for that, right?Corey: Right.Courtney: I mean like—[laugh].Corey: Why would you have a plan for that, the first time?Courtney: Yeah. I mean, so imagine this exercise: sitting down in a room with a bunch of people and going, “What are all the things that could go wrong?” I mean, [laugh] ain't nobody got time for that? That's not how it works. You all have other jobs to do, too, and systems to build, and pressures, and customers, and partners, and features to build, so admit and acknowledge that you just won't know all of the antecedents and how do you respond when things happen?Which is a whole other, you know—I know you told me you recorded an episode with Dr. Christina Maslach on burnout, which I'm so happy you did, and there's a whole ‘nother piece of incidents and incident response, and burning people out, and blaming people, and all that stuff that's a whole ‘nother pod—it sounds like you might—you know, probably not incidents with her. But still, these things take a toll on people. And people who, like I said, show up every day really hoping to do their best job, and go up a ladder, and get a promotion, and whatever. So, I think not just treating those things as checklists has broader implications as well, just for the wellbeing of your organization.Corey: On some level, the biggest problem that I think we've run into is that, as you said, it all comes down to people. Unfortunately, legally, we can't patch those. Yet.Courtney: No, [laugh]. No, no. Not most kinds of patches, no. And that's messy. And I know some people are like, “Everyone should learn to code.” And I'm like, “Actually, everyone should get a liberal arts degree.” Come on, help me out people. Because there's so much of these socio-technical systems where the socio part of it is more relevant than the actual technical part.Corey: I believe you're right, for better or worse; there's no way around it. Thank you so much for taking the time to speak with me. If people want to learn more about what you're up to, where can they find you? And we will, of course, throw a link to VOID in the [show notes 00:28:06].Courtney: Yeah, I also like to talk on Twitter, like you do. I'm not as good at it as you are, but I try. So yeah, I'm @courtneynash on Twitter. And at Verica, you can find me at Verica as well, courtney@verica.io. And those are the best ways to find me, I would say. And yeah, please people, write up your incidents, send them to the VOID and let's all learn and get better together, please.Corey: Thank you so much for taking the time to speak with me today. I really do appreciate it.Courtney: Thank you for having me on. I know—do people say this: I'm like, “Yeah, big fan,” but I am. I'm a [laugh] big fan [laugh] of the podcast.Corey: Oh, dear Lord, find better things to listen to. My God.Courtney: [laugh]. But it's been a treat. Thank you.Corey: Courtney Nash, Internet Incident Librarian at Verica. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment making it very clear that for whatever reason the website is down, it is most certainly not your fault.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Molding Leadership Within Tech with Adam Zimman

Screaming in the Cloud

Play Episode Listen Later Sep 22, 2021 37:46


About AdamAdam Zimman is a start-up Advisor providing guidance on leadership, platform architecture, product marketing, and GTM strategy. He has over 20 years of experience working in a variety of roles from software engineering to technical sales. He has worked in both enterprise and consumer companies such as VMware, EMC, GitHub, and LaunchDarkly. Adam is driven by a passion for inclusive leadership and solving problems with technology. As an Advisor he works with a number of startups and nonprofits. His perspective on life has been shaped by a background in Physics and Visual Art, an ongoing adventure as a husband and father, and a childhood career as a fire juggler.Links:Twitter: https://twitter.com/azimman TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.This episode is sponsored in part by our friends at VMware. Let's be honest—the past year has been far from easy. Due to, well, everything. It caused us to rush cloud migrations and digital transformation, which of course means long hours refactoring your apps, surprises on your cloud bill, misconfigurations and headache for everyone trying manage disparate and fractured cloud environments. VMware has an answer for this. With VMware multi-cloud solutions, organizations have the choice, speed, and control to migrate and optimizeapplications seamlessly without recoding, take the fastest path to modern infrastructure, and operate consistently across the data center, the edge, and any cloud. I urge to take a look at vmware.com/go/multicloud. You know my opinions on multi cloud by now, but there's a lot of stuff in here that works on any cloud. But don't take it from me thats: VMware.com/go/multicloud and my thanks to them again for sponsoring my ridiculous nonsense.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: Welcome to Screaming in the Cloud. I'm Cloud Economist Corey Quinn, and periodically I like to talk to people about different aspects of the industry. One that I think is interesting that doesn't get spoken about a lot directly is the idea of leadership. My guest today is Adam Zimman, who's a startup advisor providing guidance on—as mentioned—leadership, platform architecture, Product Marketing, and GTM Strategy—GTM, of course, standing for go-to-market. Who goes to market? That's right, little piggies. Adam, thank you for joining me.Adam: Thank you, Corey. It's a pleasure to be here.Corey: I imagine that you usually don't advise your clients to call their GTM execs, little piggies?Adam: Well, I mean, I guess it depends. You know, if you're actually a bacon manufacturer then that might be actually a reasonable thing to do.Corey: Yeah, that's a level of investment in the product that you usually don't see in most environments, but we take what we can get. So, snark and cynicism aside, what is it you do?Adam: Ultimately, I look for ways in which I can add value. And I've had the privilege in my career to be exposed to a lot of amazing companies, and I look for ways to be able to take the lessons that I've learned, mainly through mistakes and failure, and be able to translate those into success for others.Corey: Most recently, you were at LaunchDarkly for a while, taking a number of different VP roles. While you were there we spoke, back in 2017, briefly while you were in that environment. And in fact, my first guest on the show was one of the folks on your team, Heidi Waterhouse, who has been back at least once since then, and hopefully more than that. But it's been an interesting ride there. Before that you were at places like GitHub—or JIF-ub as I insist on pronouncing it—EMC-slash-VMware—where does one start and the other stop? Hard to say, it's sort of a giant corporate shell game—but you've spent a lot of time in large companies and small ones as well, and now you're effectively hanging out your shingle as a strategic advisor.Adam: This is true. I mean, I think that one of the things that I've found is that doesn't really matter what size of company you're at; you're going to find new and interesting challenges, and you really don't have to look that hard. And so one of the things that I found consistently, and I would say that this was most pointedly phrased for me by Emily Freeman in the context of, “DevOps is this amazing thing of people, process, and technology. And the reality is, is the only one that's complicated is the people.” And oddly enough, small companies, you still got people; big companies, you still got people. So, therein lies some of the challenges.Corey: And people are inherently non-deterministic; you never know what you're going to get by applying the same input, even to the same person just separated out by time. It's a challenge, and the problem that I see across the industry is that very often, you'll have a team of engineers and you'll pick the best and brightest one of those engineers, and, “Congratulations, you manage the team now.” Now, management's inherently orthogonal skill, and what you've simultaneously done is gotten rid of a great engineer and introduced a terrible manager. And that's through no fault of this person's own. But when I started managing teams, I got surprisingly far by just doing the exact opposite of all the stuff that my previous terrible bosses have done.And that works really well right up until it doesn't in a variety of probably fairly easily predictable ways. And the challenge that I'm seeing is that there is no book on how to do these things. If you want to climb an engineering ladder, great; there's a bunch of very qualified people who will tell you how to go from wherever you are technically, to where you want to go, and what you have to demonstrate, and what you have to do. Leadership is squishy, in that sense. At least it always has been to me.Adam: The interesting part that I would challenge you a little bit on is that there are thousands of interesting books on leadership, even smaller subsection on management specifically. I think one of the challenges there is that they're not well circulated within tech as an industry. I think that there are a few that people come back to, like Andy Grove's book on his experience building Intel. There are a lot of books out there that have done a lot for talking about how to manage people and how to think about what are the specific tactical things that you do. It's having one-on-ones, it's having meetings with clear agendas, it's being able to look for ways to set expectations with your organization.I think one of the challenges that I see pretty consistently, is the fact that that effort to be able to go out and find that information or to learn those skills is something that is put on to, as you said, this individual who is coming to management through punishment. They've been extraordinarily successful and now you will punish them by putting them in a role where they can no longer do all the things that they enjoyed, that made them successful. And I think that you see time and time again, where organizations put people in these roles, but they don't do anything to either prepare them for it or do anything to continue that notion of professional development or training for those individuals once they're in those roles.Corey: There are a lot of books out there for any discipline under the sun; some are good, some are terrible, most are somewhere in the middle of the road law of averages winds up working out. I think a key difference, on some level, is I can take to Twitter, or a forum, or something like that, and complain about software; the computer isn't doing the thing I think the computer should be doing. And that's great. I can't very well go and complain about managerial issues while actively having a team and not find myself no longer having managerial issues, if you catch my meaning. It's hard to find communities around this stuff.Adam: I think that you're right. And I think that this is one of those things where not only that, but I think that we also in tech have predominantly taken a very hierarchical structure to the way that we think about management and leadership, to the sense where oftentimes, it is not only discouraged but downright forbidden for an individual contributor to challenge their manager if they want to continue to have gainful employment. And I think that this is a cultural thing that, you know, it's funny; I know that you recently did an episode with John Allspaw and were talking about incident remediation. And I think that one of the things that I've always tried to do as a manager, as a leader, is think about opportunities for being able to do that type of incident response, for people. If you have a person that leaves, whether that is forced attrition, whether that is voluntary attrition, whether that is something that you wanted to happen, something that you didn't want to happen, what are you doing from a perspective of kind of a post-incident assessment to learn from that? And I think that the next level that is, how do you do it so that you actually, in some way, incorporate that for the individual that's actually leaving. Because ideally, they're learning from that experience, as well.Corey: Back when I was a generally terrible employee, I decided at some point, I was tired of dealing with computer problems and wanted to deal with people problems instead. Now, let's be clear, I found a path to do that in a very different direction than I expected at the time, but at the time, it was, “Great. I'm going to go ahead and become a manager of a team.” And I talked to a number of folks about all right, what is the path to go from decent technical engineer—I was a senior SRE type at most of these places—into management. And not just talking to people at the companies I was at, but talking to people in the larger community, and every engineering manager who I respected and talked to about, it always seemed like they got this lucky break at just the right time and that made them a manager for the first time.And once you have a track record of having managed people, then you're in. You can go back and forth between IC and management roles. But, “Well, you've never managed people before, so we're not going to take a chance on you to manage people.” The way that I did it, honestly, was I—a few times—I wound up joining startups where I was effectively the only ops person; we suddenly started scaling and having fun problems, and well, I did negotiate for that director title, so all right, I have teams now. I was more of a team lead than most things, in some cases.But it led to a really pretty interesting evolution in how I approach these things. I find now that the right answer is for me not to manage people at all because what I fundamentally do here at The Duckbill Group is basically become the loud, obnoxious center of attention. And I think that what managers need to do is showcase their people instead. And those two things, at least in my view, are opposed. And it's very challenging to do both of them, let alone well. For me at least, I tend to back away from the management side of things almost entirely and abdicate the role. Which is great. People self-manage, right?Adam: Well, I mean, I think that there are individuals who definitely will take—have the ability to self-organize and self-manage to a degree. I think that the challenge that you run into is, as the organization scales, as the nature of their role tends to change with that scaling organization, it becomes more challenging for them to navigate through those changes. A great example would be, I have had the pleasure and the privilege a number of times in my career of managing extraordinarily senior individuals; these are individuals who, to your point, don't need a whole lot of care and feeding. But what they do sometimes need is they need someone who is able to be in rooms that they're not in, whether that's from a higher-level leadership meeting understanding larger organizational goals, or they need someone that's going to check them; they need someone that they can trust, someone that they can bounce their ideas off of to know is this something that's going to be perceived value or something that's going to actually take me in the wrong direction, or somebody that's, kind of like, paying attention to the work product that they're doing and giving them some coaching, whether that's cheerleading or whether that's connecting of saying, “Hey, there's also this other person you should talk to.” Those types of things are really valuable for those individuals who are, to your point, a little bit more self-sufficient.Corey: On some level, I ran into this trap a lot, and having over drinks conversations with a bunch of people who went on similar paths, it's blindingly obvious that it's a dumb move in hindsight, but an awful lot of us did it, where we're sitting there as engineers with the belief of, “Ah, if I can make my manager—or beyond, several skip-levels up—look incredibly foolish in the middle of a large meeting, they will inherently see the value of what I have to say and will thus elevate me to management.” As it turns out, they elevate you to customer because you're not working there anymore, in many cases. And when I talk to people about this, it usually has that lightbulb coming on moment of as soon as you hear it, of course, it is blindingly obvious that you aren't going to sarcastically obnoxious your way into being management. Instead, the path there—in hindsight, also blindly obvious—is act as if: act managerial; help to effectively carry on your manager's message to the rest of the team, and when you have reservations or whatnot, talk to them in private rather than calling them out. And it's the obvious stuff of who gets promoted to management? Well, the people that look managerial. And that is what that looks like, in many respects.Adam: And this is one of the reasons why, when I talk about management I like to separate the notion of management from leadership. Because I think that anyone can be a leader. You don't actually have to be the administrative manager of an individual to be a leader to them.Corey: I saw a great poster once when I was younger. “Leaders are like eagles. We don't have either of them here.”Adam: [sigh]. Yeah, yeah. Ugh. I do miss good motivational posters.Corey: Oh, yeah.Adam: You know, I think that there's some truth to it. I think that finding people who are genuinely invested in being able to enable the success of others—which is how I define leadership—is challenging. I think that, especially in rather capitalistic-type industry like we're in, there is a lot of measurement of people's success by their own personal achievements and by their ability to beat their own drum. And I think that it's something that is, frankly, a failing of our industry, where we don't do a better job of encouraging folks, and rewarding folks that actually look out for others and enable the success of others. Because I think that's something that is—ultimately you think about how you build strong teams, and it's not about getting a bunch of individuals who can do amazing things individually. It's about getting individuals who are capable of working together and being able to do more than they would be able to if they were simply working individually.Corey: Do you ever find that people are chasing management in many respects because they think that it's something very different than what it is, and then find themselves in situations where well, I'm the dog that caught the car that I was chasing and only now do I realize that I have no idea how to drive the thing?Adam: Oh, absolutely. So, this is something that has been interesting me a lot recently, in the sense that I think we as an industry also do a very poor job of measuring management, measuring leadership. We give a lot of power to managers through performance reviews to measure their individual contributors, but there are very few companies who actually efficiently do things like 360 reviews, which has always confused me because I think that implies that you're getting feedback from all around you, as opposed to what you really want is you want feedback pointed back at you, which would be 180. But maybe that's just—Corey: Let's be clear, that was also pioneered by the German [Wehrmacht 00:13:48] in World War II, which is yeah, basically how some people I've worked with do tend to manage.Adam: Yeah. I think that if we can think about how do we measure the success of a manager, is it simply a function of the output of their team, or are there other efficiency metrics that you should be looking at? Very obvious one is how efficient is a manager from a perspective of the utilization of their resources? And when I think about that, I think about are they actually able to effectively hire? Are they able to effectively retain the people that they hire?What does it look like for the people on their organization from a promotion perspective in terms of skill growth? Do they become more valuable over time? Those are ways in which we can think about how we measure the manager, potentially, directly. And then there's indirect things like what's the qualitative aspect of those individuals that work for them? Are they people who are enjoying the work that they're doing?Are they motivated to continue to work towards the company's vision and mission, to be able to actually make their manager look good, but also make the company successful?Corey: A challenge, too, because I've seen this myself is, all right, you're not elevated to manager. Congratulations. It's not really a promotion. It's a lateral move. However, a lot of companies don't treat it that way.They don't compensate it that way, et cetera. And oh, okay, management, it turns out is not for me. There's no real good way to say, “I'm going back to being an IC,” especially at the same company, without it being perceived by many—rightly or wrongly—as a demotion or a failure.Adam: This question of, like, motivation to people, why do they want to go into management? I think that oftentimes this is misplaced. A lot of times the number one motivation that I've heard has nothing to do with wanting to actually help people or solve people problems, as you said earlier; it has to do with I want a bigger paycheck, I want more seniority, I want more responsibility, and therefore the only path available to me is management. In fact, many career ladders at organizations require an individual contributor to go to a management position before they can become a principal or a staff-level engineer, which is nonsense. First of all, why would you torture the individual to do something that is so completely and utterly outside of where their interests are? Secondly, why would you just decimate your lower-level individual contributors, your newer individual contributors by having someone who is completely non-inclined towards management be responsible for them? Oh.Corey: Oh, yeah. Used to be your peer; now they manage you, and great. I think people underestimate exactly how broad the blast radius of a manager is.Adam: Yeah. Talk to anyone, and they'll be more than happy to tell you the worst manager that they've ever had. At the same time, they'll also probably be able to tell you the best manager they've ever had.Corey: Oh, yeah. I called both of those out—only one the one of those by name, by the way—in conference talks that I've had because it's—yeah, you can probably guess which one I would call out and which one I would not name publicly—yeah—Adam: It depends on the conference, I guess. But yeah.Corey: Oh, yeah, absolutely. If it was you-know-what-your-problem-is con, yeah, it went super well.Adam: [laugh].Corey: It was fun. And management, especially in the current era is getting interesting, as we're seeing the heating up of the market in a bunch of different ways. And I understand, to be clear, that Twitter is not a perfect microcosm of the industry, but there's a recurring theme that I'm seeing among a number of engineering types that seemed to get—and again, I don't want to get letters for this, so if I misstate it, audience, please go ahead and be kind—but there seems to be a certain thread running through engineering communities that the purpose of a company is to provide a utopian work environment for its staff. Now, as someone who runs a company myself, yeah, I absolutely want to provide the kind of working environment I wish I'd had in a bunch of different environments. And that's not going to work for everyone, but that's okay.But fundamentally we're here to make money, and ideally, enough monies that we can keep the lights on. And that does mean that, however, we want to treat our staff that has to be subordinate to can we continue as a going concern? So yeah, it turns out, we can't—sustainably—outbid Netflix on every hire that we make and we aren't able to wind up having three catered meals a day as a full remote company delivered to everyone's house. Now, I'd like to, in a world where money flows like water, but it doesn't. For better or worse, there are constraints, and constraints shape us.But there's a thread that I'm starting to see of… I hesitate to call it entitlement, but it trends slightly toward the direction of folks who are in tech, and in some ways seem very far removed from business realities—now, let's be clear in the FAANG world, yeah, it's pretty attenuated. And in startup land where well, we're the VC backed, so we're losing money by the billion but we're making it up in volume. Great. That is not necessarily what I'm talking about here. I'm seeing a thread where, oh, engineers are clearly the smartest people in any company, which means that every other department should defer to them. I disagree with that position.Adam: I want to follow that thread a little bit with regards to engineers. So, I've worked as a software developer—Corey: My condolences.Adam: Yeah. I've worked as a technical salesperson. I've had the opportunity to work in pretty much every department with the exceptions of HR and finance. So, that has been part of my career of jack of all trades, master of none, but it has given me some interesting insights in terms of the value that different organizations, different individuals, bring to a company. And I think that—one of the things that I will say is that for the longest time, in large organizations, especially non-tech industry organizations, the engineer or the developer was at the same expectations or the role as someone in the janitorial staff.It was basically, “You're part of the plumbing. You just do the things so that the tech just works, and we're going to have the other business folks that are more responsible for actually making decisions that are going to make our business money.” The quintessential example is someone like Kraft Foods or someone like John Deere, right, where you're building tractors; for the longest time, the guy who ran the website wasn't going to be the guy who was going to make or break John Deere's quarterly earnings. Now, you've got tractors that literally are more computers than they are mechanical devices and so you suddenly have this change in dynamic with regards to the importance of that developer. But I think that something that's interesting, also, is that those other people who worked at the company didn't go away.They're still there; they're still important. In fact, they're still oftentimes making the buying decisions on behalf of the developers. The developers aren't the ones that are making those choices. And so you need to figure out, how do you actually make the technology choices and the technology outcomes accessible to individuals that are in roles that were, historically, had nothing to do with tech.This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.Corey: I've always been a big believer in the idea that if you're going to transition into a new field, be it into tech, out of tech, et cetera, great. In almost every case, you should find ways to do that laterally. I think that this idea that, oh, you're going to go ahead and just start over with an entry-level job after you've been in a field for five years—no. Find the position that's halfway between where you are and where you think you want to go next and start getting exposure there. In time, it's those niches that add value that distinguish you from other folks.It turns out that they don't generally want to hire someone in almost any role that comes from Central Casting, where it's alright, give me a standard MBA with the following pedigree and drop them in as my new executive, whatever. No. They want to see things like industry experience; they want to see things that distinguish folks, and having experience in industries that are not traditionally, purely what this role is, is super helpful in a lot of different ways. What I do pretty clearly blends finance and tech; that goes reasonably well. Increasingly it starts to blend media, which is something I don't pretend to understand. But here we are, he said into the microphone.Adam: Yeah. Well, as long as you're not starting the next Fox News, I'm fine with that.Corey: No, no. Generally not.Adam: Okay, fair enough. But I think that you're right. This is one of the things where, trailing back, we've throughout this conversation to the notion of leadership, this is something that I found extraordinarily rewarding and empowering that I've done with individuals that I've brought into new organizations, either through initial conversations during an interview process, or during, as part of their onboarding, is I sit down, and I actually talk to them about what are their plans? What are their expectations? What are their goals, not only for the next 30, 60, 90 days in this role that we're talking about but what are they thinking about from a perspective of what do they want to do in the next year? In the next three years? Five years? Ten years? What are those checkpoints of what do you want to do in this role? What do you want to do at this company? What do you want to do with your career? Like, where do you see it headed?And it doesn't mean that you're writing this in stone, or that I'm going to hold you to it, but I think that one of those things that's really empowering for a leader is to be able to help those individuals find those connective threads that tie one position to the next and help them get there. If they're somebody who is saying, “Hey, look, I'm currently a developer, but I really wish that I could give more talks.” Okay, well, that's great for me to know. Let's put you on some projects that maybe actually would result in great content for a talk that you could give at a conference. And then we'll figure out, how do we work with the marketing department to be able to help you bring that to fruition?There's a lot of ways to be able to leverage this experience that you have as a leader, as a manager, to an individual who's coming up in their career and saying, “Hey, look. This is how some more ancillary things are connected.” And being able to bring those back to them.Corey: I really wish, on some level, that there was a more defined path toward a lot of these things, where the stuff is explained to folks. So often, I had terrible managers that, in hindsight, weren't that terrible. Because I didn't understand where the role started and stopped, I tended to view the role of the manager is there to protect the team. The end. And be our advocate in the organization, and get us the thing that we want, and what do we want? Comfy chairs.And it turns out that isn't ever how it really works. If I had to define management, it would basically be, balancing competing priorities more than it is almost anything else. And counterintuitively, the higher you rise in an organization, the more responsibility you have, and the less you can actually directly do. Everything you do drives influence. And that's it. That's how it distills down.Adam: You talk about the engineer that wants to move into management role because that's how they see their career progressing. This is a close corollary to the engineer that wants to move into a product management role because they want to have greater oversight into the decisions that are being made about what's getting built. And what you come to realize, for any engineer who successfully made that transition, is it's really complicated and difficult to be able to have that mental switch take place between this is how I'm going to build it versus this is the priority of what needs to get built next. And all too often you see engineers that land in product management roles that are dictating how something should be built, and suddenly the engineers are just like, “No, I have no respect for you. Because that's not your job.”And likewise, in a management role, oftentimes people view that as an opportunity for them to make all the choices, make all the decisions, and suddenly lose sight of the fact that they used to be on the other side of that outcome themselves, and were disappointed when they weren't included in some way, shape or form, or their priorities weren't taken into consideration.Corey: As you look at your own career, what is the worst job experience you've ever had? Or the worst job you've ever had? Or the worst boss you've ever had? That's always a good one to do.Adam: [laugh].Corey: Pick a superlative and not the good kind. Hit me.Adam: Yeah, no, I mean, look, I think that probably the worst… experience that I ever had with a manager, with a boss, was actually when I was first a software developer. And my manager would occasionally just come up behind me and just stand and watch me code. And we're not talking about peer programming, where it was just like, we're working together. No, it was, literally would come up, stand behind me on my shoulder, and just stand there. Not saying anything; just watching me write Java code. And that was probably the most disconcerting experience that I've ever had in a job ever. I lasted about six months and then I was just like, “I need to move on to something else.”Corey: It turns out one of my failure modes was that I was great for the first three months in new ops roles because things were invariably a fire, and—Adam: [laugh].Corey: —I know how to solve those things. And then it becomes a maintenance role, and I'm bad at that. For longest time, I thought I was just a crap employee. And I am, but for different reasons. Instead, though, for me, it turned into a, I need to find the thing that I'm good at and embrace that. And I have to say, it was not being, basically, a cloud comedian on Twitter where my primary means of communication is shitposting. But you know, here we are, and this is how we've gotten there.Adam: I mean, know your strengths, man. Know your strengths.Corey: Yeah, lean into it. I mean, you went to college in Maine; you know what it's like there. It's dark and cold nine months out of the year, so all we do is sit inside and develop personality disorders. And well, here we are.Adam: Well, hey, I mean, I took a break from tech after that first job in software development and I actually went back and worked for a guy that I met while I was in school, and I worked for him, he was a general contractor. So, I have an appreciation for Maine winters in a way that I never gained as a privileged college student, when I was actually digging snow out of ditches to be able to pour concrete at six in the morning and then later in the day, I got to go up and use 80-pound weight shingles to reshingle the roof in 20-degree weather. So, it was an eye-opening experience. But I'll tell you, I learned pretty much everything that I know about how to build infrastructure from that eight months that I spent doing everything from framing, ditch-digging, to electrical, and plumbing, and roofing.Corey: Kind of fun how often is that we wind up trying other things. And this is part of it, too. As much fun as it is to complain about various jobs and whatnot that we have, let's be very clear here for a minute that I'm not dealing with hot tar, being paid seven bucks an hour. There are advantages to the [unintelligible 00:28:08] jobs I have.Adam: I mean, that was a number of years ago, but I still got ten bucks an hour.Corey: My first job at the University of Maine call center working in tech, in those days, I think I was being paid something like $5.35 an hour. To answer phones, which again, not that hard of a job. I made a lot more money a couple years later when I moved to construction. Yeah, I wouldn't recommend any of those things for me these days, but it was instructive.Adam: But at the same time, I would argue that you also have benefited from those experiences in the way that you approach the things that you do now. And I think that's one of the things that I've tried to bring forward in my career is look for those opportunities to make those connections, and understand the value of those experiences, and be able to help to enable other people because I've had those experiences.Corey: To me at least, the answer is to turn whatever you've done or whatever happened to you into some form of empathy. The idea of well, I had to struggle coming up, so you should, too. Let's instead focus on making it better for people who follow us. Send the elevator back down, as it were.Adam: I mean, I think that's great advice, and I think that it's something that's done far too infrequently. One of the things that I've noticed is that that aspect, unless somebody has actually been through the experience where somebody has done that for them, it is oftentimes something that is a lot harder for people to see. This goes to your earlier statement around the expectations that maybe are changing, and they're not such great ways with regards to what people are expecting from companies, what people are expecting from managers. I think that there is a distinct lack of expectation setting that takes place at companies in terms of what is the role of the company, what is the role of an employee, and how can those two come together to still have a positive interaction, but aren't overstepping on either side? Because that's really where you get into problems. That's where all of a sudden you have these companies that are looking to fill the role of, I will take care of all aspects of your life, when in reality that's not a very healthy relationship for an individual to have with a company.Corey: So, I want to thank you for coming and speak to me. What are you up to these days, and where can people find you? And why should people find you?Adam: Well, I don't know that anybody should find me.Corey: “I hope this email finds you never. I hope you're free.”Adam: Yeah, exactly. No, I mean, I would love to find folks that I can add value to and help out. It's easy enough to find me on Twitter. It's just @-A-Z-I-M-M-A-N—azimman. And they're welcome to reach out to me there. My DMs are open—much to my displeasure sometimes—but happy to help people who are looking for help. I'm particularly interested in spending my time with those individuals who maybe are coming from underrepresented backgrounds in tech and looking for ways to be able to either get into tech or to move up within leadership roles in tech.But I'm spending a lot of my time doing a lot of coaching, doing a lot of advising for small startups, and then also just as a small side project have been working pretty extensively with James Governor and a woman by the name of Kim Harrison on this little thing called Progressive Delivery, which is, as far as we're concerned, it is the next iteration of the software development lifecycle that we've written about and talked about pretty extensively. James and Kim and I are working on a book together to be able to capture all those ideas and bring them and coalesce them for people, to make more consumable. But ultimately, we're trying to say, “Hey, look. The way that we've done things leading up till now, moving from waterfall to agile to continuous delivery into what's next?” And look at some of the market conditions that have changed. A lot of stuff that you talk about. I think that you would be the first to point out how things have changed since the launch of AWS.Corey: Oh, yes. It's more confusing now.Adam: Oh, way more confusing. And the ways in which people consume cloud-based services has radically changed. And so I think that the way that we are building software and the way that we're consuming software is something that we need to put some serious thought into. And the players that are—you know, as I spoke about earlier on this talk with you—are different. It's no longer just your developers that care about your AWS choices or care about the cloud service choices that you're making.You've got other individuals, whether it's the finance side you focus on or thinking about it from the perspective of the marketing team, or the HR team that's thinking about which cloud service HRIS are they going to use. There's a lot of people that need to be party to those choices that you're making and how you build out your company stack, as it were. And the Progressive Delivery model looks to take into consideration that changing and evolving group of people.Corey: And we will, of course, have links to that in the [show notes 00:32:46]. Thank you so much for taking the time to speak with me. I appreciate it.Adam: Corey, thank you so much for having me. It was a pleasure.Corey: Adam Zimman, startup advisor, and oh, so much more. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with a scathing comment telling me why you as an engineer are best suited to be the manager of everything.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Finding a Common Language for Incidents with John Allspaw

Screaming in the Cloud

Play Episode Listen Later Aug 17, 2021 32:19


About JohnJohn Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.”  His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund UniversityLinks: The Art of Capacity Planning: https://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/1491939206/ Web Operations: https://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/ The DevOps Handbook: https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002/ Adaptive Capacity Labs: https://www.adaptivecapacitylabs.com John Allspaw Twitter: https://twitter.com/allspaw Richard Cook Twitter: https://twitter.com/ri_cook Dave Woods Twitter: https://twitter.com/ddwoods2 TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you've heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Check out CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by John Allspaw, who's—well, he's done a lot of things. He was one of the founders of the DevOps movement—although I'm sure someone's going to argue with that—he's also written a couple of books, The Art of Capacity Planning and Web Operations and the foreword of The DevOps Handbook. But he's also been the CTO at Etsy and has gotten his Master's in Human Factors and System Safety from Lund University before it was the cool thing to do. And these days, he is the founder and principal at Adaptive Capacity Labs. John, thanks for joining me.Corey: And now for something completely different!John: Thanks for having me. I'm excited to talk with you, Corey.Corey: So, let's start at the beginning here. So, what is Adaptive Capacity Labs? It sounds like an experiment in auto-scaling, as is every use of auto-scaling, but that's neither here nor there. I'm guessing it goes deeper.John: Yeah. So, I managed to trick, or let's say convince some of my heroes, Dr. Richard Cook and Dr. David Woods, these folks are what you would call heavies in the human factors, system safety, and resilience engineering world, Dave Woods is credited with creating the field of resilience engineering. And so what we've been doing for the past—since I left Etsy is bringing perspectives, techniques, approaches to the software world that are, I guess, some of the most progressive practices that saved other safety, critical domains, like aviation, and power plants, and all of the stuff that makes news.And the way we've been doing that is largely through the lens of incidents. And so we do a whole bunch of different things, but that's the core of what we do is activities and projects for clients that have a concern around incidents; both, are we learning well? Can you tell us that? Or can you tell us how to understand incidents and analyze them in such a way that we can learn from them effectively?Corey: Generally speaking, my naive guess, based upon the times I spent working in various operations role has been, “Great. So, how do we learn from incidents?” Well, if you're like most of the industry, you really don't. You wind up blaming someone in a meeting that's called blameless, so instead of using the person's name, you use a team or a role name, and then you wind up effectively doing a whole bunch of reactive process work that in long enough timeline and enough incidents ossifies you into a whole bunch of processes and procedure that is just horrible. And then how do you learn from this?Well, by the time it actually becomes a problem, you've rotated CIOs four times and there's no real institutional memory here. Great. That's my cynical approach, and I suspect it's not entirely yours because if it were, you wouldn't be doing a business in this because otherwise, it would be this wonderful choreographed song-and-dance number of, “Doesn't it suck to be you? Da-da.” And that's it. I suspect you do more as a consultant than that. So, what does my lived experience of terrible companies differing in what respects from the folks you talk to?John: Oh, well, I mean, just to be blunt, you're absolutely spot on. [laugh]. The industry is terrible at this.Corey: Well, crap.John: I mean, look, the good news is, there are inklings, there are signals for some organizations that have been doing the things that they've been told to do by some book or website that they read, and they're doing all the things and they realize, “All right, well, whatever we're doing doesn't seem to be—it doesn't feel—we're doing all the things, checking the boxes, but we're having incidents”—and even more disturbing to them is we're having incidents that seem as if—it'd be one thing to have incidents that were really difficult, hairy, complicated, and complex, and certainly those happen, but there is a view that they're just simply not getting as much out of these sometimes pretty traumatic events as they could be. And that's all that's needed, yeah.Corey: In most companies, it seems like, on some level, you're dealing with every incident that looks a lot like that. Sure, it was a certificate expired, but then you wind up tying into all the relevant things that are touching that. It seems like it's an easy, logical conclusion. Oh, wow. It turns out in big enterprises, nothing is straightforward or simple.Everything becomes complicated, and issues like that happen frequently enough that it seems like the entire career can be spent in pure firefighting reactive mode.John: Yeah, absolutely. And again, I would say that just like these other domains that I mentioned earlier, there's a lot of, sort of, intuitive perspectives that are, let's just say, sort of unproductive. And so in software, we write software; it makes sense if all of our discussions after an incident trying to make sense of it, is entirely focused on the software did this, and Postgres has this weird thing, and Kafka has this tricky bit here. But the fact of the matter is, people and—engineers and non-engineers—are struggling when an incident arises, both in terms of what the hell is happening, and generating hypotheses, and working through whether the hypothesis is valid or not, adjusting it if signals show up that it's not, and what can we do, what are some options? If we do feel like we're on a good [unintelligible 00:06:09] productive thread about what's happening, what are some options that we can take?That opens up a doorway for a whole variation of other questions. But the fact of the matter is, handling incidents, understanding really, effectively, time-pressured problem solving, almost always amongst multiple people with different views, different expertise, and piecing together across that group what's happening, and what to do about it, and what are the ramifications of doing this thing versus that thing? This is all what we would call above-the-line work. This is expertise. It shows up in how people weigh ambiguities, and things are uncertain.And that doesn't get this lived experience that people have, it just we're not used to talking about—we're used to talking about networks, and applications, and code, and network. We're not used to talking about and even have vocabulary for what makes something confusing? What makes something ambiguous? And that is what makes for effective incident analysis.Corey: Do you find that most of the people who are confused about these things tend to be more aligned with being individual contributor type engineers, who are effectively boots-on-the-ground, for lack of a better term? Is it high-level executives who are trying to understand why it seems like they're constantly getting paraded in the press? Or is it often folks somewhere between the two?John: Yes.Corey: [laugh].John: Right? Like there is something that you point out, which is this contrast between boots-on-the-ground, hands-on keyboard, folks who are resolving incidents, who are wrestling with these problems, and leadership. And sometimes leadership who remember their glory days of being an individual contributor sometimes are a bit miscalibrated. They still believe they have a sufficient understanding of all the messy details when they don't. And so, I mean, the fact of the matter is, there's the age-old story of Timmy stuck in a well, right?There's the people trying to get Timmy out of the well, and then there's what to do about all of the news reporters surrounding the well asking for updates and questions, and how did Timmy get in the well? These are two different activities. And I'll tell you pretty confidently, if you get Timmy out of the well, pretty fluidly, if you can set situations up where people who ostensibly would get Timmy out of the well are better prepared with anticipating Timmy is going to be in the well, and understanding all the various options and tools to get Timmy out of the well, the more you can set up those and have those conditions be in place, there's a whole host of other problems that simply don't go away. And so, these things kind of get a bit muddled. And so when you say ‘learning from incidents,' I would separate that very much from what you tell the world externally from your company about the incident because they're not at all the same.Public write-ups about an incident are not the results of an analysis. It's not the same as an internal review, were the review to be effective. Why? Well, first thing is you never see apologies on internal post-incident reviews because who are you going to apologize to?Corey: It's always fun watching the certain level of escalating transparency as you go up through the spectrum of the public explanation of an outage, to ones you put internal customers, to ones you show under NDA to special customers, to the ones who are basically partners who are going to fire you contractually if you don't, to the actual internal discussion about it. And watching that play out is really interesting. As you wind up seeing the things that are buried deeper and deeper, yeah, you wind up with this flowery language on the outside, and it gets more and more transparent, and at the end, it's, “Someone tripped and hit the emergency power switch in a data center.” And it's this great list of how this stuff works.John: Yeah. And to be honest, it would be strange and shocking if they weren't different. Because like I said, the purpose of a public write-up is entirely different than an internal write-up and the audience is entirely different. And so that's why they're cherry-picked. There's a whole bunch of things that aren't included in public write-up because the purpose is, “I want a customer or potential customer to read this and feel at least a little bit better.”Or really, I want them to at least get this notion that we've got a handle on it. “Wow, that was really bad, but nothing to see here, folks. It's all been taken care of.” But again, this is very different, the people inside the organization, even if it's just sort of tacit, they've got a knowledge. Tenured people who have been there for some time, see connections, even if they're not made explicit, between one incident to another incident.To that one that happened—“Remember that one that happened three years ago, that big one? Oh, sorry, you're new. Oh, let me tell you the story. Oh, it's about this and blah, blah, blah. And who knew that Unix pipes only passes 4k across it.” Blah, blah, blah, something—some weird, esoteric thing.And so our focus, largely, although we have done projects with companies about trying to be better about their external language about it, the vast majority of what we do and where our focuses is, is to capture the richest understanding of an incident for the broadest audience. And like I said at the very beginning, the bar is real low. There's a lot of, I don't want to say falsehoods, but certainly a lot of myths that just don't play out in the data about whether people are learning. Whenever we have a call with a potential client, we always ask the same question. Ask them about what their post-incident activities look like, and they tell us and throw in some cliches, and everyone—never want a crisis go to waste.And, “Oh, yes. And we always try to capture the learnings and we put them in a document.” And we always ask the same question, which is, “Oh. So, you put these documents, these write-ups in an area?” Oh, yes, we want that to be shared as much as possible.And then we say, “Who reads them?” And that tends to put a bit of a pause because most people have no idea whether they're being read or not. And the fact is, when we look, very few of these write-ups are being read. Why? I'll be blunt: because they're terrible. [laugh].There's not much to learn from there because they're not written to be read. They're written to be filed. And so we're looking to change that. And there's a whole bunch of other things that are unintuitive, but just like all of the perspective shifts, DevOps, and continuous deployment, they sound obvious, but only in hindsight after you get it. That's characterization of our work.Corey: It's easy to wind up, from the outside, seeing a scenario where things go super well in an environment like that, where, okay, we brought you in as a consultant, suddenly, we have better understanding about our outages. Awesome. But outages still happen. And it's easy to take a cynical view of, okay, so other than talking to you a lot, we say the right things, but how do we know that companies are actually learning from what happened as opposed to just being able to tell better stories about pretending to learn?John: Yeah, yeah. And this is, I think, where the world of software has some advantages over other domains. And the fact is, software engineers don't pay any attention to anything they don't think the attention is warranted, or they're not being judged, or scored, or rewarded for. And so there's no single signal that accompanies learning from incidents. It's more like a constellation, like, a bunch of smaller signals.So, for example, if more people are reading the write-ups. If more people are attending group review meetings. In organizations that do this really well, engineers who start attending meetings, we ask them, “Well, why are you going to this meeting?” And they'll report, “Well, because I can learn stuff here that I can't learn anywhere else. Can't read about it in a runbook, can't read about it on the wiki, can't read about it in an email, or hear about it in an all-hands.”And that they can see a connection between, even incidents handled in some distant group, they can see a connection to their own work. And so those are the sort of signals—we've written about this on our blog—those are the sort of signals that we know that progress is building momentum. But a big part of that is capturing this, again, this experience. Usually, we'll see, there's a timeline, and this is when memcached did X, and this alert happened, and then blah, blah, blah, blah, blah. Right?But very rarely are captured the things that, when you ask an engineer, “Tell me your favorite incident story.” People who will even describe themselves, “Oh, I'm not really a storyteller, but listen to this.” And they'll include parts that make for a good story. Social construct is, if you're going to tell a story, you've got the attention of other people, you're going to include the stuff that was not usually kept or captured in write-ups. For example, like, what was confusing?A story that tells about what was confusing, well—“And then we looked, and it said, ‘zero tests failed.'”—this is an actual case that we looked at—“It says ‘zero tests failed.' And so, okay. So, then I deployed. Well, the site went down.” “Okay, well, so what's the story there?” “Well, listen to this. As it turns out, at a fixed font, zeros, like, in Courier or whatever, have a slash through it and at a small enough font, a zero with a slash through it looks a lot like an eight. There were eight tests failed, not zero.” So, that's about the display. And so those are the types of things that make a good story. We all know stories like this, right? The Norway problem with YAML. You ever heard of that Norway problem?Corey: Not exactly. I'm hoping you'll tell me.John: Well, so lay [laugh] it's excellent, and of course it works out that the spec for YAML will evaluate the value no—N-O—to false as if it was a boolean. Yes, for true. Well, but if your YAML contains a list of abbreviations for countries, then you might have Ireland, Great Britain, Spain, US, false instead of Norway. And so that's just an unintuitive surprise. And so, those are the types of things that don't typically get captured in incident writeups.There might be a sentence like, “There was a lack of understanding.” Well, that's unhelpful. At best. Don't tell me what wasn't there. Tell me what was there. “There was confusion.” Great. “What made it confusing?” “Oh, yeah. N-O is both ‘no' and the abbreviation for Norway.”Red herrings is another great example. Red herrings happen a lot; they tend to stick in people's memories; and yet, they never really get captured. But it's, like, one of the most salient aspects of the case that ought to be captured. People don't follow red herrings because they know they're a red herring. They follow red herrings because they think it's going to be productive.So therefore, you better describe for all your colleagues what brought you to believe that this was productive. Turns out later—you find out later that it wasn't productive. Those are some of the examples. And so if you can capture what's difficult, what's ambiguous, what's uncertain, and what made it difficult, ambiguous, or uncertain, that makes for good stories. If you can enrich these documents, it means people who maybe don't even work there yet, when they start working there, they'll be interested; they have a set expectation they'll learn something by reading these things.Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.Corey: There's an inherent cynicism around… well, from at least from my side of the world, around any third-party that claims to fundamentally shift significant aspects of company culture, and if the counter-argument to that is that you and DORA and a whole bunch of other folks have had significant success with doing it, it's just very hard to see that from the outside. So, I'm curious as to how you wind up telling stories about that because the problem is inherently whenever you have an outsider coming into an enterprise-style environment, is, “Oh, cool. What are they going to be able to change?” And it's hard to articulate that value, and not—well, given what you do, to be direct—come across as an engineering apologist, where it's well, “Engineers are just misunderstood, so they need empathy, and psychological safety, and blameless post-mortems.” And it sounds to crappy executives, if I'm being direct, that, “Oh, in other words, I just can't ever do anything negative to engineers who, from my perspective, just failed me or are invisible, and there's nothing else in my relationship with them.” Or am I oversimplifying?John: No, no. I actually think you're spot on. I mean, that's the thing is that if you're talking with leaders—remember, a.k.a. People who are, even though they're tasked with providing the resources and setting conditions for practitioners—the hands-on folks who get their work done—they're quite happy to talk about these sort of abstract concepts, like psychological safety and insert other sorts of hand-wavy stuff.What is actually pretty magical about incidents is that these are grounded, concrete, messy phenomena that practitioners have, and will remember; they're sometimes visceral experiences. And so that's why we don't do theory at Adaptive Capacity Labs. We understand the theory, happy to talk to you about it, but it doesn't mean as much without the practicality. And the fact of the matter is that the engineer apologist is, “If you didn't have the engineers, would you have a business?” That's at the flip side; this is, like, the core unintuitive part of the field of resilience engineering, which is that Murphy's Law is wrong.What could go wrong almost never does, but we don't pay much attention to that. And the reason why you're not having nearly as many incidents as you could be is because, despite the fact that you make it hard to learn from incidents, people are actually learning. But they're just learning out of view from leaders. When we go to an organization and we see that most of the people who are attending post-incident review meetings are managers, that has a very particular signal. That tells me that the real post-incident review is happening outside that meeting, it probably happened before that meeting, and those people are there to make sure that whatever group that they represent in their organization isn't unnecessarily given the brunt of the bottom of a bus.And so it's a political due diligence. But the notion that you shouldn't punish or be harsh on engineers for making mistakes completely misses the point. The point is to set up the conditions so that engineers can understand the work that they do. And if you can amplify that, as Andrew Schaffer has said, “You're either building a learning organization, or you're losing to someone who is.” And a big part of that is you need people; you have to set up conditions for people to give detailed story about their work, what's hard.This part of the codebase is really scary, right? All engineers have these notions: this part is really scary, this part is really not that big of a deal, this part is somewhere in between. But there's no place for that outside of the informal discussions. But I would assert that if you can capture that, the organization will be better prepared. The thing that I would end on that is that it's a bit of a rhetorical device to get this across, but one of the questions we'll ask is, “How can you tell the difference between a difficult case—a difficult incident—handled well, or a straightforward incident handled poorly?”Corey: And from the outside, it's very hard to tell the difference.John: Oh, yeah. Well, certainly if what you're doing is averaging how long these things take. But the fact of the matter is that all the people who were involved in that, they know the difference between a difficult case handled well, and a straightforward one handled poorly. They know it, but there's nowhere, there's no place to give voice to that lived experience.Corey: So, on the whole, what is the tech industry missing when it comes to learning effectively from the incidents that we all continually experience and what feels to be far too frequently?John: They're missing what is captured in that age-old parable of the blind men and the elephant. And I would assert that these blind men that the king sends out—“Go find an elephant and come back and tell me about the elephant”—they come back and they all have—they're all valid perspectives, and they argue about, “No, an elephant is this big flexible thing,” and other one is, “Oh, no, an elephant is this big wall,” and, “No, an elephant is a big flappy thing.” If you were to make a synthesis of their different perspectives, then you'd have a richer picture and understanding of an elephant. You cannot legislate—and this is where what you brought up—you cannot set ahead, a priori, some amount of time and effort. And quite often what we see are leaders saying, “Okay, we need to have some sort of root cause analysis done within 72 hours of an event.” Well, if your goal is to find gaps, and come up with remediation items, that's what you're going to get. Remediation items might actually not be that good because you've basically contained the analysis time.Corey: Which does sort of feel, on some level, like it's very much aligned as—from a viewpoint of, yeah, remediation items may not be useful as far as driving lasting change, but without remediation items, good luck explaining to your customers that will never ever, ever happen again.John: Right, yeah. Of course. Well, you'll notice something about those public write-ups; you'll notice that they don't tend to link to previous incidents that have similarities to them because that would undermine the whole purpose, which is to provide confidence. And a reader might actually follow a hyperlink to say, “Wait a minute. You said this wouldn't happen again.”Turns out it would. Of course, that's horseshit. But you're right. And there's nothing wrong with remediation items, but if that's the goal, then that goal is—you know, what you look for is what you find, and what you find is what you fix. If I said, “Here's this really complicated problem and I'm only giving you an hour to describe it,” and it took you eight hours to figure out the solution.Well then, what you come up with in an hour is not actually going to be all that good. So, then the question is, how good are the remediation items? Quite often what we see is—and I'm sure you've had this experience—an incident's been resolved and you and your colleagues are like, “Wow, that was a huge pain in the ass. Oh, dude. I didn't see that coming. That was weird. Yeah.” And one of you might say, “You know what? I'm just going to make this change because I don't want to be woken up tonight, or I know that making this change is going to help things. I'm not waiting for the post-mortem. We're just going to do that.” “Is that good?” “Yep.” “Okay, yeah, please do it.”Quite frequently, those things, those actions, those aren't listed as action items, and yet it was a thing so important that it couldn't wait for the post-mortem—arguably the most important action item—and it doesn't get captured that way. We've seen this take place. And so again, in the end, it's about those who have the lived experience. The live experience is what fuels how reliable you are today.You don't go to your senior technical people and say, “Hey, listen. We got to do this project. We don't know how. I want you to figure out—we're going to—let's say we're going to move away from this legacy thing, so I want you to get in a room, come up with two or three options. Gather a group of folks who know what they're talking about. Get some options, and then show me what the options. Oh, and by the way, I'm prohibiting you from taking into account any experience you've ever had with incidents.” It sounds ridiculous when you would say that, and yet, that is what [unintelligible 00:27:54].So, if you can fuel people's memory, you can't say you've learned something if you can't remember it. At least that's what my kids' teachers tell me. And so yeah, you have to capture the lived experience, and including what was hard for people to understand. And those make for good stories. That makes for people reading them. That makes for people to have better questions about it. That's what learning looks like.Corey: If people want to learn more about what you have to say and how you view these things, where can they find you?John: You can find me and my colleagues at adaptivecapacitylabs.com where we talk all about the stuff on our blog. And myself, and Richard Cook, and Dave Woods are also on Twitter, as well.Corey: And we'll, of course, include links to that in the [show notes 00:28:42]. John, thank you so much for taking the time to speak with me today. I really appreciate it.John: Yeah, thanks. Thanks for having me. I'm honored.Corey: John Allspaw, co-founder and principal at Adaptive Capacity Labs. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment giving me a list of suggested remediation actions that I can take to make sure it never happens again.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Embracing Differences
Incident investigation: What can we learn from the software world?

Embracing Differences

Play Episode Listen Later Aug 17, 2021 53:44


In this podcast, John Allspaw, the Founder of Adaptive Safety Labs and I talk about the world of software engineering and operations and its connections to safety science and human factors. Together we explore what opportunities — and challenges — exist in the domain, what incident analysis and genuine learning from incidents looks like, and what makes this an exciting time for exploring this domain from a safety perspective.

Software Daily
Digital Ocean with John Allspaw

Software Daily

Play Episode Listen Later Jun 10, 2021


Corey Quinn is guest hosting on Software Engineering Daily this week, presenting a Tour of the Cloud. Corey Quinn is the Chief Cloud Economist at The Duckbill Group, where he helps companies fix their AWS bill by making it smaller and less horrifying. If you’re looking to lower your AWS bill or negotiate a new

Cloud Engineering – Software Engineering Daily
Digital Ocean with John Allspaw

Cloud Engineering – Software Engineering Daily

Play Episode Listen Later Jun 10, 2021 56:03


Corey Quinn is guest hosting on Software Engineering Daily this week, presenting a Tour of the Cloud. Corey Quinn is the Chief Cloud Economist at The Duckbill Group, where he helps companies fix their AWS bill by making it smaller and less horrifying. If you’re looking to lower your AWS bill or negotiate a new The post Digital Ocean with John Allspaw appeared first on Software Engineering Daily.

Software Engineering Daily
Digital Ocean with John Allspaw

Software Engineering Daily

Play Episode Listen Later Jun 10, 2021 49:00


Corey Quinn is guest hosting on Software Engineering Daily this week, presenting a Tour of the Cloud. Corey Quinn is the Chief Cloud Economist at The Duckbill Group, where he helps companies fix their AWS bill by making it smaller and less horrifying. If you're looking to lower your AWS bill or negotiate a new The post Digital Ocean with John Allspaw appeared first on Software Engineering Daily.

Podcast – Software Engineering Daily
Digital Ocean with John Allspaw

Podcast – Software Engineering Daily

Play Episode Listen Later Jun 10, 2021 56:03


Corey Quinn is guest hosting on Software Engineering Daily this week, presenting a Tour of the Cloud. Corey Quinn is the Chief Cloud Economist at The Duckbill Group, where he helps companies fix their AWS bill by making it smaller and less horrifying. If you’re looking to lower your AWS bill or negotiate a new The post Digital Ocean with John Allspaw appeared first on Software Engineering Daily.

0800-DEVOPS
Learning from incidents with John Allspaw

0800-DEVOPS

Play Episode Listen Later May 12, 2021 46:15


John Allspaw is the founder of Adaptive Capacity Labs and one of the people that drove DevOps movement from its beginnings. His (and Paul Hammond's) famous “10+ deploys per day at Flickr” talk at Velocity conf in 2009 showed us that there is a better way of collaborating and delivering software. Today, John focuses on helping organizations do better in incident management and learn from their own mistakes. I talked with John about incident management, learning organizations, and how the world has changed since 2009.Subscribe to 0800-DEVOPS newsletter here.Show notes:This interview is featured in 0800-DEVOPS #17 - John Allspaw, resilience engineering and DOES 2020 conference.The community sharing stories and learning from incidents.

Reversim Podcast
403 Carburetor 30

Reversim Podcast

Play Episode Listen Later Mar 11, 2021


פרק מספר 403 של רברס עם פלטפורמה - התאריך היום [בעת ההקלטה] הוא ה-2 במרץ 2001 . . . אה, 2021. [דומה](אורי) אני חייב להגיד שהיום הכנו פה את ה-Setup, ותמיד הבן הקטן שלי עוזר לי - ואמרתי לו “שחר, אתה יודע שהתחלנו להקליט את הפודקאסט הזה עוד לפני שנולדת?”, והוא ענה לי “מה? כל כך הרבה זמן?!” . . .(רן) אז הנה - חזרנו בזמן . . .בקיצור - אנחנו ב-2021, למי ששומע את זה בעתיד, ואנחנו באולפנינו הקט אשר בכרכור, בביתו של אורי - אהלן אורי! - (אורי) אהלן - (רן) ומארחים את נתי - אהלן נתי!(נתי) אהלן, טוב להיות פה, פיזית . . .(רן) ברוך הבא, עבר הרבה זמן . . .(נתי) לא בזום, איזה כיף . . .(רן) אנחנו שוב בתקופה של בין-הסגרים [כמו בקרבורטור הקודם…], ואת נתי אנחנו מארחים שוב בפעם ה-30 (!) כנראה, לפרק נוסף של הקרבורטור . . .(נתי) 30?(רן) כן, יש לי Counter שסופר . . . כן, לפחות 30.אז אתם בודאי מכירים את נתי [נתי שלום (Cloudify)], אז בואו וישר, ככה, נצלול לעסק - נתי אוהב לדבר ויודע ומבין הרבה בכל מה שקשור לתשתיות, והיום אנחנו בעצם רוצים לדבר איתך, נתי, על הגל החדש של [לא קורונה…] ה-DevOps - הנושא, או המילה DevOps זו מילה שכבר שגורה בפי רבים, אבל יש חדשות בתחום הזה, אז בוא וספר לנו מה מעניין שם . . .(נתי) אז מצויין - אני אתחיל בשאלה דווקא: כמה זמן אתם מתעסקים ב-DevOps? מתי הייתה הפעם הראשונה שככה התחלתם להתעסק . . (רן) מאז 2001, אני לא יודע . . .(אורי) איפה . . DevOps זו המילה הזו שמחברת אנשים שהולכים ליותר מחמישה כנסים בשנה, לא?(נתי) כן, משהו כזה . . .(רן) אני חושב שכנראה הפעם הראשונה שהכרתי את זה הייתה ב-2011 . . . 2010? מתי זה היה?(נתי) זה בערך, קלנדרית, אני חושב שזו התקופה שבה זה התחיל . . . מי היו המאמצים העיקריים של DevOps? אילו חברות?(רן) אוקיי, אז היו Puppet ו-Chef שהיו “מובילי דעה” . . .(אורי) ו-Etsy, שהיה . . עם John Allspaw, שהוא אחד הפושרים (Pushers) הרציניים של זה.(נתי) ו-Velocity, והיו עוד כל מיני דברים אייקוניים ש . . .(אורי) Velocity ככנס . . .(נתי) כן, Velocity ככנס, וכמובן היה DevOps Days וכל מה שהיה סביב זה - והיום כשאנחנו מסתכלים על ה-Landscape הזה, אנחנו יכולים לראות שזה נכנס לאיזשהו Plateau בשנים האחרונותמבחינת Hype או Trend, זה כאילו ניהיה, לפחות בעולמות של סטארטאפים, איזשהו “אוקיי, DevOps, הבנו”(רן) צריך איש DevOps . . .(נתי) כן . . . (רן) “אבל DevOps זה לא בנאדם . . . זה . . . טוב, לא משנה - תביאו!”.(נתי) ולכן השאלה היא “אוקיי - התרגלנו כבר ל-Plateau, אז מה חדש?”אז מה שחדש הוא שיש גל חדש, ותיכף אני אסביר מהו.אני אתחיל בציטוט של מנכ”ל AWS היום, שבעצם אמר ב-Re:invent האחרון, שהיה בדצמבר 2020 - 4% מסך כל תקציב ה-IT הוא ב-Cloud”.עכשיו - תחשבו שנייה על המשפט הזה, ואז תבינו למה אני מתכוון . . .(רן) רק 4% . . .(נתי) רק 4% הם ב-Cloud . . . עכשיו, תחשבו - אנחנו, וכל מי שמדבר פה בפודקאסטים, וכל מי שאנחנו פחות או יותר נפגשים איתו - מבחינתנו 100% כמעט זה ב-Cloud, ואולי 2% זה לא ב-Cloud . . . (אורי) הלו . . .(נתי) סליחה, נכון . . .אז הייתה כאילו פתאום מעיין תמונת מראה מעניינת שאומרת “רגע - מה זה 4% ב-Cloud? אז מה זה ה-96% שאני בכלל לא מכיר אותם?”(אורי) “האינטרנט האמיתי”, כאילו . . עולם ה-IT האמיתי(נתי) בדיוק - אז באמת שם נמצא הגל החדש, והוא גם גל באמת מאוד מעניין, זה גל שיש בו . . .(רן) אבל רגע - בוא נבין מה זה 96% . . .(נתי) מהתקציב, לא מהאנשים ולא מסך כל ה-Workload בעולם, אלא מהתקציב(רן) זאת אומרת - לא מדברים פה על משכורות, לא מדברים על רישיונות - מדובר פה על מחשוב ותשתיות, ממש חומרה . . .(נתי) מבחינת תקציב IT של ארגונים - אתה סוכם את סך כל תקציב ה-IT בעולם, ואתה אומר כמה מסך כל תקציב ה-IT בעולם כיום “יושב בענן” - אז זה מגיע ל-4%, וזאת אומרת שיש Untapped market למי שלא מכיר את המושג - זה אומר שיש פוטנציאל שוק של כאלה שעכשיו, במיוחד אחרי הקורונה [אופטימי?], נמצאים בתהליך של האצה, של מעבר לענן - שהוא מאוד מאוד גדולוזה בעצם מייצג את הגל הבא הזה - והסיבה שהזכרתי את המספר הזה היא כדי להסביר שהגל הזה הוא לא איזשהו גל מינורי, זה גל מאוד . . .(רן) מה שמעניין אותי זה . . . סליחה שאני מתפרץ, אבל אנחנו חברים אז זה בסדר - הזכרנו מי היו הסמנים הראשונים של ה-DevOps, ודיברנו על חברות כמו Puppet ו-Chef, שדי נעלמו, לטעמי, בעקבות ה-Cloud, אז אולי עכשיו אולי הן, או חברות כמוהן, יזכו לעדנה מחודשת, כי הצורך קיים?(נתי) נכון . . אז אני אזכיר, ככה, מה לפחות בדו”ח הזה מדובר עליו כמאפיינים של הגל הבא הזה.אני אמנע ממהמאפיינים הטכנולוגיים, של Cloud Native וכל הדברים האלה . . . אז יש כמה מאפיינים - אחד זה ה-Regulated Business, או - הם קראו לזה בדו”ח ה Non-Tech Businessזה Non-tech לא במובן שאין להן טכנולוגיה, אלא Non-Tech במובן שה-Busiiness עצמו הוא Finance או Industrial - זה לא Facebook או Google, שזה ייעודן.(אורי) אבל דווקא זה . . . אולי אני קצת מקדים את מה שאתה הולך להגיד, אבל דווקא המקומות האלה, שה-Business שלך הוא לא הטכנולוגיה, או שה-Core Competence שלך הוא לא הטכנולוגיה - זה דווקא יותר Make sense ללכת לשירותי SaaS ו . . .(נתי) אז למה זה לא קרה עד עכשיו?(אורי) אני לא בטוח שזה לא קרה, אבל . . .(נתי) זה לא קרה, לא בעצימות גבוהה - כן היו “איים” וכן היו התחלות ודיברו, וכולם אוהבים לדבר על זה - אבל אם אתה בודק עדיין בכמות ה-Workload שנמצאת היום בבנק, כמה אחוז ממנה . . .(רן) אני יכול לנחש למה זה לא קרה . . . מצד אחד, אורי - אני חושב שאתה צודק: זאת אומרת, דווקא להם זה הכי הגיוני שהם יעברו . . .(אורי) נכון - זה לא ה-Business שלך, אתה לא “עושה כסף מה-IT הזה” . . .(רן) נכון, ומצד שני זה לשנות סדרי עולם, לשנות את הדרך שבה הם פעלו עד עכשיו - ויכול להיות שאלו ארגונים שאו שהם קצת יותר ותיקים, או שדווקא בגלל שטכנולוגיה זה לא הצד החזק שלהם, אז הם הולכים על הבטוח, הולכים על המוכרלוקחים את אנשי ה-IT שיודעים לעשות את מה שהם יודעים - וזה לא ענן.(אורי) מעניין - לפעמים יש שיקולים שאנחנו לא חושבים עליהם כמו Unions . . .(נתי) אז א’ - יש Job Security, או מה שאתה קורא לו Unions - כשבאמת יש לאנשים Day Job בסוף והדבר הזה מאיים עליהם, וזה בהחלט מייצר קונפליקט של אינטרסים בין טובת הארגון, לצורך העניין, לבין צורת ההתנהלות.יש כמובן היבטים רגולטוריים, שהם נושאים מאוד לא זניחים - אלו ארגונים שהשקיעו עשרות שנות אדם - מאות ואלפי שנות אדם - בלבנות תשתית ארגונית שעונה על הרגולציה ועונה על כל ההיבטים הרגולטוריים, על היבטים של Security.והיו פה גם אתגרים טכנולוגיים של לעבור לסביבה, שהיא לא תמיד תואמת לסטנדרטים ולצורת העבודה שלהם - וזו טרנספורמציה מאוד יקרה וגדולה עבורם, וה-ROI הוא לא תמיד ברור ומהיר.(אורי) אנחנו מדברים כבר על לא מעט שנים . . . זה טווח זמן שהוא שנות-דור ב-IT, לא יודע . . . כמה זמן קיים Gmail? 15 שנה? [אם רק היה פרק 1 באפריל ב-2004…] אפילו Microsoft כבר לא תמכור לך היום Hosted Outlook [בוא]. . .(נתי) זו שאלה מצויינת - אני חושב שקרו פה כמה תהליכים שגרמו, או גורמים, לזה לקרות עכשיו ביתר שאת.אני חושב שהמאיץ הגדול - כולנו מכירים אותו, קוראים לו קורונה, או Covid-19, שבאמת הביא את זה לנקודת רתיחה שבה כולם פתאום נשאלו “איפה אתם בתוכניות שלכם?”, והסתכלו וראו ש”וואלה - מדברים ומדברים אבל לא עושים הרבה”, בהרבה ארגונים.פתאום נוצר לחץ מאוד מג’ורי (Major) לעשות את המהלך הזה.הדבר השני שבאמת קרה הוא הבשלות של תשתיות הענן, זאת אומרת - היום רוב העננים, יש להם יכולות לתת גם Private, גם Secrets, ראינו אפילו שירותים של Government שעוברים לענןיכולת לייצר, מבחינת ספקי הענן, אג’יליות (Agile) מאוד גבוהה uיכולת לתת כל מיני דרגות Security ו-Isolation ו-Data Centers בכל מיני מקומות.היום יש גם יחידות קצה כמו Outposts, שבעצם אתה יכול להריץ את זה ממש ב-Data Center שלך ועדיין לקבל שירותי ענן סביב זה.אז הייתה השתכללות מאוד גדולה של העננים, שבעצם הורידה הרבה מאוד חסמים שהיו בעבר חסמים מאוד משמעותיים לכניסה לתשתיות ענן.(רן) בעצם, מה שאתה מתאר זה לא בהכרח הגל השני של ה-DevOps - זה אולי יותר “עננות”: אותם ארגונים שקודם היה להם קשה או אולי חששו לעבור לענן, אז עכשיו הם עוברים לענן - זה לא בהכרח אומר יותר DevOps . . . חשבתי שאתה אומר . . .(נתי) זו נקודה מעניינת - נגעת בנקודה שדורשת איזשהו הסבר: למה קישרתי בין המעבר לענן ל-DevOps.אז אחד - היו כל מיני נסיונות לעשות מעבר לענן שהוא לא בהכרח מסתכל על DevOps - מסתכל על DevOps כעל עוד אמצעי, כשמשהו שצריך לקרות כחלק מהמעבר לענן.הייתה לי שיחה על זה עם Morgan Stanley, הם תיארו את זה מאוד יפה - ככה הם התחילו את ה-Journey שלהם: הם אמרו “יש מעבר לענן ויש DevOps - שני הדברים האלה צריכים בסוף לחיות ביחד, אבל בסדר הזה”.והם די מהר הבינו, שכל ה-ROI של המעבר לענן בלי DevOps - לא קיים פשוט . . .היו הרבה ניסיונות ל-Lift& Shift - נקנו חברות, גם ע”י שחקני ה-Cloud, בשביל לעשות את Lift & Shift הזה.גם אתם (Outbrain) עשיתם איזשהו ניסיון של לבנות מודלים היברידיים כאלה ואחרים[382 Carburetor 27 - k8s and multi-cloud]ופתאום אתה מגלה שהיבטי ה-Cost של הענן - הרבה מהמודלים האלה לא מחזיקים אם אתה לא באמת בונה את המודלים האלה כאג’יליים, כמשהו שבאמת יכול להיות מנוהל באופן אוטומטי.אם אין לך מנגנוני בקרה מספיק טובים, אז תגלה שאפילו העלות יותר גדולה ממה שהרצת בעבר . . .ואז הנושא הזה של מעבר לענן בלי פתרונות אוטומציה ובלי טרנספורמציה של לעבור גם לאוטומציה, הייתי אומר מתקדמת - הראה ש-Return On Investment הוא שלילי אפילו.לכן נוצר צימוד מאוד גדול בין מעבר לענן לאימוץ תשתיות . . .(אורי) “שלילי” זה אפילו Understatement . . .(מתי) לגמרי(רן) אז זה לא שאותם 96% החליטו ואמרו “בואו נשאר עם אותו ה-Data Center שיש לנו, רק נכניס לתוכו DevOps” - זה לא מה שקורה, או לפחות לא משהו משמעותיאבל 96%, או איזשהו חלק מהם, עובר לענן - ואז הם מגלים שלעשות את זה בלי אימוץ של - נקרא לזה “DevOps למיניהם” - זה פשוט לא משתלם.(נתי) נכון . . . (אורי) שזה קצת פרדוקס, כי אם תיקח, ובתוך ה-Data Center שלך תפעיל מתודולוגיות של DevOps ושל Monitoring ושל אוטומציה ושל . . .(נתי) . . תגיע ל-Efficiency מאוד דומה . . .(אורי) . . . תגיע ל-Efficiency מאוד גדול, ואז כשתרצה לעבור לענן - עוד פעם המודלי ענן יהיו יקרים יותר . . .(נתי) נכון ולא נכוןהסיבה שאני אומר “נכון” היא כיוון שבאמת יש כמה סיבות לזה שאפשר להגיע ל-Efficiency יותר גבוה בסביבה שבה אתה שולט בכל ה-Infrastructure עצמו.יש פה בעיות של Skill-set - איכות האנשים והיכולת שלהם להרים תשתיות כאלה.(אורי) אבל זה בדיוק העניין - אנחנו מדברים על Skill-set(נתי) נכון - ולרוב הארגונים אין את הכוח אדם ואת היכולת, וזה גם לא בהכרח ה-Business שלהם.כמו שאמרת קודם - להתרכז בזה והלפוך את זה למשהו מאוד מאוד Efficient דורש איכות מאוד מאוד גבוהה של ניהולשל תשתיות, של הבנה של תשתיות, של לדעת איך להרים כזה דבר . . .מכיוון שלרוב הארגונים זה לא ה-Main Business שלהם - הם גם לא יודעים להרים כאלו אופרציות בצורה אפקטיבית - ושם הם נופלים, לא בגלל שזה לא נכון ברמה עקרונית, בטח שלא בתיאורטית.אז (1) זה חלק מההבנה - ואני נתתי למשל את הדוגמא שאני חושב שהיא מייצגת של Netflix ו-Blockbusterלמה זו דוגמא מצויינת? כי זה באמת מייצג את הטרנספורמציה הזו - איפה היא מצליחה ואיפה היא לא מצליחה, וכמה רחוק אתה צריך ללכת כדי שהיא תצליח.(רן) יש כאלה שעכשיו שואלים את עצמם - “מה? Blockbuster? מה זה Blockbuster? . . .אני יודע מה זה Netflix אבל מה זה Blockbuster? . . .”(נתי) שאלתי את הילדים שלי את זה והם לא זכרו בכלל מה זה . .(רו) אז למאזנינו הצעירים - קצת נוסטלגיה: Blockbuster הייתה חברה להשכרת סרטי וידאו ש . . .[הייתה . . . Number of employees: 84,300 (2004); 25,000 (2010); 3 (2019)](נתי) למעשה גם Netflix וגם Blockbuster התחרו על שירותי מדיה למעשה - סרטים שהיית קונה ומשכירפעם זה היה ארוז ב-CDs, שגם לא בטוח עד כמה כל הדור הזה מכיר . . . אבל פעם לא היה . . .(אורי) בוא נודה על האמת - עדיין רוב שנותיה של Netflix היא הייתv שולחת CDs בדואר . . . (נתי) נכוןאז פה זה . . . אני משתמש בזה כי אני חושב שזו דוגמא טובה לטרנספורמציה ולמשקל של החלק של המעבר לענן והטרנספורמציה הזו, כי אני חושב ש-Netflix עשתה מהלך מאוד מעניין בהקשר הזה ומאוד נועז - אבל הוא דוגמא מאוד מייצגת לכל הטרנספורמציות.מה ש-Netflix באה ואמרה עוד באותו זמן זה “אוקיי - הבנו, אנחנו צריכים ללכת לשירותי On-demand, ואנחנו צריכים להריץ את זה על תשתיות שהדבר העיקרי בהן הוא שהן תיהינה סקלביליות (Scalable), שהן תיהינה זולות” - ואנחנו נעשה את זה עם שותף - קוראים לו AWS במקרה הזה, הם התחילו איתו - ואנחנו נדאג שהוא יתאים את עצמו אלינו במקומות שבהם זה לא מתאים לנו או חסר לנו דבריםשזה למשל רשת ה-CDN ועוד כל מיני דברים שהם התעסקו איתםאבל הם הלכו All-in על Cloud, והם הלכו All-in על לבנות אוטומציה, והלכו All-in עוד לפני כולם - חברה מסורתית, שהצליחה לעשות טרנספורמציה, Big-time מה שנקרא . . .(רן) אבל, אגב - החברה הזו אולי מסורתית אבל הם עשו טרנספורמציה כמעט בכל דבר, זאת אומרת - לא רק במעבר ל-Cloud . . .(אורי) אני חושב ש . . אני קצת מכיר כי שאלתי אנשים מ-Netflix מה ולמה - והם עשו כמה דברים נורא מעניינים - קודם כל, הם אמרו - אם אתה מסתכל על ה-Cost Model של Netflix, אז יש שם כמובן משכורות, יש שם את מה שהם שמו בשירותי ענן, שזה בעצם החליף את ה-Data Centers שלהם, אבל יש שתי קומפוננטות (Components) שהן הקומפוננטות העיקריות בעלויות של Netflix, הן רוב העליות של Netflix - וזה נמצא שם:זה CDN, שבעצם החליף את הבולים . . . זה ה-Delivery mechanism . . (נתי) מעניין לחשוב על זה ככה באמת . . .(אורי) . . וה-Rolyalties לתוכןועכשיו צריך להסתכל על זה - כשאלו הן העליות העיקריות שלך, כל חסכון שאתה עושה בעלויות האלה זה המון כסף על ה-Balance sheet של החברה.אם עשית את הניהול ב-Cloud או במערכת יקרה יותר - זה לא מזיז הרבה ל-Balance sheet, בסדר?את ה-CDN הם פיתחו לעצמם לבד, כדי להוריד המון מהעלויותורואים גם מה קרה אצלם ל-Royalties, לכל התכנים - הם פשוט התחילו לפתח תכנים לבד, הם התחילו להתחרות בעצם באולפנים.(נתי) אז אני אסתכל על זה מזוית קצת שונה, קצת תמונת מראה למה שאתה אמרת . . .(רן) היום, דרך אגב, ענף התוכן זה הענף שיש בו הכי הרבה הוצאות ב-Netflix, לפי מה שקראתי - הם מוציאים הכי הרבה כסף על הפקות.(נתי) אז אני רוצה באמת לגעת בזה, כי אני חושב שיש כאן איזושהי תמונת מראה מעניינת - בעבר - ואפשר להסתכל על זה בכל התעשיות, כשאתה הולך לכנסים אז רואים את זה מאוד בבירור - כשאתה הולך לכנסי Telcos או כנסי Finance, החלק של התשתית וניהול התשתית ל-Finance או ל-Telco או ל-Industry היה חלק מאוד משמעותיהיית מדבר בשפה אחרת - סטנדרטים אחרים, Vendors אחרים, Ecosystems אחרים . . . גן סגור ממש סביב העולמות האלה.ודווקא ההתעסקות לכאורה ב-Business הייתה, אפילו הייתי אומר באחוזים, הרבה יותר קטנה, כתוצאה מזה שהרבה פוקוס היה ניתן על איך מנהלים את האופרציה, איך מנהלים את התשתית, ופחות על איך עושים אינובציה (Innovation) ב-Business.מה שקורה במעבר לענן זה שזה בעצם מאפשר ל-Business לסדר את ה-Balance הזה, פתאום אתה מסתכל על זה אחרת - רגע, התשתית זה לא ה-Business שלי, אופס . . . זה עובר למקום אחר.ואז אתה פתאום רואה את כל האינובציה (Innovation) ואת כל האנרגיה של הארגון עוברת ל-Business - ואז אתה רואה Innovation בצד של התוכן ובלייצר תוכן ובלהביא ולעשות אפליקציות סליקיות (Sleek) וחוויית משתמש מאוד טובה ודברים שחברות שעוברות לדיגיטל - פתאום זה הופך להיות הפוקוס שלהן.חברות שלא עברו את הטנרספורמציה לדיגיטל או שעדיין בדרך לדבר הזה - אתה רואה שה-Mindset שלהן לא תמיד . . .אני מכיר את זה טוב למשל מבזק - עד שבזק לא עשו מעבר מאוד גדול לדיגיטל, היית מקבל כל מיני קופסאות שלא היית מבין איך להפעיל אותן, היית צריך טכנאי בשביל להפעיל אותן, שנים על גבי שנים . . . וזה לא שאי אפשר היה לעשות את זה קודם, פשוט מה שקרה זה שברגע שבאמת . . . וזה החלק הנוסף במעבר הזה לתשתיות ענן, נקרא לזה - זה שהוא באמת משנה את הפוקוס של ה-Business, והוא מאפשר ל-Business להתרכז באמת בלקוח, בתוכן, בחוויית המשתמש, ב-Innovation שהוא קשור לורקיטל (Vertical) שהוא באמת עסוק בווזה ההבדל הגדול בין Blockbuster ל-Netflix לדעתי [סגירת סוגריים מלפני מלא זמן] ומה שגרם ל-Netflix להצליח - שהמעבר לענן לא באמת נגע בסעיף של הוזלת תשתיות וניהול התשתיות - אני חושב שנגעת בזה נכון מבחינת הסעיפים - אלא שהוא אפשר ל-Netflix לשחרר שם איזשהו חסם ולעשות Innovation הרבה יותר מהיר בצד של ה-Business, ולהפוך את כל השוק הזה.וזו תמונת מראה של אפקט כזה, שאם אתה לא עושה אותו עד הסוף אתה לא מגיע לאימפקט הזה - ואז המכפיל הזה לא קורה ואז ה-ROI לא קורה.(רן) זה כמו ש-Facebook החליפה את רחל המרכלת, אם אתם זוכרים . . . רחל המרכלת היה טור רכילות במעריב או ידיעות, אני לא זוכר [העולם הזה], ועכשיו יש לך את Facebook בשביל רכילות . . (אורי) אני רק רוצה דווקא להסתכל על הצד של ה-CDN, בסדר?ב-CDN, נראה ש-Netflix לא ויתרה, את ה-CDN היא עשתה בעצמה - ולמה?כי זה ה-Core business שלה, זה ה-Delivery mechanism שלה בסוף, את זה היא עשתה בעצמה.(נתי) באמת הנקודה המרכזית שכשאנחנו אומרים “לעבור לענן” זה לא באמת אומר 100% הכל outsourced אלא זה שאתה עושה בחירות יותר מושכלות על איפה באמת יש לך ערך לנהל תשתית ואיפה אתה יכול לעשות לזה Outsource לחלוטין, ולהתנהל מולו.(אורי) אז נגיד ב-Outbrain, שתשתית ה-Serving שלנו היא ה-enabler של ה-Business ,ויש עליה המון-המון workload - אנחנו כן מנהלים ב-Data center שלנואבל תשתיות ה-IT - כל מה שאנחנו יכולים אנחנו מוציאים . . . ב-IT אני מתכוון ל-Business systems ו-Information systemsכל הדברים האלה יוצאים ל-SaaS.(רן) אז נגיד Data warehouse . . . (אורי) Data warehouse - חלק אצלנו וחלק ב-GCP, אבל איך להגיד את זה . . . ניהול לקוחות, CRM, כל הדברים האלה . . .(נתי) זה, אני חושב, די Make sense - הסיבה שהזכרתי את כל זה הייתה כי רציתי בעצם להראות את ההשפעה של הגל הזה על התעשיות האלה, על אותם 96% האלה.ונתתי את הדוגמא של Netflix ו-Blockbuster כדי להראות כמה זה הרבה מעבר להוזלת עלויות של תשתית או ניהול אוטומטי של תשתיתזה ממש משנה את הצורה שה-Business עצמו מתנהל ואיפה הפוקוס העסקי שלו וכמה Innovative הוא יכול להיות וכמה Customer experience ניהיה חלק משמעותיבכל חברות הדיגיטל, למשל - פתאום זה ניהיה Main theme אצלהן: איך נראה UX ואיך נראית חוויית המשתמשדקה לפני כן, כשלא היה את הדבר הזה, את המעבר לדיגיטל - זה לא היה כזה חשוב.(אורי) אני חושב שיש דוגמא נהדרת מהקורונה - קופות החולים . . .קופות החולים עברו טרנספורמציה מדהימה לדיגיטל - הם פשוט היו חייבים.(נתי) אני לא זוכר מתי בפעם האחרונה הייתי בקופת חולים . . .(רן) למרות שצריך להגיד שישראל הייתה, יחסית לשאר המדינות, כבר כמה צעדים לפני עוד לפני הקורונה, אבל אני מסכים לגמרי שבמהלך הקורונה זה השתפר מאוד.(אורי) אני, האמת, לא יודע על אילו שירותים בסוף קופות החולים מריצות את זה, אבל אין מה לעשות - כל האוריינות הדיגיטלית קפצה . . .(נתי) אני חושב שמה שאנחנו באמת עדים, וזה אני חושב מה שיאפיין את הגל השני (של ה-DevOps), זה בין השאר שכשתשאל את השאלה “מה ההבדל בין התשתית שבזק רצה עליה לבין התשתית ש-Netflix רצה עליה, והאם היא צריכה באמת להיות שונה?” - התשובה תיהיה, כנראה - “לא”.והאם ההבדל, אפילו הייתי אומר, בין איך שהצבא מריץ את התשתיות שלו -Netflix - עם הסתייגות לגבי היבטים של Security - האם צריכה להיות שונה? כנראה שלא.אותו הדבר גם לגבי התעשייה - חברת חשמל וכל מה שנקרא Industrial, manufacturing וכל זה - כנראה שלא.יש פה איזשהו תהליך שהוא גם, כחלק מהמעבר הזה, קונסולידציה (Consolidation) של הרבה מאוד תעשיות - תחשוב מה זה אומר - אני מדבר דווקא, ויסלחו לי המאזינים, אני פחות מדבר על הטכנולוגיה היום, יותר על ההיבט של הטרנספורמציה, כי אני חושב שהוא המאפיין הכי משמעותי פה -זה גם פותח פה משהו, ואנחנו Startup nation, אבל זה פותח פה משהו שאם חושבים עליו - הוא פנומנלי מבחינת פוטנציאל, כי זה בעצם אומר שאם אתה חברת מוצר, Startup, ועברת למכור לשווקים הוורטיקליים האלה - מה זה אמר?זה אמר להיות נוכח הרבה - אנשי Sales, להיות בכנסים, להיות במקומות, להיות בניו-יורק או בכל מיני מקומותועכשיו אתה יכול לייצר מוצר גנרי / Terraform או מה שזה לא יהיה - שנבנה על תשתיות ענן ואוטומטית הוא כבר זמין לשוק הרבה יותר גדול . . .(רן) זו בדיוק השאלה הבאה - זה אותו גל שני שאנחנו מדברים עליו, ונניח שבאמת אותם 96% או איזשהו חלקיק מהם הולך לעבור ל-Cloud, יאמצו טכנולוגיות או מתודולוגיות של DevOps - האם אתה רואה שה-Providers שקיימים היום פשוט יקבלו נתח שוק יותר גדול, או שאתה רואה בעצם יצירה של Providers חדשים, של כלים חדשים? . . . (נתי) התשובה היא “גם וגם” . . ו-Netflix זו דוגמא מצויינת כי זה בדיוק דוגמא לכך שגם שחקן מסורתי יכול להיות Startup ואפילו יותר מ-Startup מהבחינה הזאת.אני חושב שהתפיסה שעד היום הייתה זה שהתעשייה המסורתית זה דינוזאורים וסטארטאפים פשוט יודעים לרוץ יותר מהר אז הם ישיגו אותם, וה-Unicorn הבא תמיד יהוא מסטארטאפים - זו תפיסה שצריך לערער עליה קצת.אני חושב שהתובנה הזאת חלחלה, והארגונים הגדולים כבר יודעים לעשות את העבודה, וארגונים פיננסיים - יש להם לא מעט אנשים טובים ו-Innovation . . .(אורי) אפילו קופות החולים(נתי) קופות החולים זו דוגמא מצויינת, כן בהחלטלבנקים יש לא מעט אנשים טובים שיכולים לעשות טרנספורמציה וכן להדביק איכשהו את הפער - וחלק מהם יהיו ה-Netflix - ורבים מהם יהיו ה-Blockbuster . . . אז לא הייתי מספיד את כולם כדינוזאורים, ולא הייתי אומר שהדבר הבא יבוא רק מאיזה סטארטאפ Unicorn-י חדש, אבל תיהיה פה דינמיקה מאוד מעניינת בין סטארטאפים שיאתגרו את התעשייה המסורתית, ארגונים כמו החברות האלה, שחלק מהן כן יאמצו גישת Netflix ויהיו שם את האנשים שיובילו טרנספורמציה כזואבל יהיו הרבה הרבה הרבה Blockbuster - הרבה יותר מאשר האחרים.ויהיה פה שינוי מאוד מאוד גדול בבהיבט הזה של הרבה מאוד חברות - כמו שראינו בתעשיות אחרות - הרבה חברות שיתכווצו משמעותית או יעלמו מהעולם כתוצאה מזה.(רן) מה שבעצם ניסיתי לשאול זה האם אתה רואה בגל השני use cases חדשים, שלא היו קיימים בגל הראשון, ודורשים פתרונות חדשים בתחום ה-DevOps?(נתי) בטח - בוא ניקח לדוגמא תעשייה שאני מתעסק איתה הרבה שזו תעשיית ה-Networking וה-Telcosכולנו מכירים את העולם של ה-5G, ונשים שנייה בצד את כל הקונספירציות מאחורי זה . . (רן) כבר דברנו על זה . . (אורי) אנחנו בכרכור . . .תחת השפעת הקונספירציות יותר מאשר ה-5G . . .(נתי) אז זה די קשקוש למי שמכיר את התחום, אבל נשאיר את זה במגרש הזה כרגע, אני לא רוצה להכנס לשם . . המחשבה על זה שעכשיו האנטנה שאתם רואים בדרך היא גם משאב מחשוב, ומה שרץ עליה זה Containers ו-Software, ואני יכול להריץ שם כל מיני Augmented Reality ועוד כל מיני אפליקציות אחרות - אותו דבר אנחנו רגילים כבר לראות במכונית שלנו, וה-WiFi שלנו - שכל הרכיבים האלה הופכים להיות כמו עוד תשתית ב-IT שלנו - זה משהו שכבר אנחנו רואים את החדשנות מאחוריו.מה החדשנות? שאם בעבר מכשירים יצאו מהמפעל והגיעו לבית והיו מנותקים מהאינטרנט, והיינו רגילים לראות אותם כאיזשהו יצור נפרד - אז היום הם כל הזמן מחוברים:מחוברים לרשת Wi-Fi, לאינטרנט, מתעדכנים כל הזמן - חלק מהתהליך DevOps, מתעדכנים בלי שאנחנו אפילו שואלים שאלות . . .(רן) הטלוויזיה שלי כל הזמן מתעדכנת . . . לפי דעתי זה קצת מפריע לילדים לראות . . .[חכה לעדכון של הרכב. . . ](נתי) ועכשיו - זה שהם מתעדכנים זה עוד משהו אחד, שזו כבר מהפכה בעיני, כי זה פותח פה פתח להרבה מאוד דברים שלא היו פה קודםאבל אנחנו גם מייצרים Feedback-loop דרך זה שיש לנו חיישנים ודרך זה שיש לנו מצלמה ודרך זה שאנחנו יודעים פתאום . . . והדוגמא הטובה ביותר היא כל האפליקציות-ספורט האלה - אתה יודע מה הדופק, אתה יודע מה מצב ה-Battery . . . אני אגב לא לובש שעון בגלל זה, אבל זה כבר סיפור אחר . . .אני פתאום יודע עליך כל כך הרבה דרך הסנסורים, וכל החווייה שלי אליך היא אחרת.האפליקציות נראות לגמרי אחרת - אלו אפליקציות שלומדות אותך, יודעות אותך, מגיבות אליך . . . זה לא איזה משתמש סטטי, שמישהו תכנו פעם, עכשיו הוא מותאם אליךאתה מכיר את זה מעולם התוכן הרבה . . .ועכשיו, כשאנחנו מחברים את זה באמת לאפליקציות ולעולם של תשתיות, אז כמעט . . .הדמיון פה הוא די פרוע על לאן זה הולך ויכול ללכת.ראינו גם את ההיבטים השליליים של זה - עם הסרט שדיברנו עליו קודם, עם פייסבוק ואחרים, אבל גם . . . אני לא אכנס כרגע לעולם הזה.אבל זה בהחלט פותח פה פתח לחשיבה אחרת - לאיך חושבים על אוטומציה, מדברים הרבה על AI-Ops, ואיך באמת מייצרים בינה מלאכותית כחלק מתהליך האוטומציהאנחנו מדברים הרבה על תהליכים שבהם - איך שאני מייצר סקריפטים של אוטומציה בלי שאני אצטרך לכתוב אותם, זאת אומרת שתיהיה איזושהי מערכת שיודעת לעשות Discovery, להבין איך נראית סביבת Reference וממנה לייצר את האוטומציה ואז להגיד “אני רוצה סביבה דומה ב-Azure!”אוקיי - אז אני אמצא את ה-Templates הרלוונטיים ואני אייצר לך דבר כזה גם שם.(רן) אז אולי, אני אני מנסה לעשות ככה Reverse-engineering למה שאתה אומר - אתה אומר שה-Early adopters של הענן היו יחסית טכנולוגיים, אוקיי. עכשיו ה-Late adopters מגיעים, והם צריכים כלים שקצת יותר “מלטפים” אותם, כלים קצת פחות קשוחים, שיותר עוזרים להם להגיע לענן ול-DevOps - ועכשיו אנחנו אומרים את זה בנשימה אחת, נכון שאלו שני דברים שונים אבל אתה קושר אותם ביחד - ולא, הם לא יריצו Puppet ו-Chef וגם לא יריצו Terraform - הם צריכים משהו הרבה יותר פשוט, אולי כלי אחד שמותאם לתעשיית הבנקאות וכלי אחר שמותאם לתעשיית הקוסמטיקה או אני לא יודע מה יש שם עוד . . . (אורי) אבל אז, רן - מגיע ה-SaaS, בסדר? ומה לעשות - היום, אני לא יודע מה נתח השוק ב-CRM של Salesforce, אבל Salesforce היא SaaS והיא די דומיננטית, אם לא דומיננטית ביותר בשוק, ואף אחד לא מתקין Salesforce ב-On Premise, כולם משתמשים בענן של Salesforce, ו-Salesforce היא לא נמצאת על שום, אני חושב, על שום Cloud . . .(נתי) זה משתנה . . אני יכול לתאר כמה סיבות ללמה זה משתנה(אורי) היא Cloud בעצמה . . .(נתי) אבל חלק מהלקוחות הגדולים שלה . . . זה כבר ניהיה מאוד חשוב אפילו לשחקנים כמו Salesforce איפה התשתיות שלהם רצות, וגם יש דרישה של לקוחות להגיד “אני רץ ב-Azure ושם יש לי את כל ההיבטים של הרגולציה”.אמנם אתה רץ ב-Cloud שלך, אני אני רוצה שה-Data ירוץ ב-ב-Azure כי שם ה-Data שלי נמצא ואני לא רוצה שהוא יהיה במקום אחר כי אני לא יודע לדאוג לו, ומבחינה רגולטורית אני לא יכול לדאוג לו . . .אז יש להם דרישה כזו, ואתה רואה שגם חברות SaaS פתאום עובדות במודלים כאלה של Multi-cloud -יש להן את ה-Data Centers שלהן, לפי צרכים של Efficiency, במקומות - כמו שתיארת עם ה-CDN, במקומות שבהם זה באמת Make Senseאבל הם מתחילים לעבוד במודל היברידי שבו דברים שלא ב-Core Business, או דברים שלחילופין כן קשורים ל-Data והלקוח רגיש ורוצה את ה-Data קרוב אליו - אז שם הם מתחילים לעבוד במודלים יותר היברידיים.(אורי) אבל מבחינת ה . . לא יודע מה - הבנק או חברת הגז או לא משנה מה - אותם זה לא מעניין, הם מקבלים מ-Salesforce שירות CRM וזהו.(נתי) אז אני אומר - חלק מהם, זה כן מעניין אותם, מהסיבה שהמידע של הלקוח הוא משהו שהם מחוייבים אליו, לאיפה שהוא יושב . . .(אורי) אני מבין את זה, אבל היום-יום של איש ה-IT . . .(נתי) לאיש ה-IT זה אולי זה לא משנה, אבל Salesforce עוברת עכשיו תהליך של להעביר הרבה מאוד מהתשתיות שלה כן לענן, בדומה אגב למה שתיארת שאתם (Outbrain) עושים - כן יש פה גם הסתכלות על הדבר הזה, אבל זה לא מרכז הדיון היום.מרכז הדיון שאמרנו הוא שקורה בגל הבא - שהמאפיין שלו הוא אותם 96% שלא נמצאים עדיין בענן, אותם 96% - אני רוצה שנייה להתכתב עם מה שאמרת קודם, רן, לגבי הליטוף והשונות שלהם - אלו חברות שהן לא Greenfield, יש להן דברים שהן כבר פיתחו והשקיעו בהם הרבה מאוד זמןאם הייתה מחשבה שהם תוך שנתיים עוברים וסוגרים את ה-Data Centers שלהם ועוברים לענן, אז זה לא יקרה וזה גם חלק מהתובנות שיש לדבר הזה.אז קורים פה תהליכים שהם, הייתי אומר, יותר מדו-סטריים, אני חושב שאפילו תלת-סיטרייםטרנד אחד שהוא להגיד “אוקיי - אני צריך להעביר דברים לענן”טרנד שני שאומר - “אני לא יכול להיפרד מה-On-Premise שלי והתשתיות האלה ביום אחד - זה יקח כמה שנים טובות עד שהקונסולידציה (Consolidation) הזו כן תקרה”המשמעות של המשפט השני היא שאני כן צריך לעשות מודרניזציה לתשתית הקיימת, אני לא יכול להרוג אותה, ואני רוצה שהיא תראה כמה שיותר קרובה לצורת העבודה שלי בעולם הציבורי, כי אחרת זה ניהיה שעטנזי ובעייתי.והטרנד השלישי הוא מה שאמרת - זה שלא כולם בארגונים כאלה יודעים לעשות Infrastructure as a Code, ויהיו מפתחים שכותבים קוד ב-Python או ב-YAML או בכאלה דברים, ולכן אתה צריך איזשהו מודל יותר - קראת לו “מלטף” - יותר היברדידי, כזה שיש בו את צד ה-No-code וגם את ה- As-Code.ושניהם יכולים לחיות אחד לצד השני, כי יהיו כאלה שיבואו ויגידו - “וואלה, רק תפעיל לי את הדבר הזה, אני רוצה את הקליק ולהפעיל אותו” ויהיו כאלה שיבואו ויגידו “אני רוצה את זה כחלק מה-CI/DC Pipeline שלי כי אני לא רוצה לדבר עם אף אחד בדרך, וככה אני רגיל לעבוד, ושזה יעשה את כל תהליכי האוטומציה שלי”.(רן) אז ככה לקראת סיום, בוא נעשה נבואה - עוד לא עשינו נבואה היום, נכון?. . .(אורי) עסקנו בנבואה פה כל ה . . .אני בכלל בקורס נביאים נפלתי בשלב שלישי . . . בנביאים ראשונים נפלתי.(רן) אז ככה - הענן התחיל לפני כמה? - 15 שנים, נגיד, Give or take? משהו כזה - ואז 100% מה-Workload היה On-Premise.15 שנים אחרי שזה, אנחנו על 96% מה-Workload שהוא on-premise . . . (נתי) מהתקציב, לא מה-Workload . . .(רן) מהתקציב, נכון - מה יהיה בעוד 10 שנים?(נתי) שאלה מצויינת . . . (א) אני חושב שבאמת האחוזים יתהפכו, אני חושב שכן 96% ירוצו בסביבת ענן . . .(רן) כן? אתה חושב שבעוד עשר שנים נגיע ל-inflection points, אוקיי, מעניין . . .(נתי) כן, אני נוטה לחשוב שכן - אני חושב שבעיקר בגלל מה שקרה עם ה-Covid-19 - אני חושב שזה היה אירוע שגרם להרבה ארגונים לעשות פה איזשהו שינוי-הילוך מאוד משמעותי.(אורי) אני חושב שזה יתאפשר רק עם תיהיה קומודיטיזציה (Commoditization) של השירות הזה, והמחירים ירדו למחירים שפויים . . .(נתי) אני מאמין שזה יהיה Side-effect של המהלך הזה, כי אני חושב שיש פה תחרות עכשיו, יותר מאוזנת . . .(אורי) עדיין - יש, איך זה נקרא? לא מונופול . . . (רן) דואופול, או טריאופול . . .((אורי) זה קצת קרטל, זה . . . המחירים די מושווים ו . . .(נתי) אני מסתכל על רוב הארגונים, ויכול להיות, שוב פעם - אתה (אורי, Outbrain) מקרה ייחודי מהבחינה הזו . . היכולת של הארגונים להדביק את הקצב ולבנות משהו שהוא מקביל למשהו שהשחקנים האלה יודעים לתת הוא עד כמעט-לא-קיים.אבל כן אני חושב שילכו למודלים יותר משוכללים, שזה להגיד איפה ה-Core Business שלי, איפה אני כן רוצה להשקיע בתשתיות ואיפה הדברים שאני עושה לא כאלה קריטיים ואני יכול לקחת שירותים . . .(אורי) אבל אז השאלה היא האם זה באמת יהיה 96%-4% לכיוון השני, אם ה-Core Business עדיין ישאר במקום הזול . . .(נתי) אני לא חושב . . אני חושב ש . . .אני מדבר על התעשיות האלה, שרובן הן Non-Tech ,זאת אומרת שהם לא בוחרים טכנולוגיות . . .(רן) אוסם, תנובה, קופת חולים . . .(נתי) . . . הם בוחרים בסוף שירותים . . . (אורי) אתה יודע, אוסם, תנובה, קופת חולים - רוב מה שהם עושים, לא רק שזה לא ה-Core Business שלהם, רוב ה-IT שלהם יכול להיתמך ב-SaaS . . . הוא לא צריך. . . (נתי) אז עוד יותר . . . הסיבות שזה לא נתמך ב-SaaS הן שיש את הניואנסים שייחודיים להם, או לפחות היו, וזה באמת הולך ומצטמצם . . .(אורי) כנראה שענן לא . . .אתה יודע מה? אני לא רוצה להתנבא בזה, אבל לא יחליף להם את ניהול קו הייצור . . . קו הייצור הוא פיזי, אצלהן . . . (נתי) נכון, ולכן אני חושב שהן לא יוכלו להימנע ממשהו On-premise שירוץ ועליו זה יהיה . . .פה נוצר, הייתי אומר, מודלים היברידיים שה-Cloud אפילו תומך בהם - אני אתן את הדוגמא של Outposts, כי זה משהו שאני עובד איתו הרבה לאחרונה -תחשוב נניח, אורי - כשאתה הרצת את ה-Data Center שלך, זה היה לקחת hosted business לעומת AWS, ולבנות איזשהו משהו שהוא פה מתנהג בצורה אחת ב-On-Premise ושם מתנהג ב-Cloud בצורה אחרת לגמרי - ואיכשהו לתפור את הדבר הזה.באים AWS ואומרים “אוקיי - תתקין איזה Rack אצלך, שיש בו EC2 - אבל הוא אצלך” - וה-Cost שלו יכול להיות גם Fixed, אתה לא חייב לשלם on-demand בכלל.אתה יכול לשלם On-demand, אבל הוא יכול להיות גם Fixed, בדיוק כמו המודלים העסקיים שאתה מדבר עליהם - אולי לא באותן אופטימיזציות אבל כבר לא רחוק מזה.אבל היתרון הוא שמבחינת תשתית - המיקום הזה יראה לך כמו עוד איזה Region ב-AWS . . . כל שאר הארגון שלך יכול להמשיך לעבוד דרך ממשקי הניהול של AWS ו . . .(אורי) או שאתה עובד הפוך, אתה לוקח את ה . . . אם יש לך את ה-Kubernetes אז נותן את השקיפות לשני הכיוונים . . .(נתי) אגב, שקיפות לא סותרת . . . בדרך כלל אתה משלב בין הדברים, וכשאתה עושה את זה אתה עדיין תריץ את ה-Workload על Kubernetes וזה יאפשר לך גם את הפורטביליות (Portability) בהקשר הזה.אבל מה שאני אומר, בהקשר של השאלה הנבואית, זה שלטווח הקצר אני חושב שהעולם יהיה מאוד היברידי - כן יהיה עדיין, בטווח של העשר שנים הללו של הטרנספורמציה, יהיו עדיין דברים ב-On-premise שיתחילו לעבור לענן, וה-On-premise יצטרך לעבור מודרניזציה כדי שנוכל לראות באמת את הקונסיסטנטיות (Consistency) בין הסביבות האלה.חלק מהתפישות באות מכיוון הענן פנימה - שזה אני חושב שיהיה טרנד מרכזי: Azure עם Azure Stack ו-AWS עם Outposts וכל החבילות שלהם.לכן אני חושב שבטווח הזה, עד עוד 10 שנים, אנחנו כבר נראה את היחס הזה מתהפך בין ה-4% ל-96%.אורי) תגיד - מה יקרה קודם: 96% על הענן או שהקציצות שלנו יהיו מגודלות לא מאשכרה-פרות . . .(נתי) האמת שעשיתי פה טיול בכרכור לפני שבועיים . . .(רן) אתה (אורי) לא מדבר על עדשים, נכון? . . .(נתי) . . . אל תלך רחוק לשם . . .(אורי) לא, בשר מתורבת . . .(נתי) . . . עשינו פה טיול ליקוט בכרכור, לא רחוק מפה, ממש כמה רחובות מפה . . .(רן) . . זה יגמר ב-5G? . . .(נתי) זה לא יגמר ב-5G, זה יגמר בחובזה ויגמר ב . . . ברח לי השם של זה, לא זוכר את השם של הצמח שם . . . אבל אם אתה שואל אותי, אז אין טעם ללכת רחוק למה שהתעשייה תעשה, אני אומר תתחברו קצת לטבע שקרוב אליכם לבית ותגלו שאתם לא מכירים - הרבה דברים - ושיש הרבה דברים שהם קרובים אליכם ואפשר לעשות איתם הרבה יותר.זה לא קשור בכלל לדיון שלנו היום, קשור אולי ל-Covid . . .(אורי) אגב, נתי - אני עושה טיולים עם הכלב גם כמה רחובות מכאן, אז אני לא יודע על מה עלית שם . . .(נתי) בסדר . . . זה נושא לדיון אחר, נראה לי . . .(רן) טוב, יופי חברים - אז אולי את הפרק הבא אולי באמת נקדיש ל-5G וקורונה ועל הקשר ביניהם . . .(נתי) אתה יודע מה היה החזון שלי, ב-5G או קורונה או מה שזה לא יהיה? שכשאני אעבור את מחלף תות או איך שקוראים לו . . . המחלף הארור הזה, עין תות?(רן) תעבור מימד?(נתי) לא . . שיגיד לי “נתי! אתה בשיחת לטפון אבל אתה צריך לרדת . . .” ושהוא יזיז אותי ישר לנתיב הנכון . . .(רן) אה, זה המחלף, הקודם, התכוונת לעין עירון . . .(אורי) מחלף עירון(נתי) למי שלא מכיר - אז פיספסתי פה בעשרים דקות את הפגישה . . . אז יש לי תירוץ - רק עכשיו דיברנו על החזון של איך התירוץ הזה גם לא יהיה תירוץ בהמשך . . .(רן) לגמרי . . . היית צריך לשים על נהג אוטומטי(נתי) אנחנו מתקרבים לשם . . .(רן) אז תודה רבה נתי - שוב. היה תענוג ומעניין ומשכיל. להתראות.הקובץ נמצא כאן, האזנה נעימה ותודה רבה לעופר פורר על התמלול

ATARC Federal IT Newscast
DevSecOps Coffee Chat with Director, System Configuration & Delivery Automation Division, U.S. Patent and Trademark Office, Spence Spencer!

ATARC Federal IT Newscast

Play Episode Listen Later Jan 19, 2021 45:46


This week on the ATARC DevSecOps Coffee Chat, we have the pleasure of speaking with the Director of System Configuration & Delivery Automation Division at the U.S. Patent and Trademark Office, Spence Spencer. Spencer shares his experience and insights on leading a team and putting people first! Below is his list of book suggestions. Check out this episode to learn more! Reading List: Leaders Eat Last, Simon Sinek (current read) The Open Organization, Jim Whitehurst (CEO Red Hat) Turn the Ship Around, L. David Marquet Out of The Crisis, W. Edwards Deming Deming's Road to Continual Improvement, William Scherkenbach The Goal, Elihu Goldratt DevOps Handbook, Gene Kim et. al. The Phoenix Project, Gene Kim et. al. The Art of Capacity Planning, John Allspaw

What the Dev?
What the Dev(Ops)? - Why resilience engineering can help your organization continue core functions despite stress with John Allspaw - Episode 59

What the Dev?

Play Episode Listen Later Oct 6, 2020 20:58


We spoke to John Allspaw, the principal and founder of Adaptive Capacity Labs, a consulting company that helps organizations deal with data incidents, to talk about the growing concept of resilience engineering.In a nutshell, reverse engineering is concerned with adverse external events that can lead to system failure. It tests how much an organization can withstand stress and other challenging factors to continue performing its core functions and avoid loss of data.Be sure to check out Allspaw's talk at the DevOps Enterprise SummitLas Vegas on October 13th to learn more!

Developers Eating the World
"Resilience Engineering" - Did I get it wrong!?

Developers Eating the World

Play Episode Listen Later Sep 30, 2020 41:28


Recently I wrote this post on DevOps.com (https://devops.com/what-is-resilience-engineering/). I got a lot of positive, and ... critical feedback. I invited software industry all stars Will Gallego and John Allspaw to explain to me what I got wrong!

Page it to the Limit
Building an Incident Response Plan With John Allspaw

Page it to the Limit

Play Episode Listen Later Aug 19, 2020 27:46


Mandi Walls talks with John Allspaw, Co-Founder and Principal at Adaptive Capacity Labs, about the practice of dealing with technical incidents.

Josh on Narro
Staff Data Engineer at Slack

Josh on Narro

Play Episode Listen Later Apr 14, 2020 14:41


Diana Pojar Staff Data Engineer at Slack April, 2020 blog, twitter, linkedin Tell us a little about your current role: your title, the company you wor... https://staffeng.com/stories/diana-pojar blogtwitterlinkedintechnical leadershipJosh WillsStan BabourineBogdan GazaTravis CrawfordCamille Fournier Lara HoganJosh WillsVicki BoykisDavid GascaJulia GraceHolden KarauJohn AllspawCharity MajorsTheo SchlossnagleJessica Joy KerrSarah CatanzaroOrange Bookmy Goodreads accountReady to read another story?

Giant Robots Smashing Into Other Giant Robots
316: A Completely Orthogonal Skillset (Lara Hogan)

Giant Robots Smashing Into Other Giant Robots

Play Episode Listen Later Apr 14, 2019 40:42


Lara Hogan, co-founder of Wherewithall, discusses finding her ideal job coaching and mentoring, evaluating for management alignment, what makes for a strong manager, the value of role-play, constructive feedback, and her upcoming book, Resilient Management. This episode of Giant Robots is sponsored by: PricingWire: Monetization & Pricing Strategy for Software & Technology Innovators Links & Show Notes Wherewithall Not Lack of Ability but More Choice: Individual and Gender Differences in Choice of Careers in Science, Technology, Engineering, and Mathematics "On Being A Senior Engineer"- John Allspaw Resilient Management Voltron Lara on Twitter See open positions at thoughtbot! Become a Sponsor of Giant Robots!

Technology Leadership Podcast Review
05. Organizationally-Traumatic Management Junk Food

Technology Leadership Podcast Review

Play Episode Listen Later Mar 1, 2019 8:36


Jesse Fewell on Drunken PM, Dave Dame on Agile For Humans, Stephen Bungay on Boss Level, Julia Wester on SPAMCast, and Matty Stratton on Greater Than Code. I'd love for you to email me with any comments about the show or any suggestions for podcasts I might want to feature. Email podcast@thekguy.com. This episode covers the five podcast episodes I found most interesting and wanted to share links to during the two week period starting February 18, 2019. These podcast episodes may have been released much earlier, but this was the week when I started sharing links to them to my social network followers. JESSE FEWELL ON DRUNKEN PM The Drunken PM podcast featured Jesse Fewell with host Dave Prior. Dave and Jesse talked about the role of the Project Management Office (PMO) in organizations that are transitioning to Agile methods. Jesse talked about the invitation-orientation of the Agile PMO as defined in the Project Management Body Of Knowledge (PMBOK) in which the PMO acts to support teams as they learn to become agile. Dave brought up that most people he has spoken to from PMOs want everyone in the organization to “do Agile” the same way, which Jesse described as management junk food. This led to a further discussion about why people want consistency and why most of their reasons are due to misunderstandings and anti-patterns like optimizing resource efficiency over flow efficiency. They also delved into some of my favorite topics: the leadership circle concept from Anderson and Adams, the competing values framework, and Carol Dweck’s ideas around fixed and growth mindsets. iTunes link: https://itunes.apple.com/ca/podcast/evolving-role-pmo-in-agile-organization-catching-up/id1121124593?i=1000428696329&mt=2 Website link: http://drunkenpm.blogspot.com/2019/01/the-evolving-role-of-pmo-in-agile.html DAVE DAME ON AGILE FOR HUMANS The Agile For Humans podcast featured Dave Dame with host Ryan Ripley. Dave talked about growing up with cerebral palsy which led to a discussion about the opportunities brought about by improvements in accessibility in recent years. He talked about how a technology like Apple Pay that might seem like a relatively minor innovation to most people can be a complete game-changer for somebody with cerebral palsy as it lets them pay for something without having to trust a stranger to go into their wallet. He talked about how social media has given him a voice where in previous generations there just wouldn’t be the opportunity. Nowadays, he says, the biggest accessibility obstacles at work for him are not buildings lacking ramps and elevators, but the inaccessible nature of the company’s org charts. iTunes link: https://itunes.apple.com/ca/podcast/afh-105-agile-leadership-and-management-with-dave-dame/id991671232?i=1000429122862&mt=2 Website link: https://ryanripley.com/afh-105-agile-leadership-and-management-with-dave-dame/ STEPHEN BUNGAY ON BOSS LEVEL The Boss Level podcast featured Stephen Bungay with host Sami Honkonen. This episode is a few years old, but I recently finished reading Melissa Perri’s new book The Build Trap which referenced Stephen Bungay’s book The Art Of Action and I have been reading his work non-stop ever since, which got me interested in hearing more from him. I liked what he had to say about uncertainty’s central place in strategy and its distinction from risk. He also told a compelling story about a friend of his working in strategy at a UK retailer and how he went against the traditional rollout of store layout changes to all stores at once and instead rolled out changes a few stores at a time so that he could tweak the design as he went. This is something any entrepreneur would recognize as Lean Startup thinking, but it was completely foreign to the management of this retailer. iTunes link: https://itunes.apple.com/ca/podcast/stephen-bungay-and-strategy-under-uncertainty/id1041885043?i=1000376171555&mt=2 Website link: http://www.bosslevelpodcast.com/stephen-bungay-and-strategy-under-uncertainty/ MATTY STRATTON ON GREATER THAN CODE The Greater Than Code podcast featured Matty Stratton with hosts Janelle Klein, Coraline Ehmke, and Jessica Kerr. They began the discussion by having Matty summarize his REdeploy conference talk ‘Fight, Flight, or Freeze – Releasing Organizational Trauma.’ Taking the idea of incidents and outages as a form of organizational trauma, Matty talked about the importance of being able to tell stories about your incident responses and how that helps the organization process the trauma. He cited John Allspaw regarding the idea that incident postmortems should ask questions that trigger conversations rather than give answers. Janelle brought up the point that the stories we tell are sometimes lies that cover up the trauma rather than address it when the environment of the organization lacks psychological safety. This brought them to a discussion of blameless postmortems and how a culture of blamelessness is so hard to build and so easy to lose. iTunes link: https://itunes.apple.com/ca/podcast/116-healing-organizational-trauma-with-matt-stratton/id1163023878?i=1000429285663&mt=2 Website link: http://www.greaterthancode.com/2019/02/06/116-healing-organizational-trauma-with-matt-stratton/ JULIA WESTER ON SPAMCAST The Software Process & Measurement podcast featured Julia Wester with host Thomas Cagley. Tom and Julia talked about the need for spectrum thinking, discussed the distinction between spectrum thinking and binary thinking, and then Julia described how she uses the Cynefin framework to identify whether or not a problem requires spectrum thinking. While this is a straightforward concept, I see binary thinking being applied all the time to address problems that require something more akin to spectrum thinking. iTunes link: https://itunes.apple.com/ca/podcast/spamcast-532-spectrum-thinking-interview-julia-wester/id213024387?i=1000429098317&mt=2 Website link: http://spamcast.libsyn.com/spamcast-532-spectrum-thinking-an-interview-with-julia-wester FEEDBACK Ask questions, make comments, and let your voice be heard by emailing podcast@thekguy.com. Twitter: https://twitter.com/thekguy LinkedIn: https://www.linkedin.com/in/keithmmcdonald/ Facebook: https://www.facebook.com/thekguypage Instagram: https://www.instagram.com/the_k_guy/ YouTube: https://www.youtube.com/channel/UCysPayr8nXwJJ8-hqnzMFjw Website:

Technology Leadership Podcast Review
02. Managers, Leaders, A/B Testers, and Bloodletters

Technology Leadership Podcast Review

Play Episode Listen Later Jan 18, 2019 7:48


Courtney Eckhardt on Greater Than Code, Teresa Torres on Product Love, Johanna Rothman on Developer On Fire, Jeff Patton on Scrum Master Toolbox, and Jeff Gothelf on Scrum Master Toolbox. I'd love for you to email me with any comments about the show or any suggestions for podcasts I might want to feature. Email podcast@thekguy.com. This episode covers the five podcast episodes I found most interesting and wanted to share links to during the two weeks period starting January 7, 2019. These podcast episodes may have been released much earlier, but this was the week when I started sharing links to them to my social network followers. COURTNEY ECKHARDT ON GREATER THAN CODE The Greater Than Code podcast featured Courtney Eckhardt with hosts John K Sawers, Sam Livingston-Gray, Jamey Hampton and Coraline Ada Ehmke. It was great to hear another conversation that built upon the human factors conversations with Steven Shorrock and John Allspaw in previous episodes. I like how Courtney highlighted the importance of good communication in incident response by helping us picture what the lack of good communication looks like from the customer’s point of view. iTunes link: https://itunes.apple.com/ca/podcast/110-human-incident-response-with-courtney-eckhardt/id1163023878?i=1000426093173&mt=2 Website link: http://www.greaterthancode.com/2018/12/19/110-human-incident-response-with-courtney-eckhardt/ TERESA TORRES ON PRODUCT LOVE The Product Love podcast featured Teresa Torres with host Eric Boduch. I felt that, while A/B testing is a powerful and useful technique, Teresa makes a great point that it is not appropriate in all circumstances and she lists several other techniques that teams should consider when doing product discovery. I also liked the bloodletting metaphor. iTunes link: https://itunes.apple.com/ca/podcast/teresa-torres-joins-product-love-to-talk-about-product/id1343610309?i=1000425622664&mt=2 Website link: https://productcraft.com/podcast/product-love-podcast-teresa-torres-product-discovery-coach-and-writer-of-product-talk/ JOHANNA ROTHMAN ON DEVELOPER ON FIRE The Developer On Fire podcast featured Johanna Rothman with host Dave Rael. I can’t count the number of times I’ve heard someone make a distinction between management and leadership. I always felt that it let managers off the hook. I feel that a manager needs to be a good leader to do his or her job well and vice versa. Johanna captured that sentiment. iTunes link: https://itunes.apple.com/ca/podcast/episode-402-johanna-rothman-learning-and-delivering/id1006105326?i=1000426413335&mt=2 Website link: https://developeronfire.com/podcast/episode-402-johanna-rothman-learning-and-delivering JEFF PATTON ON SCRUM MASTER TOOLBOX The Scrum Master Toolbox podcast featured Jeff Patton with host Vasco Duarte. Jeff talked about how, when he got into software development, he quickly learned that building software was about making as many people as happy as you could while still making money. When he found himself on XP and Agile teams in the first decade of the 2000s, he felt something was missing. When he later fell in with product people, he realized that the missing piece was product thinking. They discussed how Jeff came up with user story mapping and Jeff cited three books that emphasize product thinking: Inspired, Escaping The Build Trap, and Inspired. iTunes link: https://itunes.apple.com/ca/podcast/product-owner-role-what-scrum-masters-can-do-to-help/id963592988?i=1000426507266&mt=2 Website link: https://scrum-master-toolbox.org/2018/12/podcast/jeff-patton-shares-his-view-on-the-product-owner-role-and-what-scrum-masters-can-do-to-help/ JEFF GOTHELF ON SCRUM MASTER TOOLBOX The Scrum Master Toolbox podcast featured Jeff Gothelf with host Vasco Duarte. Vasco asked Jeff about the key ingredients in Agile transformations that get organizations to continuously think about how the product they’re creating relates to the business and the market. Jeff gave a great answer that finished with an example of how even a change in the name of the team changes the way that the team thinks of themselves and their mission. iTunes link: https://itunes.apple.com/ca/podcast/how-to-redefine-measure-success-for-software-development/id963592988?i=1000426560415&mt=2 Website link: https://scrum-master-toolbox.org/2018/12/podcast/jeff-gothelf-on-how-to-redefine-the-measure-of-success-for-software-development/ Feedback Ask questions, make comments, and let your voice be heard by emailing podcast@thekguy.com. Twitter: https://twitter.com/thekguy LinkedIn: https://www.linkedin.com/in/keithmmcdonald/ Facebook: https://www.facebook.com/thekguypage Instagram: https://www.instagram.com/the_k_guy/ YouTube: https://www.youtube.com/channel/UCysPayr8nXwJJ8-hqnzMFjw Website: https://www.thekguy.com/ Intro/outro music: "waste time" by Vincent Augustus

leaders managers agile vasco xp testers jeff gothelf teresa torres johanna rothman jeff patton john allspaw vasco duarte coraline ada ehmke vincent augustus eric boduch dave rael greater than code scrum master toolbox
Greater Than Code
110: Human Incident Response with Courtney Eckhardt

Greater Than Code

Play Episode Listen Later Dec 19, 2018 60:22


RubyConf 2018 - Retrospectives for Humans by Courtney Eckhardt (https://www.youtube.com/watch?v=s7R7V5wC0wA) 01:16 – Courtney’s Superpower: Explaining things. 06:50 – Incident Response: How we talk to people how are are affected by incidents Other Great Incident Response GTC Episodes! * 088: The Safety 2 Dance with Steven Shorrock (https://www.greaterthancode.com/2018/07/11/088-the-safety-2-dance-with-steven-shorrock/) * 096: Resilience Engineering with John Allspaw (http://www.greaterthancode.com/2018/09/05/096-resilience-engineering-with-john-allspaw/) 13:52 – Disabilities in the Workplace and Professional Spaces 20:25 – The Tension Between Accessibility and Security 23:20 – Developing Coping Skills in Response to a Troubled Childhood / Combatting the Feeling of Being Othered 29:16 – Incident Retrospectives and Defensiveness as a Natural Instinct to Feedback 35:29 – Showing Vulnerability "In order to understand what another person is saying, you must assume it is true and try to imagine what it could be true of." - George Armitage Miller 43:56 – Emotional Response Trauma and Recovery: The Aftermath of Violence--From Domestic Abuse to Political Terror (https://www.amazon.com/gp/product/0465061710/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=therubyrep-20&creative=9325&linkCode=as2&creativeASIN=0465061710&linkId=b3f61caa5c87c1f98f62945ec2d4a75c) Mental Health First Aid (https://www.mentalhealthfirstaid.org/) Reflections: John: Trauma doesn’t stay in the past. Trauma has a continuous effect on our lives. Coraline: Thinking about therapy and frame it as a blameless retrospective. Sam: Referring to “post mortems” as “retrospectives” and buying the book Agile Retrospectives. (Future book club episode?!) Jamey: Even if you’re not changing things in a higher level, you can still help on a direct level. Courtney: Group therapy and handling retrospectives. This episode was brought to you by @therubyrep (https://twitter.com/therubyrep) of DevReps, LLC (http://www.devreps.com/). To pledge your support and to join our awesome Slack community, visit patreon.com/greaterthancode (https://www.patreon.com/greaterthancode). To make a one-time donation so that we can continue to bring you more content and transcripts like this, please do so at paypal.me/devreps (https://www.paypal.me/devreps). You will also get an invitation to our Slack community this way as well. Amazon links may be affiliate links, which means you’re supporting the show when you purchase our recommendations. Thanks! Special Guest: Courtney Eckhardt.

echo, podcast tech / dev
#E23 - Les feedbacks constructifs et la communication non-violente avec Sophie Despeisse

echo, podcast tech / dev

Play Episode Listen Later Oct 24, 2018 31:37


Sophie Despeisse, associée et lead développeuse chez Toucan Toco, met en place une culture positive du feedback au sein de son équipe. Voici les sujets abordés dans ce podcast : - Présentation de Sophie et de son rôle chez Toucan Toco - Qu'est-ce qu'un feedback ? Sophie explique l'importance de la question du feedback, de ses liens avec les préceptes agiles aux problématiques quotidiennes. - Quelles peuvent être les conséquences de feedbacks mal amenés ? - Le timing : trouver le bon moment pour adresser des feedbacks, et ne pas attendre qu'il soit trop tard. - Le timing (2) : préférer les cold feedbacks constructifs (à froid et objectifs) et éviter les warm feedbacks irrationnels (à chaud et subjectifs). - Rédiger des feedbacks (notamment dans la code review) : collaboration, relecture et emojis. - Comment et quand solliciter des feedbacks ? - Comment obtenir des feedbacks de son équipe lorsqu'on occupe une position hiérarchique supérieure ? - Comment mesurer les impacts d'un feedback ? - Observer et comprendre via la communication non-violente - Comment sensibiliser ses collaborateurs et ses collaboratrices à ces questions ? - Les apports du blameless post-mortem - Quelques recommandations de lectures et de vidéos (voir ci-dessous) Ressources citées tout au long du podcast : - Marshall B. Rosenberg, Les Mots Sont Des Fenêtres (Ou Bien Ce Sont Des Murs) (1999) : https://www.amazon.fr/mots-sont-fen%C3%AAtres-bien-murs/dp/2707143812 Mort en 2015, Marshall Rosenberg a conceptualisé et diffusé à travers le monde la Communication Non-Violente (CNV). Les Mots Sont Des Fenêtres est une introduction à la CNV. - "Blameless PostMortems and a Just Culture", un article sur la pratique du blameless post mortem chez Etsy, écrit par John Allspaw, CTO d'Etsy : https://codeascraft.com/2012/05/22/blameless-postmortems/ - re:Work, ensemble d'outils de management mis à disposition par Google : https://rework.withgoogle.com/blog/support-managers-with-rework-tools/ - Conférence TEDx d'Eduardo Briceno, CEO de Mindset Works : "The Power of belief -- mindset and success" (en anglais) : https://www.youtube.com/watch?v=pN34FNbOKXc Ressource supplémentaire : - Conférence TED de Frances Frei, professeure de technologie et de management opérationnel à Harvard Business School : "Comment instaurer (et réinstaurer) la confiance (How To Build (And Rebuild Trust)" (en anglais, sous-titré en français) : https://www.ted.com/talks/frances_frei_how_to_build_and_rebuild_trust?language=fr

Greater Than Code
096: Resilience Engineering with John Allspaw

Greater Than Code

Play Episode Listen Later Sep 5, 2018 69:02


John Allspaw: Etsy’s Debriefing Facilitation Guide for Blameless Postmortems (https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/) 01:32 – John’s Superpower: Seeing connections across domains. 05:45 – All Technical Communities Run Small, the Intersection of People, Technology, and Work, and the Resilience Engineering Community 09:07 – Variety and Complexity Requisite Variety (https://en.wikipedia.org/wiki/Variety_(cybernetics)) The Toyota Way: 14 Management Principles from the World’s Greatest Manufacturer (https://www.amazon.com/gp/product/0071392319/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=therubyrep-20&creative=9325&linkCode=as2&creativeASIN=0071392319&linkId=7ce8452504e8dd06d876719e2898eb3f) The Great Courses (https://www.thegreatcourses.com/learning) Understanding Complexity (https://www.thegreatcourses.com/courses/understanding-complexity.html) 17:51 – Understanding Cognitive Work 25:34 – Heuristics and Biases 31:01 – Strategies for Generating Context-Specific Questions Debriefing Facilitation Guide (Morgan Evans) (https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf) 35:01 – Asking “Why?” Over “What?” Questions The PreAccident Podcast (https://preaccidentpodcast.podbean.com/) Todd Conklin: People screw up – and it happens all the time (https://conferences.oreilly.com/velocity/devops-web-performance-ny-2015/public/schedule/detail/44275) Ten challenges for making automation a “team player” in joint human-agent activity (https://ieeexplore.ieee.org/document/1363742/?reload=true) 49:33 – Analyzing and Aggregating Rational Choice Theory (https://en.wikipedia.org/wiki/Rational_choice_theory) Reflections: Rein: How do we deal with the objective/subjective dialectic? Janelle: The thing we focus on and pay attention to is a clear signal of what matters. Jessica: Looking up the Knowledge Elicitation Methods. Rein: It takes variety to match variety. John A.: Guiding dialogue data. This episode was brought to you by @therubyrep (https://twitter.com/therubyrep) of DevReps, LLC (http://www.devreps.com/). To pledge your support and to join our awesome Slack community, visit patreon.com/greaterthancode (https://www.patreon.com/greaterthancode). To make a one-time donation so that we can continue to bring you more content and transcripts like this, please do so at paypal.me/devreps (https://www.paypal.me/devreps). You will also get an invitation to our Slack community this way as well. Amazon links may be affiliate links, which means you’re supporting the show when you purchase our recommendations. Thanks! Special Guest: John Allspaw.

The Food Fight Show
Food Fight Show - 119 - The STELLA Report

The Food Fight Show

Play Episode Listen Later Apr 10, 2018 54:11


Join a discussion with John Allspaw (@allspaw) about the STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity.

Software Engineering Radio - The Podcast for Professional Software Developers
SE-Radio Episode 301: Jason Hand on Handling Outages

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Aug 29, 2017 63:25


Bryan Reinero talks with Jason Hand about handling outages and responding to failures. The episode explores basic problem-solving strategies and diagnostic techniques, organizing teams to address incidents efficiently, communicating with stakeholders, learning from incidents, and managing stress.   Related Links Episode 284 – John Allspaw on System Failures: Preventing, Responding, and Learning From Episode 225 […]

Software Engineering Radio - The Podcast for Professional Software Developers
SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Mar 7, 2017 51:42


John Allspaw CTO of Etsy speaks with Robert Blumen about systemic failures and outages; how are systems defended against outages?; why do they fail anyway?; why are failures not entirely preventable?; why do outages involve multiple failures?; the time that Etsy identified it’s own office as a potential source of fraud; the human as part […]

Software Engineering Radio - The Podcast for Professional Software Developers
SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Mar 7, 2017 51:43


John Allspaw CTO of Etsy speaks with Robert Blumen about systemic failures and outages. Why they cannot be totally prevented, how to respond, and what we can learn from them.

Software Defined Talk
Episode 86: Life after artisanal pork rinds (i.e. tech M&A), CostCo Down Under

Software Defined Talk

Play Episode Listen Later Jan 30, 2017 61:52


With a flurry of M&A over the past few weeks, we discuss some of the more popular ones: AppDynamics, Trello, and Apiary. These kind of buys are all about what the acquirer plans to do with the new “asset” and the financial health of the company being acquired. We discuss these recent acquisitions, including who the “losers” are. Also, the low-down on CostCo in Australia! Mid-roll Coté: I’m speaking at DevOpsDays Charlotte (https://www.devopsdays.org/events/2017-charlotte/agenda/), day two keynote, I think. Use the code SDT to get 25% off! Matt: Talking Chef at the AWS Sydney User Group (https://www.meetup.com/AWS-Sydney/events/232172236/) Microsoft Ignite Australia: Chef will have a booth & a talk (https://events.chef.io/events/microsoft-ignite-australia/) ChefConf ChefConf 2017 Teaser (https://www.youtube.com/watch?v=DhHpt-Xhj84) Coté: much self-promotion to catch up on: I’m writing more “original content” on my blog (https://cote.io/), and plan to write more; subscribe to my newsletter for a round-up of stuff I blog, sent out on Sunday night (http://us1.campaign-archive1.com/home/?u=ce6149b4008d62a08093a4fa6&id=806adba588), will tweak more. Also, in the “grim” vein, Coté reviews some books on "automation," (http://thenewstack.io/review-automation-wake-call-fill-vacuum-tech-ethics) which John Allspaw rightly says (https://www.facebook.com/drunkandretired/posts/10155023406864169?comment_id=10155023650914169&comment_tracking=%7B%22tn%22%3A%22R%22%7D) should be called "new technology," fair enough. The 1983 paper on automation and humans (http://www.bainbrdg.demon.co.uk/Papers/Ironies.html) is a good read too. CostCo field report: Australia It’s great! US: No need for a hot pizza sign holder. US: Rayban Wayfarers are like $130 now! AppDynamics files for IPO… Cisco says NOT SO FAST IPO filing... (http://finance.yahoo.com/news/appdynamics-files-ipo-133542304.html) “Our revenues for the fiscal years ended January 31, 2014, 2015 and 2016 were $23.6 million, $81.9 million and $150.6 million, respectively” Cisco (http://blogs.cisco.com/news/cisco-announces-enterprise-news) $3.7 billion, about a 14-17X multiplier (https://cote.io/2017/01/25/at-3-7bn-appdynamics-sells-to-cisco-at-17-3x-estimated/) Atlassian Buys Trello for $425 Million Wired coverage (https://www.wired.com/2017/01/trello-simple-app-worth-425-million-dollars/) 451 report, paywall (https://451research.com/report-short?entityId=91333). Public blog from 451 (https://blogs.the451group.com/techdeals/ma/atlassian-inks-its-biggest-buy-with-425m-collaboration-software-deal/). Oracle Buys Apiary “API Integration Cloud” (https://www.oracle.com/corporate/acquisitions/apiary/index.html) Coté’s coverage, with plenty more links (https://cote.io/2017/01/19/oracle-acquiring-apiary-api-design-for-the-660m-in-2020-api-market/): small asset working on a $660m API management market. BONUS LINKS! Not covered in show. HP Buys Stuff Cloud Cruiser for management/chargeback, $650 million (http://www.zdnet.com/article/hpe-to-acquire-cloud-cruiser-for-measuring-it-usage/) SimpliVity for converged systems, $650 million (http://www.forbes.com/sites/petercohan/2017/01/17/hewlett-packard-enterprise-pays-650-million-in-cash-for-simplivity/) You Know What DevOps Needs? An IEEE Standard They’re working on it (https://standards.ieee.org/develop/wg/DevOps.html) Twitter Google buying Fabric. Facebook still king. Do We Talk About Trump? OpenStack Summits leaving the US (https://www.openstack.org/blog/2017/01/supporting-our-global-community/) Red Hat, Microsoft, others making announcements against the Muslim ban Coté says: these people are proven idiots. Don’t work with them (https://cote.io/2017/01/30/tech-must-rethink-working-with-the-hobgoblins-cf-scorpions-turtles-trumptech/). Trump’s Twitter Moves Markets Apparently he watches Fox and parrots their lines (http://www.marketwatch.com/story/every-trump-tweet-activates-thousands-of-computer-algorithms-2017-01-12), so maybe someone at Fox is making a killing with “insider trading”? RethinkDB: Why We Failed Good read for how hard it is to crack the DB and OSS markets (http://www.defstartup.org/2017/01/18/why-rethinkdb-failed.html). “In hindsight, two things went wrong – we picked a terrible market and optimized the product for the wrong metrics of goodness.” Coté follow-up: be careful with TAM picking (https://cote.io/2017/01/21/choose-your-tam-wisely-and-remember-to-charge-a-high-price-rethinkdb/). Yahoo is Altaba … wut? (http://www.reuters.com/article/us-yahoo-m-a-verizon-idUSKBN14T2I7) Dreams $45bn (https://twitter.com/IvanTheK/status/818810602839744512) Google’s AI Awakening “How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself.” Extensive article on Google’s AI push from back in December (http://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html) Alexa Amazon’s OS (https://stratechery.com/2017/amazons-operating-system/) Also, there’s an estimated 24.5m of these voice things around (https://cote.io/2017/01/26/alexa-how-many-of-your-type-exists/). ClusterHQ Shutting Down Docker storage startup shuts down (http://www.storagenewsletter.com/rubriques/start-ups/start-up-clusterhq-shutting-down/) Facebook’s 2016 Open Source Contributions Open source continues to be great for recruiting (and probably code) (https://code.facebook.com/posts/1058188987642144/facebook-open-source-2016-year-in-review/) Google buys Twitter’s Fabric CASH! (http://variety.com/2017/digital/news/google-buys-fabric-from-twitter-1201962640/) Bruce Sterling/Jon Lebkowsky “State of the World” Always a good read (http://www.well.com/conf/inkwell.vue/topics/495/Bruce-Sterling-and-Jon-Lebkowsky-page01.html) Recommendations Brandon: RTIC 30oz Tumbler (http://amzn.to/2kNhcrw). Matt: Donate to the ACLU (https://www.aclu.org/). RTJ3 is out, and free (https://runthejewels.com/)! My 2016 year in the air (http://cem.re/year-in-review/3b675cc593dc326ec6d2835144db5800d0b28e35.html) Tennis ball making video (http://mentalfloss.com/article/83414/mesmerizing-video-shows-how-tennis-balls-are-made) Coté: big jar of green hatch! Get a 40 (http://amzn.to/2kHP6hQ)! Also, how to feed three people with one bean (https://www.youtube.com/watch?v=KqEVYbPw9lI&feature=youtu.be&t=1m26s).

PreAccident Investigation Podcast
PAPod 57 - System Reliability - John Allspaw

PreAccident Investigation Podcast

Play Episode Listen Later Feb 13, 2016 46:33


Safety Podcast, Devops, Safety Culture, Safety Differently, Safety Leadership, New View Safety, Organizational Change, Operational Excellence, Safety, ReliabilitySo this podcast episode is one I have been looking forward to doing for almost a year.  When the podcast first started a person "tweeted" an episode on being wrong...and suddenly dozens of people were asking me about John Allspaw.  John has worked in systems operations for over fourteen years in biotech, government and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon.com, InfoWorld.com, Friendster, and Flickr. He is now the CTO at Etsy, and is the author of "The Art of Capacity Planning" and "Web Operations" published by O'Reilly. He speaks from time to time at conferences on topics related to web operations, operations and development culture, infrastructure, and capacity planning.He's a dad, guitarist, engineer, and wiseguy.  He is also a great podcast interview.  I bet you cannot listen to this episode with learning something new...come on!  Take the bet.  You know I am good for it.  I know you will love this episode.  Thanks for listening and tell your friends.  You are a a part of the fastest growing safety podcast on earth!This episode is sponsored by UA Workplace Health and Safety.  Learn more and thank them at UAWHS.comPS.  Hey John.  I play the guitar, you play the guitar, we should play guitar.

RunAs Radio
Jeffrey Snover Is Serious About DevOps!

RunAs Radio

Play Episode Listen Later Aug 1, 2012 38:42


Richard chats with Jeffrey Snover about DevOps at Microsoft. The concept of DevOps goes back to a talk that John Allspaw and Paul Hammond did at Velocity called 10 Deploys Per Day. DevOps focuses on having developers and operations working closely together to make rapid deployments possible. Jeff discusses his blog post on the subject of Windows Server 2012, PowerShell 3.0 and DevOps. DevOps is coming to the Microsoft world, are you ready?

USI - Les sessions - iPad / Apple TV
2011 - John Allspaw - Construire la résilience dans le développement et l'opérationnel web

USI - Les sessions - iPad / Apple TV

Play Episode Listen Later Sep 20, 2011 38:11


L'ingénierie autour de la résilience se définit comme "la capacité d'un système à adapter son fonctionnement avant, pendant ou après des changements ou perturbations, avec comme objectif de garantir son fonctionnement après un imprévu ou dans des environnements de stress continu." (Erik Hollnagel)

USI - Les sessions - iPhone/iPod
2011 - John Allspaw - Construire la résilience dans le développement et l'opérationnel web

USI - Les sessions - iPhone/iPod

Play Episode Listen Later Sep 20, 2011 38:11


L'ingénierie autour de la résilience se définit comme "la capacité d'un système à adapter son fonctionnement avant, pendant ou après des changements ou perturbations, avec comme objectif de garantir son fonctionnement après un imprévu ou dans des environnements de stress continu." (Erik Hollnagel)