A podcast about Chaos Engineering, presented by Gremlin. Find us on Twitter at @BTOPpod.

Today Jason and Julie catch up and reflect on their favorite moments from Season 3, including unpopular opinions, chaos engineering, make or break moments in engineers' careers, and more. They discuss the unique features of having established engineers and newer engineers on the show and what each one brings to the table, and they talk about some of their favorite “build” episodes, where engineers delve into the story of how they saw a need and then built a product to fulfill it. The conclude they conversation by sharing what's next for Break Things on Purpose. See you next season!In this episode we cover: Introduction to the episode and catching up with Jason and Julie (00:16) Jason and Julie identify some of their favorite guests from the season (4:49) The differences and advantages of having established engineers vs. newer engineers on the show (11:58) Jason and Julie talk about their favorite “build” episodes (15:56) What's coming for Break Things on Purpose (21:20) Links Referenced: January 11th, 2022 episode: https://www.gremlin.com/blog/podcast-break-things-on-purpose-unpopular-opinions/ Twitter: https://twitter.com/btoppod gremlin.com/podcast: https://gremlin.com/podcast loyaltyfreakmusic.com: https://loyaltyfreakmusic.com

In this episode, we cover: Mauricio talks about his background and his role at Pismo (1:14) Jason and Mauricio discuss tech and reliability with regards to financial institutions (5:59) Mauricio talks about the work he has done in Chaos Engineering with reliability (10:36) Mauricio discusses things he and his team have done to maximize success (19:44) Mauricio talks about new technologies his team has been utilizing (22:59) Links Referenced: Pismo: https://pismo.io/ LinkedIn: https://www.linkedin.com/company/pismo/ TranscriptMauricio: That's why the name Cockroach, I guess, if there's a [laugh] a world nuclear war here, all that will survive would be cockroaches in our client's data. [laugh]. So, I guess that's the gist of it.Jason: Welcome to Break Things on Purpose, a podcast about Chaos Engineering and reliability. In this episode, we chat with Mauricio Galdieri, a staff engineer at Pismo about testing versus exploration, reliability and resiliency, and the challenges of bringing new technologies to the financial sector.Jason: Welcome to the show.Mauricio: Hey, thank you. Welcome. Thanks for having me here, Jason.Jason: Yeah. So, Mauricio, you and I have chatted before in the past. We were at Chaos Conf, and you are part of a panel. So, I'm curious, I guess to kick things off, can you tell folks a little bit more about yourself and what you do at Pismo? And then we can maybe pick up from our conversations previously?Mauricio: Okay, awesome. I work as a staff engineer here at Pismo. I work in a squad called staff engineering squad, so we're a bunch of—five squad engineers there. And we're mostly responsible for coming up with new ways of using the existing technology, new technologies for us to have, and also standardize things like how we use those technologies here? How does it fit the whole processes we have here? And how does it fit in the pipelines we have here, also?And so, we do lots of documentation, lots of POCs, and try different things, and we talk to different people from different companies and see how they're solving problems that we also have. So, this is basically our day-to-day activities here. Before that, well, I have a kind of a different story, I guess. Most people that work in this field, have a degree in something like a technical degree or something like that. But I actually graduated as an architect in urban planning, so I came from a completely different field.But I've always worked as a software developer since a long time ago, more than [laugh] willing to disclose. So, at that time when I started working with software development, I like to say that startups were called dotcoms that back then, so, [laugh] there was a lots of job opportunities back then, so I worked as a software developer at that time. And things evolved. I grew less and less as an architect and more as an engineer, so after I graduated, I started to look for a second degree, but on the more technical college, so I went to an engineering college and graduated as a system analyst.So, from then on, I've always worked as a software developer and never, never have done any house planning or house project or something like that. And I really doubt if I could do that right now [laugh] so I may be a lousy architect [in that sense 00:03:32]. But anyway, I've worked in different companies for both in private and public sectors. And I've worked with consultancy firms and so on. But just before I came to Pismo, I went working with a FinTech.So, this is where I was my first contact with the world of finance in a software context. Since then, I've digged deep into this industry, and here I am now working at Pismo, it's for almost five years now.Jason: Wow. That quite a journey. And although it's a unique journey, it's also one that I feel like a lot of folks in tech come from different backgrounds and maybe haven't gone down the traditional computer science route. With that said, you know, one of the things you mentioned FinTech. Can you give us a little bit of a description of Prismo, just so folks understand the company that you're working at now?Mauricio: Oh, yeah. Well, Pismo, it's a company that has about six years now. And we provide infrastructure for financial services. So, we're not banks ourselves, but we provide the infrastructure for banks to build their financial projects with this. So basically, what we do is we manage accounts, we manage those accounts' balances, we have connections with credit card networks, so we process—we're also a credit card processor.We issue cards, although we're not the issuer in this in the strict sense, but we issue cards here and manage all the lifecycle of those cards. And basically, that's it. But we have a very broad offering of products, from account management to accounting management, and transactions management, and spending control limits and stuff. So, we have a very broad product portfolio. But basically, what we do is provide infrastructure for financial services.Jason: That's fascinating to me. So, if I were to sum that up, would it be accurate to say that you're basically like Software as a Service for financial institutions? You do all the heavy lifting?Mauricio: Yeah, yeah. I could say that, yeah.Jason: It's interesting to me because, you know, traditionally, we always think of banks because they need to be regulated and there needs to be a whole lot more security and reliability around finances, we always think of banks as being very slow when it comes to technology. And so, I think it's interesting that, in essence, what you've said with trying the latest technology and getting to play around with new technology and how it applies, especially within your staff engineering group, it's almost the exact opposite. You're sort of this forefront, this leading edge within the world of finance and technology.Mauricio: Yeah. And that actually is, it's something that—it's the most difficult part to sell banks to sign up with us, you know? Because they have those ancient systems running on-premises and most likely running on top of COBOL programs and so on. But at the same time, it's highly, highly reliable. That they've been running those systems for, like, 40 years, even more than that, so it's a very highly reliable.And as you said, it's a very regulated industry, so it's very hard to sell them this kind of new approach to banking. And actually, we consider this as almost an innovation for them. And it's a little bit strange to talk about innovation in a sense that we're proposing other companies to run in the cloud. This doesn't sound innovating at all nowadays. So, every company runs their systems in the cloud nowadays, so it's difficult to [laugh] realize that this is actually innovation in the banking system because they're not used to running those things.And as you said, they're slow in adopting new technologies because of security concerns, and so on. So, we're trying to bring these new things to the table and prove them. And we had to prove banks and other financial institutions that it is possible to run a banking system a hundred percent in the cloud while maintaining security standards and security compliances and governance compliance and all that stuff. It's very hard to do so and we have a very stringent process to evaluate and assess new technologies because we have to make sure it complies with those standards and all those certifications that we need to have in order to operate in this industry. So, it's very hard, but it doesn't—at that same time, we have lots of new technologies and different ways we can provide the same services to those banks.And then I think the most difficult part in this is to map what traditional banks were doing into this new way of doing things in the cloud. So, this mapping, it's sometimes it gets a little confusing and we have to be very patient and very clear with our clients what they should expect from us and how we will provide the same services they already have now, but using different technologies and different ways. For instance, they are used to these communications with different services, they're used to things like webhooks. But webhooks are not reliable; they can fail and if they fail, you lose that connection, you lose connectivity, and you may lose data and you may have things out of sync using webhooks. So, now we have things like event streaming, or queues and other stuff that you can use to [replay 00:09:47] things and not lose any data.But at the same time, you have to process this, and then offline in an asynchronous manner. So, you have to map those synchronous things that they did before to this asynchronous world and this world where things are—we have an eventual consistency. But it's very difficult but it's also at the same time, it's a very fascinating industry.Jason: Yeah, that is fascinating. But I do love how you mentioned taking the idea of the new technology and what it does, and really trying to map that back to previously—you know, those previous practices that they had. And so, along with that, for folks who are listening again, Mauricio and I had a chat during Chaos Conf a while back, and he was sharing some of the practices that Pisma has done for Chaos Engineering. And I always liken that back to, you know, Chaos Engineering really is very similar to traditional disaster recovery testing, in many ways, other than oftentimes, your disaster recovery would never actually, you know, take things down. Mauricio, I'm curious, can you share a little bit more about what you've been doing with Chaos Engineering and in general, with reliability. Are there any new programs or processes that you've worked on within Prismo around Chaos Engineering and reliability?Mauricio: Well, I think that the first thing to realize, and I think this is the most important point that you need to have very clear in your mind when we're talking about Chaos Engineering is that we're not testing something when we're doing Chaos Engineering; we're experimenting with something. And there's a subtle but very important distinction between those two concepts. When you test for something, you're testing for something that you knew what will happen; you have an idea of how it should behave. You're asserting a certain behavior. You know how the system must behave and you assert that, and it makes sure the system doesn't deviate on that by having an automated test, for instance, a unit or integrated test, or even functional tests and such.But Chaos Engineering is more about experimenting. So, it's designed for the unknowns. You don't know what will happen. You're basically experimenting. It's like a lab, you're working in a laboratory, you're trying different stuff and see what happens, you have an idea of what should happen and we call this a hypothesis, but you're not sure if that is how we will behave.And actually, it doesn't matter if it complies with your expectations. Even if it doesn't behave the way you expect it to behave or the way you want it to behave, you're still gaining knowledge about your system. So, it's much more about experimenting new things instead of actually testing for some something that you know about. And our journey here into Chaos Engineering at Pismo, it all began about a year-and-a-half ago when we got a very huge outage on one of our major cloud providers here. And we went down with them; they were out for about almost an hour.But not only we were affected by it, but other digital banks here in Brazil, but also many other services like Slack, Datadog, other observability tools that were running at that time, using that cloud provider went down, together with them. So, it was a major, major outage here. And then we were actually caught off guard on this because we have lots of different ways to make sure the system doesn't go down if something bad happens. But that was so bad that we went down and we couldn't do anything. We were desperate because we couldn't do anything. And also we can even communicate properly because we use Slack as our communication hub, so Slack was down at that time, also, so we cannot communicate properly with our official channels.Also, Datadog that we were using at a time also went down and we couldn't even see what was happening in the system because we didn't have any observability running at the time. So, that was a major, major outage we had there. So, we started thinking about ways we could experiment with those major outages and see how we could find ways of still operating at least partially and not go down entirely or at least have ways to see what was happening even in the face of a major disaster. And those traditional disaster recovery measures that were valid at the time, even those couldn't cope with the kind of outages we were facing at that time. So, we were trying to look for different ways that we can improve the reliability of our services as a whole.So, I guess that's when we started looking into Chaos Engineering and started looking for different tools to make that work, and different partnerships we could find, and even different ways we could experiment this with our existing technology and platform.Jason: I really love how you characterized that difference between testing and Chaos Engineering. And I think the idea of being more experimental puts you into a mindset of having this concept of, you know, kind of blamelessness, right, around failure. The idea that, like, failure is going to happen and we want to be open to seeing that and to learning from it. More so than a test, right? When we test things, then there's the notion of a pass-fail and fails are bad, whereas with an experiment, that learning is, if it didn't happen the way you expect, there's learning around that and that's a good thing rather than a bad thing, such as failing a test.Mauricio: Yeah, and that works in a higher framework, I guess, which is resilience itself. So, I guess, chaos experiment, chaos engineering, and all that stuff, it's an important part of a bigger whole that we call resilience. And I guess a key to understand resilience is that this point exactly, the systems never work in unexpected ways. They always behave the way it is expected to behave. They're deterministic in nature. So, we're talking about machines here, computers. We told them what we want them to do.And even if we have complexity and randomness involved, say if a network connection goes down, it still will behave the way we programmed them to behave. So, every failure should be expected. What we have here is that sometimes they behave in ways we don't want them to behave. And sometimes they behave in ways we want them to behave. So, it's more of a matter of desire, you know? You want something, you want the system to behave a certain way.So, in that sense, success should be measured as a performance variability, you know? So, sometimes it will work the way you want and sometimes it will work your way in ways that you don't want it to behave. And I guess, realizing that, it's key also to understand another point that is, in that sense, success is the flip side of failure. So, either it works the way you want it or it works the way you don't want it. And what we can do to move the scale towards a more successful operation, the ways you can do this, you must first realize also that—let's go back a little bit then say, if you have a failure and you look at why it happened, almost never it is the result of one single thing.Sometimes it is, but this is very rare. Most of the failures and even mainly when we're talking about major failures, they're most likely the result of a context of things that happened that led to this failure. And you can see that the same thing, it's valid for successes. When you have a success at one point, it's almost never the result of one thing that you did that led to a successful scenario. Most of the time is a context of different things you did that maximizes your chances of success.So, to turn this scale towards success, you should create an environment of several things, of a context of things. And this could be tooling, this could be your organizational culture and stuff, all of those things that you do in your company to maximize their chances of success. It's not, you cannot plan for success in the sense because planning is one thing you can do, and planning doesn't involve strategy, for instance. Because planning should be done thinking about things you can do, tasks you can perform, while strategy, you should be turning tables to [laugh] think in terms of strategy. So, you have to put all of this in the same way in a table and try to organize your company and your culture, your tools and your technology in ways you maximize your chances of success and minimize your chances of failures.Jason: That's such an interesting insight. So, I'm curious, can you dive into some of the things that you and your team have done to maximize your chances of success?Mauricio: Okay. When we started working with Chaos Engineering, it was in this sense of trying to do one more thing to maximize our chances of success. And we partnered up with Gremlin and we saw that working with Chaos Engineering, using Gremlin mainly, it's so easy—that is, it's also easy to lose track of what you're doing. It's easy for you to go just for the fun of it and break things down and have fun with it and stuff. So, we had to come up with a way to bring structure to this process.And by doing so, we should also not be too bureaucratic in the sense of creating a set of steps you should take in order to run a chaos session. So, one way we thought about was to come up with a document. That is the bureaucratic part, so this was a step you should take in order to plan for your chaos session, but there is one part of it—and I think it's one of the most important parts of this chaos session planning—is that you should describe what you're going to test, but more importantly, why you're going to test this. And this is one of the most important questions because this is a fundamental question: why you're doing this kind of experiment. And to answer that, you have to think about all the things in context.What are the technologies you're using? Why it fails in the first place? Do the fails that I expect to see are actually fails or is it just different ways of behaving? And sometimes we consider failure in a business rule that was not complied, that was not met. So, this is an opportunity to think about, are those business rules correct? Should we make it more flexible? Should we change those business logic?So, when you start asking why you're doing something, you're asking fundamental questions, and I think that puts you in context. And this is one of the major starting points to maximize our chances of success because it makes every engineer involved in running a chaos session, think about their role in the whole process and the role of their services in the whole company. So, I think this is one powerful question to ask before starting any chaos session, and I think this contributes a lot to a successful outcome.Jason: Yeah, I think that's a really great perspective on how to approach Chaos Engineering. Beyond the Chaos Engineering, you mentioned that the staff engineering group that you're part of that Prismo is really responsible for seeing new technologies and new trends and really trying to bring those in and see how they can be used and applied within the financial services sector. Are there any new technologies that you've used recently or that you're looking at right now that has really been fruitful or really applied to finding more success as you've mentioned?Mauricio: Yeah, there are some things we're researching. One of those already went past research and we're already using it in production, which is data—cloud-based, multi-region databases and multi-cloud—also—databases. And we're working with CockroachDB as one of our new database technologies we use. And it's a database built from the ground up to be ultra resilient. And that's why the name Cockroach, I guess, if there's a [laugh] a world nuclear war here, all that will survive would be cockroaches in our client's data. [laugh]. So, I guess that's the gist of it.And we have to think about that in different ways of how we approach this because we're talking about multi-cloud data stores and multi-region and how we deal with data in different regions. And should we replicate all the data between regions and how we do partition data. So, we have to think in different ways, how we approach data modeling with those new cloud-based and multi-region and globally distributed databases. Another one that we're—this is more like of a research, is having a sharded processing. And that is, how we can deal with, how we group different parts of the data to be processed separately but using the same logic.And this is a way to scale processing in ways that horizontal scaling in a more traditional way doesn't solve in some instances. Like, when we have—for instance, let me describe one scenario that we have that we're exploring things along those lines. We have a system here called ‘The Ledger,' which keeps track of all of the accounts' balances. And for this system, if we have multiple requests or lots of requests for different accounts, there's no problem because we're updating balances for different accounts, and that works fine. And we can deal with lots and lots of requests. We have a very good performance on that.But when we have lots of requests coming in from one particular accounts, and they're all grouped for this particular account, then we cannot—there's no way around locking at some place. So, you have to lock it either at the database level, or at a distributed locking mechanism level, or at the business logic layer. At some point, you have to lock the access to this account balance. So, this degrades performance because you have to wait for this processing to finish and start another. And how can we deal with that without using locks?And this was the challenge we put that to ourselves. And we're exploring different ways, lots of different ways, and different approaches to that. And we have lots of restrictions on that because this system has to respond quickly, has to respond online, and cannot be in an asynchronous process; it has to be synchronous. So, we have very little space for double-checking it and stuff. So, we're exploring a sharded processing for this one in which we can have a small subset of accounts being routed to one specific consumer to process this transaction, and by doing so, we may have things like a queue of order transactions so we can give up locking at the database and maybe improve on performance. But we're still on the POC on that, so let's see what we come up with [laugh] in the next few months.Jason: I think that's really fascinating. Both from a, you know, having been there, having worked on systems where, you know, very transaction-driven, and having locks be an issue. And so, you know, back in my day of doing this, you know, was traditionally MySQL or Postgres, trying to figure out, like, how do you structure the database. So, I think it's interesting that you're sort of tackling this in two ways, right? You've got CockroachDB, which is more oriented towards reliability, but a lot of the things that you're doing there around, you know, sharding and multi-cloud also have effects for this new work that you're doing on how do you eliminate that locking and try to do sharded processes as well. So, that's all super fascinating to me.Mauricio: Exactly. Yeah, yeah. This is one of the things that makes you do better the end of the day, you know? [laugh].Jason: Yeah, definitely. As an engineer, you know, if anybody's listening and you're thinking of, “Wow, this all sounds fascinating and really cool stuff,” right, “Really cool technologies to be working with and really interesting challenges to solve,” I know, Mauricio, you said that Pismo is hiring. Do you want to share a little bit more about ways that folks can engage with you? Or maybe even join your team?Mauricio: Yeah, sure. We're hiring; we have lots of jobs open for application. You can go to pismo.io and we have a section for that. And also, you can find us on LinkedIn; just search for Pismo and then find us there.And I think if you're an engineer and looking for some cool challenges on that, be sure to check our open positions because we do have lots and lots of cool stuff going on here. And since we're growing global, you have a chance to work from wherever you are. And this also imposes some major challenges for [laugh] for new technologies and making our products, our existing products, work in a globally distributed banking system. So, be sure to check out our channels there.Jason: Fantastic. Before we wrap up, is there anything else that you'd like to promote or share?Mauricio: Oh no, I think those are the main channels. You can find us: LinkedIn and our own website, pismo.io. Also, you can find us in some GopherCon conferences, KubeCon, and other—Money20/20; we're attending all of those conferences, be it in the software industry or in the financial industry. You can find this there with a booth there or just visiting or participating in some conferences and so on. So, be sure to check that out there also. I guess that's it.Jason: Very cool well thanks, Mauricio for joining us. It's been a pleasure to chat with you again.Mauricio: Thank you, Jason. And thanks for having me here.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: Aaron talks about starting out as a developer and the early stages of cloud development at RBC (1:05) Aaron discusses transitioning to developer advocacy (12:25) Aaron identifies successes he had in his early days of developer advocacy (20:35) Jason asks what it looks like to assist developers in achieving completion with long term maintenance projects, or “sustainable development” (25:40) Jason and Aaron discuss what “innersource” is and why it's valuable in an organization (29:29) Aaron answers the question “how do you keep skills and knowledge up to date?” (33:55) Aaron talks about job opportunities at RBC (38:55) Links Referenced: Royal Bank of Canada: https://www.rbcroyalbank.com Opportunities at RBC: https://jobs.rbc.com/ca/en TranscriptAaron: And I guess some PM asked my boss, “So, Aaron doesn't come to our platform status meetings, he doesn't really take tickets, and he doesn't take support rotation. What does Aaron do for the Cloud Platform Team?”Jason: [laugh].Jason: Welcome to Break Things on Purpose, a podcast about reliability, learning, and building better systems. In this episode, we talk with Aaron Clark, Director of Developer Advocacy at the Royal Bank of Canada. We chat with him about his journey from developer to advocate, the power of applying open-source principles within organizations—known as innersource—and his advice to keep learning.Jason: Welcome to the show, Aaron.Aaron: Thanks for having me, Jason. My name is Aaron Clark. I'm a developer advocate for cloud at RBC. That is the Royal Bank of Canada. And I've been at the bank for… well, since February 2010.Jason: So, when you first joined the bank, you were not a developer advocate, though?Aaron: Right. So, I have been in my current role since 2019. I've been part of the cloud program since 2017. Way back in 2010, I joined as a Java developer. So, my background in terms of being a developer is pretty much heavy on Java. Java and Spring Boot, now.I joined working on a bunch of Java applications within one of the many functions areas within the Royal Bank. The bank is gigantic. That's kind of one of the things people sometimes struggle to grasp. It's such a large organization. We're something like 100,000… yeah, 100,000 employees, around 10,000 of that is in technology, so developers, developer adjacent roles like business analysts, and QE, and operations and support, and all of those roles.It's a big organization. And that's one of the interesting things to kind of grapple with when you join the organization. So, I joined in a group called Risk IT. We built solely internal-facing applications. I worked on a bunch of stuff in there.I'm kind of a generalist, where I have interest in all the DevOps things. I set up one of the very first Hudson servers in Risk—well, in the bank, but specifically in Risk—and I admin'ed it on the side because nobody else was doing it and it needed doing. After a few years of doing that and working on a bunch of different projects, I was occasionally just, “We need this project to succeed, to have a good foundation at the start, so Aaron, you're on this project for six months and then you're doing something different.” Which was really interesting. At the same time, I always worry about the problem where if you don't stay on something for very long, you never learn the consequences of the poor decisions you may have made because you don't have to deal with it.Jason: [laugh].Aaron: And that was like the flip side of, I hope I'm making good decisions here. It seemed to be pretty good, people seemed happy with it, but I always worry about that. Like, being in a role for a few years where you build something, and then it's in production, and you're running it and you're dealing with, “Oh, I made this decision that seems like a good idea at the time. Turns out that's a bad idea. Don't do that next time.” You never learned that if you don't stay in a role.When I was overall in Risk IT for four, almost five years, so I would work with a bunch of the teams who maybe stayed on this project, they'd come ask me questions. It's like, I'm not gone gone. I'm just not working on that project for the next few months or whatever. And then I moved into another part of the organization, like, a sister group called Finance IT that runs kind of the—builds and runs the general ledger for the bank. Or at least for a part of capital markets.It gets fuzzy as the organization moves around. And groups combine and disperse and things like that. That group, I actually had some interesting stuff that was when I started working on more things like cloud, looking at cloud, the bank was starting to bring in cloud. So, I was still on the application development side, but I was interested in it. I had been to some conferences like OSCON, and started to hear about and learn about things like Docker, things like Kubernetes, things like Spring Boot, and I was like this is some really neat stuff.I was working on a Spark-based ETL system, on one of the early Hadoop clusters at the bank. So, I've been I'm like, super, super lucky that I got to do a lot of this stuff, work on all of these new things when they were really nascent within the organization. I've also had really supportive leadership. So, like, I was doing—that continuous integration server, that was totally on the side; I got involved in a bunch of reuse ideas of, we have this larger group; we're doing a lot of similar things; let's share some of the libraries and things like that. That was before being any, like, developer advocate or anything like that I was working on these.And I was actually funded for a year to promote and work on reuse activities, basically. And that was—I learned a lot, I made a lot of mistakes that I now, like, inform some of the decisions I make in my current role, but I was doing all of this, and I almost described it as I kind of taxed my existing project because I'm working on this team, but I have this side thing that I have to do. And I might need to take a morning and not work on your project because I have to, like, maintain this build machine for somebody. And I had really supportive leadership. They were great.They recognize the value of these activities, and didn't really argue about the fact that I was taking time away from whatever the budget said I was supposed to be doing, which was really good. So, I started doing that, and I was working in finance as the Cloud Team was starting to go through a revamp—the initial nascent Cloud Team at the bank—and I was doing cloud things from the app dev side, but at the same time within my group, anytime something surprising became broken, somebody had some emergency that they needed somebody to drop in and be clever and solve things, that person became me. And I was running into a lot of distractions in that sense. And it's nice to be the person who gets to work on, “Oh, this thing needs rescuing. Help us, Aaron.”That's fantastic; it feels really good, right, up until you're spending a lot of your time doing it and you can't do the things that you're really interested in. So, I actually decided to move over to the Cloud Team and work on kind of defining how we build applications for the cloud, which was really—it was a really good time. It was a really early time in the bank, so nobody really knew how we were going to build applications, how we were going to put them on the cloud, what does that structure look like? I got to do a lot of reading and research and learning from other people. One of the key things about, like, a really large organization that's a little slow-moving like the bank and is a little bit risk-averse in terms of technology choices, people always act like that's always a bad thing.And sometimes it is because we're sometimes not adopting things that we would really get a lot of benefit out of, but the other side of it is, by the time we get to a lot of these technologies and platforms, a bunch of the sharp edges have kind of been sanded off. Like, the Facebooks and the Twitters of the world, they've adopted it and they've discovered all of these problems and been, like, duct-taping them together. And they've kind of found, “Oh, we need to have actual, like, security built into this system,” or things like that, and they've dealt with it. So, by the time we get to it, some of those issues are just not there anymore. We don't have to deal with them.Which is an underrated positive of being in a more conservative organization around that. So, we were figuring there's a lot of things we could learn from. When we were looking at microservices and, kind of, Spring Boot Spring Cloud, the initial cloud parts that had been brought into the organization were mainly around Cloud Foundry. And we were helping some initial app teams build their applications, which we probably over-engineered some of those applications, in the sense that we were proving out patterns that you didn't desperately need for building those applications. Like, you could have probably just done it with a web app and relational database and it would have been fine.But we were proving out some of the patterns of how do you build something for broader scale with microservices and things like that. We learned a bunch about the complexities of doing that too early, but we also learned a bunch about how to do this so we could teach other application teams. And that's kind of the group that I became part of, where I wasn't a platform operator on the cloud, but I was working with dev teams, building things with dev teams to help them learn how to build stuff for cloud. And this was my first real exposure to that scope and scale of the bank. I'd been in the smaller groups and one of the things that you start to encounter when you start to interact with the larger parts of the bank is just, kind of, how many silos there are, how diverse the tech stacks are in an organization of that size.Like, we have areas that do things with Java, we have areas doing things with .NET Framework, we have areas doing lots of Python, we have areas doing lots of Node, especially as the organization started building more web applications. While you're building things with Angular and using npm for the front-end, so you're building stuff on the back-end with Node as well. Whether that is a good technology choice, a lot of the time you're building with what you have. Even within Java, we'd have teams building with Spring Boot, and lots of groups doing that, but someone else is interested in Google Guice, so they're building—instead of Spring, they're using Google Guice as their dependency injection framework.Or they have a… like, there's the mainframe, right? You have this huge technology stack where lots of people are building Java EE applications still and trying to evolve that from the old grungy days of Java EE to the much nicer modern ways of it. And some of the technology conversations are things like, “Well, you can use this other technology; that's fine, but if you're using that, and we're using something else over here, we can't help each other. When I solve a problem, I can't really help solve it for you as well. You have to solve it for yourself with your framework.”I talked to a team once using Vertex in Java, and I asked them, “Why are you using Vertex?” And they said, “Well, that's what our team knew.” I was like, “That's a good technology choice in the sense that we have to deliver. This is what we know, so this is the thing we know we can succeed with rather than actually learning something new on the job while trying to deliver something.” That's often a recipe for challenges if not outright failure.Jason: Yeah. So, it sounds like that's kind of where you come in; if all these teams are doing very disparate things, right—Aaron: Mm-hm.Jason: That's both good and bad, right? That's the whole point of microservices is independent teams, everyone's decoupled, more velocity. But also, there's huge advantages—especially in an org the size of RBC—to leverage some of the learnings from one team to another, and really, like, start to share these best practices. I'm guessing that's where you come into play now in your current role.Aaron: Yeah. And that's the part where how do we have the flexibility for people to make their own choices while standardizing so we don't have this enormous sprawl, so we can build on things? And this is starting to kind of where I started really getting involved in community stuff and doing developer advocacy. And part of how this actually happened—and this is another one of those cases where I've been very fortunate and I've had great leaders—I was working as part of the Cloud Platform Team, the Special Projects group that I was, a couple of people left; I was the last one left. It's like, “Well, you can't be your own department, so you're part of Cloud Platform.” But I'm not an operator. I don't take a support rotation.And I'm ostensibly building tooling, but I'm mostly doing innersource. This is where the innersource community started to spin up at RBC. I was one of the, kind of, founding members of the innersource community and getting that going. We had built a bunch of libraries for cloud, so those were some of the first projects into innersource where I was maintaining the library for Java and Spring using OIDC. And this is kind of predating Spring Security's native support for OIDC—so Open ID Connect—And I was doing a lot of that, I was supporting app teams who were trying to adopt that library, I was involved in some of the other early developer experience things around, you complain this thing is bad as the developer; why do we have to do this? You get invited to one of the VP's regular weekly meetings to discuss, and now you're busy trying to fix, kind of, parts of the developer experience. I was doing this, and I guess some PM asked my boss, “So, Aaron doesn't come to our platform status meetings, he doesn't really take tickets, and he doesn't take support rotation. What does Aaron do for the Cloud Platform Team?”Jason: [laugh].Aaron: And my boss was like, “Well, Aaron's got a lot of these other things that he's involved with that are really valuable.” One of the other things I was doing at this point was I was hosting the Tech Talk speaking series, which is kind of an internal conference-style talks where we get an expert from within the organization and we try to cross those silos where we find someone who's a machine-learning expert; come and explain how TensorFlow works. Come and explain how Spark works, why it's awesome. And we get those experts to come and do presentations internally for RBC-ers. And I was doing that and doing all of the support work for running that event series with the co-organizers that we had.And at the end of the year, when they were starting up a new initiative to really focus on how do we start promoting cloud adoption rather than just people arrive at the platform and start using it and figure it out for themselves—you can only get so far with that—my boss sits me down. He says. “So, we really like all the things that you've been doing, all of these community things and things like that, so we're going to make that your job now.” And this is how I arrived at there. It's not like I applied to be a developer advocate. I was doing all of these things on the side and all of a sudden, 75% of my time was all of these side projects, and that became my job.So, it's not really the most replicable, like, career path, but it is one of those things where, like, getting involved in stuff is a great way to find a niche that is the things that you're passionate about. So, I changed my title. You can do that in some of our systems as long as your manager approves it, so I changed my title from the very generic ‘Senior Technical Systems Analyst—which, who knows what I actually do when that's my title—and I changed that to ‘Developer Advocate.' And that was when I started doing more research learning about what do actual developer advocates do because I want to be a developer advocate. I want to say I'm a developer advocate.For the longest time in the organization, I'm the only person in the company with that title, which is interesting because then nobody knows what to do with me because I'm not like—am I, like—I'm not a director, I'm not a VP. Like… but I'm not just a regular developer, either. Where—I don't fit in the hierarchy. Which is good because then people stop getting worried about what what are titles and things like that, and they just listen to what I say. So, I do, like, design consultations with dev teams, making sure that they knew what they were doing, or were aware of a bunch of the pitfalls when they started to get onto the cloud.I would build a lot of samples, a lot of docs, do a lot of the community engagement, so going to events internally that we'd have, doing a lot of those kinds of things. A lot of the innersource stuff I was already doing—the speaking series—but now it was my job formally, and it helped me cross a lot of those silos and work very horizontally. That's one of the different parts about my job versus a regular developer, is it's my job to cover anything to do with cloud—that at least, that I find interesting, or that my boss tells me I need to work at—and anything anywhere in the organization that touches. So, a dev team doing something with Kubernetes, I can go and talk to them. If they're building something in capital markets that might be useful, I can say, “Hey, can you share this into innersource so that other people can build on this work as well?”And that was really great because I develop all of these relationships with all of these other groups. And that was, to a degree, what the cloud program needed from me as well at that beginning. I explained that this was now my job to one of my friends. And they're like, “That sounds like the perfect job for you because you are technical, but you're really good with people.” I was like, “Am I? I guess I am now that I've been doing it for this amount of time.”And the other part of it as we've gone on more and more is because I talk to all of these development teams, I am not siloed in, I'm not as tunneled on the specific thing I'm working with, and now I can talk to the platform teams and really represent the application developer perspective. Because I'm not building the platform. And they have their priorities, and they have things that they have to worry about; I don't have to deal with that. My job is to bring the perspective of an application developer. That's my background.I'm not an operator; I don't care about the support rotation, I don't care about a bunch of the niggly things and toil of the platform. It's my job, sometimes, to say, hey, this documentation is well-intentioned. I understand how you arrived at this documentation from the perspective of being the platform team and the things that you prioritize and want to explain to people, but as an application developer, none of the information that I need to build something to run on your platform is presented in a manner that I am able to consume. So, I do, like, that side as well of providing customer feedback to the platform saying, “This thing is hard,” or, “This thing that you are asking the application teams to work on, they don't want to care about that. They shouldn't have to care about this thing.” And that sort of stuff.So, I ended up being this human router are sometimes where platform teams will say, “Do you know anybody who's doing this, who's using this thing?” Or finding one app team and say, “You should talk to that group over there because they are also doing the same thing, or they're struggling with the same thing, and you should collaborate.” Or, “They have solved this problem.” Because I don't know every single programming language we use, I don't know all of the frameworks, but I know who I asked for Python questions, and I will send teams to that person. And part of that, then, as I started doing this community work was actually building community.One of the great successes was, we have a Slack channel called ‘Cloud Adoption.' And that was the place where everybody goes to ask their questions about how do I do this thing to put something on Cloud Foundry, put it on Kubernetes? How do I do this? I don't understand. And that was sometimes my whole day was just going onto that Slack channel, answering questions, and being very helpful and trying to document things, trying to get a feel for what people were doing.It was my whole day, sometimes. It took a while to get used to that was actually, like, a successful day coming from a developer background. I'm used to building things, so I feel like success because I built something I can show you, that I did this today. And then I'd have days where I talked to a bunch of people and I don't have anything I can show you. That was, like, the hard part of taking on this role.But one of the big successes was we built this community where it wasn't just me. Other people who wanted to help people, who were just developers on different dev teams, they'd see me ask questions or answer questions, and they would then know the answers and they'd chime in. And as I started being tasked with more and more other activities, I would then get to go—I'd come back to Slack and see oh, there's a bunch of questions. Oh, it turns out, people are able to help themselves. And that was—like that's success from that standpoint of building community.And now that I've done that a couple times with Tech Talks, with some of the developer experience work, some of the cloud adoption work, I get asked internally how do you build community when we're starting up new communities around things like Site Reliability Engineering. How are we going to do that? So, I get—and that feels weird, but that's one of the things that I have been doing now. And as—like, this is a gigantic role because of all of the scope. I can touch anything with anyone in cloud.One of the scope things with the role, but also with the bank is not only do we have all these tech stacks, but we also have this really, really diverse set of technical acumen, where you have people who are experts already on Kubernetes. They will succeed no matter what I do. They'll figure it out because they're that type of personality, they're going to find all the information. If anything, some of the restrictions that we put in place to manage our environments and secure them because of the risk requirements and compliance requirements of being a regulated bank, those will get in the way. Sometimes I'm explaining why those things are there. Sometimes I'm agreeing with people. “Yeah, it sucks. I don't want to have to do this.”But at the same time, you'll have people who they just want to come in, write their code, go home. They don't want to think about technology other than that. They're not going to go and learn things on their own necessarily. And that's not the end of the world. As strange as that sounds to people who are the personality to be constantly learning and constantly getting into everything and tinkering, like, that's me too, but you still need people to keep the lights on, to do all of the other work as well. And people who are happy just doing that, that's also valuable.Because if I was in that role, I would not be happy. And someone who is happy, like, this is good for the overall organization. But the things that they need to learn, the things they need explained to them, the help they need for success is different. So, that's one of the challenges is figuring out how do you address all of those customers? And sometimes even the answer for those customers is—and this is one of the things about my role—it's like the definition is customer success.If the application you're trying to put on cloud should not go on cloud, it is my job to tell you not to put it on cloud. It is not my job to put you on cloud. I want you to succeed, not just to get there. I can get your thing on the cloud in an afternoon, probably, but if I then walk away and it breaks, like, you don't know what to do. So, a lot of the things around how do we teach people to self-serve, how do we make our internal systems more self-serve, those are kind of the things that I look at now.How do I manage my own time because the scope is so big? It's like, I need to figure out where I'm not moving a thousand things forward an inch, but I'm moving things to their completion. And I am learning to, while not managing people, still delegate and work with the community, work with the broader cloud platform group around how do I let go and help other people do things?Jason: So, you mentioned something in there that I think is really interesting, right, the goal of helping people get to completion, right? And I think that's such an interesting thing because I think as—in that advocacy role, there's often a notion of just, like, I'm going to help you get unstuck and then you can keep going, without a clear idea of where they're ultimately heading. And that kind of ties back into something that you said earlier about starting out as a developer where you build things and you kind of just, like, set it free, [laugh] and you don't think about, you know, that day two, sort of, operations, the maintenance, the ongoing kind of stuff. So, I'm curious, as you've progressed in your career, as you've gotten more wisdom from helping people out, what does that look like when you're helping people get to completion, also with the mindset of this is an application that's going to be running for quite some time. Even in the short term, you know, if it's a short-term thing, but I feel like with the bank, most things probably are somewhat long-lived. How do you balance that out? How do you approach that, helping people get to done but also keeping in mind that they have to—this app has to keep living and it has to be maintained?Aaron: Yeah, a lot of it is—like, the term we use is sustainable development. And part of that is kind of removing friction, trying to get the developers to a point where they can focus on, I guess, the term that's often used in the industry is their inner loop. And it should come as no surprise, the bank often has a lot of processes that are high in friction. There's a lot of open a ticket, wait for things. This is the part that I take my conversations with dev teams, and I ask them, “What are the things that are hard? What are the things you don't like? What are the things you wish you didn't have to do or care about?”And some of this is reading between the lines when you talk to them; it's not so much interviewing them. Like, any kind of requirements gathering, usually, it's not what they say, it's what they talk about that then you look at, oh, this is the problem; how do we unstuck that problem so that people can get to where they need to be going? And this kind of informs some of my feedback to the systems we put in place, the processes we put in place around the platform, some of the tooling we look at. I really, really love the philosophy from Docker and Solomon Hykes around, “Batteries included but removable.” I want developers to have a high baseline as a starting point.And this comes partly from my experience with Cloud Foundry. Cloud Foundry has a really great out-of-the-box dev experience for lots of things where, “I just have a web app. Just run it. It's Nginx; it's some HTML pages; I don't need to know all the details. Just make it go and give me the URL.”And I want more of that for app teams where they have a high baseline of things to work with as a starting point. And kind of every organization ends up building this, where they have—like, Netflix: Netflix OSS or Twitter with Finagle—where they have, “Here's the surrounding pieces that I want to plug in that everybody gets as a starting point. And how do we provide security? How do we provide all of these pieces that are major concerns for an app team, that they have to do, we know they have to do?” Some of these are things that only start coming up when they're on the cloud and trying to provide a lot more of that for app teams so they can focus on the business stuff and only get into the weeds when they need to.Jason: As you're talking about these frameworks that, you know, having this high quality or this high baseline of tools that people can just have, right, equipping them with a nice toolbox, I'm guessing that the innersource stuff that you're working on also helps contribute to that.Aaron: Oh, immensely. And as we've gone on and as we've matured, our innersource organization, a huge part of that is other groups as well, where they're finding things that—we need this. And they'll put—it originally it was, “We built this. We'll put it into innersource.” But what you get with that is something that is very targeted and specific to their group and maybe someone else can use it, but they can't use it without bending it a little bit.And I hate bending software to fit it. That's one of the things—it's a very common thing in the corporate environment where we have our existing processes and rather than adopting the standard approach that some tool uses, we need to take it and then bend it until it fits our existing process because we don't want to change our processes. And that gets hard because you run into weird edge cases where this is doing something strange because we bent it. And it's like, well, that's not its fault at that point. As we've started doing more innersource, a lot more things have really become innersource first, where groups realize we need to solve this together.Let's start working on it together and let's design the API as a group. And API design is really, really hard. And how do we do things with shared libraries or services. And working through that as a group, we're seeing more of that, and more commonly things where, “Well, this is a thing we're going to need. We're going to start it in innersource, we'll get some people to use it and they'll be our beta customers. And we'll inform it without really specifically targeting an application and an app team's needs.”Because they're all going to have specific needs. And that's where the, like, ‘included but removable' part comes in. How do we build things extensibly where we have the general solution and you can plug in your specifics? And we're still—like, this is not an easy problem. We're still solving it, we're still working through it, we're getting better at it.A lot of it's just how can we improve day-over-day, year-over-year, to make some of these things better? Even our, like, continuous integration and delivery pipelines to our to clouds, all of these things are in constant flux and constant evolution. We're supporting multiple languages; we're supporting multiple versions of different languages; we're talking about, hey, we need to get started adopting Java 17. None of our libraries or pipelines do that yet, but we should probably get on that since it's been out for—what—almost a year? And really working on kind of decomposing some of these things where we built it for what we needed at the time, but now it feels a bit rigid. How do we pull out the pieces?One of the big pushes in the organization after the log4j CVE and things like that broad impact on the industry is we need to do a much more thorough job around software supply chain, around knowing what we have, making sure we have scans happening and everything. And that's where, like, the pipeline work comes in. I'm consulting on the pipeline stuff where I provide a lot of customer feedback; we have a team that is working on that all full time. But doing a lot of those things and trying to build for what we need, but not cut ourselves off from the broader industry, as well. Like, my nightmare situation, from a tooling standpoint, is that we restrict things, we make decisions around security, or policy or something like that, and we cut ourselves off from the broader CNCF tooling ecosystem, we can't use any of those tools. It's like, well, now we have to build something ourselves, or—which we're never going to do it as well as the external community. Or we're going to just kind of have bad processes and no one's going to be happy so figuring out all of that.Jason: Yeah. One of the things that you mentioned about staying up to speed and having those standards reminds me of, you know, similar to that previous experience that I had was, basically, I was at an org where we said that we'd like to open-source and we used open-source and that basically meant that we forked things and then made our own weird modifications to it. And that meant, like, now, it wasn't really open-source; it was like this weird, hacked thing that you had to keep maintaining and trying to keep it up to date with the latest stuff. Sounds like you're in a better spot, but I am curious, in terms of keeping up with the latest stuff, how do you do that, right? Because you mentioned that the bank, obviously a bit slower, adopting more established software, but then there's you, right, where you're out there at the forefront and you're trying to gather best practices and new technologies that you can use at the bank, how do you do that as someone that's not building with the latest, greatest stuff? How do you keep that skills and that knowledge up to date?Aaron: I try to do reading, I try to set time aside to read things like The New Stack, listen to podcasts about technologies. It's a really broad industry; there's only so much I can keep up with. This was always one of the conversations going way back where I would have the conversation with my boss around the business proposition for me going to conferences, and explaining, like, what's the cost to acquire knowledge in an organization? And while we can bring in consultants, or we can hire people in, like, when you hire new people in, they bring in their pre-existing experiences. So, if someone comes in and they know Hadoop, they can provide information and ideas around is this a good problem to solve with Hadoop? Maybe, maybe not.I don't want to bet a project on that if I don't know anything about Hadoop or Kubernetes or… like, using something like Tilt or Skaffold with my tooling. That's one of the things I got from going to conferences, and I actually need to set more time aside to watch the videos now that everything's virtual. Like, not having that dedicated week is a problem where I'm just disconnected and I'm not dealing with anything. When you're at work, even if KubeCon's going on or Microsoft Build, I'm still doing my day-to-day, I'm getting Slack messages, and I'm not feeling like I can just ignore people. I should probably block out more time, but part of how I stay up to date with it.It's really doing a lot of that reading and research, doing conversations like this, like, the DX Buzz that we invited you to where… I explained that event—it's adjacent to internal speakers—I explained that as I was had a backlog of videos from conferences I was not watching, and secretly if I make everybody else come to lunch with me to watch these videos, I have to watch the video because I'm hosting the session to discuss it, and now I will at least watch one a month. And that's turned out to be a really successful thing internally within the organization to spread knowledge, to have conversations with people. And the other part I do, especially on the tooling side, is I still build stuff. As much as, like, I don't code nearly as much as I used to, I bring an application developer perspective, but I'm not writing code every day anymore.Which I always said was going to be the thing that would make me miserable. It's not. I still think about it, and when I do get to write code, I'm always looking for how can I improve this setup? How can I use this tool? Can I try it out? Is this better? Is this smoother for me so I'm not worrying about this thing?And then spreading that information more broadly within the developer experience group, our DevOps teams, our platform teams, talking to those teams about the things that they use. Like, we use Argo CD within one group and I haven't touched it much, but I know they've got lots of expertise, so talking to them. “How do you use this? How is this good for me? How do I make this work? How can I use it, too?”Jason: I think it's been an incredible, [laugh] as you've been chatting, there are so many different tools and technologies that you've mentioned having used or being used at the bank. Which is both—it's interesting as a, like, there's so much going on in the bank; how do you manage it all? But it's also super interesting, I think, because it shows that there's a lot of interest in just finding the right solutions and finding the right tools, and not really being super-strongly married to one particular tool or one set way to do things, which I think is pretty cool. We're coming up towards the end of our time here, so I did want to ask you, before we sign off, Aaron, do you have anything that you'd like to plug, anything you want to promote?Aaron: Yeah, the Cloud Program is hiring a ton. There's lots of job openings on all of our platform teams. There's probably job openings on my Cloud Adoption Team. So, if you think the bank sounds interesting—the bank is very stable; that's always one of the nice things—but the bank… the thing about the bank, I originally joined the bank saying, “Oh, I'll be here two years, and I'll get bored and I'll leave,” and now it's been 12 years and I'm still at the bank. Because I mentioned, like, that scope and scale of the organization, there's always something interesting happening somewhere.So, if you're interested in cloud platform stuff, we've got a huge cloud platform. If you're in—like, you want to do machine-learning, we've got an entire organization. It should come as no surprise, we have lots of data at a bank, and there's a whole organization for all sorts of different things with machine-learning, deep learning, data analytics, big data, stuff like that. Like, if you think that's interesting, and even if you're not specifically in Toronto, Canada, you can probably find an interesting role within the organization if that's something that turns your crank.Jason: Awesome. We'll post links to everything that we've mentioned, which is a ton. But go check us out, gremlin.com/podcast is where you can find the show note for this episode, and we'll have links to everything. Aaron, thank you so much for joining us. It's been a pleasure to have you.Aaron: Thanks so much for having me, Jason. I'm so happy that we got to do this.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

Today we chat with Cisco's head of developer content, community, and events, Michael Chenetz. We discuss everything from KubeCon to kindness and Legos! Michael delves into some of the main themes he heard from creators at KubeCon, and we discuss methods for increasing adoption of new concepts in your organization. We have a conversation about attending live conferences, COVID protocol, and COVID shaming, and then we talk about how Legos can be used in talks to demonstrate concepts. We end the conversation with a discussion about combining passions to practice creativity. We discuss our time at KubeCon in Spain (5:51) Themes Michael heard at KubeCon talking with creators (7:46) Increasing adoption of new concepts (9:27) We talk conferences, COVID shaming, and blamelessness (12:21) Legos and reliability (18:04) Michael talks about ways to exercise creativity (23:20) Links: KubeCon October 2022: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/ Nintendo Lego Set: https://www.amazon.com/dp/B08HVXMQ87?ref_=cm_sw_r_cp_ud_dp_ED7NVBWPR8ANGT8WNGS5 Cloud Unfiltered podcast episode featuring Julie and Jason:https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884 Links Referenced: Cisco: https://www.cisco.com/ Cloud Unfiltered Podcast with Julie and Jason: https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884 Cloud Unfiltered Podcast: https://www.cisco.com/c/en/us/solutions/cloud/podcasts.html Nintendo Lego: https://www.amazon.com/dp/B08HVXMQ87 TranscriptJulie: And for folks that are interested in, too, what day it is—because I think we're all still a little bit confused—it is Monday, May 24th that we are recording this episode.Jason: Uh, Julie's definitely confused on what day it is because it's actually Tuesday, [laugh] May 24th.Michael: Oh, my God. [laugh]. That's great. I love it.Julie: Welcome to Break Things on Purpose, a podcast about reliability, learning from each other, and blamelessness. In this episode, we talk to Michael Chenetz, head of developer content, community, and events at Cisco, about all of the learnings from KubeCon, the importance of being kind to each other, and of course, how Lego translates into technology.Julie: Today, we are joined by Michael Chenetz. Michael, do you want to tell us a little bit about yourself?Michael: Yeah. [laugh]. Well, first of all, thank you for having me on the show. And I'm really good at breaking things, so I guess that's why I'm asked to be here is because I'm superb at it. What I'm not so good at is, like, putting things back together.Like when I was a kid, I remember taking my dad's stereo apart; wasn't too happy about that. Wasn't very good at putting it back together. But you know, so that's just going back a little ways there. But yeah, so I work for the DevRel at Cisco and my whole responsibility is, you know, to get people to know that know a little bit about us in terms of, you know, all the developer-related topics.Julie: Well, and Jason and I had the awesome opportunity to hang out with you at KubeCon, where we got to join your Cloud Unfiltered podcast. So folks, definitely go check out that episode. We have a lot of fun. We'll put a link in the [show notes 00:02:03]. But yeah, let's talk a little bit about KubeCon. So, as of recording this episode, we all just recently traveled back from Spain, for KubeCon EU, which was… amazing. I really enjoyed being there. My first time in Spain. I got back, I can tell you, less than 24 hours ago. Michael, I think—when did you get back?Michael: So, I got back Saturday night, but my bags have not arrived yet. So, they're still traveling and they're enjoying Europe. And they should be back soon, I guess when they're when they feel like they're—you know, they should be back from vacation.Julie: [laugh].Michael: So. [laugh].Julie: Jason, how about you? When did you get home?Jason: I got home on Sunday night. So, I took the train from Valencia to Barcelona on Saturday evening, and then an early morning flight on Sunday and got home late Sunday night.Julie: And for folks that are interested in, too, what day it is—because I think we're all still a little bit confused—it is Monday, May 24th that we are recording this episode.Jason: Uh, Julie's definitely confused on what day it is because it's actually Tuesday, [laugh] May 24th.Michael: Oh, my God. [laugh]. That's great. I love it. By the way, yesterday was my birthday so I'm going to say—Julie: Happy birthday.Michael: —happy birthday to myself.Julie: Oh, my gosh, happy birthday. [laugh].Michael: Thank you [laugh].Julie: So… what is time anyway?Jason: Yeah.Michael: It's all good. It's all relative. Time is relative.Julie: Time is relative. And so, you know, tell us a little bit about—I'd love to know a little bit about why you want folks to know about, like, what is the message you try to get across?Jason: Oh, that's not the question I thought you were going to ask. I thought you were going to ask, “What's on your Amazon wishlist so people can send you birthday presents?”Julie: Yeah, let's back up. Let's do that. So, let's start with your Amazon wishlist. We know that there might be some Legos involved.Michael: Oh, my God, yeah. I mean, you just told me about a cool one, which was Optimus Prime and I just—I'm already on the website, my credit card is out and I'm ready to buy. So, you know, this is the problem with talking to you guys. [laugh]. It's definitely—you know, that's definitely on my list. So, anything that, anything music-related because obviously behind me is a lot of music equipment—I love music stuff—and anything tech. The combination of tech and music, and if you can combine Legos and that, too, man that would just match all the boxes. [laugh].Julie: Just to let you know, there's a Lego Con. Like, I did not know this until last night, actually. But it is a virtual conference.Michael: Really.Julie: Yeah. But one of the things I was looking at actually on Lego, when you look at their website, like, to request one of their speakers, to request one of their engineers as a speaker, they actually don't do that because they get so many requests for their folks to speak at conferences, they actually have a dedicated part of their website that talks about this. So, I thought that was interesting.Michael: Well listen, just because of that, if they want somebody that's in, you know, cloud computing, I'm not going to go talk for Lego. And I know they really want somebody from cloud computing talking to Lego, so, you know… it's, you know, quid pro quo there, so that's just the way it's going to work. [laugh].Julie: I want to be best friends with Lego people.Michael: [laugh]. I know, me too.Julie: I'm just going to make it a goal in life now to have one of their engineers speak at DevOpsDays Boise. It's like a challenge.Michael: It is. I accept it.Julie: [laugh]. With that, though, just on other Lego news, before we start talking about all the other things that folks may also want to hear about, there is another new Lego, which is the Van Gogh Starry Night that has been newly released by the time this episode comes out.Michael: With a free ear, right?Julie: I mean—[laugh].Michael: Is that what happens?Julie: —well played. Well, played. [laugh]. So, now you really got to spend a lot of time at KubeCon, you were just really recording podcast after podcast.Michael: Oh, my God. Yeah. So, I mean, it was great. I love—because I'm a techie, so I love tech and I love to find out origin stories of stuff. So, I love to, like, talk to these people and like, “Why did that come about? How did—” you know, “What happened in your life that made you want to do this? Who hurt you?” [laugh].And so, that's what I constantly try and figure out is, like, [laugh], “What is that?” So, it was really cool because I had, like, Jimmy Zelinskie who came from CoreOS, and he came from—you know, they create, you know, Quay and some of this other kinds of stuff. And you know, just to talk about, like, some of the operators and how they came about, and like… those were the original operators, so that was pretty cool. Varun from Tetrate was supposed to come on, and he created Istio, you know? So, there were so many of these things that I just geek out knowing about, you know?And then the other thing that was really high on our list, and it's really high from where I am, is API quality, API testing, API—so really, that's why I got in touch with you guys because I was like, “Wow, that fits in really good, you know? You guys are doing stuff that's around chaos, and you know, I think that's amazing.” So, all of this stuff is just so interesting to me. But man, it was just a whirlwind of every day just recording, and by the end that was just like, you know, “I'm so sorry, but I just, I can't talk anymore.” You know, and that was it. [laugh].Jason: I love that chatting with the creators. We had Zack Butcher on who is also from Tetrate and one of the early Istio—Michael: Yeah, yeah.Jason: Contributors. And I find it fascinating because I feel like when you chat with these folks, you start to understand the context of why things were built. And it—Michael: Yes.Jason: —it opens your brain up to, like, cool, there's a software—oh, now I know exactly why it's doing things that way, right? Like, it's just so, so eye-opening. I love it.Julie: With that, though, like, did you see any trends or any themes as you were talking to all these folks?Michael: Yeah, so a few real big trends. One is everybody wants to know about eBPF. That was the biggest thing at KubeCon, by far, was that, “We want to learn how to do this low-level kernel stuff that's really fast, that can give us all the information we need, and we don't have to use sidecars and things like that.” I mean it was—you know, that was the most excitement that I saw. OTel was another one for OpenTelemetry, which was a big one.The other thing was simplification. You know, a lot of people were looking to simplify the Kubernetes ecosystem because there's so much out there, and there's so many things that you have to learn about that it was super hard, you know, for somebody to come into it to say, “Where do I even start?” You know? So, that was a big theme was simplification.I'm trying to think. I think another one is APIs, for sure. You know, because there's this whole thing about API sprawl. And people don't know what their APIs are, people just, like—you know, I always say people can see—like, developers are lazy in a good way, and I consider myself one of them. So, what that means is that when we want to develop something, what we're going to do is we're just going to pull down the nearest API that does what we need, that has the best documentation, that has the best blog, that has the best everything.We don't know what their testing strategy is; we don't know what their security strategy is; we don't know if they use other libraries. And you have to figure that stuff out. And that's the thing that—you know, so everything around APIs is super important. And you really have to test that stuff out. Yes, people, you have to test it [laugh] and know more about it. So, those are those were the big themes, I think. [laugh].Julie: You know, I know that Kerim and I gave a talk on observability where we kind of talked more high-level about some of the overarching concepts, but folks were really excited about that. I think is was because we briefly touched on OpenTelemetry, which we should have gone into a little bit more depth, but there's only so much you can fit into a 30-minute talk, so hopefully we'll be able to talk about that more at a KubeCon in the future, we [crosstalk 00:09:54] to the selection committee.Michael: Hashtag topics?Julie: Uh-huh. [laugh]. You know, that said, though, it really did seem like a huge topic that people just wanted to learn more about. I know, too, at the Gremlin booth, a lot of folks were also interested in talking about, like, how do we just get our organization to adopt some of these concepts that we're hearing about here? And I think that was the thing that surprised me the most is I expected people to be coming up to the booth and deep-diving into very, very deep, technical-level questions, and really, a lot of it was how do we get our organization to do this? How can we increase adoption? So, that was a surprise for me.Michael: Yeah, you know what, and I would say two things to that. One is, when you talk about Chaos Engineering, I think people think it's like rocket science and people are really scared and they don't want to claim to be experts in it, so they're like, “Wow, this is, like, next-level stuff, and you know, we're really scared. You guys are the experts. I don't want to even attempt this.” And the other thing is that organizations are scared because they think that it's going to, like, create mass hysteria throughout their organization.And really, none of this is true in either way. In reality, it's a very, very scripted, very exacting stuff that you're testing, and you throw stuff out there and see what kind of response you get. So, you know, it's not this, like, you know—I think people just have—there needs to be more education around a lot of areas in cloud-native. But you know, that's one of the areas. So, I think it's really interesting there.Julie: I think so too. How about for you, Jason? Like, what was your surprise from the conference or something that maybe—Jason: Yeah, I mean, I think my surprise was mostly around just seeing people coming back, right? Because we're now I would say, six months into conferences being back as a thing, right? Like, we had re:Invent last year in Vegas; we had KubeCon last year in LA, and so, like, those are okay events. They weren't, like, back to normal. And this was, I feel like, one of the first conferences, that it really started to feel back to normal.Like, there was much better attendance, there was much more just buzz and hallway tracking and everything else that we're used to. Like, the whole reason that we go to conferences is getting together with people and hanging out and stuff, and this one has so far felt the most back-to-normal out of any event that I've been to over the past six months.Michael: Can I just talk about one thing that I think, you know, people have to get over is, you know, I see a lot online, I think it was—I forget who it was that was talking about it. But this whole idea of Covid shaming. I mean, we're going to this event, and it's like, yeah, everybody wants to get out, everybody wants to learn things, but don't shame people just because they got Covid, everybody's getting Covid, okay? That's just the point of life at this point. So, let's just, you know, let's just be nice to each other, be friendly to each other, you know? I just have to say that because I think it's a shame that people are getting shamed, you know, just for going to an event. [laugh].Julie: See, and I think that—that's an interesting—there's been a lot of conversation around this. And I don't think anybody should be Covid-shamed. Look, I think that we all took a calculated risk in coming—Michael: Absolutely.Julie: To this event. I personally gave out a lot of hugs. I hugged some of the folks that have mentioned that they have come up positive from Covid, so there's a calculated risk in going. I think there has been a little bit of pushback on maybe how some of the communication has come out around it. That said, as an organizer of a small conference with, like, 400 people, I think that these are very complicated matters. And what I really think is important is to listen to feedback from attendees and to take that.And then we're always looking to improve, right?Michael: Absolutely.Julie: If everything that we did was perfect right out of the gate, then we wouldn't have Chaos Engineering because there'd be nothing [crosstalk 00:13:45] be just perfectly reliable. And so, if we take away anything, let's take away—just like what you said, first of all, Covid, you should never shame somebody for having Covid. Like, that's not cool. It's not somebody's fault that they caught an illness.Michael: Yes.Julie: I mean unless they were licking doorknobs. And that's a whole different—Michael: Yes. [laugh]. That's a whole different thing, right there.Julie: Conversation. But when we talk about just like these questions around cultural adoption, we talk about blamelessness; we talk about learning from failure; we talked about finding ways to improve, and I think all of that can come into play. So, it'll be interesting to see how we learn and grow as we move forward. And like, thank you to re:Invent, thank you to KubeCon, thank you to DevOpsDays Boise. But these conferences that have started going back in-person, at great risk to organizers and the committee because people are going to be mad, one way or the other.Michael: Yeah. And you can see that people want to be back because it was huge, you know?Julie: Yeah.Michael: Maybe you guys, I'm going to put in a feature request for Gremlin to chaos engineer crowds. Can we do that so we can figure out, like, what's going to happen when we have these big events? Can we do that?Julie: I mean, that sounds fun. I think what's going to happen is there's going to be hugs, there's going to be people getting sick, but there's going to be people learning and growing.Michael: Yes.Julie: And ultimately, I just think that we have to remember that just, like, our systems aren't perfect, and neither are people. Like, the fact that we expect people to be perfect, and maybe we should just keep some mask mandates for a little bit longer when we're at conferences with 8000 people.Michael: Sure.Julie: I mean, that's—Michael: That makes sense.Jason: Yeah. I mean, it's all about risk management, right? This is, essentially what we do in SRE is there's always a risk of a massive outage, and so it's that balance of, right, do what you can, but ultimately, that's why we have SLOs and things is, you can never be a hundred percent, so like, where do we draw the line of here are the things that we're going to do to help manage this risk, but you can never shoot for a perfectly, entirely safe space, right? Because then we'd all be having conferences in padded rooms, and not touching each other, and things like that. There's a balance there.And I think we're all just trying to find that, so yeah, as you mentioned, that whole, like, DevOps blamelessness thing, you know, treat each other with the notion that we're all trying to get through this together and do what we think is best. Nobody's just like John Allspaw said, you know, “Nobody goes to work thinking that, like, their intent is to crash everything and destroy the company.” No one's going to KubeCon or any of these conferences thinking, “Yeah, I'm going to be a super-spreader.”Julie: [laugh].Michael: Yeah, that would be [crosstalk 00:16:22].Jason: Like, everyone's trying not to do it. They're doing their best. They're not actively, like, aggressively trying to get you sick or intentionally about it. But you know—so just be kind to one another.Michael: Yeah. And that's the key.Julie: It is.Michael: The key. Be kind to one another, you know? I mean, it's a great community. People are really nice, so, you know, let's keep that up. I think that's something special about the, you know, the community around KubeCon, specifically.Julie: As we can refine this and find ways, I would take all of the hugs over virtual conferences—Michael: Yes.Julie: Any day now. Because, as Jason mentioned, is even just with you, Michael, the time we got to spend with you, or the time I kept going up to Jfrog's booth and Baruch and I would have conversations as he made me a delicious coffee, these hallway tracks, these conversations, that's what no one figured out how to recreate during the virtual events—Michael: Absolutely.Julie: —and it's just not possible, right?Michael: Yeah. I mean, I think it would take a little bit of VR and then maybe some, like, suit that you wear in order to feel the hug. And, you know, so it would take a lot more in order to do that. I mean, I guess it's technologically possible. I don't know if the graphics are there yet, so it might be like a pixelated version, like, you know, like, NES-style, or something like that. But it could look pretty cool. [laugh]. So, we'll have to see, you know?Julie: Everybody listening to this episode, I hope you're getting as much of a kick out of it as we are recording it because I mean, there are so many different topics here. One of the things that Michael and I bonded about years ago, for our listeners that are—not years ago; months ago. Again, what is time?Michael: Yeah. What is time? It's all relative.Julie: It is. It was Lego, though, and so we've been talking about that. But Michael, you asked a great question when we were recording with you, which is, like—Michael: Wow.Julie: Can—just one. Only one great question.Michael: [laugh].Julie: [laugh]. Which was, how would you incorporate Lego into a talk? And, like, when we look at our systems breaking and all of that, I've really been thinking about that and how to make our systems more reliable. And here's one of the things I really wanted to clarify that answer. I kind of went… I went talking about my Lego that I build, like, my Optim—not my Optimus Primes, I don't have it, but my Voltron or my Nintendo Lego. And those are all box sets.Michael: Yep.Julie: But one of the things if you're not playing with a box set with instruction, if you're just playing with just the—or excuse me, architecting with just the Lego blocks because it's not playing because we're adults now, I think.Michael: Yes, now it's architecting. Yes.Julie: Yes, now that we're architecting, like, that's one of the things that I was really thinking about this, and I think that it would make something really fun to talk about is how you're building upon each layer and you're testing out these new connection pieces. And then that really goes into, like, when we get into Technics, into dependencies because if you forget that one little one-inch plastic piece that goes from the one to the other, then your whole Lego can fall apart. So anyway, I just thought that was really interesting, and I'd wondered if you or Jason even gave that any more thought, or if it was just fleeting for you.Michael: It was definitely fleeting for me, but I will give it some more thought, you know? But you know, when—as you're saying that though, I'm thinking these Lego pieces really need names because you're like that little two-inch Lego piece that kind of connects this and this, like, we got to give these all names so that people can know, that's x-54 that's—that you're putting between x-53 and x-52. I don't know but you need some kind of name for these parts now.Julie: There are Lego names. You just Google it. There are actual names for all of the parts but—Michael: Wow. [laugh].Julie: Like, Jason, what do you think? I know you've got [unintelligible 00:19:59].Jason: Yeah, I mean, I think it's interesting because I am one of those, like, freeform folks, right? You know, my standard practice when I was growing up with Legos was you build the thing that you bought once and then you immediately, like, tear it apart, and you build whatever the hell you want.Michael: Absolutely.Jason: So, I think that that's kind of an interesting thing as we think about our systems and stuff, right? Like, part of it is, like, yeah, there's best practices and various companies will publish, like, you know, “Here's how to architect such-and-such system.” And it's interesting because that's just not reality, right? You're not going to go and take, like, the Amazon CloudFormation thing, and like, congrats, you're done. You know, you just implement that and your job's done; you just kick back for the rest of the week.It never works that way, right? You're taking these little bits of, like, cool, I might have, like, set that up once just to see what's happening but then you immediately, like, deconstruct it, and you take the knowledge of what you learned in those building blocks, and you, like, go and remix it to build the thing that you actually need to build.Michael: But yeah, I mean, that's exactly—so you know, Legos is what got me interested in that as a kid, but when you look at, you know, cloud services and things like that, there's so many different ways to combine things and so many different ways to, like—you know, you could use Terraform, you could use Crossplane, you could use, you know, any of the services in the cloud, you could use FaaS, you could use serverless, you could use, you know, all these different kinds of solutions and tie them together. So, there's so much choice, and what Lego teaches you is that, embrace the choice. Figure out and embrace the different pieces, embrace all the different things that you have and what the art of possibility is, and then start to build on that. So, I think it's a really good thing. And that's why there's so much correlation between, like, kind of, art and tech and things like that because that's the kind of mentality that you need in order to be really successful in tech.Jason: And I think the other thing that works really well with what you said is, as you're playing with Legos, you start to learn these hacks, right? Like, I don't have, like, a four-by-one brick, but I know that if I have three four-by-one flats, I can stack those three and it's the same height as a brick, right?Michael: Yep.Jason: And you can start combining things. And I love that engineering mentality of, like, I have this problem that I need to solve, I have a limited toolbox for whatever constraints, right, and understanding those constraints, and then cool, how can I remix what I've got in my toolbox to get this thing done?Michael: And that's a thing that I'm always doing. Like, when I used to do a lot of development, you know, it was always like, what is the right code? Or what is the library that's going to solve my problem? Or what is the API that's going to solve my problem, you know?And there's so many different ways to do it. I mean, so many people are afraid of, like, making the wrong choice, when really in programming, there is no wrong choice. It's all about how you want to do it and what makes sense to you, you know? There might be better options in formatting and in the way that you kind of, you know, format that code together and put them in different libraries and things like that, but making choices on, like, APIs and things like that, that's all up to the artist. I would say that's an artist. [laugh]. So, you know, I think it all stems though, when you go back from, you know, just being creative with things… so creativity is king.Jason: So Michael, how do you exercise your creativity, then? How do you keep up that creativity?Michael: Yeah, so there's multiple ways. And that's a great segment because one of the things that I really enjoy—so you know, I like development, but I'm also a people person. And I like product management, but I also like dealing with people. So really, to me, it's about how do I relate products, how do I relate solutions, how do I talk to people about solutions that people can understand? And that's a creative process.Like, what is the right media? What is the right demos? What is the right—you know, what do people need? And what do people need to, kind of, embrace things? And to me, that's a really creative medium to me, and I love it.So, I love that I can use my technical, I love that I can use my artistic, I love that I can use, you know, all these pieces all at once. And sometimes maybe I'll play guitar and just put it in the intro or something, I don't know. So, that kind of combines that together, too. So, we'll figure that piece out later. Maybe nobody wants to hear me play guitar, that's fine, too. [laugh].But I love to be able to use, you know, both sides of my brain to do these creative aspects. So, that's really what does it. And then sometimes I'll program again and I'll find the need, and I'll say, “Hey, look, you know, I realized there's a need for this,” just like a lot of those creators are. But I haven't created anything cool, but you know, maybe someday I will. I feel like it's just been in between all those different intersections that's really cool.Jason: I love the electric guitar stuff that you mentioned. So, for folks who are listening to this show, during our recording of the Cloud Unfiltered you were talking about bringing that art and technical together with electric guitars, and you've been building electric guitar pickups.Michael: Yes. Yeah. So, I mean, I love anything that can combine my music passion with tech, so I have a CNC machine back here that winds pickups and it does it automatically. So, I can say, “Hey, I need a 57 pickup, you know, whatever it is,” and it'll wind it to that exact spec.But that's not the only thing I do. I mean, I used to design control surfaces for artists that were a big band, and I really can't—a lot of them I can't mention because we're under NDA. But I designed a lot of these big, you know, control surfaces for a lot of the big electronic and rock bands that are out there. I taught people how to use Max for Live, which is an artist's, kind of, programming language that's graphical, so [NMax 00:25:33] and MSP and all that kind of stuff. So, I really, really like to combine that.Nowadays, you know, I'm talking about doing some kind of events that may be combined tech, with art. So, maybe doing things like Algorave, and you know, things that are live-coding music and an art. So, being able to combine all these things together, I love that. That's my ultimate passion.Jason: That is super cool.Julie: I think we have learned quite a bit on this episode of Break Things on Purpose, first of all, from the guy who said he hasn't created much—because you did say that, which I'm going to call you out on that because you just gave a long list of things that you created. And I think we need to remember that we're all creators in our own way, so it's very important to remember that. But I think that right now we've created a couple of options for talks in the future, whether or not it's with Lego, or guitar pickups.Michael: Yeah.Julie: Is that—Michael: Hey—Julie: Because I—Michael: Yeah, why not?Julie: —know you do kind of explain that a little bit to me as well when I was there. So, Michael, this has just been amazing having you. We're going to put a lot of links in the notes for everybody today. So, to Michael's podcast, to some Lego, and to anything else Michael wants to share with us as well. Oh, real quick, is there anything you want to leave our listeners with other than that? You know, are you looking to hire Cisco? Is there anything you wanted to share with us?Michael: Yeah, I mean, we're always looking for great people at Cisco, but the biggest thing I'd say is, just realize that we are doing stuff around cloud-native, we're not just network. And I think that's something to note there. But you know, I just love being on the show with you guys. I love doing anything with you guys. You guys are awesome, you know. So.Julie: You're great too, and I think we'll probably do more stuff, all of us together, in the future. And with that, I just want to thank everybody for joining us today.Michael: Thank you. Thanks so much. Thanks for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

It's time to shoot for the stars with Dan Isla, VP of Product at itopia, to talk about everything from astronomical importance of reliability to time zones on Mars. Dan's trajectory has been a propulsion of jobs bordering on the science fiction, with a history at NASA, modernizing cloud computing for them, and loads more. Dan discusses the finite room for risk and failure in space travel with an anecdote from his work on Curiosity. Dan talks about his major take aways from working at Google, his “baby” Selkies, his work at itopia, and the crazy math involved with accounting for time on Mars!In this episode, we cover: Introduction (00:00) Dan's work at JPL (01:58) Razor thin margins for risk (05:40) Transition to Google (09:08) Selkies and itopia (13:20) Building a reliability community (16:20) What itopia is doing (20:20) Learning, building a “toolbox,” and teams (22:30) Clockdrift (27:36) Links Referenced: itopia: https://itopia.com/ Selkies: https://github.com/danisla/selkies selkies.io: https://selkies.io Twitter: https://twitter.com/danisla LinkedIn: https://www.linkedin.com/in/danisla/ TranscriptDan: I mean, at JPL we had an issue adding a leap second to our system planning software, and that was a fully coordinated, many months of planning, for one second. [laugh]. Because when you're traveling at 15,000 miles per hour, one second off in your guidance algorithms means you missed the planet, right? [laugh]. So, we were very careful. Yeah, our navigation parameters had, like, 15 decimal places, it was crazy.Julie: Welcome to Break Things on Purpose, a podcast about reliability, building things with purpose, and embracing learning. In this episode, we talked to Dan Isla, VP of Product at itopia about the importance of reliability, astronomical units, and time zones on Mars.Jason: Welcome to the show, Dan.Dan: Thanks for having me, Jason and Julie.Jason: Awesome. Also, yeah, Julie is here. [laugh].Julie: Yeah. Hi, Dan.Jason: Julie's having internet latency issues. I swear we are not running a Gremlin latency attack on her. Although she might be running one on herself. Have you checked in in the Gremlin control panel?Julie: You know, let me go ahead and do that while you two talk. [laugh]. But no, hi and I hope it's not too problematic here. But I'm really excited to have Dan with us here today because Dan is a Boise native, which is where I'm from as well. So Dan, thanks for being here and chatting with us today about all the things.Dan: You're very welcome. It's great to be here to chat on the podcast.Jason: So, Dan has mentioned working at a few places and I think they're all fascinating and interesting. But probably the most fascinating—being a science and technology nerd—Dan, you worked at JPL.Dan: I did. I was at the NASA Jet Propulsion Lab in Pasadena, California, right, after graduating from Boise State, from 2009 to around 2017. So, it was a quite the adventure, got work on some, literally, out-of-this-world projects. And it was like drinking from a firehose, being kind of fresh out to some degree. I was an intern before that so I had some experience, but working on a Mars rover mission was kind of my primary task. And the Mars rover Curiosity was what I worked on as a systems engineer and flight software test engineer, doing launch operations, and surface operations, pretty much the whole, like, lifecycle of the spacecraft I got to experience. And had some long days and some problems we had to solve, and it was a lot of fun. I learned a lot at JPL, a lot about how government, like, agencies are run, a lot about how spacecraft are built, and then towards the end a lot about how you can modernize systems with cloud computing. That led to my exit [laugh] from there.Jason: I'm curious if you could dive into that, the modernization, right? Because I think that's fascinating. When I went to college, I initially thought I was going to be an aerospace engineer. And so, because of that, they were like, “By the way, you should learn Fortran because everything's written in Fortran and nothing gets updated.” Which I was a little bit dubious about, so correct folks that are potentially looking into jobs in engineering with NASA. Is it all Fortran, or… what [laugh] what do things look like?Dan: That's an interesting observation. Believe it or not, Fortran is still used. Fortran 77 and Fortran—what is it, 95. But it's mostly in the science community. So, a lot of data processing algorithms and things for actually computing science, written by PhDs and postdocs is still in use today, mostly because those were algorithms that, like, people built their entire dissertation around, and to change them added so much risk to the integrity of the science, even just changing the language where you go to language with different levels of precision or computing repeatability, introduced risk to the integrity of the science. So, we just, like, reused the [laugh] same algorithms for decades. It was pretty amazing yeah.Jason: So, you mentioned modernizing; then how do you modernize with systems like that? You just take that codebase, stuff it in a VM or a container and pretend it's okay?Dan: Yeah, so a lot of it is done very carefully. It goes kind of beyond the language down to even some of the hardware that you run on, you know? Hardware computing has different endianness, which means the order of bits in your data structures, as well as different levels of precision, whether it's a RISC system or an AMD64 system. And so, just putting the software in a container and making it run wasn't enough. You had to actually compute it, compare it against the study that was done and the papers that were written on it to make sure you got the same result. So, it was pretty—we had to be very careful when we were containerizing some of these applications in the software.Julie: You know, Dan, one thing that I remember from one of the very first talks I heard of yours back in, I think, 2015 was you actually talked about how we say within DevOps, embrace failure and embrace risk, but when you're talking about space travel, that becomes something that has a completely different connotation. And I'm kind of curious, like, how do you work around that?Dan: Yeah, so failing fast is not really an option when you only have one thing [laugh] that you have built or can build. And so yeah, there's definitely a lot of adverseness to failing. And what happens is it becomes a focus on testing, stress testing—we call it robustness testing—and being able to observe failures and automate repairs. So, one of the tests programs I was involved with at JPL was, during the descent part of the rover's approach to Mars, there was a power descent phase where the rover actually had a rocket-propelled jetpack and it would descend to the surface autonomously and deliver the rover to the surface. And during that phase it's moving so fast that we couldn't actually remote control it, so it had to do everything by itself.And there were two flight computers that are online, pretty much redundant, everything hardware-wise, and so it's kind of up to the software to recover itself. And so, that's called entry descent and landing, and one of my jobs towards the end of the development phase was to ensure that we tested all of the possible breakage points. So, we would do kind of evil Gremlin-like things. We actually—the people in the testbed, we actually call Gremlins. And [laugh] we would—we—they inject faults during the simulation.So, we had copies of the hardware running on a desk, the software was running, and then we'd have Gremlins go and say like, “Hey, flight computer one just went out. You know, what's going to happen?” And you watch the software, kind of, take over and either do the right thing or simulate a crash landing. And we find bugs in the software this way, we'd find, like, hangs in the control loops for recovery, and we had to fix those before we made it to Mars, just in case that ever happened. So, that was like how we, like, really stressed test the hardware, we did the same thing with situational awareness and operations, we had to simulate things that would happen, like, during launch or during the transit from Earth to Mars, and then see how the team itself reacted to those. You know, do our playbooks work? Can we run these in enough time and recover the spacecraft? So, it was a lot of fun. That's I guess that's about as close to, like, actually breaking something I can claim to. [laugh].Julie: Well, I have to say, you've done a good job because according to Wikipedia—which we all know is a very reliable source—as of May 9th, 2022, Curiosity has been active on Mars for 3468 sols or 3563 days, and is still active. Which is really amazing because I don't—was it ever intended to actually be operational that long?Dan: Not really. [laugh]. The hardware was built to last for a very long time, but you know, as with most missions that are funded, they only have a certain amount of number of years that they can be operated, to fund the team, to fund the development and all that. And so, the prime mission was only, like, two years. And so, it just keeps getting extended. As long as the spacecraft is healthy, and, like, doing science and showing results, we usually extend the missions until they just fall apart or die, or be intentionally decommissioned, kind of like the Cassini project. But yeah.Julie: Well, you've heard it here first, folks. In order to keep funding, you just need to be, quote, “Doing science.” [laugh]. But Dan, after JPL, that's when you went over to Google, right?Dan: Yeah, yeah. So, it was kind of that transition from learning how to modernize with cloud. I'd been doing a lot with data, a lot with Amazon's government cloud, which is the only cloud we could use at JPL, and falling in love with these APIs and ways to work with data that were not possible before, and saw this as a great way to, you know, move the needle forward in terms of modernization. Cloud is a safe place to prototype a safe place to get things done quick. And I always wanted to work for a big tech company as well, so that was always another thing I was itching to scratch.And so Google, I interviewed there and finally made it in. It was not easy. I definitely failed my first interview. [laugh]. But then try it again a few years later, and I came in as a cloud solution architect to help customers adopt cloud more quickly, get through roadblocks.My manager used to say the solution architects were the Navy Seals of cloud, they would drop in, drop a bunch of knowledge bombs, and then, like, get out, [laugh] and go to the next customer. It was a lot of fun. I got to build some cool technology and I learned a lot about what it's like working in a big public company.Julie: Well, one of my favorite resources is the Google SRE book, which, as much as I talk about it, I'm just going to admit it here now, to everybody that I have not read the entire thing.Dan: It's okay.Julie: Okay, thank you.Dan: Most people probably haven't.Julie: I also haven't read all of Lord of the Rings either. But that said, you know, when you talk about the learnings, how much of that did you find that you practiced day-to-day at Google?Dan: In cloud—I've mostly worked in cloud sales, so we were kind of post-sales, the experts from the technology side, kind of a bridge to engineering and sales. So, I didn't get to, like, interact with the SREs directly, but we have been definitely encouraged, I had to learn the principles so that we could share them with our customers. And so, like, everyone wanted to do things like Google did, you know? Oh, these SREs are there, and they're to the rescue, and they have amazing skills. And they did, and they were very special at Google to operate Google's what I would call alien technology.And so, you know, from a principles point of view, it was actually kind of reminded me a lot of what I learned at JPL, you know, from redundant systems and automating everything, having the correct level of monitoring. The tools that I encountered at Google, were incredible. The level of detail you could get very quickly, everything was kind of at your fingertips. So, I saw the SREs being very productive. When there was an outage, things were communicated really well and everyone just kind of knew what they were doing.And that was really inspiring, for one, just to see, like, how everything came together. That's kind of what the best part of working at Google was kind of seeing how the sausage was made, you know? I was like, “Oh, this is kind of interesting.” [laugh]. And still had some of its big company problems; it wasn't all roses. But yeah, it was definitely a very interesting adventure.Jason: So, you went from Google, and did you go directly to the company that you helped start, right now?Dan: I did. I did. I made the jump directly. So, while I was at Google, you know, not only seeing how SRE worked, but seeing how software was built in general and by our customers, and by Google, really inspired me to build a new solution around remote productivity. And I've always been a big fan of containers since the birth of Docker and Kubernetes.And I built the solution that let you run, kind of, per-user workloads on Kubernetes and containers. And this proved to be interesting because you could, you know, stand up your own little data processing system and scale it out to your team, as well as, like, build remote code editors, or remote desktop experiences from containers. And I was very excited about this solution. The customers were really starting to adopt it. And as a solution architect, once the stuff we built, we always open-source it.So, I put it on GitHub as a project called Selkies. And so, Selkies is the Kubernetes components and there's also the high performance streaming to a web browser with WebRTC on GitHub. And a small company, itopia, I met at a Google conference, they saw my talk and they loved the technology. They were looking for something like that, to help some of their product line, and they brought me in as VP of Product.So, they said, “We wanted to productize this.” And I'm like, “Well, you're not doing that without me.” [laugh]. Right? So, through the pandemic and work from home and everything, I was like, you know, now is probably a good time to go try something new.This is going to be—and I get to keep working on my baby, which is Selkies. So yeah, I've been itopia since beginning of 2021, building a remote desktop, really just remote developer environments and other remote productivity tools for itopia.Julie: Well and, Dan, that's pretty exciting because you actually talked a little bit about that at DevOpsDays Boise, which if that video is posted by the time of publication of this podcast, we'll put a link to that in the show notes. But you're also giving a talk about this at SCaLE 19x in July, right?Dan: Yeah, that's right. Yeah, so SCaLE is the Southern California Linux Expo, and it's a conference I really enjoy going to get to see people from Southern California and other out of town, a lot of JPLers usually go as well and present. And so, it's a good time to reconnect with folks. But yeah, so SCaLE, you know, they usually want to talk more about Linux and some of the technologies and open-source. And so yeah, really looking forward to sharing more about selfies and kind of how it came to be, how containers can be used for more than just web servers and microservices, but also, you know, maybe, like, streaming video games that have your container with the GPU attached. The DevOpsDays Boise had a little demo of that, so hopefully, that video gets attached. But yeah, I'm looking forward to that talk at the end of July.Jason: Now, I'm really disappointed that I missed your talk at DevOpsDays Boise. So Julie, since that's your domain, please get those videos online quickly.Julie: I am working on it. But Dan, one of the things that you know you talk about is that you are the primary maintainer on this and that you're looking to grow and improve with input from the community. So, tell us, how can the community get involved with this?Dan: Yeah, so Selkies is on GitHub. You can also get to it from selkies.io. And basically, we're looking for people to try it out, run it, to find problems, you know, battle test it. [laugh]. We've been running it in production at itopia, it's powering the products they're building now.So, we are the primary maintainers. I only have a few others, but, you know, we're just trying to build more of an open-source community and level up the, you know, the number of contributors and folks that are using it and making it better. I think it's an interesting technology that has a lot of potential.Jason: I think as we talk about reliability, one of the things that we haven't covered, and maybe it's time for us to actually dive into that with you is reliability around open-source. And particularly, I think one of the problems that always happens with open-source projects like this is, you're the sole maintainer, right? And how do you actually build a reliable community and start to grow this out? Like, what happens if Dan suddenly just decides to rage quit tech and ups and leaves and lives on his own little private island somewhere? What happens to Selkies?Do you have any advice for people who've really done this, right? They have a pet project, they put it on GitHub, it starts to gain some traction, but ultimately, it's still sort of their project. Do you have any advice for how people can take that project and actually build a reliable, growing, thriving community around it?Dan: Honestly, I'm still trying to figure that out [laugh] myself. It's not easy. Having the right people on your team helps a lot. Like, having a developer advocate, developer relations to showcase what it's capable of in order to create interest around the project, I think is a big component of that. The license that you choose is also pretty important to that.You know, there's some software licenses that kind of force the open-sourcing of any derivative of what you build, and so that can kind of keep it open, as well, as you know, move it forward a little bit. So, I think that's a component. And then, you know, just, especially with conferences being not a thing in the last couple of years, it's been really hard to get the word out and generate buzz about some of these newer open-source technologies. One of the things I kind of like really hope comes out of a two-year heads-down time for developers is that we're going to see some, like, crazy, amazing tech on the other side. So, I'm really looking forward to the conferences later this year as they're opening up more to see what people have been building. Yeah, very interested in that.Jason: I think the conversation around open-source licenses is one that's particularly interesting, just because there's a lot involved there. And there's been some controversy over the past couple of years as very popular open-source projects have decided to change licenses, thinking of things like Elastic and MongoDB and some other things.Dan: Yeah. Totally.Jason: You chose, for Selkies, it looks like it's Apache v2.Dan: Yep. That was mostly from a Google legal point of view. When I was open-sourcing it, everything had to be—you know, had to have the right license, and Apache was the one that we published things under. You know, open-source projects change their license frequently. You saw that, like what you said, with Elastic and Mongo.And that's a delicate thing, you know, because you got to make sure you preserve the community. You can definitely alienate a lot of your community if you do it wrong. So, you got to be careful, but you also, you know, as companies build this tech and they're proud of it and they want to turn it into a product, you want to—it's a very delicate process, trying to productize open-source. It can be really helpful because it can give confidence to your customers, meaning that, like, “Hey, you're building this thing; if it goes away, it's okay. There's this open-source piece of it.”So, is instills a little bit of confidence there, but it also gets a little tricky, you know? Like, what features are we adding the add value that people will still pay for versus what they can get for free? Because free is great, but you know, it's a community, and I think there are things that private companies can add. My philosophy is basically around packaging, right? If you can package up an open-source product to make it more easier to consume, easier to deploy, easier to observe and manage, then you know, that's a lot of value that the rest of the free community may not necessarily need.If they're just kind of kicking the tires, or if they have very experienced Kubernetes team on-site, they can run this thing by themselves, go for it, you know? But for those, the majority that may not have that, you know, companies can come in and repackage things to make it easier to run open-source. I think there's a lot of value there.Jason: So, speaking of companies repackaging things, you mentioned that itopia had really sort of acquired you in order to really build on top of Selkies. What are the folks at itopia doing and how are they leveraging the software?Dan: That's a good question. So, itopia's mission is to radically improve work-from-anywhere. And we do that by building software to orchestrate and automate access to remote computing. And that orchestration and automation is a key component to this, like, SaaS-like model for cloud computing.And so, Selkies is a core piece of that technology. It's designed for orchestrating per-user workloads, like, remote environments that you would need to stand up. And so, you know, we're adding on things that make it more consumable for an enterprise, things like VPN peering and single-sign-on, a lot of these things that enterprises need from day one in order to check all the boxes with their security teams. And at the heart of that is really just increasing the amount of the productivity you have through onboarding.Basically, you know, setting up a developer environment can take days or weeks to get all the dependencies set up. And the point of itopia—Spaces is the product I'm working on—is to reduce that amount of time as much as possible. And, you know, this can increase risk. If you have a product that needs to get shipped and you're trying to grow or scale your company and team and they can't do that, you can slip deadlines and introduce problems, and having a environment that's not consistent, introduces reliability problems, right, because now you have developers that, “Hey, works on my machine.” But you know, they may have—they don't have the same machine, same environment as everyone else, and now when it comes to reproducing bugs or even fixing them, that you can introduce more problems to the software supply chain.Julie: I mean, that sounds like a great problem to solve and I'm glad you're working on it. With your background being varied, starting as an intern to now where you personally are being acquired by organizations. What's something that you've really learned or taken from that? Because one thing that you said was that you failed your first Google interview badly? And—Dan: Yes. [laugh].Julie: I find that interesting because that sounds like you know, you've taken that learning from failure, you've embraced the fact that you failed it. Actually, I just kind of want to go back. Tell us, do you know what you did?Dan: It was definitely a failure. I don't know how spectacular it was, but, like, [laugh] google interviews are hard. I mean—and that's just how it is, and it's been—it's notorious for that. And I didn't have enough of the software, core software experience at the time to pass the interview. These are, like, five interviews for a software engineer.And I made it through, like, four of them. The last one was, like, just really, really, really hard and I could not figure it out. You know, because this is, like, back in the day—and I think they still do this, like, where you're, like, coding on a whiteboard, right? Like, okay, right, this C code on a whiteboard, and it has to work. You know, the dude is, like, right, there compiling it, right? Like, “Okay, [unintelligible 00:23:29], boy.” [laugh].So, not only is a high stress, but it has to be right as well. [laugh]. And so, like, it was just a very difficult experience. And what I learned from that was basically, “Okay, I need to, one, get more experience in this style and this domain of programming, as well, as you know, get more comfortable speaking and being in front of people I don't know.” [laugh].So yeah, there's definitely components there of personal growth as well as technical growth. From a technical point of view, like, my philosophy as being an engineer in general, and software developer, is have a really big toolbox and use the tools that are appropriate for the job. This is, like, one of my core philosophies. Like, people ask, you know, ‘what language do you use?' And I'm like, “Whatever language you needed to solve the problem.”Like, if you're writing software, in a—with libraries that are all written in C, then don't try to do that in, like, Java or something, in some other language that doesn't have those language bindings. Don't reinvent the language bindings. You follow the problem and you follow the tech. What language, what tool will best solve this problem? And I'm always working backwards from the problem and then bringing in the right tools to solve it.And that's something that has paid off in dividends because it's very—problem-solving is fun and it's something I always had a passion for, but when you have a toolbox that is full of interesting gadgets and things you can use, you get excited every time you get to use that tool. Like, just like power tools here, I have a—I don't know, but it's like, “Yeah, I get to use the miter saw for this thing. Awesome. I don't have one? Okay, I'm going to go buy one.” [laugh].Julie: That's actually—that's a really good point, one of the talks that I gave was, “You Can't Buy DevOps.” And it was really all about letting developers be part of the process in choosing the tools that they're going to use. Because sometimes I think organizations put too many constraints around that and force you to use these tools that might not be the best for what you're trying to accomplish. So, I like that you bring up having the ability to be excited about your toolbox, or your miter saw. For me, it would be my dremel. Right? But what tool is going to—Dan: [crosstalk 00:25:39] cool.Julie: Yeah, I mean, they really are—what tool is going to be best for the job that you are trying to accomplish? And I think that that's, that's a big thing. So, when you look to bring people onto your team, what kind of questions do you ask them? What are you looking for?Dan: Well, we're just now starting to really grow the company and try and scale it up. And so we're, you know, we're starting to get into more and more interview stuff, I try to tell myself, I don't want to put someone through the Google experience again. And part of that is just because it wasn't pleasant, but also, like, I don't know if it was really that useful [laugh] at the end of the day. And so, you know, there's a lot about culture fit that is really important. People have to be able to communicate and feel comfortable with your team and the pace that your team is working at. And so, that's really important.But you know, technically, you know, I like to see a lot of, you know—you got to be able to show me that you can solve problems. And that can be from, you know, just work that you've done an open-source, you know, having a good resume of projects you've worked on is really important because then we can just talk about tech and story about how you solve the problem. I don't have to—I don't need you to go to the whiteboard and code me something because you have, like, 30 repos on GitHub or something, right? And so, the questions are much more around problem-solving: you know, how would you solve this problem? What technology choices would you use, and why?Sometimes I'll get the fundamentals, like, do you understand how this database works at its core or not? You know, or why is it… why is that good or bad? And so, looking for people who can really think within the toolbox they have—it doesn't have to be a big one, but do they know how to use the tools that they've acquired so far, and really, just really, really critically think through with your problems? So, to me, that's a better skill to have than just, you know, being able to write code on the whiteboard.Julie: Thanks for that, Dan. And earlier, before we started the official recording here, you were talking a little bit about time drift. Do you want to fill everybody in on what you were talking about because I don't think it was Doctor Strange and the Multiverse of Madness?Dan: No. [laugh]. I think there were some—we were talking about um…clocks?Julie: Clocks skew.Dan: Daylight savings time?Julie: Yeah.Dan: Clock skew, clock drift. There was a time at JPL when we were inserting a leap second to the time. This actually happened all throughout the world, where periodically that the clocks will drift far enough because the orbits and the rotation of the planet are not, like, perfectly aligned to 365 days in a year and 24 hours in a day. And so, every so decades, you have to insert these leap seconds in order to catch up and make time more precise. Well, space travel, when you're planning, you have to—you're planning to the position of the stars and the planets and the orbital bodies, and those measurements are done at such a large scale that you have—your precision goes, like, way out, you know, many, many decimal places in order to properly plan to the bodies up big.And with the Mars Rover, one of these leap seconds happened to come in, like, right, before we launched. And it was like, oh my gosh, this is going to be to—change all of our ephemeris files—the data that you use to track positions—and we had to do it, like, synchronize it all, like, right, when the leap second was going in. And we tested this extensively because if you get it wrong with your spacecraft is traveling, like, 15,000 miles an hour towards Mars, and a one-second pointing error from Earth means, like, you missed the whole planet, you won't even get there. [laugh]. We're not talking about, like, missing the landing site of, like, a few kilometers. No, it's like thousands of kilometers in pointing error.So yeah, things are astronomical [laugh] in units. Actually, that's why they're called AU, astronomical units, when you're measuring the distance from the Sun. So yeah, it was a pretty fun time. A little bit nerve-wracking just because the number of systems that had to be updated and changed at the same time. It's kind of like doing a rolling update on a piece of software that just had to go out all at the same time. Yeah.Jason: I think that's really interesting, particularly because, you know, for most of us, I think, as we build things whether that's locally or in the cloud or wherever our servers are at, we're so used to things like NTP, right, where things just automatically sync and I don't have to really think about it and I don't really have to worry about the accuracy because NTP stays pretty tight. Usually, generally.Dan: Mm-hm.Jason: Yeah. So, I'm imagining, obviously, like, on a spacecraft flying 15,000 miles a second or whatever, no NTP out there.Dan: [laugh]. Yeah, no NTP and no GPS. Like, all the things you take for granted, on Mars are just not there. And Mars even has a different time system altogether. Like the days on Mars are about 40 minutes longer because the planet spins slower.And my first 90 sols—or days on Mars—of the mission, the entire planning team on earth that I was a part of, we lived on Mars time. So, we had to synchronize our Earth's schedule with what the rover was doing so that when the rover was asleep, we were planning the next day's activities. And when it woke up, it was ready to go and do work during the day. [laugh]. So, we did this Mars time thing for 90 days. That was mostly inherited from the Mars Exploration rovers, Spirit and Opportunity because they were only designed to live for, like, 90 days.So, the whole team shifted. And we—and now it's kind of done in spirit of that mission. [laugh]. Our rover, we knew it was going to last a bit longer, but just in case, let's shift everyone to Mars time and see what happened. And it was not good. We had to [laugh] we had to end that after 90 days. People—your brain just gets completely fried after that. But it was bizarre.And there's no time. You have invent your own time system for Mars. Like, there's no, it was called LMST, or Local Mars Standard Time, local mean standard time. But it was all, like, relative to, you know, the equator and where you were on the planet. And so, Mars had his own Mars time that counted at a different rate per second.And so, it was funny, we had these clocks in the Mission Control Room that—there was this giant TV screen that had, like, four different time clocks running. It had, like, Pasadena time, UTC time, Mars time, and, like, whatever time it was at the Space Network. And I was like, “Oh, my gosh.” And so, we were always doing these, like, time conversions in our heads. It was mental. [laugh]. So, can't we just all be on UTC time? [laugh].Jason: So, I'm curious, with that time shift of being on Mars time and 40 minutes longer, that inherently means that by the end of that 90 days, like, suddenly, your 8 a.m. Mars local time is, like, shifted, and is now, like, hours off, right? You're waking—Dan: Yeah.Jason: Up in the middle of the night?Dan: Totally, yeah.Jason: Wow.Dan: Yeah, within, like, two weeks, your schedule will be, like, upside down. It's like, every day, you're coming in 40 minutes later. And yeah, it was… it was brutal. [laugh]. Humans are not supposed to do that.If you're actually living on Mars, you're probably okay, but like, [laugh] trying to synchronize those schedules. I thought you were going from East Coast to West Coast time, working remote was hard. And, like, [laugh] that's really remote.Julie: Dan, that's just astronomical.Dan: [laugh].Julie: I'm so sorry. I had to do it. But with that—[laugh].Jason: [laugh].Dan: [laugh]. [unintelligible 00:33:15].Julie: With that, Dan, I really just want to thank you for your time on Break Things on Purpose with us today. And as promised, if I can find the links to Dan's talks, if they're available before this episode posts, we will put those in the show notes. Otherwise, we'll put the link to the YouTube channel in the show notes to check for updates. And with that, I just want to thank you, Dan, and wish you a wonderful day.Jason: Before we go, Dan, do you have anything that you'd like to plug? Any projects that people should check out, where they can find you on the internet, stuff like that?Dan: Yeah, thank you guys very much for having me. It was a great conversation. Really enjoyed it. Please check out our new product, itopia Spaces, remote developer environments delivered, powered by Selkies. We launched it last fall and we're really trying to ramp that up.And then check out the open-source Selkies project, selkies.io will get you there. And yeah, we're looking for contributors. Beyond that, you can also find me on Twitter, I'm @danisla, or on LinkedIn.Jason: Awesome. Well, thanks again for being a part of the show. It's been fantastic.Dan: You're very welcome. Thanks for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: Introduction (00:00) “Embracing Change Fearlessly” (01:45) Fearless change enabling good work (04:00) The culture change that needs to happen (06:10) How to talk to your leaders (10:45) “The Adolescent Version” of engineering (14:40) How Natalie prioritizes time, speed, and efficiency (18:42) Natalie's keynote (26:48) Links Referenced: Gremlin: https://www.gremlin.com/ gremlin.com/podcast: https://gremlin.com/podcast loyaltyfreakmusic.com: https://loyaltyfreakmusic.com TranscriptNatalie: I like this—I call it the adolescent version of engineering. It's where, you know, we're through the baby part, we need to start to grow up a little bit, we need to go from getting stuff done in some way or another, to something that's repeatable and scalable. And so, it's like, that adolescent years, that's my fun. That's what I enjoy doing. I call it creating something out of chaos.Basically, taming the chaos is what it really looks like because it's very chaotic initially, and that's true of every, like, small organization; they always start like that. And as they start to grow, you know, you've got ten different engineers who have ten different opinions on how something should be done, and so they do it ten different ways. And that's fine when you're only ten, but then when you need to go from 10 to 20 to 30 to 100, it no longer works.Julie: Welcome to Break Things on Purpose, a podcast about reliability, culture change, and learning from failure. In this episode, we talk with Natalie Conklin, head of engineering at Gremlin, about the importance of embracing change, and how we can all work through our fears and work together to build more reliable systems. Natalie, I'm so excited to have you here with us today. And today is actually a really big day because it is the fifth year of DevOpsDays Boise, which you are doing the closing keynote for. So, really excited to have you both on the podcast and at the conference today. And your talk is titled “Embrace Change Fearlessly.” So, do you want to kick off by telling our listeners a little bit about you and what you're going to be talking about?Natalie: Sure. Thanks for having me. I am excited about both, sort of, [laugh] which is exactly what the talk is about. [laugh]. The talk is really about being able to embrace change fearlessly, and that it's rarely ever fearlessly truly, but mostly around being able to do what makes you afraid anyway.I'm not a big public speaker, so that's something I've had to work hard at trying to be able to be more comfortable doing. And so, this is an exciting time for me. But background-wise, I am the head of engineering currently for Gremlin and had been leading engineering teams for growth companies for just over a decade. And a lot of what I end up doing centers around this: It's helping those engineering teams be willing to move forward in risky—because in growth companies, a lot of times you're building things that are brand new, this is not something that, you know, has been out there and done, so they typically have to do something new for the first time. And so, being able to take calculated risks is tough. It's hard stuff. And so, getting into the right mindset to be able to push through that, that's a lot of what I ended up doing.Julie: I love that. And that's actually a really good point that you're bringing up, you know, growth companies and being in the right mindset. So, one of the things you and I talked about when I was starting here at Gremlin and getting to know you a little bit about your background, which is really cool. You lived in India for a few years, correct?Natalie: I did. I lived there for two years. I was working for a company, we were doing big data analytics for telcos, building big, large platform that we would then do some custom development work off the top of for these various telco companies. And the team over there had experienced some turnover, and so there was a lot of quality issues and things of that nature starting to show up for the first time. This had been a very rock-solid team, honestly, and so the company asked if I would be willing to go to India to figure out what was going on. And so, that was what I did. It was a great opportunity; loved doing it.Julie: So now, as you work with teams to embrace change fearlessly, and we talk about you mentioned the ROI and doing things in new ways and building new things, do you have an example of maybe when you built something new or your team built something new, and it changed the way we work?Natalie: Well yes, an easy answer would just be to fall back on the India example for a second, right? So, a lot of what I did when I went there was they were a very waterfall shop, converted them over to Agile practices and DevOps. They had really none of that practice existing. So, when you ask the company—or the, I'll just say the team to go through that sort of transition, you're pretty much asking them to change everything about the way they work. And we focused a lot more —there was a lot of manual processes that they had been doing previously and we were automating all of those had to do the automations, but then also, you know, make sure that work fit into this new automated way of doing things.They also had, just, also the trepidation over am I going to still be needed, right? Those are all those things that come into your mind when you're basically changing from a manual process to an automated process, “Am I still going to be needed? Is my work going to still be important? What am I going to do in this new world, in this new environment?” There's a lot of that that pops up into people's heads.So, a lot of making the change successful, there's certainly the technical aspects of getting it automated and all those things, but to really make a change successful on that kind of scale, it requires getting people to think about it differently and to be okay, and to realize that they can learn new stuff and they'll come out of this better than how they went in. And a lot of that takes a lot of, just, communication and talking, being very personal with people, making sure that they personally understand how to do this, but then just also, things like training and coaching and making sure that there are people there to counter the negative energy that comes along with change. There's always negative energy that comes along with it, people are nervous, they're scared, and you have to be able to counter that in some way.Julie: You know, there was a talk I gave a while ago, and I'm trying to remember the name of it, but one of the things that I talked about was the Pareto Principle, which is, what, 20% of people are going to be amazing in an organization, 60% are going to be, you know, middle of the road, then you have that bottom 20% that are going to kind of fight that change. And you shouldn't really necessarily focus on that top 20%, but you should put a lot of the focus on bringing that bottom 20% along with you. And we talk a lot about just the cultural change that needs to happen when we talk about Chaos Engineering, for example. I mean, there's a huge cultural change that organizations need to switch that mindset into embracing failure. Which we talk a lot about, but it's hard for folks to embrace change fearlessly, embrace failure fearlessly.When you've been going through these experiences in the past—and you mentioned that you really need to think about the people—what's one of the common fears? You said, you know, people worry about their jobs and worry about being left behind. Work us through how do you help folks with that?Natalie: Yeah, I think that's actually one of the most interesting aspects of this. When you start looking at—when I [start talking about 00:07:18] about, you know, people don't change, when it's something that's personal like getting married or having kids or going off to college, you know, these are all huge life changes, and we celebrate those, we have parties, we're super happy, we think they're fantastic, right? And I mean, if I go back to India for a second, these are the same people that are struggling on, you know, the fact that I'm going to change from a manual testing to an automated testing, will actually go through an arranged marriage where they're marrying someone that they don't know super well, but they're very happy about it, right? So, that's one of the things that I like to point out and have a discussion with people about is that you're not afraid of change; you're afraid of change in your work life, right? And we have to be very specific about that because we start talking about humans are afraid of change, I actually don't agree. I think we're just afraid of changing what we do at work.And usually, that's because that's somehow tied to our needs pyramid, right? Like, that's how we get our needs met from food and shelter and all of these other kinds of things. And so, when we start to threaten that, it gets really, you know, sketchy for a minute, right? So, that's when we have to, like, take a minute and realize what we're doing and realize that we're being overly protective of a part of our world that, you know, we somehow feel like it's going to then have us begging on the street, is the example I give in my talk, right? That's not going to happen. Like, you know, that's just an irrational fear.And it's highly unlikely that that's your right answer. So, what I encourage people to do is to actually find a logical, kind of, sounding board, person, a mentor, a friend—and again, if you don't have this person in your life, then you know, find that person, but start talking to them about, like, what's most likely to happen in this scenario? Or, better yet, what can I get out of it? I think if you spent less time on that and spent, you know, more time on, like, what can I actually get out of this, how could this benefit me, and sort of flip that in your brain.Because what our brains are incredibly good at doing is going down that worst possible path. But the real truth is, we're just as capable of imagining the good. It's just a matter of focus. So, why don't we just focus on that instead? We can focus on what's the positive part of this, what could happen, and we're actually much more likely—there's a whole lot of studies around manifestation—and we can manifest that in our life if we want to, right? So, we just need to focus on the positive side of it.So, I—literally it's honestly a bunch of personal conversations, and getting people to just calm down and realize that the likelihood of their worst-case scenario is not really real. And then start to think through, okay, what can you actually learn from this? You know, is there something that you would like to get out of this? Would you like to try a new role? Would you like to try to lead an initiative? Would you like to be part of this in some way, right?So, those conversations—and again, it has to be personal. That's the thing that I think, you know, when you start doing widespread, full organizational changes, which I was doing over there and I had 120 engineers, it's hard to do it personally because you literally have to have one-on-one conversations with everybody and understand what they are going to get out of it. But that is what's required. I think, to really get people to a comfort zone, you've got to make sure that they understand how they fit in, and their why; why they're doing it.Julie: And that is all amazing. Now, as the leader, as the head of engineering and an organization, how do you recommend individual contributors talk to their leaders? Or how do they bring up concerns in a way that's productive in an organization? Because I know for me, sometimes—and you're right, I am excellent at going down that every possible negative outcome path; I've planned it out pretty well, to my peril, but that means that when I bring up concerns with leadership, I tend to do so in a heightened emotional state. So, what's your advice for folks?Natalie: Well, and it's just that. I think it's exactly where you're headed with that is that take the emotions out of it—or attempt to—and try to present your concerns logically. Because there's going to be situations where what you're bringing up is something they need to consider, and if you can present it in a logical way, chances are they will, and they'll take that into consideration. So, I would—like, even if they are going to still move forward with the plans that you've somehow don't agree with, like, let's assume that some portion of this change, you don't feel is correct, which is actually one of the most legitimate reasons to worry about this, then what you should do is say, “Okay, look, I have this concern, so here's the Plan B. But just in case, this doesn't work. But I think it might not, so here's a Plan B.”Like, that's a way of presenting that in a way that's not challenging to the situation. So, I'll give you an example. In the India conversations, one of the things was that I actually did create a Plan B around was the fact that the person was bringing up—I was attempting to have Agile teams where they needed to have very strong ownership, they also needed to be able to self-manage. We talked about self-managed teams in Agile. And India is a very hierarchical culture, and so the thing that they brought up with me is that this isn't going to work here; it culturally isn't a good fit.And frankly, I knew that I was going to—I had this issue it within the company, but was it so widespread within India that I couldn't possibly change it? I hadn't lived there my whole life, I couldn't say, right? So, I needed to actually answer that question. And I thought it was a legitimate question, right? And I thought—but it was presented in, you know, a very factual, logical way, and kind of without the emotions, and so it's like, “Okay, let me think through that.”And so, we did this as a—you know, we created an experimental team where we tried this out to see if it would work. And it actually did, ultimately, succeed with that team. And I love this team because —I mean, to be fair, I did handpick who went on this team. Like, I did, you know, try to pick people who I thought might be the most likely to succeed. I'm not crazy; I did want it to work, and so you know, I did sort of seed it a bit.But at the same time, when they came out of that—and they tend to be a little bit younger than I think some of the, you know—because I think their minds were a little bit more open as part of that, but they came out of that, and after about nine sprints, you started to see the junior engineers challenging the more senior engineers, which in India is not like something that you see all that often. They were also able to —the junior engineers were having opinions, they were contributing to the technical discussions. Like, it was actually a pretty radical shift. But they also kind of walked around with this, like, certain swagger that I cannot describe. But it was, like, super fun to watch.So, you know, you've got to see that this was actually going to work, and it could work. And then it became a really good example, for the rest. So, I think the main thing is to help mitigate risk. If you have a real concern over a change that's coming your way, and it's something you don't feel like the company should do, just understand that they may do it instead and that's not personal, but at the same time, you know, you can help by offering a Plan B or some risk mitigation to double-check that it is going to work or to help it work.Julie: Absolutely. It's kind of that whole testing hypothesis, right? We're going to see if this works; we're going to evaluate it. One of the things that you brought up that I love and it was something that when I was at PagerDuty, we used to talk about a lot with the postmortem process, which was to involve junior engineers because they tend to look at things differently with that fresh set of eyes.Natalie: Right.Julie: And they kind of get us a little bit—the people who've been doing it for a very long period of time—a little bit out of your comfort zone because all of a sudden, maybe you're having to explain something. Jason and I have talked about this a few more times probably than necessary, but just, “Well, we've always done it this way because…” and then having to explain that because. You know, one of the things that I find interesting just from your background is—you know, we've talked about this, where you scaled that engineering team from 0 to 100, to deliver on custom software engineering contracts, and you've done quite a few things over your career. I mean, even working at Oracle—which we were actually just talking about an Oracle outage this morning—but, driving technical programs. And that seems to be a lot of your background. I mean, even at Facet, that you introduced engineering best practices to standardize code reviews and improve test coverage. Do you want to talk a little bit about that?Natalie: Yeah, I think—I like this—I call it the adolescent version of engineering. It's where, you know, we're through the baby part, we need to start to grow up a little bit, we need to go from getting stuff done in some way or another, to something that's repeatable and scalable. And so, it's like, that adolescent years, that's my fun. That's what I enjoy doing. I call it creating something out of chaos.Basically, taming the chaos is what it really looks like because it's very chaotic initially, and that's true of every, like, small organization; they always start like that. And as they start to grow, you know, you've got ten different engineers who have ten different opinions on how something should be done, and so they do it ten different ways. And that's fine when you're only ten, but then when you need to go from 10 to 20 to 30 to 100, it no longer works. And you do have to create some standards and still leave enough leeway for people to be able to have their tool of choice based on, you know, what makes sense, right?So, there needs to be some pragmatism in there, you can't just, like, also go the [unintelligible 00:16:54] where it's just one thing. But at the same time, there is some standards and there is some consistency that needs to be created so that, like, when you're onboarding a new engineer, there's not 20 things to learn; you can reduce that down to something that's manageable and you can get somebody onboard and productive within a reasonable amount of time. Otherwise, that's difficult, even that becomes difficult. So, every part of it that needs to have some level of standards around it—I think the fun in it, too, is finding that balance between introducing enough process that you have some standardization, you have some consistency, but not so much that you slow it down to the point that it's no longer moving. Because you can; you can strangle a small organization with too much process.So, it's finding that middle ground. And yeah, that's what I've pretty much done, like, my whole career in some form or another; it's what I enjoy. And if it gets to the point where things become too standard, too stable, to done, then I'm probably… I'm going to need to move on to something different and new. You know, that's going to be where I go do this again, with somebody else.Julie: Hashtag #startuplife, right?Natalie: [laugh].Julie: [laugh]. That's interesting that you bring up, you know, going from ten people to more, right, where you can just buy any tool you want and reimburse it, and there might not even be a central repo of all the tools that the organization has, to whittling that down into processes that you own, that you control, versus processes that control you. And then bringing those ten people that were there at the beginning that could kind of do whatever they want because the whole goal is to bring this product to market, to refining that organization and helping build out features in service of the customer. So, when you're looking at the new things that you want to do or prioritizing your time or the engineering team's time, what are some of the things that you take into consideration?Natalie: It's kind of actually very similar to performance when you look at the performance of a system, right? The engineering organization is no different. You need to find your bottlenecks and then you work from there. And the bottlenecks are different depending on which team that you're looking at, right? So, I like to start to kind of get a feel for what's working, what's not working, and where things are slow, [unintelligible 00:19:15] oftentimes what I'm trying to do is to get some speed, to get some speed and consistency tend to be really big things without losing quality. You know, all of those kinds of—those are the always the buckets, right?And so, when you start looking at speed, it really starts to look very much like that performance bottleneck exercise where you just start hitting them one at a time until you, you know, you get through the easy ones and then you start tweaking from there. But for instance, I'll tell you when I first started with Gremlin, we had a very large team and because of that, stand-ups were very huge, there was too much conversation, they took too long, people —actually the odd thing is that you'll find people have less ownership when the team is too large because they don't feel like they're as part of something that they're making a huge —as much of an impact on; they don't feel their impact on a team that's too large, so when you're organized in such a way that the teams are very large, you tend to lose some of the qualities of Agile that you're trying to achieve when you're doing these little small Agile teams, or at least that's the thought. So, one of the things I did was split the team. And one of the first things that I did—and that automatically started to create a different dynamic within the teams, and we're starting to see the results of that. And so, I feel like those are the kinds of things that you do.Like, that was an easy one; we have to do this, like, that first. Now, like, what do we do next? It depends. It depends, like, where, like, in some cases—I'll take India, for example—there was a lot of tech debt. So, I had some tech debt that I had to contend with and deal with that was—the way it was built, it was built with this very huge monolithic-style service, and I needed to help them start breaking that into smaller services, mainly because—and they were such a large team, and it was still a monolithic sort of situation, the problem was actually more so than the performance because they had tuned the heck out of that, so that wasn't it.Like, the data was very large, so they had already dealt with performance. But the conflict within the engineering teams was a lot because there was so much coordination. And so, by being able to split this up into services that make sense, then the teams can start to own the services and be able to deliver on that with some speed without having to coordinate so much. And every moment of coordination costs you time, right? So, that's the type of things that you start to look at.And it could be a technical solution, like in this case, it was breaking the technology, from an architectural standpoint, down into something that make the teams operate differently, or it can be splitting the teams itself without changing the architecture. It can be any number of things. But really start to have to look at what's causing this to go slow.Julie: Now, I love that because when everybody owns everything, nobody owns anything, right? And you talked about breaking the teams down into service teams that makes sense. And so, it sounds like it was incredibly intentional; owning your services all the way through into production is really helpful with that speed and that quality. And you mentioned that briefly earlier, which is—what is that? The iron triangle, or whatever they call it, but speed, cost, quality. There's three things; you can only have two. Which two do you pick?Natalie: Right. [laugh]. Exactly.Julie: And I've seen that titled as a fallacy saying that you can really have all three, but I don't really know. What do you think? Speed, cost, quality, can you have all three?Natalie: Well, so you can maybe have speed, cost, and quality, but if you throw scope in there, [laugh] and you throw that into your [unintelligible 00:22:41], right? Like, because [unintelligible 00:22:42] where you have to start throwing that in. Like, if you look at—so, you know, the triangle that we tend to look at is the time that you're going to deliver it in, the scope, and the price. Those are the three that I think you can only hold two of. You can go—so by speed when you say speed, cost, and quality, if you go back to your you know, your original one, depends on what how you define speed on whether or not you get quality out of that, right? [laugh].And so, when you say—but when you start putting deadlines on things, then yeah, you can get quality so long as I can control the scope, right? Because then I can scope it down enough that I can deliver something within that timeline that is of high quality, right? So, those are the trade-offs that you have to make? And no I don't —I still feel like in that particular three-legged stool, you know, there's only two of those you get, that somebody else outside of your organization can handle. You do have to —otherwise, you know, you can't possibly deliver everything in the world within a really short timeframe and expect the quality to be high.Julie: Yeah, wouldn't that be nice if you could, right? But that's why we talk about learning from our failures. That's why we talked about Chaos Engineering and understanding our systems. Because in all reality, we do have timeframes that we need to get things out, and we have to make our systems as reliable as possible. But then where do we find the gaps that we may have missed because of speed, because of that timeliness?Natalie: Well, and when you start looking at things like, you know, quality, there's certainly things that you can do, but if you go back to Chaos Engineering—we talk about that for just a second, and we look at the changes that people are afraid of. What happens when you go in and you tell a place, “To improve your quality I'm going to actually start shutting down your host.” They're like, “I'm sorry, what?” [laugh].Julie: [laugh].Natalie: That's a very difficult conversation, right? So, I feel like it's one of those things where once you see that and why you would do it and then, like, you make the adjustments to that, and then it becomes a part of your—doing this sort of change is actually, you know, something that you just do on a continuous basis; it's no longer something that you're afraid of, right? And I think that's true of just [unintelligible 00:24:48] in general. Like, you know, once you start getting into the habit of it, whatever that habit might be—and automation, by the way, is one of those things—and whether it be automating regular tests, whether it be automating Chaos Engineering tests, like any of this automation, that's actually a key to speed with engineering. And the reason for that is because those are so closely linked.I go back and I talk about automation and confident mindset. This is really the two things that give you speed in engineering organization. And the reason is because if you can automate it enough, you can—you know, obviously there's just some speed that comes from automation, you know, that you're not doing things manually, that's great. But the thing that you miss in that, or that you don't necessarily think of, is the fact that there, like, an automated safety net under you, like, through testing, through, like, you know, the systems-level testing, Chaos Engineering, you know, the engineers now feel more free, they're more confident, they're able to make changes at a much more rapid pace. It feels less risky because they're able to make this change and then they know that the tests are going to catch them, right?So, if they've screwed something up, something else is going to stop it before it heads to production. So, they're just more—they're able to just move forward at a faster pace than they would otherwise, right? So, that automation, the speed that you get out of it goes far beyond just you taking the manual process down to an automated one; it's creating the safety net that gives them the confidence to just move without thinking. And that's huge. Like, that's a big deal.It's also—back to your thoughts on junior engineers—it's also why I think it's really important to make sure there's people in the engineering team who [unintelligible 00:26:26] three years, like, three years of experience. It's like you know enough that you can make really good progress and you can be useful, but you don't know so much that you're afraid. Like, there—laugh] because that confident mindset I'm back to, it really matters. Like, it makes such a big difference in the teams that will move quickly and teams that will not.Julie: I love everything that you just said. And I just saw a tweet from Kelsey Hightower that he tweeted just a couple of days ago; I saw it just before we recorded this. So, he said, “…as an industry we've been pushing… Automate. Automate. Automate. And we haven't been saying… Understand. Understand. Understand. Because if you understand what you're doing, you can automate it if you want to.”And I think you just touched on that. And I think you touched on a lot of the having confidence, that what you're doing—that there's safety and even if there are failures, that they're going to be caught. And I think that all ties together beautifully. Now, with that, because I do realize that we are running out of time, I just want to say, so for you, you are giving the closing keynote today at DevOpsDays Boise. And we've talked a lot about overcoming fear during this podcast, and I know that this was something that made you a little bit uncomfortable. Can you tell me why you chose to do this? Why did you choose to overcome this fear?Natalie: Because of my position and the fact that I'm female, I get offers. And I just made a deal with myself about, you know, a few months ago that said, you know, I wouldn't turn these down. And primarily it's because I feel like it's important that at least some women are out there and are serving as examples for others. Like, I'm not saying that I'm going to have, like, the best things to say all the time, and I think that's okay. I don't think every man that comes on a podcast has the best things to say either, right?So, I feel like it's just one of those situations where we need examples for ourselves, and I think it's important that, you know, we see ourselves in the—in what's—in what's, I guess, the speakers and the participants, right? And so, I want to make sure that I do my part in that, I guess.Julie: Well, thank you. And you heard it here first, folks. If you need Natalie to speak at your conference, she made a deal with herself [laugh] that she would not say no. We're really excited to have you both on the podcast and speaking at DevOpsDays Boise. So, thank you, Natalie, and thank you for joining us on Break Things on Purpose. And good luck on your talk today.Natalie: Thank you. Appreciate it. Enjoyed it. [laugh].Julie: Have a wonderful day.Natalie: You too.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover:00:00:00 - Introduction00:00:57 - Rootly, an incident management platform 00:02:20 - Why build Rootly00:06:00 - Unique aspects of Rootly00:09:50 - How people should use Rootly Links Referenced:rootly.com/demo: https://rootly.com/demo TranscriptJJ: How do you now get this massive organization to change the way that they work? Even if they were following, like, a checklist and Google Docs, that still marks as a fairly significant cultural change, and so we need to be very mindful of it.Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build modern applications, or help us fix them when they break. In this episode, JJ Tang, co-founder of Rootly, joins us to chat about incident response, the tool he's built, and the lessons he's learned from incidents.So, in this episode, we've got with us JJ Tang, who's the co-founder of a company and a tool called Rootly, welcome to the show.JJ: Thank you, Jason, super excited to be here. Big fan of what you guys are doing over at Gremlin and all things Chaos Engineering. Quick intro on my side. I'm JJ, as you mentioned. We are building Rootly, which is an incident management platform built on top of Slack.So, we help a bunch of different companies automate what we believe to be some of the most manual and tedious work when it comes to incidents, like creating virtual war rooms, Zoom Bridges, tracking your action items on Jira, generating your postmortem timeline, adding the right responders, and generally just helping build that consistency. So, we work with a bunch of different fast-growing tech companies like Canva, Grammarly, Bolt, Faire, Productboard, and also some of the more traditional ones like Ford and Shell. So, super excited to be here. Hopefully, I have some somewhat engaging insight, I hope. [laugh].Jason: Yeah, I think you will because in our discussions previously, we've always had fantastic conversations. So, you've kind of covered a lot of the first question that I normally ask, and that's what did you build? And so as you explained, Rootly is an incident management tool; works with Slack. But that naturally leads into the other question that I asked our Build Things guests, and that's why did you build this? Was it something from your experience as an engineer that you're just like, “I need a tool to solve this?” What's the story behind Rootly?JJ: Yeah, definitely. Sorry to jump the gun on the first question. I was a little bit too excited, I think. But yeah, so my co-founder, and I—his name is Quinton—we both used to work at Instacart, the grocery delivery startup. He was there super, super early days; he was actually one of the first SREs there and kind of built out that team.And I was more on the product side of things, so I helped us build out our enterprise and last-mile delivery products. If you're curious what does [laugh] grocery have to do with reliability, actually, not that much, but the challenges we were dealing with were at very great scale. So, it all started back when the pandemic first started getting kicked off. Instacart was growing rapidly at the time, we were scaling really well, we were heading the numbers where we want it to be, but with suddenly the lockdowns occurring, everyone overnight who didn't care about grocery delivery and thought, “Well, why don't I just drive to Walmart,” [laugh] suddenly wanted to order things on Instacart. So, the company grew 5, 600%, nearly overnight.And with that, our systems just could not handle the load. And it'd be the most obscure incidents you wouldn't think would break, but under such immense stress and demand, we just couldn't keep the site up all the time. And what that really exposed on our end was, we don't have a really good incident management process. What we were doing was, we kind of just had every engineer in a single incident channel on Slack. And if you got paged, you just kind of ping in there. “I just got woken up. Did anyone else? Does this look legit?”And there was no formal way, so there was no consistency in terms of how the incidents were created. And then, of course, from that top-of-funnel into the postmortem, there wasn't too much discipline there. So, we really thought about, you know, after the dust kind of settled, there must be a better way to do this. And like most organizations that we work with, you start thinking about how can I build this myself?I think there's probably a little bit of a gap right now in this space. People generally understand monitoring tools really well, like New Relic, Datadog, alerting tools super well, PagerDuty, Opsgenie, they do a really good job at it. But everything afterwards, the actual orchestration and learning from the incidents tends to be a little bit sparse. So, we started embarking on our own. And for my co-founder's side of things, he was more at the heart of the incident than I was. I think I was the one complaining about and breathing down his neck a little bit about why things [laugh] sometimes weren't working.And—yeah, and, you know, as we started thinking about internal solutions, we took a step back and thought, “Well, you know, if Instacart is facing this problem then I think a lot of companies must be as well.” And luckily, our hypothesis has proven to be true, and yeah, the rest is just history now.Jason: That's really fascinating, particularly because, I mean, it is such a widespread issue, right? And I think I've experienced that as well, where you've got a general on-call or incidents channel, and literally everybody in the organization's in there, not just engineers, but—like yourself—product people and customer success or support folks are all in there. And the idea is this, sort of—it's a giant, giant crowd of folks who are just, like, waiting and wondering. And so having a tool to help manage that is extremely useful. As you started building out this tool, I'm starting to think there are starting to become a lot more incident management tools or incident response management tools, so talk to me about what are the unique points about Rootly?Because I suspect that a lot of it is influenced from, “These are the pain points that I had during my incidents,” and so you pulled them over? And so I'm curious, what are those that you brought to the tool that really help it shine during an incident?JJ: Yeah, definitely. I think the space that we're in right now is certainly heating up as you go to the different conferences and the content that's put out there. Which is great because that means everyone is educating the broader audience of what's going on and just makes my job just a little bit easier. There's a couple, you know, original hypothesis that we had for the product that just ended up not being as important. And that has really defined how we think about Rootly and how we differentiate a lot of what we do.How we did incidents at Instacart wasn't all that unique, you know? We used the same tools everyone else did. We had Opsgenie, we used Slack, Datadog, Jira, we wrote our postmortems on Confluence, stuff like that, and our initial reaction was, “Well, people are using the same tools, they must be following a very similar process.” And we also looked and worked a lot with people that are deep into the space, you know, Google, Stripe, the Airbnbs of the world, people that have a very formal process. And so we actually embarked on this journey building a relatively opinionated tool; “This is how we think the best incidents can be run.” And that actually isn't the best fit for everyone.I think if you had no incident management process whatsoever, that's great. You know, we give you super powerful defaults out of the box, like we do today, and you kind of can just hit the ground running super fast. But what we found is despite everyone using basically the same kind of tools, the way they use it is super different. You might only want to create a Zoom Bridge for, you know, high severity incidents, whereas someone else wants to create it for every single incident, for example. So, what we did was really focus on how do we balance between building something that's opinionated versus flexible, where should customers be able to turn the knobs and the dials.And a big part of it is we built what we call our workflows, and that allows customers to create a process that it's very similar to theirs. And a part of that we didn't anticipate at the very beginning was, although the tool is super simple to use, I think or average install time is probably 13 minutes, all the integrations and everything on a quick call with our customers, the really heavy lifting comes with, how do you now get this massive organization to change the way that they work? Even if they were following, like, a checklist in Google Docs, that still marks as a fairly significant cultural change, and so we need to be very mindful of it. So, we can't be just ripping tools out of their existing stack, we can't be wildly changing every process; everything has to happen progressively, almost, in a way. And that is a lot more digestible than saying you're going to replace everything.So, I think that's probably one of the key differences is we tend to lean more on the side of playing with your existing stack versus changing everything up.Jason: That's a really good insight, particularly because coming from Chaos Engineering, and that is almost entirely changing the way that people work, right, is Chaos Engineering is a new practice, so I definitely empathize with you, or sympathize with you on that struggle of, like, how do you change what people are doing and really get them to embrace it? That said, being opinionated is also a really good thing because you have a chance to lead people, and so that leads me to our final question that we always ask folks—and this is where being opinionated is good—but if folks were to use Rootly, or just even wanted to improve their incident response processes in general, what are some of those opinions that you had about how people should be doing that, that they should consider embracing?JJ: Yeah, that's an awesome question. So, a couple things, a little bit related to your second question that we initially thought but just proved to not be as important for us, everything that we build at the beginning—and still build—is relatively laser-focused on helping you get to that resolution as fast as possible. But from an organizational perspective, what we found is, people don't think about incident management success as how quickly they can resolve an incident. A lot of it's actually just having that security and framework and consistency around the incident. So ironically, as a tool in incident management, the most important things are actually around your people and the process and the culture that you can develop around the tool.No matter how good of something that we build, you know—let's say you're an organization, you just bring in Rootly, you have a very blameful way of handling postmortems, no one generally understands how severities in organization work, you're super laser-focused on, you know, tracking MTTR, which can not always be the best metric, but you still want to interpret it as such, it's very difficult to make the tool successful. So, that's the biggest advice that we give to our customers is when we see those type of red flags from, like, a process and culture standpoint, we'll try to guide them the best that we can. And we'll also do it from a product perspective. What you get out of the box today, we have companies as small as, you know, 20 for example, just kind of being able to hit the ground running; they'll use workflow templates that are pre-built based on some best practices that we've seen to just kind of layer in that framework. So, I think that would be a really big one that we've noticed is it's not all about us; it's not all about the product and the benefits that we can provide; it's about how we can actually enable our customers to get to that stage.Jason: I love that answer. Well, JJ, thanks for being a guest on the show and sharing a bit more about your journey and the journey of Rootly. If folks are interested in trying out the product and getting better at incident response, where can they find more info about you and about Rootly?JJ: Yeah. You can just visit rootly.com/demo. We do offer a 14-day trial if you want to sign up for free.If you want to talk to one of us or partnerships team, you're welcome to book a personalized session. I recommend that because then you get to see my super cute dog that isn't with me right now and wouldn't matter because this is audio only, but I love showing her off. That's my favorite part of my job.Jason: So, if you want to go see JJ's dog, or learn more about Rootly and incident management, go check it out. Thanks again.JJ: Yeah, thanks for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: Introduction (00:00) Elizabeth, AppLand, and AppMap (1:00) Why build AppMap (03:34) Being open-source (06:40) Building community (08:50) Some tips on using AppMap (11:15) Links Referenced: VS Code Marketplace: https://marketplace.visualstudio.com/items?itemName=appland.appmap JetBrains Marketplace: https://plugins.jetbrains.com/plugin/16701-appmap AppLand: https://appland.com TranscriptElizabeth: “Whoa.” [laugh]. That's like getting a map of all of the Planet Earth with street directions for every single city, across all of the continents. You don't need that; you just want to know how to get to the nearest 7/11, right? Like, so just start small. [laugh]. Don't try and map your entire universe, galaxy, you know, out of the gate. [laugh].Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build and operate modern applications. In this episode, Elizabeth Lawler joins us to chat about the challenges of building modern, complex software, and the tool that she's built to help developers better understand where they are and where they're going.Jason: Today on the show, we have Elizabeth Lawler who's the founder of a company called AppLand, they make a product called AppMap. Welcome to the show, Elizabeth.Elizabeth: Thank you so much for having me, Jason.Jason: Awesome. So, tell us a little bit more about AppLand and this product that you've built. What did you build?Elizabeth: Sure. So, AppMap is a product that we're building in the open. It's a developer tool, so it's free and open-source. And we call it Google Maps for code. You know, I think that there has been a movement in more assistive technologies being developed—or augmenting technologies being developed for developers, and with some of the new tools, we were looking to create a more visual and interactive experience for developers to understand the runtime of their code better when they code.So, it's interesting how a lot of the runtime of an application when you're writing it or you're actually crafting it is sort of in your imagination because it hasn't yet been. [laugh]. And so, you know, we wanted to make that information apparent and push that kind of observability left so that people could see how things were going to work while they're writing them.Jason: I love that idea of seeing how things are working while you're writing it because you're so right. You know, when I write code, I have a vision in mind, and so, like, you mentally kind of scaffold out here are the pieces that I need and how they'll fit together. And then as you write it, you naturally encounter issues, or things don't work quite as you expect, and you tweak those. And sometimes that idea or the concept in your head gets a little fuzzy. So, having a tool that actually shows you in real-time seems like an extremely valuable tool.Elizabeth: Thank you. Yes. And I think you've nailed how it's not always the issue of dependency, it's really the issue of dependent behavior. And that dependent behavior of other services or code you're interacting with is the hardest thing to imagine while you're writing because you're also focusing on feature and functionality. So, it's really a fun space to work in, and crafting out that data, thinking about what you would need to present, and then trying to create an engaging experience around that has been a really fun journey that the team has been on since 2020. We announced the project in 2021 in March—I think almost about this time last year—and we have over 13,000 users of AppMap now.Jason: That's incredible. So, you mentioned two things that I want to dive into. One is that it's open-source, and then the second—and maybe we'll start there—is why did you build this? Is this something that just was organic; you needed a tool for yourself, or… what was the birth of AppMap?Elizabeth: Oh, I think that's such a great question because I think it was—this is the third startup that I've been in, third project of this kind, building developer tooling. My previous company was a cybersecurity company; before that, I helped build applications in the healthcare sector. And before that, I worked in government and healthcare. And—also, again, building platforms and IT systems and applications as part of my work—and creating a common understanding of how software operates—works—understanding and communicating that effectively, and lowering that kind of cognitive load to get everybody on the same page is such a hard problem. I mean, when we didn't all work from home, we had whiteboards [laugh] and we would get in the room and go through sprint review and describe how something was working and seeing if there was anything we could do to improve quality, performance, reliability, scalability, functionality before something shipped, and we did it as a group, in-person. And it's very difficult to do that.And even that method is not particularly effective because you're dealing with whiteboards and people's mental models and so we wanted to, first of all, create something objective that would show you really how things worked, and secondly, we wanted to lower the burden to have those conversations with yourself. Or, you know, kind of rubber ducky debugging when something's not working, and also with the group. So, we created AppMaps as both interactive visualizations you could use to look at runtime, debug something, understand something better, but also something that could travel and help to make communication a lot easier. And that was the impetus, you know, just wanting to improve our own group understanding.Jason: I love that notion of not just having the developer understand more, but that idea of yeah, we work in teams and we often have misalignment simply because people on different sides of the application look at things differently. And so this idea of, can we build a tool that not only helps an individual understand things, but gets everybody on the same page is fantastic.Elizabeth: And also work in different layers of the application. For example, many observability tools are very highly focused on network, right? And sometimes the people who have the view of the problem, aren't able to articulate it clearly or effectively or expeditiously enough to capture the attention of someone who needs to fix the problem. And so, you know, I think also having—we've blended a combination of pieces of information into AppMap, not only code, but also web services, data, I/O, and other elements and so that we can start to talk more effectively as groups.Jason: That's awesome. So, I think that collaboration leads into that second thing that I brought up that I think is really interesting is that this is an open-source project as well. And so—Elizabeth: It is.Jason: Tell me more about that. What's the process? Because that's always, I think, a challenge is this notion of we love open-source, but we're also—we work for companies, we like to get paid. I like to get paid. [laugh]. So, how does that work out and what's that look like as you've gone on this journey?Elizabeth: Yeah. You know, I think we think quietly working are certainly looking for other fellow travelers who are interested in this space. We started by creating an open data framework—which AppMap is actually both the name of a code editor extension you can install and use to see the runtime of your code to understand issues and find a fix them faster, but it also is a data standard. And with that data standard, we're really looking to work with other people. Because, you know, I think this type of information should be widely accessible for people and I think it should be available to understand.I think, you know, awareness about your software environment is just kind of like a basic developer right. And so, [laugh] you know, the reason why we made the tools free, and the reason why we've made the data structure open-source is to be able to encourage people to get the kind of information that they need to do their job better. And by making our agents open-source, by making our clients open-source, it simply allows people to be able to find and adopt this kind of tooling to improve their own job performance. And so, you know, that was really kind of how we started and I think, ultimately, you know, there are opportunities to provide commercial products, and there will be some coming down the road, but at the moment, right now we're really interested in working with the community and, you know, understanding their needs better.Jason: That's awesome. Number one, I love the embrace of, you know, when you're in the startup land, there's the advice, have never tried to monetize too early, right? Build something that's useful that people enjoy and really value, and then it'll naturally come. The other question that I had is, I'm assuming you eat your own dog food, slash drink your own champagne. So, I'm really curious, like, one of the problems that I've had in open-source is the onboarding of new community members, right? Software is complex, and so people often have troubles, and they're like, how do I fix this? They file an issue on GitHub or whatever system you're using, and there's sometimes a notion with open-source of like, that's a good thing that you called out. You can fix that because it's open-source, but people are like, “I don't know how.”Elizabeth: Yeah.Jason: Does AppMap actually help in enabling AppMap open-source contributors? Like, have you seen that?Elizabeth: So, we've had issues filed. I would say that most of the fixes still come from us. If people wanted to run AppMap on AppMap to identify the bug, [laugh] that would be great, but it doesn't really work that way. So, you know, for us at this time, most of it is community filed issues and that we are working to resolve. But I do think—and I will say—that we have actually used AppMap on open-source projects that we use, and we've found [laugh] flaws and bugs using AppMap with those projects, and have filed issues with them. [laugh].Jason: That's awesome. I love that. I mean, that's what it means to be an open-source, right, and to use open-source is that notion of, like—Elizabeth: Right.Jason: Contribute wherever you can.Elizabeth: Yeah. And if that's the way, you know, we can contribute, you know—and I think similarly, I mean, our relationship to open-source is very strong. So, for example, you know, we came from the Ruby community and there's lots of different kinds of open-source projects that are commonly used for things like security and authentication and we've done a lot of work in our own project to tag and label those commonly-used libraries so that they can be—when you pop open an AppMap everything is all beautiful and tagged and, you know, very nicely and neatly organized for you so you can find everything you're looking for. Similarly, we're working with open-source communities in Python and Java and now JavaScript to do the same thing, which is, you know, to make sure that important information, important commonly used libraries and tools are called out clearly.Jason: So, as you're adding more languages, you're going to get more users. So, that brings me to our final question. And that's, as you get all these new users, they probably need some guidance. So, if you were to give some users tips, right? Someone goes out there, like, “I want to use AppMap,” what's some advice that you'd give them related to reliability? How can they get the best experience and build the best code using AppMap?Elizabeth: Yes. So, this has actually been a key piece of feedback, I think, from the community for us, which is, we released this tool out to the world, and we said, “We're going to bring here; we come with gifts of observability in your code editor.” And people have used it for all kinds of different projects: They've used it for refactoring projects, for debugging, for onboarding to code, for all of these different use cases, but one of the things that can be overwhelming is the amount of information that you get. And I think this is true of most kinds of observability tools; you kind of start with this wall of data, and you're like, “Where am I going to start?”And so my recommendation is that AppMap is best used when you have a targeted question in mind, not just kind of like, you know, “I'd like to understand how this new piece of the codebase works. I've shifted from Team A to Team B, and I need to onboard to it.” “I'd like to figure out why I've got a slow—you know, I've been told that we've got a slowdown. Is it my query? Is it my web service? What is it? I'd like to pinpoint, find, and fix the issue fast.”One of the things that we're doing now is starting to leverage the data in a more analytic way to begin to help people focus their attention. And that's a new product that we're going to be bringing out later this spring, and I'm very, very excited about it. But I think that's the key, which is to start small, run a few test cases that are related to the area of code that you're interested in if that's an onboarding case, or look for areas of the code you can record or run test cases around that is related to the bug you have to fix. Because if you just run your whole test suite, you will generate a giant amount of data. Sometimes people generate, like, 10,000 AppMaps on the first pass through. And they're like, “Whoa.” [laugh]. That's like getting a map of all of the Planet Earth with street directions for every single city, across all of the continents. You don't need that; you just want to know how to get to the nearest 7/11, right? Like, so just start small. [laugh]. Don't try and map your entire universe, galaxy, you know, out of the gate. [laugh].Jason: That's fantastic advice, and it sounds very similar to what we advise at Gremlin for Chaos Engineering of starting small, starting very specific, really honing in on sort of a hypothesis, “What do I think will happen?” Or, “How do I think I understand things?” And really going from there?Elizabeth: Yeah. It does, it focuses the mind to have a specific question as opposed to asking the universe what does it all mean?Jason: Yeah. Well, thanks for being a guest on the show today. Before we go, where can people find AppMap if they're interested in the tool, and they want to give it a try?Elizabeth: So, we are located in the VS Code Marketplace if you use the VS Code editor, and we're also located in JetBrains Marketplace if you use any of the JetBrains tools.Jason: Awesome. So yeah, for our VS Code and JetBrains users, go check that out. And if you're interested in more about AppMap or AppLand, where can folks find more info about the company and maybe future announcements on the analysis tooling?Elizabeth: That would be appland.com A-P-P-L-A-N-D dot C-O-M. And our dev docs are there, new tooling is announced there, and our community resources are there, so if anyone would like to participate in either helping us build out our data model, feedback on our language-specific plans or any of the tooling, we welcome contributors.Jason: Awesome. Thanks again for sharing all of that info about AppMap and AppLand and how folks can continue to build more reliable software.Elizabeth: Thank you for having me, Jason.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: Introduction (00:00) How Chris got into the world of chaos and teaching middle school science (02:11) The Cengage seasonal model and preparing for the (5:56) How Cengage schedules the chaos and the “day of darkness” (11:10) Scaling and migration and “the inches we need” (15:28) Communicating with different teams and the customers (18:18) Chris's biggest lesson from practicing chaos engineering (24:30) Chris and working at Cengage/Outro (27:40) Links Referenced: Cengage: https://www.cengagegroup.com/ Chris Martello on LinkedIn: https://www.linkedin.com/in/christophermartello/ TranscriptJulie: Wait, I got it. You probably don't know this one, Chris. It's not from you. How does the Dalai Lama order a hot dog?Chris: He orders one with everything.Julie: [laugh]. So far, I have not been able to stump Chris on—[laugh].Chris: [laugh]. Then the follow-up to that one for a QA is how many engineers does it take to change a light bulb? The answer is, none; that's a hardware problem.Julie: Welcome to Break Things on Purpose, a podcast about reliability, quality, and ways to focus on the user experience. In this episode, we talk with Chris Martello, manager of application performance at Cengage, about the importance of Chaos Engineering in service of quality.Julie: Welcome to Break Things on Purpose. We are joined today by Chris Martello from Cengage. Chris, do you want to tell us a little bit about yourself?Chris: Hey, thanks for having me today, Julie, Jason. It's nice to be here and chat with you folks about Chaos Engineering, Chaos Testing, Gremlin. As Julie mentioned I'm a performance manager at Cengage Learning Group, and we do a fair amount of performance testing, both individual platforms, and coordinated load testing. I've been a software manager at Cengage for about five years, total of nine altogether there at Cengage, and worn quite a few of the testing hats, as you can imagine, from automation engineer, performance engineer, and now QA manager. So, with that, yeah, my team is about—we have ten people that coordinate and test our [unintelligible 00:01:52] platforms. I'm on the higher-ed side. We have Gale Research Library, as well as soft skills with our WebAssign and ed2go offerings. So, I'm just one of a few, but my claim to fame—or at least one of my passions—is definitely chaos testing and breaking things on purpose.Julie: I love that, Chris. And before we hear why that's your passion, when you and I chatted last week, you mentioned how you got into the world of QA, and I think you started with a little bit of different type of chaos. You want to tell us what you did before?Chris: Sure, even before a 20-year career, now, in software testing, I managed chaos every day. If you know anything about teaching middle school, seventh and eighth-grade science, those folks have lots of energy and combine that with their curiosity for life and, you know, their propensity to expend energy and play basketball and run track and do things, I had a good time for a number of years corralling that energy and focusing that energy into certain directions. And you know back, kind of, with the jokes, it was a way to engage with kids in the classroom was humor. And so there was a lot of science jokes and things like that. But generally speaking, that evolved into I had a passion for computers, being self-taught with programming skills, project management, and things like that. It just evolved into a different career that has been very rewarding.And that's what brings me to Cengage and why I come to work every day with those folks is because instead of now teaching seventh and eighth-grade science to young, impressionable minds, nowadays I teach adults how to test websites and how to test platforms and services. And the coaching is still the same; the mentoring is still the same. The aptitude of my students is a lot different, you know? We have adults, they're people, they require things. And you know, the subject matter is also different. But the skills in the coaching and teaching is still the same.Jason: If you were, like, anything like my seventh-grade science teacher, then another common thing that you would have with Chaos Engineering and teaching science is blowing a lot of things up.Chris: Indeed. Playing with phosphorus and raw metal sodium was always a fun time in the chemistry class. [laugh].Julie: Well, one of the things that I love, there are so many parallels between being a science teacher and Chaos Engineering. I mean, we talk about this all the time with following the scientific process, right? You're creating a hypothesis; you're testing that. And so have you seen those parallels now with what you're doing with Chaos Engineering over there at Cengage?Chris: Oh, absolutely. It is definitely the basis for almost any testing we do. You have to have your controlled variables, your environment, your settings, your test scripts, and things that you're working on, setting up that experiment, the design of course, and then your uncontrolled variables, the manipulated ones that you're looking for to give you information to tell you something new about the system that you didn't know, after you conducted your experiment. So, working with teams, almost half of the learning occurs in just the design phase in terms of, “Hey, I think this system is supposed to do X, it's designed in a certain way.” And if we run a test to demonstrate that, either it's going to work or it's not. Or it's going to give us some new information that we didn't know about it before we ran our experiment.Julie: But you also have a very, like, cyclical reliabilities schedule that's important to you, right? You have your very important peak traffic windows. And what is that? Is that around the summertime? What does that look like for you?Chris: That's right, Julie. So, our business model, or at least our seasonal model, runs off of typical college semesters. So, you can imagine that August and September are really big traffic months for us, as well as January and part of February. It does take a little extra planning in order to mimic that traffic. Traffic and transactions at the beginning of the semester are a lot different than they are at the middle and even at the end of the semester.So, we see our secondary higher education platforms as courseware. We have our instructors doing course building. They're taking a textbook, a digitized textbook, they're building a course on it, they're adding their activities to it, and they're setting it up. At the same time that's going along, the students are registering, they are signing up to use the course, they're signing up to their course key for Cengage products, and they're logging into the course. The middle section looks a lot like taking activities and tests and quizzes, reading the textbook, flipping pages, and maybe even making some notes off to the side.And then at the end of the semester, when the time is up, quite literally on the course—you know, my course semester starts from this day to this day, in 15th of December. Computers being as precise as they are, when 15th of December at 11:59 p.m. rolls off the clock, that triggers a whole bunch of cron jobs that say, “Hey, it's done. Start calculating grades.”And it has to go through thousands of courses and say, “Which courses expired today? How many grades are there submitted? How many grades are unsubmitted and now I have to calculate the zeros?” And there's a lot of math that goes in with that analytics. And some of those jobs, when those midnight triggers kick off those jobs, it will take eight to ten hours in order to process that semester's courses that expire on that day.Julie: Well, and then if you experience an outage, I can only assume that it would be a high-stress situation for both teachers and students, and so we've talked about why you focus so heavily on reliability, I'd love to hear maybe if you can share with us how you prepare for those peak traffic events.Chris: So yeah, it's challenging to design a full load test that encompasses an entire semester's worth of traffic and even the peaks that are there. So, what we do is, we utilize our analytics that give us information on where our peak traffic days lie. And it's typically the second or third Monday in September, and it's at one or two o'clock in the afternoon. And those are when it's just what we've seen over the past couple of years is those days are our typical traffic peaks. And so we take the type of transactions that occur during those days, and we calibrate our load tests to use those as a peak, a one-time, our performance capacity.And then that becomes our x-factor in testing. Our 1x factor is what do we see in a semester at those peaks? And we go gather the rest of them during the course of the semester, and kind of tally those up in a load test. So, if our platforms can sustain a three to six-hour load test using peak estimate values that come from our production analysis, then we think we're pretty stable.And then we will turn the dial up to two times that number. And that number gives us an assessment of our headroom. How much more headroom past our peak usage periods do we have in order to service our customers reliably? And then some days, when you're rolling the dice, for extra bonus points, we go for 3x. And the 3x is not a realistic number.I have this conversation with engineering managers and directors all the time. It's like, “Well, you overblow that load test and it demonstrated five times the load on our systems. That's not realistic.” I says, “Well, today it's not realistic. But next week, it might be depending on what's happening.”You know, there are things that sometimes are not predictable with our semesters and our traffic but generally speaking it is. So, let's say some other system goes down. Single-sign-on. Happens to the best of us. If you integrate with a partner and your partner is uncontrolled in your environment, you're at their mercy.So, when that goes down, people stop entering your application. When the floodgates open, that traffic might peak for a while in terms of, hey, it's back up again; everybody can log in. It's the equivalent of, like, emptying a stadium and then letting everybody in through one set of doors. You can't do it. So, those types of scenarios become experimental design conversations with engineering managers to say, “At what level of performance do you think your platform needs to sustain?”And as long as our platforms can sustain within two to three, you know, we're pretty stable in terms of what we have now. But if we end up testing at three times the expected load and things break catastrophically, that might be an indication to an architect or an engineering director, that, hey, if our capacity outlives us in a year, it might be time to start planning for that re-architecture. Start planning for that capacity because it's not just adding on additional servers; planning for that capacity might include a re-architecture of some kind.Julie: You know, Chris, I just want to say to anybody from Coinbase that's out there that's listening, I think they can find you on [LinkedIn](https://www.linkedin.com/in/christophermartello/) to talk about load testing and preparing for peak traffic events.Chris: Yeah, I think the Superbowl saw one. They had a little QR code di—Julie: Yeah.Chris: —displayed on the screen for about 15 seconds or so, and boy, I sure hope they planned for that load because if you're only giving people 15 seconds and everybody's trying to get their phone up there, man I bet those servers got real hot real fast. [laugh].Julie: Yeah, they did. And there was a blip. There was a blip.Chris: Yeah. [laugh].Julie: But you're on LinkedIn, so that's great, and they can find you there to talk to you. You know, I recently had the opportunity to speak to some of the Cengage folks and it was really amazing. And it was amazing to hear what you were doing and how you have scheduled your Chaos Engineering experiments to be something that's repeatable. Do you want to talk about that a little bit for folks?Chris: Sure. I mean, you titled our podcast today, “A Day of Darkness,” and that's kind of where it all started. So, if I could just back up to where we started there with how did chaos become a regular event? How did chaos become a regular part of our engineering teams' DNA, something that they do regularly every month and it's just no sweat to pull off?Well, that Day of Darkness was 18 hours of our educational platforms being down. Now, arguably, the students and instructors had paid for their subscriptions already, so we weren't losing money. But in the education space and in our course creations, our currency is in grades and activities and submissions. So, we were losing currency that day and losing reputation. And so we did a postmortem that involved engineering managers, quality assurance, performance folks, and we looked at all the different downtimes that we've had, and what are the root causes.And after conferring with our colleagues in the different areas—we've never really been brought together in a setting like that—we designed a testing plan that was going to validate a good amount of load on a regular basis. And the secondary reason for coordinating testing like that was that we were migrating from data center to cloud. So, this is, you know, about five, six years ago. So, in order to validate that all that plumbing and connections and integrations worked, you know, I proposed I says, “Hey, let's load test it all the same time. Let's see what happens. Let's make sure that we can run water through the pipes all day long and that things work.”And we plan this for a week; we planned five days. But I traveled to Boston, gathered my engineers kind of in a war room situation, and we worked on it for a week. And in that week, we came up with a list of 90 issues—nine-zero—that we needed to fix and correct and address for our cloud-based offerings before it could go live. And you know, a number of them were low priority, easy to fix, low-hanging fruit, things like that. But there were nine of them that if we hadn't found, we were sure to go down.And so those nine things got addressed, we went live, and our system survived, you know, and things went up. After that, it became a regular thing before the semesters to make sure, “Hey, Chris, we need to coordinate that again. Can you do it?” Sure enough, let's coordinate some of the same old teams, grab my run sheet. And we learned that we needed to give a day of preparation because sometimes there were folks that their scripts were old, their environment wasn't a current version, and sometimes the integrations weren't working for various reasons of other platform releases and functionality implementation.So, we had a day of preparation and then we would run. We'd check in the morning and say, “Everybody ready to go? Any problems? Any surprises that we don't know about, yet?” So, we'd all confer in the morning and give it a thumbs up.We started our tests, we do a three-hour ramp, and we learned that the three-hour ramp was pretty optimal because sometimes elastic load balancers can't, like, spin up fast enough in order to pick up the load, so there were some that we had to pre-allocate and there were others that we had to give enough time. So, three hours became that magic window, and then three hours of steady-state at our peak generation. And now, after five years, we are doing that every month.Jason: That's amazing. One of the things you mentioned in there was about this migration, and I think that might tie back to something you said earlier about scaling and how when you're thinking of scaling, especially as I'm thinking about your migration to the cloud, you said, “Scaling isn't just adding servers. Sometimes that requires re-architecting an application or the way things work.” I'm curious, are those two connected? Or some of those nine critical fixes a part of that discovery?Chris: I think those nine fixes were part of the discovery. It was, you can't just add servers for a particular platform. It was, how big is the network pipe? Where is the DNS server? Is it on this side or that side? Database connections were a big thing: How many are there? Is there enough?So, there was some scaling things that hadn't been considered at that level. You know, nowadays, fixing performance problems can be as easy as more memory and more CPU. It can be. Some days it's not. Some days, it can be more servers; some days, it can be bigger servers.Other times, it's—just, like, quality is everybody's job, performance fixing is not always a silver bullet. There are things like page optimization by the designers. There's code optimization by your front-end engineers. And your back-end engineers, there are database optimizations that can be made: Indexing, reindexing on a regular basis—whatever that schedule is—for optimizing your database queries. If your front-end goes to an API for five things on the first page, does it make five extra calls, or does it make one call, and all five things come across at the same time?So, those are considerations that load performance testing, can tell you where to begin looking. But as quality assurance and that performance lead engineer, I might find five things, but the fixes weren't just more testing and a little bit of extra functionality. It might have involved DevOps to tweak the server connections, it might have involved network to slim down the hops from four different load balancers to two, or something like that. I mean, it was always just something else that you never considered that you utilized your full team and all of their expertise and skills in order to come up with those inches.And that's one of my favorite quotes from Every Given Sunday. It's an older football movie starring Al Pacino. He gives this really awesome speech in a halftime type of setting, and the punch line for this whole thing is, “The inches we need are everywhere around us.” And I tell people that story in the terms of performance is because performance, at the software level, is a game of inches. And those inches are in all of our systems and it's up to us as engineers to find them and add them up.Julie: I absolutely love everything about that. And that would have made a great title for this episode. “The Inches we Need are Everywhere Around Us.” We've already settled on, “A Day of Darkness with Chris Martello,” though. On that note, Chris, some of the things that you mentioned involve a lot of communication with different teams. How did you navigate some of those struggles? Or even at the beginning of this, was it easy to get everybody on board with this mindset of a new way of doing things? Did you have some challenges?Chris: There were challenges for sure. It's kind of hard to picture, I guess, Cengage's platform architecture and stuff. It's not just one thing. It's kind of like Amazon. Amazon is probably the example is that a lot of their services and things work in little, little areas.So, in planning this, I looked at an architecture diagram, and there's all these things around it, and we have this landscape. And I just looked down here in the corner. I said, “What's this?” They said, “Well, that's single-sign-on.” I says, “Well, everything that touches that needs to be load tested.”And they're like, “Why? We can't do that. We don't have a performance environment for that.” I said, “You can't afford not to.” And the day of darkness was kind of that, you know, example that kind of gave us the [sigh] momentum to get over that obstacle that said, “Yeah, we really do need a dedicated performance environment in order to prove this out.”So, then whittling down that giant list of applications and teams into the ones that were meaningful to our single-sign-on. And when we whittled that down, we now have 16 different teams that regularly participate in chaos. Those are kind of the ones that all play together on the same playing field at the same time and when we find that one system has more throughput than another system or an unexpected transaction load, sometimes that system can carry that or project that load onto another system inadvertently. And if there's timeouts at one that are set higher than another, then those events start queuing up on the second set of servers. It's something that we continually balance on.And we use these bits of information for each test and start, you know, logging and tracking these issues, and deciding whether it's important, how long is it going to take to fix, and is it necessary. And, you know, you're balancing risk and reward with everything you're doing, of course, in the business world, but sometimes the, you know—“Chris, bring us more quality. You can do better this month. Can you give us 20 more units of quality?” It's like, “I can't really package that up and hand it to you. That's not a deliverable.”And in the same way that reputation that we lose when our systems go down isn't as quantifiable, either. Sure, you can watch the tweets come across the interwebs, and see how upset our students are at those kinds of things, but our customer support and our service really takes that to heart, and they listen to those tweets and they fix them, and they coordinate and reach out, you know, directly to these folks. And I think that's why our organization supports this type of performance testing, as well as our coordinated chaos: The service experience that goes out to our customers has to be second to none. And that's second to none is the table stakes is your platform must be on, must be stable, and must be performing. That's just to enter the space, kids. You've got to be there. [laugh].You can't have your platform going down at 9 p.m. on a Sunday night when all these college students are doing their homework because they freak out. And they react to it. It's important. That's the currency. That is the human experience that says this platform, this product is very important to these students' lives and their well-being in their academic career. And so we take that very seriously.Jason: I love that you mentioned that your customer support works with the engineering team. Because makes me think of how many calls have you been on where something went wrong, you contacted customer support, and you end up reaching this thing of, they don't talk to engineering, and they're just like, “I don't know, it's broken. Try again some other time.” Or whatever that is, and you end up lost. And so this idea of we often think of DevOps is developers and operations engineers working together and everybody on the engineering side, but I love that idea of extending that.And so I'm curious, in that vein, does your Chaos Engineering, does your performance testing also interact with some of what customer support is actually doing?Chris: In a support kind of way, absolutely. Our customer call support is very well educated on our products and they have a lot of different tools at their disposal in order to correct problems. And you know, many of those problems are access and permissions and all that kind of stuff that's usual, but what we've seen is even though that our customer base is increasing and our call volume increases accordingly, the percentage decreases over time because our customer support people have gotten so good at answering those questions. And to that extent, when we do log issues that are not as easily fixed with a tweak or knob toggle at the customer support side, those get grouped up into a group of tickets that we call escalation tickets, and those go directly to engineering.And when we see groups of them that look and smell kind of the same or have similar symptoms, so we start looking at how to design that into chaos, and is it a real performance issue? Especially when it's related to slowness or errors that continuously come at a particular point in that workflow. So, I hope I answered that question there for you, Jason.Jason: Yeah, that's perfect.Julie: Now, I'd like to kind of bring it back a little bit to some of the learnings we've had over this time of practicing Chaos Engineering and focusing on that quality testing. Is there something big that stands out in your mind that you learned from an experiment? Some big, unknown-unknown that you don't know that you ever could have caught without practicing?Chris: Julie, that's a really good question, and there isn't, you know, big bang or any epiphanies here. When I talk about what is the purpose of chaos and what do we get out of it, there's the human factor of chaos in terms of what does this do for us. It gets us prepared, it gets us a fire drill without the sense of urgency of production, and it gets people focused on solving a problem together. So, by practicing in a performance, in a chaos sort of way, when performance does affect the production, those communication channels are already greased. When there's a problem with some system, I know exactly who the engineer is to go to and ask him a question.And that has also enabled us to reduce our meantime to resolution. That meantime to resolution factor is predicated on our teams knowing what to do, and how to resolve those. And because we've practiced it, now that goes down. So, I think the synergy of being able to work together and triangulate our teams on existing issues in a faster sort of way, definitely helps our team dynamic in terms of solving those problems faster.Julie: I like that a lot because there is so much more than just the technical systems. And that's something that we like to talk about, too. It is your people's systems. And you're not trying to surprise anybody, you've got these scheduled on a calendar, they run regularly, so it's important to note that when you're looking at making your people's systems more resilient, you're not trying to catch Chris off guard to see if he answered the page—Chris: That's right.Julie: —what we're working on is making sure that we're building that muscle memory with practice, right, and iron out the kinks in those communication channels.Chris: Absolutely. It's definitely been a journey of learning both for, you know, myself and my team, as well as the engineers that work on these things. You know, again, everybody chips in and gets to learn that routine and be comfortable with fighting fires. Another way I've looked at it with Chaos Engineering, and our testing adventures is that when we find something that it looks a little off—it's a burp, or a sneeze, or some hiccup over here in this system—that can turn into a full-blown fever or cold in production. And we've had a couple of examples where we didn't pay attention to that stuff fast enough, and it did occur in production.And kudos to our engineering team who went and picked it up because we had the information. We had the tracking that says we did find this. We have a solution or recommended fix in place, and it's already in process. That speaks volumes to our sense of urgency on the engineering teams.Julie: Chris, thank you for that. And before we end our time with you today, is there anything you'd like to let our listeners know about Cengage or anything you'd like to plug?Chris: Well, Cengage Learning has been a great place for me to work and I know that a lot of people enjoy working there. And anytime I ask my teams, like, “What's the best part of working there?” It's like, “The people. We work with are supportive and helpful.” You know, we have a product that we'd like to help change people's lives with, in terms of furthering their education and their career choices, so if you're interested, we have over 200 open positions at the current moment within our engineering and staffing choices.And if you're somebody interested in helping out folks and making a difference in people's educational and career paths, this is a place for you. Thanks for the offer, Julie. Really appreciate that.Julie: Thank you, Chris.Jason: Thanks, Chris. It's been fantastic to have you on the show.Chris: It's been a pleasure to be here and great to talk to you. I enjoy talking about my passions with testing as well as many of my other ones. [laugh].Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Intro 00:01:56 - How Alex and Kolton know each other and the beginnings of their companies 00:10:10 - The change of mindset from Amazon to the smaller scale 00:17:34 - Alex and Kolton's advice for companies that “can't be a Netflix or Amazon” 00:22:57 - PagerDuty, Gremlin and Crossovers/Outro TranscriptKolton: I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don't want your money. I'm going to bootstrap it. We're going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”Julie: Welcome to the Break Things on Purpose podcast, a show about chaos, culture, building and breaking things with intention. I'm Julie Gunderson and in this episode, we have Alex Solomon, co-founder of PagerDuty, and Kolton Andrus, co-founder of Gremlin, chatting about everything from founding companies to how to change culture in organizations.Julie: Hey everybody. Today we're going to talk about building awesome things with two amazing company co-founders. I'm really excited to be here with Mandy Walls on this crossover episode for Break Things on Purpose and Page it to the Limit. I am Julie Gunderson, Senior Reliability Advocate here over at Gremlin. Mandy?Mandy: Yeah, I'm Mandy Walls, DevOps Advocate at PagerDuty.Julie: Excellent. And today we're going to be talking about everything from reliability, incident management, to building a better internet. Really excited to talk about that. We're joined by Kolton Andrus, co-founder of Gremlin, and Alex Solomon, co-founder of PagerDuty. So, to get us started, Kolton and Alex, you two have known each other for a little while. Can you kick us off with maybe how you know each other?Alex: Sure. And thanks for having us on the podcast. So, I think if I remember correctly, I've known you, Kolton, since your days in Netflix while PagerDuty was a young startup, maybe less than 20 people. Is that right?Kolton: Just to touch before I joined Netflix. It was actually that Velocity Conference, we hung out of that suite at, I think that was 2013.Alex: Yeah, sounds right. That sounds right. And yeah, it's been how many years? Eight, nine years since? Yeah.Kolton: Yeah. Alex is being humble. He's let me bother him for advice a few times along the journey. And we talked about what it was like to start companies. You know, he was in the startup world; I was still in the corporate world when we met back at that suite.I was debating starting Gremlin at that time, and actually, I went to Netflix and did a couple more years because I didn't feel I was quite ready. But again, it's been great that Alex has been willing to give some of his time and help a fellow startup founder with some advice and help along the journey. And so I've been fortunate to be able to call on him a few times over the years.Alex: Yeah, yeah. For sure, for sure. I'm always happy to help.Julie: That's great that you have your circle of friends that can help you. And also, you know, Kolton, it sounds like you did your tour of duty at Netflix; Alex, you did a tour duty at Amazon; you, too, Kolton. What are some of the things that you learned?Alex: Yeah, good question. For me, when I joined Amazon, it was a stint of almost three years from '05 to '08, and I would say I learned a ton. Amazon, it was my first job out of school, and Amazon was truly one of the pioneers of DevOps. They had moved to an environment where their architecture was oriented around services, service-oriented architecture, and they were one of the pioneers of doing that, and moving from a monolith, breaking up a monolith into services. And with that, they also changed the way teams organized, generally oriented around full service-ownership, which is, as an engineer, you own one or more services—your team, rather—owns one or more services, and you're not just writing code, but you're also testing yourself. There's no, like, QA team to throw it to. You are doing deploys to production, and when something breaks, you're also in charge of maintaining the services in production.And yeah, if something breaks back then we used pagers and the pager would go off, you'd get paged, then you'd have to get on it quickly and fix the problem. If you didn't, it would escalate to your boss. So, I learned that was kind of the new way of working. I guess, in my inexperience, I took it for granted a little bit, in retrospect. It made me a better engineer because it evolved me into a better systems thinker. I wasn't just thinking about code and how to build a feature, but I was also thinking about, like, how does that system need to work and perform and scale in production, and how does it deal with failures in production?And it also—my time at Amazon served as inspiration for PagerDuty because in starting a startup, the way we thought about the idea of PagerDuty was by thinking back from our time at Amazon—myself and my other two co-founders, Andrew and Baskar—and we thought about what are useful tools or internal tools that existed at Amazon that we wished existed in the broader world? And we thought about, you know, an internal tool that Amazon developed, which was called the ‘Pager Duty Tool' because it organized the on-call scheduling and paging and it was attached to the incident—to the ticketing system. So, if there's was a SEV 1 or SEV 2 ticket, it would actually page either one team—or lots of teams if it was a major incident that impacted revenue and customers and all that good stuff. So yeah, that's where we got the inspiration for PagerDuty by carrying the pager and seeing that tool exist within Amazon and realizing, hey, Amazon built this, Google has their own version, Facebook has their own version. It seems like there's a need here. That's kind of where that initial germ of an idea came from.Kolton: So, much overlap. So, much similarity. I came, you know, a couple of years behind you. I was at Amazon 2009 to 2013. And I'd had the opportunity to work for a couple of startups out of college and while I was finishing my education, I'd tasted startup world a little bit.My funny story I tell there is I turned down my first offer from Amazon to go work for a small startup that I thought was going to be a better deal. Turns out, I was bad at math, and a couple of years later, I went back to Amazon and said, “Hey, would you still like me?” And I ended up on the availability team, and so very much in the heart of what Alex is describing. It was a ‘you build it, you own it, you operate it' environment. Teams were on call, they got paged, and the rationale was, if you felt the pain of that, then you were going to be motivated to go fix it and ensure that you weren't feeling that pain.And so really, again, and I agree, somewhat taken for granted that we really learned best-in-class DevOps and system thinking and distributed system principles, by just virtue of being immersed into it and having to solve the problems that we had to solve at Amazon. We also share a similar story in that there was a tool for paging within Amazon that served as a bit of an inspiration for PagerDuty. Similarly, we built a tool—may or may not have been named Gremlin—within Amazon that helped us to go do this exact type of testing. And it was one part tooling and it was one part evangelism. It was a controversial idea, even at Amazon.Some teams latched on to it quickly, some teams needed some convincing, but we had that opportunity to go work with those teams and really go develop this concept. It was cool because while Netflix—a lot of folks are familiar with Netflix and Chaos Monkey, this was a couple of years before Chaos Monkey came out. And we went and built something similar to what we built a Gremlin: An API, a front end, a variety of failure modes, to really go help solve a wider breadth of problems. I got to then move into performance, and so I worked on making the website fast, making sure that we were optimizing things. Moved into management.That was a very useful life experience wasn't the most enjoyable year of my life, but learned a lot, got a lot done. And then that was the next summer, as I was thinking about what was next, I bumped into Alex. I was really starting to think about founding a company, and there was a big question: Was what we built an Amazon going to be applicable to everyone? Was it going to be useful for everyone? Were they ready for it?And at the time, I really wasn't sure. And so I decided to go to Netflix. And that was right after Chaos Monkey had come out, and I thought, “Well, let's go see—let's go learn a bit more before we're ready to take this to market.” And because of that time at Amazon—or at Netflix, I got to see, they had a great start. They had a great culture, people were bought into it, but there was still some room for development on the tooling and on the approach.And I found myself again, half in the developer mindset, half in the advocacy mindset where needed to go and prove the tooling to make it safer and more scalable and needed to go out and convince folks or help them do it well. But seeing it work at Amazon, that was great. That was a great learning experience. Seeing at work at Amazon and Netflix, to me said, “Okay, this is something that everyone's going to need at some point, and so let's go out and take a stab at it.”Alex: That's interesting. I didn't realize that it came from Amazon. I always thought Chaos Engineering as a concept came from Netflix because that's where everyone's—I mean, maybe I'm not the only one, but that's—that was my impression, so that's interesting.Kolton: Well, as you know, Amazon, at times, likes to keep things close to the vest, and if you're not a principal engineer, you're not really authorized to go talk about what you've done. And that actually led to where my opportunity to start a company came from. I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don't want your money. I'm going to bootstrap it. We're going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”Mandi: So, what ends up being different? Amazon—I've never worked for Amazon, so full disclosure, I went from AOL to Chef, and now I'm at PagerDuty. So, but I know what that environment was like, and I remember the early days, PagerDuty you got started around the same time, like, Fastly and Chef and, like, that sort of generation of startups. And all this stuff that sort of emerged from Amazon, like, what kind of mindset do you—is there a change of mindset when you're talking to developers and engineers that don't work for Amazon, looking into Amazon from the outside, you kind of feel like there's a lot more buy-in for those kinds of tools, and that kind of participation, and that kind of—like we said before, the full service-ownership and all of those attitudes and all that cultural pieces that come along with it, so when you're taking these sort of practices commercial outside of Amazon, what changes? Like, is there a different messaging? Is there a different sort of relationship you have with the developers that work somewhere else?Alex: I have some thoughts, and it may not be cohesive, but I'm going to go ahead anyway. Well, one thing that was very interesting from Amazon is that by being a pioneer and being at a scale that's very significant compared to other companies, they had to invent a lot of the tooling themselves because back in mid-2000s, and beyond, there was no Datadog. There was no AWS; they invented AWS. There wasn't any of these tools, Kubernetes, and so on, that we take for granted around containers, and even virtual servers were a new thing. And Amazon was actually I think, one of the pioneers of adopting that through open-source rather than through, like, a commercial vendor like VMware, which drove the adoption of virtual everything.So, that's one observation is they built their own monitoring, they built their own paging systems. They did not build their own ticketing system, but they might as well have because they took Remedy and customized it so much that it's almost like building your own. And deployment tools, a lot of this tooling, and I'm sure Kolton, having worked on these teams, would know more about the tooling than I did as just an engineer who was using the tooling. But they had to build and invent their own tools. And I think through that process, they ended up culturally adopting a ‘not invented here' mindset as well, where they're, generally speaking, not super friendly towards using a vendor versus doing it themselves.And I think that may make sense and made a lot of sense because they were at such a scale where there was no vendor that was going to meet their needs. But maybe that doesn't make as much sense anymore, so that's maybe a good question for debate. I don't know, Kolton, if you have any thoughts as well.Kolton: Yeah, a lot of agreement. I think what was needed, we needed to build those things at Amazon because they embraced that distributed systems, the service-oriented architectures early on, that is a new class of problem. I think in a world where you're not dealing with the complexity of distributed systems, Chaos Engineering just looks like testing. And that's fine. If you're in a monolith and it's more straightforward, great.But when you have hundreds of things with all the interconnections and the combinatorial explosion you have with that, the old approach no longer works and you have to find something new. It's funny you mentioned the tooling. I miss Amazon's monitoring tooling, it was really good. I miss the first iteration of their pipelines, their CI/CD tooling. It was a great iteration.And I think that's really—you get to see that need, and that evolution, that iteration, and a bit of a head start. You asked a bit about what is it like taking that to market? I think one of the things that surprised me a little bit, or I had to learn, is different companies are at different points in their journey, and when you've worked at Amazon and Netflix, and you think everybody is further along than they are, at times, it can be a little frustrating, or you have to step back and think about how do you catch somebody up? How do you educate them? How do you get them to the point where they can take advantage of it?And so that's, you know, that's really been the learning for me is we know aspirationally where we want to go—and again, it's not the Amazon's perfect; it's not the Netflix is perfect. People that I talk to tend to deify Netflix engineering, and I think they've earned a lot of respect, but the sausage is made the same, fundamentally, at every company. And it can be messy at times, and it's not always—things don't always go well, but that opportunity to look at what has gone well, what it should look like, what it could look like really helps you understand what you're striving for with your customers or with the market as a whole.Alex: I totally agree with that because those are big learning for me as well. Like, when you come out of an Amazon, you think that maybe a lot of companies are like Amazon, in that they're… more like I mentioned: Amazon was a pioneer of service-oriented architecture; a pioneer of DevOps; and you build it, you own it; pioneer of adopting virtual servers and virtual hosting. And you, maybe, generalize and think, you know, other companies are there as well, and that's not true. There's a wide variety of maturities and these trends, these big trends like Cloud, like AWS, like virtualization, like containerization, they take ten years to fully mature from the starting point. With the usual adopter curve of very early adopters all the way to, kind of, the big part of the curve.And by virtue of starting PagerDuty in 2009, we were on the early side of the DevOps wave. And I would say, very fortunate to be in the right place at the right time, riding that wave and riding that trend. And we worked with a lot of customers who wanted to modernize, but the biggest challenge there is, perhaps it's the people and process problem. If you're already an established company, and you've been around for a while you do things a certain way, and change is hard. And you have to get folks to change and adapt and change their jobs, and change from being a, “sysadmin,” quote-unquote, to an SRE, and learn how to code and use that in your job.So, that change takes a long time, and companies have taken a long time to do it. And the newer companies and startups will get there from day one because they just adopt the newest thing, the latest and greatest, but the big companies take a while.Kolton: Yeah, it's both that thing—people can catch up quicker. It's not that the gap is as large, and when you get to start fresh, you get to pick up a lot of those principles and be further along, but I want to echo the people, the culture, getting folks to change how they're doing things, that's something, especially in our world, where we're asking folks to think about distributed system testing and cross-team collaboration in a different way, and part of that is a mental journey, just helping folks get over the idea—we have to deal with some misconceptions, folks think chaos has to be random, they think it has to be done in production. That's not the case. There's ways to do it in dev and staging, there's ways to do it that aren't random that are much safer and more deterministic.But helping folks get over those misconceptions, helping folks understand how to do it and how to do it well, and then how to measure the outcomes. That's another thing I think we have that's a bit tougher in our SRE ops world is oftentimes when we do a great job, it's the absence of something as opposed to an outcome that we can clearly see. And you have to do more work when you're proving the absence of something than the converse.Julie: You know, I think it's interesting, having worked with both of you when I was at PagerDuty and now at Gremlin, there's a theme. And so we've talked a lot about Amazon and Netflix; one of the things, distinctly, with customers at both companies, is I've heard, “But we're not Amazon and we're not Netflix.” And that can be a barrier for some companies, especially when we talk about this change, and especially when we talk about very rigid organizations, such as, maybe, FinServ, government, those types of organizations, where they're more resistant to that, and they say, “Don't say Amazon. Don't say Netflix. We're not those companies. We can't operate like them.”I mean, Mandy and I, we were on a call with a customer at one point that said we couldn't use the term DevOps, we had to call it something different because DevOps just meant too forward-thinking, even though we were talking about the same concepts. So, I guess what I would like to hear from both of you, is what advice would you give to those organizations that say, “Oh, no. We can't be Netflix and we can't be Amazon?” Because I think that's just a fear of change conversation. But I'm curious what your thoughts are.Alex: Yeah. And I can see why folks are allergic to that because you look at these companies, and they're, in a lot of ways, so far ahead that you don't, you know—and if you're a lower level of maturity, for lack of a better word, you can't see a path in your head of how do you get from where you are today to becoming more like a Netflix or an Amazon because it's so different. And it requires a lot of thinking differently. So, I think what I would encourage, and I think this is what you all do really well in terms of advocacy, but what I'd encourage is, like, education and thinking about, like, what's a small step that you can take today to improve things and to improve your maturity? What's an on-ramp?And there's, you know, lots of ideas there. Like, for example, if we're talking about modern incident management, if we're talking Chaos Engineering, if we're talking about public cloud adoption and any of these trends, DevOps, SRE, et cetera, maybe think about how do you—do you have a new greenfield project, a brand new system that you're spinning up, how do you do that in a modern way while leaving your existing systems alone to start? Then you learn how to do it and how to operate it and how to build a new service, a new microservice using these new technologies, you build that muscle. You maybe hire some folks who have done it before; that's always a good way to do it. But start with something greenfield, start small, you don't have to boil the ocean, you don't have to do everything at once. And that's really important.And then create a plan of taking other systems and migrating them. And maybe some systems don't make sense to migrate at all because they're just legacy. You don't want to put any more investment in them. You just want to run them, they work, leave them alone. And yeah, think about a plan like that. And there's lots of—now, there's lots of advice and lots of organizations that are ready and willing to help folks think through these plans and think through this modernization journey.Kolton: Yeah, I agree with that. It's daunting to folks that there's a lot, it's a big problem to solve. And so, you know, it'd be great if it's you do X, you get Y, you're done, but that's not really the world we live in. And so I agree with that wisdom: Start small. Find the place that you can make an impact, show what it looks like for it to be successful.One thing I've found is when you want to drive bottoms-up consensus, people really want to see the proof, they want to see the outcome. And so that opportunity to sit down with a team that is already on the cutting edge, that is feeling the pain, and helping them find success, whether that's SRE, DevOps, whether it's Chaos Engineering, helping them, see it, see the outcome, see the value, and then let them tell their organization. We all hear from other folks what we should be doing, and there's a lot of that information, there's a lot of that context, and some of its noise, and so how we cut through that into what's useful, becomes part of it. This one to me is funny because we hear a lot, “Hey, we have enough chaos already. We don't need any more chaos.”And I get it. It's funny, but it's my least favorite joke because, number one, if you have a lot of chaos, then actually you need this today. It's about removing the chaos, not about adding chaos. The other part of it is it speaks to we need to get better before we're ready to embrace this. And as somebody that works out regularly, a gym analogy comes to mind.It's kind of like your New Year's, it's your New Year's resolution and you say, “Hey, I'm going to lose ten pounds before I start going to the gym.” Well, it's a little bit backwards. If you want to get the outcome, you have to put in a bit of the work. And actually, the best way to learn how to do it is by doing it, by going out getting a little bit of—you know, you can get help, you can get guidance. That's why we have companies, we're here to help people and teach them what we've learned, but going out doing a bit of it will help you learn how you can do it better, and better understand your own systems.Alex: Yeah, I like the workout analogy a lot. I think it's hard to get started, it's painful at first. That's why I like the analogy [laugh]—Kolton: [laugh].Alex: —a lot. But it's a muscle that you need to keep practicing, and it's easy to lose, you stopped doing it, it's gone. And it's hard to get back again. So yeah, I like that analogy a lot.Julie: Well, I like that, too, because that's something that we talked a lot about for being on call, and understanding how to handle incidents, and building that muscle memory, right, practice. And so there's a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can't see, says, “The world is always on. Let's keep it this way,” and Kolton, you talk about reliability being no accident.And so when we talk about the foundations of both of these organizations, it's about helping engineers be better and make better products. And I'm really excited to learn a little bit more about where you think the future of that can go.For the second part of this episode, check out the PagerDuty podcast at Page it to the Limit. For links to the Page it to the Limit podcast and to all the information mentioned, visit our website atgremlin.com/podcast. If you liked this episode, subscribe to Break Things on Purpose on Apple Podcasts, Spotify, or wherever you listen to your favorite podcasts.Jason: Our theme song is called, “Battle of Pogs” by Komiku, and it's available onloyaltyfreakmusic.com.[SPLIT]Mandy: All right, welcome. This week on Page it to the Limit, we have a crossover episode. If you haven't heard part one of this episode featuring Kolton Andrus and Alex Solomon, you'll need to find it. It's on the Break Things on Purposepodcast from our friends at Gremlin. So, you'll find that atgremlin.com/podcast. You can listen to that episode and then come back and listen to our episode this week as we join the conversation in progress.Julie: There's a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can't see, says, “The world is always on. Let's keep it this way,” and Kolton, you talk about reliability being no accident. And so when we talk about the foundations of both of these organizations, it's about helping engineers be better and make better products. And I'm really excited to learn a little bit more about where you think the future of that can go.Kolton: You hit it though. Like, the key to me is I'm an engineer by trade. I felt this pain, I saw value in the solution. I love to joke, I'm a lazy engineer. I don't like getting woken up in the middle of the night, I'd like my system to just work well, but if I can go save some other people that pain, if I can go help them to more quickly understand, or ramp, or have a better on-call life have a better work-life balance, that's something we can do that helps the broader market.And we do that, as you mentioned, in service of a more reliable internet. The world we live in is online, undoubtedly, after the last couple of years, and it's only going to be more so. And people's expectations, if you're an older person like me, you know, maybe you remember downloading AOL for a couple of hours, or when a web page took a minute to load; people's expectations are much different now. And that's why the reliability, the performance, making sure things work when we need them to is critical.Alex: Absolutely. And I think there's also a trend that I see and that we're part of around automation. And automation is a very broad thing, there's lots of ways that you want to automate manual things, including CI/CD and automated testing and things like that, but I also think about automation in the incident context, like when you have an alert that fires off or you have an incident you have something like that, can you automate the solution or actually even prevent that alert from going off in the first place by creating a set of little robots that are kind of floating around your system and keeping things running and running well and running reliably? So, I think that's an exciting trend for us.Mandy: Oh, definitely on board with automating all the things for sure. So, of the things that you've learned, what's one thing that you wish you had maybe learned earlier? Or if there was like a gem or a nugget for folks that might be thinking about starting their own company around developer tools or this kind of software, is there anything that you can share with them?Alex: Kolton, you want to go first?Kolton: Sure, I'll go first. I was thinking a little bit about this. If I went back—we've only been at about six years, so Alex has the ten-year version. I can give you the five, six-year version. You know, I think coming into it as a technical founder, you have a lot of thoughts about how the world works that you learn are incorrect or incomplete.It's easy as an engineer to think that sales is this dirty organization that's only focused on money, and that's just not true or fair. They do a lot of hard work. Getting people to do the right thing is tough. Helping with support, with customer success.Even marketing. Marketing is, you know, to many engineers, not what they would spend their time doing, and yet marketing has really changed in the last 20 years. And so much of marketing now is about sharing information and teaching what we've learned as opposed to this old approach of you know, whatever you watched on TV as a kid. So, I think understanding the broader business is important. Understanding the value you're providing to customers, understanding the relationships you build with those customers and the community as a whole, those are pieces that might be easy to gloss over as an engineer.Alex: Yeah, and to echo that, I like your point on sales because initially when I first started PagerDuty, I didn't believe in sales. I thought we wouldn't need to hire any salespeople. Like, we sell to other engineers, and if they're anything like me, they don't want to talk to a salesperson. They want to go on the website, look around learn, maybe try it out—we had a free trial; we still have a free trial—and put in a credit card and off to the races. And that's what we did it first, but then it turns out that when doing so, and in customers in that way, there are folks who want to talk to you to make sure that, first of all your real business, you're going to be around for a while and it's not—you know, you're not going to not be around tomorrow.And that builds trust being able to talk to someone, to understand, if you have questions, you have someone to ask, and creating that human connection. And I found myself doing that function, like, myself and then realized, there's not enough time in the day to do this, so I need to hire some folks. And I changed my mind about sales and hired our first two salespeople about two-and-a-half years into PagerDuty. And probably got a little bit lucky because they're technical engineering background type folks who then went into sales, so they ended up being rockstars. And we instantly saw an increase in revenue with that.And then maybe another more tactical piece of advice is that you can't focus on culture too early when starting a company. And so one lesson that we learned the hard way is we hired an engineer that was brilliant, and really smart, but not the best culture fit in terms of, like, working well with others and creating that harmonious team dynamic with their peers. That ended up being an issue. And basically, the takeaway there is don't hire brilliant but asshole folks because it's just going to cause a lot of pain, and they're not going to work out even though they're really smart, and that's kind of the reason why you keep them around because you think, well, it's so hard to hire folks. You can't let this person go because what are we going to do? But you do have to do it because it's going to blow up anyways, and it's going to be worse in the long run.Kolton: Yeah, hiring and recruiting have their own set of challenges associated with them. And similar to hiring the brilliant jerk, some of the folks that you hire early on aren't going to be the folks that you have at the end. And that one's always tough. These are your friends, these are people you work closely with, and as the company grows, and as things change, people's roles change, and sometimes people choose to leave and that breaks your heart because you've invested a lot of time and effort into that relationship. Sometimes you have to break their heart and tell them it's not the right fit, or things change.And that's one that if you're a founder or you're part of that early team, you're going to feel a little bit more than everyone else. I don't think anything you read on the internet can prepare you for some of those difficult conversations you have to have. And it's great if everything goes well, and everyone grows at the same rate, everyone can be promoted, and you can have the same team at the end, but that's not really how things play out in reality.Julie: It's interesting that we're talking about culture, as we heard about last week, on the Break Things on Purposeepisode, where we also talked about culture and how organizations struggle with the culture shift with adopting new technologies, new ways of working, new tools. And so what I'm hearing from you is focusing on that when hiring and founding your company is important. We also heard about how that's important with changing the way that we work. So, if you could give an advice to maybe a very established—if you are going to give a piece of advice to Amazon—maybe not Amazon, but an established company—on how to overcome some of those objections to culture change, those fears of adopting new technology. I know people are still afraid of holding a pager and being on call, and I know other people are afraid of chaos as we talk about it and those fears that you've mentioned before, Kolton. What would your piece of advice be?Alex: Yeah, good—great question. This will probably echo what I've said earlier, which is when looking to transform, transform culture especially, and people and process, the way I think about is try to not boil the ocean and start small, and get some early wins. And learn what good looks like. I think that's really important. It's this concept of show, don't tell.Like, if you want to, you know, you want to change something, you start at the grassroots level, you start small, you start maybe with one or two teams, you try it out, maybe something like I mentioned before, in a greenfield context where you're doing something brand new and you're not shackled by legacy systems or anything like that, then you can build something new or that new system using the new technologies that are that we're talking about here, whether it's public cloud, whether it's containerization and Kubernetes, or whatnot, or serverless, potentially. And as you build it and you learn how to build it and how to operate it, you share those learnings and you start evangelizing within the company.And that goes to what I was saying with the show don't tell where you're like showing, “Here's what we did and here's what we learned. And not everything went swimmingly and here are things that didn't go so well, and maybe what's our next step beyond this? Do other folks want to opt-in to this kind of new thing that we're doing?” And I'm sure that's a good way to get others excited. And if you're thinking about longer-term, like, how do you transform the entire company, well, that's this is a good way to start; start small you learn how to do it, you learn about what good looks like, you get others excited about it, others opt-in, and then at some point through that journey, you start mandating it top-down as well because grassroots is only going to take you so far. And then that's where you start putting together project plans around, like, how do we get other teams to do it, on a timeline? And when are they going to do it? And how are they going to do it? And then bring everyone along for the journey as well.Kolton: You're making this easy for me. I'll just keep agreeing with you. You hit all the points. Yeah, I mean, on one hand, the engineer in me says, you know, a lot of times when we're talking about this transformation, it's not easy, but it's worth it. There's a need that we're trying to solve, there's a problem we're trying to solve.And then the end, what that becomes as a competitive advantage. The thought that came to mind as Alex was speaking is you need that bottoms-up buy-in; you also need that top-down support. And as engineers, we don't often think about the business impact of what we do. There's an important element and a message I like to reiterate for all the engineers that, think about how the business would value the work you do. Think about how you would quantify the value of the work you do to the business because that's going to help that upper level that doesn't, in the day-to-day feeling the pain, understand that what we're doing is important, and it's important for the organization.I think about this a little bit like remote-by-default work. So, when we founded Gremlin, we decided you know, we didn't want offices. And six years ago that was a little bit exceptional. Folks were still fundamentally working in an office environment. I'm not here to tell you that remote-by-default is easy, works for everyone, or is the answer.Actually, what we found is you need a little bit of both. You need to be able to have good tooling so folks can be efficient and effective in their work, but it's still important to get folks together in person. And magic happens when you get a group of folks in a room and let them brainstorm and collaborate chat on the way to launch or on the way to dinner. But I think that's a good example where we've learned over the last couple of years that the old way of doing it was not as effective as it could be. That maybe we don't need to swing the pendulum entirely the other way, but there's merits at looking at what the right balance is.And I think that applies to, you know, incident management, to SRE, to Chaos Engineering. You know, maybe we don't have to go entirely on the other end of the spectrum for everyone, but are there little—you know, is there an 80/20 solution that gets us a lot of value, that saves a lot of time, that makes us more efficient and effective, without having to rewrite everything from scratch?Alex: Yeah, I like that a lot. And I think part of it, just to add to that, is make it easy for people to adopt it, too. Like, if you can automate it for folks, “Hey, here's a Terraform thing where you could just hit a button and it does it for you, here's some training around how to leverage it, and here's the easy button for you to adopt.” I think that goes with the technology of adopting, but also the training, also the, you know, how-tos and learnings. That way, it's not going to be, like, a big painful thing, you can plan for it. And yeah, it's off to the races from there.Kolton: I think that's prudent product advice, as well. Make it easy for people to do the right thing. And I'm sure it's tricky in your space; it's really tricky in our space. We're going out and we're causing failure, and there's inadvertent side effects, and you need to understand what's happening. It's a little scary, but that's where we add a lot of value.We invest a lot of time and effort in how do we make it easy to understand, easy to understand what to expect, and easy to go do and see what happens and see that value? And it sounds easy. You know, “Hey, just make it easy. Just make it simple,” but actually, as we know, it takes so much more effort and work to get it to be that level of simplicity.Alex: Yeah, making something easy is very, very hard—Kolton: [laugh].Julie: —ironically.Kolton: Yeah. Ironically.Mandy: Yeah, so what are you excited for the future? What's on your horizon that maybe you can share with us that isn't too, like, top-secret or anything? Or even stuff, maybe, not related to your companies? Like, what are you seeing in the industry that really has you motivated and excited?Alex: Great question. I think a couple of things come to mind. I already mentioned automation, and we are in the automation space in a couple of different ways, in that we acquired a company called Rundeck over a year ago now, which does runbook automation and just automation in general around something like running a script across a variety of resources. And in the incident context, if an alert fires or an incident fires, it's that self-healing aspect where you can actually resolve the issue without bothering a human.There's two modes to this automation: There's the kind of full self-healing mode where, you know, something happens and the script just fixes it. And then the second mode is a human is involved, they get paged, and they have a toolbox of things that they can do, that they can easily do. We call that the Iron Man mode, where you're getting, like, these buttons you can push to actually resolve the problem, but in that case, it's a type of problem that does require a person to look at it and realize, oh, we should take this action to fix it. So, I'm very excited about the automation and continuing down that path.And then the other thing that really excites me as well is being able to apply AI and ML to the alerting and incident response and incident management space. Especially our pattern detection, looking for patterns and alerts and incidents, and seeing have we seen this kind of problem before? If so, what happened last time? Who worked on the last time? How did they resolve it last time?Because, you know, you don't want to solve the same problems over and over. And that actually ties into automation really nicely as well. That pattern detection, it's around reducing noise, like, these alerts are not real alerts, they're false alerts, so let's reduce them automatically, let's suppress them, let's filter them out automatically because the signal to noise is really important. And it's that pattern detection, so if something major is happening, you can see here's the blast radius, here's the services or systems it's impacting. Oh, we've seen something similar before—or we haven't seen something similar before, it's something totally brand new—and try to get the right folks involved quickly so that they can understand that blast radius and know how to approach the problem, and resolve it quickly.Kolton: So, it's not NFT's is your PagerDuty profile picture?Alex: [laugh].Kolton: Because that's, kind of, what I—no, I'm kidding. I couldn't help but just like what do I not see—like, I've, I've tried to think of the best NFT joke I could. That was what I came up with. I agree on the AI/ML stuff. That opportunity to have more data and to be able to do better analysis of it, I've written some of that, you know, anomaly detection stuff—and it was a while back; I'm sure it could be done better—that'll get us to a point.You know, of course, I'm here to push on the proactive. There's things we can do beyond just reacting faster that will be helpful. But I think part of that comes from people being comfortable sharing more about their failures. It's a stigmata to fail today, and regardless of whether we're talking about a world where we're inciting things like blameless postmortems, people still don't want to talk about their failures, and it's hard to get that good outage information, it's hard to get the kind of detail that would let us do better analytics, better automation.And again, back to the conversation, you know, maybe we know what Amazon and Netflix looks like, but for us to create something that will help solve a broader problem, we have to know what those companies are feeling in pain; we need to know what their troubles are hitting at. So, I think that's one thing I've been excited about is over the past two years, you've seen the focus on reliable, stable systems be much more important. Five years ago, it was, “Get out of my way, I got features to write, we got money to make, we're not interested in that. If it breaks, we'll fix it.” And you know, as we're looking at the future, we're looking at our bridges, we're looking at our infrastructure, our transportation, the software we're writing is going to be critical to the world, and it operating correctly and reliably is going to be critical. And I think what we'll see is the market and customers are going to catch up to that; that tolerance for failure is going to go down and that willingness to invest in preventing failure is going to go up.Alex: Yeah, I totally agree with that. One thing I would add is, I think it's human nature that people don't want to talk about failures. And this is maybe not going to go away, but there is maybe a middle ground there. I mean, talking about postmortems, especially, like, when a big company has a big outage and it makes the news, it makes Hacker News, et cetera, et cetera, I don't see that changing, in that companies are going to become radically more transparent, but where I do think there is a middle ground is for your large customers, for your important customers, creating relationships with them and having more transparency in those cases. Maybe you don't post it on a public status page a full, detailed nitty-gritty postmortem, but what you do do is you talk to your major customers, your important customers, and you give them that deeper view into your systems.And what's good about that is that it creates trust, it helps establish and maintain trust when you're more transparent about problems, especially when you're taking steps to fix them. And that piece is really important. I mean trust is, like, at the core of what we do. I have a saying about this—[unintelligible 00:19:31]—but, “Trust is won in droplets and lost in buckets.” So, if you have these outages all the time, or you have major service degradation, it's easy to lose that trust. So, you want to prevent those, you want to catch them early, you want to create that transparency with your major customers, and you want to let them in the loop on what's happening and how you're preventing these types of issues going forward.Kolton: Yeah, great thoughts. Totally agree.Julie: So, for this episode of deep thoughts with Kolton and Alex, [laugh] I want to thank both of you for being here with Mandy and I today. We're really excited to hear more and to see each of our respective companies grow and change the way people work and make life easier, not just for engineers, but for our customers and everybody that depends on us.Mandy: Yeah, absolutely. I think it's good for folks out there to know, you're not alone. We're all learning this stuff together. And some folks are a little further down the path, and we're here to help you learn.Kolton: Totally. Totally, it's an opportunity for us to share. Those that are further along can share what they've learned; those that are new or have some great ideas and suggestions and enthusiasm, and by working together, we all benefit. This is the two plus two equals five, where, by getting together and sharing what we've learned and figuring out the best way, no one of us is going to be able to do it, but as a group, we can do it better.Alex: Yeah. Totally agree. That's a great closing thought.Mandy: Well, thanks, folks. Thank you for joining us for another episode of Page it to the Limit. We're wishing you an uneventful day.

In this episode, we cover: 00:00:00 - Introduction 00:02:00 - Carissa's first job in tech and first bootcamp 00:04:30 - Early Lessons: Carissa breaks production—on a Friday! 00:08:40 - Carissa's work at ClickBank and listening to newer hires 00:10:55 - The metrics that Carissa measures and her attitude about constantly learning 00:16:45 - Carissa's Chaos Engineering experiences 00:18:25 - Some advice for bringing new folks into the fold 00:23:08 - Carissa and ClickBank/Outro Links: ClickBank: https://www.clickbank.com/ LinkedIn: https://www.linkedin.com/in/carissa-morrow/ TranscriptCarissa: It's all learning. I mean, technology is never going to stop changing and it's never going to stop being… a lot to learn, [laugh] so we might as well learn it and try to keep up with the [laugh] times and make our lives easier.Julie: Welcome to Break Things on Purpose, a podcast about reliability, asking questions, and learning from failure. In this episode, we talked with Carissa Morrow about what it's like to be new in tech, and how to learn from mistakes and build your skills.Julie: Carissa, I'm really excited to talk to you. I know we chatted in the past a little bit about some horror stories of breaking production. I think that it's going to be a lot of fun for our listeners. Why don't you tell us a little bit about yourself?Carissa: Yeah, so I actually have only been in this industry about three years. So, I come with kind of a newbie's perspective. I was a certified ophthalmic tech before this. So, completely different field. Hit my ceiling, and my husband said, “You want to try coding?” I said, “Not really.” [laugh]. But I did. And I loved it.So, long story short, I ended up just signing up for a local boot camp, three-month full stack. And then I got really lucky; when I graduated there and walked into my previous employer's place. They said, “Do you know what DevOps is?” I said, “I have no idea.” And they still hired me.And it was really great, really, really great experience. I learned so much in a couple years with them. So, and now I'm here at ClickBank and I'm three years in and trying not to break things every day, especially on a Friday.Julie: [laugh]. Why? That's the best day to break things, Carissa—Carissa: [laugh]. No, it's really not.Julie: —preferably at 4:45. Well, that's really amazing. So, that's quite the jump. And as you mentioned, you started with a boot camp and then ended up at an employer—and so, what was your role? What were you doing in your first role?Carissa: So, I started on a really small team; there was just three of us including myself. So, I learned pretty much everything from the ground up, knowing nothing coming into DevOps. So, I had, you know, coding background from the boot camp, but I had to learn Python from scratch. And then from there, just kind of learning everything cloud. I had no idea about AWS or Google or anything in the cloud realm.So, it was very much a rough—very, very rough first year, I had to put my helmet on because it was a very bumpy ride. But I made it and I've come out a heck of a lot stronger because of it.Julie: Well, that's awesome. How about do you have people that you were working with that are mentoring you?Carissa: Yep. So, I actually have been very lucky and have a couple of mentors, from not only my previous employer, but also clients that I worked with that have asked to be my mentor and have stuck it out with me, and helped not just in the DevOps realm or the cloud realm, but for me as a person in that growing area. So, it's been pretty great.Julie: Well, that's awesome. And I guess I should give the disclosure that Carissa and I both worked together, for me a couple of jobs ago. And I know that, Carissa, I've reached out to you for folks who are interested in the boot camp that you went through. And I know it's not an advertisement for the boot camp, but I also know that you mentored a friend of mine. Did you want to share where you went?Carissa: Yeah, definitely. So, I went to Boise CodeWorks, which is a local coding school here in Boise. And they did just move locations, so I'm not quite sure where they're at now, but they're definitely in Boise.Julie: And if I remember correctly, that was a three-month very intensive, full-time boot camp where you really didn't have time for anything else. Is that right?Carissa: Yes, it is absolutely 1000% a full-time job for three months. And you will get gray hairs. If you don't, you're doing something wrong. [laugh]. Yep.Julie: So, what would you say is one of the most important things you learned out of that?Carissa: I would say just learning how to be resilient. It was very easy to want to quit because it was so difficult. And not knowing what it was going to look like when I got out of it, but part of me just wanted to throw my hands up half the time. But pushing through that made it just that much sweeter when I was done.Julie: Well now, when we were talking before, you mentioned that you broke production once. Do you want to tell me about that—Carissa: Maybe a few times. [laugh].Julie: —[crosstalk 00:04:34] a few times? [laugh]. You want to share what happened and maybe what you learned from it.Carissa: Yeah, yep. So, I was working for a company that we had clients, so it was a lot of client work. And they were an AWS shop, and I was going in to kind of clean up some of their subnets and some of their VPN issues—of course, this is also on a Friday. Yeah. It has to be on a Friday.Julie: Of course.Carissa: So, I will never forget, I was sitting outside thinking, “This is going to be a piece of cake.” I went in, I just deleted a subnet, thinking, “That's fine. Nothing's going to happen.” Five minutes later Slack's blowing up, production's down and, you know, websites not working. Bad. Like, worst-case scenario.So, back then we had, like, a team of, I think I would say ten, and every single person jumped on because you could tell I was panicking. And they all jumped in and we went step-by-step, tried to figure it out, figured out how we could fix it. But it took a good four hours of traumatizing stress [laugh] before we got it fixed. And then I learned my lesson, you know? Double-triple check before you delete anything and try to just make Fridays read-only if you can. [laugh].Julie: Well, and I think that's one of the things right? You always have to have that lesson-learning experience, and it's going to happen. And showing empathy for friends during that, I think, is the really important piece. And I love the fact that you just talked about how the whole team jumped on because they saw that you were stressed out. Were you in person or remote at the time?Carissa: I was remote at the time.Julie: Okay.Carissa: Yeah. And we were traveling in our RV, so nothing like being out in the woods, panicking by yourself, and [laugh] roaming around.Julie: So, did you run a postmortem on it?Carissa: So, back then—actually, we ended up doing that, yes, but that was when I had never really experienced a postmortem before, and that's one thing that, you know, when we talk about this kind of stuff—and everyone has a horror story or two, but that's something that I've had to learn to get better at is RCAs and postmortems because they're so important. I think they're incredibly important. Because these things are going to happen again; they're going to happen to the best of us. So, definitely, everything is a learning experience. And if it's not, you're missing out. So, I try to make everything a learning experience, for sure.Julie: Absolutely. And that's one of the things we talk about is now take that, and how do you learn from this? And how do you put the gates in place so that you can't just delete a subnet? I mean, to be fair, you did it, but were there other things that could have prevented this from happening, some additional checks and balances?Carissa: Mm-hm.Julie: And as you mentioned, that's not the only time that you've broken production. But let me ask you was that—did the alerting mechanisms work? Did all of the other—did the monitoring and observability? Like, did everything work correctly, or did you find some holes in that as well?Carissa: So, that's a great question. So, this specific client did not use any monitoring tools whatsoever. So—Julie: Huh.Carissa: Yeah, so that was one of those unique situations where they just tried to get on their own website and it didn't work. And then, you know, it was testing and everything was failing. But it was all manual testing. And I actually—believe it or not—I've seen that more often than I ever thought I would in the last three years. And so with what you guys do, and kind of what I'm seeing with a bunch of different clients, it's not just do they have monitoring, it's how do they use that? And when it's, kind of, bits and pieces here and there and they're not using it to their full potential, that's when a lot of things slip through the cracks. So, I've definitely seen a lot of that.Julie: Absolutely. And it's interesting because I really think that, especially these advanced organizations, that they're just going to have all the ducks in the row, all the right monitoring setup, and it turns out that they don't always have everything set up or set up correctly. And that's one of the things that we talk about, too, is validating with Chaos Engineering, and looking at how can we make sure it's not just that our systems are resilient, but that our tools pick things up, that our people and processes work? And I think that's really important. Now… you're working at ClickBank today?Carissa: Mm-hm.Julie: You want to tell us a little bit about that and about what you do over there?Carissa: Yep. So, I came on a few months ago as a cloud engineer for their team. And they are—I have actually learned a lot of monitoring tools through what they have already set up. And as they're growing and continue to grow, I'm learning a lot about what they have in place and maybe how we can improve it. So, not just understanding the metrics has been a learning curve, but understanding what we're tracking, why, and what's an emergency—what's critical, what's not—all of those things is definitely a huge, huge learning curve.But regardless of if it's ClickBank or other companies that I know people that work out or I've worked at, everyone knows there's a humbling aspect when you're using all these tools. We all want to pretend like we know everything all the time, and so being humble enough to ask the questions of, “Why do we use this? Are we using it to its full potential? And what am I looking at?” That's how I've learned the most, even in the last couple of months here is just asking those very humbling questions.Julie: Well, I have to say, you know, you mentioned that you are really still new; it's three years out of school for you doing this, and I think that there actually is quite a lot to be said about listening to newer people because you're going to ask questions that other folks haven't thought of, like, the whys. “Why are we doing things this way?” Or, “Why are we tracking that?” And sometimes—I think you've probably seen this as organizations—we just get into these habits—Carissa: Mm-hm.Julie: —and we do things because somebody who worked here, like, five years ago, set it up that way; we've just always done it this way.Carissa: Mm-hm.Julie: And it's a great idea to look into some of our practices and make sure that they're still serving us. One thing that you mentioned that I love, though, is you said metrics. And metrics are really important when practicing Chaos Engineering because it's good to know where you are now so that you can see improvement. Can you talk about some of the metrics that you measure or that might be important to ClickBank?Carissa: Yeah. So, a lot of the things that we measure have to do with orders. So, the big thing with ClickBank with how the model, the infrastructure of this company is set, orders are incredibly important, so between the vendors and the buyers in ClickBank. So, we are always monitoring in great detail how our orders are coming in, going out, all the payment information, you know, make sure everything's always secure and running smoothly. So, those are where most of our metrics that we watch where those live.The one thing that I think is—I've noticed is really important is whether you're monitoring one thing or ten, monitor to the best of your ability so that you're not just buying stuff and using 50% of it. And I think we get really excited when we go and we're like, “Yes, this is a great third-party tool or third-party—we're going to use it.” And then 10% of it, you know, you use and the rest of it, it's like, “That's really cool. Maybe we'll do that later, maybe we'll implement that part of it later.” And that's something that it's just, it's like, I know it's painful, [laugh] but do it now; get it implemented now and start using it, and then go from there.But I feel like why do we bother if we're only going to use 10% to 50% of these amazing things that really make our lives easier, and obviously, more secure and more resilient.Julie: I think you're onto something there. That is really good advice. I remember speaking at a conference in New Zealand and one of the speakers there talked about how their organization will buy any new tool that comes out, any and every new tool that comes out. But just buying that—and as you mentioned, just using a tiny, small portion of that tool can really be kind of ridiculous. You're spending a lot of money on these tools, but then these features were built for a reason, and oftentimes—and I saw this, too, at my past company—folks would purchase our tool, but not realize that our tool did so many other things.And so then there are multiple tools that are doing the same things within an organization when in reality, if you look at all the features and truly understand a tool—I would say some folks have a hard time with saying well, it just takes too much time to learn all of that. What's your advice for them?Carissa: Yeah. I think I've caught myself saying that to [laugh] at some point in time. You know, the context-switching, already having our full-time jobs and then bringing on tools, other tools that we need to learn. And it is overwhelming, but my advice is, why make more pain for yourself? [laugh]. Why not make your life easier, just like automation, right?When you're automating things, it's going to be a lot of work up front, but the end goal is make everything more secure, make it easier on yourself, take out the single point of failure or the single-person disaster because they did one wrong thing. Monitoring does the same thing. You know, if you put the investment up ahead of time, if you do it right upfront, it's going to pay off later.The other thing I've seen, and I've been guilty of as well is just looking at it and saying, “Well, it looks like it's working,” but I don't really know what I'm looking at. And so going back to that, you know, if you don't know why things are failing, or what to look out for to catch things from failing, then why even bother having that stuff in front of you? So, it's a lot of learning. It's all learning. I mean, technology is never going to stop changing and it's never going to stop being… a lot to learn, [laugh] so we might as well learn it and try to keep up with the [laugh] times and make our lives easier.There was actually a—I wrote this quote down because I ran across this last week, and I loved it because we were talking about failures. It said, “Not responding to failures is one characteristic of the organizational death spiral.” And I loved that because I sat there and thought, “Yeah, if you do have a failure, and you think, ‘Well, I have my monitoring tools in place. It looks like it worked itself out. I don't really know what happened.' And that continues to happen, and everyone on the team has that same mentality, then eventually, things are going to keep breaking, and it's going to get worse and worse over time.” And they're not going to realize that they had a death spiral. [laugh]. So, I just love that quote, I thought that was pretty great.Julie: I love that as well, who was that from?Carissa: Oh, I'll have to pull it up, but it was online somewhere. I was kind of going through—because really bothering me when we were talking about some of our monitoring, and I was asking some kind of deep questions about, why? What's the critical threshold? What's the warning? Why are we looking at this? And so I started looking at deeper dives into resiliency, and so that popped up, and I thought that was pretty spot on.Julie: I love it. We will find the author of that. We'll post it in the show notes. I think that is an amazing quote. I think I'm going to steal it from you at some point because that's—it's very true.And learning from those failures and understanding that we can prevent failures from occurring, right? So—Carissa: Absolutely.Julie: —if you have a failure and you've remediated it, and you still want to test to make sure that you're not going to drift back into that failure, right? Our systems are constantly changing. So, that's one of the things we talk about with Chaos Engineering, as well, and building that reliability in. Now, have you experienced or practiced Chaos Engineering at all with any of your customers that you've worked on, or at ClickBank?Carissa: There was one, [sigh] one client that we had that I would say yes, but the testing itself needed to be more robust, it needed to be more accurate. It was kind of like an attempt to build testing around—you know, for Chaos Engineering, but looking back now, I wish we would have had more guidance and direction on how to build really strategic testing, not just, “Oh, look, it passed.” It might have been a false pass, [laugh] but it was just kind of absolute basic testing. So, I think there's a growth with that. Because I've talked to a lot of engineers over the years that we say testing is important, right, but then do we actually do it, especially when we're automating and we're using all these third-party tools.A lot of times, I'm going to go with what we don't. We say it's really important, we see the importance of it, but we don't actually implement it. And sometimes it's because we need help to be able to build accurate testing and things that we know really are going to be sustainable testing. So, it's more of probably an intimidation thing that I've seen over the years. And it's kind of going back to, we don't like to ask for help a lot of times in this industry, and so that plays a role there. Sometimes we just need help to be able to build these things out so we're not walking on eggshells waiting for the next thing to break.Julie: Now, I love it because you've drilled down kind of into that a few times about asking for help. And you've worked with some folks that I know you've done a great job. So far, I'm really impressed just seeing your growth over the last three years because I do remember your first day—Carissa: Oh—Julie: [laugh].Carissa: [laugh]. Oh, God.Julie: —and seeing you and in these little corner cubes. That was—[laugh]—Carissa: I was sweating bullets that day.Julie: —quite a long time ago. What advice would you give to senior folks who are helping newer folks or more junior folks? What would you want them to know about working with newer people?Carissa: Yeah, that's a good question. So, in my last job, I actually ended up becoming a lead before I left. And so [sigh] the one thing I learned from my mentor at my previous company that really just brought me up from knowing nothing. One thing I learned from him was, when he looked at me on the first day, he said, “Do not be afraid to ask for help. Period. Just don't. Because if you don't, something bad's going to happen and you're not going to learn and you're not going to grow.”And he also was one that said, “Put your helmet on. It's going to be a bumpy ride.” [laugh]. And I loved that. He even got me a little, uh—oh, it kind of like—it was a little bobblehead, and it had a helmet. [laugh]. And I thought that was so spot-on.I think we forget when we get really good at something or we've been doing something for a while, as human beings, we forget what it's like to be new, and to be scared, and to not know what our left and right hand is doing. So, I would say keep that in the forefront of your mind as you're mentoring people, as you're helping ramp them up, is they're going to be afraid to ask questions or remind them it's okay, and also just taking a step back and remembering when you were really new at something. Because it's hard to do. We all want to become experts and we don't want to remember how horrible that felt when we did not know what was in front of us. So, that would be my couple pieces of advice.Julie: Well, and then kind of circling back to that first time that you broke production, right, and everybody rallied around to help you—which is amazing; I love that—after it was over, what was the culture like? Were they supportive? What happened?Carissa: Yeah, that's a really good question because I've heard people's horror stories where it was not a good response afterwards, and they felt even more horrible after it was fixed. And my experience was a complete opposite. The support was just 1000% there. And we even hung out—we started a Zoom call and after we'd fixed it, there were people that hopped back on the call and said, “Let me tell you about my production story.” And we just started swapping horror stories.And it was 1000% support, but also it was a nice human reminder that we break things and it's okay. And so that was—it was a pretty great experience, I hope the best—we're all going to break things, but I hope that everyone gets that experience because the other experience, no fun. You know, we already feel terrible enough after we break it. [laugh].Julie: I think that's important. And I love that because that goes back to the embrace failure statement, right? Embrace it, learn from it. If you can take that and learn. And what did you learn? So, you mentioned you learn double, triple, quadruple check.Carissa: Mm-hm.Julie: So, have you made that same mistake again?Carissa: I have not. Knock on wood. I have not. [laugh].Julie: [crosstalk 00:21:58]Carissa: [crosstalk 00:21:59]. [laugh].Julie: It could happen—Carissa: Yep.Julie: —as we all are learning so much, sometimes you make the same mistake twice, right?Carissa: Yeah, absolutely. I would say there's two things. So, I learned that, and then I also learned that not just double and triple check before you do something, but going back to the don't be afraid to ask questions, sometimes you have to ask clarifying questions of your client or your customer before you pull the trigger. So, you might say I've done this a million times, but sometimes the ask is a little vague. And so, if you don't ask detailed questions, then yes, you might have done what needed to be done, but not in the way that they hoped for, not in the way that they wanted, your end game results were now not what was hoped for.So, definitely ask layered questions if you need to. To anyone: To your coworkers, to your manager, to your whoever you're using your monitoring tools through. Just ask away because it's better to do it upfront than to just try to get the work done and then, you know, then more fun happens.Julie: More fun indeed. [laugh].Carissa: [laugh]. Yes.Julie: Now, why don't you tell our listeners who aren't familiar with ClickBank, do you want to promote them a little bit, talk a little bit about what you're doing over there?Carissa: Yeah. So, ClickBank is awesome, which is why I'm there. [laugh]. No, they're a great company. I'm on a fairly, I wouldn't say large team, but it's a good-sized team.They're just really good people. I think that's been one of the things that's incredibly important to me, and I knew when I was making a switch that everyone talks about, they have a great working environment, they have great work-life balance. And for me, it's like you can talk the talk, but I want you to walk the walk, as a company. And I want—you know, if you say you're going to have a family environment, I want to see that. And I have seen that at ClickBank.It's been an awesome couple of months. There's a lot of support on the teams. There's a lot of great management there, and I'm kind of excited to see where this goes. But coming with a fresh perspective of working at ClickBank, it's a really great company. I'm happy.Julie: Well, I love that. And from what I'm aware of, y'all have some positions that are open, so we'll post a link to ClickBank in the as well. And, Carissa, I just want to thank you for taking the time to be a little vulnerable and talk about your terrifying breaking production experience, but also about why it's so important to be open to folks asking questions and to show empathy towards those that are learning.Carissa: Mm-hm. Yeah, absolutely. I think that is the number one thing that's going to make us all successful. It's going to make mentors more successful, and they're going to learn as they're doing it and it's going to make—it's going to build confidence in people that are coming into this industry or that are new in this industry to say, “Not only can I do this, I'm going to be really great. And I'm going to eventually mentor somebody someday.”Julie: I love that. And thank you. And thank you for spending time with us today. And, folks, you can find Carissa on LinkedIn. Pretty impressed that you're not on Twitter, so not a huge social media person, so it's just LinkedIn for Carissa. And with that—Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Intro 00:01:45 - AWS Severless Hero and Gunnar's history using AWS 00:04:42 - Severless as reliability 00:08:10 - How they are testing the connectivity in serverless 00:12:47 - Gunnar shares a suprising result of Chaos Engineering 00:16:00 - Strategy for improving and advice on tracing 00:20:10 - What Gunnar is excited about at AWS 00:28:50 - What Gunnar has going on/Outro Links: Twitter: https://twitter.com/GunnarGrosch LinkedIn: https://www.linkedin.com/in/gunnargrosch/ TranscriptGunnar: When I started out, I perhaps didn't expect to find that many unexpected things that actually showed more resilience or more reliability than we actually thought.Jason: Welcome to the Break Things on Purpose podcast, a show about Chaos Engineering and building more reliable systems. In this episode, we chat with Gunnar Grosch, a Senior Developer Advocate at AWS about Chaos Engineering with serverless, and the new reliability-related projects at AWS that he's most excited about.Jason: Gunnar, why don't you say hello and introduce yourself.Gunnar: Hi, everyone. Thanks, Jason, for having me. As you mentioned that I'm Gunnar Grosch. I am a Developer Advocate at AWS, and I'm based in Sweden, in the Nordics. And I'm what's called a Regional Developer Advocate, which means that I mainly cover the Nordics and try to engage with the developer community there to, I guess, inspire them on how to build with cloud and with AWS in different ways. And well, as you know, and some of the viewers might know, I've been involved in the Chaos Engineering and resilience community for quite some years as well. So, topics of real interest to me.Jason: Yeah, I think that's where we actually met was around Chaos Engineering, but at the time, I think I knew you as just an AWS Serverless Hero, that's something that you'd gotten into. I'm curious if you could tell us more about that. How did you begin that journey?Gunnar: Well, I guess I started out as an AWS user, built things on AWS. As a builder, developer, I've been through a bunch of different roles throughout my 20-plus something year career by now. But started out as an AWS user. I worked for a company, we were a consulting firm helping others build on AWS, and other platforms as well. And I started getting involved in the AWS community in different ways, by arranging and speaking at different meetups across the Nordics and Europe, also speaking at different conferences, and so on.And through that, I was able to combine that with my interest for resiliency or reliability, as someone who's built systems for myself and for our customers. That has always been a big interest for me. Serverless, it came as I think a part of that because I saw the benefits of using serverless to perhaps remove that undifferentiated heavy lifting that we often talk about with running your own servers, with operating things in your own data centers, and so on. Serverless is really the opposite to that. But then I wanted to combine it with resilience engineering and Chaos Engineering, especially.So, started working with techniques, how to use Chaos Engineering with serverless. That gained some traction, it wasn't a very common topic to talk about back then. Adrian Hornsby, as some people might know, also from AWS, he was previously a Developer Advocate at AWS, now in a different role within the organization. He also talked a bit about Chaos Engineering for serverless. So, teamed up a bit with him, and continue those techniques, started creating different tools and some open-source libraries for how to actually do that. And I guess that's how, maybe, the AWS serverless team got their eyes opened for me as well. So somehow, I managed to become what's known as an AWS Hero in the serverless space.Jason: I'm interested in that experience of thinking about serverless and reliability. I feel like when serverless was first announced, it was that idea of you're not running any infrastructure, you're just deploying code, and that code gets called, and it gets run. Talk to me about how does that change the perception or the approach to reliability within that, right? Because I think a lot of us when we first heard of serverless it's like, “Great, there's Nothing. So theoretically, if all you're doing is calling my code and my code runs, as long as I'm being reliable on my end and, you know, doing testing on my code, then it should be fine, right?” But I think there's some other bits in there or some other angles to reliability that you might want to tune us into.Gunnar: Yeah, for sure. And AWS Lambda really started it all as the compute service for serverless. And, as you said, it's about having your piece of code running that on-demand; you don't have to worry about any underlying infrastructure, it scales as you need it, and so on; the value proposition of serverless, truly. The serverless landscape has really evolved since then. So, now there is a bunch of different services in basically all different categories that are serverless.So, the thing that I started doing was to think about how—I wasn't that concerned about not having my Lambda functions running; they did their job constantly. But then when you start building a system, it becomes a lot more complex. You need to have many different parts. And we know that the distributed systems we build today, they are very complex because they contain so many different moving parts. And that's still the case for serverless.So, even though you perhaps don't have to think about the underlying infrastructure, what servers you're using, how that's running, you still have all of these moving pieces that you've interconnected in different ways. So, that's where the use case for Chaos Engineering came into play, even for serverless. So, testing how these different parts work together to then make sure that it actually works as you intended to. So, it's a bit harder to create those experiments since you don't have control of that underlying infrastructure. So instead, you have to do it in a few different ways, since you can't install any agents to run on the platform, for instance, you can't control the servers—shut down servers, the perhaps most basic of Chaos Engineering experiment.So instead, we're doing it using different libraries, we're doing it by changing configuration of services, and so on. So, it's still apply the same principles, the principles of Chaos Engineering, we just have to be—well, we have to think about it in different way in how we actually create those experiments. So, for me, it's a lot about testing how the different services work together. Since the serverless architectures that you build, they usually contain a bunch of different services that you stitch together to actually create the output that you're looking for.Jason: Yeah. So, I'm curious, what does that actually look like then in testing, how these are stitched together, as you say? Because I know with traditional Chaos Engineering, you would run a blackhole attack or some sort of network attack to disrupt that connectivity between services. Obviously, with Lambdas, they work a little bit differently in the way that they're called and they're more event-driven. So, what does that look like to test the connectivity in serverless?Gunnar: So, what we started out with, both me and Adrian Hornsby was create these libraries that we could run inside the AWS Lambda functions. So, I created one that was for Node.js, something that you can easily install in your Node.js code. Adrian has created one for Python Lambda functions.So, then they in turn contain a few different experiments. So, for instance, you could add latency to your AWS Lambda functions to then control what happens if you add 50 milliseconds per invocation on your Lambda function. So, for each call to a downstream service, say you're using DynamoDB as a data store, so you add latency to each call to DynamoDB to see how this data affect your application. Another example could be to have a blackhole or a denial list, so you're denying calls to specific services. Or it could be downstream services, other AWS services, or it could be third-party, for instance; you're using a third-party for authentication. What if you're not able to reach that specific API or whatever it is?We've created different experiments for—a typical use case for AWS Lambda functions has been to create APIs where you're using an API Gateway service, an AWS Lambda function is called, and then returning something back to that API. And usually, it should return a 200 response, but you could then alter that response to test how does your application behave? How does the front-end application, for instance, behave when it's not getting that 200 response that it's expecting, instead of getting a 502, a 404, or whatever error code you want to test with. So, that was the way, I think, we started out doing these types of experiments. And just by those simple building blocks, you can create a bunch of different experiments that you can then use to test how the application behaves under those adverse conditions.Then if you want to move to create experiments for other services, well, then serverless, as we talked about earlier, since you don't have control over the underlying infrastructure, it is a bit harder. Instead, you have to think about different ways to do with by, for instance, changing configuration, things like that. You could, for instance, restrict concurrent operations on certain services, or you could do experiments to block access, for instance, using different access control lists, and so on. So, different ways, all depending on how that specific service works.Jason: It definitely sounds like you're taking some of those same concepts, and although serverless is fundamentally different in a lot of ways, really just taking that, translating it, and applying those to the serverless.Gunnar: Yeah, exactly. I think that's very important here to think about, that it is still using Chaos Engineering in the exact same way. We're using the traditional principles, we're walking through the same steps. And many times as I know everyone doing Chaos Engineering talks about this, we're learning so much just by doing those initial steps. When we're looking at the steady-state of the application, when we're starting to design the experiments, we learn so much about the application.I think just getting through those initial steps is very important for people building with serverless, as well. So, think about, how does my application behave if something goes wrong? Because many times with serverless—and for good reasons—you don't expect anything to fail. Because it's scales as it should, services are reliant, and they are responding. But it is that old, “What if?” What if something goes wrong? So, just starting out doing it in the same way as you normally would do with Chaos Engineering, there is no difference, really.Jason: And know, when we do these experiments, there's a lot that we end up learning, and a lot that can be very surprising, right? When we assume that our systems are one way, and we run the test, and we follow that regular Chaos Engineering process of creating that hypothesis, testing it, and then getting that unexpected result—Gunnar: Right.Jason: —and having to learn from that. So, I'm interested, if you could share maybe one of the surprising results that you've learned as you've done Chaos Engineering, as you've continued to hone this practice and use it. What's a result that was unexpected for you, that you've learned something about?Gunnar: I think those are very common. And I think we see them all the time in different ways. And when I started out, I perhaps didn't expect to find that many unexpected things that actually showed more resilience or more reliability than we actually thought. And I think that's quite common, that we run an experiment, and we often find that the system is more resilient to failure than we actually thought initially, for instance, that specific services are able to withstand more turbulent conditions than we initially thought.So, we create our hypothesis, we expect the system to behave in a certain way. But it doesn't, instead—it doesn't break, but instead, it's more robust. Certain services can handle more stress than we actually thought, initially. And I think those cases, they, well, they are super common. I see that quite a lot. Not only talking about serverless Chaos Engineering experiments; all the Chaos Engineering experiments we run. I think we see that quite a lot.Jason: That's an excellent point. I really love that because it's, as you mentioned, something that we do see a lot of. In my own experience working with some of our customers, oftentimes, especially around networking, networking can be one of the more complex parts of our systems. And I've dealt with customers who have come back to me and said, “I ran a blackhole attack, or latency attack, or some sort of network disruption and it didn't work.” And so you dig into it, well, why didn't it work? And it's actually well, it did; there was a disruption, but your system was designed well enough that you just never noticed it. And so it didn't show up in your metrics dashboards or anything because system just worked around it just fine.Gunnar: Yeah, and I think that speaks to the complexity of the systems we're often dealing with today. I think it's Casey Rosenthal who talked about this quite early on with Chaos Engineering, that it's hard for any person to create that mental model of how a system works today. And I think that's really true. And those are good examples of exactly that. So, we create this model of how we think the system should behave, but [unintelligible 00:15:46], sometimes it behaves very unexpected… but in the positive way.Jason: So, you mentioned about mental models and how things work. And so since we've been talking about serverless, that brought to mind one of those things for me with serverless is, as people make functions and things because they're so easy to make and because they're so small, you end up having so many of them that work together. What's your strategy for starting to improve or build that mental model, or document what's going on because you have so many more pieces now with things like serverless?Gunnar: There are different approaches to this, and I think this ties in with observability and the way we observe systems today because as these systems—often they aren't static, they continue to evolve all the time, so we add new functionality, and especially using serverless and building it with AWS Lambda functions, for instance, as soon as we start creating new features to our systems, we add more and more AWS Lambda functions or different serverless ways of doing new functionality into our system. So, having that proper observability, I think that's one of the keys of creating that model of how the system actually works, to be able to actually see tracing, see how the system or how a request flows through the system. Besides that, having proper documentation is something that I think most organizations struggle with; that's been the case throughout all of my career, being able to keep up with the pace of innovation that's inside that organization. So, keeping up with the pace of innovation in the system, continuing to evolve your documentation for the system, that's important. But I think it's hard to do it in the way that we build systems today.So, it's not about only keeping that mental model, but keeping documentation and how the system actually looks, the architecture of the system, it's hard today. I think that's just a fact. And ways to deal with that, I think it comes down to how the engineering organization is structured, as well. We have Amazon and AWS, we—well, I guess we're quite famous for our two-pizza teams, the smaller teams that they build and run their systems, their services. And it's very much up to each team to have that exact overview how their part on the bigger picture works. And that's our solution for doing that,j but as we know, it differs from organization to organization.Jason: Absolutely. I think that idea of systems being so dynamic that they're constantly changing, documentation does fall out of step. But when you mentioned tracing, that's always been one of those really key parts, for me at least coming from a background of doing monitoring and observability. But the idea of having tracing that just automatically going to expose things because it's following that request path. As you dive into this, any advice for listeners about how to approach that, how to approach tracing whether that's AWS X-Ray or any other tools?Gunnar: For me, it's always been important to actually do it. And I think what I sometimes see is that's something that's added on later on in the process when people are building. I tend to say that you should start doing it early on because I often think it helps a lot in the development phase as well. So, it shouldn't be an add-on later on, after the fact. So, starting to use tracing no matter if it's as you said, X-Ray or any third-party's service, using it early on, that helps, and it helps a lot while building the system. And we know that there are a bunch of different solutions out there that are really helpful, and many AWS partners that are willing to help with that as well.Jason: So, we've talked a bunch about serverless, but I think your role at AWS encompasses a whole lot of things beyond just serverless. What's exciting you now about things in the AWS ecosystem, like, what are you talking about that just gets you jazzed up?Gunnar: One thing that I am talking a lot about right now that is very exciting is fortunately, we're in line with what we've just talked about, with resilience and with reliability. And many of you might have seen the release from AWS recently called AWS Resilience Hub. So, with AWS Resilience Hub, you're able to make use of all of these best practices that we've gathered throughout the years in our AWS Well-Architected Framework that then guides you on the route to building resilient and reliable systems. But we've created a service that will then, in an, let's say, more opinionated but also easier way, will then help you on how to improve your system with resilience in mind. So, that's one super exciting thing. It's early days for Resilience Hub , but we're seeing customers already starting to use it, and already making use of the service to improve on their architecture, use those best practices to then build more resilient and reliable systems.Jason: So, AWS Resilience Hub is new to me. I haven't actually haven't really gotten into it much. As far as I understand it, it really takes the Well-Architected Framework and combines the products or the services from Amazon into that, and as a guide. Is this something for people that have developed a service for them to add on, or is this for people that are about to create a new service, and really helping them start with a framework?Gunnar: I would say that it's a great fit if you've already built something on AWS because you are then able to describe your application using AWS Resilience Hub. So, if you build it using Infrastructure as Code, or if you have tagging in place, and so on, you can then define your application using that, or describe your application using that. So, you point towards your CloudFormation templates, for instance, and then you're able to see, these are the parts of my application. Then you'll set up policies for your application. And the policies, they include the RTO and the RPO targets for your application, for your infrastructure, and so on.And then you do the assessment of your application. And this then uses the AWS Well-Architected Framework to assess your application based on the policies you c reated. And it will then see if your application RTO and RPO targets are in line with what you set up in your policies. You will also then get an output with recommendations what you can do to improve the resilience of your application based, once again, on the Well-Architected Framework and all of the best practices that we've created throughout the years. So, that means that you, for instance, will get it, you'll build an application that right now is in one single availability zone, well, then Resilience Hub will give you recommendations on how you can improve resilience by spreading your application across multiple availability zones. That could be one example.It could also be an example of recommending you to choose another data store to have a better RTO or RPO, based on how your application works. Then you'll implement these changes, hopefully. And at the end, you'll be able to validate that these new changes then help you reach your targets that you've defined. It also integrates with AWS Fault Injection Simulator, so you're able to actually then run experiments to validate that through the help of this.Jason: That's amazing. So, does it also run those as part of the evaluation, do failure injection to automatically validate and then provide those recommendations? Or, those provided sort of after it does the evaluation, for you to continue to ensure that you're maintaining your objectives?Gunnar: It's the latter. So, you will then get a few experiments recommended based on your application, and you can then easily run those experiments at your convenience. So, it doesn't run them automatically. As of now, at least.Jason: That is really cool because I know a lot of people when they're starting out, it is that idea of you get a tool—no matter what tool that is—for Chaos Engineering, and it's always that question of, “What do I do?” Right? Like, “What's the experiment that I should run?” And so this idea of, let's evaluate your system, determine what your goals are and the things that you can do to meet those, and then also providing that feedback of here's what you can do to test to ensure it, I think that's amazing.Gunnar: Yeah, I think this is super cool one. And as a builder, myself who's used the Well-Architected Framework as a base when building application, I know how hard it can be to actually use that. It's a lot of pages of information to read, to learn how to build using best practices, and having a tool that then helps you to actually validate that, and I think it's great. And then as you mentioned, having recommendations on what experiments to run, it makes it easier to start that Chaos Engineering journey. And that's something that I have found so interesting through these last, I don't know, two, three years, seeing how tools like Gremlin, like, now AWS FIS, and with the different open-source tools out there, as well, all of them have helped push that getting-started limit closer to the users. It is so much easier to start with Chaos Engineering these days, which I think it's super helpful for everyone wanting to get started today.Jason: Absolutely. I had someone recently asked me after running a workshop of, “Well, should I use a Chaos Engineering tool or just do my own thing? Like do it manually?” And, you know, the response was like, “Yeah, you could do it manually. That's an easy, fast way to get started, but given how much effort has been put into all of these tools, there's just so much available that makes it so much easier.” And you don't have to think as much about the safety and the edge cases of what if I manually do this thing? What are all the ways that can go wrong? Since there are these tools now that just makes it so much easier?Gunnar: Exactly. And you mentioned safety, and I think that's a very important part of it. Having that, we've always talked about that automated stop button when doing Chaos Engineering experiments and having the control over that in the system where you're running your experiments, I think that's one of the key features of all of these Chaos Engineering tools today, to have a way to actually abort the experiments if things start to go wrong.Jason: So, we're getting close to the end of our time here. Gunnar, I wanted to ask if you've got anything that you wanted to plug or promote before we wrap up.Gunnar: What I'd like to promote is the different workshops that we have available that you can use to start getting used to AWS Fault Injection Simulator. I would really like people to get that hands-on experience with AWS Fault Injection Simulators, so get your hands dirty, and actually, run some Chaos Engineering experiments. Even though you are far away from actually doing it in your organization, getting that experience, I think that's super helpful as the first step. Then you can start thinking about how could I implement this in my organization? So, have a look at the different workshops that we at AWS have available for running Chaos Engineering.Jason: Yeah, that's a great thing to promote because it is that thing of when people ask, “Where do I start?” I think we often assume not just that, “Let me try this,” but, “How am I going to roll this out in my organization? How am I going to make the business case for this? Who needs to be involved in it?” And then suddenly it becomes a much larger problem that maybe we don't want to tackle. Awesome.Gunnar: Yeah, that's right.Jason: So, if people want to find you around the internet, where can they follow you and find out more about what you're up to?Gunnar: I am available everywhere, I think. I'm on Twitter at @GunnarGrosch. Hard to spell, but you can probably find it in the description. I'm available on LinkedIn, so do connect there. I have a TikTok account, so maybe I'll start posting there as well sometimes.Jason: Fantastic. Well, thanks again for being on the show.Gunnar: Thank you for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Intro 00:02:23 - Iwata is the best, rest in peace 00:06:45 - Sam sneaks some SNES emulators/Engineer prep 00:08:20 - AWS, incidents, and China 00:16:40 - Understanding the big picture and moving from project to product 00:19:18 - Sam's time at Snacphat 00:26:40 - Sam's work at Gremlin, and culture changes 00:34:15 - Pokémon Go and Outro TranscriptSam: It's like anything else: You can have good people and bad people. But I wouldn't advocate for no people.Julie: [laugh].Sam: You kind of need humans involved.Julie: Welcome to the Break Things on Purpose podcast, a show about people, culture, and reliability. In this episode, we talk with Sam Rossoff, principal software engineer at Gremlin, about legendary programmers, data center disasters at AWS, going from 15 to 3000 engineers at Snapchat, and of course, Pokémon.Julie: Welcome to Break Things on Purpose. Today, Jason Yee and I are joined by Sam Rossoff, principal software engineer at Gremlin, and max level 100. Pokémon trainer. So Sam, why don't you tell us real quick who you are.Sam: So, I'm Sam Rossoff. I'm an engineer here at Gremlin. I've been in engineering here for two years. It's a good time. I certainly enjoyed it. And before that, I was at Snapchat for six years, and prior to that at Amazon for four years. And actually, before I was at Amazon, I was at Nokia Research Center in Palo Alto, and prior to that, I was at Activision. This was before they merged with Blizzard, all the way back in 2002. I worked in QA.Julie: And do you have any of those Nokia phones that are holding up your desk, or computer, or anything?Sam: I think I've been N95 around here somewhere. It's, like, a phone circa 2009. Probably. I remember, it was like a really nice, expensive phone at the time and they just gave it to us. And I was like, “ oh, this is really nice.”And then the iPhone came out. And I was like [laugh], “I don't know why I have this.” Also, I need to find a new job. That was my primary—I remember I was sitting in a meeting—this was lunch. It wasn't a meeting.I was sitting at lunch with some other engineers at Nokia Research, and they were telling me the story about this app—because the App Store was brand new in those days—it was called iRich, and it was $10,000. It didn't do anything. It was, like, a glowing—it was, like, NFTs, before NFTs—and it was just, like, a glowing thing on your phone. And you just, like, bought it to show you could waste $10,000 an app. And that was the moment where I was like, “I need to get out of this company. I need a new job.” It's depressing at the time, I guess.Julie: So. Sam, you're the best.Sam: No. False. Let me tell you story. There's a guy, his name is Iwata, right? He's a software developer. He works at a company called HAL Laboratories. You may recall, he built a game called Kirby. Very famous game; very popular.HAL Laboratories gets acquired by Nintendo. And Nintendo is like, “Hey, can you”—but Iwata, by the way, is the president of HAL Laboratories. Which is like, you know, ten people, so not—and they're like, “Hey, can you, like, send someone over? We're having trouble with this game we're making.” Right, the game question, at the time they called it Pokémon 2, now we call it Gold and Silver, and Iwata just goes over himself because he's a programmer in addition to be president of HAL Laboratories.And so he goes over there and he's like, “How can I help?” And they're like, “We're over time. We're over budget. We can't fit all the data on the cart. We're just, like, cutting features left and right.” He's like, “Don't worry. I got this.”And he comes up with this crazy compression algorithm, so they have so much space left, they put a second game inside of the game. They add back in features that weren't there originally. And they released on time. And they called this guy the legendary programmer. As a kid, he was my hero.Also famous for building Super Smash Brothers, becoming the president of all of Nintendo later on in his life. And he died a couple years ago, of cancer, if I recall correctly. But he did this motion when he was president of Nintendo. So, you ever see somebody in Nintendo go like this, that's a reference to Iwata, the legendary programmer.Jason: And since this is a podcast, Sam is two hands up, or just search YouTube for—Sam: Iwata.Jason: That's the lesson. [laugh].Sam: [laugh]. His big console design after he became President of Nintendo was the Nintendo Wii, as you may recall, with the nunchucks and everything. Yeah. That's Iwata. Crazy.Julie: We were actually just playing the Nintendo Wii the other day. It is still a high-quality game.Sam: Yeah.Jason: The original Wii? Not like the… whatever?Julie: Yeah. Like, the original Wii.Jason: Since you brought up the Wii, the Wii was the first console I ever owned because I grew up with parents that made it important to do schoolwork, and their entire argument was, if you get a Nintendo, you'll stop doing your homework and school stuff, and your grades will suffer, and just play it all the time. And so they refuse to let me get a Nintendo. Until at one point I, like, hounded them enough-I was probably, like, eight or nine years old, and I'm like, “Can I borrow a friend's Nintendo?” And they were like, sure you can borrow it for the weekend. So, of course, I borrowed it and I played it the whole weekend because, like, limited time. And then they used that as the proof of like, “See? All you did this weekend was play Nintendo. This is why we won't get you one.” [laugh].Sam: So, I had the exact same problem growing up. My parents are also very strict. And firm believers in corporal punishment. And so no video games was very clear. And especially, you know, after Columbine, which was when I was in high school.That was like a hard line they held. But I had friends. I would go to their houses, I would play at their houses. And so I didn't have any of those consoles growing up, but I did eventually get, like, my dad's old hand-me-down computer for, like, schoolwork and stuff, and I remember—first of all, figuring out how to program, but also figuring out how to run SNES emulators on [laugh] on those machines. And, like, a lot of my experience playing video games was waking up at 2 a.m. in the morning, getting on emulators, playing that until about, you know, five, then turning it off and pretending to go back to bed.Julie: So see, you were just preparing to be an engineer who would get woken up at 2 a.m. with a page. I feel like you were just training yourself for incidents.Sam: What I did learn—which has been very useful—is I learned how to fall asleep very quickly. I can fall asleep anywhere, anytime, on, like, a moment's notice. And that's a fantastic skill to have, let me tell you. Especially when [crosstalk 00:07:53]—Julie: That's a magic skill.Sam: Yeah.Julie: That is a magic skill. I'm so jealous of people that can just fall asleep when they want to. For me, it's probably some Benadryl, maybe add in some melatonin. So, I'm very jealous of you. Now I—Jason: There's probably a reason that I'm drinking all this cheap scotch right now.Sam: [laugh].Julie: We should point out that it's one o'clock in the morning for Jason because he's in Estonia right now. So, thank you, A, for doing this for us, and we did promise that you would get to talk about Pokémon. So—Sam: [laugh].Julie: [laugh].Sam: I don't know if you noticed, immediately, that's what I went to. I got a story about Pokémon.Julie: So, have you heard any of our episodes?Sam: I have. I have listened to some. They're mostly Jason, sort of, interviewing various people about their experience. I feel like they come, like, way more well-prepared than I am because they have, like, stuff they want to talk about, usually.Julie: They also generally have more than an hour or two's notice. So.Sam: Well, that's fair. Yeah. That probably [laugh] that probably helps. Whereas, like, I, like, refreshed one story about Iwata, and that's, like, my level of preparation here. So… don't expect too much.Julie: I have no expectations. Jason already had what you should talk about lined up anyway. Something about AWS incidents in China.Sam: Oh, my God. The first question is, which one?Jason: [laugh].Sam: So, I don't know how much you're familiar with the business situation in China, but American businesses are not allowed to operate in China. What happens is you create a Chinese subsidiary that's two-thirds owned by Chinese nationals in some sort of way, you work through other companies directly, and you form, like, these partnerships. And I know you know, very famously, Blizzard did this many years ago, and then, like, when they pulled out China, that company, all the people worked at are like, “Well, we're just going to take your assets and make our own version of World of Warcraft and just, like, run that instead.” But Amazon did, and it was always this long game of telephone, where people from Amazon usually, like, VP, C-level people were asking for various things. And there were people whose responsibility it was to, like, go and make those things happen.And maybe they did or, like, maybe they just said they did, right? And, like, it was never clear how much of it was lost in translation, or they're just, like, dealing with unreasonable requirements, and they're just, like, trying to get something done. But one story is one of my favorites because I was on this call. Amazon required all of their data centers to be multiple zones, right? So, now they talk about availability zones in a region. Internally at Amazon, that's not how we referred to things; it'd be like, there's the data center in Virginia, and there's, like, the first one, the second one, the third one, right? They're just, like, numbered; we knew what they were.And you had to have three of them, and then all services had to be redundant such they could handle a single data center failure. In the earlier days of Amazon, they would actually go turn off data centers to, like, make you prove this as the case. It's was, like, a very early version of chaos engineering. Because it's just, like, unreliable. And unfortunately, AWS kind of put the kibosh on that because it turns out people purchasing VMs on AWS don't like it when you turn off their VMs without warning. Which, like, I'm sympathetic, uh… I don't know.As a side note, if you are data center redundant, that means you're running excess capacity. So, if I'm about to lose a data center, I need to be able to maintain traffic without a real loss in error rates, that means I've got to be running, like, 50% excess capacity if I've only got three data centers, or 33% if you're four data centers. And so capacity of course was always the hard problem when you're dealing with data centers. So, when we were running the Chinese website— z.cn or amazon.cn—there was a data center in China, as you might imagine, as required by the complex business regulations and whatnot.And it had, you know, three availability zones, for lack of a better term. Or we thought it had three availability zones, which of course, this is what happened. One day, I got paged into this call, and they were dealing with a website outage, and we were trying to get people on the ground in China on the call, which as I recall, actually is a real hard problem to get. It was the middle of the night there; there was a very bad rainstorm; people were not near internet connectivity. If you're unfamiliar with the Chinese landscape—well, it's more complex today, but in those days, there were just basically two ISPs in China, and, like, Amazon only paired with one of them.And so if you were on the other one, it was very difficult to get back into Amazon systems. And so they'd have places they could go to so they could connect them when they—and so it was pair to. And so it was a very difficult situation. It took us a while to get people on the phone, but basically, we lost two data centers at the same time, which was very surprising. And later we find out what happened is one of the data centers had flooded, which is bad, bunch of electrical machines flooding for a rainstorm that's got whatever else going on.It turns out the other data center was physically inside of the first [laugh] data center. Which is not the sort of isolation you want between two regions. It's not really clear where in the conversation, you know, things got lost, such that this is what got implemented. But we had three data centers and in theory, and in practice, we had two data centers, since one was inside the other. And when the first one flooded, the, like, floor gave away, and the servers crashed down on top of the other one. [laugh].And so they were literally inside of each other after that point. They took down the Chinese website for Amazon. It was an experience. It was also one of those calls where there's not a lot I could do to help, which is always frustrating for a lot of reasons.Julie: So, how did you handle that call? Out of curiosity, I mean, what do you say?Sam: Well, I'll be honest with you, it took us a long time to get that information, to get save the world. Most of the call actually was trying to get ahold of people try to get information, get translators—because almost everybody on the line did not speak either Cantonese or Mandarin, which is what the engineers were working with—and so by the time we got an understanding—I was in Seattle at the time—Seattle got an understanding of what was happening in—I think it was Beijing. I don't recall off the top of my head—the people on the ground had done a lot of work to isolate and get things up and running, and the remainder of the work was reallocating capacity in the remaining data center so that we wouldn't be running data center redundant, but at the very least, we would be able to serve something. It was, as I recall, it was a very long outage we had to take. Although in those days, the Amazon cn website was not really a profit center.The business was—the Amazon business—was willing to sell things at steep discounts in China to establish themselves in that market, and so, there was always sort of a question of whether or not the outage was saving the company money. Which is, like, sort of a—Julie: [laugh].Sam: —it's like a weird place to be in as an engineer, right? Because you're, like, “You're supposed to be adding business value.” I'm like, “I feel like doing nothing might be adding business out here.” It's not true, obviously because the business value was to be in the Chinese market and to build an Amazon presence for some eventual world. Which I don't know if they ever—they got to. I don't work at Amazon, and haven't in almost a decade now.But it was definitely—it's the kind of thing that wears our morale, right? If you know the business is doing something that is sort of questionable in these ways. And look, in the sales, you know, when you're selling physical goods, industry loss leaders are a perfectly normal part of the industry. And you understand. Like, you sell certain items or loss to get people in the door, totally.But as engineering lacked a real strong view of the cohesive situation on the ground, the business inputs, that's hard on engineering, right, where they're sort of not clear what the right thing is, right? And anytime you take the engineers very far away from the product, they're going to make a bunch of decisions that are fundamentally in a vacuum. And if you don't have a good feel for what the business incentives are, or how the product is interacting with customers, then you're making decisions in a vacuum because there's some technical implementation you have to commit in some way, you're going to make a lot of the wrong decisions. And that was definitely a tough situation for us in those days. I hear it's significantly better today. I can't speak to it personally because I don't work there, but I do hear they have a much better situation today.Julie: Well, I'll tell you, just on the data center thing, I did just complete my Amazon Certified Cloud Practitioner. And during the Amazon training, they drilled it into you that the availability zones were tens of miles apart—the data centers were tens of miles apart—and now I understand why because they're just making sure that we know that there's no data centers inside data centers. [laugh].Sam: It was a real concern.Julie: [laugh]. But kind of going back though, to the business outcomes, quite a while ago, I used to give a talk called, “You Can't Buy DevOps,” and a lot of the things in that talk were based off of some of the reading that I did, in the book, Accelerate by Dr. Nicole Forsgren, Gene Kim, and Jez Humble. And one of the things they talked about is high-performing teams understanding the business goals. And kind of going back to that, making those decisions in a vacuum—and then I think, also, when you're making those decisions in a vacuum, do you have the focus on the customer? Do you understand the direction of the organization, and why are you making these decisions?Jason: I mean, I think that's also—just to dovetail on to that, that's sort of been the larger—if we look at the larger trend in technology, I think that's been the goal, right? We've moved from project management to product management, and that's been a change. And in our field, in SRE and things, we've moved from just thinking of metrics, and there were all these monitoring frameworks like USE (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration) and monitoring for errors, and we've moved to this idea of SLOs, right? And SLOs are often supposed to be based on what's my customer experience? And so I think, overall, aside from Accelerate and DevOps, DevOps I feel like, has just been one part of this longer journey of getting engineers to understand where they fit within the grander scheme of things.Sam: Yeah. I would say, in general, anytime you have some sort of metric, which you're working towards, in some sort of reasonable way, it's easy to over-optimize for the metric. And if you think of the metric instead as sort of like the needle on a compass, it's like vaguely pointing north, right, but keep in mind, the reason we're heading north is because X, Y, Z, right? It's a lot easier for, like, individuals making the decisions that they have to on a day-to-day basis to make the right ones, right? And if you just optimize for the metric—I'm not saying metrics aren't helpful; they're extremely important. I would rather be lost with a compass than without one, but I also would like to know where I'm going and not just be wandering to northwards with the compass, right?Julie: Absolutely. And then—Jason: I mean, you don't want to get measured on lines of code that you commit.Sam: Listen. I will commit 70 lines of code. Get ready.Julie: Well, and metrics can be gamed, right? If people don't understand why those metrics are important—the overall vision; you've just got to understand the vision. Speaking of vision, you also worked at Snap.Sam: I did. I did. That was a really fun place to work. I joined Snapchat; there were 30 people at the company and 14 engineers. Very small company. And a lot of users, you know, 20-plus million users by that point, but very small company.And all the engineers, we used to sit in one room together, and so when you wanted to deploy the production back end, you, like, raised your hand. You're like, “Hey, I'm going to ship out the code. Does anyone have changes that are going out, or is everyone else already doing it?” And one of my coworkers actually wrote something into our deploy script so the speakers on your computer would, like, say, “Deploying production” just so, like, people could hear when it went out the door. Because, like, when you're all in one room, that's, like, a totally credible deployment strategy.We did build automation around that on CircleCI, which in those days was—I think this was 2014—much less big than it is today. And the company did eventually scale to at least 3000 engineers by the time I left, maybe more. It was hard for me to keep track because the company just grown in all these different dimensions. But it was really interesting to live through that.Julie: So, tell me about that. You went from, what, you said, 30 engineers to 3000 in the time that you were there.Sam: Fifteen engineers, I was the fifteenth.Julie: Fifteen. Fifteen engineers. What were some of the pain points that you experienced? And actually maybe even some advice for folks going through big company growth spurts?Sam: Yeah, that hypergrowth? I think it's easier for me to think about the areas that Snap did things wrong, but those were, like, explicit decisions we made, right? It might not be the case that you have these problems at your company. Like, one of the problems Snap had for a long time, we did not hire frontline managers or TPMs, and what that did is it create a lot of situations where you have director-levels with, like, 50-plus direct reports who struggled to make sure that—I don't know, there's no way you're going to manage 50-plus direct reports as engineers, right? Like, and it took the company a while to rectify that because we had such a strong hiring pipeline for engineers and not a strong hiring pipeline for managers.I know there's, like, a lot of people saying companies like, “Oh, man, these middle managers and TPM's all they do is, like, create work for, like, real people.” No. They—I get to see the world without them. Absolutely they had enormous value. [laugh]. They are worth their weight in gold; there's a reason they're there.And it's not to say you can't have bad ones who add negative value, but that's also true for engineers, right? I've worked with engineers, too, who also have added negative value, and I had to spend a lot of my time cleaning up their code, right? It's like anything else: You can have good people and bad people. But I wouldn't advocate for no people.Julie: [laugh].Sam: You kind of need humans involved. The thing that was nice about Snap is Snap was a very product-led company, and so we always had an idea of what the product is that we were trying to build. And that was, like, really helpful. I don't know that we had, like, a grand vision for, like, how to make the internet better like Google does, but we definitely had an idea of what we're building and the direction we're moving it in. And it was very much read by Evan Spiegel, who I got to know personally, who spent a lot of time coming down talking to us about the design of the product and working through the details.Or at least, you know, early on, that was the case. Later on, you know, he was busy with other stuff. I guess he's, like, a CEO or something, now.Julie: [laugh].Sam: But yeah, that was very nice. The flip side meant that we under-invested in areas around things like QA and build tools and these other sorts of pieces. And, like, DevOps stuff, absolutely. Snapchat was on an early version of Google Cloud Platform. Actually an early version of something called App Engine.Now, App Engine still exist as a product. It is not the product today that it was back in 2014. I lived through them revving that product, and multiple deprecations and the product I used in 2014 was a disaster and huge pain, and the product they have today is actually semi-reasonable and something I've would use again. And so props to Google Cloud for actually making something nice out of what they had. And I got to know some of their engineers quite well over the—[laugh] my tenure, as Snap was the biggest customer by far.But we offboarded, like, a lot of the DevOps works onto Google-and paid them handsomely for it—and what we found is you kind of get whatever Google feels like level of support, which is not in your control. And when you have 15 engineers, that's totally reasonable, right? Like, if I need to run, like, a million servers and I have 15 engineers, it's great to pay Google SREs to, like, keep track of my million servers. When you have, you know, 1000 engineers though, and Google wants half a billion dollars a year, and you're like, “I can't even get you guys to get my, like, Java version revved, right? I'm still stuck on Java 7, and this Java 8 migration has been going on for two years, right?”Like, it's not a great situation to be in. And Snap, to their credit, eventually did recognize this and invested heavily in a multi-cloud solution, built around Kubernetes—maybe not a surprise to anyone here—and they're still migrating to that, to the best of my knowledge. I don't know. I haven't worked in that company for two years now. But we didn't have those things, and so we had to sort of rebuild at a very, sort of, large scale.And there was a lot of stuff we infrastructure we set up in the early days in, like, 2014, when, like, ah, that's good enough, this, like, janky python script because that's what we had time for, right? Like, I had an intern write a janky Python script that handled a merge queue so that we could get changes in, and that worked really great when there was like, a dozen engineers just, like, throwing changes at it. When there was, like, 500 engineers, that thing resulted in three-day build times, right? And I remember, uh, what was this… this was 2016… it was the winter of either 2016 to 2017 or 2017 to 2018 where, like, they're like, “Sam, we need to, like, rebuild the system because, like, 72 hours is not an acceptable time to merge code that's already been approved.” And we got down to 14 minutes.So, we were able to do it, right, but you need to be willing to invest the time. And when you're strapped for resources, it's very easy to overlook things like dev tools and DevOps because they're things that you only notice when they're not working, right? But the flip side is, they're also the areas where you can invest and get ten times the output of your investment, right? Because if I put five people on this, like, build system problem, right, all of a sudden, I've got, like, 100x build performance across my, like, 500 engineers. That's an enormous value proposition for your money.And in general, I think, you know, if you're a company that's going through a lot of growth, you have to make sure you are investing there, even if it looks like you don't need it just yet. Because first of all, you do, you're just not seeing it, but second of all, you're going to need it, right? Like, that's what the growth means: You are going to need it. And at Snap I think the policy was 10% of engineering resources were on security—which is maybe reasonable or not; I don't know. I didn't work on security—but it might also be the case that you want maybe 5 to 10% of the engineering resources working on your internal tooling.Because that is something that, first of all, great value for your money, but second of all, it's one of those things where all of a sudden, you're going to find yourself staring at a $500 million bill from Google Cloud or AWS, and be like, “How did we do this to ourselves?” Right? Like, that's really expensive for the amount of money we're making. I don't know what the actual bill number is, but you know, it's something crazy like that. And then you have to be like, “Okay, how do we get everything off of Google Cloud and onto AWS because it's cheaper.” And that was a—[laugh] that was one heck of a migration, I'll tell you.Julie: So, you've walked us through AWS and through Snap, and so far, we've learned important things such as no data centers within data centers—Sam: [laugh].Julie: —people are important, and you should focus on your tooling, your internal tooling. So, as you mentioned before, you know, now you're at Gremlin. What are you excited about?Sam: Yeah. I think there's, like, a lot of value that Gremlin provides to our customers. I don't know, one of the things I liked working at Snapchat is, like, I don't particularly like Facebook. I have not liked Facebook since, like, 2007, or something. And there's, like, a real, like, almost, like, parasitic aspect to it.In my work at Snap, I felt a lot better. It's easy to say something pithy, like, “Oh, you're just sending disappearing photos.” Like, yeah, but, like, it's a way people stay connected that's not terrible the way that Facebook is, right? I felt better about my contribution.And so similarly, like, I think Gremlin was another area where, like, I feel a lot be—like, I'm actually helping my customers. I'm not just, like, helping them down a poor path. There's some, like, maybe ongoing conversation around if you worked in Amazon, like, what happens in FCs and stuff? I didn't work in that part of the company, but like, I think if I had to go back and work there, that's also something that might, you know, weigh on me to some degree. And so one of the—I think one of the nice things about working at Gremlin is, like, I feel good about my work if that makes sense.And I didn't expect it. I mean, that's not why I picked the job, but I do like that. That is something that makes me feel good. I don't know how much I can talk about upcoming product stuff. Obviously, I'm very excited about upcoming product stuff that we're building because, like, that's where I spend all my time. I'm, like, “Oh, there's, like, this thing and this thing, and that's going to let people do this. And then you can do this other thing.”I will tell you, like, I do—like, when I conceptualize product changes, I spend a lot of time thinking, how is this going to impact individual engineers? How is this going to impact their management chain, and their, like, senior leadership director, VP, C-suite level? And, like, how do we empower engineers to, like, show that senior leadership that work is getting done? Because I do think it's hard—this is true across DevOps and it's not unique to Chaos Engineering—I do think it's hard sometimes to show that you're making progress in, like, the outages you avoided, right? And, like, that is where I spend, like, a lot of my thought time, like, how do I like help doing that?And, like, if you're someone who's, like, a champion, you're, you're like, “Come on, everyone, we should be doing Chaos Engineering.” Like, how do I get people invested? You care, you're at this company, you've convinced them to purchase Gremlin, like, how do I get other engineers excited about Chaos Engineering? I think, like, giving you tools to help with that is something that, I would hope, I mean, I don't know what's actually implemented just yet, but I'd hope is somewhere on our roadmap. Because that's the thing like, that I personally think a lot about.I'll tell you another story. This was also when I was at Amazon. I had this buddy, we'll call him Zach because that's his name, and he was really big on testing. And he had all this stuff about, like, testing pyramid, if you're familiar with, like, programming unit testing, integration testing, it's all that stuff. And he worked as a team—a sister team to mine—and a lot engineers did not care heavily about testing. [laugh].And he used to try to, like, get people to, like, do things and talk about it and stuff. They just, like, didn't care, even slightly. And I also kind of didn't care, so I wasn't any better, but something I did one day on my team is I was like, “You know, somebody else at Amazon”—because Amazon invested very heavily in developer tools—had built some way that was very easy to publish metrics into our primary metrics thing about code coverage. And so I just tossed in all the products for my team, and that published a bunch of metrics. And then I made a bunch of graphs on a wiki somewhere that pulled live data, and we could see code coverage.And then I, like, showed it in, like, a team meeting one week, and everyone was like, “Oh, that's kind of interesting.” And then people were like, “Oh, I'm surprised that's so low.” And they found, like, some low-hanging fruit and they started moving it up. And then, like, the next year bi-weekly with our skip-level, like, they showed the progress, he's like, “Oh, this it's really good.” You made, like, a lot of progress in the code coverage.And then, like, all of a sudden, like, when they're inviting new changes, they start adding testing, or, like, all sudden, like, code coverage, just seemed ratchet up. Or some [unintelligible 00:30:51] would be like, “Hey, I have this thing so that our builds would fail now if code coverage went down.” Right? Like, all of a sudden, it became, sort of like, part of the culture to do this, to add coverage. I remember—and they, like, sort of pollinated to the sister teams.I remember Zach coming by my desk one day. He's like, “I'm so angry. I've been trying for six months to get people to care. And you do some dumb graphs and our wiki.” And I'm like, “I mean, I don't know. I was just, like, an idea I had.” Right? Like, it wasn't, like, a conscious, like, “I'm going to change the culture moment,” it was very much, like, “I don't know, just thought this was interesting.”And I don't know if you know who [John Rauser](https://www.youtube.com/watch?v=UL2WDcNu_3A) is, but he's got this great talk at Velocity back in 2010, maybe 2011, where he talks about culture change and he talks about how humans do change culture readily—and, you know, Velocity is very much about availability and latency—and what we need to do in the world of DevOps and reliability in general is actually we have to change the culture of the companies we're at. Because you're never going to succeed, just, like, here emoting adding chaos engineering into your environment. I mean because one day, you're going to leave that company, or you're going to give up and there'll be some inertia that'll carry things forward, but eventually, people will stop doing it and the pendulum will swing back the other way, and the systems will become unreliable again. But if you can build a culture, if you can make people care—of course, it's the hardest thing to do in engineering, like, make other engineers care about something—but if you can do it, then it will become sort of self-perpetuating, right, and it becomes, like, a sort of like a stand-alone complex. And then it doesn't matter if it's just you anymore.And as an engineer, I'm always looking for ways to, like, remove myself as a critical dependency, right? Like, if I could work myself out of a job, thank you, because, like, [laugh] yeah, I can go work on something else now, right? Like, I can be done, right? Because, like, as we all know, you're never done with software, right? There's always a next version; there's always, like, another piece; you're always, like, migrating to a new version, right? It never really ends, but if you can build something that's more than just yourself—I feel like this is, like, a line from Batman or something. “Mr. Wayne, if you can become a legend”—right? Like, you'd be something more yourself? Yeah, absolutely. I mean, it's not a great delivery like Liam Neeson. But yeah.Jason: I like what you said, though. You talked about, like, culture change, but I think a big thing of what you did is exposing what you're measuring or starting to measure this thing, right? Because there's always a statement of, “You can't improve until you measure it,” right? And so I think simply because we're engineers, exposing that metric and understanding where we're at is a huge motivator, and can be—and obviously, in your case—enough to change that culture is just, like, knowing about this and seeing that metric. And part of the whole DevOps philosophy is the idea that people want to do the best job that they can, and so exposing that data of, “Look, we're not doing very well on this,” is often enough. Just knowing that you're not doing well, is often enough to motivate you to do better.Sam: Yeah, one of the things we used to say at Amazon is, “If you can't measure it, it didn't happen.” And like, it was very true, right? I mean, that was a large organization that moves slowly, but, like, it was very true that if you couldn't show a bunch of graphs or reports somewhere, oftentimes people would just pretend like it never happened.Julie: So, I do you want to bring it back just a little bit, in the last couple of minutes that we have, to Pokémon. So, you play Pokémon Go?Sam: I do. I do play Pokémon Go.Julie: And then how do people find you on Pokémon Go?Sam: My trainer—Jason: Also, I'm going to say, Sam, you need to open my gifts. I'm in Estonia.Sam: [laugh]. It's true. I don't open gifts. Here's the problem. I have no space because I have, like, all these items from all the, like, quests and stuff they've done recently.They're like, “Oh, you got to, like, make enough space, or you could pay us $2 and we'll give you more space.” I'm like, “I'm not paying $2,” right? Like—Jason: [laugh].Sam: And so, I just, like, I have to go in every now and then and, like, just, like, delete a bunch of, like, Poké Balls or something. Like maybe I don't need 500 Poké Balls. That's fair.Jason: I mean, I'm sitting on 628 Ultra Balls right now. [laugh].Sam: Yeah. Well, maybe you don't need—Jason: It's community day on Sunday.Sam: I know, I know. I'm excited for it. I have a trainer code. If you need my trainer to find me on Pokémon Go, it's 1172-0487-4013. And you can add me, and I'll add you back because, like, I don't care; I love playing Pokémon, and I'd play every day. [laugh].Julie: And I feel it would be really rude to leave Jason out of this since he plays Pokémon a lot. Jason, do you want to share your…Jason: I'm not sharing my trainer code because at this point, I'm nearing the limit, and I have all of these Best Friends that I'm actually Lucky Friends with, and I have no idea how to contact them to actually make Lucky trades. And I know that some of them are, like, halfway around the world, so if you are in the Canary Islands and you are a friend of mine on Pokémon Go, please reach out to me on Twitter. I'm @gitbisect on Twitter. Message me so that we can actually, like, figure out who you are. Because at some point, I will go to the Canary Islands because they are beautiful.Sam: Also, you can get those, like, sweet Estonia gifts, what will give you those eggs from Estonia, and then when you trade them you get huge mileage on the trades. I don't know if this is a thing you [unintelligible 00:36:13], Jason, but, like, my wife and I both compete for who can get the most mileage on the trip. And of course, we traded each other but that's, like, a zero-sum game, right? And so the total mileage on trades is a big thing in my house.Jason: Well, the next time we get together, I've got stuff from New Zealand, so we can definitely get some mileage there.Sam: Excellent.Julie: Well, this is excellent. I feel like we have learned so much on this episode of Break Things on Purpose, from obviously the most important information out there—Pokémon—but back to some of the history of Nintendo and Amazon and Snap and all of it. And so Sam, I just want to thank you for being on with us today. And folks again, if you want to be Sam's friend on Pokémon Go—I'm sorry, I don't really know how it works. I don't even know if that's the right term—Sam: It's fine.Julie: You've got his code. [laugh]. And thanks again for being on our podcast.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Intro 00:00:38 - Death to VPNs 00:02:45 - “I do not like React hooks.” 00:03:50 - A Popular (?) Opinion TranscriptPat: Good thing you're putting that on our SRE focused pod.Brian: Yeah, well, they can take that to their front end developers and say, well, Brian Holt told me that hooks suck.Jason: Welcome to break things on purpose, an opinionated podcast about reliability and technology. As we launch into 2022, we thought it would be fun to ask some of our previous guests about their unpopular opinions.Zack butcher joined the show in August, 2021, to chat about his work on the Istio service mesh and its role in building more reliable, distributed systems. Here's his unpopular opinion on network security.Zack: I mean, can I talk about how I'm going to kill all the VPNs in the world? Uh, VPNs don't need to exist anymore. and that's stuff that I've actually been saying for years now. So it's so funny. We're finally realizing multi cluster Kubernetes. Right? I was so excited maybe two years ago at Kubecon and I finally heard people talk about multi cluster and I was like, oh, we finally arrived! It's not a toy anymore! Because when you have one, it's a toy, we have multiple, you're actually doing things. However, how do people facilitate that? I had demos four years ago of multicluster routing and traffic management on Istio. It was horrendous to write. It was awful. It's way better the way we do now. But, you know, the whole point that almost that entire time, I would tell people like, I'm going to kill VPN, there's no need for VPNs.There's a small need for like user privacy things. Right? That's a different category. But by and large, when organizations use a VPN, it's really about extending their network, right. It's about a network based trust model. And so I know that when you have reachability, that is that authorization, right? That's the old paradigm. VPNs enabled that. Fundamentally that doesn't work with the world that we live in anymore. it just doesn't, that's just not how security works, sorry. Uh, in, in these highly dynamic environments that we live in now. and so I actually think at this point in time, for the most part, actually VPNs probably cause more problems than solutions given the other tools that we have around.So yeah, so my unpopular opinion is that I want them to go away and be replaced with Envoy sidecars doing the encryption for all kinds of stuff. I would love to see that on your machine too. Right. I would love to see, you know, I'm, I'm talking to you on a Mac book. I would love for there to be a small sidebar there that actually is proxying that and doing things like identity and credential exchange in some way. Because that's a much stronger way to do security and to build your system, then things like a VPN.Jason: In April, 2021, Brian Holt shared some insightful, and hilarious, incidents and his perspective on Frontend Chaos Engineering. He shared his unpopular opinion with host Pat HigginsBrian: My unpopular opinion is that I do not like react hooks. And if you get people from the react community there's going to be some people that are legitimately going to be upset by that.I think they demo really well. And like the first time you show me some of that, it's just amazing and fascinating, but maintaining the large code bases full of hooks just quickly devolves into a performance mess, you get into like weird edge cases. And long-term, I think they actually have more cognitive load because you have to understand closures , really well to understand hooks really well. Whereas the opposite way, which is doing with react components. You have to understand this in context a little bit, but not a lot. So anyway, that's my very unpopular react opinion is that I like hooks and I wish we didn't have them.Pat: Good thing you're putting that on our SRE focused pod.Brian: Yeah, well, they can take that to their front end developers and say, well, Brian Holt told me that hooks suck.Jason: In November, Gustavo Franco dropped by to chat about building an SRE program at VMWare and the early days of Chaos Engineering at Google, we suspect his strongly held opinion is in fact, quite popular.Gustavo: About technology in general, the first thing that comes to mind, like the latest pet peeve in my head is really AIOps, as a term. It really bothers me. I think it's giving a name to something that is not there yet. It may come one day.So I could rant about AIOps forever. But the thing I would say is that, I dunno, folks selling AIOps solutions, like, look into improving, statistics functions in your products first. Yeah, it's, it's just a pet peeve. I know it doesn't really change anything to me day to day basis just every time I see something related to AIOps or people asking me, you know, if my teams ever implement AIOps it bothers me.Maybe about technology at large, just quickly, is kind of the same realm and how everything is artificial intelligence now. Even when people are not using machine learning at all. So everything quote unquote is an AI like queries and keyword matching for things. And people were like, oh, this is like an AI. This is more like for journalists, right? Like, I don't know if any journalists ever listen to this, but if they do, not everything that uses keyword matching's AI or machine learning.The computers are not learning, people! The computers are not learning! Calm down!Jason: The computers are not learning, but we are. And we hope that you'll learn along with us.To hear more from these guests and listen to all of our previous episodes. Visit our website at gremlin.com/podcast. You can automatically receive all of our new episodes by subscribing to the Break Things on Purpose podcast on Apple Podcasts, Spotify, or your favorite podcast app. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Introduction 00:30:00 - Fastly Outage 00:04:05 - Salesforce Outage 00:07:25 - Hypothesizing 00:10:00 - Julie Joins the Team! 00:14:05 - Looking Forward/Outro TranscriptJason: There's a bunch of cruft that they'll cut from the beginning, and plenty of stupid things to cold-open with, so.Julie: I mean, I probably should have not said that I look forward to more incidents.[audio break 00:00:12]Jason: Hey, Julie. So, it's been quite a year, and we're going to do a year-end review episode here. As with everything, this feels like a year of a lot of incidents and outages. So, I'm curious, what is your favorite outage of the year?Julie: Well, Jason, it has been fun. There's been so many outages, it's really hard to pick a favorite. I will say that one that sticks out as my favorite, I guess, you could say was the Fastly outage, basically because of a lot of the headlines that we saw such as, “Fastly slows down and stops the internet.” You know, “What is Fastly and why did it cause an outage?” And then I think that people started realizing that there's a lot more that goes into operating the internet. So, I think from just a consumer side, that was kind of a fun one. I'm sure that the increases in Google searches for Fastly were quite large in the next couple of days following that.Jason: That's an interesting thing, right? Because I think for a lot of us in the industry, like, you know what Fastly is, I know what Fastly is; I've been friends with folks over there for quite a while and they've got a great service, but for everybody else out there in the general public, suddenly, this company, they never heard of that, you know, handles, like, 25% of the world's internet traffic, like, is suddenly on the front page news and they didn't realize how much of the internet runs through this service. And I feel it that way with a lot of the incidents that we're seeing lately, right? We're recording this in December, and a week ago, Amazon had a rather large outage, affecting us-east-1, which it seems like it's always us-east-1. But that took down a bunch of stuff and similar, they are people, like you know, my dad, who's just like, “I buy things from Amazon. How did this crash, like, the internet?”Julie: I will tell you that my mom generally calls me—and I hate to throw her under the bus—anytime there is an outage. So, Hulu had some issues earlier this year and I got texts from my mom actually asking me if I could call any of my friends over at Hulu and, like, help her get her Hulu working. She does this similarly for Facebook. So, when that Facebook outage happened, I always—almost—know about an outage first because of my mother. She is my alerting mechanism.Jason: I didn't realize Hulu had an outage, and now it makes me think we've had J. Paul Reed and some other folks from Netflix on the show. We definitely need to have an engineer from Hulu come on the show. So, if you're out there listening and you work for Hulu, and you'd like to be on the show and dish all the dirt on Hulu—actually don't do that, but we'd love to talk with you about reliability and what you're doing over there at Hulu. So, reach out to us at podcast@gremlin.com.Julie: I'm sure my mother would appreciate their email address and phone number just in case—Jason: [laugh].Julie: —for the future. [laugh].Jason: If you do reach out to us, we will connect you with Julie's mother to help solve her streaming issues. You had mentioned one thing though. You said the phrase about throwing your mother under the bus, and that reminds me of one of my favorite outages from this year, which I don't know if you remember, it's all about throwing people under the bus, or one person in particular, and that's the Salesforce outage. Do you remember that?Julie: Oh. Yes, I do. So, I was not here at the time of the Salesforce outage, but I do remember the impact that that had on multiple organizations. And then—Jason: Yes—Julie: —the retro.Jason: —the Salesforce outage was one where ,similarly ,Salesforce affects so much, and it is a major name. And so people like my dad or your mom probably knew like, “Oh, Salesforce. That's a big thing.” The retro on it, I think, was what really stood out. I think, you know, most people understand, like, “Oh, you're having DNS issues.” Like, obviously it's always DNS, right? That's the meme: It's always DNS that causes your issues.In this case it was, but their retro on this they publicly published was basically, “We had an engineer that went to update DNS, and this engineer decided to push things out using an EBF process, an Emergency Brake Fix process.” So, they sort of circumvented a lot of the slow rollout processes because they just wanted to get this change made and get it done without all the hassle. And turns out that they misconfigured it and it took everything down. And so the entire incident retro was basically throwing this one engineer under the bus. Not good.Julie: No, it wasn't. And I think that it's interesting because especially when I was over at PagerDuty, right, we talked a lot about blamelessness. That was very not blameless. It doesn't teach you to embrace failure, it doesn't show that we really just want to take that and learn better ways of doing things, or how we can make our systems more resilient. But going back to the Fastly outage, I mean, the NPR headline was, “Tuesday's Internet Outage was Caused by One Customer Changing a Setting, Fastly says.” So again, we could have better ways of communicating.Jason: Definitely don't throw your engineers on their bus, but even moreso, don't throw your customers under the bus. I think for both of these, we have to realize, like, for the engineer at Salesforce, like, the blameless lesson learned here is, what safeguards are you going to put in place? Or what safeguards were there? Like, obviously, this engineer thought, like, “The regular process is a hassle; we don't need to do that. What's the quickest, most expedient way to resolve the issue or get this job done?” And so they took that.And similarly with the customer at Fastly, they're just like, “How can I get my systems working the way I want them to? Let's roll out this configuration.” It's really up to all of us, and particularly within our companies, to think about how are people using our products. How are they working on our systems? And, what are the guardrails that we need to put in place? Because people are going to try to make the best decisions that they can, and that obviously means getting the job done as quickly as possible and then moving on to the next thing.Julie: Well, and I think you're really onto something there, too, because I think it's also about figuring out those unique ways that our customers can break our products, things that we didn't think through. And I mean, that goes back to what we do here at Gremlin, right? Then that goes back to Chaos Engineering. Let's think through a hypothesis. Let's see, you know, what if ABC Company, somebody there does something. How can we test for that?And I think that shouldn't get lost in the whole aspect of now we've got this postmortem. But how do we recreate that? How do we make sure that these things don't happen again? And then how do we get creative with trying to figure out, well, how can we break our stuff?Jason: I definitely love that. And that's something that we've done internally at Gremlin this year is, we've really started to build up a better practice around running Chaos Engineering internally on our own systems. We've done that for a long time, but a lot of times it was just specific teams, and so earlier this year, the advocacy team was partnering up with the various engineering teams and running Chaos Engineering experiments. And it was interesting to learn and think through some of those ideas of as we're doing this work, we're going to be trying to do things expediently with the least amount of hassle, but what if we decide to do something that's outside of the documented process, but for which there is no technical guardrails? So, some of the things that we ended up doing were testing dependencies, right, things that again, are outside of the normal process.Like, we use LaunchDarkly for feature flagging. What happens if we decide to circumvent that, just push things straight to production? What happens if we decide to just block LaunchDarkly all together? And we found some actual critical issues and we're able to resolve those without impacting our customers.Julie: That's the key element: Practice, play, think through the what ifs. And I love the what ifs part. You know, going back to my past, I have to tell you that the IT team used to always give me all of the new tech because if something was going to break for some reason—they used to call me the “AllSpark” to be honest with everybody out there—for some reason, if something was going to break, with me it would break in the most unique possible way, so before anything got rolled out to the entire company, I was the one that got to test it.Jason: That's amazing. So, what you're saying is on my next project, I need to give that to you first?Julie: Oh, a hundred percent. Really, it was remarkable how things would break. I mean, I had keyboards that would randomly type letters. I definitely took down some internal things, but I'm just saying that you should leverage those people within your organization, as well. The thing was, it was never a, “Julie is awful; things break because of Julie.” It was, “You know what? Leverage Julie to learn about what we're using.” And it was kind of fun. I mean, granted, this was years ago, and that name has stuck, and sometimes they still definitely make fun of me for it, but really, they just used me to break things in unique ways. Because I did.Jason: That's actually a really good segue to some of the stuff that we've been doing because you joined Gremlin, now, a few months back—more than a few months—but late summer, and a lot of what we were doing early on was just, we had these processes that, internally for myself and other folks who'd been around for a while, it was just we knew what to do because we'd done it so much. And it was that nice thing of we're going to do this thing, but let's just have Julie do it. Also, we're not going to tell you anything; we're just going to point you at the docs. It became really evident as you went through that of, like, “Hey, this doc is missing this thing. It doesn't make sense.”And you really helped us improve some of those documentation points, or some of the flows that we had, you would execute, and it's like, “Why are we doing it this way?” And a lot of times, it was like, “Oh, that's a legacy thing. We do it because—oh, right, that thing we did it because of doesn't exist anymore. Like, we're doing it completely backwards because of some sort of legacy thing that doesn't exist. Let's update that.” And you were able to help us do that, which was fantastic.Julie: Oh, yeah. And it was really great on my end, too because I always felt like I could ask the questions. And that is a cultural trait that is really important in an organization, to make sure that folks can ask questions and feel comfortable doing so. I've definitely seen it the other way, and when folks don't know the right way to do something or they're afraid to ask those questions, that's also where you see the issues with the systems because they're like, “Okay, I'm just going to do this.” And even going back to my days of being a recruiter—which is when I started in tech, but don't worry, everybody, I was super cool; I was not a bad recruiter—that was something that I always looked for in the interview process. When I'd ask somebody how to do something, would they say, “I don't know, I would ask,” or, “I would do this,” or would they just fumble their way through it, I think that it's important that organizations really adopt that culture of again, failure, blamelessness, It's okay to ask questions.Jason: Absolutely. I think sort of the flip side of that, or the corollary of that is something that Alex Hidalgo brought up. So, one of our very first episodes of 2021 on this podcast, we had Alex Hidalgo who's now at Nobl9, and he brought up a thing from his time at Google called Hyrum's Law. And Hyrum's Law is this guy Hyrum who worked at Google basically said, “If you've got an API, that API will be used in every way possible. If you don't actually technically prevent it, somebody is going to use your API in a way it wasn't designed for. And that because it allows that, it becomes totally, like, a plausible or a valid use case for this.”And so as we think about this, and thinking about blamelessness, use the end-runaround to deploy this DNS change, like, that's a valid process now because you didn't put anything in place to validate against it, and to guarantee that people weren't using it in ways that were not intended.Julie: I think that that makes a lot of sense. Because I know I've definitely used things in ways that were not intended, which people can go back and look at my quest for Diet Cherry 7 Up during the pandemic, when I used tools in ways they weren't intended, but I would like to say that Diet Cherry 7 Up is back, from those tools. Thank you PagerDuty and some APIs that were open to me to be able to leverage in interesting ways.Jason: If you needed an alert for Diet Cherry 7 Up, PagerDuty, I guess it's a good enough tool for that.Julie: Well, the fact is, is I [laugh] was able to get very creative. I mean, what are terms of service, Jason?Jason: I don't know. Does anybody actually read those?Julie: Yeah. I would call them ‘light guardrails.'Jason: [laugh]. So Julie, we're getting towards the end of the year. I'm curious, what are you looking forward to in 2022?Julie: Well, aside from, ideally, the end to the pandemic, I would say that one of the things that I'm looking forward to in 2022, from joining Gremlin, I had a really great opportunity to work on certifications here, and I'm really excited because in 2022 we'll be launching some more certifications and I'm excited for what we're going to do with that and getting creative around that. But I'm also really interested to just see how everybody evolves or learns from this year and the outages that we had. I always love fun outages, so I'm kind of curious what's going to happen over the holiday season to see if we see anything new or interesting. But Jason, what about you? What are you looking forward to?Jason: You know I, similarly, am looking forward to the end of the pandemic. I don't know if there's really going to be an end, but I think we're starting to see a return to some normalcy. And so, we've already participated in some great events, went to KubeCon a couple months ago, went to Amazon re:Invent a few weeks ago, and both of those were fantastic just to see people getting out there, and learning, and building things again. So, I'm super excited for this next year. I think we're going to start seeing a lot more events back in person, and a lot of people really eager to get together to learn and build things together. So, that's what I'm excited about. Hopefully, less incidents, but as systems get more complex, I'm not sure that that's going to happen. So, at least if we don't have less incidents, more learning from incidents is really what I'm hoping for.Julie: I like how I'm looking forward to more incidents and you're looking forward to less. To be fair, from my perspective, every incident that we have is an opportunity to talk about something new and to teach folks things, and just sometimes it's fun going down the rabbit holes to find out, well, what was the cause of this? And what was the outcome? So, when I say more incidents, I don't mean that I don't want to be able to watch the Queen's Gambit on Netflix, okay, J. Paul? Just throwing that out there.Jason: Well, thanks, Julie, for being on. And for all of our listeners, whether you're seeing more incidents or less incidents, Julie and I both hope that you're learning from the incidents that you have, that you're working to become more reliable and building more reliable systems, and hopefully testing them out with some chaos engineering. If you'd like to hear more from the Break Things on Purpose podcast, we've got a bunch of episodes that we've published this year, so if you haven't heard some of them, go back into our catalog. You can see all of the episodes at gremlin.com/podcast. And we look forward to seeing you in our next podcast.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Introduction 00:04:30 - Early Dark Days in Chaos Engineering and Reliability 00:08:27 - Anecdotes from the “Long Dark Time” 00:16:00 - The Big Changes Over the Years 00:20:50 - Mandi's Work at PagerDuty 00:27:40 - Mandi's Tips for Better DevOps 00:34:15 - Outro Links:PagerDuty: https://www.pagerduty.com TranscriptJason: — hilarious or stupid?Mandi: [laugh]. I heard that; I listened to the J. Paul Reed episode and I was like, “Oh, there's, like, a little, like, cold intro.” And I'm like, “Oh, okay.”Jason: Welcome to Break Things on Purpose, a podcast about reliability and learning from failure. In this episode, we take a trip down memory lane with Mandi Walls to discuss how much technology, reliability practices, and chaos engineering has evolved over her extensive career in technology.Jason: Everybody, welcome to the show, Julie Gunderson, who recently joined Gremlin on the developer advocacy team. How's it going, Julie?Julie: Great, Jason. Really excited to be here.Jason: So, Mandi is actually a guest of yours. I mean, we both have been friends with Mandi for quite a while but you had the wonderful opportunity of working with Mandi.Julie: I did, and I was really excited to have her on our podcast now as we ran a podcast together at PagerDuty when we worked there. Mandi has such a wealth of knowledge that I thought we should have her share it with the world.Mandi: Oh, no. Okay.Julie: [laugh].Jason: “Oh, no?” Well, in that case, Mandi, why don't you—Mandi: [crosstalk 00:01:28]. I don't know.Jason: Well, in that case with that, “Oh no,” let's have Mandi introduce herself. [laugh].Mandi: Yeah hi. So, thanks for having me. I am Mandi Walls. I am currently a DevOps advocate at PagerDuty, Julie's last place of employment before she left us to join Jason at Gremlin.Julie: And Mandi, we worked on quite a few things over a PagerDuty. We actually worked on things together, joint projects between Gremlin, when it was just Jason and us where we would run joint workshops to talk about chaos engineering and actually how you can practice your incident response. And I'm sure we'll get to that a little bit later in the episode, but will you kick us off with your background so everybody knows why we're so excited to talk to you today?Mandi: Oh, goodness. Well, so I feel like I've been around forever. [laugh]. Prior to joining PagerDuty. I spent eight-and-a-half years at Chef Software, doing all kinds of things there, so if I ever trained you on Chef, I hope it was good.Prior to joining Chef, I was assistant administrator for AOL.com and a bunch of other platform and sites at AOL for a long time. So, things like Moviefone, and the AOL Sports Channel, and dotcom, and all kinds of things. Most of them ran on one big platform because the monolith was a thing. So yeah, my background is largely in operations, and just systems administration on that side.Jason: I'm laughing in the background because you mentioned Moviefone, and whenever I think of Moviefone, I think of the Seinfeld episode where Kramer decides to make a Moviefone competitor, and it's literally just his own phone number, and people call up and he pretends to be that, like, robotic voice and has people, like, hit numbers for which movie they want to see and hear the times that it's playing. Gives a new meaning to the term on-call.Mandi: Indeed. Yes, absolutely.Julie: And I'm laughing just because I recently watched Hackers and, you know, they needed that AOL.com disc.Mandi: That's one of my favorite movies. Like, it's so ridiculous, but also has so many gems of just complete nonsense in it. Absolutely love Hackers. “Hack the planet.”Julie: “Hack the planet.” So, with hacking the planet, Mandi, and your time working at AOL with the monolith, let's talk a little bit because you're in the incident business right now over at PagerDuty, but let's talk about the before times, the before we practiced Chaos Engineering and before we really started thinking about reliability. What was it like?Mandi: Yeah, so I'll call this the Dark Ages, right? So before the Enlightenment. And, like, for folks listening at home, [laugh] the timeline here is probably—so between two-thousand-and-fi—four, five, and 2011. So, right before the beginning of cloud, right before the beginning of, like, Infrastructure as Code, and DevOps and all those things that's kind of started at, like, the end of my tenure at AOL. So, before that, right—so in that time period, right, like, the web was, it wasn't like it was just getting started, but, like, the Web 2.0 moniker was just kind of getting a grip, where you were going from the sort of generic sites like Yahoo and Yellow Pages and those kinds of things and AOL.com, which was kind of a collection of different community bits and news and things like that, into more personalized experiences, right?So, we had a lot of hook up with the accounts on the AOL side, and you could personalize all of your stuff, and read your email and do all those things, but the sophistication of the systems that we were running was such that like, I mean, good luck, right? It was migration from commercial Unixes into Linux during that era, right? So, looking at when I first joined AOL, there were a bunch of Solaris boxes, and some SGIs, and some other weird stuff in the data center. You're like, good luck on all that. And we migrated most of those platforms onto Linux at that time; 64 bit. Hurray.At least I caught that. And there was an increase in the use of open-source software for big commercial ventures, right, and so less of a reliance on commercial software and caught solutions for things, although we did have some very interesting commercial web servers that—God help them, they were there, but were not a joy, exactly, to work on because the goals were different, right? That time period was a huge acceleration. It was like a Cambrian explosion of software pieces, and tools, and improvements, and metrics, and monitoring, and all that stuff, as well as improvements on the platform side. Because you're talking about that time period is also being the migration from bare metal and, like, ordering machines by the rack, which really only a handful of players need to do that now, and that was what everybody was doing then.And in through the earliest bits of virtualization and really thinking about only deploying the structures that you needed to meet the needs of your application, rather than saying, “Oh, well, I can only order gear, I can only do my capacity planning once a year when we do the budget, so like, I got to order as much as they'll let me order and then it's going to sit in the data center spinning until I need it because I have no ability to have any kind of elastic capacity.” So, it was a completely, [laugh] completely different paradigm from what things are now. We have so much more flexibility, and the ability to, you know, expand and contract when we need to, and to shape our infrastructures to meet the needs of the application in such a more sophisticated and almost graceful way that we really didn't have then. So, it was like, “Okay, so I'm running these big websites; I've got thousands of machines.” Like, not containers, not services.Like, there's tens of thousands of services, but there's a thousand machines in one location, and we've got other things spread out. There's like, six different pods of things in different places and all this other crazy business going on. At the same time, we were also running our own CDN, and like, I totally recommend you never, ever do that for any reason. Like, just—yeah. It was a whole experience and I still sometimes have, like, anxiety dreams about, like, the configuration for some of our software that we ran at that point. And all of that stuff is—it was a long… dark time.Julie: So, now speaking of anxiety dreams, during that long, dark time that you mentioned, there had to have been some major incidents, something that stands out that that you just never want to relive. And, Mandi, I would like to ask you to relive that for us today.Mandi: [laugh]. Okay, well, okay, so there's two that I always tell people about because they were so horrific in the moment, and they're still just, like, horrible to think about. But, like, the first one was Thanksgiving morning, sometime early in the morning, like, maybe 2 a.m. something like that, I was on call.I was at my mom's, so at the time, my mom had terrible internet access. And again, this time period don't have a lot of—there was no LTE or any kind of mobile data, right? So, I'm, like, on my mom's, like, terrible modem. And something happened to the database behind news.aol.com—which was kind of a big deal at the time—and unfortunately, we were in the process of, like, migrating off of one kind of database onto another kind of database.News was on the target side but, like, the actual platform that we were planning to move to for everything else, but the [laugh] database on-call, the poor guy was only trained up in the old platform, so he had no idea what was going on. And yeah, we were on that call—myself, my backup, the database guy, the NOC analyst, and a handful of other people that we could get hold of—because we could not get into touch with the team lead for the new database platform to actually fix things. And that was hours. Like, I missed Thanksgiving dinner. So, my family eats Thanksgiving at midday rather than in the evening. So, that was a good ten hour call. So, that was horrifying.The other one wasn't quite as bad as that, but like, the interesting thing about the platform we were running at the time was it was AOL server, don't even look it up. Like, it was just crazytown. And it was—some of the interesting things about it was you could actually get into the server platform and dig around in what the threads were doing. Each of the servers had, like, a control port on it and I could log into the control port and see what all the requests were doing on each thread that was live. And we had done a big push of a new release of dotcom onto that platform, and everything fell over.And of course, we've got, like, sites in half a dozen different places. We've got, you know, distributed DNS that's, like, trying to throw traffic between different locations as they fall over. So, I'm watching, like, all of these graphs oscillate as, like, traffic pours out of the [Secaucus 00:11:10] or whatever we were doing, and into Mountain View or something and, like, then all the machines in the Secaucus recover. So, then they start pinging and traffic goes back, and, like, they just fall over, over and over again. So, what happened there was we didn't have enough threads configured in the server for the new time duration for the requests, so we had to, like, just boosted up all of the threads we could handle and then restart all of the applications. But that meant pushing out new config to all the thousands of servers that were in the pool at the time and then restarting all of them. So, that was exciting. That was the outage that I learned that the CTO knew how to call my desk. So, highly don't recommend that. But yeah, it was an experience. So.Julie: So, that's really interesting because there's been so many investments now in reliability. And when we talk about the Before Times when we had to cap our text messages because they cost us ten cents a piece, or when we were using those AOL discs, the thought was there; we wanted to make that user experience better. And you brought up a couple of things, you know, you were moving to those more personalized experiences, you were migrating those platforms, and you actually talked about your metrics and monitoring. And I'd like to dig in a little on that and see, how did that help you during those incidents? And after those incidents, what did you do to ensure that these types of incidents didn't occur again in the future?Mandi: Yeah, so one of the interesting things about, you know, especially that time period was that the commercially available solutions, even some of the open-source solutions were pretty immature at that time. So, AOL had an internally built solution that was fascinating. And it's unfortunate that they were never able to open-source it because it would have been something interesting to sort of look at. Scale of it was just absolutely immense. But the things that we could look at the time to sort of give us, you know, an indication of something, like, an AOL.com, it's kind of a general purpose website; a lot of different people are going to go there for different reasons.It's the easiest place for them to find their email, it's the easiest place for them to go to the news, and they just kind of use it as their homepage, so as soon as traffic starts dropping off, you can start to see that, you know, maybe there's something going on and you can pull up sort of secondary indicators for things like CPU utilization, or memory exhaustion, or things like that. Some of the other interesting things that would come up there is, like, for folks who are sort of intimately tied to these platforms for long periods of time, to get to know them as, like, their own living environment, something like—so all of AOL's channels at the time were on a single platform.—like, hail to the monolith; they all live there—because it was all linked into one publishing site, so it made sense at the time, but like, oh, my goodness, like, scaling for the combination of entertainment plus news plus sports plus all the stuff that's there, there's 75 channels at one time, so, like, the scaling of that is… ridiculous.But you could get a view for, like, what people were actually doing, and other things that were going on in the world. So like, one summer, there were a bunch of floods in the Midwest and you could just see the traffic bottom out because, like, people couldn't get to the internet. So, like, looking at that region, there's, like, a 40% drop in the traffic or whatever for a few days as people were not able to be online. Things like big snowstorms where all the kids had to stay home and, like, you get a big jump in the traffic and you get to see all these things and, like, you get to get a feel for more of a holistic attachment or holistic relationship with a platform that you're running. It was like it—they are very much a living creature of their own sort of thing.Like, I always think of them as, like, a Kraken or whatever. Like, something that's a little bit menacing, you don't really think see all of it, and there's a lot of things going on in the background, but you can get a feel for the personality and the shape of the behaviors, and knowing that, okay, well, now we have a lot of really good metrics to say, “All right, that one 500 error, it's kind of sporadic, we know that it's there, it's not a huge deal.” Like, we did not have the sophistication of tooling to really be able to say that quantitatively, like, and actually know that but, like, you get a feel for it. It's kind of weird. Like, it's almost like you're just kind of plugged into it yourself.It's like the scene in The Matrix where the operator guy is like, “I don't even see the text anymore.” Right? Like, he's looking directly into the matrix. And you can, kind of like—you spend a lot of time with [laugh] those applications, you get to know how they operate, and what they feel like, and what they're doing. And I don't recommend it to anyone, but it was absolutely fascinating at the time.Julie: Well, it sounds like it. I mean, anytime you can relate anything to The Matrix, it is going to be quite an experience. With that said, though, and the fact that we don't operate in these monolithic environments anymore, how have you seen that change?Mandi: Oh, it's so much easier to deal with. Like I said, like, your monolithic application, especially if there are lots of different and diverse functionalities in it, like, it's impossible to deal with scaling them. And figuring out, like, okay, well, this part of the application is memory-bound, and here's how we have to scale for that; and this part of the application is CPU-bound; and this part of the application is I/O bound. And, like, peeling all of those pieces apart so that you can optimize for all of the things that the application is doing in different ways when you need to make everything so much smoother and so much more efficient, across, like, your entire ecosystem over time, right?Plus, looking at trying to navigate the—like an update, right? Like, oh, you want to do an update to your next version of your operating system on a monolith? Good luck. You want to update the next version of your runtime? Plug and pray, right? Like, you just got to hope that everybody is on board.So, once you start to deconstruct that monolith into pieces that you can manage independently, then you've got a lot more responsibility on the application teams, that they can see more directly what their impacts are, get a better handle on things like updates, and software components, and all the things that they need independent of every other component that might have lived with them in the monolith. Noisy neighbors, right? Like, if you have a noisy neighbor in your apartment building, it makes everybody miserable. Let's say if you have, like, one lagging team in your monolith, like, nobody gets the update until they get beaten into submission.Julie: That is something that you and I used to talk about a lot, too, and I'm sure that you still do—I know I do—was just the service ownership piece. Now, you know who owns this. Now, you know who's responsible for the reliability.Mandi: Absolutely.Julie: You know, I'm thinking back again to these before times, when you're talking about all of the bare metal. Back then, I'm sure you probably didn't pull a Jesse Robbins where you went in and just started unplugging cords to see what happened, but was there a way that AOL practiced Chaos Engineering with maybe not calling it that?Mandi: It's kind of interesting. Like, watching the evolution of Chaos Engineering from the early days when Netflix started talking about it and, like, the way that it has emerged as being a more deliberate practice, like, I cannot say that we ever did any of that. And some of the early internet culture, right, is really built off of telecom, right? It was modem-based; people dialed into your POP, and like, that was the reliability they were expecting was very similar to what they expect out of a telephone, right? Like, the reason we have, like, five nines as a thing is because you want to pick up dial tone, and—pick up your phone and get dial tone on your line 99.999% of the time.Like, it has nothing to do with the internet. It's like 1970s circuits with networking. For part of that reason, like, a lot of the way things were built at that time—and I can't speak for Yahoo, although I suspect they had a very similar setup—that we had a huge integration environment. It's completely insane to think now that you would build an integration environment that was very similar in scope and scale to your production environment; simply does not happen. But for a lot of the services that we had at that time, we absolutely had an integration environment that was extraordinarily similar.You simply don't do that anymore. Like, it's just not part of—it's not cost effective. And it was only cost effective at that time because there wasn't anything else going on. Like, you had, like, the top ten sites on the internet, and AOL was, like, number three at the time. So like, that was just kind of the way things are done.So, that was kind of interesting and, like, figuring out that you needed to do some kind of proactive planning for what would happen just wasn't really part of the culture at the time. Like, we did have a NOC and we had some amazing engineers on the NOC that would help us out and do some of the things that we automate now: putting a call together, or when paging other folks into an incident, or helping us with that kind of response. I don't ever remember drilling on it, right, like we do. Like, practicing that, pulling a game day, having, like, an actual plan for your reliability along those lines.Julie: Well, and now I think that yeah, the different times are that the competitive landscape is real now—Mandi: Yeah, absolutely.Julie: And it was hard to switch from AOL to something else. It was hard to switch from Facebook to MySpace—or MySpace to Facebook, I should say.Mandi: Yeah.Julie: I know that really ages me quite a bit.Mandi: [laugh].Julie: But when we look at that and when we look at why reliability is so important now, I think it's because we've drilled it into our users; the users have this expectation and they aren't aware of what's happening on the back end. They just kn—Mandi: Have no idea. Yeah.Julie: —just know that they can't deposit money in their bank, for example, or play that title at Netflix. And you and I have talked about this when you're on Netflix, and you see that, “We can't play this title right now. Retry.” And you retry and it pops back up, we know what's going on in the background.Mandi: I always assume it's me, or, like, something on my internet because, like, Netflix, they [don't ever 00:21:48] go down. But, you know, yeah, sometimes it's [crosstalk 00:21:50]—Julie: I just always assume it's J. Paul doing some chaos engineering experiments over there. But let's flash forward a little bit. I know we could spend a lot of time talking about your time at Chef, however, you've been over at PagerDuty for a while now, and you are in the incident response game. You're in that lowering that Mean Time to Identification and Resolution. And that brings that reliability piece back together. Do you want to talk a little bit about that?Mandi: One of the things that is interesting to me is, like, watching some of these slower-moving industries as they start to really get on board with cloud, the stairstep of sophistication of the things that they can do in cloud that they didn't have the resources to do when they were using their on-premises data center. And from an operation standpoint, like, being able to say, “All right, well, I'm going from, you know, maybe not bare metal, but I've got, like, some kind of virtualization, maybe some kind of containerization, but like, I also own the spinning disks, or whatever is going on there—and the network and all those things—and I'm putting that into a much more flexible environment that has modern networking, and you know, all these other elastic capabilities, and my scaling and all these things are already built in and already there for me.” And your ability to then widen the scope of your reliability planning across, “Here's what my failure domains used to look like. Here's what I used to have to plan for with thinking about my switching networks, or my firewalls, or whatever else was going on and, like, moving that into the cloud and thinking about all right, well, here's now, this entire buffet of services that I have available that I can now think about when I'm architecting my applications for the cloud.” And that, just, expanded reliability available to you is, I think, absolutely amazing.Julie: A hundred percent. And then I think just being able to understand how to respond to incidents; making sure that your alerting is working, for example, that's something that we did in that joint workshop, right? We would teach people how to validate their alerting and monitoring, both with PagerDuty and Gremlin through the practice of incident response and of chaos engineering. And I know that one of the practices at PagerDuty is Failure Fridays, and having those regular game days that are scheduled are so important to ensuring the reliability of the product. I mean, PagerDuty has no maintenance windows, correct?Mandi: No that—I don't think so, right?Julie: Yeah. I don't think there's any planned maintenance windows, and how do we make sure for organizations that rely on PagerDuty—Mandi: Mm-hm.Julie: —that they are one hundred percent reliable?Mandi: Right. So, you know, we've got different kinds of backup plans and different kinds of rerouting for things when there's some hiccup in the platform. And for things like that, we have out of band communications with our teams and things like that. And planning for that, having that game day to just be able to say—well, it gives you context. Being able to say, “All right, well, here's this back-end that's kind of wobbly. Like, this is the thing we're going to target with our experiments today.”And maybe it's part of the account application, or maybe it's part of authorization, or whatever it is; the team that worked on that, you know, they have that sort of niche view, it's a little microcosm, here's a little thing that they've got and it's their little widget. And what that looks like then to the customer, and that viewpoint, it's going to come in from somewhere else. So, you're running a Failure Friday; you're running a game day, or whatever it is, but including your customer service folks, and your front-end engineers, and everyone else so that, you know, “Well, hey, you know, here's what this looks like; here's the customers' report for it.” And giving you that telemetry that is based on customer experience and your actual—what the business looks like when something goes wrong deep in the back end, right, those deep sea, like, angler fish in the back, and figuring out what all that looks like is an incredible opportunity. Like, just being able to know that what's going to happen there, what the interface is going to look like, what things don't load, when things take a long time, what your timeouts look like, did you really even think about that, but they're cascading because it's actually two layers back, or whatever you're working on, like that kind of insight, like, is so valuable for your application engineers as they're improving all the pieces of architecture, whether it's the most front-end user-facing things, or in the deep back-end that everybody relies on.Julie: Well, absolutely. And I love that idea of bringing in the different folks like the customer service teams, the product managers. I think that's important on a couple of levels because not only are you bringing them into this experience so they're understanding the organization and how folks operate as a whole, but you're building that culture, that failure is acceptable and that we learn from our failures and we make our systems more resilient, which is the entire goal.Mandi: The goal.Julie: And you're sharing the learning. When we operate in silos—which even now as much as we talk about how terrible it is to be in siloed teams and how we want to remove silos, it happens. Silos just happen. And when we can break down those barriers, any way that we can to bring the whole organization in, I think it just makes for a stronger organization, a stronger culture, and then ultimately a stronger product where our customers are living.Mandi: Yeah.Julie: Now, I really do want to ask you a couple of things for some fun here. But if you were to give one tip, what is your number one tip for better DevOps?Mandi: Your DevOps is always going to be—like, I'm totally on board with John Wallace's [CAMS 00:27:57] to, like, move to CALMS sort of model, right? So, you've got your culture, your automation, your learning, your metrics, and your sharing. For better DevOps, I think one of the things that's super important—and, you know, you and I have hashed this out in different things that we've done—we hear about it in other places, is definitely having empathy for the other folks in your organization, for the work that they're doing, and the time constraints that they're under, and the pressures that they're feeling. Part of that then sort of rolls back up to the S part of that particular model, the sharing. Like, knowing what's going on, not—when we first started out years ago doing sort of DevOps consulting through Chef, like, one of the things we would occasionally run into is, like, you'd ask people where their dashboards were, like, how are they finding out, you know, what's going on, and, like, the dashboards were all hidden and, like, nobody had access to them; they were password protected, or they were divided up by teams, like, all this bonkers nonsense.And I'm like, “You need to give everybody a full view, so that they've all got a 360 view when they're making decisions.” Like you mentioned your product managers as part of, like, being part of your practice; that's absolutely what you want. They have to see as much data as your applications engineers need to see. Having that level of sharing for the data, for the work processes, for the backlog, you know, the user inputs, what the support team is seeing, like, you're getting all of this input, all this information, from everywhere in your ecosystem and you cannot be selfish with it; you cannot hide it from other people.Maybe it doesn't look as nice as you want it to, maybe you're getting some negative feedback from your users, but pass that around, and you ask for advice; you ask for other inputs. How are we going to solve this problem? And not hide it and feel ashamed or embarrassed. We're learning. All this stuff is brand new, right?Like, yeah, I feel old talking about AOL stuff, but, like, at the same time, like, it wasn't that long ago, and we've learned an amazing amount of things in that time period, and just being able to share and have empathy for the folks on your team, and for your users, and the other folks in your ecosystem is super important.Julie: I agree with that. And I love that you hammer down on the empathy piece because again, when we're working in ones and zeros all day long, sometimes we forget about that. And you even mentioned at the beginning how at AOL, you had such intimate knowledge of these applications, they were so deep to you, sometimes with that I wonder if we forget a little bit about the customer experience because it's something that's so close to us; it's a feature maybe that we just believe in wholeheartedly, but then we don't see our customers using it, or the experience for them is a little bit rockier. And having empathy for what the customer may go through as well because sometimes we just like to think, “Well, we know how it works. You should be able to”—Mandi: Yes.Julie: Yes. And, “They're definitely not going to find very unique and interesting ways to break my thing.” [laugh].Mandi: [laugh]. No, never.Julie: Never.Mandi: Never.Julie: And then you touched on sharing and I think that's one thing we haven't touched on yet, but I do want to touch on a little bit. Because with incident—with incident response, with chaos engineering, with the learning and the sharing, you know, an important piece of that is the postmortem.Mandi: Absolutely.Julie: And do you want to talk a little bit about the PagerDuty view, your view on the postmortems?Mandi: As an application piece, like, as a feature, our postmortem stuff is under review. But as a practice, as a thing that you do, like, a postmortem is an—it should be an active word; like, it's a verb, right? You hol—and if you want to call it a post-incident review, or whatever, or post-incident retrospective, if you're more comfortable with those words, like that's great, and that's—as long as you don't put a hyphen in postmortem, I don't care. So, like—Julie: I agree with you. No hyphen—Mandi: [laugh].Julie: —please. [laugh].Mandi: Please, no hyphen. Whatever you want to call that, like, it's an active thing. And you and I have talked a number of times about blamelessness and, like, making sure that what you do with that opportunity, this is—it's a gift, it's a learning opportunity after something happened. And honestly, you probably need to be running them, good or bad, for large things, but if you have a failure that impacted your users and you have this opportunity to sit down and say, all right, here's where things didn't go as we wanted them to, here's what happened, here's where the weaknesses are in our socio-technical systems, whether it was a breakdown in communication, or breakdown in documentation, or, like, we we found a bug or, you know, [unintelligible 00:32:53] defect of some kind, like, whatever it is, taking that opportunity to get that view from as many people as possible is super important.And they're hard, right? And, like, we—John Allspaw, on our podcast, right, last year talked a bit about this. And, like, there's a tendency to sort of write the postmortem and put it on a shelf like it's, like, in a museum or whatever. They are hopefully, like, they're learning documents that are things that maybe you have your new engineers sort of review to say, “Here's a thing that happened to us. What do you think about this?” Like, maybe having, like, a postmortem book club or something internally so that the teams that weren't maybe directly involved have a chance to really think about what they can learn from another application's learning, right, what opportunities are there for whatever has transpired? So, one of the things that I will say about that is like they aren't meant to be write-only, right? [laugh]. They're—Julie: Yeah.Mandi: They're meant to be an actual living experience and a practice that you learn from.Julie: Absolutely. And then once you've implemented those fixes, if you've determined the ROI is great enough, validate it.Mandi: Yes.Julie: Validate and validate and validate. And folks, you heard it here first on Break Things on Purpose, but the postmortem book club by Mandi Walls.Mandi: Yes. I think we should totally do it.Julie: I think that's a great idea. Well, Mandi, thank you. Thank you for taking the time to talk with us. Real quick before we go, did you want to talk a little bit about PagerDuty and what they do?Mandi: Yes, so Page—everyone knows PagerDuty; you have seen PagerDuty. If you haven't seen PagerDuty recently, it's worth another look. It's not just paging anymore. And we're working on a lot of things to help people deal with unplanned work, sort of all the time, right, or thinking about automation. We have some new features that integrate more with our friends at Rundeck—PagerDuty acquired Rundeck last year—we're bringing out some new integrations there for Rundeck actions and some things that are going to be super interesting for people.I think by the time this comes out, they'll have been in the wild for a few weeks, so you can check those out. As well as, like, getting better insight into your production platforms, like, with a service graph and other insights there. So, if you haven't looked at PagerDuty in a while or you think about it as being just a place to be annoyed with alerts and pages, definitely worth revisiting to see if some of the other features are useful to you.Julie: Well, thank you. And thanks, Mandi, and looking forward to talking to you again in the future. And I hope you have a wonderful day.Mandi: Thank you, Julie. Thank you very much for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover:00:00:00 - Introduction 00:05:00 - Itiel's Background in Engineering00:08:25 - Improving Kubernetes Troubleshooting00:11:45 - Improving Team Collaboration 00:14:00 - OutroLinks: Komodor: https://komodor.com/ Twitter: https://twitter.com/Komodor_com TranscriptJason: Welcome back to another episode of Build Things On Purpose, a part of the Break Things On Purpose podcast where we talk with people who have built really cool software or systems. Today with us, we have Itiel Shwartz who is the CTO of a company called Komodor. Welcome to the show.Itiel: Thanks, happy to be here.Jason: If I go to Komodor's website it really talks about debugging Kubernetes, and as many of our listeners know Kubernetes and complex systems are a difficult thing. Talk to me a little bit more—tell me what Komodor is. What does it do for us?Itiel: Sure. So, I don't think I need to tell our listeners—your listeners that Kubernetes looks cool, it's very easy to get started, but once you're into it and you have a big company with complex, like, micros—it doesn't have to be big, even, like, medium-size complex system company where you're starting to hit a couple of walls or, like, issues when trying to troubleshoot Kubernetes.And that usually is due to the nature of Kubernetes which makes making complex systems very easy. Meaning you can deploy in multiple microservices, multiple dependencies, and everything looks like a very simple YAML file. But in the end of the day, when you have an issue, when one of the pods is starting to restart and you try to figure out, like, why the hell is my application is not running as it should have, you need to use a lot of different tools, methodologies, knowledge that most people don't really have in order to solve the issue. So, Komodor focus on making the troubleshooting in Kubernetes an easy and maybe—may I dare say even fun experience by harnessing our knowledge in Kubernetes and align our users to get that digest view of the world.And so usually when you speak about troubleshooting, the first thing that come to mind is issues are caused due to changes. And the change might be deploying Kubernetes, it can be a [configurment 00:02:50] that changed, a secret that changed, or even some feature flag, or, like, LaunchDarkly feature that was just turned on and off. So, what Komodor does is we track and we collect all of the changes that happen across your entire system, and we put, like, for each one of your services a [unintelligible 00:03:06] that includes how did the service change over time and how did it behave? I mean, was it healthy? Was it unhealthy? Why wasn't it healthy?So, by collecting the data from all across your system, plus we are sit on top of Kubernetes so we know the state of each one of the pods running in your application, we give our users the ability to understand how did the system behave, and once they have an issue we allow them to understand what changes might have caused this. So, instead of bringing down dozens of different tools, trying to build your own mental picture of how the world looks like, you just go into Komodor and see everything in one place.I would say that even more than that, once you have an issue, we try to give our best efforts on helping to understand why did it happen. We know Kubernetes, we saw a lot of issues in Kubernetes. We don't try complex AI solution or something like that, but using our very deep knowledge of Kubernetes, we give our users, FYI, your pods that are unhealthy, but the node that they are running on just got restarted or is having this pressure.So, maybe they could look at the node. Like, don't drill down into the pods logs, but instead, go look at the nodes. You just upgraded your Kubernetes version or things like that. So, basically we give you everything you need in order to troubleshoot an issue in Kubernetes, and we give it to you in a very nice and informative way. So, our user just spend less time troubleshooting and more time developing features.Jason: That sounds really extremely useful, at least from my experience, in operating things on Kubernetes. I'm guessing that this all stemmed from your own experience. You're not typically a business guy, you're an engineer. And so it sounds like you were maybe scratching your own itch. Tell us a little bit more about your history and experience with this?Itiel: I started computer science, I started working for eBay and I was there in the infrastructure team. From there I joined two Israeli startup and—I learned that the thing that I really liked or do quite well is to troubleshoot issues. I was in a very, very, like, production-downtime-sensitive systems. A system when the system is down, it just cost the business a lot of money.So, in these kinds of systems, you try to respond really fast through the incidents, and you spend a lot of time monitoring the system so once an issue occur you can fix it as soon as possible. So, I developed a lot of internal tools. For the companies I worked for that did something very similar, allow you once you have an issue to understand the root cause, or at least to get a better understanding of how the world looks like in those companies.And we started Komodor because I also try to give advice to people. I really like Kubernetes. I liked it, like, a couple of years ago before it was that cool, and people just consult with me. And I saw the lack of knowledge and the lack of skills that most people that are running Kubernetes have, and I saw, like—I'd have to say it's like giving, like, a baby a gun.So, giving an operation person that doesn't really understand Kubernetes tell him, “Yeah, you can deploy everything and everything is a very simple YAML. You want a load balancer, it's easy. You want, like, a persistent storage, it's easy. Just install like—Helm install Postgres or something like that.” I installed quite a lot of, like, Helm-like recipes, GA, highly available. But things are not really highly available most of the time.So, it's definitely scratching my own itch. And my partner, Ben, is also a technical guy. He was in Google where they have a lot of Kubernetes experience. So, together both of us felt the pain. We saw that as more and more companies moved to Kubernetes, the pain became just stronger. And as the shift-left movement is also like taking off and we see more and more dev people that are not necessarily that technical that are expected to solve issues, then again we saw an issue.So, what we see is companies moving to Kubernetes and they don't have the skills or knowledge to troubleshoot Kubernetes. And then they tell their developers, “You are now responsible for the production. You are deploying? You should troubleshoot,” and the developers really don't know what to do. And we came to those companies and basically it makes everything a lot easier.You have any issue in Kubernetes? No issue, like, no issue. And no problem go to Komodor and understand what is the probable root cause. See what's the status? Like, when did it change? When was it last restarted? When was it unhealthy before today? Maybe, like, an hour ago, maybe a month ago. So, Komodor just gives you all of this information in a very informative way.Jason: I like the idea of pulling everything into one place, but I think that obviously begs the question: if we're pulling in this information we need to have good information to begin with. I'm interested in your thoughts of if someone were to use Komodor or just want to improve their visibility into troubleshooting Kubernetes, what are some tips or advice that you'd have for them in maybe how to set up their monitoring, or how to tag their changes, things like that? What does that look like?Itiel: I will say the first thing is using more metadata and tagging capabilities across the board. It can be on top of the monitors, the system, the services, like, you name it, you should do it. Once an alert is triggered, you don't necessarily have to go to the perfect playbook because it doesn't really exist. You should understand what's the relevant impact, what system it impacted, and who is the owner, and who should you wake up, like, now or who should look at it?So, spending the time tagging some of the alerts and resources in Kubernetes is super valuable. It's not that hard, but by doing so you just reduced the mental capacity needed in order to troubleshoot an issue. More than that, here in Komodor we read of this metadata label stacks, and we harness it for our own benefits. So, it is best practice to do so and Komodor also utilize this data.And for example, for an alert, say like, the relevant team name that is responsible, and for each service in Kubernetes write the team that owns this service. And this way you can basically understand what teams are responsible for what services or issues. So, this is the number one tip or trick. And the second one is just spend time on exposing these data. You can use Komodor I think, like, it's the best solution, but even if not, try to have those notification every time something change.Write those, like, web hooks to which one of your resources and let the team know that things change. If not, like, what we see in companies is something break, no one really know what changed, and in the end of the day they are forced to go into Slack and doing, like, here—someone changed something that might cause production break. And if so, please fix it. It's not a good place to be. If you see yourself asking questions over Slack, you have an issue with the system monitoring and observability.Jason: That's a great point because I feel like a lot of times we do that. And so you look back into your CI/CD logs, like, what pushes are made, what deploys are made. You're trying to parse out, like, which one was it? Especially in a high-velocity organization of multiple changes and which one actually did that breaking.Itiel: We see it across the board. There are so many changes, so many dependencies. Because microservice A talks with microservice B that speak with microservice C using SQS or something like that. And then things break and no one know what is really happening. Especially the developers, they have no idea what is happening. But most of the time also the DevOps themselves.Jason: I think that's a great point of, sort of, that shared confusion. As we've talked about DevOps and that breaking down of the walls between developers and operations, there was always this, “Well, you should work together,” and there is this notion now of we're working together but nobody knows what's going on.As we talk about this world of sharing, what are some of your advice as somebody who's helped both developers and operations? Aside from getting that shared visibility for troubleshooting, do you have any tips for collaborating better to understand as a team how things are functioning?Itiel: I have a couple of thoughts on this area. The first thing is you must have the alignment. Both the DevOps, or operation and the developers need to understand they are in this together. And this, like, base point in other organization you see they struggle. Like, the developers are like, yeah, I don't really need—like, it's the ops problem if production is down, and the ops are, like, angry at the devs and say they don't understand anything so they shouldn't be responsible for issues in production.So, first of all, let's create the alignment. The organization needs to understand that both the dev and the ops team need to take shared responsibility over the system and over the troubleshooting process. Once this very key pillar is out of the way, I will say that adding more and more tools and making sure that those tools can be shared between the ops and the dev team.Because a lot of the times we see tools that are designed for the DevOps, and a developer don't really understand what is happening here, what are those numbers, and basically how to use them. So, I think making sure the tools fit both personas is a very crucial thing. And the last thing is learning from past incidents. You are going to have other incidents, other issues. The question is, do you understand how we improve the next time this incident or a similar incident will happen? What processes and what tools are missing in the link between the DevOps and the system to optimize it. Because it's not after you snap your finger and everything works as expected.It is an iterative process and you must have, like, the state of mind of, okay, things are going to get better, or they are going to get better, and so on. So, I think this is the third, like, three most important things. One make sure you have that alignment, two, create tools that can be shared across different teams, and three, learn from past incidents and understand this is like a marathon. It's not a sprint.Jason: Those are excellent tips. So, for our listeners, if you would like a tool that can be shared between devs and DevOps or ops teams, and you're interested in Komodor—Itiel, tell us where folks can find more info about Komodor and learn more about how to troubleshoot Kubernetes.Itiel: So, you can find us on Twitter, but basically on komodor.com. Yeah, you can sign up for a free trial. The installation is, like, 10 seconds or something like that. It's basically Helm install, and it really works. We just finished, like, a very big round, so we are growing really fast and we have more and more customers. So, we'll be happy to hear your use case and to see how we can accommodate your needs.Jason: Awesome. Well, thanks for being on the show. It's been a pleasure to have you.Itiel: Thank you. Thank you. It was super fun being here.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Introduction 00:02:45 - Adopting the Cloud 00:08:15 - POC Process 00:12:40 - Infrastructure Team Building 00:17:45 - “Disaster Roleplay”/Communicating to the Non-Technical Side 00:20:20 - Leadership 00:22:45 - Tomas' Horror Story/Dashboard Organziation 00:29:20 - Outro Links: Productboard: https://www.productboard.com Scaling Teams: https://www.amazon.com/Scaling-Teams-Strategies-Successful-Organizations/dp/149195227X Seeking SRE: https://www.amazon.com/Seeking-SRE-Conversations-Running-Production/dp/1491978864/ TranscriptJason: Welcome to Break Things on Purpose, a podcast about failure and reliability. In this episode, we chat with Tomas Fedor, Head of Infrastructure at Productboard. He shares his approach to testing and implementing new technologies, and his experiences in leading and growing technical teams.Today, we've got with us Tomas Fedor, who's joining us all the way from the Czech Republic. Tomas, why don't you say hello and introduce yourself?Tomas: Hello, everyone. Nice to meet you all, and my name is Tomas, or call me Tom. And I've been working for a Productboard for past two-and-a-half year as infrastructure leader. And all the time, my experience was in the areas of DevOps, and recently, three and four years is about management within infrastructure teams. What I'm passionate about, my main technologies-wise in cloud, mostly Amazon Web Services, Kubernetes, Infrastructure as Code such as Terraform, and recently, I also jumped towards security compliances, such as SOC 2 Type 2.Jason: Interesting. So, a lot of passions there, things that we actually love chatting about on the podcast. We've had other guests from HashiCorp, so we've talked plenty about Terraform. And we've talked about Kubernetes with some folks who are involved with the CNCF. I'm curious, with your experience, how did you first dive into these cloud-native technologies and adopting the cloud? Is that something you went straight for, or is that something you transitioned into?Tomas: I actually slow transition to cloud technologies because my first career started at university when I was like, say, half developer and half Unix administrator. And I had experience with building very small data center. So, those times were amazing to understand all the hardware aspects of how it's going to be built. And then later on, I got opportunity to join a very famous startup at Czech Republic [unintelligible 00:02:34] called Kiwi.com [unintelligible 00:02:35]. And that time, I first experienced cloud technologies such as Amazon Web Services.Jason: So, as you adopted Amazon, coming from that background of a university and having physical servers that you had to deal with, what was your biggest surprise in adopting the cloud? Maybe something that you didn't expect?Tomas: So, that's great question, and what comes to my mind first, is switching to completely different [unintelligible 00:03:05] because during my university studies and career there, I mostly focused on networking [unintelligible 00:03:13], but later on, you start actually thinking about not how to build a service, but what service you need to use for your use case. And you don't have, like, one service or one use case, but you have plenty of services that can suit your needs and you need to choose wisely. So, that was very interesting, and it needed—and it take me some time to actually adopt towards new thinking, new mindset, et cetera.Jason: That's an excellent point. And I feel like it's only gotten worse with the, “How do you choose?” If I were to ask you to set up a web service and it needs some sort of data store, at this point you've got, what, a half dozen or more options on Amazon? [laugh].Tomas: Exactly.Jason: So, with so many services on providers like Amazon, how do you go about choosing?Tomas: After a while, we came up with a thing like RFCs. That's like ‘Request For Comments,' where we tried to sum up all the goals, and all the principles, and all the problems and challenges we try to tackle. And with that, we also tried to validate all the alternatives. And once you went through all these information, you tried to sum up all the possible solutions. You typically had either one or two options, and those options were validated with all your team members or the whole engineering organization, and you made the decision then you try to run POC, and you either are confirmed, yeah this is the technology, or this is service you need and we are going to implement it, or you revised your proposal.Jason: I really like that process of starting with the RFC and defining your requirements and really getting those set so that as you're evaluating, you have these really stable ideas of what you need and so you don't get swayed by all of the hype around a certain technology. I'm curious, who is usually involved in the RFC process? Is it a select group in the engineering org? Is it broader? How do you get the perspectives that you need?Tomas: I feel we have very great established process at Productboard about RFCs. It's transparent to the whole organization, that's what I love the most. The first week, there is one or two reporters that are mainly focused on writing and summing up the whole proposal to write down goals, and also non-goals because that is going to define your focus and also define focus of reader. And then you're going just to describe alternatives, possible options, or maybe to sum up, “Hey, okay, I'm still unsure about this specific decision, but I feel this is the right direction.” Maybe I have someone else in the organization who is already familiar with the technology or with my use case, and that person can help me.So, once—or we call it a draft state, and once you feel confident, you are going to change the status of RFC to open. The time is open to feedback to everyone, and they typically geared, like, two weeks or three weeks, so everyone can give a feedback. And you have also option to present it on engineering all-hands. So, many engineers, or everyone else joining the engineering all-hands is aware of this RFC so you can receive a lot of feedback. What else is important to mention there that you can iterate over RFCs.So, you mark it as resolved after through two or three weeks, but then you come up with a new proposal, or you would like to update it slightly with important change. So, you can reopen it and update version there. So, that also gives you a space to update your RFC, improve the proposal, or completely to change the context so it's still up-to-date with what you want to resolve.Jason: I like that idea of presenting at engineering all-hands because, at least in my experience, being at a startup, you're often super busy so you may know that the RFC is available, but you may not have time to actually read through it, spend the time to comment, so having that presentation where it's nicely summarized for you is always nice. Moving from that to the POC, when you've selected a few and you want to try them out, tell me more about that POC process. What does that look like?Tomas: So typically, in my infrastructure team, it's slightly different, I believe, as you have either product teams focus on POCs, or you have more platform teams focusing on those. So, in case of the infrastructure team, we would like to understand what code is actually going to be about because typically the infrastructure team has plenty of services to be responsible for, to be maintained, and we try to first choose, like, one specific use case and small use case that's going to suit the need.For instance, I can share about implementation of HashiCorp Vault, like our adoption. We leveraged firstly only key-value engine for storing secrets. And what was important to understand here, whether we want to spend hours of building the whole cluster, or we can leverage their cloud service and try to integrate it with one of our services. And we need to understand what service we are going to adopt with Vault.So, we picked cloud solution. It was very simple, the experience that were seamless for us, we understood what we needed to validate. So, is developer able to connect to Vault? Is application able to connect to Vault? What roles does it offer? Was the difference for cloud versus on-premise solution?And at the end, it's often the cost. So, in that case, POC, we spin up just cloud service integrated with our system, choose the easiest possible adaptable service, run POC, validate it with developers, and provide all the feedback, all the data, to the rest of the engineering. So, that was for us, some small POC with large service at the end.Jason: Along with validating that it does what you want it to do, do you ever include reliability testing in that POC?Tomas: It is, but it is in, like, let's say, it's in a later stage. For example, I can again mention HashiCorp Vault. Once we made a decision to try to spin up first on-premise cluster, we started just thinking, like, how many master nodes do we need to have? How many availability zones do we need to have? So, you are going to follow quorum?And we are thinking, “Okay, so what's actually the reliability of Amazon Web Services regions and their availability zones? What's the reliability of multi-cross-region? And what actually the expectations that is going to happen? And how often they happen? Or when in the past, it happened?”So, all those aspects were considered, and we ran out that decision. Okay, we are still happy with one region because AWS is pretty stable, and I believe it's going to be. And we are now successfully running with three availability zones, but before we jumped to the conclusion of having three availability zones, we run several tests. So, we make sure that in case one availability zone being down, we are still fully able to run HashiCorp Vault cluster without any issues.Jason: That's such an important test, especially with something like HashiCorp Vault because not being able to log into things because you don't have credentials or keys is definitely problematic.Tomas: Fully agree.Jason: You've adopted that during the POC process, or the extended POC process; do you continue that on with your regular infrastructure work continuing to test for reliability, or maybe any chaos engineering?Tomas: I actually measure something about what we are working on, like, what we have so far improved in terms of post-mortem process that's interesting. So, we started two-and-a-half year ago, and just two of us as infrastructure engineers. At the time, there was only one incident response on-call team, our first iteration within the infrastructure team was with migration from Heroku, where we ran all our services, to Amazon Web Services. And that time, we needed to also start thinking about, okay, the infrastructure team needs to be on call as well. So, that required to update in the process because until then, it works great; you have one team, people know each other, people know the whole stack. Suddenly, you are going to add new people, you're going to add new people a separate team, and that's going to change the way how on-call should be treated, and how the process should look like.You may ask why. You have understanding within the one team, you understand the expectations, but then you have suddenly different skill set of people, and they are going to be responsible for different part of the technical organization, so you need to align the expectation between two teams. And that was great because guys at Productboard are amazing, and they are always helpful. So, we sat down, we made first proposal of how new team is going to work like, what are going to be responsibilities. We took inspirations from the already existing on-call process, and we just updated it slightly.And we started to run with first test scenarios of being on call so we understand the process fully. Later on, it evolved to more complex process, but it's still very simple. What is more complex: we have more teams that's first thing being on call; we have better separation of all the alerts, so you're not going to route every alert to one team, but you are able to route it to every team that's responsible for its service; the team have also prepared a set of runbooks so anyone else can easily follow runbook and fix the incident pretty easily, and then we also added section about post-mortems, so what are our expectations of writing down post-mortem once incident is resolved.Jason: That's a great process of documenting, really—right—documenting the process so that everybody, whether they're on a different team and they're coming over or new hires, particularly, people that know nothing about your established practices can take that runbook and follow along, and achieve the same results that any other engineer would.Tomas: Yeah, I agree. And what was great to see that once my team grew—we are currently five and we started two—we saw excitement of the team members to update the process so everybody else we're going to join the on-call is going to be excited, is going to take it as an opportunity to learn more. So, we added disaster roleplay, and that section talks about you are new person joining on-call team, and we would like to make sure you are going to understand all the processes, all the necessary steps, and you are going to be aligned with all the expectations. But before you will actually going to have your first alerts of on-call, we would like to try to run roleplay. Imagine what a HashiCorp Vault cluster is going down; you should be the one resolving it. So, what are the first steps, et cetera?And that time you're going to realize whatever is being needs to be done, it's not only from a technical perspective, such as check our go to monitoring, check runbook, et cetera, but also communication-wise because you need to communicate not only with your shadowing buddy, but you also need to communicate internally, or to the customers. And that's going to change the perspective of how an incident should be handled.Jason: That disaster roleplay sounds really amazing. Can you chat a little bit more about the details of how that works? Particularly you mentioned engaging the non-technical side—right—of communication with various people. Does the disaster roleplay require coordinating with all those people, or is it just a mock, you would pretend to do, but you don't actually reach out to those people during this roleplay?Tomas: So, we would like to also combine the both aspects. We would like to make sure that person understands all the communication channels that are set within our organization, and what they are used for, and then we would like to make sure that that person understand how to involve other engineers within the organization. For instance, what was there the biggest difference is that you have plenty of options how to configure assigning or creating an alert. And so for those, you may have a different notification settings. And what happened is that some of the people have settings only for newly created alert, but when you made a change of assigned person of already existing alert, someone else, it might happen that that person didn't notice it because the notification setting was wrong. So, we encountered even these kind of issues and we were able to fix it, thanks to disaster roleplay. So, that was amazing to be found out.Jason: That's one of the favorite things that I like to do when we're using chaos engineering to do a similar thing to the disaster roleplay, is to really check those incident response processes, and validating those alerts is huge. There's so many times that I've found that we thought that someone would be alerted for some random thing, and turns out that nobody knew anything was going on. I love that you included that into your disaster roleplay process.Tomas: Yeah, it was also great experience for all the engineers involved. Unfortunately, we run it only within our team, but I hope we are going to have a chance to involve all other engineering on-call teams, so the onboarding experience to the engineering on-call teams is going to rise and is going to be amazing.Jason: So, one of the things that I'm really interested in is, you've gone from being a DevOps engineer, an SRE individual contributor role, and now you're leaving a small team. I think a lot of folks, as they look at their career, and I think more people are starting to become interested in this is, what does that progression look like? This is sort of a change of subject, but I'm interested in hearing your thoughts on what are the skills that you picked up and have used to become an effective technical leader within Productboard? What's some of that advice that our listeners, as individual contributors, can start to gain in order to advance where they're going with their own careers?Tomas: Firstly, it's important to understand what makes you passionate in your career, whether it's working with people, understanding their needs and their future, or you would like to be more on track as individual contributor and you would like to enlarge your scope of responsibilities towards leading more technical complex initiatives, that are going to take a long time to be implemented. In case all the infrastructure, or in case of the platform leaders, I would say the position of manager or technical leader also requires certain technical knowledge so you can be still in close touch with your team or with your most senior engineers, so you can set the goals and set the strategic clearly. But still, it's important to be, let's say, people person and be able to listen because in that case, people are going to be more open to you, and you can start helping them, and you can start making their dreams true and achievable.Jason: Making their dreams true. That's a great take on this idea because I feel like so many times, having done infrastructure work, that you start to get a mindset of maybe that people just are making demands of you, all the time. And it's sometimes hard to keep that perspective of working together as a team and really trying to excel to give them a platform that they can leverage to really get things done. We were talking about disaster roleplaying, and that naturally leads to a question that we like to ask of all of our guests and that's, do you have any horror stories from your career about an incident, some horror story or outage that you experienced and what you've learned from it?Tomas: I have one, and it actually happened at the beginning of my career of DevOps engineer. What is interesting here that it was one of the toughest incidents I experienced. It happened after midnight. So, the time I was still new to a company, and we have received an alert informing about too many 502, 504 errors written from API. At the time API process thousands of requests per second, and the incident had a huge impact on the services we were offering.And as I was shadowing my on-call buddy, I tried to check our main alerting channel, see what's happening, what's going on there, how can I help, and I started with checking monitoring system, reviewing all the reports from the engineers of being on-call, and I initiated the investigation on my own. I realized that something is wrong or something is not right, and I realized I was just confused and I want sleep, so it took me a while to get back on track. So, I made the side note, like, how can I start my brain to be working as during the day? And then I got back to the incident resolution process.So, it was really hard for me to start because I didn't know what [unintelligible 00:24:27] you knew about the channel, you knew about your engineers working on the resolution, but there were plenty of different communication funnels. Like, some of the engineers were deep-focused on their own investigation, and some of them were on call. And we needed to provide regular updates to the customers and internally as well. I had that inner feeling of let's share something, but I realized I just can't drop a random message because the message with all the information should have certain format and should have certain information. But I didn't know what kind of information should be there.So, I tried to ping someone, so, “Hey, can you share something?” And in the meantime, actually, more other people send me direct message. And I saw there are a lot of different tracks of people who tried to solve the incident, who tries to provide the status, but we were not aligned. So, this all showed me how important is to have proper communication funnel set. And we got the lucky to actually end up in one channel, we got lucky to resolve incident pretty quickly.And what else I learned that I would recommend to make sure you know where to work. I know it's pretty obvious sentence, but once your company has plenty of dashboards and you need to find one specific metric, sometime it looks like mission impossible.Jason: That's definitely a good lesson learned and feeds back to that disaster roleplays, practicing how you do those communications, understanding where things need to be communicated. You mentioned that it can be difficult to find a metric within a particular dashboard when you have so many. Do you have any advice for people on how to structure their dashboards, or name their dashboards, or organize them in a certain way to make that easier to find the metric or the information that you're looking for?Tomas: I will have a different approach, and that do have basic dashboard that provides you SLOs of all the services you have in the company. So, we understand firstly what service actually impacts the overall stability or reliability. So, that's my first advice. And then you should be able to either click on the specific service, and that should redirect you to it's dashboard, or you're going to have starred one of your favorite dashboards you have. So, I believe the most important is really have one main dashboard where you have all the services and their stability resourced, then you have option to look.Jason: Yeah, when you have one main dashboard, you're using that as basically the starting point, and from there, you can branch out and dive deeper, I guess, into each of the services.Tomas: Exactly, exactly true.Jason: I like that approach. And I think that a lot of modern dashboarding or monitoring systems now, the nice thing is that they have that ability, right, to go from one particular dashboard or graphic and have links out to the other information, or just click on the graph and it will show you the underlying host dashboard or node dashboard for that metric, which is really, really handy.Tomas: And I love the connection with other monitoring services, such as application monitoring. That gives you so much insight and when it's even connected with your work management tool is amazing so you can have all the important information in one place.Jason: Absolutely. So, oftentimes we talk about—what is it—the three pillars of observability, which I know some of our listeners may hate that, but the idea of having metrics and performance monitoring/APM and logs, and just how they all connect to each other can really help you solve a lot, or uncover a lot of information when you're in the middle of an incident. So Tomas, thanks for being on the show. I wanted to wrap up with one more question, and that's do you have any shoutouts, any plugs, anything that you want to share that our listeners should go take a look at?Tomas: Yeah, sure. So, as we are talking about management, I would like to promote one book that helped make my career, and that's Scaling Teams. It's written by Alexander Grosse and David Loftesness.And another one book is from Google, they have, like, three series, one of those is Seeking SRE, and I believe other parts are also useful to be read in case you would like to understand whether your organization needs SRE team and how to implement it within organization, and also, technically.Jason: Those are two great resources, and we'll have those linked in the show notes on the website. So, for anybody listening, you can find more information about those two books there. Tomas, thanks for joining us today. It's been a pleasure to have you.Tomas: Thanks. Bye.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Introduction 00:03:20 - VMWare Tanzu 00:07:50 - Gustavo's Career in Security 00:12:00 - Early Days in Chaos Engineering 00:16:30 - Catzilla 00:19:45 - Expanding on SRE 00:26:40 - Learning from Customer Trends 00:29:30 - Chaos Engineering at VMWare 00:36:00 - Outro Links: Tanzu VMware: https://tanzu.vmware.com GitHub for SREDocs: https://github.com/google/sredocs E-book on how to start your incident lifecycle program: https://tanzu.vmware.com/content/ebooks/establishing-an-sre-based-incident-lifecycle-program Twitter: https://twitter.com/stratus TranscriptJason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems. In this episode, Gustavo Franco, a senior engineering manager at VMware joins us to talk about building reliability as a product feature, and the journey of chaos engineering from its place in the early days of Google's disaster recovery practices to the modern SRE movement. Thanks, everyone, for joining us for another episode. Today with us we have Gustavo Franco, who's a senior engineering manager at VMware. Gustavo, why don't you say hi, and tell us about yourself.Gustavo: Thank you very much for having me. Gustavo Franco; as you were just mentioning, I'm a senior engineering manager now at VMware. So, recently co-founded the VMware Tanzu Reliability Engineering Organization with Megan Bigelow. It's been only a year, actually. And we've been doing quite a bit more than SRE; we can talk about like—we're kind of branching out beyond SRE, as well.Jason: Yeah, that sounds interesting. For folks who don't know, I feel like I've seen VMware Tanzu around everywhere. It just suddenly went from nothing into this huge thing of, like, every single Kubernetes-related event, I feel like there's someone from VMware Tanzu on it. So, maybe as some background, give us some information; what is VMware Tanzu?Gustavo: Kubernetes is sort of the engine, and we have a Kubernetes distribution called Tanzu Kubernetes Grid. So, one of my teams actually works on Tanzu Kubernetes Grid. So, what is VMware Tanzu? What this really is, is what we call a modern application platform, really an end-to-end solution. So, customers expect to buy not just Kubernetes, but everything around, everything that comes with giving the developers a platform to write code, to write applications, to write workloads.So, it's basically the developer at a retail company or a finance company, they don't want to run Kubernetes clusters; they would like the ability to, maybe, but they don't necessarily think in terms of Kubernetes clusters. They want to think about workloads, applications. So, VMWare Tanzu is end-to-end solution that the engine in there is Kubernetes.Jason: That definitely describes at least my perspective on Kubernetes is, I love running Kubernetes clusters, but at the end of the day, I don't want to have to evaluate every single CNCF project and all of the other tools that are required in order to actually maintain and operate a Kubernetes cluster.Gustavo: I was just going to say, and we acquired Pivotal a couple of years ago, so that brought a ton of open-source projects, such as the Spring Framework. So, for Java developers, I think it's really cool, too, just being able to worry about development and the Java layer and a little bit of reliability, chaos engineering perspective. So, kind of really gives me full tooling, the ability common libraries. It's so important for reliable engineering and chaos engineering as well, to give people this common surface that we can actually use to inject faults, potentially, or even just define standards.Jason: Excellent point of having that common framework in order to do these reliability practices. So, you've explained what VMware Tanzu is. Tell me a bit more about how that fits in with VMware Tanzu?Gustavo: Yeah, so one thing that happened the past few years, the SRE organization grew beyond SRE. We're doing quite a bit of horizontal work, so SRE being one of them. So, just an example, I got to charter a compliance engineering team and one team that we call ‘Customer Zero.' I would call them partially the representatives of growth, and then quote-unquote, “Customer problems, customer pain”, and things that we have to resolve across multiple teams. So, SRE is one function that clearly you can think of.You cannot just think of SRE on a product basis, but you think of SRE across multiple products because we're building a platform with multiple pieces. So, it's kind of like putting the building blocks together for this platform. So then, of course, we're going to have to have a team of specialists, but we need an organization of generalists, so that's where SRE and this broader organization comes in.Jason: Interesting. So, it's not just we're running a platform, we need our own SREs, but it sounds like it's more of a group that starts to think more about the product itself and maybe even works with customers to help their reliability needs?Gustavo: Yeah, a hundred percent. We do have SRE teams that invest the majority of their time running SaaS, so running Software as a Service. So, one of them is the Tanzu Mission Control. It's purely SaaS, and what teams see Tanzu Mission Control does is allow the customers to run Kubernetes anywhere. So, if people have Kubernetes on-prem or they have Kubernetes on multiple public clouds, they can use TMC to be that common management surface, both API and web UI, across Kubernetes, really anywhere they have Kubernetes. So, that's SaaS.But for TKG SRE, that's a different problem. We don't have currently a TKG SaaS offering, so customers are running TKG on-prem or on public cloud themselves. So, what does the TKG SRE team do? So, that's one team that actually [unintelligible 00:05:15] to me, and they are working directly improving the reliability of the product. So, we build reliability as a feature of the product.So, we build a reliability scanner, which is a [unintelligible 00:05:28] plugin. It's open-source. I can give you more examples, but that's the gist of it, of the idea that you would hire security engineers to improve the security of a product that you sell to customers to run themselves. Why wouldn't you hire SREs to do the same to improve the reliability of the product that customers are running themselves? So, kind of, SRE beyond SaaS, basically.Jason: I love that idea because I feel like a lot of times in organizations that I talk with, SRE really has just been a renamed ops team. And so it's purely internal; it's purely thinking about we get software shipped to us from developers and it's our responsibility to just make that run reliably. And this sounds like it is that complete embrace of the DevOps model of breaking down silos and starting to move reliability, thinking of it from a developer perspective, a product perspective.Gustavo: Yeah. A lot of my work is spent on making analogies with security, basically. One example, several of the SREs in my org, yeah, they do spend time doing PRs with product developers, but also they do spend a fair amount of time doing what we call in a separate project right now—we're just about to launch something new—a reliability risk assessment. And then you can see the parallels there. Where like security engineers would probably be doing a security risk assessment or to look into, like, what could go wrong from a security standpoint?So, I do have a couple engineers working on reliability risk assessment, which is, what could go wrong from a reliability standpoint? What are the… known pitfalls of the architecture, the system design that we have? How does the architectural work looks like of the service? And yeah, what are the outages that we know already that we could have? So, if you have a dependency on, say, file on a CDN, yeah, what if the CDN fails?It's obvious and I know most of the audience will be like, “Oh, this is obvious,” but, like, are you writing this down on a spreadsheet and trying to stack-rank those risks? And after you stack-rank them, are you then mitigating, going top-down, look for—there was an SREcon talk by [Matt Brown 00:07:32], a former colleague of mine at Google, it's basically, know your enemy tech talk in SREcon. He talks about this like how SRE needs to have a more conscious approach to reliability risk assessment. So, really embraced that, and we embraced that at VMware. The SRE work that I do comes from a little bit of my beginnings or my initial background of working security.Jason: I didn't actually realize that you worked security, but I was looking at your LinkedIn profile and you've got a long career doing some really amazing work. So, you said you were in security. I'm curious, tell us more about how your career has progressed. How did you get to where you are today?Gustavo: Very first job, I was 16. There was this group of sysadmins on the first internet service provider in Brazil. One of them knew me from BBS, Bulletin Board Systems, and they, you know, were getting hacked, left and right. So, this guy referred me, and he referred me saying, “Look, it's this kid. He's 16, but he knows his way around this security stuff.”So, I show up, they interview me. I remember one of the interview questions; it's pretty funny. They asked me, “Oh, what would you do if we asked you to go and actually physically grab the routing table from AT&T?” It's just, like, a silly question and they told them, “Uh, that's impossible.” So, I kind of told him the gist of what I knew about routing, and it was impossible to physically get a routing table.For some reason, they loved that. That was the only candidate that could be telling them, “No. I'm not going to do it because it makes no sense.” So, they hired me. And the student security was basically teaching the older sysadmins about SSH because they were all on telnet, nothing was encrypted.There was no IDS—this was a long time ago, right, so the explosion of cybersecurity security firms did not exist then, so it was new. To be, like, a security company was a new thing. So, that was the beginning. I did dabble in open-source development for a while. I had a couple other jobs on ISPs.Google found me because of my dev and open-source work in '06, '07. I interviewed, joined Google, and then at Google, all of it is IC, basically, individual contributor. And at Google, I start doing SRE-type of work, but for the corporate systems. And there was this failed attempt to migrate from one Linux distribution to another—all the corporate systems—and I tech-led the effort making that successful. I don't think I should take the credit; it was really just a fact of, like you know, trying the second time and kind of, learned—the organization learned the lessons that I had to learn from the first time. So, we did a second time and it worked.And then yeah, I kept going. I did more SRE work in corp, I did some stuff in production, like all the products. So, I did a ton of stuff. I did—let's see—technical infrastructure, disaster recovery testing, I started a chaos-engineering-focused team. I worked on Google Cloud before we had a name for it. [laugh].So, I was the first SRE on Google Compute Engine and Google Cloud Storage. I managed Google Plus SRE team, and G Suite for a while. And finally, after doing all this runs on different teams, and developing new SRE teams and organizations, and different styles, different programs in SRE. Dave Rensin, which created the CRE team at Google, recruited me with Matt Brown, which was then the tech lead, to join the CRE team, which was the team at Google focused on teaching Google Cloud customers on how to adopt SRE practices. So, because I had this very broad experience within Google, they thought, yeah, it will be cool if you can share that experience with customers.And then I acquired even more experience working with random customers trying to adopt SRE practices. So, I think I've seen a little bit of it all. VMware wanted me to start, basically, a CRE team following the same model that we had at Google, which culminated all this in TKG SRE that I'm saying, like, we work to improve the reliability of the product and not just teaching the customer how to adopt SRE practices. And my pitch to the team was, you know, we can and should teach the customers, but we should also make sure that they have reasonable defaults, that they are providing a reasonable config. That's the gist of my experience, at a high level.Jason: That's an amazing breadth of experience. And there's so many aspects that I feel like I want to dive into [laugh] that I'm not quite sure exactly where to start. But I think I'll start with the first one, and that's that you mentioned that you were on that initial team at Google that started doing chaos engineering. And so I'm wondering if you could share maybe one of your experiences from that. What sort of chaos engineering did you do? What did you learn? What were the experiments like?Gustavo: So, a little bit of the backstory. This is probably because Kripa mentioned this several times before—and Kripa Krishnan, she actually initiated disaster recovery testing, way, way before there was such a thing as chaos engineering—that was 2006, 2007. That was around the time I was joining Google. So, Kripa was the first one to lead disaster recovery testing. It was very manual; it was basically a room full of project managers with postIts, and asking teams to, like, “Hey, can you test your stuff? Can you test your processes? What if something goes wrong? What if there's an earthquake in the Bay Area type of scenario?” So, that was the predecessor.Many, many years later, I work with her from my SRE teams testing, for my SRE teams participating in disaster recovery testing, but I was never a part of the team responsible for it. And then seven years later, I was. So, she recruited me with the following pitch, she was like, “Well, the program is big. We have disaster recovery tests, we have a lot of people testing, but we are struggling to convince people to test year-round. So, people tend to test once a year, and they don't test again. Which is bad. And also,” she was like, “I wish we had a software; there's something missing.”We had the spreadsheets, we track people, we track their tasks. So, it was still very manual. The team didn't have a tool for people to test. It was more like, “Tell me what you're going to test, and I will help you with scheduling, I'll help you to not conflict with the business and really disrupt something major, disrupt production, disrupt the customers, potentially.” A command center, like a center of operations.That's what they did. I was like, “I know exactly what we need.” But then I surveyed what was out there in open-source way, and of course, like, Netflix, gets a lot of—deserves a lot of credit for it; there was nothing that could be applied to the way we're running infrastructure internally. And I also felt that if we built this centrally and we build a catalog of tasks ourselves, and that's it, people are not going to use it. We have a bunch of developers, software engineers.They've got to feel like—they want to, they want to feel—and rightfully so—that they wanted control and they are in control, and they want to customize the system. So, in two weeks, I hack a prototype where it was almost like a workflow engine for chaos engineering tests, and I wrote two or three tests, but there was an API for people to bring their own test to the system, so they could register a new test and basically send me a patch to add their own tests. And, yeah, to my surprise, like, a year later—and the absolute number of comparison is not really fair, but we had an order of magnitude more testing being done through the software than manual tests. So, on a per-unit basis, the quality of the ultimate tasks was lower, but the cool thing was that people were testing a lot more often. And it was also very surprising to see the teams that were testing.Because there were teams that refused to do the manual disaster recovery testing exercise, they were using the software now to test, and that was part of the regular integration test infrastructure. So, they're not quite starting with okay, we're going to test in production, but they were testing staging, they were testing a developer environment. And in staging, they had real data; they were finding regressions. I can mention the most popular testing, too, because I spoke about this publicly before, which was this fuzz testing. So, a lot of things are RPC or RPC services, RPC, servers.Fuzz testing is really useful in the sense that, you know, if you send a random data in RPC call, will the server crash? Will the server handling this gracefully? So, we fought a lot of people—not us—a lot of people use or shared service bringing their own test, and fuzz testing was very popular to run continuously. And they would find a ton of crashes. We had a lot of success with that program.This team that I ran that was dedicated to building this shared service as a chaos engineering tool—which ironically named Catzilla—and I'm not a cat person, so there's a story there, too—was also doing more than just Catzilla, which we can also talk about because there's a little bit more of the incident management space that's out there.Jason: Yeah. Happy to dive into that. Tell me more about Catzilla?Gustavo: Yeah. So, Catzilla was sort of the first project from scratch from the team that ended up being responsible to share a coherent vision around the incident prevention. And then we would put Catzilla there, right, so the chaos engineering shared service and prevention, detection, analysis and response. Because once I started working on this, I realized, well, you know what? People are still being paged, they have good training, we had a good incident management process, so we have good training for people to coordinate incidents, but if you don't have SREs working directly with you—and most teams didn't—you also have a struggle to communicate with executives.It was a struggle to figure out what to do with prevention, and then Catzilla sort of resolved that a little bit. So, if you think of a team, like an SRE team in charge of not running a SaaS necessarily, but a team that works in function of a company to help the company to think holistically about incident prevention, detection, analysis, and response. So, we end up building more software for those. So, part of the software was well, instead of having people writing postmortems—a pet peeve of mine is people write postmortems and them they would give to the new employees to read them. So, people never really learned the postmortems, and there was like not a lot of information recovery from those retrospectives.Some teams were very good at following up on extra items and having discussions. But that's kind of how you see the community now, people talking about how we should approach retrospectives. It happened but it wasn't consistent. So then, well, one thing that we could do consistently is extract all the information that people spend so much time writing on the retrospectives. So, my pitch was, instead of having these unstructured texts, can we have it both unstructured and structured?So, then we launch postmortem template that was also machine-readable so we could extract information and then generate reports for to business leaders to say, “Okay, here's what we see on a recurring basis, what people are talking about in the retrospectives, what they're telling each other as they go about writing the retrospectives.” So, we found some interesting issues that were resolved that were not obvious on a per retrospective basis. So, that was all the way down to the analysis of the incidents. On the management part, we built tooling. It's basically—you can think of it as a SaaS, but just for the internal employees to use that is similar to externally what would be an incident dashboard, you know, like a status page of sorts.Of course, a lot more information internally for people participating in incidents than they have externally. For me is thinking of the SRE—and I manage many SRE teams that were responsible for running production services, such as Compute Engine, Google Plus, Hangouts, but also, you know, I just think of SRE as the folks managing production system going on call. But thinking of them a reliability specialists. And there's so many—when you think of SREs as reliability specialists that can do more than respond to pages, then you can slot SREs and SRE teams in many other areas of a organization.Jason: That's an excellent point. Just that idea of an SRE as being more than just the operation's on-call unit. I want to jump back to what you mentioned about taking and analyzing those retrospectives and analyzing your incidents. That's something that we did when I was at Datadog. Alexis Lê-Quôc, who's the CTO, has a fantastic talk about that at Monitorama that I'll link to in the [show notes 00:19:49].It was very clear from taking the time to look at all of your incidents, to catalog them, to really try to derive what's the data out of those and get that information to help you improve. We never did it in an automated way, but it sounds like with an automated tool, you were able to gather so much more information.Gustavo: Yeah, exactly. And to be clear, we did this manually before, and so we understood the cost of. And our bar, company-wide, for people writing retrospectives was pretty low, so I can't give you a hard numbers, but we had a surprising amount of retrospectives, let's say on a monthly basis because a lot of things are not necessarily things that many customers would experience. So, near misses or things that impact very few customers—potentially very few customers within a country could end up in a retrospective, so we had this throughput. So, it wasn't just, like, say, the highest severity outages.Like where oh, it happens—the stuff that you see on the press that happens once, maybe, a year, twice a year. So, we had quite a bit of data to discuss. So, then when we did it manually, we're like, “Okay, yeah, there's definitely something here because there's a ton of information; we're learning so much about what happens,” but then at the same time, we were like, “Oh, it's painful to copy and paste the useful stuff from a document to a spreadsheet and then crunch the spreadsheet.” And kudos—I really need to mention her name, too, Sue [Lueder 00:21:17] and also [Yelena Ortel 00:21:19]. Both of them were amazing project program managers who've done the brunt of this work back in the days when we were doing it manually.We had a rotation with SREs participating, too, but our project managers were awesome. And also Jason: As you started to analyze some of those incidents, every infrastructure is different, every setup is different, so I'm sure that maybe the trends that you saw are perhaps unique to those Google teams. I'm curious if you could share the, say, top three themes that might be interesting and applicable to our listeners, and things that they should look into or invest in?Gustavo: Yeah, one thing that I tell people about adopting the—in the books, the SRE books, is the—and people joke about it, so I'll explain the numbers a little better. 70, 75% of the incidents are triggered by config changes. And people are like, “Oh, of course. If you don't change anything, there are no incidents, blah, blah, blah.” Well, that's not true, that number really speaks to a change in the service that is impacted by the incident.So, that is not a change in the underlying dependency. Because people were very quickly to blame their dependencies, right? So meaning, if you think of a microservice mesh, the service app is going to say, “Oh, sure. I was throwing errors, my service was throwing errors, but it was something with G or H underneath, in a layer below.” 75% of cases—and this is public information goes into books, right—of retrospectives was written, the service that was throwing the errors, it was something that changed in that service, not above or below; 75% of the time, a config change.And it was interesting when we would go and look into some teams where there was a huge deviation from that. So, for some teams, it was like, I don't know, 85% binary deploys. So, they're not really changing config that much, or the configuration issues are not trigger—or the configuration changes or not triggering incidents. For those teams, actually, a common phenomenon was that because they couldn't. So, they did—the binary deploys were spiking as contributing factors and main triggers for incidents because they couldn't do config changes that well, roll them out in production, so they're like, yeah, of course, like, [laugh] my minor deploys will break more on my own service.But that showed to a lot of people that a lot of things were quote-unquote, “Under their control.” And it also was used to justify a project and a technique that I think it's undervalued by SREs in the wild, or folks running production in the wild which is canary evaluation systems. So, all these numbers and a lot of this analysis was just fine for, like, to give extra funding for the scene that was basically systematically across the entire company, if you tried to deploy a binary to production, if you tried to deploy a config change to production, will evaluate a canary if the binary is in a crash loop, if the binary is throwing many errors, is something is changing in a clearly unpredictable way, it will pause, it will abort the deploy. Which back to—much easier said than done. It sounds obvious, right, “Oh, we should do canaries,” but, “Oh, can you automate your canaries in such a way that they're looking to monitoring time series and that it'll stop a release and roll back a release so a human operator can jump in and be like, ‘oh, okay. Was it a false positive or not?'”Jason: I think that moving to canary deployments, I've long been a proponent of that, and I think we're starting to see a lot more of that with tools such as—things like LaunchDarkly and other tools that have made it a whole lot easier for your average organization that maybe doesn't have quite the infrastructure build-out. As you started to work on all of this within Google, you then went to the CRE team and started to help Google Cloud customers. Did any of these tools start to apply to them as well, analyzing their incidents and finding particular trends for those customers?Gustavo: More than one customer, when I describe, say our incident lifecycle management program, and the chaos engineering program, especially this lifecycle stuff, in the beginning, was, “Oh, okay. How do I do that?” And I open-sourced a very crufty prototype which some customers pick up on it and they implement internally in their companies. And it's still on GitHub, so /google/sredocs.There's an ugly parser, an example, like, of template for the machine-readable stuff, and how to basically get your retrospectives, dump the data onto Google BigQuery to be able to query more structurally. So yes, customers would ask us about, “Yeah. I heard about chaos engineering. How do you do chaos engineering? How can we start?”So, like, I remember a retail one where we had a long conversation about it, and some folks in tech want to know, “Yeah, instant response; how do I go about it?” Or, “What do I do with my retrospectives?” Like, people started to realize that, “Yeah, I write all this stuff and then we work on the action items, but then I have all these insights written down and no one goes back to read it. How can I get actionable insights, actionable information out of it?”Jason: Without naming any names because I know that's probably not allowed, are there any trends from customers that you'd be willing to share? Things that maybe—insights that you learned from how they were doing things and the incidents they were seeing that was different from what you saw at Google?Gustavo: Gaming is very unique because a lot of gaming companies, when we would go into incident management, [unintelligible 00:26:59] they were like, “If I launch a game, it's ride or die.” There may be a game that in the first 24, or 48 hours if the customers don't show up, they will never show up. So, that was a little surprising and unusual. Another trend is, in finance, you would expect a little behind or to be too strict on process, et cetera, which they still are very sophisticated customers, I would say. The new teams of folks are really interested in learning how to modernize the finance infrastructure.Let's see… well, tech, we basically talk the same language, with the gaming being a little different. In retail, the uniqueness of having a ton of things at the edge was a little bit of a challenge. So, having these hubs, where they have, say, a public cloud or on-prem data center, and these of having things running at the stores, so then having this conversation with them about different tiers and how to manage different incidents. Because if a flagship store is offline, it is a big deal. And from a, again, SaaS mindset, if you're think of, like, SRE, and you always manage through a public cloud, you're like, “Oh, I just call with my cloud provider; they'll figure it out.”But then for retail company with things at the edge, at a store, they cannot just sit around and wait for the public cloud to restore their service. So again, a lot of more nuanced conversations there that you have to have of like, yeah, okay, yeah. Here, say a VMware or a Google. Yeah, we don't deal with this problem internally, so yeah, how would I address this? The answers are very long, and they always depend.They need to consider, oh, do you have an operational team that you can drive around? [laugh]. Do you have people, do you have staffing that can go to the stores? How long it will take? So, the SLO conversation there is tricky.a secret weapon of SRE that has definitely other value is the project managers, program managers that work with SREs. And I need to shout out to—if you're a project manager, program manager working with SREs, shout out to you.Do you want to have people on call 24/7? Do you have people near that store that can go physically and do anything about it? And more often than not, they rely on third-party vendors, so then it's not staffed in-house and they're not super technical, so then remote management conversations come into play. And then you talk about, “Oh, what's your network infrastructure for that remote management?” Right? [laugh].Jason: Things get really interesting when you start to essentially outsource to other companies and have them provide the technology, and you try to get that interface. So, you mentioned doing chaos engineering within Google, and now you've moved to VMware with the Tanzu team. Tell me a bit more about how do you do chaos engineering at VMware, and what does that look like?Gustavo: I've seen varying degrees of adoption. So, right now, within my team, what we are doing is we're actually going as we speak right now, doing a big reliabilities assessment for a launch. Unfortunately, we cannot talk about it yet. We're probably going to announce this on October at VMworld. As a side effect of this big launch, we started by doing a reliability risk assessment.And the way we do this is we interview the developers—so this hasn't launched yet, so we're still designing this thing together. [unintelligible 00:30:05] the developers of the architecture that they basically sketch out, like, what is it that you're going to? What are the user journeys, the user stories? Who is responsible for what? And let's put an architecture diagram, a sketch together.And then we tried to poke or holes on, “Okay. What could go wrong here?” We write this stuff down. More often than not, from this list—and I can already see, like, that's where that output, that result fits into any sort of chaos engineering plan. So, that's where, like—so I can get—one thing that I can tell you for that risk assessment because I participated in the beginning was, there is a level of risk involving a CDN, so then one thing that we're likely going to test before we get to general availability is yeah, let's simulate that the CDN is cut off from the clients.But even before we do the test, we're already asking, but we don't trust. Like, trust and verify, actually; we do trust but trust and verify. So, we do trust the client is actually another team. So, we do trust the client team that they cache, but we are asking them, “Okay. Can you confirm that you cache? And if you do cache, can you give us access to flush the cache?”We trust them, we trust the answers; we're going to verify. And how do we verify? It's through a chaos engineering test which is, let's cut the client off from the CDN and then see what happens. Which could be, for us, as simple as let's move the file away; we should expect them to not tell us anything because the client will fail to read but it's going to pick from cache, it's not reading from us anyways. So, there is, like, that level of we tell people, “Hey, we're going to test a few things.”We'll not necessarily tell them what. So, we are also not just testing the system, but testing how people react, and if anything happens. If nothing happens, it's fine. They're not going to react to it. So, that's the level of chaos engineering that our team has been performing.Of course, as we always talk about improving reliability for the product, we talked about, “Oh, how is it that chaos engineering as a tool for our customers will play out in the platform?” That conversation now is a little bit with product. So, product has to decide how and when they want to integrate, and then, of course, we're going to be part of that conversation once they're like, “Okay, we're ready to talk about it.” Other teams of VMWare, not necessarily Tanzu, then they do all sorts of chaos engineering testing. So, some of them using tools, open-source or not, and a lot of them do tabletop, basically, theoretical testing as well.Jason: That's an excellent point about getting started. You don't have a product out yet, and I'm sure everybody's anticipating hearing what it is and seeing the release at VMworld, but testing before you have a product; I feel like so many organizations, it's an afterthought, it's the, “I've built the product. It's in production. Now, we need to keep it reliable.” And I think by shifting that forward to thinking about, we've just started diagramming the architecture, let's think about where this can break. And how we can build those tests so that we can begin to do that chaos engineering testing, begin to do that reliability testing during the development of the product so that it ships reliably, rather than shipping and then figuring out how to keep it reliable.Gustavo: Yeah. The way I talked to—and I actually had a conversation with one of our VPs about this—is that you have technical support that is—for the most part, not all the teams from support—but at least one of the tiers of support, you want it to be reactive by design. You can staff quite a few people to react to issues and they can be very good about learning the basics because the customers—if you're acquiring more customers, they are going to be—you're going to have a huge set of customers early in the journey with your product. And you can never make the documentation perfect and the product onboarding perfect; they're going to run into issues. So, that very shallow set of issues, you can have a level of arterial support that is reactive by design.You don't want that tier of support to really go deep into issues forever because they can get caught up into a problem for weeks or months. You kind of going to have—and that's when you add another tier and that's when we get to more of, like, support specialists, and then they split into silos. And eventually, you do get an IC SRE being tier three or tier four, where SRE is a good in-between support organizations and product developers, in the sense that product developers also tend to specialize in certain aspects of a product. SRE wants to be generalists for reliability of a product. And nothing better than to uncover reliability for product is understanding the customer pain, the customer issues.And actually, one thing, one of the projects I can tell you about that we're doing right now is we're improving the reliability of our installation. And we're going for, like, can we accelerate the speed of installs and reduce the issues by better automation, better error handling, and also good—that's where I say day zero. So, day zero is, can we make this install faster, better, and more reliable? And after the installs in day one, can we get better default? Because I say the ergonomics for SRE should be pretty good because we're TKG SREs, so there's [unintelligible 00:35:24] and SRE should feel at home after installing TKG.Otherwise, you can just go install vanilla Kubernetes. And if vanilla Kubernetes does feel at home because it's open-source, it's what most people use and what most people know, but it's missing—because it's just Kubernetes—missing a lot of things around the ecosystem that TKG can install by default, but then when you add a lot of other things, I need to make sure that it feels at home for SREs and operators at large.Jason: It's been fantastic chatting with you. I feel like we can go [laugh] on and on.Gustavo: [laugh].Jason: I've gone longer than I had intended. Before we go, Gustavo, I wanted to ask you if you had anything that you wanted to share, anything you wanted to plug, where can people find you on the internet?Gustavo: Yeah, so I wrote an ebook on how to start your incident lifecycle program. It's not completely out yet, but I'll post on my Twitter account, so twitter.com/stratus. So @stratus, S-T-R-A-T-U-S. We'll put the link on the [notes 00:36:21], too. And so yeah, you can follow me there. I will publish the book once it's out. Kind of explains all about the how to establish an incident lifecycle. And if you want to talk about SRE stuff, or VMware Tanzu or TKG, you can also message me on Twitter.Jason: Thanks for all the information.Gustavo: Thank you, again. Thank you so much for having me. This was really fun. I really appreciate it.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Introduction 00:03:30 - An Engineering Anecdote 00:08:10 - Lessons Learned from Putting Out Fires 00:11:00 - Building “Guardrails” 00:18:10 - Pushing the Chaos Envelope 00:23:35 - OpenGitOps Project 00:30:37 - Where to Find Leo/Costa Rica CNCF Links: Weaveworks: https://www.weave.works GitOps Working Group: https://github.com/gitops-working-group/gitops-working-group OpenGitOps Project: https://opengitops.dev Github.com/open-gitops: https://github.com/open-gitops Twitter: https://twitter.com/murillodigital LinkedIn: https://www.linkedin.com/in/leonardomurillo/ Costa Rica CNCF: https://community.cncf.io/costa-rica/ Cloudnative.tv: http://cloudnative.tv Gremlin-certified chaos engineering practitioner: https://www.gremlin.com/certification TranscriptJason: Welcome to the Break Things on Purpose podcast, a show about our often self-inflicted failures and what we learn from them. In this episode, Leonardo Murillo, a principal partner solutions architect at Weaveworks. He joins us to talk about GitOps, Automating reliability, and Pura Vida.Ana: I like letting our guests kind of say, like, “Who are you? What do you do? What got you into the world of DevOps, and cloud, and all this fun stuff that we all get to do?”Leo: Well, I guess I'll do a little intro of myself. I'm Leonardo Murillo; everybody calls me Leo, which is fine because I realize that not everybody chooses to call me Leo, depending on where they're from. Like, Ticos and Latinos, they're like, “Oh, Leo,” like they already know me; I'm Leo already. But people in Europe and in other places, they're, kind of like, more formal out there. Leonardo everybody calls me Leo.I'm based off Costa Rica, and my current professional role is principal solutions architect—principal partner solutions architect at Weaveworks. How I got started in DevOps. A lot of people have gotten started in DevOps, which is not realizing that they just got started in DevOps, you know what I'm saying? Like, they did DevOps before it was a buzzword and it was, kind of like, cool. That was back—so I worked probably, like, three roles back, so I was CTO for a Colorado-based company before Weaveworks, and before that, I worked with a San Francisco-based startup called High Fidelity.And High Fidelity did virtual reality. So, it was actually founded by Philip Rosedale, the founder of Linden Lab, the builders of Second Life. And the whole idea was, let's build—with the advent of the Oculus Rift and all this cool tech—build the new metaverse concept. We're using the cloud because, I mean, when we're talking about this distributed system, like a distributed system where you're trying to, with very low latency, transmit positional audio, and a bunch of different degrees of freedom of your avatars and whatnot; that's very massive scale, lots of traffic. So, the cloud was, kind of like, fit for purpose.And so we started using the cloud, and I started using Jenkins, as a—and figure it out, like, Jenkins is a cron sort of thing; [unintelligible 00:02:48] oh, you can actually do a scheduled thing here. So, started using it almost to run just scheduled jobs. And then I realized its power, and all of a sudden, I started hearing this whole DevOps word, and I'm like, “What this? That's kind of like what we're doing, right?” Like, we're doing DevOps. And that's how it all got started, back in San Francisco.Ana: That actually segues to one of the first questions that we love asking all of our guests. We know that working in DevOps and engineering, sometimes it's a lot of firefighting, sometimes we get to teach a lot of other engineers how to have better processes. But we know that those horror stories exist. So, what is one of those horrible incidents that you've encountered in your career? What happened?Leo: This is before the cloud and this is way before DevOps was even something. I used to be a DJ in my 20s. I used to mix drum and bass and jungle with vinyl. I never did the digital move. I used DJ, and I was director for a colocation facility here in Costa Rica, one of the first few colocation facilities that existed in the [unintelligible 00:04:00].I partied a lot, like every night, [laugh] [unintelligible 00:04:05] party night and DJ night. One night, they had 24/7 support because we were collocations [unintelligible 00:04:12], so I had people doing support all the time. I was mixing in some bar someplace one night, and I don't want to go into absolute detail of my state of consciousness, but it wasn't, kind of like… accurate in its execution. So, I got a call, and they're like, “We're having some problem here with our network.” This is, like, back in Cisco PIX times for firewalls and you know, like… back then.I wasn't fully there, so I [laugh], just drove back to the office in the middle of night and had this assistant, Miguel was his name, and he looks at me and he's like, “Are you okay? Are you really capable of solving this problem at [laugh] this very point in time?” And I'm like, “Yeah. Sure, sure. I can do this.”We had a rack full of networking hardware and there was, like, a big incident; we actually—one of the primary connections that we had was completely offline. And I went in and I started working on a device, and I spent about half an hour, like, “Well, this device is fine. There's nothing wrong with the device.” I had been working for half an hour on the wrong device. They're like, “Come on. You really got to focus.”And long story short, I eventually got to the right device and I was able to fix the problem, but that was like a bad incident, which wasn't bad in the context of technicality, right? It was a relatively quick fix that I figured it out. It was just at the wrong time. [laugh]. You know what I'm saying?It wasn't the best thing to occur that particular night. So, when you're talking about firefighting, there's a huge burden in terms of the on-call person, and I think that's something that we had experienced, and that I think we should give out a lot of shout-outs and provide a lot of support for those that are on call. Because this is the exact price they pay for that responsibility. So, just as a side note that comes to mind. Here's a lot of, like, shout-outs to all the people on-call that are listening to this right now, and I'm sorry you cannot go party. [laugh].So yeah, that's telling one story of one incident way back. You want to hear another one because there's a—this is back in High Fidelity times. I was—I don't remember exactly what it was building, but it had to do with emailing users, basically, I had to do something, I can't recall actually what it was. They was supposed to email all the users that were using the platform. For whatever reason—I really can't recall why—I did not mock data on my development environment.What I did was just use—I didn't mock the data, I actually used just to a copy of the production [unintelligible 00:07:02] the users. I basically just emailed everybody, like, multiple times. And that was very embarrassing. And another embarrassing scenario was, one day, I was working on a firewall that was local to my office, and I got the terminals mixed up, and I shut down not my local office firewall, but the one that was at the colocation facility. And that was another embarrassing moment. So yeah, those are three, kind of, self-caused fires that required fighting afterwards.Ana: The mock data one definitely resonates, especially when you're starting out in engineering career where you're just like, “Hey, I need to get this working. I'm trying to connect to pull this data from a production service,” or, “I'm trying to publish a new email, I want to see how it all goes out. Yeah, why not grab a copy of what actually usually is being used by my company and, like, press buttons here? Oh, wait, no, that actually is hitting a live endpoint? I did not know that.”Which brings me to that main question; what do you end up learning when you go through these fires? After you went through this incident that you emailed all of your customers, what is something that you learn that you got to take back.Leo: I learned how you have to pay attention. It's hard to learn without having gone through this experiences because you start picking up on cues that you didn't pick up in the past. You start seeing things that you didn't pay attention to before, particularly because you didn't know. And I'm pretty sure, even if somebody would have told me, “Don't do this,” or, “Don't do that. Be careful,” you still make those mistakes.There is certain things that you only achieve through experience. And I think that's one of the most important things that I realized. And I've actually see the analogy of that with my children. There's certain things that I, no matter how well I articulate, they will not learn until they go through those experiences of themselves. But I think that's one of the things that I'd argue, you ha—you will go through this, and it's—it's not okay, but it's okay.Everybody makes mistakes. You'll also identify whether—like, how supporting your team is and how supportive your—the organization you're working with is when you see the reaction to those errors. Hopefully, it wasn't something too bad, and ideally there's going to be guiderails that prevent that really, really bad scenario, but it's okay to make mistakes. You learn to focus through those mistakes and you really should be paying attention; you should never take anything for granted. There is no safety net. Period.So, you should never assume that there is, or that you're not going to make a mistake. So, be very careful. Another thing that I learned, how I can I work in my development environment. How different patterns that I apply in my development environment, how I now I'm very careful to never have, kind of like, production [x 00:10:11] readily available within my development environment. And also to build those guiderails.I think part of what you learn is all the things that could go wrong, might go wrong, so take time to build those guiderails. I think that's important. Like anything else that comes with seniority, when you have a task to accomplish, the task itself is merely a margin, only a percentage of what you really should consider to reach that objective. And a lot of the times, that means building protection around what you're asked, or thinking beyond that scope. And then leverage the team, you know? If you have people around you that know more, which is kind of great about community and collaboration. Like, being—don't—you're not alone.Ana: I love that you mentioned guardrails and guardrails being a way that you're able to prevent some of these things. Do you think something like chaos engineering could help you find those guardrails when you don't know that you don't have a guardrail?Leo: I think it definitely. The more complex your job, the more complex your architecture, the more complex of the solution you're building—and we've gotten in an increase in complexity over time. We went from monoliths to microservices to fully distributed architectures of services. We went from synchronous to asynchronous to event-driven to—like, there's this increase in complexity that is basically there for a reason because of an increase in scale as well. And the number of possible failure conditions that could arise from this hugely diverse and complex set of variables means that we've gotten to a point that likely always was the way, but now it's reached, again, and because of targets aligned with this complexity, new levels of scale, that there is currently more unknown unknowns than we've ever had.The conditions that you can run into because of different problem states of each individual component in your distributed architecture, brings up an orders-of-magnitude increase in the possible issues that you might run into, basically a point where you really have to understand that you have no idea what could fail, and the exercise of identifying what can fail. Or what are the margins of stability of your solution because that's, kind of like, the whole point, the boundaries? There's going to be a set of conditions, there's going to be a combination of conditions that will trigger your—kind of, will tip your solution beyond that edge. And finding those edges of stability can no longer be something that just happens by accident; it has to be premeditated, it has to be planned for. This is basically chaos engineering.Hypothesizing, given a set of conditions, what is the expected outcome? And through the execution of this hypothesis of increasing or varying scope and complexity, starting to identify that perimeter of stability of their solution. So, I guess to answer your question, yes. I mean, chaos engineering allows you to ide—if you think about that perimeter of stability as the guardrails around your solution within which have to remain for your solution to be stable, for instance, there goes—[unintelligible 00:13:48] chaos engineering. I was actually talking to somebody the other day, so I'm the organizer for the Costa Rica Cloud-Native Community, the chapter for [unintelligible 00:14:00], and I have this fellow from [unintelligible 00:14:04] who, he works doing chaos engineering.And he was talking to me about this concept that I had not thought about and considered, how chaos engineering can also be, kind of like, applied at a social level. What happens if a person xyz is not available? What happens if a person other has access to a system that they shouldn't have? All these types of scenarios can be used to discover where more guiderails should be applied.Jason: You know, you start to learn where the on-call person that's completely sober, maybe, is unavailable for some reason, and Leo comes and [crosstalk 00:14:45]—Leo: Right. [laugh]. Exactly. Exactly. That's what you have to incorporate in your experiment, kind of like, the DJ variable and the party parameter.Jason: It's a good thing to underscore as well, right? Back to your idea of we can tell our children all sorts of things and they're not going to learn the lesson until they experience it. And similarly with, as you explore your systems and how they can fail, we can imagine and architecture systems to maybe be resilient or robust enough to withstand certain failures, but we don't actually learn those lessons or actually know if they're going to work until we really do that, until we really stress them and try to explore those boundaries.Leo: Wouldn't it be fantastic if we could do that with our lives? You know, like, I want to bungee jump or I want to skydive, and there's a percentage of probability that I'm going to hit the ground and die, and I can just introduce a hypothesis in my life, jump, and then just revert to my previous state if it went wrong. It would be fantastic. I would try many, many things. [laugh].But you can't. And it's kind of like the same thing with my kids. I would love to be able to say, “You know what? Execute the following process, get the experience, and then revert to before it happened.” You cannot do that in real life, but that's, kind of like, the scenario that's brought up by chaos engineering, you don't have to wait for that production incident to learn; you can actually, “Emulate” quote-unquote, those occurrences.You can emulate it, you can experience without the damage, though, if you do it well because I think that's also part of, kind of like, there's a lot to learn about chaos engineering and there's a lot of progress in terms of how the practice of chaos engineering is evolving, and I think there's likely still a percentage of the population or of the industry that still doesn't quite see chaos engineering beyond just introducing chaos, period. They know chaos engineering from calling the Chaos Monkeys kill instances at random, and fix things and, you know, not in the more scientific context that it's evolved into. But yeah, I think the ability to have a controlled experience where you can actually live through failure states, and incidents, and issues, and stuff that you really don't want to happen in real life, but you can actually simulate those, accelerates learning in a way that only experience provides. Which is the beauty of it because you're actually living through it, and I don't think anything can teach us as effectively as living through [unintelligible 00:17:43], through suffering.Ana: I do also very much love that point where it's true, chaos engineering does expedite your learning. Not only are you just building and releasing and waiting for failure to happen, you're actually injecting that failure and you get to just be like, “Oh, wait, if this failure was to occur, I know that I'm resilient to it.” But I also love pushing that envelope forward, that it really allows folks to battle-test solutions together of, “I think this architecture diagram is going to be more resilient because I'm running it on three regions, and they're all in just certain zones. But if I was to deploy to a different provider, that only gives me one region, but they say they have a higher uptime, I would love to battle, test that together and really see, I'm throwing both scenarios at you: you're losing your access to the database. What's going to happen? Go, fight.” [laugh].Leo: You know, one thing that I've been mentioning to people, this is my hypothesis as to the future of chaos engineering as a component of solutions architecture. My hypothesis is that just as nowadays, if you look at any application, any service, for that application or service to be production-ready, you have a certain percentage of unit test coverage and you have a certain percentage of end-to-end coverage of testing and whatnot, and you cannot ignore and say I'm going to give you a production-ready application or production-ready system without solid testing coverage. My hypothesis is that [unintelligible 00:19:21]. And as a side note, we are now living in a world of infrastructure as code, and manifested infrastructure, and declarative infrastructure, and all sorts of cool new ways to deploy and deliver that infrastructure and workloads on top of it. My theory is that just as unit testing coverage is a requirement for any production-ready solution or application nowadays, a certain percentage of, “Chaos coverage,” quote-unquote.In other words, what percentage of the surface of your infrastructure had been exercised by chaos experiments, is going to also become a requirement for any production-ready architecture. That's is where my mind is at. I think you'll start seeing that happen in CI/CD pipelines, you're going to start seeing labels of 90% chaos coverage on Terraform repos. That's kind of the future. That I hope because I think it's going to help tremendously with reliability, and allow people to party without concern for being called back to the office in the middle of the night. It's just going to have a positive impact overall.Ana: I definitely love where that vision is going because that's definitely very much of what I've seen in the industry and the community. And with a lot of the open-source projects that we see out there, like, I got to sit in on a project called Keptn, which gets a chance to bring in a little bit more of those SRE-driven operations and try to close that loop, and auto-remediate, and all these other nice things of DevOps and cloud, but a big portion of what we're doing with Keptn is that you also get a chance to inject chaos and validate against service-level objectives, so you get to just really bring to the front, “Oh, we're looking at this metric for business-level and service-level objectives that allow for us to know that we're actually up and running and our customers are able to use us because they are the right indicators that matter to our business.” But you get to do that within CI/CD so that you throw chaos at it, you check that SLO, that gets rolled out to production, or to your next stage and then you throw more chaos at it, and it continues being completely repetitive.Leo: That's really awesome. And I think, for example, SLOs, I think that's very valuable as well. And prioritize what you want to improve based on the output of your experiments against that error budget, for example. There's limited time, there's limited engineering capacity, there's limited everything, so this is also something that you—the output, the results, the insights that you get from executing experiments throughout your delivery lifecycle as you promote, as you progress your solution through its multiple stages, also help you identify what should be prioritized because of the impact that it may have in your area budgets. Because I mean, sometimes you just need to burn budget, you know what I'm saying?So, you can actually, clearly and quantifiably understand where to focus engineering efforts towards site reliability as you introduce changes. So yeah, I think it's—and no wonder it's such a booming concept. Everybody's talking about it. I saw Gremlin just released this new certification thing. What is it, certified chaos engineer?Jason: Gremlin-certified chaos engineering practitioner.Leo: Ah, pretty cool.Jason: Yeah.Leo: I got to get me one of those. [laugh].Jason: Yeah, you should—we'll put the link in the [show notes 00:23:19], for everybody that wants to go and take that. One of the things that you've mentioned a bunch is as we talk about automation, and automating and getting chaos engineering coverage in the same way that test coverage happens, one of the things that you're involved in—and I think why you've got so much knowledge around automation—is you've been involved in the OpenGitOps Project, right?Leo: Mm-hm. Correct.Jason: Can you tell us more about that? And what does that look like now? Because I know GitOps has become this, sort of, buzzword, and I think a lot of people are starting to look into that and maybe wondering what that is.Leo: I'm co-chair of the GitOps Working Group by the CNCF, which is the working group that effectively shepherds the OpenGitOps Project. The whole idea behind the OpenGitOps Project is to come to a consensus definition of what GitOps is. And this is along the lines of—like, we were talking about DevOps, right?Like DevOps is—everybody is doing DevOps and everybody does something different. So, there is some commonality but there is not necessarily a community-agreed-upon single perspective as to what DevOps is. So, the idea behind the OpenGitOps Project and the GitOps Working Group is to basically rally the community and rally the industry towards a common opinion as to what GitOps is, eventually work towards ways to conformance and certification—so it's like you guys are doing with chaos engineering—and in an open-source community fashion. GitOps is basically a operating model for cloud-native infrastructure and applications. So, idea is that you can use the same patterns and you can use the same model to deploy and operate the underlying infrastructure as well as the workloads that are running on top of it.It's defined by four principles that might resonate as known in common for some with some caveats. So, the first principle is that your desired state, how you want your infrastructure and your workloads to look like is declarative. No, it's—you're not—there's a fundamental difference between the declarative and imperative. Imperative is you're giving instructions to reach a certain state. The current industry is just… defining the characteristics of that state, not the process by which you reached it.The current state should be immutable and should be versioned, and this is very much aligned with the whole idea of containers, which are immutable and are versioned, and the whole idea of the Gits, that if used… [unintelligible 00:26:05] if used following best practices is also immutable and versioned. So, your declared state should be versioned and immutable.it should be continuously reconciled through agents. In other words, it eliminates the human component; you are no longer executing manual jobs and you're no longer running imperative pipelines for the deployment component of your operation. You are allowing your [letting 00:26:41] agents do that for you, continuously and programmatically.And the fourth principle is, this is the only way by which you interact with the system. In other words it completely eliminates the human component from the operating model. So, for example, when I think about GitOps as a deployment mechanism, and for example, progressive delivery within the context of GitOps, I see a lot of… what's the word I'm looking for? Like, symbiosis.Jason: Yeah. Symbiosis?Leo: Yeah. Between chaos engineering, and this model of deployment. Because I think chaos engineering is also eliminating a human component; you're no longer letting humans exercise your system to find problems, you are executing those by agents, you are doing so with a declarative model, where you're declaring the attributes of the experiment and the expected outcome of that experiment, and you're defining the criteria by which you're going to abort that experiment. So, if you incorporate that model of automated, continuous validation of your solution through premeditated chaos, in a process of continuous reconciliation of your desired state, through automated deployment agents, then you have a really, really solid, reliable mechanism for the operation of cloud-native solutions.Ana: I was like, I think a lot what we've seen, I mean, especially as I sit in more CNCF stuff, is really trying to get a lot of our systems to be able to know what to do next before we need to interfere, so we don't have to wake up. So, between chaos engineering, between GitOps, between Keptn, [unintelligible 00:28:32] how is it that you can make the load of SRE and the DevOps engineer be more about making sure that things get better versus, something just broke and I need to go fix it, or I need to go talk to an engineer to go do a best practice because now those things are built into the system as a guardrail, or there's better mental models and things that are more accurate to real conditions that can happen to a system?Leo: Actually, I sidetracked. I never ended up talking more about the OpenGitOps Project and the GitOps Working Group. So, it's a community effort by the CNCF. So, it's open for contribution by everybody. You're all in the CNCF Slack, there is an OpenGitOps Slack channel there.And if you go to github.com/open-gitops, you'll be able to find ways to contribute. We are always looking to get more involvement from the community. This is also an evolving paradigm, which I think also resonates with chaos engineering.And a lot of its evolution is being driven by the use cases that are being discovered by the end-users of these technologies and the different patterns. Community involvement is very important. Industry involvement is very important. It would be fantastic and we're an open community, and I'd love to get to know more about what you're all doing with GitOps and what it means for you and how these principles apply to the challenges that your teams are running into, and the use cases that and problems spaces that you're having to deal with.Jason: I think that's a fantastic thing for our listeners to get involved in, especially as a new project that's really looking for the insight and the contribution from new members as it gets founded. As we wrap up, Leo, do you have any other projects that you want to share? How can people find you on the internet? Anything else that you want to plug?Leo: I love to meet people on these subjects that I'm very passionate about. So yes, you can find me on Twitter. I guess, it's easier to just type it, it's @murillodigital, but you'll find that in the show notes, I imagine. As well as my LinkedIn.I have to admit, I'm more of a LinkedIn person. I don't, I hope that doesn't age me or made me uncool, but I never figured out how to really work with Twitter. I'm more of a LinkedIn person, so you can find me there. I'm an organizer in the community in Costa Rica CNCF, and I run.So, for those that are Spanish speakers, I'm very much for promoting the involvement and openness of the cloud-native ecosystem to the Hispanic and Latin community. Because I think language is a barrier and I think we're coming from countries where a lot of us have struggled to basically get our head above water from lesser resources and difficult access to technology and information. But that doesn't mean that there isn't a huge amount of talent in the region. There is. And so, I run a—there's a recent initiative by the CNCF called cloud-native TV, which is we're ten shows that are streaming on Twitch.You go to cloudnative.tv, you'll see them. I run a show called Cloud Native LatinX, which is in Spanish. I invite people to talk about cloud-native technologies that are more cloud-native communities in the region.And my objective is twofold: I want to demonstrate to all Hispanics and all Latin people that they can do it, that we're all the same, doesn't matter if you don't speak the language. There is a whole bunch of people, and I am one of them that speak the language that are there, and we're there to help you learn, and support and help you push through into this community. Basically, anybody that's listening to come out and say these are actionable steps that I can take to move my career forward. So, it's every other Tuesday on cloudnative.tv, Cloud Native LatinX, if you want to hear and see more of me talking in Spanish. It's on cloudnative.tv. And the OpenGitOps Project, join in; it's open to the community. And that's me.Ana: Yes I love that shout-out to getting more folks, especially Hispanics and Latinx, be more involved in cloud and CNCF projects itself. Representation matters and folks like me and Leo come in from countries like Costa Rica, Nicaragua, we get to speak English and Spanish, we want to create more content in Spanish and let you know that you can learn chaos engineering in English and you can learn about chaos engineering in Spanish, Ingeniería de Caos. So, come on and join us. Well, thank you Leo. Muchisimas gracias por estar en el show de hoy, y gracias por estar llamando hoy desde Costa Rica, y para todos los que están oyendo hoy que también hablen español...pura vida y que se encuentren bien. Nos vemos en el próximo episodio.Leo: Muchas gracias, Ana, and thanks everybody, y pura vida para todo el mundo y ¡hagamos caos!Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover:00:00:00 - Introduction 00:04:25 - Cadence to Temporal 00:09:15 - Breaking down the Technology 00:15:35 - Three Tips for Using Temporal 00:19:21 - OutroLinks:Temporal: https://temporal.io TranscriptJason: And just so I'm sure to pronounce your names right, it's Maxim Fateev?Maxim: Yeah, that's good enough. That's my real name but it's—[laugh].Jason: [laugh]. Okay.Maxim: It's, in American, Maxim Fateev.Jason: And Samar Abbas.Samar: That sounds good.Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build and operate modern applications. In this episode, Maxim Fateev and Samar Abbas join us to chat about the problems with orchestrating microservices and the software they've created to help solve those problems.Jason: Hey, everyone, welcome to Break Things on Purpose, the podcast about reliability, chaos engineering, and all things SRE and DevOps. With me today, I've got Maxim Fateev and Samar Abbas, the cofounders of a company called Temporal. Maxim, why don't you tell us a little bit more about yourself?Maxim: Hi, thanks for inviting me to the podcast. I have been around for quite a while. In 2002, I joined Amazon and Amazon was pretty small company back then, I think 800 developers; compared to its current size it was really small. I quickly moved to the platform team, among other things. And it wasn't AWS back then, it was just the software platform, actually was called [Splat 00:01:36].I worked in a team which owned the old publish-subscribe technologies of Amazon, among other things. As a part of the team, I oversaw implementation of service and architecture, Amazon [unintelligible 00:01:47] roll out services at large scale, and they built services for [unintelligible 00:01:51] Amazon so all this asynchronous communication, there was something my team was involved in. And I know this time that this is not the best way to build large-scale service-oriented architectures, or relying on asynchronous messaging, just because it's pretty hard to do without central orchestration. And as part of that, our team conceived and then later built a Simple Workflow Service. And I was tech leader for the public release of the AWS Simple Workflow Service.Later, I also worked in Google and Microsoft. Later I joined Uber. Samar will tell his part of the story but we together built Cadence, which was, kind of, the open-source version of the same—based on the same ideas of the Simple Workflow. And now we driving Temporal open-source project and the company forward.Jason: And Samar, tell us a little bit about yourself and how you met Maxim.Samar: Thanks for inviting us. Super excited to be here. In 2010, I was basically wanted to make a switch from traditional software development like it used to happen back at Microsoft, to I want to try out the cloud side of things. So, I ended up joining Simple Workflow team at AWS; that's where I met Maxim for the first time. Back then, Maxim already had built a lot of messaging systems, and then saw this pattern where messaging turned out—[unintelligible 00:03:08] believe that messaging was the wrong abstraction to build certain class of applications out there.And that is what started Simple Workflow. And then being part of that journey, I was, like, super excited. Since then, in one shape or another, I've been continuing that journey, which we started back then in the Simple Workflow team while working with Maxim. So, later in 2012, after shipping Simple Workflow, I basically ended up coming back to Azure side of things. I wrote this open-source library by the name of Durable Task Framework, which looks like later Azure Functions team ended up adopting it to build what they are calling as Azure Durable Functions.And then in 2015, Uber opened up office here in Seattle; I ended up joining their engineering team in the Seattle office, and out of coincidence, both me and Max ended up joining the company right about the same time. Among other things we worked on together, like, around 2017, we started the Cadence project together, which was you can think of a very similar idea as like Simple Workflow, but kind of applying it to the problem we were seeing back at Uber. And one thing led to another and then now we are here basically, continuing that journey in the form of Temporal.Jason: So, you started with Cadence, which was an internal tool or internal framework, and decided to strike out on your own and build Temporal. Tell me about the transition of that. What caused you to, number one, strike out on your own, and number two, what's special about Temporal?Maxim: We built the Cadence from the beginning as an open-source project. And also it never was, like, Uber management came to us and says, “Let's build this technology to run applications reliably,” or workflow technology or something like that. It was absolutely a bottoms-up creation of that. And we are super grateful to Uber that those type of projects were even possible. But we practically started on our own, we build it first version of that, and we got resources later.And [unintelligible 00:05:09] just absolutely grows bottoms-up adoption within Uber. It grew from zero to, like, over a hundred use cases within three years that this project was hosted by our team at Uber. But also, it was an open-source project from the beginning, we didn't get much traction first, kind of, year or two, but then after that, we started to see awesome companies like HashiCorp, Box, Coinbase, Checkr, adopt us. And there are a lot of others, it's just that not all of them are talking about that publicly. And when we saw this external adoption, we started to realize that thing within Uber, we couldn't really focus on external events, like, because we believe this technology is very widely applicable, we needed, kind of, have separate entity, like a company, to actually drive the technology forward for the whole world.Like most obvious thing, you cannot have a hosted version [unintelligible 00:06:00] at Uber, right? We would never create a cloud offering, and everyone wants it. So, that is, kind of like, one thing led to another, Samar said, and we ended up leaving Uber and starting our own company. And that was the main reasoning is that we wanted to actually make this technology successful for everybody in the whole world, not just within Uber. Also the, kind of, non-technical but also technical reasons, one of the benefits of doing that was that we had actually accumulated quite pretty large technical debt when running, like, Cadence, just because we were in it for four years without single backwards-incompatible change because since [unintelligible 00:06:37] production, we still were on the same cluster with the same initial users, and we never had downtime, at least, lik, without relat—infrequent outages.So, we had to do everything in backwards-compatible manner. At Temporal, we could go and rethink that a little bit, and we spent almost a year just working on the next generation of technology and doing a lot of fixes and tons of features which we couldn't do otherwise. Because that was our only chance to do backwards-incompatible change. After our first initial production release, we have this promise that they're not going to break anyone, at least—unless we start thinking about the next major change in the project, which probably is not going to come in the next few years.Samar: Yeah. One thing I would add is back then one of the value propositions that Uber was going after is provide rides as reliable as running water. And that translated into interesting system requirements for engineers. Most of the time, what ended up happening is product teams at Uber are spending a large amount of time building this resiliency and reliability into their applications, rather than going after building cool features or real features that the users of the platform cares about. And I think this is where—this is the problem that we were trying to solve with us back at Uber, where let us give you that reliability baked into the platform.Why does every engineer needs to be a distributed systems engineer to deal with all sorts of failure conditions? We want application teams to be more focused on building amazing applications, which makes a lot of sense for the Uber platform in general. And this is the value proposition that we were going after with Cadence. And it basically hit a nerve with all the developers out there, especially within Uber. One of the very funny incidents early on, when we are getting that early adoption, one of the use cases there, the way they moved onto Cadence as an underlying platform is there was actually an outage, a multi-day outage in one of the core parts of that system, and the way they mitigated the outage is they rewrote that entire system on top of Cadence in a day, and able to port over that entire running system in production and build it on top of Cadence and run it in production. And that's how they mitigated that outage. So, that was, in my opinion, that was a developer experience that we were trying to strive, with Cadence.Jason: I think, let's dive into that a little bit more because I think for some of our listeners, they may not understand what we're talking about with this technology. I think people are familiar with simple messaging, things like Kafka, like, “I have a distributed system. It's working asynchronously, so I do some work, I pass that into a queue of some sort, something pulls that out, does some more work, et cetera, and things act decoupled.” But I think what we're talking about here with workflows, explain that for our listeners a little bit more. What does it provide because I've taken a look at the documentation and some of the demos and it provides a lot of really cool features for reliability. So, explain it a little bit more, first.Maxim: [crosstalk 00:09:54] describe pretty well how systems are built now. A lot of people, kind of, call it choreography. But basic idea is that you have a bunch of callbacks which listen on queues, then update certain data sources' databases, and then put messages into the queues back. And also they need to actually—in a real system also need to have durable timers, so you either build your own timer service, so you just poll your databases for messages to be in certain state to account for time. And skill in these things are non-trivial.The worst part is that your system has a bunch of independent callbacks and you practically have very complex state machine and all these business requirements just practically broken in, like, a thousand, thousand little pieces which need to work together. This choreography in theory kind of works, but in practice is usually a mess. On top of that, you have very poor visibility into your system, and all this other requirements about retries and so on are actually pretty hard to get. And then if something goes wrong, good luck finding the problem. It goes into the orchestrat—it does orchestration.Means that you implement your business logic in one place and then you just call into these downstream services to implement the business logic. The difference is that we know how to do that for short requests. Practically, let's say you get a request, your service makes the five downstream API calls, does something with those calls, maybe makes a little bit more calls, than [unintelligible 00:11:14] data. If this transactions takes, let's say, a second, is pretty easy to do, and you don't care about reliability that much if it fails in the middle. But as soon as they come to you and say, “Okay, but any of those calls can fail for three minutes,” or, “This downstream call can take them ten hours,” you practically say, “Okay, I have this nice piece of code which was calling five services and doing a few things. Now, I need to break it into 50 callbacks, queues, and whatever in database, and so on.”It's Temporal [unintelligible 00:11:38] keep that code [unintelligible 00:11:39]. The main abstraction, which is non-obvious to people is that they practically make your process fully fault-tolerant, including stack variables, [unintelligible 00:11:49], and so on. So, if you make a call, and this call takes five hours, you're still blocked in exactly the same line of code. And in five hours, this line of code returns and then continues. If you're calling sleep for one month, you're blocked on this sleep line of code for one month, and then it just returns to the next line of code.Obviously, there is some magic there, in the sense that we need to be able to [unintelligible 00:12:11] and activate state of your workflow—and we call it workflows but this code—but in exactly the same state, but this is exactly what Temporal provides out of the box. You write code, this failure doesn't exist because if your process fails, we just reconstruct the exactly the same state in a different process, and it's not even visible to you. I sometimes call it a fault-oblivious programming because your program not even aware that fault happened because it just automatically self-healing. That is main idea. So, we'll give you a fault-tolerant code is guaranteed to finish execution.And on top of that, there are a lot of things which kind of came together there. We don't invoke these services directly, usually, we invoke them from queues. But these queues are hidden in a sense because all you say execute, for instance, some activity, call some API, and then this API is in workflow asynchronously. But for your coding code, it's not visible. It's, kind of, just normal RPC call, but behind the scenes, it's all asynchronous, has infinite retries, exponential retries, a [unintelligible 00:13:11] granular tasks, and so on.So, there are a lot of features. But the main thing which we call workflow which ties all these together, is just your business logic because all it does is just practically makes this call. And also it has state. It's stateful because you can keep state in variables and you don't need to talk to database. So, it can have you—for example, one of the use cases we saw is customer loyalty program.Imagine you want to implement the UI airline, you need to implement points, give points to people. So, you listen to external events every time your trip finished, your flight finished, and then you will get event or a need to increment that. So, in normal system, you need to get better-based cues, and so in our world, you would just increment local variables saying, “Yeah, it's a counter.” And then when this counter reaches a hundred, for example, you will call some downstream service and say, “Okay, promote that person to the next tier.” And you could write this practically your type of application in 15, 20 minutes on your desktop because all of the rest is taken care of by Temporal.It assumes that you can have millions of those objects and hundreds of millions of those objects running because you have a lot of customers—and we do that because we built it at Uber, which had hundreds of millions of customers—and run this reliably because you don't want to lose data. It's [unintelligible 00:14:29] your financial data. This is why Temporal is very good for financial transactions, and a lot of companies like Coinbase, for example, uses them for their financial transactions because we provide much better [unintelligible 00:14:39] that alternative solutions there.Jason: That's amazing. In my past as an engineer of working on systems and trying to do that, particularly when you mentioned things like retries, I'm thinking of timeouts and things where you have a long-running process and you're constantly trying to tune, what if one service takes a long time, but then you realize that up the stack, some other dependency had ended up timing out before your timeout hit, and so it's suddenly failed even though other processes are going, and it's just this nightmare scenario. And you're trying to coordinate amongst services to figure out what your timeout should be and what should happen when those things timeout and how retries should coordinate among teams. So, the idea of abstracting that away and not even having to deal with that is pretty amazing. So, I wanted to ask you, as this tool sounds so amazing, I'm sure listeners will want to give this a try if they're not already using it.If I wanted to try this out, if I wanted to implement Temporal within my application, what are, say, three things that I need to keep in mind to ensure that I have a good experience with it, that I'm actually maintaining reliability, or improving reliability, due to using this framework, give us some tips for [crosstalk 00:15:53] Temporal.Maxim: One thing which is different about Temporal, it requires you rethinking how application is structured. It's a kind of new category of software, so you cannot just take your current design and go and translate to that. For example, I've seen cases that people build system using queues. They have some downstream dependency which can be down. For example, I've seen the payment system, the guys from the payment system came to us and said, “What do we do? We have downstream dependency to bank and they say in the SLA that can be done for three days. Can we just start workflow on every message if the system is down, and keep retrying for three days?” Well, technically, answer is yes, you can absolutely create workflows, productivity retried options for retry for a month, and it is going to work out of the box, but my question to those people first time is what puts message in that Kafka queue which your are listen, you know?And they are, “Oh, it's a proxy service which actually does something and”—“But how does the service is initiated?” And they say, “Oh, it's based on another to Kafka queue.” And the whole thing, I kind of ended up understanding that they had this huge pipeline, this enormous complexity, and they had hard time maintaining that because all these multiple services, multiple data sources, and so on. And then we ended up redesigning the whole pipeline as just one workflow, and instead of just doing this by little piece, and it helped them tremendously; it actually practically completely changed the way application was designed and simplified, they removed a lot of code, and all these, practically, technology, and reliability was provided by Temporal out of the box. So, my first thing is that don't try to do piecemeal.Again, you can. Yeah, actually, they initially even did that, just to try to prove the technology works, but at the end, think about end-to-end scenario, and if you use Temporal for end-to-end scenario, you will get [10x 00:17:37] benefits there. That probably would be the most important thing is just think about design.Samar: So, I think Max, you gave a pretty awesome description of how you should be approaching building applications on top of Temporal. One of the things that I would add to that is think about how important durability is for your application. Do you have some state which needs to live beyond a single request response? If the answer to those questions is yes, then I think Temporal is an awesome technology, which helps you deal with that complexity, where a traditional way of building those applications, using databases, queues, retry mechanisms, durable timers, as Max mentioned, for retrying for three days because we have—instead of building a sub-system which deals with this retrying behavior, you can literally just, when you schedule an activity, you just put an activity options on it to, say, put your retry policy and then it will retry based on three days, based on your policy. So, I think—think holistically about your system, think about the statefulness and how important that state is, and I think Temporal is an amazing technology which really developer-friendly because the way currently industry—I think people have accepted that building these class of applications for cloud environment is inherently complex.So, now a lot of innovation which happens is how to deal with that complexity as opposed to what Temporal is trying to do is, no, let's simplify the entire experience of building such applications. I think that's the key value that we are trying to provide.Jason: I think that's an excellent point, thinking of both of those tips of thinking about how your application is designed end-to-end, and also thinking about those pieces. I think one of the problems that we have when we build distributed applications, and particularly as we break things into smaller and smaller services, is that idea of for something that's long-running, where do I store this information? And a lot of times we end up creating these interesting, sort of, side services to simply pass that information along, ensure that it's there because we don't know what's going to happen with a long-running application. So, that's really interesting. Well, I wanted to thank you for joining the podcast. This has been really fantastic if folks want to find more information about Temporal or getting involved in the open-source project or using the framework, where should they head?Maxim: At temporal.io. And we have links to our Git repos, we have links to our community forum, and we also have, at the bottom that page, there is a link to the Slack channel. So, we have pretty vibrant community, so please join it. And we are always there to help.Jason: Awesome, thanks again.Maxim: Thank you.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: 00:00:00 - Introduction 00:03:15 - FinOps Foundation and Multicloud 00:07:00 - Costs 00:10:40 - John's History in Reliability Engineering 00:16:30 - The Actual Cost of an Outages, Security, Etc. 00:21:30 - What John Measures 00:28:00 - What John is Up To/Latinx in Tech Links: Palo Alto Networks: https://www.paloaltonetworks.com/ FinOps Foundation: https://www.finops.org Techqueria.org: https://techqueria.org LinkedIn: https://www.linkedin.com/in/johnmartinez/ TranscriptJohn: I would say a tip for better monitoring, uh, would be to, uh turn it on. [laugh]. [unintelligible 00:00:07] sounds, right?Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode we chat with John Martinez, Director of Cloud R&D at Palo Alto Networks. John's had a long career in tech, and we discuss his new focus on FinOps and how it has been influenced by his past work in security and chaos engineering. Jason: So, John, welcome to the show. Tell us a little bit about yourself. Who are you? Where do you work? What do you do?John: Yeah. So, John Martinez. I am a director over at Palo Alto Networks. I have been in the cloud security space for the better of, I would say, seven, eight years or so. And currently, am in transition in my role at Palo Alto Networks.So, I'm heading headstrong into the FinOps world. So, turning back into the ops world to a certain degree and looking at what can we do, two things: better manage our cloud spend and gain a lot more optimization out of our usage in the cloud. So, very excited about new role.Jason: That's an interesting new role. I'd imagine that at Palo Alto Networks, you've got quite a bit of infrastructure and that's probably a massive bill.John: It can be. It can be. Yeah, [laugh] absolutely. We definitely have large amount of scale, in multi-cloud, too, so that's the added bonus to it all. FinOps is kind of a new thing for me, so I'm pretty happy to, as I dig back into the operations world, very happy to discover that the FinOps Foundation exists and it kind of—there's a lot of prescribed ways of both looking at FinOps, at optimization—specifically in the cloud, obviously—and as well as there's a whole framework that I can go adopt.So, it's not like I'm inventing the wheel, although having been in the cloud for a long time, and I haven't talked about that part of it but a lot of times, it feels like—in my early days anyway—felt like I was inventing new wheels all the time. As being an engineer, the part that I am very excited about is looking at the optimization opportunities of it. Of course, the goal, from a finance perspective, is to either reduce our spend where we can, but also to take a look at where we're investing in the cloud, and if it takes more of a shift as opposed to a straight-up just cut the bill kind of thing, it's really all about making sure that we're investing in the right places and optimizing in the right places when it comes down to it.Jason: I think one of the interesting perspectives of adopting multi-cloud is that idea of FinOps: let's save money. And the idea, if I wanted to run a serverless function, I could take a look at AWS Lambda, I could take a look at Azure Functions to say, “Which one's going to be cheaper for this particular use case,” and then go with that.John: I really liked how the FinOps Foundation has laid out the approach to the lifecycle of FinOps. So, they basically go from the crawl, walk, run approach which, in a lot of our world, is kind of like that. It's very much about setting yourself up for success. Don't expect to be cutting your bill by hundreds of thousands of dollars at the beginning. It's really all about discovering not just how much we're spending, but where we're spending it.I would categorize the pitting the cloud providers against each other to be more on the run side of things, and that eventually helps, especially in the enterprise space; it helps enterprises to approach the cloud providers with more of a data-driven negotiation, I would say [laugh] to your enterprise spend.Jason: I think that's an excellent point about the idea of that is very much a run. And I don't know any companies within my sphere and folks that I know in the engineering space that are doing that because of that price competition. I think everybody gets into the idea of multi-cloud because of this idea of reliability, and—John: Mm-hm.Jason: One of my clouds may fail. Like, what if Amazon goes down? I'd still need to survive that.John: That's the promise, right? At least that's the promise that I've been operating under for the 11 years or so that I've been in the cloud now. And obviously, in the old days, there wasn't a GCP or an Azure—I think they were in their infancy—there was AWS… and then there was AWS, right? And so I think eventually though you're right, you're absolutely right. Can I increase my availability and my reliability by adopting multiple clouds?As I talk to people, as I see how we're adopting the multiple clouds, I think realistically though what it comes down to is you adopted cloud, or teams adopt a cloud specifically for, I wouldn't say some of the foundational services, but mostly about those higher-level niche services that we like. For example, if you know large-scale data warehousing, a lot of people are adopting BigQuery and GCP because of that. If you like general purpose compute and you love the Lambdas, you're adopting AWS and so on, and so forth. And that's what I see more than anything is, I really like a cloud's particular higher level service and we go and we adopt it, we love it, and then we build our infrastructure around it. From a practical perspective, that's what I see.I'm still hopeful, though, that there is a future somewhere there where we can commoditize even the cloud providers, maybe [laugh]. And really go from Cloud A to Cloud B to Cloud C, and just adopt it based on pricing I get that's cheaper, or more performant, or whatever other dimensions that are important to me. But maybe, maybe. We'll remain hopeful. [laugh].Jason: Yeah, we're still very much in that spot where everybody, despite even the basics of if I want to a virtual machine, those are still so different between all the clouds. And I mean even last week, I was working on some Terraform and the idea of building it modularly, and in my head thinking, “Well, at some point, we might want to use one of the other clouds so let's build this module,” and thinking, “Realistically, that's probably not going to happen.”John: [laugh]. Right. I would say that there's the other hidden cost about this and it's the operational costs. I don't think we spend a whole lot of time talking about operational costs, necessarily, but what is it going to cost to retrain my DevOps team to move from AWS to GCP, as an example? What are the underlying hidden costs that are there?What traps am I going to fall into because of that? It seems cool; Terraform does a great job of getting that pain into the multiple clouds from an operations perspective. Kubernetes does a great job as well to take some of that visibility into the underlying—and I hate to use it this way but ‘hardware' [laugh] virtual hardware—that's like EC2 or Google Compute, for example. And they do great jobs, but at the end of the day we're still spending a lot of time figuring out what the foundational services are. So, what are those hidden costs?Anyway, long story short, as part of my journey into FinOps, I'm looking forward into not just uncovering the basics of FinOps, where is what are we spending? Where are we spending it? What are the optimization opportunities? But also take a look at some of the more hidden types of costs. I'm very interested in that aspect of the FinOps world as well. So, I'm excited.Jason: Those hidden costs are also interesting because I think, given your background in security—John: Mm-hm.Jason: —one of the challenges in multi-cloud is, if I'm an expert in AWS and suddenly we're multi-cloud and I have to support GCP, I don't necessarily know all of those correct settings and how to necessarily harden and build my systems. I know a model and a general framework, but I might be missing something. Talk to me a bit more about that as a security person.John: Yeah.Jason: What does that look like?John: Yeah, yeah. It's very nuanced, for sure. There are definitely some efforts within the industry to help alleviate some of that nuance and some of those hidden settings that I might not think about. For example, CIS Foundations as a community, the foundations of benchmarks that CIS produces can be pretty exhaustive—and there are benchmarks for the major clouds as well—those go a long way to try and describe at least, what are the main things I should look at from a security perspective? But obviously, there are new threats coming along every day.So, if I was advising security teams, security operations team specifically, it would be definitely to keep abreast into what are the latest and go take a look at what some of the exploit kits are looking for or doing and adopting some of those hidden checks into, for example, your security operations center, what you react to, what the incident responses are going to be to some of those emerging threats. For sure it is a challenge, and it's a challenge that the industry faces and one that we go every day. And an exploit that might be available for EC2 may be different on Google Compute or maybe different on Azure Compute.Jason: There's a nice similarity or parallel there to what we often talk about, especially in this podcast, is we talk about chaos engineering and reliability and that idea of let's look at how things fail and take what we know about one system or one service, and how can we apply that to others? From your experience doing a wide breadth of cloud engineering, tell me a bit more about your experience in the reliability space and keeping—all these great companies that you've worked for, keeping their systems up and running.John: I think I have one of the—fortunate to have one of the best experiences ever. So, I'll have to dig way back to 11 years ago, or so [laugh]. My first job in the cloud was at Netflix. I was at Netflix right around the time when we were moving applications out of the data center and into AWS. Again, fortunate; large-scale, at the cusp of everything that was happening in the cloud, back in those days.I had just helped finish—I was a systems engineer; that's where I transitioned from, systems engineering—and just a little bit of a plug there, tomorrow is Sysadmin Day, so I still am an old school sysadmin at heart so I still celebrate Sysadmin Day. [laugh]. But I was doing that transition from systems engineering into cloud engineering at Netflix, just helped move a database application out from the data center into AWS. We were also adopting in those days, very rapidly, a lot of the new services and features that AWS was rolling out. For example, we don't really think about it today anymore, but back then EBS-backed instances was the thing. [laugh].Go forth and every new EC2 instance we create is going to be EBS-backed. Okay, great. March, I believe it was March 2011, one of AWS's very first, and I believe major, EBS outages occurred. [laugh]. Yeah, lots of, lots of failure all over the place.And I believe from that a lot of what—at least in Gremlin—a lot of that Chaos Monkey and a lot of that chaos engineering really was born out of a lot of our experiences back then at Netflix, and the early days of the cloud. And have a lot of the scars still on me. But it was a very valuable lesson that I take now every day, having lived through it. I'm sure you guys at Gremlin see a lot of this with your customers and with yourselves, right, is that the best you can do is test those failure scenarios and hope that you are as resilient as possible. Could we have foreseen that there was going to be a major EBS outage in us-east-1? Probably.I think academically we thought about it, and we were definitely preaching the mantra of architect for failure, but it still bit us because it was a major cascading outage in one entire region in AWS. It started with one AZ and it kept rolling, and it kept rolling. And so I don't know necessarily in that particular scenario that we could have engineered—especially with the technology of the day—we could have engineered full-on failover to another region, but it definitely taught us and me personally a lot of lessons around how to architect for failure and resiliency in the cloud, for sure.Jason: I like that point of it's something that we knew theoretically could maybe happen, but it always seems like the odds of the major catastrophes are so small that we often overlook them and we just think, “Well, it's going to be so rare that it'll never happen, so we don't think about it.” As you've moved forward in your career, moving on from Netflix, how has that shaped how you approach reliability—this idea of we didn't think EBS could ever go down and lead to this—how do you think of catastrophic failures now, and how do you go about testing for them or architecting to withstand them?John: It's definitely stayed with me. Every ops job that I've had since, it's something that I definitely take into account in any of those roles that I have. As the opportunity came up to speak with you guys, wanted to think about reliability and chaos in terms of cloud spend, and how can I marry those two worlds together? Obviously, the security aspect of things, for sure, is there. It's expecting the unexpected and having the right types of security monitoring in place.And I think that's—kind of going back to an earlier comment that I made about these unexpected or hidden costs that are there lying dormant in our cloud adoption, just like I'm thinking about the cost of security incidents, the cost of failure, what does that look like? These are answers I don't have yet but the explorer in me is looking forward to uncovering a lot of what that's going to be. If we talk in a year from now, and I have some of that prescribed, and thought of, and discovered, and I think it'll be awesome to talk about it in a year's time and where we are. It's an area that I definitely take seriously I have applied not just to operational roles, but as I got into more customer-facing roles in the last 11 years, in between advising customers, both as a sales engineer, as head of customer success, and cloud security startup that I worked for, Evident.io, and then eventually moving here to Palo Alto Networks, it's like, how do I best advise and think about—when I talk to customers—about failure scenarios, reliability, chaos engineering? I owe it all to that time that I spent at Netflix and those experiences very early on, for sure.Jason: Coming back to those hidden costs is definitely an important thing. Especially I'm sure that as you interact with folks in the FinOps world, there's always that question of, “Why do I have so much redundancy? Why am I paying for an entire AZs worth of infrastructure that I'm never using?” There's always the comment, “Well, it's like a spare tire; you pay for an extra tire in case you have a flat.” But on some hand, there is this notion of how much are we actually spending versus what does an outage really cost me?John: Right. We thought about that question very early on at another company I worked at after Netflix and before the startup. I was fortunate again to work in another large-scale environment, at Adobe actually, working on the early days of their Creative Cloud implementation. Very different approach to doing the cloud than Netflix in many ways. One of the things that we definitely made a conscious effort to do, and we thought about it in terms of an insurance policy.So, for example, S3 replication—so replicating our data from one region to another—in those days, an expensive proposition but one that we looked at, and we intentionally went in with, “Well, no, this is our customer data. How much is that customer data worth to us?” And so we definitely made the conscious decision to invest. I don't call it ‘cost' at that point; I call that an investment. To invest in the reliability of that data, having that insurance policy there in case something happened.You know, catastrophic failure in one region, especially for a service as reliable and as resilient as S3 is very minuscule, I would say, and in practice, it has been, but we have to think about it in terms of investing. We definitely made the right types of choices, for sure. It's an insurance policy. It's there because we need it to be there because that's our most precious commodity, our customers' data.Jason: Excellent point about that being the most precious commodity. We often feel that our data isn't as valuable as we think it is and that the value for our companies is derived from all of the other things, and the products, and such. But when it comes down to it, it is that data. And it makes me think we're currently in this sort of world where ransomware has become the biggest headline, especially in the security space, and as I've talked with people about reliability, they often ask, “Well, what is Gremlin do security-wise?” And we're not a security product, but it does bring that up of, if your data systems were locked and you couldn't get at your customer information, that's pretty similar to having a catastrophic outage of losing that data store and not having a backup.John: I've thought about this, of course, in the last few weeks, obviously. A very, very public, very telling types of issues with ransomware and the underlying issues of supply chain attacks. What would we do [laugh] if something like that were to happen? Obviously, rhetorically, what would we do? And lots of companies are paying the ransom because they're being held at gunpoint, you know, “We have your data.”So yeah, I mean, a lot of it, in the situation, like the example I gave before, could not just the replication of, for example, my entire S3 bucket where my customer data is thwarted a situation like that? And then you think about, kind of like, okay, let's think about this further. If we do it in the same AWS account, as an example, if the attacker obtained my IAM credentials, then it really comes down to the same thing because, “Oh, look it, there's another bucket in that other region over there. I'm going to go and encrypt all of those objects, too. Why not, right?” [laugh].And so, it also begs the question or the design principles and decisions of, well, okay, maybe do I ship it to a different account where my security context is different, my identity context is different? And so there's a lot of areas to explore there. And it's very good question and one that we definitely do need to think about, in terms of catastrophic failure because that's the way to think about it, for sure.Jason: Yeah. So, many parallels between that security and reliability, and all comes together with that FinOps, and how much are you—how much do we pay for all of this?John: Between the reliability and the security world, there's a lot of parallels because your job is about thinking what are the worst-case scenarios? It's, what could possibly go wrong? And how bad could it be? And in many cases, how bad is it? [laugh].Especially as you uncover a lot of the bad things that do happen in the real world every day: how bad is it? How do I measure this? And so absolutely there's a lot of parallels, and I think it's a very interesting point you make. And so… yeah so, Jason, how can we marry the two worlds of chaos engineering and security together? I think that's another very exciting topic, for sure.Jason: That is, absolutely. You mentioned just briefly in that last statement, how do you measure it?John: Yep.Jason: That comes up to something that we were chatting about earlier is monitoring, and what do you measure, and ensuring that you're measuring the right things. From your experience building secure systems, talk to me about what are some of the things that you like to measure, that you like to get observability on, that maybe some folks are overlooking.John: I think the overlooking part is an interesting angle, but I think it's a little bit more basic than that even. I'll go to my time in the startup—so at Evident.io—mainly because I was in customer success and my job was to talk to our customers every day—I would say that a bunch of our customers—and they varied based on maturity level, but we were working with a lot of customers that were new in the cloud world, and I would say a lot of customers were still getting tripped up by a lot of the basic types of things. For example—what do I mean by that? Some of the basic settings that were incorrect were things just, like, EC2 security groups allowing port 22 in from the world, just the simple things like that. Or publicly accessible S3 buckets.So, I would say that a lot of our customers were still missing a lot of those steps. And I would say, in many of the cases, putting my security hat on, the first thing you go to is, well, there's an external hacker trying to do something bad in your AWS accounts, but really, the majority of the cases were all just mistakes; they were honest. I'm an engineer setting up a dev account and it's easier for me, instead of figuring out what my egress IP is for my company's VPN, it's easier for me just to set port 22 to allow all from the world. A few minutes later, there you go. [laugh]. Exploit taken, right? It's just the simple stuff; we really as an industry do still get tripped up by the simple things.I don't know if this tracks with the reliability world or the chaos engineering world, but I still see that way too much. And that just tells me that even if we are in the cloud—mature company or organization—there's still going to be scenarios where that engineer at two in the morning just decides that it's just easier to open up the firewall on EC2 than it is to do, quote-unquote, “The right thing.” Then we have an issue. So, I really do think that we can't let go of not just monitoring the basics, but also getting better as an industry to alert on the basics and when there are misconfigurations on the basics, and shortening that time to alert because that really is—especially in the security world—that really is very critical to make sure that window between when that configuration setting is made to when that same engineer who made the misconfiguration get alerted to the fact that it is a misconfiguration. So. I'll go to that: it's the basics. [laugh].Jason: I like that idea of moving the alert forward, though. Because I think a lot of times you think of alerts as something bad has happened and so we're waiting for the alert to happen when there's wrongful access to a system, right? Someone breaks in, or we're waiting for that alert to happen when a system goes down. And we're expecting that it's purely a response mechanism, whereas the idea of let's alert on misconfigurations, let's alert on things that could lead to these, or that will likely lead to these wrong outcomes. If we can alert on those, then we can head it off.John: It's all the way. And in the security world, we call it shifting left, shifting security all the way to the left, all the way to the developer. Lots of organizations are making a lot of the right moves in that direction for embedding security well into the development pipeline. So, for example, I'll name two players in the Infrastructure as Code as we call it in the security space. And I'll name the first one just because they're part of Palo Alto Networks now, so Bridgecrew; so very strong, open-source solution in that space, as well as over on the HashiCorp side where Sentinel is another example of a great developer-forward shift-left type of tool that can help thwart a lot of the simple security misconfigurations, right from your CI/CD pipelines, as opposed to the reaction time over here on the right, where you're chasing security misconfigurations.So, there's a lot of opportunity to shorten that alert window. And even, in fact, I've spent a lot of time in the last couple of years—I and my team have spent a lot of time in the last couple of years thinking about what can the bots do for us, as opposed to waiting for an alert to pop up on a Slack message that says, “Hey, engineer. You've got port 22 open to the world. You should maybe think about doing something.” The right thing to do there is for something—could be something as simple as an alert making it to a Lambda function and the Lambda function closing it up for you in the middle of the night when you're not paying attention to Slack, and the bot telling you, “Hey, engineer. By the way, I closed the port up. That's why it's broken this morning for you.” [laugh]. “I broke it intentionally so that we can avoid some security problems.”So, I think there's the full gamut where we can definitely do a lot more. And that's where I believe the new world, especially in the security world, the DevSecOps world, can definitely help embed some of that security mindset with the rest of the cloud and DevOps space. It's certainly a very important function that needs to proliferate throughout our organizations, for sure.Jason: And we're seeing a lot of that in the reliability world as well, as people shift left and developers are starting to become more responsible for the operations and the running of their services and applications, and including being on call. That does bring to mind that idea, though—back to alerting on configurations and really starting to get those alerts earlier, not just saying that, “Hey, devs, you're on call so now you share a pain,” but actually trying to alleviate that pain even further to the left. Well, we're coming up close to time here. So, typically at this point, one thing that I like to do is we like to ask folks if they have anything to plug. Oftentimes that's where people can find you on social media or other things. I know that you're connected with Ana through Latinx in Tech, I would love to share more about that, too. So.John: For sure, yeah. So, my job in terms of my leadership role is definitely to promote a lot of diversity, inclusion, and equity, obviously, within the workspace. Personally, I do also feel very strongly that I should be not just preaching it, but also practicing it. So, I discovered in the last year—in fact, it's going to be about a year since I joined Techqueria—so techqueria.org—and we definitely welcome anybody and everybody.We're very inclusive, all the way from if you're a member of the Latinx community and in technology, definitely join us, and if you're an ally, we definitely welcome you with open arms, as well, to join techqueria.org. It is a very active and very vibrant community on Slack that we have. And as part of that, I and a couple of people in Techqueria are running a couple of what we call cafesitos which is the Spanish word for coffees, coffee meetings.So, it's a social time, and I'm involved in helping lead both the cybersecurity cafecito—we call it Cafecito Cibernético, which happens every other Friday. And it's security-focused, it's security-minded, we go everywhere from being very social and just talking about what's going on with people personally—so we like to celebrate personal wins, especially for those that are joining the job market or just graduating from school, et cetera, and talk about their personal wins, as well as talk about the happenings, like for example, a very popular topic of late has been supply chain attacks and ransomware attacks, so definitely very, very timely there. As well as I'm also involved—being in the cloud security space, I'm bridging, sort of, two worlds between the DevOps world and the security world; more recently, we started up the DevOps Cafecito, which is more focused on the operations side. And that's where, you know, happy to have Ana there as part of that Cafecito and helping out there. Obviously, there, it's a lot of the operations-type topics that we talk about; lots of Kubernetes talk, lots of looking at how the SRE and the DevOps jobs look in different places.And I wouldn't say I'm surprised by it, but it's very nice to see that there is also a big difference with how different organizations think about reliability and operations. And it's varied all over the place and I love it, I love the diversity of it. So anyway, so that's Techqueria, so very happy to be involved with the organization. I also recently took on the role of being the chapter co-director for the San Francisco chapter, so very happy to be involved. As we come out of the pandemic, hopefully, pretty soon here [laugh] right—as we're coming out of the pandemic, I'll say—but looking forward to that in-person connectivity and socializing again in person, so that's Techqueria.So, big plug for Techqueria. As well, I would say for those that are looking at the FinOps world, definitely check out the FinOps Foundation. Very valuable in terms of the folks that are there, the team that leads it, and the resources, if you're looking at getting into FinOps, or at least gaining more control and looking at your spend, not so much like this, but with your eyes wide open. Definitely take a look at a lot of the work that they've done for the FinOps community, and the cloud community in general, on how to take a look at your cloud cost management.Jason: Awesome. Thanks for sharing those. If folks want to follow you on social media, is that something you do?John: Absolutely. Mostly active on LinkedIn at johnmartinez on LinkedIn, so definitely hit me up on LinkedIn.Jason: Well, it's been a pleasure to have you on the show. Thanks for sharing all of your experiences and insight.John: Likewise, Jason. Glad to be here.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: What Kessel Run is Doing: 00:01:27 Failure Never has a Single Point: 00:05:50 Lessons Learned: 00:10:50 Working the DOD:00:13:40 Automation and Tools: 00:18:02 Links: Kessel Run: https://kesselrun.af.mil Kessel Run LinkedIn: https://www.linkedin.com/company/kesselrun/ TranscriptOmar: But I'll answer as much as I can. And we'll go from there.Jason: Yeah. Awesome. No spilling state secrets or highly classified info.Omar: Yes.Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems.Jason: Welcome back to Break Things on Purpose. Today with us we have guest Omar Marrero. Omar, welcome to the show.Omar: Thank you. Thank you, man. Yeah, happy to be here.Jason: Yeah. So, you've been doing a ton of interesting work, and you've got a long history. For our listeners, why don't you tell us a little bit more about yourself? Who are you? What do you do?Omar: I've been in the military, I guess, public service for a while. So, I was military before, left that and now I've joined as a government employee. I love what I do. I love serving the country and supporting the warfighters, making sure they have the tools. And throughout my career, it's been basically building tools for them, everything they need to make their stuff happen.And that's what drives me. That's my passion. If you've got the tool to do your mission, I'm in and I'll make that happen. That's kind of what I've done for the whole of my career, and chaos has always been involved there in some fashion. Yeah, it's been a pretty cool run.Jason: So, you're currently doing this at a company called Kessel Run. Tell us a little bit more about Kessel Run.Omar: So, we deliver combat capability that can sense or respond to conflict in any domain, anywhere, any time. Or deliver award-winning software that our warfighters love. So, Kessel Run's kind of… you might think of it as a software factory within the DOD. So, the whole creation of Kessel Run is to deliver quickly, fast. If you follow the news, you know DOD follows waterfall a little bit.So, the whole creation of Kessel Run was to change that model. And that's what we do. We deliver continuously non-stop. Our users give us feedback and within hours, they got it. So, that's the nature behind Kessel Run. It's like a hybrid acquisition model within the government.Jason: So, I'm curious then, I mean, you obviously aren't responsible for the company naming, but I'm sure many of our listeners being Star Wars fans are like, “Oh, that sounds familiar.” Omar: Yep, yep.Jason: If you haven't checked out Kessel Run's website, you should go do that; they have a really cool logo. I'm guessing that relates to just the story of Kessel Run being like, doing it really fast and having that velocity, and so bringing that to the DOD, is that the connection?Omar: Actually, it goes into the smuggling DevSecOps into the DOD, so the 12 parsecs. So, that's where it comes from. So, we are smuggling that DevSecOps into the DOD; we're changing that model. So, that's where it comes from.Jason: I love that idea of we're going to take this thing and smuggle it in, and that rebellious nature. I think that dovetails nicely into the work that you've been doing with chaos engineering. And I'm curious, how did you get into chaos engineering? Where did you get your start?Omar: I've been breaking things forever. So, part of that they deliver tools that our warfighters can use, that's been my jam. So, I've been doing, you can say, chaos forever. I used to walk around, unplug power cables, network cables, turn down [WAN 00:03:24]. Yeah, that was it.Because we used to build these tools and they're like, “Oh, I wonder if this happens.” “All right, let's test it out. Why not?” Pull the cable and everybody would scream and say, “What are you doing?” It was like, “We figured it out.”But yeah, I've been following chaos engineering for a while, ever since Netflix started doing it and Chaos Monkey came out and whatnot, so that's been something that's always been on my mind. It's like, “Ah, this would be cool to bring into the DOD.” And Kessel Run just made that happen. Kessel Run, the way we build tools, our distributed system was like, “Yep, this is the prime time to bring chaos into the DOD.” And Kessel Run just adopted it.I tossed the idea, I was like, “Hey, we should bring chaos into Kessel Run.” And we slowly started ramping up, and we build a team for it; team is called Bowcaster. So, we follow the breaking stuff. And that's it. So, we've matured, and we've deployed and, of course, we've learned on how to deploy chaos in our different environments. And I mean, yeah, it's been a cool run.Jason: Yeah, I'm curious. You mentioned starting off simply, and that's always what we recommend to people to do. Tell us a little bit more about that. What were some of the tests that you ran then, and then maybe how have they matured, and what have you moved into?Omar: So, our first couple of tests were very simple. Hey, we're going to test a database failover, and it was really manual at that point. We would literally go in and turn off Database A and see what happened. So, it was very basic, very manual work. We used to record them so we can show them off like, “Hey, check this out. This is what we did.”So, from there, we matured. We got a little bit more complex. We eventually got to the point where we were actually corrupting databases in production and seeing what happens. You should have seen everybody's faces when we proposed that. So, from there, we're running basically, we call it ‘Chaos Plus' in Kessel Run.So, we've taken chaos engineering, the concept of chaos engineering, right, breaking things on purpose, but we've added performance engineering on top of it, and we've added cybersecurity testing on top of it. So, we can run a degraded system, and at the same time say, “All right, so we're going to ramp up and see what a million users does to our app while it's fully degraded.” And then we would bring in our cyber team and say, “All right, our system is degraded. See if you can find a vulnerability in it.” So, we've kind of evolved.And I call it, put chaos on a little bit of steroids here. But we call it Chaos Plus; that's our thing. We've recently added fuzzing while we're doing chaos. So, now we got performance chaos, our cyber team, and we're fuzzing the systems. So, I'm just going to keep going until somebody screams at me and says, “Omar, that's too much.” But that's essentially a little bit of our ride in Kessel Run.Jason: That's amazing. I love that idea of we're going to do this test, and then we're going to see what else can happen. One of the things that I've been chatting with a bunch of folks recently about is this idea, we always talk about, especially in the resilience engineering space, that failure never has a single point. It's not a singular root cause; it's always contributing factors. And the problem is, when you're doing chaos engineering, you're usually testing one thing.And then it's like, “Oh, I did the failover on that database and that worked.” I've been suggesting that people now start to do, “Well, if this is in a degraded state, what are the contributing factors—if that's still working, what are the contributing factors that can lead to a major catastrophe?” That's one of the nice things that actually performing these failures allows you to do rather than just imagining them and trying to work up some sort of response process to your imagination.Omar: That's our thing. So, from our perspective, that's what I charge the team to do is like, “Hey, we need to make sure these things are working.” Comes back to my passion, right? Were delivering tools to the warfighters; the warfighter needs to have tools that work. And that's what Kessel Run does; that's what Kessel Run exists for.We deliver that award-winning software that our airmen love. So, following that trend, that's where chaos comes in place. So, we're building fancy tools, and we got an awesome platform that supports it and all that stuff. We're just there to make sure, “Hey yeah, this is engineered correctly. It's responsive to fault or any kind of failure.” And we just—I mean, we're literally blasting it with anything we can imagine to make sure it could support that.Jason: I'm curious if you could dive into some details about one of your recent chaos engineering experiments. Was anything unusual or unexpected? And what did you learn from it?Omar: So, I think one of the cool ones, which is the latest one, was that database corruption. There was a lot of questions on, “Hey, we have some tools in place we built. The engineering is in place to make sure that if the database goes down, nothing is impacting our system and whatnot. What would happen if the database gets corrupted?” For some odd reason. I don't know, that's probably going to happen once in a million, I don't know.But it's like, “Hey, let's figure it out.” So, my team came up with an experiment; we went and we started corrupting databases in staging. It's like, “All right yeah, that was cool.” Oh, and then we went to the leader, she was like, “Hey, we want to do this in production and call an outage and see how the teams responds.” And at the same time, we're going to throw a whole bunch of curves.We're going to disappear key people, we're going to make sure you don't have access to certain things. It was not just database corruption; we're going to throw curveballs at you like there's no tomorrow here. So, we did, and it was actually a pretty good experience. So, we figured out, hey, yeah, the database corruption just happens, whatnot and the team like our SRE team actually figured out. It took them a little bit because it was a lot of curveballs, but we learned, all right, if this does happen and we have all these issues happening at once, it's probably a non-realistic—I'd call it—fire drill, but it's something we got to prepare for just in case.We've learned from it and we actually practiced it again. So, from the initial time it took us to go through the curveballs, we did another one, threw different curveballs at them, and that was like a no-brainer. They're like, “Yep. We got this. Don't worry about it. We ran this through once, so we know.”Which is why we do these things. You want to practice and then, if there's an outage, shorten the time, make sure it's not impacting. What was really cool to see is, like, it didn't matter how many databases we corrupted and how many curveballs we threw at the system, there was never an impact to the end-user, which is the goal. We practice chaos to make sure that it's always working. So, we validated that our system can tolerate all these curveballs and all these things we were doing at it. And it's something that we've never tried before, so it was pretty cool.Jason: I love that you mentioned what you threw at people was maybe not realistic, it's not something that would happen in the real world, but I think it brings up that idea of when you're training for things, if you train harder, if you're an athlete and you train harder than you wouldn't normally in a game, and you're constantly stressing yourself when it comes to that real-world situation, it just seems easy.Omar: Yeah. And that's what the SRE team—because we do the normal, “Hey, we did the test,” and then we go, [it's like 00:10:31], “This is what we saw.” And then we actually asked for feedback from the team's. It's like, “Any way we could have done this test better?” The normal process.And they're like, “We loved this. We've learned so much that helps us either automate more scripts or streamline our process.” So, from our standpoint, we'll keep throwing curveballs. And I think they did that, aside from, hey, this is a very realistic scenario, and then we go to the—this is probably a little bit over the edge, but we still want to do it. We do both. It's good.Plus, it doesn't keep a same [unintelligible 00:11:04]. We're used to it. All of a sudden you're throwing all these curveballs at the team, they can nitpick from all these lessons learned and put better processes in place, make it faster, better engineering. The team's awesome. All the team that supports Kessel Run, our SRE team, our platform team, everybody's super smart, super amazing, and I'm just there to test their ability to respond. Which is why I like my job.Jason: You mentioned lessons learned, and I'm curious, as somebody who's been doing chaos engineering for quite a long time, actually, what are some of the top lessons that you would give, or the top advice you would give to our listeners as they start to do chaos engineering?Omar: I would say. So, you start simple, and that's key. You start simple. If you really mention chaos to somebody who's not familiar, the first thing they're going to do is they're going to Google ‘chaos engineering,' and what they're going to find out is Netflix and Chaos Monkey. That's an awesome tool, but do your research, figure out what other people are doing, and get involved in the chaos community world; there's a lot of people doing some cool stuff.Start with a small test so you can see and get the data from there, and scale up. As you learn and as you go, you scale up. And it helps—chaos scares, sometimes—or not, sometimes. For the most part—your senior leadership because you're telling them, “Hey, I'm going to come in and break stuff.” So, doing small-scale tests allows you to prove and provide, hey, this is why it's beneficial.The actual event is not chaos. We call it chaos engineering, but the actual event is very controlled. We know what we're doing, we're watching, we have somebody in place to say stop in case things are going haywire. So, you have to explain that while you're doing. And just do it; it's just like testing, you have to test your applications, and the more testing you do, the better.The closer you shift left the better, too but you have to test. You got to make sure your apps are working. So, chaos engineering is just another flavor to that. The word chaos usually scares people. So, you just got to slowly do it and show them the value of doing chaos.Hey, you're doing chaos, this is what it brings. Hey, we just proved your database can failover. That's a good thing. And if it didn't fail over, it's like, how can we make it happen? So, that's a small-scale test that provides that feedback and data you need to say, this is why we have to adopt chaos engineering.And as you going, get—do—go crazy, right? As your leadership allows you to do stuff like, yeah, let's just do it. And work with your teams. Work with the SRE teams, work with the app teams and get feedback. What do you need?What is your biggest problem? That's one thing I ask my team to do. So, every month, they go to the team and say, “All right, so what's your biggest hurdle? What right now is your—why don't you sleep?” And we go, “Okay, can we replicate ‘the why don't you sleep' so we can let you sleep?”So, that's an approach that's worked for us. And a whole bunch of our tests are based on that. It's like, okay, “What keeps you up at night?” We'll test it so you can sleep. And then next month, give me the next thing that keeps you up at night. And we go in and we test it.Jason: And like that iterative approach of let's work on, what's your biggest pain point? What keeps you up at night? And then let's solve that. And then what's the next thing? And keep working down that chain until, hopefully, nothing keeps you up at night.Omar: Yeah, that'll be good. We all sleep and it's like, “Oh, this thing's on cruise control. Let's go.” Jason: You mentioned convincing management or the upper levels of management in allowing you to do this. What's that process like at Kessel Run? And then, what's that process look like as Kessel Run convinces the broader Department of Defense to adopt this?Omar: Oh, that's a fun one. Yeah, so we when we first brought it up, we got the, “What are you trying to do?” Look—because it was like, “Hey, we want to do chaos engineering.” It was like, “Okay, yeah, we've heard a little bit about this. What does that mean?”It's like, “I'm just going to break stuff.” Which probably wasn't the smart approach at the moment, but that's what I said. And they're like, “No, wait. What do you mean?” And I'm like, “Yeah, and eventually I want to do it in production.”So, I just went all out. That was my presentation. You know, I've learned from that. It's like, okay, baby steps, Omar. But initially, it was like, “I want to do chaos and I want to get to production.” They were like, “Yeah, sounds good, but I need a plan.” I was like, “Okay. I'll come up with a plan. And we'll figure it out.”And so that's how we slowly started. And I stood up the team, Bowcaster, and from there we kind of, all right, how do we show the value of chaos engineering? How do we learn chaos and all that stuff? So, it was easy to get them to adopt it. It was the actual execution of tests that was a concern.Because there was a lot of unknowns. We didn't know what we're going to break. We don't know how it's going to react. And how do we actually do this? And we slowly just kind of did those little tests. It was like, all right, we're going to do this, we're going to do that. And that's how we got it.And now that we're moving to the rest of the DOD, that's a really cool adventure because our framework, what Bowcaster has built in Kessel Run, is what they want to move to the rest of the DOD. So, the Chaos Plus model is what's interesting. The fact that we are moving to the rest of the DOD is very cool because it's something I believe should be in the rest of the DOD. And we're happy to experiment. From the Kessel Run perspective, that's what we're here for.We'll experiment and we'll let you know what fails what doesn't fail because we're an experimental lab. And, yeah. But the senior leadership in DOD in charge with all the software development and stuff like that, they're all over it. They just want to—hey, how do we make it happen? What do you need?You'll see there's a different mind change now that chaos engineering is more familiar around the DOD and the tech space. “Hey, yeah. This thing called chaos engineering.” It's not just, yeah, Netflix does chaos engineering. It's like, yeah, everybody's doing chaos engineering.So, you see the little mind shift from, initially, when I bought it in. It was like, “Hey, I want to break stuff in production.” And everybody's like, “Whoa, hold up there. There's [no 00:17:05] baby steps here, Omar.” Now, it's like, “Hey, let's go and do it.” Is like, “Yeah, let's do it. How do we execute? But let's do it.” So, it's a very cool thing to see.Jason: I'm wondering if maybe that readiness to adopt things like this since you've spent time in the military—I haven't, but from what I understand, it sounds like the military has ideas of really, really doing testing. And in some cases, not production testing. We don't start wars just to train the military, but there is the idea of things like live-fire testing. Do existing practices within the military influence the perception of chaos engineering, and to help people actually understand it better, maybe more so than with standard civilians and corporate enterprise?Omar: Yes. Testing is very important in our systems. So, it's a different mindset, I would say. So, because in corporate world, it's all about the money, making the system work and make sure it's not going down because you lose profit. Or if you're—that's the mindset on that one.For us, we are in charge of defending the nation, so our system has to be proven and ready to rock within seconds. So, we do a lot of tests, and chaos engineering is just one extra layer to those tests. And now that we are moving to this massive DevSecOps transformation, chaos engineering is key. There's no way we can do this without having chaos engineering involved. So, that's what our senior leadership is pushing.Hey, yeah, this is another flavor of testing. It's important because we're building distributed complex systems across the cloud and whatnot, to support the DOD mission. So, chaos engineering is there. Same thing with the live-fire testing. We got to do live-fire testing to make sure that the ammunition is working, and the guns are working, and everything's working right. This is just a different flavor of live fire testing, just on software, and applications, and infrastructure, and the whole deal.Jason: You mentioned running game days and throwing curveballs, and that sounds like more of a manual game day where you've got people running the attacks and people responding. You've mentioned Kessel Run and really that velocity, and getting faster at things, and automating. Have you started automating the chaos engineering process as well?Omar: So, we have and we're following the same approach as when we started. So, the baby steps approach. So, we are going to slowly work with the SRE team to automate some of these tests. And that's ongoing. My team's working on it right now, so we're getting there.It's part of our slowly learning and kind of process. The manual, like, game days won't stop. Those will keep going because of the curveballs we want to keep throwing at the teams, but the automations is coming. The idea is to get the chaos engineering closer to the dev cycle as we can, so shift left as much as possible. And that's our next goal.So, we're working on that. And I think a lot of it comes down to where do we do it. So, we work in different environments. It's not just what we call the internet right now. We have different environments, so how do we automate across all environments?And part of it is how are we architect that so it works. So, if we make it work on one environment, how does it work on all the environments? So, that's usually where our timelines are. So, trying to make sure that our architecture supports all environments versus having to spend a lot of resources, you know, all right, we're going to engineer one environment, we've got to engineer another environment, we've got to engineer another environment. We want to make sure just to—out of the box, here we go. But that is part of our goal, and we are starting baby steps, so the database failover test is probably the one we will automate first.Jason: As you've done chaos engineering, you're doing the game days manually; what was the process like in terms of tools and adoption? I think a lot of people start off and they hear of Chaos Monkey and so they immediately jump over and, “Cool, let me grab Chaos Monkey and see if I can use that.” For any listeners that have tried that you've probably have quickly recognized that that tool, not so great for public consumption, was very much designed for Netflix. So, I'm curious if you could tell me more about your tools adoption, what have you used? What are you using now? What does that evolution look like?Omar: Yeah, so we actually—the first thing I told my team was you are going to research tools. [laugh]. I know Chaos Monkey is out there, but I'm like, there's definitely more tools that we should look at. I'm sure there's been a whole bunch of tools created, depending on our platform. And that's what they did.So, they went and they researched a whole bunch of tools. And they came back and they presented the tools they wanted to use, or kind of just integrate into our architecture. When the team started, right, so when we started that chaos team, the Bowcaster team was supposed to focus just on chaos engineering, but the more I kept thinking about it, it was like we need to focus on chaos and some other stuff. So, that's where the performance engineering and the fuzzing came in plays, and bringing the cyber team into the game. So, from a tool perspective, when you look at us, Bowcaster the team is also the tool.So, they have a tool, Bowcaster is the tool that we deploy across KR to do chaos engineering. Now, within that tool or that framework, there's the tools behind it. And there's a combination of open-source tools and other tools that we do there, but those just provide the engine for us to perform all of our tests on what we call Chaos Plus. So, Bowcaster is our tool. Yeah, it's the team and the tool is kind of weird, right?But the team and the tool, so when you go into KR and you say, “Hey, I want to chaos engineering.” It's like, “All right. Go do chaos engineering with the Bowcaster tool that the Bowcaster team built.” But the architecture behind that, there's a lot of tools. And it was that—that was the task I gave the team.It's like, “I need you to research tools. I know, Chaos Monkey is out there. I know Simian Army, I know all these tools that originally come out when you Google.” It's like, if Netflix created it, that's the first thing that comes up. But there has to be more, especially in the Kubernetes world. There's a whole bunch of tools. So, that's what they did, and we took a combination of those tools and we built Bowcaster. And that's what we got.Jason: That's an excellent point, though, about not just a chaos engineering tool. And I think a lot of times when people think of chaos engineering because it's chaos engineering it sounds like this well-defined practice of, this is it. If you have chaos engineering, you must have chaos engineers, and so it seems siloed when in actuality, it's just one of many practices that SREs and DevOps and all engineers should practice. So, this idea of, we're going to build a tool that has not just the chaos engineering, but all of these other things that you need, and providing that as a service is, I think, a fantastic idea.Omar: That's always been the charter I've given the teams. Yes, we want to do chaos engineering; chaos engineering is awesome. We all dig it, we preach it, we're huge advocates of it, but what else can we provide? I mean, we're already degrading the system, so what else can we test? [unintelligible 00:24:25] break the system and blast it with a million users and see what happens. And it's like, “All right, systems degraded; we're blasting it. Let's see if we can hack it.”And maybe while that's degraded and getting blasted, maybe we figure out there's a vulnerability or something. So, that's always been the concept. It's like putting chaos engineering a little bit on steroids, we call it. And that's what Bowcaster does. Bowcaster's job is to build these things and support it.And I'm sure we'll come up with other crazy stuff as we get feedback from team, like, “Hey, it would be cool if you can do this.” And we'll just build it into our framework and it will just be another service that Bowcaster provides aside from performance and chaos engineering.Jason: Omar, thanks for coming on the show. Fantastic information. It's inspiring to see the journey of where you've come from and where you're headed, especially with the Bowcaster team at Kessel Run. Before we go, though, I wanted to ask, do you have anything that you want to plug or promote, job openings, upcoming speaking? Where can people find you on the internet to learn more about the stuff you've been doing?Omar: So, Kessel Run, very active, so you can find us at LinkedIn: Kessel Run, or just go to our site, kesselrun.af.mil and you'll find a whole bunch of information there, careers, so if you're interested come work, we're cool people. I promise we do cool stuff.And if you come work for Bowcaster, we'll hire you and you can break stuff with us, which is why we—can't get better than that, right? Yeah, come check us out, kesselrun.af.mil. Lots of information there, careers, you can follow us and yeah.Jason: Awesome. Thanks again for coming on the show.Omar: Thanks.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

In this episode, we cover: Intro and an Anecdote: 00:00:27 Early Days of Chaos Engineering: 00:04:13 Moving to the Cloud and Important Lessons: 00:07:22 Always Learning and Teaching: 00:11:15 Figuring Out Chaos: 00:16:30 Advice: 00:20:24 Links: Apex: https://www.apexclearing.com LinkedIn: https://www.linkedin.com/in/mdcsaenz TranscriptJason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode, Ana Medina is joined by Carmen Saenz, a senior DevOps engineer at Apex Clearing Corporation. Carmen shares her thoughts on what cloud-native engineers can learn from our on-prem past, how she learned to do DevOps work, and what reliable IT systems look like in higher education.Ana: Hey, everyone. We have a new podcast today, we have an amazing guest; we have Carmen Saenz joining us. Carmen, do you want to tell us a little bit about yourself, a quick intro?Carmen: Sure. I am Carmen Saenz. I live in Chicago, Illinois, born and raised on the south side. I am currently a senior DevOps engineer at Apex and I have been in high-frequency trading for 11 out of 12 years.Ana: DevOps engineers, those are definitely the type of work that we love diving in on, making sure that we're keeping those systems up-to-date. But that really brings me into one of the questions we love asking about. We know that in technology, we sometimes are fighting fires, making sure our engineers can deploy quickly and keep collaboration around. What is one incident that you've encountered that has marked your career? What exactly happened that led up to it, and how is it that your team went ahead and discovered the issue?Carmen: One of the incidents that happened to us was, it was around—close to the beginning of the teens [over 00:01:23] 2008, 2009, and I was working at a high-frequency trading firm in which we had an XML configuration that needed to be deployed to all the machines that are on-prem at the time—this was before cloud—that needed to connect to the exchanges where we can trade. And one of the things that we had to do is that we had to add specific configurations in order for us to keep track of our trade position. One of the things that happened was, certain machines get a certain configuration, other machines get another configuration. That configuration wasn't added for some machines, and so when it was deployed, we realized that they were able to connect to the exchange and they were starting to trade right away. Luckily, someone noticed from external system that we weren't getting the positions updates.So, then we had to bring down all these on-prem machines by sending out a bash script to hit all these specific machines to kill the connection to the exchange. Luckily, it was just the beginning of the day and it wasn't so crazy, so we were able to kill them within that minute timeframe before it went crazy. We realized that one of the big issues that we had was, one, we didn't have a configuration management system in order to check to make sure that the configurations we needed were there. The second thing that we were missing is a second pair of eyes. We need someone to actually look at the configuration, PR it, and then push it.And once it's pushed, then we should have had a third person as we were going through the deployment system to make sure that this was the new change that needed to be in place. So, we didn't have the measures in place in order for us to actually make sure that these configurations were correct. And it was chaos because you can lose money because you're down when the trading was starting in the day. And it was just a simple mistake of not knowing these machines needed a specific configuration. So, it was kind of intense, those five minutes. [laugh].Ana: [laugh]. So, amazing that y'all were able to catch it so quickly because the first thing that comes to mind, as you said, before the cloud—on-prem—and it's like, do we start needing to making ‘BC', like, ‘Before Cloud' times when we talk about incidents? Because I think we do. When we look at the world that we live in now in a more cloud-native space, you tell someone about this incident, they're going to look at us and say, “What do you mean? I have containers that manage all my config management. Everything's going to roll out.”Or, “I have observability that's going to make us be resilient to this so that we detect it earlier.” So, with something like chaos engineering, if something like this was to happen in an on-prem type of data center, is there something that chaos engineering could have done to help prepare y'all or to avoid a situation like this?Carmen: Yeah. One of the things that I believe—the chaos engineering, for what it's worth, I didn't actually know what chaos engineering was till 2012, and the specific thing that you mentioned is actually what they were testing. We had a test system, so we had all these on-prem machines and different co-locations in the country. And we would take some of our test systems—not the production because that was money-based but our test systems that were on simulated exchanges—and what would we do to test to make sure our code was up-to-date is we actually had a Chaos Monkey to break the configuration.We actually had a Chaos Monkey and it would just pick a random function to run that day. It would be either send a bad config to a machine or bring down a machine by disconnecting its connection, doing a networking change in the middle to see how we would react. [unintelligible 00:05:01] with any machine in our simulation. And then we had to see how it was going to react with the changes that was happening, we had to deduce, we had to figure out how to roll it back. And those are the things that we didn't have at the time. In 2012—this was another company I was working for in high-frequency trading—and they implemented chaos engineering in that simulation, specifically for them, we would catch these problems before we hit production. So yeah, that's definitely was needed.Ana: That's super awesome that a failure encountered four years prior to your next company, you ended up realizing, wait, if this company actually follows what they do have of let's roll out a bad deploy; how does our system actually engage with it? That's such an amazing learning experience. Is there anything more recent that you've done in chaos engineering you'd want to share about?Carmen: Actually, since I've just started at this company a couple of months ago, I haven't—thankfully—run into anything, so a lot of my stories are more like war stories from the PC days. So.Ana: Do you usually work now, mostly on-prem systems or do you find yourself in hybrid environments or cloud type of environments?Carmen: Recently, in the last three to four years I spent in cloud-only. I rarely have to encounter on-prem nowadays. But coming from an on-prem world to a cloud world, it was completely different. And I feel with the tools that we have now we have a lot of built-in checks and balances in which even with us trying to manually delete a node in our cluster, we can see our systems auto-heal because cloud engineering tries to attempt to take care of that for us, or with, you know, infrastructure as code, we're able to redeploy at will. So, with the cloud infrastructure, a lot of what would cause me anxiety and give me more white hairs is slightly less than [unintelligible 00:06:51].Ana: I love the way of putting it the less amount of white hairs is because of cloud. So, thank you all, cloud providers. As this comes to mind and we think about your background of coming in from on-prem systems, is there anything that you've encountered in this cloud world that you think that's a gotcha? Like, I've had an incident in bare metal that cloud is not really necessarily having a use case or reliability mechanism built-in, just out of the box.Carmen: It's easy to catch, but it's a gotcha at the same time. So, when you come from on-prem into Cloud, the networking is… not all the same. The words are there from networking, like ‘gateway,' and ‘firewall,' click a few buttons, as opposed to you running [Arista 00:07:38] commands versus on a router. [laugh]. And then you have your VPC, which you can say that's your little world and your internal network.The words are there, but they're different in cloud, and that's the got me part of that transition. But at the same time, you have an easier way to visualize those things. For example, if my machine can't connect to another machine, are they in the same subnet? I don't have to run Arista commands to figure that out, or look at the logs on the router; it's literally right there in front of me. So, a lot of that pain that we would have to going—you know, switching from—going to your Linux machine to then getting into the router and then running these different commands, I feel like you needed to learn more commands and more different types of languages of the things that you were using in order to interact with, as opposed to now in the cloud, I feel that those things are more blatantly in front of you to fix.And they were a little bit more abstract in on-prem and that's why you would need someone like a network engineer more as opposed to a DevOps engineer who—I feel like it's easier in that sense. So, once you know it, you're able to solve those problems that you would need a networking engineer for.Ana: I guess now when we look at the DevOps site reliability engineers have this cloud world or this hybrid world, you end up wearing a lot of hats and you end up having to master, to an extent, various levels of networking, or knowing at least the operator side of how a lot of our infrastructure is running. It does bring me to the next question where you get a chance to come from that BC world—Before Cloud—and we now see that a lot of DevOps SREs that are joining in, they come into the magic of the cloud. What do you think is one of the things that the engineers that are just getting started and are just touching the cloud are not getting a chance to dive into, that the cloud abstraction layer really misses out on this amazing fundamental of the work that we do.Carmen: I think it goes down to the nitty-gritty. With DevOps, you wear many hats. You're good at everything, but not a master of one thing. You're a little bit everything [unintelligible 00:09:49]. Before cloud, even though the term DevOps didn't exist and you were called, like, an operations developer, operations engineer, you worked closely with the people who wore that one hat and you were working with them.And now that people are coming into the cloud only with no on-prem, they get the layers abstracted on what is the three-way handshake in networking? What is indexes really used for in databases? How do you know that you're not using—doing a linear search because your indices are incorrect in your database, versus doing an algorithmic search with that specific algorithm, that specific query language is using. Those are things that are so abstracted but are still very necessary because you may have to work with an on-prem system that connects to your cloud infrastructure and you may need to use Wireshark.Who still uses that? But you do. There's systems, older mainframe systems that mainly finance uses, or there's still COBOL systems out there. So, I feel that's what's missing from being in cloud. But I hope that education and other programs like Courseras and Lindas that if people feel like they're lacking in something, they're able to go and learn those fundamentals somewhere.Ana: I know you love learning. Two things come to mind. Do you have any resources where DevOps and SREs can start learning more? And do you want to share with our listeners a little bit more about your passion and your path in learning?Carmen: Sure. So, a lot of the things that people look at are common things like Udemy. I feel that Udemy has a lot of great DevOps courses, believe it or not. I have used them to study, I've used them for refreshers. I came in with Amazon cloud experience but no Google Cloud experience, so I basically took a Udemy to get my feet wet, as you would say, to get into that world.Linda is also good. If you have your student ID email, you can get it for free. So, [laugh] as a student, now, I use that. And then there's just various resources. A good thing, also, like, finding groups like Techqueria DevOps, as well as Latinas in Tech, and TECHNOLOchicas that if you join groups where you meet other people who are starting in that space or have been in that space for a long time, they have the resources as well. But those are the resources.Ana: Do you want to share a little bit more about your path on learning about DevOps and what you're up to now?Carmen: With DevOps, since I'm passionate in learning—because in DevOps, you have to always keep learning—as I was going through my education, and even now being in industry, is that I don't know that many Latinas, especially back in the 2000s. What I noticed when I was in school is also that the majority of my teachers were men, they went to Harvard and MIT, and their great schools, majority of them, were [unintelligible 00:12:35] and I never had a Latina teacher or any of that. And I said to myself that I wanted to be that teacher that I didn't have. So, I started teaching part-time at my alma mater, at Loyola University. And I loved it.And I loved—I taught, like, data structures in [C+ 00:12:55] and Java. I've taught DevOps classes, I've taught bash scripting, I've taught open-source computing, intro to object-oriented programming. And I just loved engaging with students. I've noticed that I knew I was missing something and then I realized it was teaching, being that difference, being that change, being the face that I didn't have. And I figured, what's the best place than my alma mater to start that at?As I was doing this for my fifth year, I realized that if I love it so much, I should do something about it. So, I decided to get a Ph.D. So, 10 years later [laugh], I went back to the school, and I'm currently in my third year at DePaul University in Chicago as a Ph.D. student, working in the American Sign Language Lab Avatar Project, creating an avatar to do not just American Sign Language but other sign languages—yes, there are many different ones. And that's where I'm currently at now because I would love to teach again somewhere, full time, be it after industry or maybe at the same time I'm still in industry. Who knows what the path is. I love teaching, and I love helping, and I love engaging, and I love technology, so that's why I wanted to go back to school and become a teacher, at one point.Ana: So, amazing to see your passions and your background come together into that mission of pushing forward the industry and bringing more representation to it.Carmen: Definitely. Thanks. It's still a [ride 00:14:19]. I don't know what the outcome will be, but [unintelligible 00:14:22] so I just hope that I pass my second exam for my Ph.D. that's coming up in September. [laugh].Ana: I wish you a lot of luck, and I'm sure our listeners are also rooting for you. Are we going to be seeing Dr. Carmen Saenz that's going to be teaching DevOps, teaching [unintelligible 00:14:39], teaching chaos engineering, or would you stick to something more in ASL?Carmen: I think a little bit of both. I think that the experience that I bring as a DevOps engineer is that some of these systems don't exist. Some of the stuff that I was doing that I brought to the lab was, they already have the programmers, but they don't have—they're running on, like, a Windows machine in an internal network at school. So, how can we make this widely available? One of my posters that I did my second year was specifically in engineering, and architecting, and infrastructure that can handle creating high visualizations with, you know, a GPU graphics card, but also being scalable and then it has to be [GDP 00:15:22] compliant because we work with European schools.So, there is DevOps in my Ph.D.; it's just a little hidden. But I'm hoping that I continue to bring that to the table in my lab and the work that I have won't just be on an internal network in three years. I hope in three years, you'll be able to—you all—can connect to it to be like, “That's the work her lab did, and we all know that the reason why it's visually seen by everybody was because Carmen was the one that created the infrastructure for it to work, with a group of other engineers.” So, I'm hoping that I can bring my two loves together in that sense, in three years' time.Ana: I mean, I think you're already doing it. It's super amazing to just even hear when you get those learning stories of someone is an engineer in the industry, has 10-plus years, and then they go and they help assist them that is not tied to our technology space, that is not running on the cloud, that they're running on just one Windows Server and they're hoping to reach, what, 3000 people a month; like, how is that possibly even going to scale for them to be successful? So, I think even you just going into the lab and putting in some of those DevOps principles. And I love also what you mentioned, you know like, how do you make it be a highly scalable system when you're running with so much GPUs, and you have all this different types of compliance in it? Which I think is always interesting when we talk about certain industries, whether it's healthcare or finance, that it's like, yes, we need to have compliancy based on data that we store, but then there's certain regulations and government that might tell us we also might need to have a certain uptime or you might be breaching this type of service-level-agreement.Carmen: So, a lot of the things I'm used to is more internal in the sense of we need to keep our logs X amount of years, and we need to know who logs into what machines. And so a lot of the compliance is pretty standard across the board for a lot of internal networks. But for something as big as this project for the Ph.D. that I'm doing on, our group has to make sure that [GDR 00:17:27] compliance is very different. And what PIA, what data do we have and how can we abstract it or break it down enough where then it won't actually go back to a person? And those are the things that I feel that I personally don't really have to deal with right now at work, but I have to deal with them at school.So, there is a trade-off, something that I'm lacking at my position at work, I actually have to think about, working with different countries, to have this software and the things that they're lacking like now having a scalable uptime system that people can communicate with, that is something that I do here at work; I'm trading off in both places. And compliance is very difficult to [ticket 00:18:10]. That's chaos engineering, too, because you're going to have to hire someone, or third-party company, or yourself have to literally attack your own system and see what you're missing and make sure you're compliant. And I think that's the beauty of—also—chaos engineering, trying to figure that out and making sure [laugh] that you're good, you know?Ana: For these highly visual systems that you have in your Ph.D. program, what are some of the unknowns that you've had to encountered as you're working on them since they're not very similar to the stuff that you work on your day-to-day?Carmen: I actually had to backtrack to my on-prem experience to a point. Unfortunately, the code was written in the late '90s and it's still [unintelligible 00:18:54] like that now, some of it. And it uses an executable; it had to be built by my professor, they gave me the EXE, I had to put it on the machine. And one of the things is, in Amazon, they have your GPU systems, right, and you could say, “I want this server with this graphics card and AMD and so forth.” One of the things that a lot of people don't recall if they never worked on on-prem systems is that drivers are problematic.And as I was trying to run this executable, I kept getting this error and I was like, “This has to be a driver issue, but how do we troubleshoot a driver on a cloud system that is pre-built for you?” So, [laugh] I was trying to figure it out? I'm like, “Is it the executable and how it was built on whatever machine? Or is it the machine that's in the cloud? And if it is, how do I update the driver? How do I downgrade the driver?”And so I had to Google how to downgrade drivers in VMs in the cloud. There's specific commands that you have to run that are AWS only. You don't have the manageability that you had when it's your own on-Prem system. Like, you just know, you run a general AMD command or a general package installer for the driver. It's not the case, all the time, for cloud systems.You have to run a specific AWS command. Luckily, what I found out was my professor, I brought it up to him, and he's like, “Oh, I have this driver. You're using this driver. I need to do some magic on my end to build this executable and it should work on the driver for this VM.” And I was like, “Sure.” But I didn't know how to troubleshoot those things in the cloud. But I knew how to troubleshoot them from back in the day when it was my on-prem system so there—it's weird understanding that drivers are still an issue, you just didn't think so because they're so abstracted nowadays.Ana: It's always interesting to remember where the abstraction layers push us forward in so many ways, but that they always bring this kind of catch on the other side of it of, “Wait, no, now you actually don't get a chance to just drive over to your data center, switch out a certain type of resource. Oh, the cable is starting to look a little hot, maybe we should stretch it out.” We now assume that a lot of these things are being handled for us.Carmen: Yes.Ana: Do you have any advice on how do you maintain systems that you don't build, or how is it that you can hand over things better when you're working with systems that are maybe even BC, Before Cloud? I'm going to trademark this. [laugh].Carmen: You should trademark it because, seriously, that is such a great way to explain it. That was literally what you do when you started a new job, right? They're like, “There's this old system.” I asked if there's any documentation, they usually laugh and chuckle at you. And then [laugh] they give you some notes that tech that some person left for you to look at. “Do you have any infrastructure as code?” They also might slightly chuckle at you and just give you some version that's 15 versions behind, that if you try to [unintelligible 00:21:52], they'll tell you, you're missing 50 other things.So, you do have to work with what you've got. And you're the whole point of being a DevOps engineer is that you investigate, investigate. And you shouldn't be afraid to ask questions. And I think that's something I learned as I got older. I was always afraid to ask questions.And I always felt like people were going to judge the crap out of me because I was asking questions. But how are you going to understand the system that you didn't build, and try to get into the head of the person that did build it in order for you to make it better? And… that's okay to ask those questions. And you should get those notes, that tech, and that rando Terraform that only works for a quarter of the things that were built. And see what's missing and try to see if you can devise a plan of attack of how are you going to break this down for yourself.And then, there may not be no diagrams. So, I'm not telling you, use Creately or anything like that to diagram, but it's also good just to have a piece of paper and a pen and just start drawing some of that stuff out. And then, a lot of it also is okay, let's make a test. On Saturday, I'm going to bring this down—chaos engineering—and I'm going to see who yells about it. Who's going to care?Who's going to care if I break this. And that's how you know who are the stakeholders. Sometimes, that's what you need to do; you need to create a little chaos, to understand what your next steps are in order to get rid of all that technical debt to make your company and your product better. That's how you have to start, and then from there, you'll get more stakeholders that are going to care because you caused a little chaos, in order to bring the system up-to-date, that is not yours, that now is yours. [laugh].Ana: You actually touched upon something that I was telling someone about two weeks ago where it's that we have this mental model of what our system looks like, they gave us an architecture diagram because, you know, this was only built five years ago, but we now have the thought that all of this is perfect, and until you start unplugging things, you start doing some chaos engineering of what in this architecture diagram is actually correct? What is not? Do I really have a database in my high critical services, or do I not? And then you can kind of really start thinking about understanding your system and build it to be better.Carmen: And also, one of the things that is just because it says dev, don't assume it's dev. It might be prod. Just because it's called dev doesn't mean that it's dev. That is one of my biggest rules now that I've learned recently in the last three years. Because with cloud, it's a little bit different than on-premise: if that's the name, that's what it's going to stay and most likely it is what it is. But here, because we're iterating so quickly, we try to fail fast in order for us to learn from our mistakes and build our product, dev becomes prod. [laugh]. More so now than it did before, you know?Ana: It brings me to that portion you mentioned also earlier: always ask questions. Always poke holes at it. If someone tells us, “Oh, no, don't worry, nothing is running here on production,” take a deep dive and try to find out what are some of those services, or what are some of those dependencies that could be going on. I know from my time at Uber, it took forever for us to find out which of the 2000 microservices are needed to just take a trip on the cloud. And it was like, “Uh, we don't know. We know they're running on prod, they're running on dev, but what is needed for this service to actually happen?”Carmen: Exactly. Sometimes just getting on the machine. And if you have root—which you should—if you're a DevOps engineer, usually—look at the history and then look at the directories of who has a home directory. People don't realize the history can give you so much good nuggets about what's going on in the system. And those are the things that help you figure out, like you said at Uber, what's running on here, and who's using it, and what is the systemd daemon telling me? And like… I mean, right?Ana: And it's funny, you mentioned that of take a look at the history because that was actually one of the things that I've always done, like, reading post-mortems—Carmen: Yeah.Ana: Understanding history that's being run on systems, understanding past PRDs to try to get a better understanding. And a lot of it actually is because of that other point that you also touched upon, being afraid of ask questions. Like, similar to you, I've also been one of the only Latinas in the room, and I'm like, “I don't want to raise my hand in this class, or in this meeting. I don't want to be the person that has to ask.” But if I have ways of starting to do my own searching so I make a more informed question, that gave me confidence. So, that was one of the things that I was always doing. But now I tell people, “No. Just ask the questions. Don't spend those five hours trying to look at history because the person next to you might actually know the answer in just two minutes.”Carmen: Yeah, exactly. And I noticed that. Just asking that question was literally like, “Oh, it was because of X, Y, and Z.” “Okay, cool.” And then, now that I know that, at least when I look at the history, I have some background of why this was this way, and now I can just pull out what I really care about in the history, as opposed to saying, “Why is this happening in the [unintelligible 00:26:59] in the first place?”But it's sucks being the first person to ask that question. And especially if it's just, like, you and a bunch of dudes—which usually it was, and at the time I was usually the youngest, too. I was, like, 22. Up until now, obviously; now I'm one of the oldest, but at one point, I was the youngest. And also age was a thing, and being the only Latina in the room, and it—you know, and it's finance; it was scary. [laugh].Ana: [laugh]. That's the awesome part. We got to have folks like you, like me, organizations like TECHNOLOchicas and Techqueria that allow for us to create spaces that are going to say, “You're welcome here. Ask as many questions as you want. No question is going to be stupid.”Because we've all had to start somewhere. And maybe you do get a chance to have Carmen as your teacher and get to pick their brain on what DevOps is. For that, I think that's all the questions that I had. Do you have anything else you want to share with our listeners that you have upcoming for you? Or just any words of advice?Carmen: My advice is just keep trucking along. There's many Carmens, there's many Anas, there's many Jason's out there that are willing to help. And there's spaces now where we can ask those deep questions, like you said, like Techqueria, Latinas in Tech, TECHNOLOchicas, [unintelligible 00:28:17] Girls Can Code, Girls Who Code. There's so many places now where you can really dig in and find the community to uplift you and keep pushing you forward in your technology inquiries and your technology career path. So, stick with it, keep going.Ana: I love it. What are some ways that folks can get in touch with you?Carmen: You can go to my LinkedIn, and the LinkedIn will be the slash name, slash M-D-C-S-A-E-N-Z. Or you can find me under Carmen Saenz at Techqueria Slack, or in Latinas in Tech Slack.Ana: Awesome. Thank you so much, Carmen.Carmen: Thank you so much for having me. I had such a great time with both of you.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called, “Battle of Pogs” by Komiku, and it's available on loyaltyfreakmusic.com.

Welcome back to another edition of “Build Things on Purpose.” This time Jason is joined by Zack Butcher, a founding engineer at Tetrate. They also break down Istio's ins and outs and the lessons learned there, the role of open source projects and their reception, and more. Tune in to this episode and others for all things chaos engineering!In this episode, we cover: Istio's History: (1:00) Lessons from Istio: (6:55) Implmenting Istio: (11:26) Links: Tetrate: http://tetrate.io Istio: https://istio.io Twitter: https://twitter.com/zackbutcher Episode Transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-zack-butcher-founding-engineer-at-tetrate/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comPaul's Twitter: https://twitter.com/paulmarsicloudEpisode highlights: Accidental SRE (1:41) Migrating to AWS (4:18) Prod vs Non-prod (8:43) Mentoring and advice (12:57) Failure is normal (19:41) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-paul-marsicovetere-senior-cloud-infrastructure-engineer-at-formidable/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comTaylor's Twitter: https://twitter.com/onlydoleEpisode highlights: It's always DNS (2:28) Focus on learning (9:29) Chaos Engineering and improvement (11:25) Tips for learning (16:33) More Chaos Engineering (21:53) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-taylor-dolezal-senior-developer-advocate-at-hashicorp

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comEpisode guests: Brian Holt (@holtbt): https://www.gremlin.com/blog/podcast-break-things-on-purpose-brian-holt-principal-program-manager-at-microsoft Jérôme Petazzoni (@jpetazzo): https://www.gremlin.com/blog/podcast-break-things-on-purpose-jerome-petazzoni-tinkerer-and-container-technology-educator J Paul Reed (@jpaulreed): https://www.gremlin.com/blog/blog/podcast-break-things-on-purpose-j-paul-reed-sr-applied-resilience-engineer-at-netflix Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-the-hill-youll-die-on/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comTaylor's Twitter: https://twitter.com/onlydoleFor more information about Terraform, see https://terraform.ioEpisode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-taylor-dolezal-terraform-specialListen to our episode with Armon Dadgar, CTO and co-founder of Hashicorp: https://www.gremlin.com/blog/podcast-break-things-on-purpose-armon-dadgar-cto-and-co-founder-of-hashicorp/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comJose's Twitter: https://twitter.com/junr03Episode highlights: Balancing consistency and complexity (1:22) Extending consistency to mobile clients (3:07) Know what you're trying to solve (9:50) Episode transcript: https://gremlin.com/blog/podcast-break-things-on-purpose-jose-nino-staff-software-engineer-at-lyft

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comBrian's Twitter: https://twitter.com/holtbtEpisode highlights: The importance of reliability in dev tooling (1:57) Chaos at Reddit (4:03) Frontend perspectives on Chaos Engineering (15:09) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-brian-holt-principal-program-manager-at-microsoft

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comArmon's Twitter: https://twitter.com/armonEpisode highlights: A tool to unify devs, ops, and release engineers (1:08) Lowering the friction of security (8:11) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-armon-dadgar-cto-and-co-founder-of-hashicorp

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comJérôme's Twitter: https://twitter.com/jpetazzoEpisode Highlights: Distributed databases at dotCloud & avoiding a major outage (2:18) Multilayered Kubernetes lasagna (16:06) Empowering others & what's important (24:22) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-jerome-petazzoni-tinkerer-and-container-technology-educator

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comJ Paul's Twitter: https://twitter.com/jpaulreed What is an Applied Resilience Engineer? (1:12) Facilitating emergent discussions in a remote world (11:14) Incentives and discretionary space (17:40) Shifting from Newtonian to Quantum thinking (24:59) Episode transcript: https://www.gremlin.com/blog/blog/podcast-break-things-on-purpose-j-paul-reed-sr-applied-resilience-engineer-at-netflix

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comVeronica's Twitter: https://twitter.com/maria_fibonacci When marketing goes too well (2:34) Introduction to Go (5:28) Using Elixir for fault tolerance (18:50) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-veronica-lopez-senior-software-engineer-at-digital-ocean/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comSteve's Twitter: https://twitter.com/spf13 Origins of Hugo (2:05) Tips for learning Go (4:44) Open source and contributors (9:05) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-steve-francia-product-and-strategy-lead-at-google/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comMiko's Twitter: https://twitter.com/mikopawlikowskiTopics include: Why Chaos Engineering? (1:29) Miko's Book (6:55) Chaos Engineering for Frontends (10:21) eBPF (12:10) SLOs (16:28) What Miko is currently excited about (21:56)Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-mikolaj-pawlikowski-engineering-lead-at-bloomberg

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comAlex's Twitter: https://twitter.com/ahidalgosreTopics include: Alex's adventure into the absurd (3:00) Google's pager list mishaps (9:37) Crashing NYU's Exchange Server and Hyrum's Law (14:19) Bartending makes you better (19:16) Nobl9 (22:37) What Alex is currently excited about (30:07) Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-alex-hidalgo-director-of-reliability-at-nobl9

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comRyan’s Twitter: https://twitter.com/this_hits_homeRyan Kitchens - SREcon19 - How Did Things Go Right?Ryan Kitchens - ReDeploy 2019 - The Meat of ItRyan Kitchens - Learning From IncidentsSee also: Break Things On Purpose Episode 8 - Haley TuckerEpisode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-11-ryan-kitchens-senior-site-reliability-engineer-at-netflix/

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comKelsey’s Twitter: https://twitter.com/kelseyhightowerKelsey doing his Tetris demo at PuppetConfNigel Kersten’s TwitterJepsenKelsey’s KubeCon KeynoteEpisode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-10-kelsey-hightower-principal-developer-advocate-at-google/

This episode we speak with Kelsey Hightower. Kelsey is a Principal Developer Advocate at Google. Topics include: Promise Theory, is Kubernetes hard, running databases on Kubernetes, the meat cloud, empathy sessions, how Kubernetes has helped standardize Ops practices, learning from failure at scale at Google, and the importance of the Inclusion part of D&I.

Links: Kolton's Twitter Netflix blog on Histrix Neflix blog on LDFI Netflix blog on FIT Jesse Robbins's Twitter Episode transcript: https://docs.google.com/document/d/12fUhzpbCfwUJQQi5PDPuE32KraWwDFnpd6k19eesEUQ/edit?usp=sharingOur music is by Komiku. For more of Komiku’s music visit loyaltyfreakmusic.com.

This episode we speak with Kolton Andrus, the CEO and co-founder of Gremlin. Topics include: The role of a Call Leader in incidents, using Chaos Engineering as runtime validation, FIT and application level fault injection, Jesse Robbins and early experiments at Amazon, oncall training, Lineage Driven Fault Injection (LDFI), the value of looking at real traffic instead of synthetic transactions, and the challenges people face when starting to do Chaos Engineering.

Links: Haley’s Twitter Monocle blog post Netflix blog post on CHAP We Are Neftlix podcast John Allspaw’s Twitter Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-8-haley-tucker-resilience-engineering-at-netflix/Our music is by Komiku. For more of Komiku’s music visit loyaltyfreakmusic.com.

This episode we speak with Haley Tucker. Haley is a Senior Software Engineer on the Resilience Engineering team at Netflix. Topics include: Running Chaos Engineering experiments as A/B tests, testing dependencies, fallbacks, testing in production, and why Chaos Monkey is less interesting at Netflix now.

Links: Matthew’s LinkedIn Matthew’s talk from SREcon Americas 2017 The asteroid that may collide with Earth in 2182 Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-7-matthew-simons-senior-product-development-manager-at-workiva/Our music is by Komiku. For more of Komiku’s music visit loyaltyfreakmusic.com.

This episode we speak with Matthew Simons. Matthew is a Senior Product Development Manager at Workiva and he leads the Quality Assessment team there. Topics include: Supporting and encouraging reliability at Workiva, why Workiva moved from App Engine to EKS, how to tighten the customer feedback loop, how Chaos Engineering can help folks who are oncall, and fatal optimism and the asteroid that may hit the Earth in the year 2181 (it’s real y’all).

Links: Subbu’s Twitter How Complex Systems Fail by Dr. Richard Cooke (PDF link) Drift Into Failure by Sydney Dekker Werner Vogels on compartmentalization at AWS “Lorin from Netflix” is Lorin Hochstein and he is a great choice to follow on Twitter Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-6-subbu-allamaraju-senior-technologist-at-expedia/Our music is by Komiku. For more of Komiku’s music visit loyaltyfreakmusic.com.

This episode we speak with Subbu Allamaraju. Subbu is a Senior Technologist at the Expedia Group. Topics include: Learning from incidents, changing culture, Why Complex Systems Fail, drifting into failure, forming a hypothesis, showing value from your reliability work, and the importance of understanding how your business makes money.

Links: Adrian’s Twitter Adrian’s blog post Chaos Engineering - Part 1 Adrian’s talk Patterns for Building Resilient Software Systems Adrian’s blog post The Quest for Availability Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-5-adrian-hornsby-senior-technical-evangelist-at-amazon-web-services/Our music is by Komiku. For more of Komiku’s music visit loyaltyfreakmusic.com.

This episode we speak with Adrian Hornsby, a Senior Tech Evangelist at Amazon Web Services. Topics include: Curiosity and breaking things, the cost of downtime, Jesse Robbins and early failure injection at Amazon, making the case to management for Chaos Engineering, forming a hypothesis, and random experiments vs Game Days.
