Monitoring platform for cloud applications
Datadog's Q3 earnings report was well-received by the market, with the stock popping as much as 9% after the release. Considering the macro backdrop and their outperformance during 2021, I thought the results were favorable. Datadog is maintaining its rapid cadence of launching new products and cross-selling them into existing customers, supporting its elevated DBNRR. The growth rates of customers with multiple product subscriptions have been reliably intact, implying little competitive infringement either from incumbent providers or new start-ups.
Andrew Davidson (SVP Products, @MongoDB) talks about MongoDB's evolution from a software company to a cloud services company (MongoDB Atlas), how developers traditionally interacted with databases, and the need for Developer Data Platforms going forward.SHOW: 671CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Security Solution: Modern Monitoring and SecurityStart investigating security threats before it affects your customers with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.CDN77 - CDN Focused on VOD and SecurityCDN77 - ask for a free trial with no duration or traffic limits.CloudZero - Cloud Cost Intelligence for Engineering TeamsSHOW NOTES:Press Release: MongoDB Unveils Vision for a Developer Data Platform at MongoDB World 2022Forbes: MongoDB Extends Developer Data Platform For Modern ApplicationsTechCrunch: MongoDB puts a spotlight on its developer data platformTopic 1 - Welcome to the show. Let's talk about your background, and what you focus on at MongoDB.Topic 2 - Data is a weird beast. It's cheap to create, it's expensive to move, and it's complicated to use because there's so many ways to interact with it depending on the use-case. So for someone that thinks about data a lot, how do you frame up the challenges of how applications interact with data? Topic 3 - People tend to think about MongoDB as a database company, and then a Cloud database company. What did the company learn as it moved to the cloud, as a lot of barriers for developers got knocked down in that transition? Topic 4 - As a developer today, do I still need to think about the relationship between the underlying data and the database access model needed to make that useful to an application, or are any of those lines blurring or going away?Topic 5 - Databases have traditionally followed the CAP theorem, and different choices have different strengths and tradeoffs. As you start to think about this concept of developer data platform, how do you try and reframe those tradeoffs? Do any of them go away? Topic 6 - What are some examples of how companies and their developers are able to think differently about how their new applications can be built with this new platform approach to data?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
Kayla Taylor (Sr. Product Manager @datadoghq) talks about taking control of cloud costs and managing costs along with observability and troubleshooting.SHOW: 669CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Monitoring: Modern Monitoring and AnalyticsStart monitoring your infrastructure, applications, logs and security in one place with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.CDN77 - CDN Focused on VOD and SecurityCDN77 - ask for a free trial with no duration or traffic limits. SHOW NOTES:Gain visibility and control of your cloud spend with Datadog Cloud Cost ManagementDatadog introduces Cloud Cost ManagementTopic 1 - Welcome to the show. Let's talk about your background, and what you focus on at Datadog.Topic 2 - Cloud cost management has become a very active focus area since the economy has been slowing down. How are you seeing customers, or engineering teams thinking about cost differently than before? Topic 3 - Since Datadog provides so much visibility into active applications (APM, Observability, Troubleshooting), do you take a similar approach to Cost Management? Topic 4 - Accountants understand costs, but not technology. Engineers understand technology, but do you find that they understand costs? How do you help them conceptualize costs and be able to make changes that impact their applications? Topic 5 - What are some of the unique things that Datadog does with the new Cloud Cost Management offering? How does it tie into the other aspects of Datadog?Topic 6 - The service is new, but what are some of the surprising things you've observed about how companies are using this insight into their cloud costs?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
This episode is sponsored by Fiberplane. Your platform for collaborative debugging notebooks!Episode Resources:Try Fiberplane hereFiberplane websiteFiberplane DocsNP-hard Ventures About Micha Hernandez van LeuffenMicha Hernandez van Leuffen is the founder and CEO of Fiberplane. He previously founded Wercker, a container-native CI/CD platform that was acquired by Oracle. Micha has dedicated his career to improving the workflows of developers. Read the whole episode (Transcript)[If you want, you can help make the transcript better, and improve the podcast's accessibility via Github. I'm happy to lend a hand to help you get started with pull requests, and open source work.][00:00:00] Michaela: Hello and welcome to the Software Engineering Unlocked Podcast. I'm host, Dr. McKayla, and today I have the pleasure to talk to Micha Hernandez van Leuffen. He is the founder and CEO of Fiberplane. He previously was the founder of Wercker, a container native CI/CD platform that was acquired by Oracle. Micha has dedicated his career to improving the workflow of developers, so he and I have a lot to talk about today.I'm really, really happy that he's here today and he's also sponsoring today's episode. Welcome to the show. I'm happy that you're here, Micha.[00:00:36] Micha: Thank you for having me. Excited to be on the show.[00:00:38] Michaela: Yeah, I'm really, really excited. So, Micha, I wanted to start really from the beginning. So you are the CEO of Fiberplane and you are the founder of Wercker, which you already sold.So, can you tell me a little bit about how you actually started to this entrepreneur journey of yours and what brought you to the developer experience area.[00:01:03] Micha: Yeah, sure thing. So I have a background in computer science and I did my so, I'm originally from Amsterdam, but I did my thesis at USF.And the topic was autonomous resource provision using software containers. This was all before Docker was a thing, you know, the container format that we now know and love. And I sort of got excited by that field of, of so containers and decided to start a company around it. That company was Worker, so container native CI/CD platform.So we helped developers build tests and deploy their applications to the cloud. We went, I would say, so we went through various iterations of the platform. You know, eventually, you know, we started off with Lxc as a container format and then eventually ended up, you know, having to, to platform on Docker.And Kubernetes. But, you know, it was quite a, quite a journey. So that company eventually got acquired by Oracle to bolster their cloud native strategy. And then, you know, spent a couple years in a Bay area as a VP of software development focusing on their cloud native efforts.Tried to do a little bit of open source there as well, and then, you know, move back to Europe. And so sort of started thinking about what's. Did some angel investing. We're still doing some angel investing as well actually in the sort of same arena. So developer tools, infrastructure building blocks for tomorrow.So I run a, a small precede seat fund with to other friends of mine. But then also started, you know, thinking about what to build next. And you know, we can get into that, but sort of from our experience at running work or this sort of large distributed. Sort of fiber plane was, was born.[00:02:26] Michaela: Cool. Yeah. And so how, how was the acquisition for you? I, from the time I'm, you said you were studying at the university, but then did you write out of university, you know, start worker or maybe already while[00:02:40] Micha: you were Yeah. More or less studying? Yeah. Yeah, more or less just out of university. So it was around 20, 20 12, 20 13.And then, you know, expanded the team. Of course we got an office in San Francisco and, and London. And then 2017 we got acquired by Whirlpool. Oh,[00:02:56] Michaela: very cool. Wow. Cool. So, and you were the, you were the founder of that and also probably cto, CEO. At, at the beginning you were one person shop, or was this, or have this idea and I get some funding and I already, you know, have a team when I'm starting out, or was it more bootstrapped way?How, how was that?[00:03:16] Micha: Yeah, yeah. We both gates, both fiber plane and, and, and worker. We got some funding early on. Then eventually got a CTO. For worker was one of the co-founders of, of OpenStack. So also, you know, very early in the, in that sort of, mm-hmm. container and, and cloud infrastructure journey.And then if for fiber plane, Yeah. There, there's no cto. I'm. I'm both CEO and cto, I guess[00:03:38] Michaela: at the same time. Yeah. Cool, cool. Can you tell me a little bit about fiber plane? What is fiber plane? You know, what does, what does it has to do with containers and with developer experience? What, what kind of of a product is it?[00:03:51] Micha: Yeah, sure thing. So, so guess coming back to the worker days, right? So we, we, you know, we're running this distributed system cic cd, so we were also running users arbitrary code. You know, any, any sort of job could happen on the platform on top of Kubernetes, inside of containers. So one of the things that, you know, stuck with me was it was very hard to always sort of debug the system, like figure out what's really going on when we had some kind of issue.You know, we've going back and forth between metrics, logs, traces, trying to figure out what is the root cause of an issue. So sort of that, that was sort of one thing. So we're thinking a lot about, you know, surely there must be a better way to, to, to help you on this, on this journey. . The other thing that I started thinking about a lot was sort of just challenge the assumption of the dashboard, mm-hmm.So if you think about it, like a lot of the monitoring observability tools are modeled after the dashboard, like sort of cockpit like view of your infrastructure. But I'd say that those are great for the known knowns. So dashboard is great. You set it up in advance, you know exactly what's gonna go wrong.These are the things to monitor. These are the things, you know, to keep tabs on. But then reality hits and you know, the thing that you're looking at, at the dashboard is not necessarily a thing that's. Going wrong. Right? So started thinking a lot about you know, what, what is a better form factor to support that sort of more investigative explorative debugging of your infrastructure.And not to say that dashboards don't have their place, right? It's like still that sort of cockpit view of your infrastructure. I think that's a, a good thing to have. But for debugging, you might wanna sort of more explorative a form factor that also gives you actionable intelligence. I think the other thing that you see a lot with dashboards, like everybody's monitoring everything and now you get a lot of signal and a lot of inputs, but not necessarily the actionable intelligence to figure out what's going on.So that's sort of the other piece where it, then the other, like, the third like I would say is collaboration sort of thing that stuck with. Was also like we've come to enjoy tools like Notion, you know, Google Docs obviously. You know, in the design space we got Figma where collaboration is built in from the get go and it is found that it was kind of odd how in the developer tools and then sort of specifically DevOps.We don't really have sort of these collab collaboration not really built in. Right. If you think about it you know, the status quo of, of you and I debugging an issue is we get on, you know, we get on a. You share your screen you open some dashboard and we started talking over it or something.Right. And so it's, and it's, you know, I guess sort of covid accelerated his thinking a bit, but you know, of everybody going remote you know, how can you make that experience more collaborative?[00:06:22] Michaela: Mm-hmm. . So it's in the incident space, it's in the monitoring space, and you want to bring more collaboration.So how does it work? Yeah,[00:06:32] Micha: yeah, yeah, exactly. So what's your solution now? Yeah. Now I've explained sort of the in inception. Yeah. But yeah, but what is it? What is it? Right. So it's, it's it's a notebook form factor. So very much inspired by data science, right? Like rc, like Jupiter. Yeah, we can Jupyter Notebooks.Yeah. Think of, think of that form factor. Mm-hmm. . We don't use Jupyter or anything like that. We've written everything from scratch. But it's a sort of, yeah, a notebook form factor and you know, built in with collaboration. So you can add, mention people like you would on Slack. You can leave, you know, comments or discussions and all and all that.But where it gets interesting, we've got these things called providers, which are effectively plugins. So they're web assembly bundles, which we can sort of dive into into that as well. But they're providers that connect to your infrastructure, right? So we have, for instance, a provider for Elastic Search for your logs.We have a provider for Prometheus for your. And it allows you to connect to these observability systems and kind of pull 'em together into one form, factor the notebook, and then, you know, start collaborating around that. Mm-hmm. . So, you know, imagine if Notion and Datadog would have a baby . Yeah.That's kind what you get. Yeah.[00:07:41] Michaela: That's cool. So I can imagine that. Let's. I'm on call and hopefully I'm not alone. A call. You are also on call, right? Yeah, and so we would open a fiber plane notebook.[00:07:52] Micha: Hopefully we're in the same time zone and we don't need to like wake up in the middle of that. Yeah.[00:07:57] Michaela: Hopefully. Yes. And then we want to understand. How the system is behaving. And so we are pulling in observers. These are data sources. Yeah. More or less. Right. And then we can do some transformation with those data. Data sources or[00:08:12] Micha: Yeah, yeah. That, yeah, exactly. That, that might be the case. The other thing that we integrate with is, for instance, PagerDuty.So an alert goes off indeed we are on call, but an alert goes off and we have this PagerDuty integration. And subsequently a notebook is created for us already. Mm-hmm. . Okay. Maybe, maybe even with, you know, some, some charts and logs that are already related to the service that might be down.Okay. So depend, So depending on the alert, obviously you're, as you know, you're as good as how you've instrumented your alerts. But say we've written some good alerts, we now have a notebook ready to go. Based off a template. So that's another thing that we, that we have as well, which is this template mechanism.And now, you know, we're ready to, to, to go in, get in into things and start debugging. So we might have a checklist, you know you look at the metrics, I'll look at the logs, sort of this action plan. We pull in that data we start a discussion around it. Mm-hmm. , hopefully we come, we come to the, to the, you know, the root cause of, of our issue.[00:09:11] Michaela: Okay. And so this discussion and this pulling in data, this happens all in the notebook. Can you explain me a little bit more, and also our listeners Exactly. When we are on this, you know, on this call now, having a fiber plane notebook in front of our, what do we see, right? How does that, how does the tool look?[00:09:28] Micha: It's, it's very similar to, I would say, like a Notion Page or a Google Doc page. Mm-hmm. . So we've got like different, different headings. The other thing that we have is, so you might have a title for a notebook, right? You know, the billing, the billing API is now. The other thing that we have is sort of this, this time range.So maybe usually when there's an issue, you know, we've seen this behavior over the last three hours, so we can sort of have that time range locked into place. So we only want to see our. For the last three hours. And that means that any chart that we plot or any log that we pull in will adhere to that global timeframe.So that's what we see. Mm-hmm. . We have support for labels, so, you know, obviously big fans of Kubernetes and, and Promeus. So we, you know, labels are. A first class primitive on the platform. So you're able to sort of populate the notebook with the labels that might maybe be related to our service.Right? So it's a US East one, which is our region. It might, you know, say service is the billing. It might be, you know, environment is production. And the status of our incident is, mm-hmm. ongoing, stuff like that. So we have, we've got, go ahead.[00:10:34] Michaela: Cool. Yeah. And, and so is it then from top to bottom we are writing and we are investigating and we are writing out down the questions that we have and the investigation.Yeah, exactly. We do.[00:10:44] Micha: Yeah. Yeah. And so, so[00:10:45] Michaela: we might have, Is it an Yeah. Is it Yes,[00:10:48] Micha: we our work? Yeah. Yeah. It's sort of Exactly. And I think in the most ideal use case, right. And I do it most ideal scenario, you're kind of like writing your postmortem as you go along. That's what[00:10:59] Michaela: I, I was thinking exactly that.Right. And then maybe next time I'm on call again and I get PagerDuty and something is down, it's again, billing. Can I search in the fiber plane notebooks to find, you know, what we did last time and then[00:11:13] Micha: Exactly. So you'll, you'll search, jump to the conclusion . Yeah. Yeah, exactly. Hopefully, hopefully if you, if you experience, you know, the same issue multiple times at some point, we'll, we'll, you know, do a little commit on GitHub and we, we fix our, fix our, Yeah.Do. But yeah, indeed, so you can search Yeah. Cool. On the notebooks and see if you've, you know, ran into similar issues. So that's, you know, it's great for building up this, this system of record, right. This knowledge base of mm-hmm. . Mm-hmm. . Of infrastructure issues and, and incidents. And it's also great for onboarding, right?If a new person joins, like, this is our process. These are some of examples that we've run into you know, have a look. And now you've got a sense for you know, how we, how we handle things and some of the issues that we've investigated. Yeah. Cool. One more thing on the, on the product. So the other, so, you know, sort of explained the, the notebook form factor.We've got these providers, right, that pull in data. From different, different data sources like Elastic Search or, or Prometheus. The other thing that we have is a command line interface which is called fp. Mm-hmm . And apart from, you know, being able to create notebooks from your terminal and you know, even invite people from the terminal, all this sort of usual stuff that you would, you know, expect from interacting with an API, with, with a product like this, there's two other things that we do.So one is a command called FP Run. And it allows you to, if you are typing a command like cube, ctl logs for a specific pod, you can pipe that the output of that command to a notebook. And why that is useful is of course, you know, when we're de debugging this issue, you and I, and you're start typing things in your terminal.I have no idea what you just did. And this is a way sort of to capture that. So you're piping these these, these outputs from your, the stuff that you're typing into your into your. into the notebook. And the cool thing is, you know, in, on your laptop you just, you know, sort of see text, right?Monospace output. But for certain outputs such as the cube CTL logs command, we actually know the structure of the data and we're actually capable of formatting that in the notebook in the structured manner that you can start filtering on, on the logs and you know, select certain columns and sort of highlight even certain loglines for prosperity that you say, Hey, these are the culprits, these are the things that you need to take into, into consideration next.So we have this sort of command line interface companion, and the other thing that it does, you're actually capable of running a long, like, sort of same use case as it just, I mentioned, but like a long running recording, like you actually record your entire shell session session as you're debugging this thing and all the output gets piped into the notebook.Cool. Cool. And[00:13:46] Michaela: so I have two questions for fiber plane. One. Is the software engineer the right person to interact with you know, fiber plane or is it the site reliability engineer that's really designed, you know, or the tool is designed[00:14:03] Micha: for? Yeah, it's, it's, that's an excellent question. So I think one, one site reliability engineers, you kind of see in more larger organizations, right, where you start splitting up your teams.I will say, I think at the end of the day, right, is if you're an engineer, you've built the service, now you need to maintain it now you need to operate it like it's, it's your baby, right? You need to, you probably know best how that system behaves than anybody else. So indeed I would say that, yeah, the target group is, you know, developers.Mm-hmm. .[00:14:40] Michaela: And so the other question that I had around fiber plane is also. When we are on this call and we are writing in this notebook, how does the whole scenario look like? Are we still on a call, like, do we have Zoom or, you know, Google meet open, or are we really in the, in the fiber plane document just writing, Or are we sitting next to each other?You know, what, what's the traditional, Is there a traditional scenario or is this all possible with fiber plane? How would you recommend using[00:15:09] Micha: it? Yeah, Yeah. Yeah. Not a, not a great question. Right. I think back in the day, it would be that, you know, we maybe sit in the same office and I scoot over and we start looking at a, at a screen, right?And start typing together. Mm-hmm. . The reality is, of course, we're all doing remote work now, and we might not be in the same room. So I do think people will still use a Zoom call or a Google meet you know, as a companion to talk over stuff. I think, you know, people will still communicate in Slack and sort of start chatting back and forth.But I think what we hope to achieve with fiber plane is like the pasting of screenshots, right? Well, if you take a screenshot of some kind of chart in your dashboard and you put it in Slack and you know, somebody yells, Oh, that's not the, that's not the thing that you should be looking at. You should, you know, like all that sort of slack glue That, you know, it's our, our goal to do away with that.[00:15:59] Michaela: Yeah. And, and the slack blue is also very problematic for the search. At least I'm never able to find it again. Right. It's like is in the dark, super in[00:16:07] Micha: the dark area. Yeah. Super ephemeral. Yeah. Yeah. You can't, can't go back in time easily. And, and you know, how did we solve this last time? So again, like building up that system of record, I think.[00:16:17] Michaela: Yeah. Very cool. And so how long are you now working on fiber plane already?[00:16:23] Micha: So we've been working on it for about two years now. Which is a, is a, is a long time. I think as a sort of, you know, one of the things that we've, I guess, sort of discovered along the way that we're kind of like building two startups at the same time, Right?We're doing a notion or like a, a rich text, collaborative rich text editing experience, which is kind of like a startup on its own. Mm-hmm. . And we're building sort of this infrastructure product. So it's, you know, it's taken quite some time and, and energy to, to get the product to where it is now.[00:16:54] Michaela: Yeah. And do you have already users? Is it like can people that listen today, can they hop on fiber plane already or.[00:17:02] Micha: It's, it's in it's been in private beta, mm-hmm. , but I think by the time this gets aired it's will be in public beta and people can sign up and take it for a spin. And, you know, we would love to get feedback on, on our roadmap, right?And Okay. People can suggest what other types of providers we need to support, what are types of integrations we, you know, would love to, to have that convers.[00:17:23] Michaela: Cool. Yeah. So is there, There is the provider side. Is there something else that you want feedback on that you are exploring[00:17:30] Micha: maybe. . Yeah. Yeah. So we've got the providers that's one thing.We've got sort of our templating stack. Mm-hmm. So curious to sort of see how people sort of start codifying their knowledge, right? What's, what, what kind of processes people have to debug their infrastructure and sort of run their incidents or write their postmortems. So curious to see what people come up with there.Other types of integrations. Right? So we have as I said, sort of PagerDuty what other type of, sort of alert, alert to notebook or other types of external systems that we need to plug in with. I would love to get some feedback on that as well. Yeah,[00:18:04] Michaela: I think I had page Bailey over on the podcast.She's from GitHub and she was she was also, they were releasing something with copilot and you know, For data scientists, some, some spaces here. And she also said like, well, we really need input from the users, right? So try it out, you know, tell us how it's working. I think it's so valuable, right, to see not only like you have your vision and obviously.It's going one way, but then if you have your users, sometimes they take your product and they use it in a very different, you know, way than you anticipate it, which can be very informative. Right. I dunno. You have done two startups already. Have you seen that? And how do you react to it? Do you instrument the data a little bit?How do you realize that people are using your product in a different. . Yeah.[00:18:50] Micha: So, so obviously we have metrics and analytics on sort of usage patterns of the, of the product. But I think, I think that data is excellent, right? But also qualitative data is, mm-hmm. , especially at this stage is probably even better, right?Where you can get somebody on a call and, you know, tell us about your use case. Tell, tell us about the problem that you're trying to solve here and how can we be, be helpful in like what types of integrations should we support? I think sort of the difference between. Worker, I would say, and, and fiber plane is that, you know, worker was a pretty confined piece of surface area, right?Cic, c d the whole goal is so you either have a, you know, a green check mark next to your build or a red check mark next to your build. Like it either, you know, failed or passed. And we need to sort of do that fast for you, get, get that result quick. Mm-hmm. . And with fiber plane, it's a more. I think that the interesting thing here is like, it's a, it's a more explorative and a sort of rich design space, right?It's this notebook, which already you, you know, you can start typing and text and images and headings and check checklists and whatnot, right? It's a very open form factor and design space. And then of course, with the integrations, it can even, you know, be richer. So I'm very curious into your point, right what direction people will pull the product into.Cause you can take it into all sorts. Use cases and scenarios. Yeah,[00:20:05] Michaela: exactly. And I think as a founder also, or as the design team, product team, it's it's also a little bit of a balancing act, right? So how far, you know, let me, are we going with what the user are doing with our product and where are we setting some boundaries that they can't do everything right?So there's also often the talk about opinionated products, right? That you can actually do one thing and on one thing only, and we have an opinion on, you know, how. Supposed to use our product. And you know, we try to, if we see people deviate from that, we try to put an end to it. And then there's the other way where you say, Well, you know, if you take fiber plane and you do X with it and we haven't thought about this maybe, you know, we are okay with it.Or maybe we even support that path, right.[00:20:48] Micha: Yeah, I think, I think we're more on indeed on the, on the ladder, right? I think what we've sort of, we talk about this a lot internally, sort of everything is a building block. You know, you've, we've got the notebook, you've got these different cell types, you've got providers, you've got templates.Mm-hmm. You've got the command line interface. So like for us, like everything is a building block and we, we actually want to retain that flexibility. Not be too prescriptive. Cause maybe you have a, a if you think about sort of the, the incident debugging or, or investigating your infrastructure, like you might have a certain process, I might have a completely different process and we need to be able to facilitate, you know, these different workflows.So, you know, thus far sort of our, our thinking around the product has been everything is a building block. And it should be this sort of flexible form factor that people can pull into into different scenarios and use. I mean,[00:21:36] Michaela: we have infrastructure as code, right? And we have like security as code.Maybe we have debugging as code. Maybe, you know, this is what's coming next. Can, can you envision that, that it's going in this direction? Because while we have building blocks, maybe right now it's not you know, programming language for debugging, but it could go a little bit into the distraction, right?No code coding for debugging.[00:22:02] Micha: Yeah, we've actually, we've, we've had some of, of that sort of discussion internally as well. If you think about the templates right. To, to some extent that is a, you know, we use J Sonet as a, as a sort of language, but we sort of codified them in a certain way and you can, you could argue that the templates is, you know, sort of a programming language for at least, you know, that debugging process, right?Yeah, exactly. Right. Yeah. And. And, and we, you can take that even further and make it kind of like statically typed and make it adhere to, you know, certain rules and maybe even have control flow. So I think that, that there's, there's a piece there. And then maybe, you know, obviously we have you know, some YAML configuration on how you set up your providers, right?Like how to connect to your infrastructure. So there's some, you know, observability as code in mm-hmm. in that realm. Yeah. Yeah. I think that'll be an interesting part of the journey, right? Like to figure out can we, and some even.[00:22:55] Michaela: Yeah, in some parts should be well, don't repeat yourself, right?Like, for example, pulling in these providers, configuring that, you know, I get the right data. This would actually be something that I'm, you know, pulling in again. And probably that's what your templates do, right? So you say billing, oh, and then check, check, check, check, check. I have, you know, all my signals here and they're configured in a way that it's useful.And then for this investigation, hopefully, One at a type thing, right? So I'm investigating, and as we, as we talked before once I realized what's going on, hopefully in my postmortem I'm going to, you know, make sure that this is not happening again. So this code probably is not going to be reused that often.Maybe some, you know, some ideas from it, but hopefully we won't reproduce the same sect completely exact thing again.[00:23:44] Micha: Yeah. Cool. Yeah, that's, that's a super great point. And I think coming back, sort of the early part of the conversation around dashboards, right? I think thus far what we've sort of experienced as, you know, engineers ourselves, like, I think, I think we probably had sort of a phase around information gathering.Like all these dashboards are great for information gathering, but now with Kubernetes and containers and microservices like the, the, the number. Services that we're running and the complexity has increased. So I think, I think there's sort of an opportunity for more exactly what you're describing. So it's more about action, right?Mm-hmm. , what? What are we doing? We want to have the information, we want actionable intelligence that informs us what to do.[00:24:20] Michaela: Yeah, yeah, exactly. Because now I'm looking at this dashboard and I'm seeing the signals. But then everything else is outside of, you know, this realm, right? So what actions do I take?Do I go, go to the console? Do I restart that service? You know, or, you know, whatever I'm doing. And, and it's also vanishing, right? So I'm doing it, but then. Who can see it, What I did. Right? Yeah, exactly. And so now we are capturing this, which is very nice, and then we can learn from it Right. Postmortems as well.Yeah. So I looked a little bit through your blog and and, and your Twitter, and you were also talking about blameless postmortems. So how do you think about psychological safety? How should people. In an organization look at on call and incident management to really make it sure that we are ending the blame game.Right. You probably have some thoughts about that as well, because you're working in this area.[00:25:19] Micha: Yeah. I, I think it's important to, and you like not have put any blame on any person. Right. It, it is a, and I guess sort of, you know, that's also why we're building this product. It is a collaborative process to debug an issue or resolve an incident.Like, and what you want to achieve is to put the entire team in the best possible position to solve the issue at hand and and, you know, a support structure around it. So, you know, coming back to the product, like being able to, to open discussions. Point people in the, in the, in the right direction.[00:25:52] Michaela: So maybe also if it's easier to find a problem to root cause it, and, you know, incidents become no issue or at least a lesser issue. So maybe the blame game is not that important. Can, can we say it that way?[00:26:08] Micha: I think so. Yeah. Yeah, yeah. If, if, you know, if the process becomes repeatable and we codify that and we collaborate on it and we build up that, again, that system of record and knowledge base I think that, you know, puts us in a safer position to, to solve the next one.That's[00:26:25] Michaela: true. Yeah. Another thing that I was thinking of when I looked through, you know, fiber plane and what it does is KS engineering and I thought like what KS engineering is where you try to prevent not only the knowns, but also the unknowns, right? So really think about, you know, what, what could go wrong and then, you know, make a fallback so that your system is reliable.Or, you know, if this database goes down that not the whole system goes down, but only a part of it and so on. Do you think that KS engineers can act. Source or, you know, use those notebooks that you're creating as input for knowing, you know, what we should actually look at and, Yeah.[00:27:02] Micha: Well, I think it, well, one thing I think it'd be a great provider yeah.integrating with, with, with, you know, one or many of the, the chaos engineering services out there. I think it's a great way to train your team, right? You, we plug in some K engineering provider. The, the provider communicates with your infrastructure and such, pulling out wires from from, you know, your, your system.And then now go ahead and start, you know, debugging this issue and mm-hmm. and you know, use different templates and you can, you know, sort of trial all sorts of different issues. I think it'd be super fun. Yeah.[00:27:37] Michaela: Yeah. So Micha, one thing that I also saw is that some of your of fiber plane is open source.So what's your vision for open sourcing that are, you know, are some parts being open source? Can people help with the building fiber plane?[00:27:51] Micha: Yeah, great question. So right now what we've open sourced is a project called fp bind Gen. So this is actually of SDK bindings, generat. For how you would create full stack web assembly plugins.So this is what we use to build our own elastic search and our Prometheus plugins. So we've, we've open sourced that. It's on GitHub we've already got some, quite some feedback on it. So, but would love some more. And then going forward we'll be open sourcing sort of our templating stack the proxy.Which sort of sets which you install inside your cluster and sort of sets up the secure connections between the providers and your infrastructure and then the fabric plane managed service. And then the command line interface that I mentioned will also be open source. So expect more to hear from us on the open source front.[00:28:36] Michaela: Yeah, Cool. I think that's so important, especially for developer tooling, that people can also really get it into their hands and then help, you know, shape the, or make the best product for their, for their environments that they have. I think this is such a success strategy.[00:28:50] Micha: Yeah, exactly. And you know, we, as I said, we would love to get feedback on the, on the providers and the, the plugin model, but maybe even, you know, once we open source the the, the provider stack would be great if people maybe come up with crazy ideas.Right? You can think of any type of provider that you could surface data inside of, inside of the notebook. Yeah. Doesn't need to be observability or like monitoring data. Like could be. Yeah.[00:29:14] Michaela: Cool. Yeah, I'm super excited. What, you know, what will come out of that. Yeah. So I want to come back a little bit to your founding story because I know a lot of people are interested in developer tools and, you know, and, and Startup founding as well.And you did it twice already, right? And maybe several more times in your life, I dunno. But right now we know of two instances. Yeah. There, there. So, and and also for fiber plane, you already got funding, right? Several million dollars. And so how do you do. How do you do it out of Europe is also some of my questions that I have because I think it's a little bit a different game here in Europe than it's in Silicon Valley.Yeah. It doesn't look like, you know, opportunities around the corner everywhere. I, I have been studying in the Netherlands, so I know that actually Netherlands is really a good place, I think for, for tech startups and, you know, also a little bit out of the universities I saw there like You know, you get a little bit of help and, and, and funding and things like this, but still, I would assume it's harder than in Silicon Valley.So how did you make it work? How did you get funding? You also said that worker had some funding at the beginning. Yeah.[00:30:26] Micha: Yeah. It's a good question. Well, how did we do the second time around, to be honest, Because it's the second time. Yeah. It was a bit easier. I mean, it's never, It's, Yeah. Yeah. It's obviously, you know, never as easy.But it was definitely easier. I do think in Europe, if I also compare it to the worker days to where we are now, Like I do think the funding climate and sort of the, the, the, the thinking around startups has improved a lot, right? There's there's more funding out there, there's more feess. I think more importantly though, what we've seen is that now.Sort of the European unicorns have exited or gone ipo. And we have actually more operators inside of Europe that have experience in either founding a startup are able to sort of start doing angel investing or have worked at multiple startups and we have just more operating experience you know, versus honestly like bankers, right?That That, you know, help you out or are, are investing in you? So actually the, the, the funds that funder does were Crane Venture Partners which is actually a seed fund out of London that's actually focused on developer tools and infrastructure. So I would highly recommend, you know, talking to them.If you're thinking about, you know, building a developer tool company and you need some funding, of course my own fund is also focused on developer tool. So shameless plug there on MP Hard Ventures. You can just Google that and find me. And then we have North Zone, which is a, you know, very like multi-stage fund.Also out of, well actually quite different geographies and Notion Capital out of out of London as well. Okay. We've got some have several micro VCs, several things. Yeah. We have somebody funded West Coast Alana Anderson was doing with base case capitals investing in a lot of infrastructure and enterprise startups and Max Cloud from System one in Berlin.Is another one. So yeah, we have a good crew of, you know, a diff different experience and sort of different stage type of funding as well.[00:32:19] Michaela: Yeah. This was my next question that I had for you. It's probably not only about the money, you said experience, right? It's also about the knowledge that people have, right.How to do things. Probably, yeah. The people that they know, right? So that they can Yeah. To be Yeah, exactly. Can consider the right people have the right network and so.[00:32:36] Micha: Yeah, I think, I think the most, yeah, it's is, is introductions, but it's also. You know, if you, if you think about the, the funds that actually do developer tools, right?So they, in their portfolio, they, they've seen, you know, startups trying over and over to tackle some kind of go to market issue or trying to build an open source, mm-hmm. company, right? So they have some, some pattern matching and some, some knowledge about, you know, what to do and what, what not to do.Of course, it's all advice, but it's good to sort of have some people in your corner that have at least seen this, these types of companies being built. Over and over again. Right. That's, and then, and then other VCs have more experience in, you know, more, more like how to build up or scale up a sales organization and thinking about how to run a SaaS company.So yeah. Different experience from different, different funds.[00:33:20] Michaela: And so now you listed quite a lot of different investors. Do you reach out to each one of them or do you have like a whole group meeting and they're all in there and you ask them for advice? , how does it[00:33:33] Micha: Yeah. No, it's, it's sort of one on one chats, right?Either over, over chat or, you know, we meet up for coffee or, or or breakfast, mm-hmm. . But yeah, we try to do that on a, on a regular cadence. And then of course, when, you know, something exciting happens, such as our launch know, we try to group them together and get them all on the same page around the same time.Or of course if an issue arises, Right, which could also be the case. Yeah. And then sort of all hands on deck and everybody in the same room or zoom.[00:34:01] Michaela: And what about your biggest struggle on your, on your entrepreneurial journey, maybe now with fiber plane or maybe with Worker? Did you ever think that, you know, worker, when you started it, did you think that somebody is going to buy this and.This is going to be huge.[00:34:16] Micha: Yeah. Yeah. I think, I think the ambition was always there. Mm-hmm. . And, but, and, and sort of that drive to just make better developer tools. I think that sort of, that, you know, that's been true for all the companies or all too. Yeah, that's,[00:34:30] Michaela: Yeah. And what[00:34:32] Micha: I struggle. Yeah. Yeah. So I think, I think as I think for fiber plane now, it's not necessarily a struggle, it's just the real, which this mission of this flexible form factor, just the fact that we're doing sort of two startups at the same time has been sort of mm-hmm. An interesting thing to to build now, right? You're doing this rich, collaborative, rich tech editor and trying to build this infrastructure oriented company, and I think that's been yeah, just an interesting experience with building out a team.You know, the technology and the product that we.[00:35:01] Michaela: Yeah. Yeah. So maybe can you tell me a little bit more about again, if people want to hop over to Fiber plane now and try it out how does that work? Do you have to, you know is there a sign up? Is there a waiting list? I mean, you said probably when this airs there is a public beat, but still do you have to, you know, what do you have to reach out to you, you give me a demo or I just fill in my credentials and I'm off togo.[00:35:25] Micha: you can just sign, sign up with Google and then you're off to the races. And then of course, if you want a demo and sort of get some more, more more help or onboarding we're happy to help you and get on a call and walk you through it. But yeah. Okay, cool. Try playing com. Is there[00:35:40] Michaela: also a, Yeah, is there a video or something that we can look[00:35:44] Micha: at?Yes. The, the website and there's a video.[00:35:50] Michaela: Okay. I will link that so that people can go Yeah. And it will explain everything to them. Right. What about pricing? Whatever pricing? Yeah. You have already some idea around pricing. Yeah.[00:36:01] Micha: We've got some ideas on how to charge, but I think right now for us, it's important to get the product market fit, mm-hmm.and as such, you know, get, get the feedback. From these companies and these teams using the product. So we'll introduce pricing at a later stage. So for now it's, it's free to use, mm-hmm. . And you just give us your time and your feedback, and then Yeah, we're grateful.[00:36:20] Michaela: Yeah. And what about my data?Is it safe with you? Like, do you have some visibility into my data or do I send it over to[00:36:29] Micha: you? Yeah, so we actually so the way the, the providers work the plugins, so they actually get activated through a proxy. So we install a proxy inside of your cluster. The proxy sets up a secure bidirectional tunnel from your infrastructure to the fiber plane managed service.And then we do, for that specific query, we do store the data that's related to that query. So of a result, we do store that in the notebook. And yeah, we probably will come up with sort of more enterprisey ideas around how to self host[00:36:59] Michaela: it, Right? Or something[00:37:01] Micha: as an example. Yeah, yeah, yeah. But again, we'd love to get some feedback on that.[00:37:07] Michaela: How that works. Right? Yeah. Okay, cool. So yeah, that sounds really good. I think you, at least my questions, , you could answer them all, but maybe my listeners have questions and then they can send them to you. I think you will be, Yeah. Quite happy, right?[00:37:22] Micha: A hundred percent. At mes on Twitter, m i e s and at fiber, playing on Twitter, fiber playing.com.Sign up, take it for spin, shoot us a message. Yeah, sounds.[00:37:33] Michaela: Yeah. Yeah, it sounds super interesting. I hope that a lot of my listeners will do that, and I will link everything in my show notes that we, you know, talked about your, your Twitter handle and everything so that people can reach you. And I hope you get a lot of questions and people give it a spin and give it a try and send you their use cases,And yeah. I hope you all the best with your product. Thank you so much for being on my show today Micha. And yeah. Thank you. Bye.[00:37:59] Micha: Thank. Thank you for having me.[00:38:01] Michaela: Yeah, it was really great. Bye bye[00:38:04] Micha: bye.[00:38:06] Michaela: This was another episode of the Software Engineering Unlocked Podcast. If you enjoyed the episode, please help me spread the word about the podcast.Send episode to a friend via email, Twitter, LinkedIn. Well, whatever messaging system you use, or give it a positive review on your favorite podcasting platform such as Spotify or iTune. This would mean really a lot to me. So thank you for listening. Don't forget to subscribe and I will talk to you in two weeks.
In der heutigen Folge „Alles auf Aktien“ sprechen die Finanzjournalisten Daniel Eckert und Philipp Vetter über die Peak-Inflations-Euphorie und einen Dämpfer für die Deutsche Telekom. Außerdem geht es um Docusign, DataDog, Auto Desk, Match Group, Advanced Micro Devices, Atlassian, Marvell Technologies, Amazon, Gilead Sciences, Amgen, Vertex Pharmaceuticals, Seagen, Zalando, Siemens Energy, Continental, Binance, Coinbase, FTX, Bitcoin, Ether, iShares MSCI EM Asia (WKN: A1C1H5), iShares MSCI Pacific ex-Japan (WKN: A0RL8Z), iShares MSCI AC Far East ex-Japan (WKN: A0HGV9), Xtrackers MSCI Singapore (WKN: DBX0KG), Xtrackers MSCI Malaysia (WKN: DBX0GW), Lyxor MSCI Korea (WKN: LYX016) und Lyxor MSCI Indonesia (WKN: LYX019) Wir freuen uns an Feedback über firstname.lastname@example.org. Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
About BenBen Whaley is a staff software engineer at Chime. Ben is co-author of the UNIX and Linux System Administration Handbook, the de facto standard text on Linux administration, and is the author of two educational videos: Linux Web Operations and Linux System Administration. He is an AWS Community Hero since 2014. Ben has held Red Hat Certified Engineer (RHCE) and Certified Information Systems Security Professional (CISSP) certifications. He earned a B.S. in Computer Science from Univ. of Colorado, Boulder.Links Referenced: Chime Financial: https://www.chime.com/ alternat.cloud: https://alternat.cloud Twitter: https://twitter.com/iamthewhaley LinkedIn: https://www.linkedin.com/in/benwhaley/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation, permissions as code, connectivity between any two devices, reduce latency, and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn and this is an episode unlike any other that has yet been released on this august podcast. Let's begin by introducing my first-time guest somehow because apparently an invitation got lost in the mail somewhere. Ben Whaley is a staff software engineer at Chime Financial and has been an AWS Community Hero since Andy Jassy was basically in diapers, to my level of understanding. Ben, welcome to the show.Ben: Corey, so good to be here. Thanks for having me on.Corey: I'm embarrassed that you haven't been on the show before. You're one of those people that slipped through the cracks and somehow I was very bad at following up slash hounding you into finally agreeing to be here. But you certainly waited until you had something auspicious to talk about.Ben: Well, you know, I'm the one that really should be embarrassed here. You did extend the invitation and I guess I just didn't feel like I had something to drop. But I think today we have something that will interest most of the listeners without a doubt.Corey: So, folks who have listened to this podcast before, or read my newsletter, or follow me on Twitter, or have shared an elevator with me, or at any point have passed me on the street, have heard me complain about the Managed NAT Gateway and it's egregious data processing fee of four-and-a-half cents per gigabyte. And I have complained about this for small customers because they're in the free tier; why is this thing charging them 32 bucks a month? And I have complained about this on behalf of large customers who are paying the GDP of the nation of Belize in data processing fees as they wind up shoving very large workloads to and fro, which is I think part of the prerequisite requirements for having a data warehouse. And you are no different than the rest of these people who have those challenges, with the singular exception that you have done something about it, and what you have done is so, in retrospect, blindingly obvious that I am embarrassed the rest of us never thought of it.Ben: It's interesting because when you are doing engineering, it's often the simplest solution that is the best. I've seen this repeatedly. And it's a little surprising that it didn't come up before, but I think it's in some way, just a matter of timing. But what we came up with—and is this the right time to get into it, do you want to just kind of name the solution, here?Corey: Oh, by all means. I'm not going to steal your thunder. Please, tell us what you have wrought.Ben: We're calling it AlterNAT and it's an alternative solution to a high-availability NAT solution. As everybody knows, NAT Gateway is sort of the default choice; it certainly is what AWS pushes everybody towards. But there is, in fact, a legacy solution: NAT instances. These were around long before NAT Gateway made an appearance. And like I said they're considered legacy, but with the help of lots of modern AWS innovations and technologies like Lambdas and auto-scaling groups with max instance lifetimes and the latest generation of networking improved or enhanced instances, it turns out that we can maybe not quite get as effective as a NAT Gateway, but we can save a lot of money and skip those data processing charges entirely by having a NAT instance solution with a failover NAT Gateway, which I think is kind of the key point behind the solution. So, are you interested in diving into the technical details?Corey: That is very much the missing piece right there. You're right. What we used to use was NAT instances. That was the thing that we used because we didn't really have another option. And they had an interface in the public subnet where they lived and an interface hanging out in the private subnet, and they had to be configured to wind up passing traffic to and fro.Well, okay, that's great and all but isn't that kind of brittle and dangerous? I basically have a single instance as a single point of failure and these are the days early on when individual instances did not have the level of availability and durability they do now. Yeah, it's kind of awful, but here you go. I mean, the most galling part of the Managed NAT Gateway service is not that it's expensive; it's that it's expensive, but also incredibly good at what it does. You don't have to think about this whole problem anymore, and as of recently, it also supports ipv4 to ipv6 translation as well.It's not that the service is bad. It's that the service is stonkingly expensive, particularly at scale. And everything that we've seen before is either oh, run your own NAT instances or bend your knee and pays your money. And a number of folks have come up with different options where this is ridiculous. Just go ahead and run your own NAT instances.Yeah, but what happens when I have to take it down for maintenance or replace it? It's like, well, I guess you're not going to the internet today. This has the, in hindsight, obvious solution, well, we just—we run the Managed NAT Gateway because the 32 bucks a year in instance-hour charges don't actually matter at any point of scale when you're doing this, but you wind up using that for day in, day out traffic, and the failover mode is simply you'll use the expensive Managed NAT Gateway until the instance is healthy again and then automatically change the route table back and forth.Ben: Yep. That's exactly it. So, the auto-scaling NAT instance solution has been around for a long time well, before even NAT Gateway was released. You could have NAT instances in an auto-scaling group where the size of the group was one, and if the NAT instance failed, it would just replace itself. But this left a period in which you'd have no internet connectivity during that, you know, when the NAT instance was swapped out.So, the solution here is that when auto-scaling terminates an instance, it fails over the route table to a standby NAT Gateway, rerouting the traffic. So, there's never a point at which there's no internet connectivity, right? The NAT instance is running, processing traffic, gets terminated after a certain period of time, configurable, 14 days, 30 days, whatever makes sense for your security strategy could be never, right? You could choose that you want to have your own maintenance window in which to do it.Corey: And let's face it, this thing is more or less sitting there as a network traffic router, for lack of a better term. There is no need to ever log into the thing and make changes to it until and unless there's a vulnerability that you can exploit via somehow just talking to the TCP stack when nothing's actually listening on the host.Ben: You know, you can run your own AMI that has been pared down to almost nothing, and that instance doesn't do much. It's using just a Linux kernel to sit on two networks and pass traffic back and forth. It has a translation table that kind of keeps track of the state of connections and so you don't need to have any service running. To manage the system, we have SSM so you can use Session Manager to log in, but frankly, you can just disable that. You almost never even need to get a shell. And that is, in fact, an option we have in the solution is to disable SSM entirely.Corey: One of the things I love about this approach is that it is turnkey. You throw this thing in there and it's good to go. And in the event that the instance becomes unhealthy, great, it fails traffic over to the Managed NAT Gateway while it terminates the old node and replaces it with a healthy one and then fails traffic back. Now, I do need to ask, what is the story of network connections during that failover and failback scenario?Ben: Right, that's the primary drawback, I would say, of the solution is that any established TCP connections that are on the NAT instance at the time of a route change will be lost. So, say you have—Corey: TCP now terminates on the floor.Ben: Pretty much. The connections are dropped. If you have an open SSH connection from a host in the private network to a host on the internet and the instance fails over to the NAT Gateway, the NAT Gateway doesn't have the translation table that the NAT instance had. And not to mention, the public IP address also changes because you have an Elastic IP assigned to the NAT instance, a different Elastic IP assigned to the NAT Gateway, and so because that upstream IP is different, the remote host is, like, tracking the wrong IP. So, those connections, they're going to be lost.So, there are some use cases where this may not be suitable. We do have some ideas on how you might mitigate that, for example, with the use of a maintenance window to schedule the replacement, replaced less often so it doesn't have to affect your workflow as much, but frankly, for many use cases, my belief is that it's actually fine. In our use case at Chime, we found that it's completely fine and we didn't actually experience any errors or failures. But there might be some use cases that are more sensitive or less resilient to failure in the first place.Corey: I would also point out that a lot of how software is going to behave is going to be a reflection of the era in which it was moved to cloud. Back in the early days of EC2, you had no real sense of reliability around any individual instance, so everything was written in a very defensive manner. These days, with instances automatically being able to flow among different hardware so we don't get instance interrupt notifications the way we once did on a semi-constant basis, it more or less has become what presents is bulletproof, so a lot of people are writing software that's a bit more brittle. But it's always been a best practice that when a connection fails okay, what happens at failure? Do you just give up and throw your hands in the air and shriek for help or do you attempt to retry a few times, ideally backing off exponentially?In this scenario, those retries will work. So, it's a question of how well have you built your software. Okay, let's say that you made the worst decisions imaginable, and okay, if that connection dies, the entire workload dies. Okay, you have the option to refactor it to be a little bit better behaved, or alternately, you can keep paying the Manage NAT Gateway tax of four-and-a-half cents per gigabyte in perpetuity forever. I'm not going to tell you what decision to make, but I know which one I'm making.Ben: Yeah, exactly. The cost savings potential of it far outweighs the potential maintenance troubles, I guess, that you could encounter. But the fact is, if you're relying on Managed NAT Gateway and paying the price for doing so, it's not as if there's no chance for connection failure. NAT Gateway could also fail. I will admit that I think it's an extremely robust and resilient solution. I've been really impressed with it, especially so after having worked on this project, but it doesn't mean it can't fail.And beyond that, upstream of the NAT Gateway, something could in fact go wrong. Like, internet connections are unreliable, kind of by design. So, if your system is not resilient to connection failures, like, there's a problem to solve there anyway; you're kind of relying on hope. So, it's a kind of a forcing function in some ways to build architectural best practices, in my view.Corey: I can't stress enough that I have zero problem with the capabilities and the stability of the Managed NAT Gateway solution. My complaints about it start and stop entirely with the price. Back when you first showed me the blog post that is releasing at the same time as this podcast—and you can visit that at alternat.cloud—you sent me an early draft of this and what I loved the most was that your math was off because of a not complete understanding of the gloriousness that is just how egregious the NAT Gateway charges are.Your initial analysis said, “All right, if you're throwing half a terabyte out to the internet, this has the potential of cutting the bill by”—I think it was $10,000 or something like that. It's, “Oh no, no. It has the potential to cut the bill by an entire twenty-two-and-a-half thousand dollars.” Because this processing fee does not replace any egress fees whatsoever. It's purely additive. If you forget to have a free S3 Gateway endpoint in a private subnet, every time you put something into or take something out of S3, you're paying four-and-a-half cents per gigabyte on that, despite the fact there's no internet transitory work, it's not crossing availability zones. It is simply a four-and-a-half cent fee to retrieve something that has only cost you—at most—2.3 cents per month to store in the first place. Flip that switch, that becomes completely free.Ben: Yeah. I'm not embarrassed at all to talk about the lack of education I had around this topic. The fact is I'm an engineer primarily and I came across the cost stuff because it kind of seemed like a problem that needed to be solved within my organization. And if you don't mind, I might just linger on this point and kind of think back a few months. I looked at the AWS bill and I saw this egregious ‘EC2 Other' category. It was taking up the majority of our bill. Like, the single biggest line item was EC2 Other. And I was like, “What could this be?”Corey: I want to wind up flagging that just because that bears repeating because I often get people pushing back of, “Well, how bad—it's one Managed NAT Gateway. How much could it possibly cost? $10?” No, it is the majority of your monthly bill. I cannot stress that enough.And that's not because the people who work there are doing anything that they should not be doing or didn't understand all the nuances of this. It's because for the security posture that is required for what you do—you are at Chime Financial, let's be clear here—putting everything in public subnets was not really a possibility for you folks.Ben: Yeah. And not only that but there are plenty of services that have to be on private subnets. For example, AWS Glue services must run in private VPC subnets if you want them to be able to talk to other systems in your VPC; like, they cannot live in public subnet. So essentially, if you want to talk to the internet from those jobs, you're forced into some kind of NAT solution. So, I dug into the EC2 Other category and I started trying to figure out what was going on there.There's no way—natively—to look at what traffic is transiting the NAT Gateway. There's not an interface that shows you what's going on, what's the biggest talkers over that network. Instead, you have to have flow logs enabled and have to parse those flow logs. So, I dug into that.Corey: Well, you're missing a step first because in a lot of environments, people have more than one of these things, so you get to first do the scavenger hunt of, okay, I have a whole bunch of Managed NAT Gateways and first I need to go diving into CloudWatch metrics and figure out which are the heavy talkers. Is usually one or two followed by a whole bunch of small stuff, but not always, so figuring out which VPC you're even talking about is a necessary prerequisite.Ben: Yeah, exactly. The data around it is almost missing entirely. Once you come to the conclusion that it is a particular NAT Gateway—like, that's a set of problems to solve on its own—but first, you have to go to the flow logs, you have to figure out what are the biggest upstream IPs that it's talking to. Once you have the IP, it still isn't apparent what that host is. In our case, we had all sorts of outside parties that we were talking to a lot and it's a matter of sorting by volume and figuring out well, this IP, what is the reverse IP? Who is potentially the host there?I actually had some wrong answers at first. I set up VPC endpoints to S3 and DynamoDB and SQS because those were some top talkers and that was a nice way to gain some security and some resilience and save some money. And then I found, well, Datadog; that's another top talker for us, so I ended up creating a nice private link to Datadog, which they offer for free, by the way, which is more than I can say for some other vendors. But then I found some outside parties, there wasn't a nice private link solution available to us, and yet, it was by far the largest volume. So, that's what kind of started me down this track is analyzing the NAT Gateway myself by looking at VPC flow logs. Like, it's shocking that there isn't a better way to find that traffic.Corey: It's worse than that because VPC flow logs tell you where the traffic is going and in what volumes, sure, on an IP address and port basis, but okay, now you have a Kubernetes cluster that spans two availability zones. Okay, great. What is actually passing through that? So, you have one big application that just seems awfully chatty, you have multiple workloads running on the thing. What's the expensive thing talking back and forth? The only way that you can reliably get the answer to that I found is to talk to people about what those workloads are actually doing, and failing that you're going code spelunking.Ben: Yep. You're exactly right about that. In our case, it ended up being apparent because we have a set of subnets where only one particular project runs. And when I saw the source IP, I could immediately figure that part out. But if it's a K8s cluster in the private subnets, yeah, how are you going to find it out? You're going to have to ask everybody that has workloads running there.Corey: And we're talking about in some cases, millions of dollars a month. Yeah, it starts to feel a little bit predatory as far as how it's priced and the amount of work you have to put in to track this stuff down. I've done this a handful of times myself, and it's always painful unless you discover something pretty early on, like, oh, it's talking to S3 because that's pretty obvious when you see that. It's, yeah, flip switch and this entire engagement just paid for itself a hundred times over. Now, let's see what else we can discover.That is always one of those fun moments because, first, customers are super grateful to learn that, oh, my God, I flipped that switch. And I'm saving a whole bunch of money. Because it starts with gratitude. “Thank you so much. This is great.” And it doesn't take a whole lot of time for that to alchemize into anger of, “Wait. You mean, I've been being ridden like a pony for this long and no one bothered to mention that if I click a button, this whole thing just goes away?”And when you mention this to your AWS account team, like, they're solicitous, but they either have to present as, “I didn't know that existed either,” which is not a good look, or, “Yeah, you caught us,” which is worse. There's no positive story on this. It just feels like a tax on not knowing trivia about AWS. I think that's what really winds me up about it so much.Ben: Yeah, I think you're right on about that as well. My misunderstanding about the NAT pricing was data processing is additive to data transfer. I expected when I replaced NAT Gateway with NAT instance, that I would be substituting data transfer costs for NAT Gateway costs, NAT Gateway data processing costs. But in fact, NAT Gateway incurs both data processing and data transfer. NAT instances only incur data transfer costs. And so, this is a big difference between the two solutions.Not only that, but if you're in the same region, if you're egressing out of your say us-east-1 region and talking to another hosted service also within us-east-1—never leaving the AWS network—you don't actually even incur data transfer costs. So, if you're using a NAT Gateway, you're paying data processing.Corey: To be clear you do, but it is cross-AZ in most cases billed at one penny egressing, and on the other side, that hosted service generally pays one penny ingressing as well. Don't feel bad about that one. That was extraordinarily unclear and the only reason I know the answer to that is that I got tired of getting stonewalled by people that later turned out didn't know the answer, so I ran a series of experiments designed explicitly to find this out.Ben: Right. As opposed to the five cents to nine cents that is data transfer to the internet. Which, add that to data processing on a NAT Gateway and you're paying between thirteen-and-a-half cents to nine-and-a-half cents for every gigabyte egressed. And this is a phenomenal cost. And at any kind of volume, if you're doing terabytes to petabytes, this becomes a significant portion of your bill. And this is why people hate the NAT Gateway so much.Corey: I am going to short-circuit an angry comment I can already see coming on this where people are going to say, “Well, yes. But it's a multi-petabyte scale. Nobody's paying on-demand retail price.” And they're right. Most people who are transmitting that kind of data, have a specific discount rate applied to what they're doing that varies depending upon usage and use case.Sure, great. But I'm more concerned with the people who are sitting around dreaming up ideas for a company where I want to wind up doing some sort of streaming service. I talked to one of those companies very early on in my tenure as a consultant around the billing piece and they wanted me to check their napkin math because they thought that at their numbers when they wound up scaling up, if their projections were right, that they were going to be spending $65,000 a minute, and what did they not understand? And the answer was, well, you didn't understand this other thing, so it's going to be more than that, but no, you're directionally correct. So, that idea that started off on a napkin, of course, they didn't build it on top of AWS; they went elsewhere.And last time I checked, they'd raised well over a quarter-billion dollars in funding. So, that's a business that AWS would love to have on a variety of different levels, but they're never going to even be considered because by the time someone is at scale, they either have built this somewhere else or they went broke trying.Ben: Yep, absolutely. And we might just make the point there that while you can get discounts on data transfer, you really can't—or it's very rare—to get discounts on data processing for the NAT Gateway. So, any kind of savings you can get on data transfer would apply to a NAT instance solution, you know, saving you four-and-a-half cents per gigabyte inbound and outbound over the NAT Gateway equivalent solution. So, you're paying a lot for the benefit of a fully-managed service there. Very robust, nicely engineered fully-managed service as we've already acknowledged, but an extremely expensive solution for what it is, which is really just a proxy in the end. It doesn't add any value to you.Corey: The only way to make that more expensive would be to route it through something like Splunk or whatnot. And Splunk does an awful lot for what they charge per gigabyte, but it just feels like it's rent-seeking in some of the worst ways possible. And what I love about this is that you've solved the problem in a way that is open-source, you have already released it in Terraform code. I think one of the first to-dos on this for someone is going to be, okay now also make it CloudFormation and also make it CDK so you can drop it in however you want.And anyone can use this. I think the biggest mistake people might make in glancing at this is well, I'm looking at the hourly charge for the NAT Gateways and that's 32-and-a-half bucks a month and the instances that you recommend are hundreds of dollars a month for the big network-optimized stuff. Yeah, if you care about the hourly rate of either of those two things, this is not for you. That is not the problem that it solves. If you're an independent learner annoyed about the $30 charge you got for a Managed NAT Gateway, don't do this. This will only add to your billing concerns.Where it really shines is once you're at, I would say probably about ten terabytes a month, give or take, in Managed NAT Gateway data processing is where it starts to consider this. The breakeven is around six or so but there is value to not having to think about things. Once you get to that level of spend, though it's worth devoting a little bit of infrastructure time to something like this.Ben: Yeah, that's effectively correct. The total cost of running the solution, like, all-in, there's eight Elastic IPs, four NAT Gateways, if you're—say you're four zones; could be less if you're in fewer zones—like, n NAT Gateways, n NAT instances, depending on how many zones you're in, and I think that's about it. And I said right in the documentation, if any of those baseline fees are a material number for your use case, then this is probably not the right solution. Because we're talking about saving thousands of dollars. Any of these small numbers for NAT Gateway hourly costs, NAT instance hourly costs, that shouldn't be a factor, basically.Corey: Yeah, it's like when I used to worry about costing my customers a few tens of dollars in Cost Explorer or CloudWatch or request fees against S3 for their Cost and Usage Reports. It's yeah, that does actually have a cost, there's no real way around it, but look at the savings they're realizing by going through that. Yeah, they're not going to come back and complaining about their five-figure consulting engagement costing an additional $25 in AWS charges and then lowering it by a third. So, there's definitely a difference as far as how those things tend to be perceived. But it's easy to miss the big stuff when chasing after the little stuff like that.This is part of the problem I have with an awful lot of cost tooling out there. They completely ignore cost components like this and focus only on the things that are easy to query via API, of, oh, we're going to cost-optimize your Kubernetes cluster when they think about compute and RAM. And, okay, that's great, but you're completely ignoring all the data transfer because there's still no great way to get at that programmatically. And it really is missing the forest for the trees.Ben: I think this is key to any cost reduction project or program that you're undertaking. When you look at a bill, look for the biggest spend items first and work your way down from there, just because of the impact you can have. And that's exactly what I did in this project. I saw that ‘EC2 Other' slash NAT Gateway was the big item and I started brainstorming ways that we could go about addressing that. And now I have my next targets in mind now that we've reduced this cost to effectively… nothing, extremely low compared to what it was, we have other new line items on our bill that we can start optimizing. But in any cost project, start with the big things.Corey: You have come a long way around to answer a question I get asked a lot, which is, “How do I become a cloud economist?” And my answer is, you don't. It's something that happens to you. And it appears to be happening to you, too. My favorite part about the solution that you built, incidentally, is that it is being released under the auspices of your employer, Chime Financial, which is immune to being acquired by Amazon just to kill this thing and shut it up.Because Amazon already has something shitty called Chime. They don't need to wind up launching something else or acquiring something else and ruining it because they have a Slack competitor of sorts called Amazon Chime. There's no way they could acquire you [unintelligible 00:27:45] going to get lost in the hallways.Ben: Well, I have confidence that Chime will be a good steward of the project. Chime's goal and mission as a company is to help everyone achieve financial peace of mind and we take that really seriously. We even apply it to ourselves and that was kind of the impetus behind developing this in the first place. You mentioned earlier we have Terraform support already and you're exactly right. I'd love to have CDK, CloudFormation, Pulumi supports, and other kinds of contributions are more than welcome from the community.So, if anybody feels like participating, if they see a feature that's missing, let's make this project the best that it can be. I suspect we can save many companies, hundreds of thousands or millions of dollars. And this really feels like the right direction to go in.Corey: This is easily a multi-billion dollar savings opportunity, globally.Ben: That's huge. I would be flabbergasted if that was the outcome of this.Corey: The hardest part is reaching these people and getting them on board with the idea of handling this. And again, I think there's a lot of opportunity for the project to evolve in the sense of different settings depending upon risk tolerance. I can easily see a scenario where in the event of a disruption to the NAT instance, it fails over to the Managed NAT Gateway, but fail back becomes manual so you don't have a flapping route table back and forth or a [hold 00:29:05] downtime or something like that. Because again, in that scenario, the failure mode is just well, you're paying four-and-a-half cents per gigabyte for a while until you wind up figuring out what's going on as opposed to the failure mode of you wind up disrupting connections on an ongoing basis, and for some workloads, that's not tenable. This is absolutely, for the common case, the right path forward.Ben: Absolutely. I think it's an enterprise-grade solution and the more knobs and dials that we add to tweak to make it more robust or adaptable to different kinds of use cases, the best outcome here would actually be that the entire solution becomes irrelevant because AWS fixes the NAT Gateway pricing. If that happens, I will consider the project a great success.Corey: I will be doing backflips like you wouldn't believe. I would sing their praises day in, day out. I'm not saying reduce it to nothing, even. I'm not saying it adds no value. I would change the way that it's priced because honestly, the fact that I can run an EC2 instance and be charged $0 on a per-gigabyte basis, yeah, I would pay a premium on an hourly charge based upon traffic volumes, but don't meter per gigabyte. That's where it breaks down.Ben: Absolutely. And why is it additive to data transfer, also? Like, I remember first starting to use VPC when it was launched and reading about the NAT instance requirement and thinking, “Wait a minute. I have to pay this extra management and hourly fee just so my private hosts could reach the internet? That seems kind of janky.”And Amazon established a norm here because Azure and GCP both have their own equivalent of this now. This is a business choice. This is not a technical choice. They could just run this under the hood and not charge anybody for it or build in the cost and it wouldn't be this thing we have to think about.Corey: I almost hate to say it, but Oracle Cloud does, for free.Ben: Do they?Corey: It can be done. This is a business decision. It is not a technical capability issue where well, it does incur cost to run these things. I understand that and I'm not asking for things for free. I very rarely say that this is overpriced when I'm talking about AWS billing issues. I'm talking about it being unpredictable, I'm talking about it being impossible to see in advance, but the fact that it costs too much money is rarely my complaint. In this case, it costs too much money. Make it cost less.Ben: If I'm not mistaken. GCPs equivalent solution is the exact same price. It's also four-and-a-half cents per gigabyte. So, that shows you that there's business games being played here. Like, Amazon could get ahead and do right by the customer by dropping this to a much more reasonable price.Corey: I really want to thank you both for taking the time to speak with me and building this glorious, glorious thing. Where can we find it? And where can we find you?Ben: alternat.cloud is going to be the place to visit. It's on Chime's GitHub, which will be released by the time this podcast comes out. As for me, if you want to connect, I'm on Twitter. @iamthewhaley is my handle. And of course, I'm on LinkedIn.Corey: Links to all of that will be in the podcast notes. Ben, thank you so much for your time and your hard work.Ben: This was fun. Thanks, Corey.Corey: Ben Whaley, staff software engineer at Chime Financial, and AWS Community Hero. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry rant of a comment that I will charge you not only four-and-a-half cents per word to read, but four-and-a-half cents to reply because I am experimenting myself with being a rent-seeking schmuck.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
About MikeBesides his duties as The Duckbill Group's CEO, Mike is the author of O'Reilly's Practical Monitoring, and previously wrote the Monitoring Weekly newsletter and hosted the Real World DevOps podcast. He was previously a DevOps Engineer for companies such as Taos Consulting, Peak Hosting, Oak Ridge National Laboratory, and many more. Mike is originally from Knoxville, TN (Go Vols!) and currently resides in Portland, OR.Links Referenced: Twitter: https://twitter.com/Mike_Julian mikejulian.com: https://mikejulian.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at Datadog. Datadog is a SaaS monitoring and security platform that enables full-stack observability for modern infrastructure and applications at every scale. Datadog enables teams to see everything: dashboarding, alerting, application performance monitoring, infrastructure monitoring, UX monitoring, security monitoring, dog logos, and log management, in one tightly integrated platform. With 600-plus out-of-the-box integrations with technologies including all major cloud providers, databases, and web servers, Datadog allows you to aggregate all your data into one platform for seamless correlation, allowing teams to troubleshoot and collaborate together in one place, preventing downtime and enhancing performance and reliability. Get started with a free 14-day trial by visiting datadoghq.com/screaminginthecloud, and get a free t-shirt after installing the agent.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built in key rotation permissions is code connectivity between any two devices, reduce latency and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. tail scales. Completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Welcome to Screaming in the Cloud, I'm Corey Quinn and I'm having something of a crisis of faith based upon a recent conversation I've had with my returning yet again guest, Mike Julian, my business partner and CEO of The Duckbill Group. Welcome back, Mike.Mike: Hi, everyone.Corey: So, the revelation that had surfaced unexpectedly was, based upon a repeated talking point where I am a terrible employee slash expensive to manage, et cetera, et cetera, and you pointed out that you've been managing me for four years or so now, at which point I did a spit take, made all the more impressive by the fact that I wasn't drinking anything at the time, and realized, “Oh, my God, you're right, but I haven't had any of the usual problems slash friction with you that I have with basically every boss I've ever had in my entire career.” So, I'm spiraling. Let's talk about that.Mike: My recollection of that conversation is slightly different than yours. Mine is that you called me and said, “Mike, I just realized that you're my boss.” And I'm like, “How do you feel about that?” He's like, “I'm not really sure.”Corey: And I'm still not entirely sure how I feel if I'm being fully honest with you. Just because it's such a weird thing to have to deal with. Because historically, I always view a managerial relationship as starting from a place of a power imbalance. And that is the one element that is missing from our relationship. We each own half the company, we can fire each other, but it takes the form of tearing the company apart, and that isn't something that we're really set up to entertain.Mike: And you know, I actually think it's deeper than that because you owning the other half of the company is not really… it's not really power in itself. Like, yeah, it is, but you could easily own half the company and have no power. Because, like, really when we talk about power, we're talking about political power, influence, and I think the reason that there is no power imbalance is because each of us does something in the company that is just as important as the other. And they're both equally valuable to the company and we both recognize the other's contributions, as that, as being equally valuable to the company. It's less to do about how much we own and more about the work that we do.Corey: Oh, of course. The ownership starts and stops entirely with the fact that neither one of us can force the other out. So it's, as opposed to well, I own 51% of the company, so when I'm tired of your bullshit, you're leaving. And that is a dynamic that's never entered into it. I'm also going to add one more thing onto what you just said, which is, both of us would sooner tear off our own skin than do the other's job.Mike: Yeah. God, I would hate to do your job, but I know you'd hate to do mine.Corey: You look at my calendar on a busy meeting day and you have a minor panic attack just looking at it where, “Oh, my God, talking to that many people.” And you are going away for a while and you come back with a whole analytical model where your first love language feels like it's spreadsheets on some days, and I look at this and it's like, “Yeah, I know what some of those numbers mean.” And it just drives me up a wall, the idea of building out a plan and an execution thing and then delegating a lot of it to other people, it does not work for my worldview in so many different ways. It's the reason I think that you and I get along. That and our shared values.Mike: I remember the first time that you and I did a consulting engagement together. We went on a multi-day trip. And at the end of, like, three days of nonstop conversations, you made a comment, it was like, “Cool. So, what are we going to do that again?” Like, you were excited by it. I can tell you're energized. And I was just thinking, “Please for love of God, I want to die right now.”Corey: One of the weirdest parts about all of it, though, is neither one of us is in a scenario where what we do for a living and how we go about it works without the other.Mike: Right. Yeah, like, this is one of the interesting things about the company we have built is that it would not work with just you or just me; it's us being co-founders is what makes it successful.Corey: The thing that I do not understand and I don't think I ever will is the idea of co-founder speed dating, where you basically go to some big networking mixer event, pick some rando off the street, and congratulations, that's your business partner. Have fun. It is not that much of an exaggeration to say that co-founding a company with someone else is like a marriage. You are creating a legal entity that without very specific controls and guidelines, you are opening yourself up to massive liability issues if the other person decides to screw you over. That is part of the reason that the values match was so important for us.Mike: Yeah, it is surprising to me how similar being co-founders and business partners is to being married. I did not expect how close those two things were. You and I spend an incredible amount of time just on the relationship for each of us, which I never expected, but makes sense in hindsight.Corey: That's I think part of it makes the whole you managing me type of relationship work is because not only can you not, “Fire me,” quote-unquote, but I can't quit without—Mike: [laugh].Corey: Leaving behind a giant pile of effort with nothing to show for it over the last four years. So, it's one of those conversation styles where we go into the conversation knowing, regardless of how heated it gets or how annoyed we are with each other, that we are not going to blow the company up because one of us is salty that week.Mike: Right. Yeah, I remember from the legal perspective, when we put together a partnership agreement, our attorneys were telling us that we really needed to have someone at the 51% owner, and we were both adamant that no, that doesn't work for us. And finally, the way that we handled it is if you and I could not handle a dispute, then the only remedy left was to shut the entire thing down. And that would be an automatic trigger. We've never ever, ever even got close to that point.But like, I like that's the structure because it really means that if you and I can't agree on something and it's a substantial thing, then there's no business, which really kind of sets the stage for how important the conversations that we have are. And of course, you and I, we're close, we have a great relationship, so that's never come up. But I do like that it's still there.Corey: I like the fact that there's always going to be an option to get out. It's not a suicide pact, for lack of a better term. But it's also something that neither one of us would ever entertain lightly. And credit where due, there have been countless conversations where you and I were diametrically opposed; we each talk through it, and one or the other of us will just do a complete one-eighty our position where, “Okay, you convinced me,” and that's it. What's so odd about that is because we don't have too many examples of that in public society, it just seems like there's now this entire focus on, “Oh, if you make an observation or a point, that's wrong, you've got to double down on it.” Why would you do that? That makes zero sense. When you've considered something of a different angle and change your mind, why waste more time on it?Mike: I think there's other interesting ones, too, where you and I have come at something from a different angle and one of us will realize that we just actually don't care as much as we thought we did. And we'll just back down because it's not the hill we want to die on.Corey: Which brings us to a good point. What hill do we want to die on?Mike: Hmm. I think we've only got a handful. I mean, as it should; like, there should not be there should not be many of them.Corey: No, no because most things can change, in the fullness of time. Just because it's not something we believe is right for the business right now does not mean it never will be.Mike: Yeah. I think all of them really come down to questions of values, which is why you and I worked so well together, in that we don't have a lot of common interests, we're at completely different stages in our lives, but we have very tightly aligned values. Which means that when we go into a discussion about something, we know where the other stands right away, like, we could generally make a pretty good guess about it. And there's often very little question about how some values discussion is going to go. Like, do we take on a certain client that is, I don't know, they build landmines? Is that a thing that we're going to do? Of course not. Like—Corey: I should clarify, we're talking here about physical landmines; not whatever disastrous failure mode your SaaS application has.Mike: [laugh]. Yeah.Corey: We know what those are.Mike: Yeah, and like, that sort of thing, you and I would never even pose the question to each other. We would just make the decision. And maybe we tell each other later because and, like, “Hey, haha, look what happened,” but there will never be a discussion around it because it just—our values are so tightly aligned that it wouldn't be necessary.Corey: Whenever we're talking to someone that's in a new sector or a company that has a different expression, we always like to throw it past each other just to double-check, you don't have a problem with—insert any random thing here; the breadth of our customer base just astounds me—and very rarely as either one of us thrown a flag on something just because we do have this affinity for saying[ yes and making money.Mike: Yeah. But you actually wanted to talk about the terribleness of managing you.Corey: Yeah. I am very curious as to what your experience has been.Mike: [laugh].Corey: And before we dive into it, I want to call out a couple of things that make me a little atypical for your typical problem employee. I am ADHD personified. My particular expression of that means that my energy level is very different at different times of day, there are times where I will get nothing done for a day or two, and then in four hours, get three weeks of work done. It is hard to predict and it's hard to schedule around and it's never clear exactly what that energy level is going to be at any given point in time. That's the starting point of this nonsense. Now, take it away.Mike: Yeah. What most people know about Corey is what everyone sees on Twitter, which is what I would call the high highs. Everyone sees you as your most energetic, or at least perceived as the most energetic. If they see you in person at a conference, it's the same sort of thing. What people don't see are your lows, which are really, really low lows.And it's not a matter of, like, you don't get anything done. Like, you know, we can handle that; it's that you disappear. And it may be for a couple hours, it may be for a couple of days, and we just don't really know what's going on. That's really hard. But then, in your high highs, they're really high, but they're also really unpredictable.So, what that means is that because you have ADHD, like, the way that your brain thinks, the way your brain works, is that you don't control what you're going to focus on, and you never know what you're going to focus on. It may be exactly what you should be focusing on, which is a huge win for everyone involved, but sometimes you focus on stuff that doesn't matter to anyone except you. Sometimes really interesting stuff comes out of that, but oftentimes it doesn't. So, helping build a structure to work around those sorts of things and to also support those sorts of things, has been one of the biggest challenges that I've had. And most of my job is really about building a support structure for you and enabling you to do your best work.So, that's been really interesting and really challenging because I do not think that way. Like, if I need to focus on something, I just say, “Great. I'm just going to focus on this thing,” and I'll focus on it until I'm done. But you don't work that way, and you couldn't conceivably work that way, ever. So, it's always been hard because I say things like, “Hey, Corey, I need you to go write this series of emails.” And you'll write them when your brain decides that wants to write them, which might be never.Corey: That's part of the problem. I've also found that if I have an idea floating around too long, it'll linger for years and I'll never write anything about it, whereas there are times when I have—the inspiration strikes, I write a one- to 2000-word blog post every week that goes out, and there are times it takes me hours and there are times I bust out the entire thing in first draft form in 20 minutes or less. Like, if it's Domino's, like, there's not going to be a refund on it. So, it's kind of wild and I wish I could harness that somehow I don't know how, but… that's one of the biggest challenges.Mike: I wish I could too, but it's one of the things that you learn to get used to. And with that, because we've worked together for so long, I've gotten to be able to tell in what state of mind you are. Like, are you in a state where if I put something in front of you, you're going to go after it hard, and like, great things are going to happen, or are you more likely to ignore that I said anything? And I can generally tell within the first sentence or so of bringing something up. But that also means that I have other—I have to be careful with how I structure requests that I have for you.In some cases, I come with a punch list of, like, here's six things I need to get through and I'm going to sit on this call while we go through them. In other cases, I have to drip them out one at a time over the span of a week just because that's how your mind is those days. That makes it really difficult because that's not how most people are managed and it's not how most people expect to manage. So, coming up with different ways to do that has been one of the trickiest things I've done.Corey: Let's move on a little bit other than managing my energy levels because that does not sound like a particularly difficult employee to manage. “Okay, great. We've got to build some buffer room into the schedule in case he winds up not delivering for a few days. Okay, we can live with that.” But oh, working with me gets so much worse.Mike: [laugh]. It absolutely does.Corey: This is my performance review. Please hit me with it.Mike: Yeah. The other major concern that has been challenging to work through that makes you really frustrating to work with, is you hate conflict. Actually, I don't actually—let me clarify that further. You avoid conflict, except your definition of conflict is more broad than most. Because when most people think of conflicts, like, “Oh, I have to go have this really hard conversation, it's going to be uncomfortable, and, like—”Corey: “Time to go fire Steven.”Mike: Right, or things like, “I have to have our performance conversation with someone.” Like, everyone hates those, but, like, there's good ways and bad ways to them, like, it's uncomfortable even at the best of times. But with you, it's more than that, it's much more broad. You avoid giving direction because you perceive giving direction as potential for conflict, and because you're so conflict-avoidant, you don't give direction to people.Which means that if someone does something you don't like, you don't say anything and then it leaves everyone on the team to say, like, “I really wish Corey would be more explicit about what he wants. I wish he was more vocal about the direction he wanted to go.” Like, “Please tell us something more.” But you're so conflict-avoidant that you don't, and no amount of begging or we're asking for it has really changed that, so we end up with these two things where you're doing most of the work yourself because you don't want to direct other people to do it.Corey: I will push back slightly on one element of that, which is when I have a strong opinion about something, I am not at all hesitant about articulating that. I mean, this is not—like, my Twitter is not performance art; it's very much what I believe. The challenge is that for so much of what we talk about internally on a day-to-day basis, I don't really have a strong opinion. And what I've always shied away from is the idea of telling people how to do their jobs. So, I want to be very clear that I'm not doing that, except when it's important.Because we've all been in environments in the corporate world where the president of the company wanders past or your grand-boss walks into the room and asks an idle question, or, “Maybe we should do this,” and it never feels like it's really just idle pondering. It's, “Welp, new strategic priority just dropped from on high.”Mike: Right.Corey: And every senior manager has a story about screwing that one up. And I have led us down that path once or twice previously. So—Mike: That's true.Corey: When I don't have a strong opinion, I think what I need to get better at is saying, “I don't give a shit,” but when I frame it like that it causes different problems.Mike: Yeah. Yeah, that's very true. I still don't completely agree with your disagreement there, but I understand your perspective. [laugh].Corey: Oh, he's not like you can fire me, so it doesn't really matter. I kid. I kid.Mike: Right. Yeah. So, I think those are the two major areas that make you a real challenge to manage and a challenge to direct. But one of the reasons why I think we've been successful at it, or at least I'll say I've been successful at managing you, is I do so with such a gentle touch that you don't realize that I'm doing anything, and I have all these different—Corey: Well, it did take me four years to realize what was going on.Mike: Yeah, like, I have all these different ways of getting you to do things, and you don't realize I'm doing them. And, like, I've shared many of them here for you for the first time. And that's really is what has worked out well. Like, a lot of the ways that I manage you, you don't realize are management.Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomomento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: What advice would you have for someone for whom a lot of these stories are resonating? Because, “Hey, I have a direct report is driving me to distraction and a lot sounds like what you're describing.” What do you wish you'd known sooner about how to coax performance out of me, for lack of a better phrasing?Mike: When we first started really working together, I knew what ADHD was, but I knew it from a high school paper that I did on ADHD, and it's um—oh, what was it—“The Overdiagnosis of ADHD,” which was a thing when you and I were at high school. That's all I knew is just that ADHD was suspected to be grossly overdiagnosed and that most people didn't have it. What I have learned is that yeah, that might have been true—maybe; I don't know—but for people that do have any ADHD, it's a real thing. Like, it does have some pretty substantial impact.And I wish I had known more about how that manifests, and particularly how it manifests in different people. And I wish I'd known more earlier on about the coping mechanisms that different people create for themselves and how they manage and how they—[sigh], I'm struggling to come up with the right word here, but many people who are neurodivergent in some way create coping mechanisms and ways to shift themselves to appear more neurotypical. And I wish I had understood that better. Particularly, I wish I had understood that better for you when we first started because I've kind of learned about it over time. And I spent so much time trying to get you to work the way that I work rather than understand that you work different. Had I spent more time to understand how you work and what your coping mechanisms were, the earlier years of Duckbill would have been so much smoother.Corey: And again, having this conversation has been extraordinarily helpful. On my side of it, one of the things that was absolutely transformative and caused a massive reduction in our interpersonal conflict was the very simple tool of, it's not necessarily a problem when I drop something on the floor and don't get to it, as long as I throw a hand up and say, “I'm dropping this thing,” and so someone else can catch it as we go. I don't know how much of this is ADHD speaking versus how much of it is just my own brokenness in some ways, but I feel like everyone has this neverending list of backlog tasks that they'll get to someday that generally doesn't ever seem to happen. More often than not, I wind up every few months, just looking at my ever-growing list, reset to zero and we'll start over. And every once in a while, I'll be really industrious and knock a thing or two off the list. But so many that don't necessarily matter or need to be me doing them, but it drives people to distraction when something hits my email inbox, it just dies there, for example.Mike: Yeah. One of the systems that we set up here is that if there's something that Corey does not immediately want to do, I have you send it to someone else. And generally it's to me and then I become a router for you. But making that more explicit and making that easier for you—I'm just like, “If this is not something that you're going to immediately take care of yourself, forward it to me.” And that was huge. But then other things, like when you take time off, no one knows you're taking time off. And it's an—the easiest thing is no one cares that you're taking time off; just, you know, tell us you're doing it.Corey: Yeah, there's a difference between, “I'm taking three days off,” and your case, the answer is generally, “Oh, thank God. He's finally using some of that vacation.”Mike: [laugh].Corey: The problem is there's a world of difference between, “Oh, I'm going to take these three days off,” and just not showing up that day. That tends to cause problems for people.Mike: Yeah. They're just waving a hand in the air and saying, “Hey, this is happening,” that's great. But not waving it, not saying anything at all, that's where the pain really comes from.Corey: When you take a look across your experience managing people, which to my understanding your first outing with it was at this company—Mike: Yeah.Corey: What about managing me is the least surprising and the most surprising that you've picked up during that pattern? Because again, the story has always been, “Oh, yeah, you're a terrible manager because you've never done it before,” but I look back and you're clearly the best manager I've ever had, if for no other reason than neither one of us can rage-quit. But there's a lot of artistry to how you've handled a lot of challenges that I present to you.Mike: I'm the best manager you've had because I haven't fired you. [laugh].Corey: And also, some of the best ones I have had fired me. That doesn't necessarily disqualify someone.Mike: Yeah. I want to say, I am by no means experienced as a manager. As you mentioned, this is my first outing into doing management. As my coach tells me, I'm getting better every day. I am not terrible [laugh].The—let's see—most surprising; least surprising. I don't think I have anything for least surprising. I think most surprising is how easy it is for you to accept feedback and how quickly you do something about it, how quickly you take action on that feedback. I did not expect that, given all your other proclivities for not liking managers, not liking to be managed, for someone to give feedback to you and you say, “Yep, that sounds good,” and then do it, like, that was incredibly surprising.Corey: It's one of those areas where if you're not embracing or at least paying significant attention to how you are being perceived, maybe that's a problem, maybe it's not, let's be very clear. However, there's also a lot of propensity there to just assume, “Oh, I'm right and screw everyone else.” You can do an awful lot of harm that way. And that is something I've had to become incredibly aware of, especially during the pandemic, as the size of my audience at this point more than quadrupled from the start of the pandemic. These are a bunch of people now who have never met me in person, they have no context on what I do.And I tend to view the world the way you might expect a dog to behave, who caught a car that he has absolutely no idea how to drive, and he's sort of winging it as he goes. Like, step one, let's not kill people. Step two, eh, we'll figure that out later. Like, step one is the most important.Mike: Mm-hm. Yeah.Corey: And feedback is hard to get, past a certain point. I often lament from time to time that it's become more challenging for me to weed out who the jerks are because when you're perceived to have a large platform and more or less have no problem calling large companies and powerful folk to account, everyone's nice to you. And well, “Really? He's terrible and shitty to women. That's odd. He's always been super nice to me.” Is not the glowing defense that so many people seem to think that it is. It's I have learned to listen a lot more clearly the more I speak.Mike: That's a challenge for me as well because, as we've mentioned, my first foray into management. As we've had more people in the company, that has gotten more of a challenge of I have to watch what I say because my word carries weight on its own, by virtue of my position. And you have the same problem, except yours is much more about your weight in public, rather than your weight internally.Corey: I see it as different sides of the same coin. I take it as a personal bit of a badge of honor that almost every person I meet, including the people who've worked here, have come away, very surprised by just how true to life my personality on Twitter is to how actually am when I interact with humans. You're right, they don't see the low sides, but I also try not to take that out on the staff either.Mike: [laugh]. Right.Corey: We do the best of what we have, I think, and it's gratifying to know that I can still learn new tricks.Mike: Yeah. And I'm not firing anytime soon.Corey: That's right. Thank you again for giving me the shotgun performance review. It's always appreciated. If people want to learn more, where can they find you, to get their own performance preview, perhaps?Mike: Yeah, you can find me on Twitter at @Mike_Julian. Or you can sign up for our newsletter, where I'm talking about my upcoming book on consulting at mikejulian.com.Corey: And we will put links to that into the show notes. Thanks again, sir.Mike: Thank you.Corey: Mike Julian, CEO of The Duckbill Group, my business partner, and apparently my boss. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that demonstrates the absolute worst way to respond to a negative performance evaluation.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Aaron and Brian talk about all things KubeConNA (Detroit) 2022.SHOW: 665CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Kubernetes Solution: Maximum Visibility into Container EnvironmentsStart monitoring the health and performance of your container environment with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.CloudZero - Cloud Cost Intelligence for Engineering TeamsCDN77 - CDN Focused on VOD and SecurityCDN77 - ask for a free trial with no duration or traffic limits. SHOW NOTES:CNCF Announcements - KubeConNA 2022 (Detroit)KubeCon Vendor ListTopic 1 - Let's start with what was good or bad at CloudNativeCon/KubeCon. Overall vibes at the conference? 7,000 attendees, 300 vendor-companies, good amount of end-usersGood: Well-organized, live interactionsBad: City choice, keynotes, Day1 & 2 pricing modelsIs this Big Tent 2.0? (Aaron - I don't think so…)Topic 2 - Interesting technologies or technology trends? Kubernetes is no longer the center of this conferenceService Mesh, WASM (Web Assembly), Cost-Mgmt, various forms of SecurityStarting to see fragmentation (e.g. Cloud-Native Security is it's own conference)Topic 3 - Are we in a bubble? Lots of companies in each technology category? Will we see consolidation, failures or buyers? What's the mission for CNCF - place for projects to incubate with no “horse in the race”, all areas will eventually consolidate down to a few players over time?Topic 4 - What's next for KubeCon? Can it survive as a big event without a central technology? Will it splinter into lots of little events?Did the CNCF turn this into too much of a marketing event?What's in it for the sponsors? Especially if it splits into different events?Why do they keep making bad location choices? (Amsterdam 4/20, Chicago - Nov ‘23)FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
About ChetanChetan Venkatesh is a technology startup veteran focused on distributed data, edge computing, and software products for enterprises and developers. He has 20 years of experience in building primary data storage, databases, and data replication products. Chetan holds a dozen patents in the area of distributed computing and data storage.Chetan is the CEO and Co-Founder of Macrometa – a Global Data Network featuring a Global Data Mesh, Edge Compute, and In-Region Data Protection. Macrometa helps enterprise developers build real-time apps and APIs in minutes – not months.Links Referenced: Macrometa: https://www.macrometa.com Macrometa Developer Week: https://www.macrometa.com/developer-week TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built in key rotation permissions is code connectivity between any two devices, reduce latency and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. tail scales. Completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomomento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today, this promoted guest episode is brought to us basically so I can ask a question that has been eating at me for a little while. That question is, what is the edge? Because I have a lot of cynical sarcastic answers to it, but that doesn't really help understanding. My guest today is Chetan Venkatesh, CEO and co-founder at Macrometa. Chetan, thank you for joining me.Chetan: It's my pleasure, Corey. You're one of my heroes. I think I've told you this before, so I am absolutely delighted to be here.Corey: Well, thank you. We all need people to sit on the curb and clap as we go by and feel like giant frauds in the process. So let's start with the easy question that sets up the rest of it. Namely, what is Macrometa, and what puts you in a position to be able to speak at all, let alone authoritatively, on what the edge might be?Chetan: I'll answer the second part of your question first, which is, you know, what gives me the authority to even talk about this? Well, for one, I've been trying to solve the same problem for 20 years now, which is build distributed systems that work really fast and can answer questions about data in milliseconds. And my journey's sort of been like the spiral staircase journey, you know, I keep going around in circles, but the view just keeps getting better every time I do one of these things. So I'm on my fourth startup doing distributed data infrastructure, and this time really focused on trying to provide a platform that's the antithesis of the cloud. It's kind of like taking the cloud and flipping it on its head because instead of having a single region application where all your stuff runs in one place, on us-west-1 or us-east-1, what if your apps could run everywhere, like, they could run in hundreds and hundreds of cities around the world, much closer to where your users and devices and most importantly, where interesting things in the real world are happening?And so we started Macrometa about five years back to build a new kind of distributed cloud—let's call the edge—that kind of looks like a CDN, a Content Delivery Network, but really brings very sophisticated platform-level primitives for developers to build applications in a distributed way around primitives for compute, primitives for data, but also some very interesting things that you just can't do in the cloud anymore. So that's Macrometa. And we're doing something with edge computing, which is a big buzzword these days, but I'm sure you'll ask me about that.Corey: It seems to be. Generally speaking, when I look around and companies are talking about edge, it feels almost like it is a redefining of what they already do to use a term that is currently trending and deep in the hype world.Chetan: Yeah. You know, I think humans just being biologically social beings just tend to be herd-like, and so when we see a new trend, we like to slap it on everything we have. We did that 15 years back with cloud, if you remember, you know? Everybody was very busy trying to stick the cloud label on everything that was on-prem. Edge is sort of having that edge-washing moment right now.But I define edge very specifically is very different from the cloud. You know, where the cloud is defined by centralization, i.e., you've got a giant hyperscale data center somewhere far, far away, where typically electricity, real estate, and those things are reasonably cheap, i.e., not in urban centers, where those things tend to be expensive.You know, you have platforms where you run things at scale, it's sort of a your mess for less business in the cloud and somebody else manages that for you. The edge is actually defined by location. And there are three types of edges. The first edge is the CDN edge, which is historically where we've been trying to make things faster with the internet and make the internet scale. So Akamai came about, about 20 years back and created this thing called the CDN that allowed the web to scale. And that was the first killer app for edge, actually. So that's the first location that defines the edge where a lot of the peering happens between different network providers and the on-ramp around the cloud happens.The second edge is the telecom edge. That's actually right next to you in terms of, you know, the logical network topology because every time you do something on your computer, it goes through that telecom layer. And now we have the ability to actually run web services, applications, data, directly from that telecom layer.And then the third edge is—sort of, people have been familiar with this for 30 years. The third edge is your device, just your mobile phone. It's your internet gateway and, you know, things that you carry around in your pocket or sit on your desk, where you have some compute power, but it's very restricted and it only deals with things that are interesting or important to you as a person, not in a broad range. So those are sort of the three things. And it's not the cloud. And these three things are now becoming important as a place for you to build and run enterprise apps.Corey: Something that I think is often overlooked here—and this is sort of a natural consequence of the cloud's own success and the joy that we live in a system that we do where companies are required to always grow and expand and find new markets—historically, for example, when I went to AWS re:Invent, which is a cloud service carnival in the desert that no one in the right mind should ever want to attend but somehow we keep doing, it used to be that, oh, these announcements are generally all aligned with people like me, where I have specific problems and they look a lot like what they're talking about on stage. And now they're talking about things that, from that perspective, seem like Looney Tunes. Like, I'm trying to build Twitter for Pets or something close to it, and I don't understand why there's so much talk about things like industrial IoT and, “Machine learning,” quote-unquote, and other things that just do not seem to align with. I'm trying to build a web service, like it says on the name of a company; what gives?And part of that, I think, is that it's difficult to remember, for most of us—especially me—that what they're coming out with is not your shopping list. Every service is for someone, not every service is for everyone, so figuring out what it is that they're talking about and what those workloads look like, is something that I think is getting lost in translation. And in our defense—collective defense—Amazon is not the best at telling stories to realize that, oh, this is not me they're talking to; I'm going to opt out of this particular thing. You figure it out by getting it wrong first. Does that align with how you see the market going?Chetan: I think so. You know, I think of Amazon Web Services, or even Google, or Azure as sort of Costco and, you know, Sam's Wholesale Club or whatever, right? They cater to a very broad audience and they sell a lot of stuff in bulk and cheap. And you know, so it's sort of a lowest common denominator type of a model. And so emerging applications, and especially emerging needs that enterprises have, don't necessarily get solved in the cloud. You've got to go and build up yourself on sort of the crude primitives that they provide.So okay, go use your bare basic EC2, your S3, and build your own edgy, or whatever, you know, cutting edge thing you want to build over there. And if enough people are doing it, I'm sure Amazon and Google start to pay interest and you know, develop something that makes it easier. So you know, I agree with you, they're not the best at this sort of a thing. The edge is phenomenon also that's orthogonally, and diametrically opposite to the architecture of the cloud and the economics of the cloud.And we do centralization in the cloud in a big way. Everything is in one place; we make giant piles of data in one database or data warehouse slice and dice it, and almost all our computer science is great at doing things in a centralized way. But when you take data and chop it into 50 copies and keep it in 50 different places on Earth, and you have this thing called the internet or the wide area network in the middle, trying to keep all those copies in sync is a nightmare. So you start to deal with some very basic computer science problems like distributed state and how do you build applications that have a consistent view of that distributed state? So you know, there have been attempts to solve these problems for 15, 18 years, but none of those attempts have really cracked the intersection of three things: a way for programmers to do this in a way that doesn't blow their heads with complexity, a way to do this cheaply and effectively enough where you can build real-world applications that serve billions of users concurrently at a cost point that actually is economical and make sense, and third, a way to do this with adequate levels of performance where you don't die waiting for the spinning wheel on your screen to go away.So these are the three problems with edge. And as I said, you know, me and my team, we've been focused on this for a very long while. And me and my co-founder have come from this world and we created a platform very uniquely designed to solve these three problems, the problems of complexity for programmers to build in a distributed environment like this where data sits in hundreds of places around the world and you need a consistent view of that data, being able to operate and modify and replicate that data with consistency guarantees, and then a third one, being able to do that, at high levels of performance, which translates to what we call ultra-low latency, which is human perception. The threshold of human perception, visually, is about 70 milliseconds. Our finest athletes, the best Esports players are about 70 to 80 milliseconds in their twitch, in their ability to twitch when something happens on the screen. The average human is about 100 to 110 milliseconds.So in a second, we can maybe do seven things at rapid rates. You know, that's how fast our brain can process it. Anything that falls below 100 milliseconds—especially if it falls into 50 to 70 milliseconds—appears instantaneous to the human mind and we experience it as magic. And so where edge computing and where my platform comes in is that it literally puts data and applications within 50 milliseconds of 90% of humans and devices on Earth and allows now a whole new set of applications where latency and location and the ability to control those things with really fine-grained capability matters. And we can talk a little more about what those apps are in a bit.Corey: And I think that's probably an interesting place to dive into at the moment because whenever we talk about the idea of new ways of building things that are aimed at decentralization, first, people at this point automatically have a bit of an aversion to, “Wait, are you talking about some of the Web3 nonsense?” It's one of those look around the poker table and see if you can spot the sucker, and if you can't, it's you. Because there are interesting aspects to that entire market, let's be clear, but it also seems to be occluded by so much of the grift and nonsense and spam and the rest that, again, sort of characterize the early internet as well. The idea though, of decentralizing out of the cloud is deeply compelling just to anyone who's really ever had to deal with the egress charges, or even the data transfer charges inside of one of the cloud providers. The counterpoint is it feels that historically, you either get to pay the tax and go all-in on a cloud provider and get all the higher-level niceties, or otherwise, you wind up deciding you're going to have to more or less go back to physical data centers, give or take, and other than the very baseline primitives that you get to work with of VMs and block storage and maybe a load balancer, you're building it all yourself from scratch. It seems like you're positioning this as setting up for a third option. I'd be very interested to hear it.Chetan: Yeah. And a quick comment on decentralization: good; not so sure about the Web3 pieces around it. We tend to talk about computer science and not the ideology of distributing data. There are political reasons, there are ideological reasons around data and sovereignty and individual human rights, and things like that. There are people far smarter than me who should explain that.I fall personally into the Nicholas Weaver school of skepticism about Web3 and blockchain and those types of things. And for readers who are not familiar with Nicholas Weaver, please go online. He teaches at UC Berkeley is just one of the finest minds of our time. And I think he's broken down some very good reasons why we should be skeptical about, sort of, Web3 and, you know, things like that. Anyway, that's a digression.Coming back to what we're talking about, yes, it is a new paradigm, but that's the challenge, which is I don't want to introduce a new paradigm. I want to provide a continuum. So what we've built is a platform that looks and feels very much like Lambdas, and a poly-model database. I hate the word multi. It's a pretty dumb word, so I've started to substitute ‘multi' with ‘poly' everywhere, wherever I can find it.So it's not multi-cloud; it's poly-cloud. And it's not multi-model; it's poly-model. Because what we want is a world where developers have the ability to use the best paradigm for solving problems. And it turns out when we build applications that deal with data, data doesn't just come in one form, it comes in many different forms, it's polymorphic, and so you need a data platform, that's also, you know, polyglot and poly-model to be able to handle that. So that's one part of the problem, which is, you know, we're trying to provide a platform that provides continuity by looking like a key-value store like Redis. It looks like a document database—Corey: Or the best database in the world Route 53 TXT records. But please, keep going.Chetan: Well, we've got that too, so [laugh] you know? And then we've got a streaming graph engine built into it that kind of looks and behaves like a graph database, like Neo4j, for example. And, you know, it's got columnar capabilities as well. So it's sort of a really interesting data platform that is not open-source; it's proprietary because it's designed to solve these problems of being able to distribute data, put it in hundreds of locations, keep it all in sync, but it looks like a conventional NoSQL database. And it speaks PostgreSQL, so if you know PostgreSQL, you can program it, you know, pretty easily.What it's also doing is taking away the responsibility for engineers and developers to understand how to deal with very arcane problems like conflict resolution in data. I made a change in Mumbai; you made a change in Tokyo; who wins? Our systems in the cloud—you know, DynamoDB, and things like that—they have very crude answers for this something called last writer wins. We've done a lot of work to build a protocol that brings you ACID-like consistency in these types of problems and makes it easy to reason with state change when you've got an application that's potentially running in 100 locations and each of those places is modifying the same record, for example.And then the second part of it is it's a converged platform. So it doesn't just provide data; it provides a compute layer that's deeply integrated directly with the data layer itself. So think of it as Lambdas running, like, stored procedures inside the database. That's really what it is. We've built a very, very specialized compute engine that exposes containers in functions as stored procedures directly on the database.And so they run inside the context of the database and so you can build apps in Python, Go, your favorite language; it compiles down into a [unintelligible 00:15:02] kernel that actually runs inside the database among all these different polyglot interfaces that we have. And the third thing that we do is we provide an ability for you to have very fine-grained control on your data. Because today, data's become a political tool; it's become something that nation-states care a lot about.Corey: Oh, do they ever.Chetan: Exactly. And [unintelligible 00:15:24] regulated. So here's the problem. You're an enterprise architect and your application is going to be consumed in 15 countries, there are 13 different frameworks to deal with. What do you do? Well, you spin up 13 different versions, one for each country, and you know, build 13 different teams, and have 13 zero-day attacks and all that kind of craziness, right?Well, data protection is actually one of the most important parts of the edge because, with something like Macrometa, you can build an app once, and we'll provide all the necessary localization for any region processing, data protection with things like tokenization of data so you can exfiltrate data securely without violating potentially PII sensitive data exfiltration laws within countries, things like that, i.e. It's solving some really hard problems by providing an opinionated platform that does these three things. And I'll summarize it as thus, Corey, we can kind of dig into each piece. Our platform is called the Global Data Network. It's not a global database; it's a global data network. It looks like a frickin database, but it's actually a global network available in 175 cities around the world.Corey: The challenge, of course, is where does the data actually live at rest, and—this is why people care about—well, they're two reasons people care about that; one is the data residency locality stuff, which has always, honestly for me, felt a little bit like a bit of a cloud provider shakedown. Yeah, build a data center here or you don't get any of the business of anything that falls under our regulation. The other is, what is the egress cost of that look like? Because yeah, I can build a whole multicenter data store on top of AWS, for example, but minimum, we're talking two cents, a gigabyte of transfer, even with inside of a region in some cases, and many times that externally.Chetan: Yeah, that's the real shakedown: the egress costs [laugh] more than the other example that you talked about over there. But it's a reality of how cloud pricing works and things like that. What we have built is a network that is completely independent of the cloud providers. We're built on top of five different service providers. Some of them are cloud providers, some of them are telecom providers, some of them are CDNs.And so we're building our global data network on top of routes and capacity provided by transfer providers who have different economics than the cloud providers do. So our cost for egress falls somewhere between two and five cents, for example, depending on which edge locations, which countries, and things that you're going to use over there. We've got a pretty generous egress fee where, you know, for certain thresholds, there's no egress charge at all, but over certain thresholds, we start to charge between two to five cents. But even if you were to take it at the higher end of that spectrum, five cents per gigabyte for transfer, the amount of value our platform brings in architecture and reduction in complexity and the ability to build apps that are frankly, mind-boggling—one of my customers is a SaaS company in marketing that uses us to inject offers while people are on their website, you know, browsing. Literally, you hit their website, you do a few things, and then boom, there's a customized offer for them.In banking that's used, for example, you know, you're making your minimum payments on your credit card, but you have a good payment history and you've got a decent credit score, well, let's give you an offer to give you a short-term loan, for example. So those types of new applications, you know, are really at this intersection where you need low latency, you need in-region processing, and you also need to comply with data regulation. So when you building a high-value revenue-generating app like that egress cost, even at five cents, right, tends to be very, very cheap, and the smallest part of you know, the complexity of building them.Corey: One of the things that I think we see a lot of is that the tone of this industry is set by the big players, and they have done a reasonable job, by and large, of making anything that isn't running in their blessed environments, let me be direct, sound kind of shitty, where it's like, “Oh, do you want to be smart and run things in AWS?”—or GCP? Or Azure, I guess—“Or do you want to be foolish and try and build it yourself out of popsicle sticks and twine?” And, yeah, on some level, if I'm trying to treat everything like it's AWS and run a crappy analog version of DynamoDB, for example, I'm not going to have a great experience, but if I also start from a perspective of not using things that are higher up the stack offerings, that experience starts to look a lot more reasonable as we start expanding out. But it still does present to a lot of us as well, we're just going to run things in VM somewhere and treat them just like we did back in 2005. What's changed in that perspective?Chetan: Yeah, you know, I can't talk for others but for us, we provide a high-level Platform-as-a-Service, and that platform, the global data network, has three pieces to it. First piece is—and none of this will translate into anything that AWS or GCP has because this is the edge, Corey, is completely different, right? So the global data network that we have is composed of three technology components. The first one is something that we call the global data mesh. And this is Pub/Sub and event processing on steroids. We have the ability to connect data sources across all kinds of boundaries; you've got some data in Germany and you've got some data in New York. How do you put these things together and get them streaming so that you can start to do interesting things with correlating this data, for example?And you might have to get across not just physical boundaries, like, they're sitting in different systems in different data centers; they might be logical boundaries, like, hey, I need to collaborate with data from my supply chain partner and we need to be able to do something that's dynamic in real-time, you know, to solve a business problem. So the global data mesh is a way to very quickly connect data wherever it might be in legacy systems, in flat files, in streaming databases, in data warehouses, what have you—you know, we have 500-plus types of connectors—but most importantly, it's not just getting the data streaming, it's then turning it into an API and making that data fungible. Because the minute you put an API on it and it's become fungible now that data is actually got a lot of value. And so the data mesh is a way to very quickly connect things up and put an API on it. And that API can now be consumed by front-ends, it can be consumed by other microservices, things like that.Which brings me to the second piece, which is edge compute. So we've built a compute runtime that is Docker compatible, so it runs containers, it's also Lambda compatible, so it runs functions. Let me rephrase that; it's not Lambda-compatible, it's Lambda-like. So no, you can't take your Lambda and dump it on us and it won't just work. You have to do some things to make it work on us.Corey: But so many of those things are so deeply integrated to the ecosystem that they're operating within, and—Chetan: Yeah.Corey: That, on the one hand, is presented by cloud providers as, “Oh, yes. This shows how wonderful these things are.” In practice, talk to customers. “Yeah, we're using it as spackle between the different cloud services that don't talk to one another despite being made by the same company.”Chetan: [laugh] right.Corey: It's fun.Chetan: Yeah. So the second edge compute piece, which allows you now to build microservices that are stateful, i.e., they have data that they interact with locally, and schedule them along with the data on our network of 175 regions around the world. So you can build distributed applications now.Now, your microservice back-end for your banking application or for your HR SaaS application or e-commerce application is not running in us-east-1 and Virginia; it's running literally in 15, 18, 25 cities where your end-users are, potentially. And to take an industrial IoT case, for example, you might be ingesting data from the electricity grid in 15, 18 different cities around the world; you can do all of that locally now. So that's what the edge functions does, it flips the cloud model around because instead of sending data to where the compute is in the cloud, you're actually bringing compute to where the data is originating, or the data is being consumed, such as through a mobile app. So that's the second piece.And the third piece is global data protection, which is hey, now I've got a distributed infrastructure; how do I comply with all the different privacy and regulatory frameworks that are out there? How do I keep data secure in each region? How do I potentially share data between regions in such a way that, you know, I don't break the model of compliance globally and create a billion-dollar headache for my CIO and CEO and CFO, you know? So that's the third piece of capabilities that this provides.All of this is presented as a set of serverless APIs. So you simply plug these APIs into your existing applications. Some of your applications work great in the cloud. Maybe there are just parts of that app that should be on our edge. And that's usually where most customers start; they take a single web service or two that's not doing so great in the cloud because it's too far away; it has data sensitivity, location sensitivity, time sensitivity, and so they use us as a way to just deal with that on the edge.And there are other applications where it's completely what I call edge native, i.e., no dependancy on the cloud comes and runs completely distributed across our network and consumes primarily the edges infrastructure, and just maybe send some data back on the cloud for long-term storage or long-term analytics.Corey: And ingest does remain free. The long-term analytics, of course, means that once that data is there, good luck convincing a customer to move it because that gets really expensive.Chetan: Exactly, exactly. It's a speciation—as I like to say—of the cloud, into a fast tier where interactions happen, i.e., the edge. So systems of record are still in the cloud; we still have our transactional systems over there, our databases, data warehouses.And those are great for historical types of data, as you just mentioned, but for things that are operational in nature, that are interactive in nature, where you really need to deal with them because they're time-sensitive, they're depleting value in seconds or milliseconds, they're location sensitive, there's a lot of noise in the data and you need to get to just those bits of data that actually matter, throw the rest away, for example—which is what you do with a lot of telemetry in cybersecurity, for example, right—those are all the things that require a new kind of a platform, not a system of record, a system of interaction, and that's what the global data network is, the GDN. And these three primitives, the data mesh, Edge compute, and data protection, are the way that our APIs are shaped to help our enterprise customers solve these problems. So put it another way, imagine ten years from now what DynamoDB and global tables with a really fast Lambda and Kinesis with actually Event Processing built directly into Kinesis might be like. That's Macrometa today, available in 175 cities.Corey: This episode is brought to us in part by our friends at Datadog. Datadog is a SaaS monitoring and security platform that enables full-stack observability for modern infrastructure and applications at every scale. Datadog enables teams to see everything: dashboarding, alerting, application performance monitoring, infrastructure monitoring, UX monitoring, security monitoring, dog logos, and log management, in one tightly integrated platform. With 600-plus out-of-the-box integrations with technologies including all major cloud providers, databases, and web servers, Datadog allows you to aggregate all your data into one platform for seamless correlation, allowing teams to troubleshoot and collaborate together in one place, preventing downtime and enhancing performance and reliability. Get started with a free 14-day trial by visiting datadoghq.com/screaminginthecloud, and get a free t-shirt after installing the agent.Corey: I think it's also worth pointing out that it's easy for me to fall into a trap that I wonder if some of our listeners do as well, which is, I live in, basically, downtown San Francisco. I have gigabit internet connectivity here, to the point where when it goes out, it is suspicious and more a little bit frightening because my ISP—Sonic.net—is amazing and deserves every bit of praise that you never hear any ISP ever get. But when I travel, it's a very different experience. When I go to oh, I don't know, the conference center at re:Invent last year and find that the internet is patchy at best, or downtown San Francisco on Verizon today, I discover that the internet is almost non-existent, and suddenly applications that I had grown accustomed to just working suddenly didn't.And there's a lot more people who live far away from these data center regions and tier one backbones directly to same than don't. So I think that there's a lot of mistaken ideas around exactly what the lower bandwidth experience of the internet is today. And that is something that feels inadvertently classist if that make sense. Are these geographically bigoted?Chetan: Yeah. No, I think those two points are very well articulated. I wish I could articulate it that well. But yes, if you can afford 5G, some of those things get better. But again, 5G is not everywhere yet. It will be, but 5G can in many ways democratize at least one part of it, which is provide an overlap network at the edge, where if you left home and you switched networks, on to a wireless, you can still get the same quality of service that you used to getting from Sonic, for example. So I think it can solve some of those things in the future. But the second part of it—what did you call it? What bigoted?Corey: Geographically bigoted. And again, that's maybe a bit of a strong term, but it's easy to forget that you can't get around the speed of light. I would say that the most poignant example of that I had was when I was—in the before times—giving a keynote in Australia. So ah, I know what I'll do, I'll spin up an EC2 instance for development purposes—because that's how I do my development—in Australia. And then I would just pay my provider for cellular access for my iPad and that was great.And I found the internet was slow as molasses for everything I did. Like, how do people even live here? Well, turns out that my provider would backhaul traffic to the United States. So to log into my session, I would wind up having to connect with a local provider, backhaul to the US, then connect back out from there to Australia across the entire Pacific Ocean, talk to the server, get the response, would follow that return path. It's yeah, turns out that doing laps around the world is not the most efficient way of transferring any data whatsoever, let alone in sizable amounts.Chetan: And that's why we decided to call our platform the global data network, Corey. In fact, it's really built inside of sort of a very simple reason is that we have our own network underneath all of this and we stop this whole ping-pong effect of data going around and help create deterministic guarantees around latency, around location, around performance. We're trying to democratize latency and these types of problems in a way that programmers shouldn't have to worry about all this stuff. You write your code, you push publish, it runs on a network, and it all gets there with a guarantee that 95% of all your requests will happen within 50 milliseconds round-trip time, from any device, you know, in these population centers around the world.So yeah, it's a big deal. It's sort of one of our je ne sais quoi pieces in our mission and charter, which is to just democratize latency and access, and sort of get away from this geographical nonsense of, you know, how networks work and it will dynamically switch topology and just make everything slow, you know, very non-deterministic way.Corey: One last topic that I want to ask you about—because I near certain given your position, you will have an opinion on this—what's your take on, I guess, the carbon footprint of clouds these days? Because a lot of people been talking about it; there has been a lot of noise made about, justifiably so. I'm curious to get your take.Chetan: Yeah, you know, it feels like we're in the '30s and the '40s of the carbon movement when it comes to clouds today, right? Maybe there's some early awareness of the problem, but you know, frankly, there's very little we can do than just sort of put a wet finger in the air, compute some carbon offset and plant some trees. I think these are good building blocks; they're not necessarily the best ways to solve this problem, ultimately. But one of the things I care deeply about and you know, my company cares a lot about is helping make developers more aware off what kind of carbon footprint their code tangibly has on the environment. And so we've started two things inside the company. We've started a foundation that we call the Carbon Conscious Computing Consortium—the four C's. We're going to announce that publicly next year, we're going to invite folks to come and join us and be a part of it.The second thing that we're doing is we're building a completely open-source, carbon-conscious computing platform that is built on real data that we're collecting about, to start with, how Macrometa's platform emits carbon in response to different types of things you build on it. So for example, you wrote a query that hits our database and queries, you know, I don't know, 20 billion objects inside of our database. It'll tell you exactly how many micrograms or how many milligrams of carbon—it's an estimate; not exactly. I got to learn to throttle myself down. It's an estimate, you know, you can't really measure these things exactly because the cost of carbon is different in different places, you know, there are different technologies, et cetera.Gives you a good decent estimate, something that reliably tells you, “Hey, you know that query that you have over there, that piece of SQL? That's probably going to do this much of micrograms of carbon at this scale.” You know, if this query was called a million times every hour, this is how much it costs. A million times a day, this is how much it costs and things like that. But the most important thing that I feel passionate about is that when we give developers visibility, they do good things.I mean, when we give them good debugging tools, the code gets better, the code gets faster, the code gets more efficient. And Corey, you're in the business of helping people save money, when we give them good visibility into how much their code costs to run, they make the code more efficient. So we're doing the same thing with carbon, we know there's a cost to run your code, whether it's a function, a container, a query, what have you, every operation has a carbon cost. And we're on a mission to measure that and provide accurate tooling directly in our platform so that along with your debug lines, right, where you've got all these print statements that are spitting up stuff about what's happening there, we can also print out, you know, what did it cost in carbon.And you can set budgets. You can basically say, “Hey, I want my application to consume this much of carbon.” And down the road, we'll have AI and ML models that will help us optimize your code to be able to fit within those carbon budgets. For example. I'm not a big fan of planting—you know, I love planting trees, but don't get me wrong, we live in California and those trees get burned down.And I was reading this heartbreaking story about how we returned back into the atmosphere a giant amount of carbon because the forest reserve that had been planted, you know, that was capturing carbon, you know, essentially got burned down in a forest fire. So, you know, we're trying to just basically say, let's try and reduce the amount of carbon, you know, that we can potentially create by having better tooling.Corey: That would be amazing, and I think it also requires something that I guess acts almost as an exchange where there's a centralized voice that can make sure that, well, one, the provider is being honest, and two, being able to ensure you're doing an apples-to-apples comparison and not just discounting a whole lot of negative externalities. Because, yes, we're talking about carbon released into the environment. Okay, great. What about water effects from what's happening with your data centers are located? That can have significant climate impact as well. It's about trying to avoid the picking and choosing. It's hard, hard problem, but I'm unconvinced that there's anything more critical in the entire ecosystem right now to worry about.Chetan: So as a startup, we care very deeply about starting with the carbon part. And I agree, Corey, it's a multi-dimensional problem; there's lots of tentacles. The hydrocarbon industry goes very deeply into all parts of our lives. I'm a startup, what do I know? I can't solve all of those things, but I wanted to start with the philosophy that if we provide developers with the right tooling, they'll have the right incentives then to write better code. And as we open-source more of what we learn and, you know, our tooling, others will do the same. And I think in ten years, we might have better answers. But someone's got to start somewhere, and this is where we'd like to start.Corey: I really want to thank you for taking as much time as you have for going through what you're up to and how you view the world. If people want to learn more, where's the best place to find you?Chetan: Yes, so two things on that front. Go to www.macrometa.com—M-A-C-R-O-M-E-T-A dot com—and that's our website. And you can come and experience the full power of the platform. We've got a playground where you can come, open an account and build anything you want for free, and you can try and learn. You just can't run it in production because we've got a giant network, as I said, of 175 cities around the world. But there are tiers available for you to purchase and build and run apps. Like I think about 80 different customers, some of the biggest ones in the world, some of the biggest telecom customers, retail, E-Tail customers, [unintelligible 00:34:28] tiny startups are building some interesting things on.And the second thing I want to talk about is November 7th through 11th of 2022, just a couple of weeks—or maybe by the time this recording comes out, a week from now—is developer week at Macrometa. And we're going to be announcing some really interesting new capabilities, some new features like real-time complex event processing with low, ultra-low latency, data connectors, a search feature that allows you to build search directly on top of your applications without needing to spin up a giant Elastic Cloud Search cluster, or providing search locally and regionally so that, you know, you can have search running in 25 cities that are instant to search rather than sending all your search requests back in one location. There's all kinds of very cool things happening over there.And we're also announcing a partnership with the original, the OG of the edge, one of the largest, most impressive, interesting CDN players that has become a partner for us as well. And then we're also announcing some very interesting experimental work where you as a developer can build apps directly on the 5G telecom cloud as well. And then you'll hear from some interesting companies that are building apps that are edge-native, that are impossible to build in the cloud because they take advantage of these three things that we talked about: geography, latency, and data protection in some very, very powerful ways. So you'll hear actual customer case studies from real customers in the flesh, not anonymous BS, no marchitecture. It's a week-long of technical talk by developers, for developers. And so, you know, come and join the fun and let's learn all about the edge together, and let's go build something together that's impossible to do today.Corey: And we will, of course, put links to that in the [show notes 00:36:06]. Thank you so much for being so generous with your time. I appreciate it.Chetan: My pleasure, Corey. Like I said, you're one of my heroes. I've always loved your work. The Snark-as-a-Service is a trillion-dollar market cap company. If you're ever interested in taking that public, I know some investors that I'd happily put you in touch with. But—Corey: Sadly, so many of those investors lack senses of humor.Chetan: [laugh]. That is true. That is true [laugh].Corey: [laugh]. [sigh].Chetan: Well, thank you. Thanks again for having me.Corey: Thank you. Chetan Venkatesh, CEO and co-founder at Macrometa. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment about why we should build everything on the cloud provider that you work for and then the attempt to challenge Chetan for the title of Edgelord.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Datadog introduced a number of new products and enhancements at their annual user conference last week. Expectations were high coming into the event, as Datadog often holds back new releases for several months in order roll out a parade of goodies. Once again, they stepped up the pace, highlighting 18 separate product announcements versus 10 at Dash a year ago. The breadth and scope of their product reach continues to expand.
About KevinKevin Miller is currently the global General Manager for Amazon Simple Storage Service (S3), an object storage service that offers industry-leading scalability, data availability, security, and performance. Prior to this role, Kevin has had multiple leadership roles within AWS, including as the General Manager for Amazon S3 Glacier, Director of Engineering for AWS Virtual Private Cloud, and engineering leader for AWS Virtual Private Network and AWS Direct Connect. Kevin was also Technical Advisor to the Senior Vice President for AWS Utility Computing. Kevin is a graduate of Carnegie Mellon University with a Bachelor of Science in Computer Science.Links Referenced: snark.cloud/shirt: https://snark.cloud/shirt aws.amazon.com/s3: https://aws.amazon.com/s3 TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at Datadog. Datadog is a SaaS monitoring and security platform that enables full-stack observability for modern infrastructure and applications at every scale. Datadog enables teams to see everything: dashboarding, alerting, application performance monitoring, infrastructure monitoring, UX monitoring, security monitoring, dog logos, and log management, in one tightly integrated platform. With 600-plus out-of-the-box integrations with technologies including all major cloud providers, databases, and web servers, Datadog allows you to aggregate all your data into one platform for seamless correlation, allowing teams to troubleshoot and collaborate together in one place, preventing downtime and enhancing performance and reliability. Get started with a free 14-day trial by visiting datadoghq.com/screaminginthecloud, and get a free t-shirt after installing the agent.Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomomento.co/screaming. That's GO M-O-M-E-N-T-O dot co slash screaming.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Right now, as I record this, we have just kicked off our annual charity t-shirt fundraiser. This year's shirt showcases S3 as the eighth wonder of the world. And here to either defend or argue the point—we're not quite sure yet—is Kevin Miller, AWS's vice president and general manager for Amazon S3. Kevin, thank you for agreeing to suffer the slings and arrows that are no doubt going to be interpreted, misinterpreted, et cetera, for the next half hour or so.Kevin: Oh, Corey, thanks for having me. And happy to do that, and really flattered for you to be thinking about S3 in this way. So more than happy to chat with you.Corey: It's absolutely one of those services that is foundational to the cloud. It was the first AWS service that was put into general availability, although the beta folks are going to argue back and forth about no, no, that was SQS instead. I feel like now that Mai-Lan handles both SQS and S3 as part of her portfolio, she is now the final arbiter of that. I'm sure that's an argument for a future day. But it's impossible to imagine cloud without S3.Kevin: I definitely think that's true. It's hard to imagine cloud, actually, with many of our foundational services, including SQS, of course, but we are—yes, we were the first generally available service with S3. And pretty happy with our anniversary being Pi Day, 3/14.Corey: I'm also curious, your own personal trajectory has been not necessarily what folks would expect. You were the general manager of Amazon Glacier, and now you're the general manager and vice president of S3. So, I've got to ask, because there are conflicting reports on this depending upon what angle you look at, are Glacier and S3 the same thing?Kevin: Yes, I was the general manager for S3 Glacier prior to coming over to S3 proper, and the answer is no, they are not the same thing. We certainly have a number of technologies where we're able to use those technologies both on S3 and Glacier, but there are certainly a number of things that are very distinct about Glacier and give us that ability to hit the ultra-low price points that we do for Glacier Deep Archive being as low as $1 per terabyte-month. And so, that definitely—there's a lot of actual ingenuity up and down the stack, from hardware to software, everywhere in between, to really achieve that with Glacier. But then there's other spots where S3 and Glacier have very similar needs, and then, of course, today many customers use Glacier through S3 as a storage class in S3, and so that's a great way to do that. So, there's definitely a lot of shared code, but certainly, when you get into it, there's [unintelligible 00:04:59] to both of them.Corey: I ran a number of obnoxiously detailed financial analyses, and they all came away with, unless you have a very specific very nuanced understanding of your data lifecycle and/or it is less than 30 or 60 days depending upon a variety of different things, the default S3 storage class you should be using for virtually anything is Intelligent Tiering. That is my purely economic analysis of it. Do you agree with that? Disagree with that? And again, I understand that all of these storage classes are like your children, and I am inviting you to tell me which one of them is your favorite, but I'm absolutely prepared to do that.Kevin: Well, we love Intelligent Tiering because it is very simple; customers are able to automatically save money using Intelligent Tiering for data that's not being frequently accessed. And actually, since we launched it a few years ago, we've already saved customers more than $250 million using Intelligent Tiering. So, I would say today, it is our default recommendation in almost every case. I think that the cases where we would recommend another storage class as the primary storage class tend to be specific to the use case where—and particularly for use cases where customers really have a good understanding of the access patterns. And we saw some customers do for their certain dataset, they know that it's going to be heavily accessed for a fixed period of time, or this data is actually for archival, it'll never be accessed, or very rarely if ever access, just maybe in an emergency.And those kinds of use cases, I think actually, customers are probably best to choose one of the specific storage classes where they're, sort of, paying that the lower cost from day one. But again, I would say for the vast majority of cases that we see, the data access patterns are unpredictable and customers like the flexibility of being able to very quickly retrieve the data if they decide they need to use it. But in many cases, they'll save a lot of money as the data is not being accessed, and so, Intelligent Tiering is a great choice for those cases.Corey: I would take it a step further and say that even when customers believe that they are going to be doing a deeper analysis and they have a better understanding of their data flow patterns than Intelligent Tiering would, in practice, I see that they rarely do anything about it. It's one of those things where they're like, “Oh, yeah, we're going to set up our own lifecycle policies real soon now,” whereas, just switch it over to Intelligent Tiering and never think about it again. People's time is worth so much more than the infrastructure they're working on in almost every case. It doesn't seem to make a whole lot of sense unless you have a very intentioned, very urgent reason to go and do that stuff by hand in most cases.Kevin: Yeah, that's right. I think I agree with you, Corey. And certainly, that is the recommendation we lead with customers.Corey: In previous years, our charity t-shirt has focused on other areas of AWS, and one of them was based upon a joke that I've been telling for a while now, which is that the best database in the world is Route 53 and storing TXT records inside of it. I don't know if I ever mentioned this to you or not, but the first iteration of that joke was featuring around S3. The challenge that I had with it is that S3 Select is absolutely a thing where you can query S3 with SQL which I don't see people doing anymore because Athena is the easier, more, shall we say, well-articulated version of all of that. And no, no, that joke doesn't work because it's actually true. You can use S3 as a database. Does that statement fill you with dread? Regret? Am I misunderstanding something? Or are you effectively running a giant subversive database?Kevin: Well, I think that certainly when most customers think about a database, they think about a collection of technology that's applied for given problems, and so I wouldn't count S3 as providing the whole range of functionality that would really make up a database. But I think that certainly a lot of the primitives and S3 Select as a great example of a primitive are available in S3. And we're looking at adding, you know, additional primitives going forward to make it possible to, you know, to build a database around S3. And as you see, other AWS services have done that in many ways. For example, obviously with Amazon Redshift having a lot of capability now to just directly access and use data in S3 and make that a super seamless so that you can then run data warehousing type queries on top of S3 and on top of your other datasets.So, I certainly think it's a great building block. And one other thing I would actually just say that you may not know, Corey, is that one of the things over the last couple of years we've been doing a lot more with S3 is actually working to directly contribute improvements to open-source connector software that uses S3, to make available automatically some of the performance improvements that can be achieved either using both the AWS SDK, and also using things like S3 Select. So, we started with a few of those things with Select; you're going to see more of that coming, most likely. And some of that, again, the idea there as you may not even necessarily know you're using Select, but when we can identify that it will improve performance, we're looking to be able to contribute those kinds of improvements directly—or we are contributing those directly to those open-source packages. So, one thing I would definitely recommend customers and developers do is have a capability of sort of keeping that software up-to-date because although it might seem like those are sort of one-and-done kind of software integrations, there's actually almost continuous improvement now going on, and around things like that capability, and then others we come out with.Corey: What surprised me is just how broadly S3 has been adopted by a wide variety of different clients' software packages out there. Back when I was running production environments in anger, I distinctly remember in one Ubuntu environment, we wound up installing a specific package that was designed to teach apt how to retrieve packages and its updates from S3, which was awesome. I don't see that anymore, just because it seems that it is so easy to do it now, just with the native features that S3 offers, as well as an awful lot of software under the hood has learned to directly recognize S3 as its own thing, and can react accordingly.Kevin: And just do the right thing. Exactly. No, we certainly see a lot of that. So that's, you know—I mean, obviously making that simple for end customers to use and achieve what they're trying to do, that's the whole goal.Corey: It's always odd to me when I'm talking to one of my clients who is looking to understand and optimize their AWS bill to see outliers in either direction when it comes to S3 itself. When they're driving large S3 bills as in a majority of their spend, it's, okay, that is very interesting. Let's dive into that. But almost more interesting to me is when it is effectively not being used at all. When, oh, we're doing everything with EBS volumes or EFS.And again, those are fine services. I don't have any particular problem with them anymore, but the problem I have is that the cloud long ago took what amounts to an economic vote. There's a tax savings for storing data in an object store the way that you—and by extension, most of your competitors—wind up pricing this, versus the idea of on a volume basis where you have to pre-provision things, you don't get any form of durability that extends beyond the availability zone boundary. It just becomes an awful lot of, “Well, you could do it this way. But it gets really expensive really quickly.”It just feels wild to me that there is that level of variance between S3 just sort of raw storage basis, economically, as well as then just the, frankly, ridiculous levels of durability and availability that you offer on top of that. How did you get there? Was the service just mispriced at the beginning? Like oh, we dropped to zero and probably should have put that in there somewhere.Kevin: Well, no, I wouldn't call it mispriced. I think that the S3 came about when we took a—we spent a lot of time looking at the architecture for storage systems, and knowing that we wanted a system that would provide the durability that comes with having three completely independent data centers and the elasticity and capability where, you know, customers don't have to provision the amount of storage they want, they can simply put data and the system keeps growing. And they can also delete data and stop paying for that storage when they're not using it. And so, just all of that investment and sort of looking at that architecture holistically led us down the path to where we are with S3.And we've definitely talked about this. In fact, in Peter's keynote at re:Invent last year, we talked a little bit about how the system is designed under the hood, and one of the thing you realize is that S3 gets a lot of the benefits that we do by just the overall scale. The fact that it is—I think the stat is that at this point more than 10,000 customers have data that's stored on more than a million hard drives in S3. And that's how you get the scale and the capability to do is through massive parallelization. Where customers that are, you know, I would say building more traditional architectures, those are inherently typically much more siloed architectures with a relatively small-scale overall, and it ends up with a lot of resource that's provisioned at small-scale in sort of small chunks with each resource, that you never get to that scale where you can start to take advantage of the some is more than the greater of the parts.And so, I think that's what the recognition was when we started out building S3. And then, of course, we offer that as an API on top of that, where customers can consume whatever they want. That is, I think, where S3, at the scale it operates, is able to do certain things, including on the economics, that are very difficult or even impossible to do at a much smaller scale.Corey: One of the more egregious clown-shoe statements that I hear from time to time has been when people will come to me and say, “We've built a competitor to S3.” And my response is always one of those, “Oh, this should be good.” Because when people say that, they generally tend to be focusing on one or maybe two dimensions that doesn't work for a particular use case as well as it could. “Okay, what was your story around why this should be compared to S3?” “Well, it's an object store. It has full S3 API compatibility.” “Does it really because I have to say, there are times where I'm not entirely convinced that S3 itself has full compatibility with the way that its API has been documented.”And there's an awful lot of magic that goes into this too. “Okay, great. You're running an S3 competitor. Great. How many buildings does it live in?” Like, “Well, we have a problem with the s at the end of that word.” It's, “Okay, great. If it fits on my desk, it is not a viable S3 competitor. If it fits in a single zip code, it is probably not a viable S3 competitor.” Now, can it be an object store? Absolutely. Does it provide a new interface to some existing data someone might have? Sure why not. But I think that, oh, it's S3 compatible, is something that gets tossed around far too lightly by folks who don't really understand what it is that drives S3 and makes it special.Kevin: Yeah, I mean, I would say certainly, there's a number of other implementations of the S3 API, and frankly we're flattered that customers recognize and our competitors and others recognize the simplicity of the API and go about implementing it. But to your point, I think that there's a lot more; it's not just about the API, it's really around everything surrounding S3 from, as you mentioned, the fact that the data in S3 is stored in three independent availability zones, all of which that are separated by kilometers from each other, and the resilience, the automatic failover, and the ability to withstand an unlikely impact to one of those facilities, as well as the scalability, and you know, the fact that we put a lot of time and effort into making sure that the service continues scaling with our customers need. And so, I think there's a lot more that goes into what is S3. And oftentimes just in a straight-up comparison, it's sort of purely based on just the APIs and generally a small set of APIs, in addition to those intangibles around—or not intangibles, but all of the ‘-ilities,' right, the elasticity and the durability, and so forth that I just talked about. In addition to all that also, you know, certainly what we're seeing for customers is as they get into the petabyte and tens of petabytes, hundreds of petabytes scale, their need for the services that we provide to manage that storage, whether it's lifecycle and replication, or things like our batch operations to help update and to maintain all the storage, those become really essential to customers wrapping their arms around it, as well as visibility, things like Storage Lens to understand, what storage do I have? Who's using it? How is it being used?And those are all things that we provide to help customers manage at scale. And certainly, you know, oftentimes when I see claims around S3 compatibility, a lot of those advanced features are nowhere to be seen.Corey: I also want to call out that a few years ago, Mai-Lan got on stage and talked about how, to my recollection, you folks have effectively rebuilt S3 under the hood into I think it was 235 distinct microservices at the time. There will not be a quiz on numbers later, I'm assuming. But what was wild to me about that is having done that for services that are orders of magnitude less complex, it absolutely is like changing the engine on a car without ever slowing down on the highway. Customers didn't know that any of this was happening until she got on stage and announced it. That is wild to me. I would have said before this happened that there was no way that would have been possible except it clearly was. I have to ask, how did you do that in the broad sense?Kevin: Well, it's true. A lot of the underlying infrastructure that's been part of S3, both hardware and software is, you know, you wouldn't—if someone from S3 in 2006 came and looked at the system today, they would probably be very disoriented in terms of understanding what was there because so much of it has changed. To answer your question, the long and short of it is a lot of testing. In fact, a lot of novel testing most recently, particularly with the use of formal logic and what we call automated reasoning. It's also something we've talked a fair bit about in re:Invent.And that is essentially where you prove the correctness of certain algorithms. And we've used that to spot some very interesting, the one-in-a-trillion type cases that S3 scale happens regularly, that you have to be ready for and you have to know how the system reacts, even in all those cases. I mean, I think one of our engineers did some calculations that, you know, the number of potential states for S3, sort of, exceeds the number of atoms in the universe or something so crazy. But yet, using methods like automated reasoning, we can test that state space, we can understand what the system will do, and have a lot of confidence as we begin to swap, you know, pieces of the system.And of course, nothing in S3 scale happens instantly. It's all, you know, I would say that for a typical engineering effort within S3, there's a certain amount of effort, obviously, in making the change or in preparing the new software, writing the new software and testing it, but there's almost an equal amount of time that goes into, okay, and what is the process for migrating from System A to System B, and that happens over a timescale of months, if not years, in some cases. And so, there's just a lot of diligence that goes into not just the new systems, but also the process of, you know, literally, how do I swap that engine on the system. So, you know, it's a lot of really hard working engineers that spent a lot of time working through these details every day.Corey: I still view S3 through the lens of it is one of the easiest ways in the world to wind up building a static web server because you basically stuff the website files into a bucket and then you check a box. So, it feels on some level though, that it is about as accurate as saying that S3 is a database. It can be used or misused or pressed into service in a whole bunch of different use cases. What have you seen from customers that has, I guess, taught you something you didn't expect to learn about your own service?Kevin: Oh, I'd say we have those [laugh] meetings pretty regularly when customers build their workloads and have unique patterns to it, whether it's the type of data they're retrieving and the access pattern on the data. You know, for example, some customers will make heavy use of our ability to do [ranged gets 00:22:47] on files and [unintelligible 00:22:48] objects. And that's pretty good capability, but that can be one where that's very much dependent on the type of file, right, certain files have structure, as far as you know, a header or footer, and that data is being accessed in a certain order. Oftentimes, those may also be multi-part objects, and so making use of the multi-part features to upload different chunks of a file in parallel. And you know, also certainly when customers get into things like our batch operations capability where they can literally write a Lambda function and do what they want, you know, we've seen some pretty interesting use cases where customers are running large-scale operations across, you know, billions, sometimes tens of billions of objects, and this can be pretty interesting as far as what they're able to do with them.So, for something is sort of what you might—you know, as simple and basics, in some sense, of GET and PUT API, just all the capability around it ends up being pretty interesting as far as how customers apply it and the different workloads they run on it.Corey: So, if you squint hard enough, what I'm hearing you tell me is that I can view all of this as, “Oh, yeah. S3 is also compute.” And it feels like that as a fast-track to getting a question wrong on one of the certification exams. But I have to ask, from your point of view, is S3 storage? And whether it's yes or no, what gets you excited about the space that it's in?Kevin: Yeah well, I would say S3 is not compute, but we have some great compute services that are very well integrated with S3, which excites me as well as we have things like S3 Object Lambda, where we actually handle that integration with Lambda. So, you're writing Lambda functions, we're executing them on the GET path. And so, that's a pretty exciting feature for me. But you know, to sort of take a step back, what excites me is I think that customers around the world, in every industry, are really starting to recognize the value of data and data at large scale. You know, I think that actually many customers in the world have terabytes or more of data that sort of flows through their fingers every day that they don't even realize.And so, as customers realize what data they have, and they can capture and then start to analyze and make ultimately make better business decisions that really help drive their top line or help them reduce costs, improve costs on whether it's manufacturing or, you know, other things that they're doing. That's what really excites me is seeing those customers take the raw capability and then apply it to really just to transform how they not just how their business works, but even how they think about the business. Because in many cases, transformation is not just a technical transformation, it's people and cultural transformation inside these organizations. And that's pretty cool to see as it unfolds.Corey: One of the more interesting things that I've seen customers misunderstand, on some level, has been a number of S3 releases that focus around, “Oh, this is for your data lake.” And I've asked customers about that. “So, what's your data lake strategy?” “Well, we don't have one of those.” “You have, like, eight petabytes and climbing in S3? What do you call that?” It's like, “Oh, yeah, that's just a bunch of buckets we dump things into. Some are logs of our assets and the rest.” It's—Kevin: Right.Corey: Yeah, it feels like no one thinks of themselves as having anything remotely resembling a structured place for all of the data that accumulates at a company.Kevin: Mm-hm.Corey: There is an evolution of people learning that oh, yeah, this is in fact, what it is that we're doing, and this thing that they're talking about does apply to us. But it almost feels like a customer communication challenge, just because, I don't know about you, but with my legacy AWS account, I have dozens of buckets in there that I don't remember what the heck they're for. Fortunately, you folks don't charge by the bucket, so I can smile, nod, remain blissfully ignorant, but it does make me wonder from time to time.Kevin: Yeah, no, I think that what you hear there is actually pretty consistent with what the reality is for a lot of customers, which is in distributed organizations, I think that's bound to happen, you have different teams that are working to solve problems, and they are collecting data to analyze, they're creating result datasets and they're storing those datasets. And then, of course, priorities can shift, and you know, and there's not necessarily the day-to-day management around data that we might think would be expected. I feel [we 00:26:56] sort of drew an architecture on a whiteboard. And so, I think that's the reality we are in. And we will be in, largely forever.I mean, I think that at a smaller-scale, that's been happening for years. So, I think that, one, I think that there's a lot of capability just being in the cloud. At the very least, you can now start to wrap your arms around it, right, where used to be that it wasn't even possible to understand what all that data was because there's no way to centrally inventory it well. In AWS with S3, with inventory reports, you can get a list of all your storage and we are going to continue to add capability to help customers get their arms around what they have, first off; understand how it's being used—that's where things like Storage Lens really play a big role in understanding exactly what data is being accessed and not. We're definitely listening to customers carefully around this, and I think when you think about broader data management story, I think that's a place that we're spending a lot of time thinking right now about how do we help customers get their arms around it, make sure that they know what's the categorization of certain data, do I have some PII lurking here that I need to be very mindful of?And then how do I get to a world where I'm—you know, I won't say that it's ever going to look like the perfect whiteboard picture you might draw on the wall. I don't think that's really ever achievable, but I think certainly getting to a point where customers have a real solid understanding of what data they have and that the right controls are in place around all that data, yeah, I think that's directionally where I see us heading.Corey: As you look around how far the service has come, it feels like, on some level, that there were some, I guess, I don't want to say missteps, but things that you learned as you went along. Like, back when the service was in beta, for example, there was no per-request charge. To my understanding that was changed, in part because people were trying to use it as a file system, and wow, that suddenly caused a tremendous amount of load on some of the underlying systems. You originally launched with a BitTorrent endpoint as an option so that people could download through peer-to-peer approaches for large datasets and turned out that wasn't really the way the internet evolved, either. And I'm curious, if you were to have to somehow build this off from scratch, are there any other significant changes you would make in how the service was presented to customers in how people talked about it in the early days? Effectively given a mulligan, what would you do differently?Kevin: Well, I don't know, Corey, I mean, just given where it's grown to in macro terms, you know, I definitely would be worried taking a mulligan, you know, that I [laugh] would change the sort of the overarching trajectory. Certainly, I think there's a few features here and there where, for whatever reason, it was exciting at the time and really spoke to what customers at the time were thinking, but over time, you know, sort of quickly those needs move to something a little bit different. And, you know, like you said things like the BitTorrent support is one where, at some level, it seems like a great technical architecture for the internet, but certainly not something that we've seen dominate in the way things are done. Instead, you know, we've largely kind of have a world where there's a lot of caching layers, but it still ends up being largely client-server kind of connections. So, I don't think I would do a—I certainly wouldn't do a mulligan on any of the major functionality, and I think, you know, there's a few things in the details where obviously, we've learned what really works in the end. I think we learned that we wanted bucket names to really strictly conform to rules for DNS encoding. So, that was the change that was made at some point. And we would tweak that, but no major changes, certainly.Corey: One subject of some debate while we were designing this year's charity t-shirt—which, incidentally, if you're listening to this, you can pick up for yourself at snark.cloud/shirt—was the is S3 itself dependent upon S3? Because we know that every other service out there is as well, but it is interesting to come up with an idea of, “Oh, yeah. We're going to launch a whole new isolated region of S3 without S3 to lean on.” That feels like it's an almost impossible bootstrapping problem.Kevin: Well, S3 is not dependent on S3 to come up, and it's certainly a critical dependency tree that we look at and we track and make sure that we'd like to have an acyclic graph as we look at dependencies.Corey: That is such a sophisticated way to say what I learned the hard way when I was significantly younger and working in production environments: don't put the DNS servers needed to boot the hypervisor into VMs that require a working hypervisor. It's one of those oh, yeah, in hindsight, that makes perfect sense, but you learn it right after that knowledge really would have been useful.Kevin: Yeah, absolutely. And one of the terms we use for that, as well as is the idea of static stability, or that's one of the techniques that can really help with isolating a dependency is what we call static stability. We actually have an article about that in the Amazon Builder Library, which there's actually a bunch of really good articles in there from very experienced operations-focused engineers in AWS. So, static stability is one of those key techniques, but other techniques—I mean, just pure minimization of dependencies is one. And so, we were very, very thoughtful about that, particularly for that core layer.I mean, you know, when you talk about S3 with 200-plus microservices, or 235-plus microservices, I would say not all of those services are critical for every single request. Certainly, a small subset of those are required for every request, and then other services actually help manage and scale the kind of that inner core of services. And so, we look at dependencies on a service by service basis to really make sure that inner core is as minimized as possible. And then the outer layers can start to take some dependencies once you have that basic functionality up.Corey: I really want to thank you for being as generous with your time as you have been. If people want to learn more about you and about S3 itself, where should they go—after buying a t-shirt, of course.Kevin: Well, certainly buy the t-shirt. First, I love the t-shirts and the charity that you work with to do that. Obviously, for S3, it's aws.amazon.com/s3. And you can actually learn more about me. I have some YouTube videos, so you can search for me on YouTube and kind of get a sense of myself.Corey: We will put links to that into the show notes, of course. Thank you so much for being so generous with your time. I appreciate it.Kevin: Absolutely. Yeah. Glad to spend some time. Thanks for the questions, Corey.Corey: Kevin Miller, vice president and general manager for Amazon S3. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, ignorant comment talking about how your S3 compatible service is going to blow everyone's socks off when it fails.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
About VictorVictor is an Independent Senior Cloud Infrastructure Architect working mainly on Amazon Web Services (AWS), designing: secure, scalable, reliable, and cost-effective cloud architectures, dealing with large-scale and mission-critical distributed systems. He also has a long experience in Cloud Operations, Security Advisory, Security Hardening (DevSecOps), Modern Applications Design, Micro-services and Serverless, Infrastructure Refactoring, Cost Saving (FinOps).Links Referenced: Zoph: https://zoph.io/ unusd.cloud: https://unusd.cloud Twitter: https://twitter.com/zoph LinkedIn: https://www.linkedin.com/in/grenuv/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq.com/screaminginthecloud to get. That's www.datadoghq.com/screaminginthecloudCorey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomomento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. One of the best parts about running a podcast like this and trolling the internet of AWS things is every once in a while, I get to learn something radically different than what I expected. For a long time, there's been this sort of persona or brand in the AWS space, specifically the security side of it, going by Zoph—that's Z-O-P-H—and I just assumed it was a collective or a whole bunch of people working on things, and it turns out that nope, it is just one person. And that one person is my guest today. Victor Grenu is an independent AWS architect. Victor, thank you for joining me.Victor: Hey, Corey, thank you for having me. It's a pleasure to be here.Corey: So, I want to start by diving into the thing that first really put you on my radar, though I didn't realize it was you at the time. You have what can only be described as an army of Twitter bots around the AWS ecosystem. And I don't even know that I'm necessarily following all of them, but what are these bots and what do they do?Victor: Yeah. I have a few bots on Twitter that I push some notification, some tweets, when things happen on AWS security space, especially when the AWS managed policies are updated from AWS. And it comes from an initial project from Scott Piper. He was running a Git command on his own laptop to push the history of AWS managed policy. And it told me that I can automate this thing using a deployment pipeline and so on, and to tweet every time a new change is detected from AWS. So, the idea is to monitor every change on these policies.Corey: It's kind of wild because I built a number of somewhat similar Twitter bots, only instead of trying to make them into something useful, I'd make them into something more than a little bit horrifying and extraordinarily obnoxious. Like there's a Cloud Boomer Twitter account that winds up tweeting every time Azure tweets something only it quote-tweets them in all caps and says something insulting. I have an AWS releases bot called AWS Cwoud—so that's C-W-O-U-D—and that winds up converting it to OwO speak. It's like, “Yay a new auto-scawowing growp.” That sort of thing is obnoxious and offensive, but it makes me laugh.Yours, on the other hand, are things that I have notifications turned on for just because when they announce something, it's generally fairly important. The first one that I discovered was your IAM changes bot. And I found some terrifying things coming out of that from time to time. What's the data source for that? Because I'm just grabbing other people's Twitter feeds or RSS feeds; you're clearly going deeper than that.Victor: Yeah, the data source is the official AWS managed policy. In fact, I run AWS CLI in the background and I'm doing just a list policy, the list policy command, and with this list I'm doing git of each policy that is returned, so I can enter it in a git repository to get the full history of the time. And I also craft a list of deprecated policy, and I also run, like, a dog-food initiative, the policy analysis, validation analysis from AWS tools to validate the consistency and the accuracy of the own policies. So, there is a policy validation with their own tool. [laugh].Corey: You would think that wouldn't turn up anything because their policy validator effectively acts as a linter, so if it throws an error, of course, you wouldn't wind up pushing that. And yet, somehow the fact that you have bothered to hook that up and have findings from it indicates that that's not how the real world works.Victor: Yeah, there is some, let's say, some false positive because we are running the policy validation with their own linter then own policies, but this is something that is documented from AWS. So, there is an official page where you can find why the linter is not working on each policy and why. There is a an explanation for each findings. I thinking of [unintelligible 00:05:05] managed policy, which is too long, and policy analyzer is crashing because the policy is too long.Corey: Excellent. It's odd to me that you have gone down this path because it's easy enough to look at this and assume that, oh, this must just be something you do for fun or as an aspect of your day job. So, I did a little digging into what your day job is, and this rings very familiar to me: you are an independent AWS consultant, only you're based out of Paris, whereas I was doing this from San Francisco, due to an escalatingly poor series of life choices on my part. What do you focus on in the AWS consulting world?Victor: Yeah. I'm running an AWS consulting boutique in Paris and I'm working for a large customer in France. And I'm doing mostly infrastructure stuff, infrastructure design for cloud-native application, and I'm also doing some security audits and [unintelligible 00:06:07] mediation for my customer.Corey: It seems to me that there's a definite divide as far as how people find the AWS consulting experience to be. And I'm not trying to cast judgment here, but the stories that I hear tend to fall into one of two categories. One of them is the story that you have, where you're doing this independently, you've been on your own for a while working specifically on this, and then there's the stories of, “Oh, yeah, I work for a 500 person consultancy and we do everything as long as they'll pay us money. If they've got money, we'll do it. Why not?”And it always seems to me—not to be overly judgy—but the independent consultants just seem happier about it because for better or worse, we get to choose what we focus on in a way that I don't think you do at a larger company.Victor: Yeah. It's the same in France or in Europe; there is a lot of consulting firms. But with the pandemic and with the market where we are working, in the cloud, in the cloud-native solution and so on, that there is a lot of demands. And the natural path is to start by working for a consulting firm and then when you are ready, when you have many AWS certification, when you have the experience of the customer, when you have a network of well-known customer, and you gain trust from your customer, I think it's natural to go by yourself, to be independent and to choose your own project and your own customer.Corey: I'm curious to get your take on what your perception of being an AWS consultant is when you're based in Paris versus, in my case, being based in the West Coast of the United States. And I know that's a bit of a strange question, but even when I travel, for example, over to the East Coast, suddenly, my own newsletter sends out three hours later in the day than I expect it to and that throws me for a loop. The AWS announcements don't come out at two or three in the afternoon; they come out at dinnertime. And for you, it must be in the middle of the night when a lot of those things wind up dropping. The AWS stuff, not my newsletter. I imagine you're not excitedly waiting on tenterhooks to see what this week's issue of Last Week in AWS talks about like I am.But I'm curious is that even beyond that, how do you experience the market? From what you're perceiving people in the United States talking about as AWS consultants versus what you see in Paris?Victor: It's difficult, but in fact, I don't have so much information about the independent in the US. I know that there is a lot, but I think it's more common in Europe. And yeah, it's an advantage to whoever ten-hour time [unintelligible 00:08:56] from the US because a lot of stuff happen on the Pacific time, on the Seattle timezone, on San Francisco timezone. So, for example, for this podcast, my Monday is over right now, so, so yeah, I have some advantage in time, but yeah.Corey: This is potentially an odd question for you. But I find an awful lot of the AWS documentation to be challenging, we'll call it. I don't always understand exactly what it's trying to tell me, and it's not at all clear that the person writing the documentation about a service in some cases has ever used the service. And in everything I just said, there is no language barrier. This documentation was written—theoretically—in English and I, most days, can stumble through a sentence in English and almost no other language. You obviously speak French as a first language. Given that you live in Paris, it seems to be a relatively common affliction. How do you find interacting with AWS in French goes? Or is it just a complete nonstarter, and it all has to happen in English for you?Victor: No, in fact, the consultants in Europe, I think—in fact, in my part, I'm using my laptop in English, I'm using my phone in English, I'm using the AWS console in English, and so on. So, the documentation for me is a switch on English first because for the other language, there is sometimes some automated translation that is very dangerous sometimes, so we all keep the documentation and the materials in English.Corey: It's wild to me just looking at how challenging so much of the stuff is. Having to then work in a second language on top of that, it just seems almost insurmountable to me. It's good they have automated translation for a lot of this stuff, but that falls down in often hilariously disastrous ways, sometimes. It's wild to me that even taking most programming languages that folks have ever heard of, even if you program and speak no English, which happens in a large part of the world, you're still using if statements even if the term ‘if' doesn't mean anything to you localized in your language. It really is, in many respects, an English-centric industry.Victor: Yeah. Completely. Even in French for our large French customer, I'm writing the PowerPoint presentation in English, some emails are in English, even if all the folks in the thread are French. So yeah.Corey: One other area that I wanted to explore with you a bit is that you are very clearly focused on security as a primary area of interest. Does that manifest in the work that you do as well? Do you find that your consulting engagements tend to have a high degree of focus on security?Victor: Yeah. In my design, when I'm doing some AWS architecture, my main objective is to design some security architecture and security patterns that apply best practices and least privilege. But often, I'm working for engagement on security audits, for startups, for internal customer, for diverse company, and then doing some accommodation after all. And to run my audit, I'm using some open-source tooling, some custom scripts, and so on. I have a methodology that I'm running for each customer. And the goal is to sometime to prepare some certification, PCI DSS or so on, or maybe to ensure that the best practice are correctly applied on a workload or before go-live or, yeah.Corey: One of the weird things about this to me is that I've said for a long time that cost and security tend to be inextricably linked, as far as being a sort of trailing reactive afterthought for an awful lot of companies. They care about both of those things right after they failed to adequately care about those things. At least in the cloud economic space, it's only money as opposed to, “Oops, we accidentally lost our customers' data.” So, I always found that I find myself drifting in a security direction if I don't stop myself, just based upon a lot of the cost work I do. Conversely, it seems that you have come from the security side and you find yourself drifting in a costing direction.Your side project is a SaaS offering called unusd.cloud, that's U-N-U-S-D dot cloud. And when you first mentioned this to me, my immediate reaction was, “Oh, great. Another SaaS platform for costing. Let's tear this one apart, too.” Except I actually like what you're building. Tell me about it.Victor: Yeah, and unusd.cloud is a side project for me and I was working since, let's say one year. It was a project that I've deployed for some of my customer on their local account, and it was very useful. And so, I was thinking that it could be a SaaS project. So, I've worked at [unintelligible 00:14:21] so yeah, a few months on shifting the product to assess [unintelligible 00:14:27].The product aim to detect the worst on AWS account on all AWS region, and it scan all your AWS accounts and all your region, and you try to detect and use the EC2, LDS, Glue [unintelligible 00:14:45], SageMaker, and so on, and attach a EBS and so on. I don't craft a new dashboard, a new Cost Explorer, and so on. It's it just cost awareness, it's just a notification on email or Slack or Microsoft Teams. And you just add your AWS account on the project and you schedule, let's say, once a day, and it scan, and it send you a cost of wellness, a [unintelligible 00:15:17] detection, and you can act by turning off what is not used.Corey: What I like about this is it cuts at the number one rule of cloud economics, which is turn that shit off if you're not using it. You wouldn't think that I would need to say that except that everyone seems to be missing that, on some level. And it's easy to do. When you need to spin something up and it's not there, you're very highly incentivized to spin that thing up. When you're not using it, you have to remember that thing exists, otherwise it just sort of sits there forever and doesn't do anything.It just costs money and doesn't generate any value in return for that. What you got right is you've also eviscerated my most common complaint about tools that claim to do this, which is you build in either a explicit rule of ignore this resource or ignore resources with the following tags. The benefit there is that you're not constantly giving me useless advice, like, “Oh, yeah, turn off this idle thing.” It's, yeah, that's there for a reason, maybe it's my dev box, maybe it's my backup site, maybe it's the entire DR environment that I'm going to need at little notice. It solves for that problem beautifully. And though a lot of tools out there claim to do stuff like this, most of them really failed to deliver on that promise.Victor: Yeah, I just want to keep it simple. I don't want to add an additional console and so on. And you are correct. You can apply a simple tag on your asset, let's say an EC2 instances, you apply the tag in use and the value of, and then the alerting is disabled for this asset. And the detection is based on the CPU [unintelligible 00:17:01] and the network health metrics, so when the instances is not used in the last seven days, with a low CPU every [unintelligible 00:17:10] and low network out, it comes as a suspect. [laugh].[midroll 00:17:17]Corey: One thing that I like about what you've done, but also have some reservations about it is that you have not done with so many of these tools do which is, “Oh, just give us all the access in your account. It'll be fine. You can trust us. Don't you want to save money?” And yeah, but I also still want to have a company left when all sudden done.You are very specific on what it is that you're allowed to access, and it's great. I would argue, on some level, it's almost too restrictive. For example, you have the ability to look at EC2, Glue, IAM—just to look at account aliases, great—RDS, Redshift, and SageMaker. And all of these are simply list and describe. There's no gets in there other than in Cost Explorer, which makes sense. You're not able to go rummaging through my data and see what's there. But that also bounds you, on some level, to being able to look only at particular types of resources. Is that accurate or are you using a lot of the CloudWatch stuff and Cost Explorer stuff to see other areas?Victor: In fact, it's the least privilege and read-only permission because I don't want too much question for the security team. So, it's full read-only permission. And I've only added the detection that I'm currently supports. Then if in some weeks, in some months, I'm adding a new detection, let's say for Snapshot, for example, I will need to update, so I will ask my customer to update their template. There is a mechanisms inside the project to tell them that the template is obsolete, but it's not a breaking change.So, the detection will continue, but without the new detection, the new snapshot detection, let's say. So yeah, it's least privilege, and all I need is the get-metric-statistics from CloudWatch to detect unused assets. And also checking [unintelligible 00:19:16] Elastic IP or [unintelligible 00:19:19] EBS volume. So, there is no CloudWatching in this detection.Corey: Also, to be clear, I am not suggesting that what you have done is at all a mistake, even if you bound it to those resources right now. But just because everyone loves to talk about these exciting, amazing, high-level services that AWS has put up there, for example, oh, what about DocumentDB or all these other—you know, Amazon Basics MongoDB; same thing—or all of these other things that they wind up offering, but you take a look at where customers are spending money and where they're surprised to be spending money, it's EC2, it's a bit of RDS, occasionally it's S3, but that's a lot harder to detect automatically whether that data is unused. It's, “You haven't been using this data very much.” It's, “Well, you see how the bucket is labeled ‘Archive Backups' or ‘Regulatory Logs?'” imagine that. What a ridiculous concept.Yeah. Whereas an idle EC2 instance sort of can wind up being useful on this. I am curious whether you encounter in the wild in your customer base, folks who are having idle-looking EC2 instances, but are in fact, for example, using a whole bunch of RAM, which you can't tell from the outside without custom CloudWatch agents.Victor: Yeah, I'm not detecting this behavior for larger usage of RAM, for example, or for maybe there is some custom application that is low in CPU and don't talk to any other services using the network, but with this detection, with the current state of the detection, I'm covering large majority of waste because what I see from my customer is that there is some teams, some data scientists or data teams who are experimenting a lot with SageMaker with Glue, with Endpoint and so on. And this is very expensive at the end of the day because they don't turn off the light at the end of the day, on Friday evening. So, what I'm trying to solve here is to notify the team—so on Slack—when they forgot to turn off the most common waste on AWS, so EC2, LTS, Redshift.Corey: I just now wound up installing it while we've been talking on my dedicated shitposting account, and sure enough, it already spat out a single instance it found, which yeah was running an EC2 instance on the East Coast when I was just there, so that I had a DNS server that was a little bit more local. Okay, great. And it's a T4g.micro, so it's not exactly a whole lot of money, but it does exactly what it says on the tin. It didn't wind up nailing the other instances I have in that account that I'm using for a variety of different things, which is good.And it further didn't wind up falling into the trap that so many things do, which is the, “Oh, it's costing you zero and your spend this month is zero because this account is where I dump all of my AWS credit codes.” So, many things say, “Oh, well, it's not costing you anything, so what's the problem?” And then that's how you accidentally lose $100,000 in activate credits because someone left something running way too long. It does a lot of the right things that I would hope and expect it to do, and the fact that you don't do that is kind of amazing.Victor: Yeah. It was a need from my customer and an opportunity. It's a small bet for me because I'm trying to do some small bets, you know, the small bets approach, so the idea is to try a new thing. It's also an excuse for me to learn something new because building a SaaS is a challenging.Corey: One thing that I am curious about, in this account, I'm also running the controller for my home WiFi environment. And that's not huge. It's T3.small, but it is still something out there that it sits there because I need it to exist. But it's relatively bored.If I go back and look over the last week of CloudWatch metrics, for example, it doesn't look like it's usually busy. I'm sure there's some network traffic in and out as it updates itself and whatnot, but the CPU peeks out at a little under 2% used. It didn't warn on this and it got it right. I'm just curious as to how you did that. What is it looking for to determine whether this instance is unused or not?Victor: It's the magic [laugh]. There is some intelligence artif—no, I'm just kidding. It just statistics. And I'm getting two metrics, the superior average from the last seven days and the network out. And I'm getting the average on those metrics and I'm doing some assumption that this EC2, this specific EC2 is not used because of these metrics, this server average.Corey: Yeah, it is wild to me just that this is working as well as it is. It's just… like, it does exactly what I would expect it to do. It's clear that—and this is going to sound weird, but I'm going to say it anyway—that this was built from someone who was looking to answer the question themselves and not from the perspective of, “Well, we need to build a product and we have access to all of this data from the API. How can we slice and dice it and add some value as we go?” I really liked the approach that you've taken on this. I don't say that often or lightly, particularly when it comes to cloud costing stuff, but this is something I'll be using in some of my own nonsense.Victor: Thanks. I appreciate it.Corey: So, I really want to thank you for taking as much time as you have to talk about who you are and what you're up to. If people want to learn more, where can they find you?Victor: Mainly on Twitter, my handle is @zoph [laugh]. And, you know, on LinkedIn or on my company website, as zoph.io.Corey: And we will, of course, put links to that in the [show notes 00:25:23]. Thank you so much for your time today. I really appreciate it.Victor: Thank you, Corey, for having me. It was a pleasure to chat with you.Corey: Victor Grenu, independent AWS architect. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment that is going to cost you an absolute arm and a leg because invariably, you're going to forget to turn it off when you're done.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Our anchors begin today's show covering Snap's Q3 revenue miss with Platformer News Founder Casey Newton and Big Technology newsletter author Alex Kantrowitz. Then, Canaccord Genuity analyst Kingsley Crane joins with his outlook for Datadog after upgrading the software company to a buy rating, and CNBC's Steve Kovach takes a deep dive into the impact of currency headwinds on big tech names posting results. Next, we share highlights from our Deirdre Bosa's interview on the international growth of F1 racing with Mercedes-AMG Petronas Team Principal Toto Wolff and CrowdStrike CEO George Kurtz. Later, we dive back into Snap with UBS analyst Lloyd Walmsley, and our Julia Boorstin recaps her conversation with Flexport Co-CEO Ryan Petersen on decreasing supply chain challenges.
Who are the people you need in your company for scaling from Seed Stage to Series B?Today's guest is part of the reason Datadog (1bn valuation), Drift (1bn valuation) and Dooly ($300m) scaled so fast!And now, Michelle Pietsch, Founding Partner at Minot Light Consulting, shares her best practices & insights with us.She chats with Sammy about her approach to building successful go-to-market strategies, and aligning teams to one goal. Here are 5 things you'll learn today:1. How start-ups succeed in the Seed Stage with founder led sales 2. Whom to hire while scaling from Seed Stage to Series B3. Best practice hiring processes to identify the right fit in a new sales development rep4. How to achieve a 30% conversion rate through webinars and podcasts5. What metrics Michelle uses to track Minot Light Consulting's success, and unifies her team around revenueAbout Michelle:Over the last 10+ years, Michelle has spent her time at early-stage start-ups like Datadog, Drift, and Dooly, building sales and GTM functions from the ground up. She enjoys building in the rapid-growth sales SaaS world, and watching the organizations succeed, whether it's another round of funding or IPO. Over that time, she has built teams from 3 – 100, and grown revenue from $200K – $55M+. She enjoys taking the time to understand the GTM motion and putting the right processes in place from the top of the funnel down to Customer Success and Account Management, once the deals are closed.Find Michelle on LinkedIn: https://www.linkedin.com/in/michelleheaney/ Michelle's business book recommendation: Start with Why: How Great Leaders Inspire Everyone to Take Action – Simon Sinek https://amzn.to/3M9VFqx Michelle's favorite podcast:Product-Led Podcast - https://productled.com/podcast __________About Minot Light Consulting:Minot Light Consulting has great expertise in building GTM strategies for start-up and early stage companies to have efficient growth. Modern Go-to-market teams need a transformation from siloed Sales, Marketing and Customer Success teams to one team united around revenue. With their 30+ years of combined experience working across all parts of rapidly scaling early stage companies, Minot Light Consulting has the answers to grow revenue faster and build strong teams and operating strategies. Website: https://www.minotlightconsulting.com/Industry: Business consulting and corporate servicesCompany size: 10Headquarters: Boston, MAFounded: 2022__________About the host Sammy:Sammy and SAWOO enable you to drive recruiting & employer branding via your hiring managers on LinkedIn. We can help you:- Speed up hiring - Reduce hiring costs - Create an authentic Employer Brand- Create a Talent Pool that you can tap into any timeGet in touch with Sammy on LinkedIn: https://www.linkedin.com/in/sammygebele/__________Past guests on the GROW B2B FASTER show include: Justin Welsh, Ian Koniak, Jamal Reimer, Mike Troiano, John Kaplan, Greg Alexander, and many more.
Mobile devices have become a prime target for malicious actors and ICE is using zero trust to significantly improve threat detection and data protection. In this episode, ICE CISO Rob Thorne also highlights the importance of applying zero trust principles to enterprise mobility and how cyber hygiene activities are helping to propel the agency on its path to zero trust. This episode is sponsored by DataDog.
Benjamin Wilms (@MrBWilms, co-founder/CEO of @Steadybit) talks about the importance of resilience for SREs, DevOps, and developers through chaos engineering platformsSHOW: 661CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Synthetic Monitoring: Frontend and Backend Modern MonitoringEnsure frontend issues don't impair user experience by detecting user-facing issues with API and browser tests with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt. Granulate, an Intel company - Autonomous, continuous, workload optimizationgProfiler from Granulate - Production profiling, made easyCDN77 - Content Delivery Network Optimized for Video85% of users stop watching a video because of stalling and rebuffering. Rely on CDN77 to deliver a seamless online experience to your audience. Ask for a free trial with no duration or traffic limits.SHOW NOTES:Steadybit (homepage)Steadybit wants developers involved in Chaos engineering before production (TechCrunch)Topic 1 - Benjamin, give everyone a quick introduction.Topic 2 - Let's start with the concept of chaos engineering. In its simplest form, chaos engineering intentionally takes down parts of a test or production environment (typically after software has shipped) randomly so teams, typically SRE's/ops/dev, are forced to make the applications more resilient over time. It's not a matter of if systems will go down, it's a matter of when. This makes the systems better over time. Benjamin, you have a consulting background in this area that ultimately led to founding Steadybit. What were the limitations to this approach?Topic 3 - What you're talking about is a more proactive approach to downtime. I'll call this resilience engineering and it requires a shift in mindset in an organization. How do you get developers onboard to embrace the need? Are we asking developers to share responsibility for outages with the SRE organization?Topic 4 - On the surface, the obvious benefit is reduced downtime. That can be hard to quantify in business value. Outages can be measured, a lack of outages is harder to quantify. Does this become an issue in convincing an organization to embrace this methodology?Topic 5 - When you say we are going to move chaos engineering into the CI/CD pipeline, what does that mean? Is this code that is added? Testing simulations that have to be passed? Real time failures of databases or nodes or simulated? What are the common use cases?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
About RichardRichard "RichiH" Hartmann is the Director of Community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Advisory Group Observability chair, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendess, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch. Go through his talks, podcasts, interviews, and articles at https://github.com/RichiH/talks or follow him on Twitter at https://twitter.com/TwitchiH for musings on the intersection of technology and society.Links Referenced: Grafana Labs: https://grafana.com/ Twitter: https://twitter.com/TwitchiH Richard Hartmann list of talks: https://github.com/richih/talks TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloudCorey: Welcome to Screaming in the Cloud, I'm Corey Quinn. There are an awful lot of people who are incredibly good at understanding the ins and outs and the intricacies of the observability world. But they didn't have time to come on the show today. Instead, I am talking to my dear friend of two decades now, Richard Hartmann, better known on the internet as RichiH, who is the Director of Community at Grafana Labs, here to suffer—in a somewhat atypical departure for the theme of this show—personal attacks for once. Richie, thank you for joining me.Richard: And thank you for agreeing on personal attacks.Corey: Exactly. It was one of your riders. Like, there have to be the personal attacks back and forth or you refuse to appear on the show. You've been on before. In fact, the last time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to?You're still at Grafana Labs. And in many cases, I would point out that, wow, you've been there for many years; that seems to be an atypical thing, which is an American tech industry perspective because every time you and I talk about this, you look at folks who—wow, you were only at that company for five years. What's wrong with you—you tend to take the longer view and I tend to have the fast twitch, time to go ahead and leave jobs because it's been more than 20 minutes approach. I see that you're continuing to live what you preach, though. How's it been?Richard: Yeah, so there's a little bit of Covid brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center. But the last two-and-a-half years didn't really happen for many people, myself included. So, I guess [laugh] that includes you.Corey: No, no you're right. You've only been at Grafana Labs a couple of years. One would think I would check the notes for shooting my mouth off. But then, one wouldn't know me.Richard: What notes? Anyway, I've been around Prometheus and Grafana Since 2015. But it's like, real, full-time everything is 2020. There was something in between. Since 2018, I contracted to do vulnerability handling and everything for Grafana Labs because they had something and they didn't know how to deal with it.But no, full time is 2020. But as to the space in the [unintelligible 00:02:45] of itself, it's maybe a little bit German of me, but trying to understand the real world and trying to get an overview of systems and how they actually work, and if they are working correctly and as intended, and if not, how they're not working as intended, and how to fix this is something which has always been super important to me, in part because I just want to understand the world. And this is a really, really good way to automate understanding of the world. So, it's basically a work-saving mechanism. And that's why I've been sticking to it for so long, I guess.Corey: Back in the early days of monitoring systems—so we called it monitoring back then because, you know, are using simple words that lack nuance was sort of de rigueur back then—we wound up effectively having tools. Nagios is the one that springs to mind, and it was terrible in all the ways you would expect a tool written in janky Perl in the early-2000s to be. But it told you what was going on. It tried to do a thing, generally reach a server or query it about things, and when things fell out of certain specs, it screamed its head off, which meant that when you had things like the core switch melting down—thinking of one very particular incident—you didn't get a Nagios alert; you got 4000 Nagios alerts. But start to finish, you could wrap your head rather fully around what Nagios did and why it did the sometimes strange things that it did.These days, when you take a look at Prometheus, which we hear a lot about, particularly in the Kubernetes space and Grafana, which is often mentioned in the same breath, it's never been quite clear to me exactly where those start and stop. It always feels like it's a component in a larger system to tell you what's going on rather than a one-stop shop that's going to, you know, shriek its head off when something breaks in the middle of the night. Is that the right way to think about it? The wrong way to think about it?Richard: It's a way to think about it. So personally, I use the terms monitoring and observability pretty much interchangeably. Observability is a relatively well-defined term, even though most people won't agree. But if you look back into the '70s into control theory where the term is coming from, it is the measure of how much you're able to determine the internal state of a system by looking at its inputs and its outputs. Depending on the definition, some people don't include the inputs, but that is the OG definition as far as I'm aware.And from this, there flow a lot of things. This question of—or this interpretation of the difference between telling that, yes, something's broken versus why something's broken. Or if you can't ask new questions on the fly, it's not observability. Like all of those things are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system I have just by looking at what is coming in, what is going out. And that is at the core the thing. Now, obviously, it's become a buzzword, which is oftentimes the fate of successful things. So, it's become a buzzword, and you end up with cargo culting.Corey: I would argue periodically, that observability is hipster monitoring. If you call it monitoring, you get yelled at by Charity Majors. Which is tongue and cheek, but she has opinions, made, nonetheless shall I say, frustrating by the fact that she is invariably correct in those opinions, which just somehow makes it so much worse. It would be easy to dismiss things she says if she weren't always right. And the world is changing, especially as we get into the world of distributed systems.Is the server that runs the app working or not working loses meaning when we're talking about distributed systems, when we're talking about containers running on top of Kubernetes, which turns every outage into a murder mystery. We start having distributed applications composed of microservices, so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue related to the request coming into a completely separate microservice? And it seems that for those types of applications, the answer has been tracing for a long time now, where originally that was something that felt like it was sprung, fully-formed from the forehead of some God known as one of the hyperscalers, but now is available to basically everyone, in theory.In practice, it seems that instrumenting applications still one of the hardest parts of all of this. I tried hooking up one of my own applications to be observed via OTEL, the open telemetry project, and it turns out that right now, OTEL and AWS Lambda have an intersection point that makes everything extremely difficult to work with. It's not there yet; it's not baked yet. And someday, I hope that changes because I would love to interchangeably just throw metrics and traces and logs to all the different observability tools and see which ones work, which ones don't, but that still feels very far away from current state of the art.Richard: Before we go there, maybe one thing which I don't fully agree with. You said that previously, you were told if a service up or down, that's the thing which you cared about, and I don't think that's what people actually cared about. At that time, also, what they fundamentally cared about: is the user-facing service up, or down, or impacted? Is it slow? Does it return errors every X percent for requests, something like this?Corey: Is the site up? And—you're right, I was hand-waving over a whole bunch of things. It was, “Okay. First, the web server is returning a page, yes or no? Great. Can I ping the server?” Okay, well, there are ways of server can crash and still leave enough of the TCP/IP stack up or it can respond to pings and do little else.And then you start adding things to it. But the Nagios thing that I always wanted to add—and had to—was, is the disk full? And that was annoying. And, on some level, like, why should I care in the modern era how much stuff is on the disk because storage is cheap and free and plentiful? The problem is, after the third outage in a month because the disk filled up, you start to not have a good answer for well, why aren't you monitoring whether the disk is full?And that was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to midsize applications, which is what I'm talking about, the only things that people would let me touch. I wasn't running hyperscale stuff where you have a fleet of 10,000 web servers and, “Is the server up?” Yeah, in that scenario, no one cares. But when we're talking about the database server and the two application servers and the four web servers talking to them, you think about it more in terms of pets than you do cattle.Richard: Yes, absolutely. Yet, I think that was a mistake back then, and I tried to do it differently, as a specific example with the disk. And I'm absolutely agreeing that previous generation tools limit you in how you can actually work with your data. In particular, once you're with metrics where you can do actual math on the data, it doesn't matter if the disk is almost full. It matters if that disk is going to be full within X amount of time.If that disk is 98% full and it sits there at 98% for ten years and provides the service, no one cares. The thing is, will it actually run out in the next two hours, in the next five hours, what have you. Depending on this, is this currently or imminently a customer-impacting or user-impacting then yes, alert on it, raise hell, wake people, make them fix it, as opposed to this thing can be dealt with during business hours on the next workday. And you don't have to wake anyone up.Corey: Yeah. The big filer with massive amounts of storage has crossed the 70% line. Okay, now it's time to start thinking about that, what do you want to do? Maybe it's time to order another shelf of discs for it, which is going to take some time. That's a radically different scenario than the 20 gigabyte root volume on your server just started filling up dramatically; the rate of change is such that'll be full in 20 minutes.Yeah, one of those is something you want to wake people up for. Generally speaking, you don't want to wake people up for what is fundamentally a longer-term strategic business problem. That can be sorted out in the light of day versus, “[laugh] we're not going to be making money in two hours, so if I don't wake up and fix this now.” That's the kind of thing you generally want to be woken up for. Well, let's be honest, you don't want that to happen at all, but if it does happen, you kind of want to know in advance rather than after the fact.Richard: You're literally describing linear predict from Prometheus, which is precisely for this, where I can look back over X amount of time and make a linear prediction because everything else breaks down at scale, blah, blah, blah, to detail. But the thing is, I can draw a line with my pencil by hand on my data and I can predict when is this thing going to it. Which is obviously precisely correct if I have a TLS certificate. It's a little bit more hand-wavy when it's a disk. But still, you can look into the future and you say, “What will be happening if current trends for the last X amount of time continue in Y amount of time.” And that's precisely a thing where you get this more powerful ability of doing math with your data.Corey: See, when you say it like that, it sounds like it actually is a whole term of art, where you're focusing on an in-depth field, where salaries are astronomical. Whereas the tools that I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin that I was grunting and pointing: “This is gonna fill up.” And that is how I thought about it. And this is the challenge where it's easy to think about these things in narrow, defined contexts like that, but at scale, things break.Like the idea of anomaly detection. Well, okay, great if normally, the CPU and these things are super bored and suddenly it gets really busy, that's atypical. Maybe we should look into it, assuming that it has a challenge. The problem is, that is a lot harder than it sounds because there are so many factors that factor into it. And as soon as you have something, quote-unquote, “Intelligent,” making decisions on this, it doesn't take too many false positives before you start ignoring everything it has to say, and missing legitimate things. It's this weird and obnoxious conflation of both hard technical problems and human psychology.Richard: And the breaking up of old service boundaries. Of course, when you say microservices, and such, fundamentally, functionally a microservice or nanoservice, picoservice—but the pendulum is already swinging back to larger units of complexity—but it fundamentally does not make any difference if I have a monolith on some mainframe or if I have a bunch of microservices. Yes, I can scale differently, I can scale horizontally a lot more easily, vertically, it's a little bit harder, blah, blah, blah, but fundamentally, the logic and the complexity, which is being packaged is fundamentally the same. More users, everything, but it is fundamentally the same. What's happening again, and again, is I'm breaking up those old boundaries, which means the old tools which have assumptions built in about certain aspects of how I can actually get an overview of a system just start breaking down, when my complexity unit or my service or what have I, is usually congruent with a physical piece, of hardware or several services are congruent with that piece of hardware, it absolutely makes sense to think about things in terms of this one physical server. The fact that you have different considerations in cloud, and microservices, and blah, blah, blah, is not inherently that it is more complex.On the contrary, it is fundamentally the same thing. It scales with users' everything, but it is fundamentally the same thing, but I have different boundaries of where I put interfaces onto my complexity, which basically allow me to hide all of this complexity from the downstream users.Corey: That's part of the challenge that I think we're grappling with across this entire industry from start to finish. Where we originally looked at these things and could reason about it because it's the computer and I know how those things work. Well, kind of, but okay, sure. But then we start layering levels of complexity on top of layers of complexity on top of layers of complexity, and suddenly, when things stop working the way that we expect, it can be very challenging to unpack and understand why. One of the ways I got into this whole space was understanding, to some degree, of how system calls work, of how the kernel wound up interacting with userspace, about how Linux systems worked from start to finish. And these days, that isn't particularly necessary most of the time for the care and feeding of applications.The challenge is when things start breaking, suddenly having that in my back pocket to pull out could be extremely handy. But I don't think it's nearly as central as it once was and I don't know that I would necessarily advise someone new to this space to spend a few years as a systems person, digging into a lot of those aspects. And this is why you need to know what inodes are and how they work. Not really, not anymore. It's not front and center the way that it once was, in most environments, at least in the world that I live in. Agree? Disagree?Richard: Agreed. But it's very much unsurprising. You probably can't tell me how to precisely grow sugar cane or corn, you can't tell me how to refine the sugar out of it, but you can absolutely bake a cake. But you will not be able to tell me even a third of—and I'm—for the record, I'm also not able to tell you even a third about the supply chain which just goes from I have a field and some seeds and I need to have a package of refined sugar—you're absolutely enabled to do any of this. The thing is, you've been part of the previous generation of infrastructure where you know how this underlying infrastructure works, so you have more ability to reason about this, but it's not needed for cloud services nearly as much.You need different types of skill sets, but that doesn't mean the old skill set is completely useless, at least not as of right now. It's much more a case of you need fewer of those people and you need them in different places because those things have become infrastructure. Which is basically the cloud play, where a lot of this is just becoming infrastructure more and more.Corey: Oh, yeah. Back then I distinctly remember my elders looking down their noses at me because I didn't know assembly, and how could I possibly consider myself a competent systems admin if I didn't at least have a working knowledge of assembly? Or at least C, which I, over time, learned enough about to know that I didn't want to be a C programmer. And you're right, this is the value of cloud and going back to those days getting a web server up and running just to compile Apache's httpd took a week and an in-depth knowledge of GCC flags.And then in time, oh, great. We're going to have rpm or debs. Great, okay, then in time, you have apt, if you're in the dev land because I know you are a Debian developer, but over in Red Hat land, we had yum and other tools. And then in time, it became oh, we can just use something like Puppet or Chef to wind up ensuring that thing is installed. And then oh, just docker run. And now it's a checkbox in a web console for S3.These things get easier with time and step by step by step we're standing on the shoulders of giants. Even in the last ten years of my career, I used to have a great challenge question that I would interview people with of, “Do you know what TinyURL is? It takes a short URL and then expands it to a longer one. Great, on the whiteboard, tell me how you would implement that.” And you could go up one side and down the other, and then you could add constraints, multiple data centers, now one goes offline, how do you not lose data? Et cetera, et cetera.But these days, there are so many ways to do that using cloud services that it almost becomes trivial. It's okay, multiple data centers, API Gateway, a Lambda, and a global DynamoDB table. Now, what? “Well, now it gets slow. Why is it getting slow?”“Well, in that scenario, probably because of something underlying the cloud provider.” “And so now, you lose an entire AWS region. How do you handle that?” “Seems to me when that happens, the entire internet's kind of broken. Do people really need longer URLs?”And that is a valid answer, in many cases. The question doesn't really work without a whole bunch of additional constraints that make it sound fake. And that's not a weakness. That is the fact that computers and cloud services have never been as accessible as they are now. And that's a win for everyone.Richard: There's one aspect of accessibility which is actually decreasing—or two. A, you need to pay for them on an ongoing basis. And B, you need an internet connection which is suitably fast, low latency, what have you. And those are things which actually do make things harder for a variety of reasons. If I look at our back-end systems—as in Grafana—all of them have single binary modes where you literally compile everything into a single binary and you can run it on your laptop because if you're stuck on a plane, you can't do any work on it. That kind of is not the best of situations.And if you have a huge CI/CD pipeline, everything in this cloud and fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more accessible. I disagree that it is becoming more accessible along all possible axes.Corey: I would agree. There is a silver lining to that as well, where yes, they are fraught and dangerous and I would preface this with a whole bunch of warnings, but from a cost perspective, all of the cloud providers do have a free tier offering where you can kick the tires on a lot of these things in return for no money. Surprisingly, the best one of those is Oracle Cloud where they have an unlimited free tier, use whatever you want in this subset of services, and you will never be charged a dime. As opposed to the AWS model of free tier where well, okay, it suddenly got very popular or you misconfigured something, and surprise, you now owe us enough money to buy Belize. That doesn't usually lead to a great customer experience.But you're right, you can't get away from needing an internet connection of at least some level of stability and throughput in order for a lot of these things to work. The stuff you would do locally on a Raspberry Pi, for example, if your budget constrained and want to get something out here, or your laptop. Great, that's not going to work in the same way as a full-on cloud service will.Richard: It's not free unless you have hard guarantees that you're not going to ever pay anything. It's fine to send warning, it's fine to switch the thing off, it's fine to have you hit random hard and soft quotas. It is not a free service if you can't guarantee that it is free.Corey: I agree with you. I think that there needs to be a free offering where, “Well, okay, you want us to suddenly stop serving traffic to the world?” “Yes. When the alternative is you have to start charging me through the nose, yes I want you to stop serving traffic.” That is definitionally what it says on the tin.And as an independent learner, that is what I want. Conversely, if I'm an enterprise, yeah, I don't care about money; we're running our Superbowl ad right now, so whatever you do, don't stop serving traffic. Charge us all the money. And there's been a lot of hand wringing about, well, how do we figure out which direction to go in? And it's, have you considered asking the customer?So, on a scale of one to bank, how serious is this account going to be [laugh]? Like, what are your big concerns: never charge me or never go down? Because we can build for either of those. Just let's make sure that all of those expectations are aligned. Because if you guess you're going to get it wrong and then no one's going to like you.Richard: I would argue this. All those services from all cloud providers actually build to address both of those. It's a deliberate choice not to offer certain aspects.Corey: Absolutely. When I talk to AWS, like, “Yeah, but there is an eventual consistency challenge in the billing system where it takes”—as anyone who's looked at the billing system can see—“Multiple days, sometimes for usage data to show up. So, how would we be able to stop things if the usage starts climbing?” To which my relatively direct responses, that sounds like a huge problem. I don't know how you'd fix that, but I do know that if suddenly you decide, as a matter of policy, to okay, if you're in the free tier, we will not charge you, or even we will not charge you more than $20 a month.So, you build yourself some headroom, great. And anything that people are able to spin up, well, you're just going to have to eat the cost as a provider. I somehow suspect that would get fixed super quickly if that were the constraint. The fact that it isn't is a conscious choice.Richard: Absolutely.Corey: And the reason I'm so passionate about this, about the free space, is not because I want to get a bunch of things for free. I assure you I do not. I mean, I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely, “It's too expensive.” It's that the billing dimension is hard to predict or doesn't align with a customer's experience or prices a service out of a bunch of use cases where it'll be great. But very rarely do I just sit here shaking my fist and saying, “It costs too much.”The problem is when you scare the living crap out of a student with a surprise bill that's more than their entire college tuition, even if you waive it a week or so later, do you think they're ever going to be as excited as they once were to go and use cloud services and build things for themselves and see what's possible? I mean, you and I met on IRC 20 years ago because back in those days, the failure mode and the risk financially was extremely low. It's yeah, the biggest concern that I had back then when I was doing some of my Linux experimentation is if I typed the wrong thing, I'm going to break my laptop. And yeah, that happened once or twice, and I've learned not to make those same kinds of mistakes, or put guardrails in so the blast radius was smaller, or use a remote system instead. Yeah, someone else's computer that I can destroy. Wonderful. But that was on we live and we learn as we were coming up. There was never an opportunity for us, to my understanding, to wind up accidentally running up an $8 million charge.Richard: Absolutely. And psychological safety is one of the most important things in what most people do. We are social animals. Without this psychological safety, you're not going to have long-term, self-sustaining groups. You will not make someone really excited about it. There's two basic ways to sell: trust or force. Those are the two ones. There's none else.Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomemento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Yeah. And it also looks ridiculous. I was talking to someone somewhat recently who's used to spending four bucks a month on their AWS bill for some S3 stuff. Great. Good for them. That's awesome. Their credentials got compromised. Yes, that is on them to some extent. Okay, great.But now after six days, they were told that they owed $360,000 to AWS. And I don't know how, as a cloud company, you can sit there and ask a student to do that. That is not a realistic thing. They are what is known, in the United States at least, in the world of civil litigation as quote-unquote, “Judgment proof,” which means, great, you could wind up finding that someone owes you $20 billion. Most of the time, they don't have that, so you're not able to recoup it. Yeah, the judgment feels good, but you're never going to see it.That's the problem with something like that. It's yeah, I would declare bankruptcy long before, as a student, I wound up paying that kind of money. And I don't hear any stories about them releasing the collection agency hounds against people in that scenario. But I couldn't guarantee that. I would never urge someone to ignore that bill and see what happens.And it's such an off-putting thing that, from my perspective, is beneath of the company. And let's be clear, I see this behavior at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS, but they are the 800-pound gorilla in the space, and that's important. Or as I just to mention right now, like, as I—because I was about to give you crap for this, too, but if I go to grafana.com, it says, and I quote, “Play around with the Grafana Stack. Experience Grafana for yourself, no registration or installation needed.”Good. I was about to yell at you if it's, “Oh, just give us your credit card and go ahead and start spinning things up and we won't charge you. Honest.” Even your free account does not require a credit card; you're doing it right. That tells me that I'm not going to get a giant surprise bill.Richard: You have no idea how much thought and work went into our free offering. There was a lot of math involved.Corey: None of this is easy, I want to be very clear on that. Pricing is one of the hardest things to get right, especially in cloud. And it also, when you get it right, it doesn't look like it was that hard for you to do. But I fix [sigh] I people's AWS bills for a living and still, five or six years in, one of the hardest things I still wrestle with is pricing engagements. It's incredibly nuanced, incredibly challenging, and at least for services in the cloud space where you're doing usage-based billing, that becomes a problem.But glancing at your pricing page, you do hit the two things that are incredibly important to me. The first one is use something for free. As an added bonus, you can use it forever. And I can get started with it right now. Great, when I go and look at your pricing page or I want to use your product and it tells me to ‘click here to contact us.' That tells me it's an enterprise sales cycle, it's got to be really expensive, and I'm not solving my problem tonight.Whereas the other side of it, the enterprise offering needs to be ‘contact us' and you do that, that speaks to the enterprise procurement people who don't know how to sign a check that doesn't have to commas in it, and they want to have custom terms and all the rest, and they're prepared to pay for that. If you don't have that, you look to small-time. When it doesn't matter what price you put on it, you wind up offering your enterprise tier at some large number, it's yeah, for some companies, that's a small number. You don't necessarily want to back yourself in, depending upon what the specific needs are. You've gotten that right.Every common criticism that I have about pricing, you folks have gotten right. And I definitely can pick up on your fingerprints on a lot of this. Because it sounds like a weird thing to say of, “Well, he's the Director of Community, why would he weigh in on pricing?” It's, “I don't think you understand what community is when you ask that question.”Richard: Yes, I fully agree. It's super important to get pricing right, or to get many things right. And usually the things which just feel naturally correct are the ones which took the most effort and the most time and everything. And yes, at least from the—like, I was in those conversations or part of them, and the one thing which was always clear is when we say it's free, it must be free. When we say it is forever free, it must be forever free. No games, no lies, do what you say and say what you do. Basically.We have things where initially you get certain pro features and you can keep paying and you can keep using them, or after X amount of time they go away. Things like these are built in because that's what people want. They want to play around with the whole thing and see, hey, is this actually providing me value? Do I want to pay for this feature which is nice or this and that plugin or what have you? And yeah, you're also absolutely right that once you leave these constraints of basically self-serve cloud, you are talking about bespoke deals, but you're also talking about okay, let's sit down, let's actually understand what your business is: what are your business problems? What are you going to solve today? What are you trying to solve tomorrow?Let us find a way of actually supporting you and invest into a mutual partnership and not just grab the money and run. We have extremely low churn for, I would say, pretty good reasons. Because this thing about our users, our customers being successful, we do take it extremely seriously.Corey: It's one of those areas that I just can't shake the feeling is underappreciated industry-wide. And the reason I say that this is your fingerprints on it is because if this had been wrong, you have a lot of… we'll call them idiosyncrasies, where there are certain things you absolutely will not stand for, and misleading people and tricking them into paying money is high on that list. One of the reasons we're friends. So yeah, but I say I see your fingerprints on this, it's yeah, if this hadn't been worked out the way that it is, you would not still be there. One other thing that I wanted to call out about, well, I guess it's a confluence of pricing and logging in the rest, I look at your free tier, and it offers up to 50 gigabytes of ingest a month.And it's easy for me to sit here and compare that to other services, other tools, and other logging stories, and then I have to stop and think for a minute that yeah, discs have gotten way bigger, and internet connections have gotten way faster, and even the logs have gotten way wordier. I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data. Do you have any, I guess, ballpark examples of what that looks like? Because it's been long enough since I've been playing in these waters that I can't really contextualize it anymore.Richard: Lord of the Rings is roughly five megabytes. It's actually less. So, we're talking literally 10,000 Lord of the Rings, which you can just shove in us and we're just storing this for you. Which also tells you that you're not going to be reading any of this. Or some of it, yes, but not all of it. You need better tooling and you need proper tooling.And some of this is more modern. Some of this is where we actually pushed the state of the art. But I'm also biased. But I, for myself, do claim that we did push the state of the art here. But at the same time you come back to those absolute fundamentals of how humans deal with data.If you look back basically as far as we have writing—literally 6000 years ago, is the oldest writing—humans have always dealt with information with the state of the world in very specific ways. A, is it important enough to even write it down, to even persist it in whatever persistence mechanisms I have at my disposal? If yes, write a detailed account or record a detailed account of whatever the thing is. But it turns out, this is expensive and it's not what you need. So, over time, you optimize towards only taking down key events and only noting key events. Maybe with their interconnections, but fundamentally, the key events.As your data grows, as you have more stuff, as this still is important to your business and keeps being more important to—or doesn't even need to be a business; can be social, can be whatever—whatever thing it is, it becomes expensive, again, to retain all of those key events. So, you turn them into numbers and you can do actual math on them. And that's this path which you've seen again, and again, and again, and again, throughout humanity's history. Literally, as long as we have written records, this has played out again, and again, and again, and again, for every single field which humans actually cared about. At different times, like, power networks are way ahead of this, but fundamentally power networks work on metrics, but for transient load spike, and everything, they have logs built into their power measurement devices, but those are only far in between. Of course, the main thing is just metrics, time-series. And you see this again, and again.You also were sysadmin in internet-related all switches have been metrics-based or metrics-first for basically forever, for 20, 30 years. But that stands to reason. Of course the internet is running at by roughly 20 years scale-wise in front of the cloud because obviously you need the internet because as you wouldn't be having a cloud. So, all of those growing pains why metrics are all of a sudden the thing, “Or have been for a few years now,” is basically, of course, people who were writing software, providing their own software services, hit the scaling limitations which you hit for Internet service providers two decades, three decades ago. But fundamentally, you have this complete system. Basically profiles or distributed tracing depending on how you view distributed tracing.You can also argue that distributed tracing is key events which are linked to each other. Logs sit firmly in the key event thing and then you turn this into numbers and that is metrics. And that's basically it. You have extremes at the and where you can have valid, depending on your circumstances, engineering trade-offs of where you invest the most, but fundamentally, that is why those always appear again in humanity's dealing with data, and observability is no different.Corey: I take a look at last month's AWS bill. Mine is pretty well optimized. It's a bit over 500 bucks. And right around 150 of that is various forms of logging and detecting change in the environment. And on the one hand, I sit here, and I think, “Oh, I should optimize that,” because the value of those logs to me is zero.Except that whenever I have to go in and diagnose something or respond to an incident or have some forensic exploration, they then are worth an awful lot. And I am prepared to pay 150 bucks a month for that because the potential value of having that when the time comes is going to be extraordinarily useful. And it basically just feels like a tax on top of what it is that I'm doing. The same thing happens with application observability where, yeah, when you just want the big substantial stuff, yeah, until you're trying to diagnose something. But in some cases, yeah, okay, then crank up the verbosity and then look for it.But if you're trying to figure it out after an event that isn't likely or hopefully won't recur, you're going to wish that you spent a little bit more on collecting data out of it. You're always going to be wrong, you're always going to be unhappy, on some level.Richard: Ish. You could absolutely be optimizing this. I mean, for $500, it's probably not worth your time unless you take it as an exercise, but outside of due diligence where you need specific logs tied to—or specific events tied to specific times, I would argue that a lot of the problems with logs is just dealing with it wrong. You have this one extreme of full-text indexing everything, and you have this other extreme of a data lake—which is just a euphemism of never looking at the data again—to keep storage vendors happy. There is an in between.Again, I'm biased, but like for example, with Loki, you have those same label sets as you have on your metrics with Prometheus, and you have literally the same, which means you only index that part and you only extract on ingestion time. If you don't have structured logs yet, only put the metadata about whatever you care about extracted and put it into your label set and store this, and that's the only thing you index. But it goes further than just this. You can also turn those logs into metrics.And to me this is a path of optimization. Where previously I logged this and that error. Okay, fine, but it's just a log line telling me it's HTTP 500. No one cares that this is at this precise time. Log levels are also basically an anti-pattern because they're just trying to deal with the amount of data which I have, and try and get a handle on this on that level whereas it would be much easier if I just counted every time I have an HTTP 500, I just up my counter by one. And again, and again, and again.And all of a sudden, I have literally—and I did the math on this—over 99.8% of the data which I have to store just goes away. It's just magic the way—and we're only talking about the first time I'm hitting this logline. The second time I'm hitting this logline is functionally free if I turn this into metrics. It becomes cheap enough that one of the mantras which I have, if you need to onboard your developers on modern observability, blah, blah, blah, blah, blah, the whole bells and whistles, usually people have logs, like that's what they have, unless they were from ISPs or power companies, or so; there they usually start with metrics.But most users, which I see both with my Grafana and with my Prometheus [unintelligible 00:38:46] tend to start with logs. They have issues with those logs because they're basically unstructured and useless and you need to first make them useful to some extent. But then you can leverage on this and instead of having a debug statement, just put a counter. Every single time you think, “Hey, maybe I should put a debug statement,” just put a counter instead. In two months time, see if it was worth it or if you delete that line and just remove that counter.It's so much cheaper, you can just throw this on and just have it run for a week or a month or whatever timeframe and done. But it goes beyond this because all of a sudden, if I can turn my logs into metrics properly, I can start rewriting my alerts on those metrics. I can actually persist those metrics and can more aggressively throw my logs away. But also, I have this transition made a lot easier where I don't have this huge lift, where this day in three months is to be cut over and we're going to release the new version of this and that software and it's not going to have that, it's going to have 80% less logs and everything will be great and then you missed the first maintenance window or someone is ill or what have you, and then the next Big Friday is coming so you can't actually deploy there. I mean Black Friday. But we can also talk about deploying on Fridays.But the thing is, you have this huge thing, whereas if you have this as a continuous improvement process, I can just look at, this is the log which is coming out. I turn this into a number, I start emitting metrics directly, and I see that those numbers match. And so, I can just start—I build new stuff, I put it into a new data format, I actually emit the new data format directly from my code instrumentation, and only then do I start removing the instrumentation for the logs. And that allows me to, with full confidence, with psychological safety, just move a lot more quickly, deliver much more quickly, and also cut down on my costs more quickly because I'm just using more efficient data types.Corey: I really want to thank you for spending as much time as you have. If people want to learn more about how you view the world and figure out what other personal attacks they can throw your way, where's the best place for them to find you?Richard: Personal attacks, probably Twitter. It's, like, the go-to place for this kind of thing. For actually tracking, I stopped maintaining my own website. Maybe I'll do again, but if you go on github.com/ritchieh/talks, you'll find a reasonably up-to-date list of all the talks, interviews, presentations, panels, what have you, which I did over the last whatever amount of time. [laugh].Corey: And we will, of course, put links to that in the [show notes 00:41:23]. Thanks again for your time. It's always appreciated.Richard: And thank you.Corey: Richard Hartmann, Director of Community at Grafana Labs. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment. And then when someone else comes along with an insulting comment they want to add, we'll just increment the counter by one.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Technical Debt has a branding problem, and in a shifting economy, it becomes increasingly important to have a focus on projects that drive positive ROI. SHOW: 660CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Security Solution: Modern Monitoring and SecurityStart investigating security threats before it affects your customers with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.CloudZero - Cloud Cost Intelligence for Engineering TeamsCDN77 - CDN Focused on VOD and SecurityCDN77 - ask for a free trial with no duration or traffic limits.SHOW NOTES:WE GLAMORIZE INNOVATION BECAUSE ROI IS HARD TO MEASUREThere is an outward perspective that everyone wants to work on something new, because it gets a lot of attention. But there are plenty of opportunities to focus on foundational, stabilizing capabilities. SOMETIMES GROWTH COMES FROM BEING EFFICIENTThink of things in terms of quarters, or six-months, or twelve-monthsWhere can you measure improvement?Where can you leave things alone, but still have room to improve?How can you link to improvement in other areas?How can you have an incremental improvement mindset?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
Michael Cade (@michaelcade1) talks about learning in public and his creation of "90 Days of DevOps" to enable others by teaching the process and principles of DevOps from the ground up.SHOW: 659CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Granulate, an Intel company - Autonomous, continuous, workload optimizationgMaestro from Granulate - Kubernetes cost optimization, made easyCDN77 - Content Delivery Network Optimized for Video85% of users stop watching a video because of stalling and rebuffering. Rely on CDN77 to deliver a seamless online experience to your audience. Ask for a free trial with no duration or traffic limits.Datadog Monitoring: Modern Monitoring and AnalyticsStart monitoring your infrastructure, applications, logs and security in one place with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.SHOW NOTES:90 Days of DevOpsTopic 1 - We've known each other for a long time. For those out there not familiar, give us a quick introduction.Topic 2 - Before we dig into the 90 Days of DevOps project, tell everyone about your career journey at Veeam and now Kasten. You've made the transition from infrastructure technologist to DevOps and cloud native technologies. How did that come about?Topic 3 - You started the project on Jan 1, 2022. Did you have this all mapped out before you started or did you create the content and topics as you went along?Topic 4 - Tell everyone a little bit about what they can learn if they complete the program. What made you choose these topics? What I like best about the program is the wide variety of topics. Everything from conceptual items (OSI model) to hands on with Terraform, Kubernetes, Jenkins, etc.Topic 5 - What stuck out to you going through the process? What was most impactful to you in the journey? I remember following along from a far on Twitter while you were doing this. You struck a nerve in the community somewhere along the way (18k stars on GitHub). When did you think this might get bigger than you anticipated?Topic 6 - What's next? Do you have plans to continue the project or create another project? What are folks out there asking for? Anything you would change in the current program?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
Clique ici pour t'inscrire aux cours gratuit engager ses prospects de manière hyper-personnalisée Cette semaine Dans l'Arène : Clément Regazzoni, Account Executive @ Salesloft & Co-fondateur de la communauté SDR Tribes. Salesloft, c'est le fournisseur de la principale plateforme d'engagement commercial qui aide les vendeurs et les équipes commerciales à générer plus de revenus. Le Modern Revenue Workspace™ de Salesloft est l'endroit unique où les vendeurs peuvent exécuter toutes leurs tâches de vente numérique, communiquer avec les acheteurs, comprendre ce qu'ils doivent faire ensuite… et obtenir le coaching et les informations dont ils ont besoin pour gagner leurs deals. En chiffres, Salesloft c'est : 2 000+ clients 500+ employés + 60 000 abonnés sur LinkedIn Une implantation dans 6 villes différentes La 7ème société de technologie à la croissance la plus rapide en Amérique du Nord De son côté, SDR Tribes, c'est une communauté Slack de partage et d'entraide qui rassemble +800 commerciaux francophones des plus belles start-ups et entreprises (Datadog, AB Tasty, Alan, Partoo, LinkedIn, Salesforce…). Ses membres échangent des bonnes pratiques de prospection commerciale mais aussi des outils, des offres d'emploi ou des articles variés sur le métier de commerciaux. Au menu : ⚔️ Réussir un cold-calling avec un CEO ⚔️1 conseil pour progresser dans sa carrière de SDR ⚔️2 conseils pour être performant et rester motivé en tant que Sales Se retrouver dans l'épisode 2'40'' : L'avantage n°1 du métier de Sales 4'22'' : La prise de conscience de Clément sur le cold-calling 6'15'' : Différence entre SDR & BDR 8'00'' : Le rôle capital d'un commercial sur l'image de son entreprise 9'22'' : 2 conseils pour être performant en tant que Sales 11'45'' : L'importance des rituels au travail 14'03'' : Le plus beau cold-calling de la carrière de Clément 18'50'' : 1 conseil pour tout BDR/SDR qui souhaite devenir Account Executive 21'40'' : Networker entre Sales grâce à SDR Tribes Merci à notre sponsor : Salesloft Pour soutenir le podcast : 1. S'inscrire Dans l'Arène pour ne pas rater les prochains épisodes ! 2. Mettre 5 étoiles sur Apple Podcast et Spotify pour aider d'autres startupers à découvrir le podcast. 3. Venez aussi découvrir A-Team, le podcast pour mieux manager au quotidien.
About Steve:Steve Rice is Principal Product Manager for AWS AppConfig. He is surprisingly passionate about feature flags and continuous configuration. He lives in the Washington DC area with his wife, 3 kids, and 2 incontinent dogs.Links Referenced:AWS AppConfig: https://go.aws/awsappconfig TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With tail scale, ssh, you can do exactly that. Tail scale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate.S. Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built in key rotation permissions is code connectivity between any two devices, reduce latency and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. tail scales. Completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This is a promoted guest episode. What does that mean? Well, it means that some people don't just want me to sit here and throw slings and arrows their way, they would prefer to send me a guest specifically, and they do pay for that privilege, which I appreciate. Paying me is absolutely a behavior I wish to endorse.Today's victim who has decided to contribute to slash sponsor my ongoing ridiculous nonsense is, of all companies, AWS. And today I'm talking to Steve Rice, who's the principal product manager on AWS AppConfig. Steve, thank you for joining me.Steve: Hey, Corey, great to see you. Thanks for having me. Looking forward to a conversation.Corey: As am I. Now, AppConfig does something super interesting, which I'm not aware of any other service or sub-service doing. You are under the umbrella of AWS Systems Manager, but you're not going to market with Systems Manager AppConfig. You're just AWS AppConfig. Why?Steve: So, AppConfig is part of AWS Systems Manager. Systems Manager has, I think, 17 different features associated with it. Some of them have an individual name that is associated with Systems Manager, some of them don't. We just happen to be one that doesn't. AppConfig is a service that's been around for a while internally before it was launched externally a couple years ago, so I'd say that's probably the origin of the name and the service. I can tell you more about the origin of the service if you're curious.Corey: Oh, I absolutely am. But I just want to take a bit of a detour here and point out that I make fun of the sub-service names in Systems Manager an awful lot, like Systems Manager Session Manager and Systems Manager Change Manager. And part of the reason I do that is not just because it's funny, but because almost everything I found so far within the Systems Manager umbrella is pretty awesome. It aligns with how I tend to think about the world in a bunch of different ways. I have yet to see anything lurking within the Systems Manager umbrella that has led to a tee-hee-hee bill surprise level that rivals, you know, the GDP of Guam. So, I'm a big fan of the entire suite of services. But yes, how did AppConfig get its name?Steve: [laugh]. So, AppConfig started about six years ago, now, internally. So, we actually were part of the region services department inside of Amazon, which is in charge of launching new services around the world. We found that a centralized tool for configuration associated with each service launching was really helpful. So, a service might be launching in a new region and have to enable and disable things as it moved along.And so, the tool was sort of built for that, turning on and off things as the region developed and was ready to launch publicly; then the regions launch publicly. It turned out that our internal customers, which are a lot of AWS services and then some Amazon services as well, started to use us beyond launching new regions, and started to use us for feature flagging. Again, turning on and off capabilities, launching things safely. And so, it became massively popular; we were actually a top 30 service internally in terms of usage. And two years ago, we thought we really should launch this externally and let our customers benefit from some of the goodness that we put in there, and some of—those all come from the mistakes we've made internally. And so, it became AppConfig. In terms of the name itself, we specialize in application configuration, so that's kind of a mouthful, so we just changed it to AppConfig.Corey: Earlier this year, there was a vulnerability reported around I believe it was AWS Glue, but please don't quote me on that. And as part of its excellent response that AWS put out, they said that from the time that it was disclosed to them, they had patched the service and rolled it out to every AWS region in which Glue existed in a little under 29 hours, which at scale is absolutely magic fast. That is superhero speed and then some because you generally don't just throw something over the wall, regardless of how small it is when we're talking about something at the scale of AWS. I mean, look at who your customers are; mistakes will show. This also got me thinking that when you have Adam, or previously Andy, on stage giving a keynote announcement and then they mention something on stage, like, “Congratulations. It's now a very complicated service with 14 adjectives in his name because someone's paid by the syllable. Great.”Suddenly, the marketing pages are up, the APIs are working, it's showing up in the console, and it occurs to me only somewhat recently to think about all of the moving parts that go on behind this. That is far faster than even the improved speed of CloudFront distribution updates. There's very clearly something going on there. So, I've got to ask, is that you?Steve: Yes, a lot of that is us. I can't take credit for a hundred percent of what you're talking about, but that's how we are used. We're essentially used as a feature-flagging service. And I can talk generically about feature flagging. Feature flagging allows you to push code out to production, but it's hidden behind a configuration switch: a feature toggle or a feature flag. And that code can be sitting out there, nobody can access it until somebody flips that toggle. Now, the smart way to do it is to flip that toggle on for a small set of users. Maybe it's just internal users, maybe it's 1% of your users. And so, the features available, you can—Corey: It's your best slash worst customers [laugh] in that 1%, in some cases.Steve: Yeah, you want to stress test the system with them and you want to be able to look and see what's going to break before it breaks for everybody. So, you release us to a small cohort, you measure your operations, you measure your application health, you measure your reputational concerns, and then if everything goes well, then you maybe bump it up to 2%, and then 10%, and then 20%. So, feature flags allow you to slowly release features, and you know what you're releasing by the time it's at a hundred percent. It's tempting for teams to want to, like, have everybody access it at the same time; you've been working hard on this feature for a long time. But again, that's kind of an anti-pattern. You want to make sure that on production, it behaves the way you expect it to behave.Corey: I have to ask what is the fundamental difference between feature flags and/or dynamic configuration. Because to my mind, one of them is a means of achieving the other, but I could also see very easily using the terms interchangeably. Given that in some of our conversations, you have corrected me which, first, how dare you? Secondly, okay, there's probably a reason here. What is that point of distinction?Steve: Yeah. Typically for those that are not eat, sleep, and breathing dynamic configuration—which I do—and most people are not obsessed with this kind of thing, feature flags is kind of a shorthand for dynamic configuration. It allows you to turn on and off things without pushing out any new code. So, your application code's running, it's pulling its configuration data, say every five seconds, every ten seconds, something like that, and when that configuration data changes, then that app changes its behavior, again, without a code push or without restarting the app.So, dynamic configuration is maybe a superset of feature flags. Typically, when people think feature flags, they're thinking of, “Oh, I'm going to release a new feature, so it's almost like an on-off switch.” But we see customers using feature flags—and we use this internally—for things like throttling limits. Let's say you want to be able to throttle TPS transactions per second. Or let's say you want to throttle the number of simultaneous background tasks, and say, you know, I just really don't want this creeping above 50; bad things can start to happen.But in a period of stress, you might want to actually bring that number down. Well, you can push out these changes with dynamic configuration—which is, again, any type of configuration, not just an on-off switch—you can push this out and adjust the behavior and see what happens. Again, I'd recommend pushing it out to 1% of your users, and then 10%. But it allows you to have these dials and switches to do that. And, again, generically, that's dynamic configuration. It's not as fun to term as feature flags; feature flags is sort of a good mental picture, so I do use them interchangeably, but if you're really into the whole world of this dynamic configuration, then you probably will care about the difference.Corey: Which makes a fair bit of sense. It's the question of what are you talking about high level versus what are you talking about implementation detail-wise.Steve: Yep. Yep.Corey: And on some level, I used to get… well, we'll call it angsty—because I can't think of a better adjective right now—about how AWS was reluctant to disclose implementation details behind what it did. And in the fullness of time, it's made a lot more sense to me, specifically through a lens of, you want to be able to have the freedom to change how something works under the hood. And if you've made no particular guarantee about the implementation detail, you can do that without potentially worrying about breaking a whole bunch of customer expectations that you've inadvertently set. And that makes an awful lot of sense.The idea of rolling out changes to your infrastructure has evolved over the last decade. Once upon a time you'd have EC2 instances, and great, you want to go ahead and make a change there—or this actually predates EC2 instances. Virtual machines in a data center or heaven forbid, bare metal servers, you're not going to deploy a whole new server because there's a new version of the code out, so you separate out your infrastructure from the code that it runs. And that worked out well. And increasingly, we started to see ways of okay, if we want to change the behavior of the application, we'll just push out new environment variables to that thing and restart the service so it winds up consuming those.And that's great. You've rolled it out throughout your fleet. With containers, which is sort of the next logical step, well, okay, this stuff gets baked in, we'll just restart containers with a new version of code because that takes less than a second each and you're fine. And then Lambda functions, it's okay, we'll just change the deployment option and the next invocation will wind up taking the brand new environment variables passed out to it. How do feature flags feature into those, I guess, three evolving methods of running applications in anger, by which I mean, of course, production?Steve: [laugh]. Good question. And I think you really articulated that well.Corey: Well, thank you. I should hope so. I'm a storyteller. At least I fancy myself one.Steve: [laugh]. Yes, you are. Really what you talked about is the evolution of you know, at the beginning, people were—well, first of all, people probably were embedding their variables deep in their code and then they realized, “Oh, I want to change this,” and now you have to find where in my code that is. And so, it became a pattern. Why don't we separate everything that's a configuration data into its own file? But it'll get compiled at build time and sent out all at once.There was kind of this breakthrough that was, why don't we actually separate out the deployment of this? We can separate the deployment from code from the deployment of configuration data, and have the code be reading that configuration data on a regular interval, as I already said. So now, as the environments have changed—like you said, containers and Lambda—that ability to make tweaks at microsecond intervals is more important and more powerful. So, there certainly is still value in having things like environment variables that get read at startup. We call that static configuration as opposed to dynamic configuration.And that's a very important element in the world of containers that you talked about. Containers are a bit ephemeral, and so they kind of come and go, and you can restart things, or you might spin up new containers that are slightly different config and have them operate in a certain way. And again, Lambda takes that to the next level. I'm really excited where people are going to take feature flags to the next level because already today we have people just fine-tuning to very targeted small subsets, different configuration data, different feature flag data, and allows them to do this like at we've never seen before scale of turning this on, seeing how it reacts, seeing how the application behaves, and then being able to roll that out to all of your audience.Now, you got to be careful, you really don't want to have completely different configurations out there and have 10 different, or you know, 100 different configurations out there. That makes it really tough to debug. So, you want to think of this as I want to roll this out gradually over time, but eventually, you want to have this sort of state where everything is somewhat consistent.Corey: That, on some level, speaks to a level of operational maturity that my current deployment adventures generally don't have. A common reference I make is to my lasttweetinaws.com Twitter threading app. And anyone can visit it, use it however they want.And it uses a Route 53 latency record to figure out, ah, which is the closest region to you because I've deployed it to 20 different regions. Now, if this were a paid service, or I had people using this in large volume and I had to worry about that sort of thing, I would probably approach something that is very close to what you describe. In practice, I pick a devoted region that I deploy something to, and cool, that's sort of my canary where I get things working the way I would expect. And when that works the way I want it to I then just push it to everything else automatically. Given that I've put significant effort into getting deployments down to approximately two minutes to deploy to everything, it feels like that's a reasonable amount of time to push something out.Whereas if I were, I don't know, running a bank, for example, I would probably have an incredibly heavy process around things that make changes to things like payment or whatnot. Because despite the lies, we all like to tell both to ourselves and in public, anything that touches payments does go through waterfall, not agile iterative development because that mistake tends to show up on your customer's credit card bills, and then they're also angry. I think that there's a certain point of maturity you need to be at as either an organization or possibly as a software technology stack before something like feature flags even becomes available to you. Would you agree with that, or is this something everyone should use?Steve: I would agree with that. Definitely, a small team that has communication flowing between the two probably won't get as much value out of a gradual release process because everybody kind of knows what's going on inside of the team. Once your team scales, or maybe your audience scales, that's when it matters more. You really don't want to have something blow up with your users. You really don't want to have people getting paged in the middle of the night because of a change that was made. And so, feature flags do help with that.So typically, the journey we see is people start off in a maybe very small startup. They're releasing features at a very fast pace. They grow and they start to build their own feature flagging solution—again, at companies I've been at previously have done that—and you start using feature flags and you see the power of it. Oh, my gosh, this is great. I can release something when I want without doing a big code push. I can just do a small little change, and if something goes wrong, I can roll it back instantly. That's really handy.And so, the basics of feature flagging might be a homegrown solution that you all have built. If you really lean into that and start to use it more, then you probably want to look at a third-party solution because there's so many features out there that you might want. A lot of them are around safeguards that makes sure that releasing a new feature is safe. You know, again, pushing out a new feature to everybody could be similar to pushing out untested code to production. You don't want to do that, so you need to have, you know, some checks and balances in your release process of your feature flags, and that's what a lot of third parties do.It really depends—to get back to your question about who needs feature flags—it depends on your audience size. You know, if you have enough audience out there to want to do a small rollout to a small set first and then have everybody hit it, that's great. Also, if you just have, you know, one or two developers, then feature flags are probably something that you're just kind of, you're doing yourself, you're pushing out this thing anyway on your own, but you don't need it coordinated across your team.Corey: I think that there's also a bit of—how to frame this—misunderstanding on someone's part about where AppConfig starts and where it stops. When it was first announced, feature flags were one of the things that it did. And that was talked about on stage, I believe in re:Invent, but please don't quote me on that, when it wound up getting announced. And then in the fullness of time, there was another announcement of AppConfig now supports feature flags, which I'm sitting there and I had to go back to my old notes. Like, did I hallucinate this? Which again, would not be the first time I'd imagine such a thing. But no, it was originally how the service was described, but now it's extra feature flags, almost like someone would, I don't know, flip on a feature-flag toggle for the service and now it does a different thing. What changed? What was it that was misunderstood about the service initially versus what it became?Steve: Yeah, I wouldn't say it was a misunderstanding. I think what happened was we launched it, guessing what our customers were going to use it as. We had done plenty of research on that, and as I mentioned before we had—Corey: Please tell me someone used it as a database. Or am I the only nutter that does stuff like that?Steve: We have seen that before. We have seen something like that before.Corey: Excellent. Excellent, excellent. I approve.Steve: And so, we had done our due diligence ahead of time about how we thought people were going to use it. We were right about a lot of it. I mentioned before that we have a lot of usage internally, so you know, that was kind of maybe cheating even for us to be able to sort of see how this is going to evolve. What we did announce, I guess it was last November, was an opinionated version of feature flags. So, we had people using us for feature flags, but they were building their own structure, their own JSON, and there was not a dedicated console experience for feature flags.What we announced last November was an opinionated version that structured the JSON in a way that we think is the right way, and that afforded us the ability to have a smooth console experience. So, if we know what the structure of the JSON is, we can have things like toggles and validations in there that really specifically look at some of the data points. So, that's really what happened. We're just making it easier for our customers to use us for feature flags. We still have some customers that are kind of building their own solution, but we're seeing a lot of them move over to our opinionated version.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloudCorey: Part of the problem I have when I look at what it is you folks do, and your use cases, and how you structure it is, it's similar in some respects to how folks perceive things like FIS, the fault injection service, or chaos engineering, as is commonly known, which is, “We can't even get the service to stay up on its own for any [unintelligible 00:18:35] period of time. What do you mean, now let's intentionally degrade it and make it work?” There needs to be a certain level of operational stability or operational maturity. When you're still building a service before it's up and running, feature flags seem awfully premature because there's no one depending on it. You can change configuration however your little heart desires. In most cases. I'm sure at certain points of scale of development teams, you have a communications problem internally, but it's not aimed at me trying to get something working at 2 a.m. in the middle of the night.Whereas by the time folks are ready for what you're doing, they clearly have that level of operational maturity established. So, I have to guess on some level, that your typical adopter of AppConfig feature flags isn't in fact, someone who is, “Well, we're ready for feature flags; let's go,” but rather someone who's come up with something else as a stopgap as they've been iterating forward. Usually something homebuilt. And it might very well be you have the exact same biggest competitor that I do in my consulting work, which is of course, Microsoft Excel as people try to build their own thing that works in their own way.Steve: Yeah, so definitely a very common customer of ours is somebody that is using a homegrown solution for turning on and off things. And they really feel like I'm using the heck out of these feature flags. I'm using them on a daily or weekly basis. I would like to have some enhancements to how my feature flags work, but I have limited resources and I'm not sure that my resources should be building enhancements to a feature-flagging service, but instead, I'd rather have them focusing on something, you know, directly for our customers, some of the core features of whatever your company does. And so, that's when people sort of look around externally and say, “Oh, let me see if there's some other third-party service or something built into AWS like AWS AppConfig that can meet those needs.”And so absolutely, the workflows get more sophisticated, the ability to move forward faster becomes more important, and do so in a safe way. I used to work at a cybersecurity company and we would kind of joke that the security budget of the company is relatively low until something bad happens, and then it's, you know, whatever you need to spend on it. It's not quite the same with feature flags, but you do see when somebody has a problem on production, and they want to be able to turn something off right away or make an adjustment right away, then the ability to do that in a measured way becomes incredibly important. And so, that's when, again, you'll see customers starting to feel like they're outgrowing their homegrown solution and moving to something that's a third-party solution.Corey: Honestly, I feel like so many tools exist in this space, where, “Oh, yeah, you should definitely use this tool.” And most people will use that tool. The second time. Because the first time, it's one of those, “How hard could that be out? I can build something like that in a weekend.” Which is sort of the rallying cry of doomed engineers who are bad at scoping.And by the time that they figure out why, they have to backtrack significantly. There's a whole bunch of stuff that I have built that people look at and say, “Wow, that's a really great design. What inspired you to do that?” And the absolute honest answer to all of it is simply, “Yeah, I worked in roles for the first time I did it the way you would think I would do it and it didn't go well.” Experience is what you get when you didn't get what you wanted, and this is one of those areas where it tends to manifest in reasonable ways.Steve: Absolutely, absolutely.Corey: So, give me an example here, if you don't mind, about how feature flags can improve the day-to-day experience of an engineering team or an engineer themselves. Because we've been down this path enough, in some cases, to know the failure modes, but for folks who haven't been there that's trying to shave a little bit off of their journey of, “I'm going to learn from my own mistakes.” Eh, learn from someone else's. What are the benefits that accrue and are felt immediately?Steve: Yeah. So, we kind of have a policy that the very first commit of any new feature ought to be the feature flag. That's that sort of on-off switch that you want to put there so that you can start to deploy your code and not have a long-lived branch in your source code. But you can have your code there, it reads whether that configuration is on or off. You start with it off.And so, it really helps just while developing these things about keeping your branches short. And you can push the mainline, as long as the feature flag is off and the feature is hidden to production, which is great. So, that helps with the mess of doing big code merges. The other part is around the launch of a feature.So, you talked about Andy Jassy being on stage to launch a new feature. Sort of the old way of doing this, Corey, was that you would need to look at your pipelines and see how long it might take for you to push out your code with any sort of code change in it. And let's say that was an hour-and-a-half process and let's say your CEO is on stage at eight o'clock on a Friday. And as much as you like to say it, “Oh, I'm never pushing out code on a Friday,” sometimes you have to. The old way—Corey: Yeah, that week, yes you are, whether you want to or not.Steve: [laugh]. Exactly, exactly. The old way was this idea that I'm going to time my release, and it takes an hour-and-a-half; I'm going to push it out, and I'll do my best, but hopefully, when the CEO raises her arm or his arm up and points to a screen that everything's lit up. Well, let's say you're doing that and something goes wrong and you have to start over again. Well, oh, my goodness, we're 15 minutes behind, can you accelerate things? And then you start to pull away some of these blockers to accelerate your pipeline or you start editing it right in the console of your application, which is generally not a good idea right before a really big launch.So, the new way is, I'm going to have that code already out there on a Wednesday [laugh] before this big thing on a Friday, but it's hidden behind this feature flag, I've already turned it on and off for internals, and it's just waiting there. And so, then when the CEO points to the big screen, you can just flip that one small little configuration change—and that can be almost instantaneous—and people can access it. So, that just reduces the amount of stress, reduces the amount of risk in pushing out your code.Another thing is—we've heard this from customers—customers are increasing the number of deploys that they can do per week by a very large percentage because they're deploying with confidence. They know that I can push out this code and it's off by default, then I can turn it on whenever I feel like it, and then I can turn it off if something goes wrong. So, if you're into CI/CD, you can actually just move a lot faster with a number of pushes to production each week, which again, I think really helps engineers on their day-to-day lives. The final thing I'm going to talk about is that let's say you did push out something, and for whatever reason, that following weekend, something's going wrong. The old way was oop, you're going to get a page, I'm going to have to get on my computer and go and debug things and fix things, and then push out a new code change.And this could be late on a Saturday evening when you're out with friends. If there's a feature flag there that can turn it off and if this feature is not critical to the operation of your product, you can actually just go in and flip that feature flag off until the next morning or maybe even Monday morning. So, in theory, you kind of get your free time back when you are implementing feature flags. So, I think those are the big benefits for engineers in using feature flags.Corey: And the best way to figure out whether someone is speaking from a position of experience or is simply a raving zealot when they're in a position where they are incentivized to advocate for a particular way of doing things or a particular product, as—let's be clear—you are in that position, is to ask a form of the following question. Let's turn it around for a second. In what scenarios would you absolutely not want to use feature flags? What problems arise? When do you take a look at a situation and say, “Oh, yeah, feature flags will make things worse, instead of better. Don't do it.”Steve: I'm not sure I wouldn't necessarily don't do it—maybe I am that zealot—but you got to do it carefully.Corey: [laugh].Steve: You really got to do things carefully because as I said before, flipping on a feature flag for everybody is similar to pushing out untested code to production. So, you want to do that in a measured way. So, you need to make sure that you do a couple of things. One, there should be some way to measure what the system behavior is for a small set of users with that feature flag flipped to on first. And it could be some canaries that you're using for that.You can also—there's other mechanisms you can do that to: set up cohorts and beta testers and those kinds of things. But I would say the gradual rollout and the targeted rollout of a feature flag is critical. You know, again, it sounds easy, “I'll just turn it on later,” but you ideally don't want to do that. The second thing you want to do is, if you can, is there some sort of validation that the feature flag is what you expect? So, I was talking about on-off feature flags; there are things, as when I was talking about dynamic configuration, that are things like throttling limits, that you actually want to make sure that you put in some other safeguards that say, “I never want my TPS to go above 1200 and never want to set it below 800,” for whatever reason, for example. Well, you want to have some sort of validation of that data before the feature flag gets pushed out. Inside Amazon, we actually have the policy that every single flag needs to have some sort of validation around it so that we don't accidentally fat-finger something out before it goes out there. And we have fat-fingered things.Corey: Typing the wrong thing into a command structure into a tool? “Who would ever do something like that?” He says, remembering times he's taken production down himself, exactly that way.Steve: Exactly, exactly, yeah. And we've done it at Amazon and AWS, for sure. And so yeah, if you have some sort of structure or process to validate that—because oftentimes, what you're doing is you're trying to remediate something in production. Stress levels are high, it is especially easy to fat-finger there. So, that check-and-balance of a validation is important.And then ideally, you have something to automatically roll back whatever change that you made, very quickly. So AppConfig, for example, hooks up to CloudWatch alarms. If an alarm goes off, we're actually going to roll back instantly whatever that feature flag was to its previous state so that you don't even need to really worry about validating against your CloudWatch. It'll just automatically do that against whatever alarms you have.Corey: One of the interesting parts about working at Amazon and seeing things in Amazonian scale is that one in a million events happen thousands of times every second for you folks. What lessons have you learned by deploying feature flags at that kind of scale? Because one of my problems and challenges with deploying feature flags myself is that in some cases, we're talking about three to five users a day for some of these things. That's not really enough usage to get insights into various cohort analyses or A/B tests.Steve: Yeah. As I mentioned before, we build these things as features into our product. So, I just talked about the CloudWatch alarms. That wasn't there originally. Originally, you know, if something went wrong, you would observe a CloudWatch alarm and then you decide what to do, and one of those things might be that I'm going to roll back my configuration.So, a lot of the mistakes that we made that caused alarms to go off necessitated us building some automatic mechanisms. And you know, a human being can only react so fast, but an automated system there is going to be able to roll things back very, very quickly. So, that came from some specific mistakes that we had made inside of AWS. The validation that I was talking about as well. We have a couple of ways of validating things.You might want to do a syntactic validation, which really you're validating—as I was saying—the range between 100 and 1000, but you also might want to have sort of a functional validation, or we call it a semantic validation so that you can make sure that, for example, if you're switching to a new database, that you're going to flip over to your new database, you can have a validation there that says, “This database is ready, I can write to this table, it's truly ready for me to switch.” Instead of just updating some config data, you're actually going to be validating that the new target is ready for you. So, those are a couple of things that we've learned from some of the mistakes we made. And again, not saying we aren't making mistakes still, but we always look at these things inside of AWS and figure out how we can benefit from them and how our customers, more importantly, can benefit from these mistakes.Corey: I would say that I agree. I think that you have threaded the needle of not talking smack about your own product, while also presenting it as not the global panacea that everyone should roll out, willy-nilly. That's a good balance to strike. And frankly, I'd also say it's probably a good point to park the episode. If people want to learn more about AppConfig, how you view these challenges, or even potentially want to get started using it themselves, what should they do?Steve: We have an informational page at go.aws/awsappconfig. That will tell you the high-level overview. You can search for our documentation and we have a lot of blog posts to help you get started there.Corey: And links to that will, of course, go into the [show notes 00:31:21]. Thank you so much for suffering my slings, arrows, and other assorted nonsense on this. I really appreciate your taking the time.Steve: Corey thank you for the time. It's always a pleasure to talk to you. Really appreciate your insights.Corey: You're too kind. Steve Rice, principal product manager for AWS AppConfig. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment. But before you do, just try clearing your cookies and downloading the episode again. You might be in the 3% cohort for an A/B test, and you [want to 00:32:01] listen to the good one instead.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.