Podcasts about duckbill group

  • 24PODCASTS
  • 350EPISODES
  • 30mAVG DURATION
  • 5WEEKLY NEW EPISODES
  • Dec 2, 2022LATEST

POPULARITY

20152016201720182019202020212022


Best podcasts about duckbill group

Latest podcast episodes about duckbill group

Screaming in the Cloud
The Rapid Rise of Vector Databases with Ram Sriharsha

Screaming in the Cloud

Play Episode Listen Later Dec 2, 2022 31:41


About RamDr. Ram Sriharsha held engineering, product management, and VP roles at the likes of Yahoo, Databricks, and Splunk. At Yahoo, he was both a principal software engineer and then research scientist; at Databricks, he was the product and engineering lead for the unified analytics platform for genomics; and, in his three years at Splunk, he played multiple roles including Sr Principal Scientist, VP Engineering and Distinguished Engineer.Links Referenced: Pinecone: https://www.pinecone.io/ XKCD comic: https://www.explainxkcd.com/wiki/index.php/1425:_Tasks TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Chronosphere. Tired of observability costs going up every year without getting additional value? Or being locked into a vendor due to proprietary data collection, querying, and visualization? Modern-day, containerized environments require a new kind of observability technology that accounts for the massive increase in scale and attendant cost of data. With Chronosphere, choose where and how your data is routed and stored, query it easily, and get better context and control. 100% open-source compatibility means that no matter what your setup is, they can help. Learn how Chronosphere provides complete and real-time insight into ECS, EKS, and your microservices, wherever they may be at snark.cloud/chronosphere that's snark.cloud/chronosphere.Corey: This episode is brought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn't care enough about backups. If you're tired of the vulnerabilities, costs, and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, that's V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest episode is brought to us by our friends at Pinecone and they have given their VP of Engineering and R&D over to suffer my various sling and arrows, Ram Sriharsha. Ram, thank you for joining me.Ram: Corey, great to be here. Thanks for having me.Corey: So, I was immediately intrigued when I wound up seeing your website, pinecone.io because it says right at the top—at least as of this recording—in bold text, “The Vector Database.” And if there's one thing that I love, it is using things that are not designed to be databases as databases, or inappropriately referring to things—be they JSON files or senior engineers—as databases as well. What is a vector database?Ram: That's a great question. And we do use this term correctly, I think. You can think of customers of Pinecone as having all the data management problems that they have with traditional databases; the main difference is twofold. One is there is a new data type, which is vectors. Vectors, you can think of them as arrays of floats, floating point numbers, and there is a new pattern of use cases, which is search.And what you're trying to do in vector search is you're looking for the nearest, the closest vectors to a given query. So, these two things fundamentally put a lot of stress on traditional databases. So, it's not like you can take a traditional database and make it into a vector database. That is why we coined this term vector database and we are building a new type of vector database. But fundamentally, it has all the database challenges on a new type of data and a new query pattern.Corey: Can you give me an example of what, I guess, an idealized use case would be of what the data set might look like and what sort of problem you would have in a vector database would solve?Ram: A very great question. So, one interesting thing is there's many, many use cases. I'll just pick the most natural one which is text search. So, if you're familiar with the Elastic or any other traditional text search engines, you have pieces of text, you index them, and the indexing that you do is traditionally an inverted index, and then you search over this text. And what this sort of search engine does is it matches for keywords.So, if it finds a keyword match between your query and your corpus, it's going to retrieve the relevant documents. And this is what we call text search, right, or keyword search. You can do something similar with technologies like Pinecone, but what you do here is instead of searching our text, you're searching our vectors. Now, where do these vectors come from? They come from taking deep-learning models, running your text through them, and these generate these things called vector embeddings.And now, you're taking a query as well, running them to deep-learning models, generating these query embeddings, and looking for the closest record embeddings in your corpus that are similar to the query embeddings. This notion of proximity in this space of vectors tells you something about semantic similarity between the query and the text. So suddenly, you're going beyond keyword search into semantic similarity. An example is if you had a whole lot of text data, and maybe you were looking for ‘soda,' and you were doing keyword search. Keyword search will only match on variations of soda. It will never match ‘Coca-Cola' because Coca-Cola and soda have nothing to do with each other.Corey: Or Pepsi, or pop, as they say in the American Midwest.Ram: Exactly.Corey: Yeah.Ram: Exactly. However, semantic search engines can actually match the two because they're matching for intent, right? If they find in this piece of text, enough intent to suggest that soda and Coca-Cola or Pepsi or pop are related to each other, they will actually match those and score them higher. And you're very likely to retrieve those sort of candidates that traditional search engines simply cannot. So, this is a canonical example, what's called semantic search, and it's known to be done better by these other vector search engines. There are also other examples in say, image search. Just if you're looking for near duplicate images, you can't even do this today without a technology like vector search.Corey: What is the, I guess, translation or conversion process of existing dataset into something that a vector database could use? Because you mentioned it was an array of floats was the natural vector datatype. I don't think I've ever seen even the most arcane markdown implementation that expected people to wind up writing in arrays of floats. What does that look like? How do you wind up, I guess, internalizing or ingesting existing bodies of text for your example use case?Ram: Yeah, this is a very great question. This used to be a very hard problem and what has happened over the last several years in deep-learning literature, as well as in deep-learning as a field itself, is that there have been these large, publicly trained models, examples will be OpenAI, examples will be the models that are available in Hugging Face like Cohere, and a large number of these companies have come forward with very well trained models through which you can pass pieces of text and get these vectors. So, you no longer have to actually train these sort of models, you don't have to really have the expertise to deeply figured out how to take pieces of text and build these embedding models. What you can do is just take a stock model, if you're familiar with OpenAI, you can just go to OpenAIs homepage and pick a model that works for you, Hugging Face models, and so on. There's a lot of literature to help you do this.Sophisticated customers can also do something called fine-tuning, which is built on top of these models to fine-tune for their use cases. The technology is out there already, there's a lot of documentation available. Even Pinecone's website has plenty of documentation to do this. Customers of Pinecone do this [unintelligible 00:07:45], which is they take piece of text, run them through either these pre-trained models or through fine-tuned models, get the series of floats which represent them, vector embeddings, and then send it to us. So, that's the workflow. The workflow is basically a machine-learning pipeline that either takes a pre-trained model, passes them through these pieces of text or images or what have you, or actually has a fine-tuning step in it.Corey: Is that ingest process something that not only benefits from but also requires the use of a GPU or something similar to that to wind up doing the in-depth, very specific type of expensive math for data ingestion?Ram: Yes, very often these run on GPUs. Sometimes, depending on budget, you may have compressed models or smaller models that run on CPUs, but most often they do run on GPUs, most often, we actually find people make just API calls to services that do this for them. So, very often, people are actually not deploying these GPU models themselves, they are maybe making a call to Hugging Face's service, or to OpenAI's service, and so on. And by the way, these companies also democratized this quite a bit. It was much, much harder to do this before they came around.Corey: Oh, yeah. I mean, I'm reminded of the old XKCD comic from years ago, which was, “Okay, I want to give you a picture. And I want you to tell me it was taken within the boundaries of a national park.” Like, “Sure. Easy enough. Geolocation information is attached. It'll take me two hours.” “Cool. And I also want you to tell me if it's a picture of a bird.” “Okay, that'll take five years and a research team.”And sure enough, now we can basically do that. The future is now and it's kind of wild to see that unfolding in a human perceivable timespan on these things. But I guess my question now is, so that is what a vector database does? What does Pinecone specifically do? It turns out that as much as I wish it were otherwise, not a lot of companies are founded on, “Well, we have this really neat technology, so we're just going to be here, well, in a foundational sense to wind up ensuring the uptake of that technology.” No, no, there's usually a monetization model in there somewhere. Where does Pinecone start, where does it stop, and how does it differentiate itself from typical vector databases? If such a thing could be said to exist yet.Ram: Such a thing doesn't exist yet. We were the first vector database, so in a sense, building this infrastructure, scaling it, and making it easy for people to operate it in a SaaS fashion is our primary core product offering. On top of that, this very recently started also enabling people who have who actually have raw text to not just be able to get value from these vector search engines and so on, but also be able to take advantage of traditional what we call keyword search or sparse retrieval and do a combined search better, in Pinecone. So, there's value-add on top of this that we do, but I would say the core of it is building a SaaS managed platform that allows people to actually easily store as data, scale it, query it in a way that's very hands off and doesn't require a lot of tuning or operational burden on their side. This is, like, our core value proposition.Corey: Got it. There's something to be said for making something accessible when previously it had only really been available to people who completed the Hello World tutorial—which generally resembled a doctorate at Berkeley or Waterloo or somewhere else—and turn it into something that's fundamentally, click the button. Where on that, I guess, a spectrum of evolution do you find that Pinecone is today?Ram: Yeah. So, you know, prior to Pinecone, we didn't really have this notion of a vector database. For several years, we've had libraries that are really good that you can pre-train on your embeddings, generate this thing called an index, and then you can search over that index. There is still a lot of work to be done even to deploy that and scale it and operate it in production and so on. Even that was not being, kind of, offered as a managed service before.What Pinecone does which is novel, is you no longer have to have this pre-training be done by somebody, you no longer have to worry about when to retrain your indexes, what to do when you have new data, what to do when there is deletions, updates, and the usual data management operations. You can just think of this is, like, a database that you just throw your data in. It does all the right things for you, you just worry about querying. This has never existed before, right? This is—it's not even like we are trying to make the operational part of something easier. It is that we are offering something that hasn't existed before, at the same time, making it operationally simple.So, we're solving two problems, which is we building a better database that hasn't existed before. So, if you really had this sort of data management problems and you wanted to build an index that was fresh that you didn't have to super manually tune for your own use cases, that simply couldn't have been done before. But at the same time, we are doing all of this in a cloud-native fashion; it's easy for you to just operate and not worry about.Corey: You've said that this hasn't really been done before, but this does sound like it is more than passingly familiar specifically to the idea of nearest neighbor search, which has been around since the '70s in a bunch of different ways. So, how is it different? And let me of course, ask my follow-up to that right now: why is this even an interesting problem to start exploring?Ram: This is a great question. First of all, nearest neighbor search is one of the oldest forms of machine learning. It's been known for decades. There's a lot of literature out there, there are a lot of great libraries as I mentioned in the passing before. All of these problems have primarily focused on static corpuses. So basically, you have a set of some amount of data, you want to create an index out of it, and you want to query it.A lot of literature has focused on this problem. Even there, once you go from small number of dimensions to large number of dimensions, things become computationally far more challenging. So, traditional nearest neighbor search actually doesn't scale very well. What do I mean by large number of dimensions? Today, deep-learning models that produce image representations typically operate in 2048 dimensions of photos [unintelligible 00:13:38] dimensions. Some of the OpenAI models are even 10,000 dimensional and above. So, these are very, very large dimensions.Most of the literature prior to maybe even less than ten years back has focused on less than ten dimensions. So, it's like a scale apart in dealing with small dimensional data versus large dimensional data. But even as of a couple of years back, there hasn't been enough, if any, focus on what happens when your data rapidly evolves. For example, what happens when people add new data? What happens if people delete some data? What happens if your vectors get updated? These aren't just theoretical problems; they happen all the time. Customers of ours face this all the time.In fact, the classic example is in recommendation systems where user preferences change all the time, right, and you want to adapt to that, which means your user vectors change constantly. When even these sort of things change constantly, you want your index to reflect it because you want your queries to catch on to the most recent data. [unintelligible 00:14:33] have to reflect the recency of your data. This is a solved problem for traditional databases. Relational databases are great at solving this problem. A lot of work has been done for decades to solve this problem really well.This is a fundamentally hard problem for vector databases and that's one of the core focus areas [unintelligible 00:14:48] painful. Another problem that is hard for these sort of databases is simple things like filtering. For example, you have a corpus of say product images and you want to only look at images that maybe are for the Fall shopping line, right? Seems like a very natural query. Again, databases have known and solved this problem for many, many years.The moment you do nearest neighbor search with these sort of constraints, it's a hard problem. So, it's just the fact that nearest neighbor search and lots of research in this area has simply not focused on what happens to that, so those are of techniques when combined with data management challenges, filtering, and all the traditional challenges of a database. So, when you start doing that you enter a very novel area to begin with.Corey: This episode is sponsored in part by our friends at Redis, the company behind the incredibly popular open-source database. If you're tired of managing open-source Redis on your own, or if you are looking to go beyond just caching and unlocking your data's full potential, these folks have you covered. Redis Enterprise is the go-to managed Redis service that allows you to reimagine how your geo-distributed applications process, deliver and store data. To learn more from the experts in Redis how to be real-time, right now, from anywhere, visit snark.cloud/redis. That's snark dot cloud slash R-E-D-I-S.Corey: So, where's this space going, I guess is sort of the dangerous but inevitable question I have to ask. Because whenever you talk to someone who is involved in a very early stage of what is potentially a transformative idea, it's almost indistinguishable from someone who is whatever the polite term for being wrapped around their own axle is, in a technological sense. It's almost a form of reverse Schneier's Law of anyone can create an encryption algorithm that they themselves cannot break. So, the possibility that this may come back to bite us in the future if it turns out that this is not potentially the revelation that you see it as, where do you see the future of this going?Ram: Really great question. The way I think about it is, and the reason why I keep going back to databases and these sort of ideas is, we have a really great way to deal with structured data and structured queries, right? This is the evolution of the last maybe 40, 50 years is to come up with relational databases, come up with SQL engines, come up with scalable ways of running structured queries on large amounts of data. What I feel like this sort of technology does is it takes it to the next level, which is you can actually ask unstructured questions on unstructured data, right? So, even the couple of examples we just talked about, doing near duplicate detection of images, that's a very unstructured question. What does it even mean to say that two images are nearly duplicate of each other? I couldn't even phrase it as kind of a concrete thing. I certainly cannot write a SQL statement for it, but I cannot even phrase it properly.With these sort of technologies, with the vector embeddings, with deep learning and so on, you can actually mathematically phrase it, right? The mathematical phrasing is very simple once you have the right representation that understands your image as a vector. Two images are nearly duplicate if they are close enough in the space of vectors. Suddenly you've taken a problem that was even hard to express, let alone compute, made it precise to express, precise to compute. This is going to happen not just for images, not just for semantic search, it's going to happen for all sorts of unstructured data, whether it's time series, where it's anomaly detection, whether it's security analytics, and so on.I actually think that fundamentally, a lot of fields are going to get disrupted by this sort of way of thinking about things. We are just scratching the surface here with semantic search, in my opinion.Corey: What is I guess your barometer for success? I mean, if I could take a very cynical point of view on this, it's, “Oh, well, whenever there's a managed vector database offering from AWS.” They'll probably call it Amazon Basics Vector or something like that. Well, that is a—it used to be a snarky observation that, “Oh, we're not competing, we're just validating their market.” Lately, with some of their competitive database offerings, there's a lot more truth to that than I suspect AWS would like.Their offerings are nowhere near as robust as what they pretend to be competing against. How far away do you think we are from the larger cloud providers starting to say, “Ah, we got the sense there was money in here, so we're launching an entire service around this?”Ram: Yeah. I mean, this is a—first of all, this is a great question. There's always something that's constantly, things that any innovator or disrupter has to be thinking about, especially these days. I would say that having a multi-year head, start in the use cases, in thinking about how this system should even look, what sort of use cases should it [unintelligible 00:19:34], what the operating points for the [unintelligible 00:19:37] database even look like, and how to build something that's cloud-native and scalable, is very hard to replicate. Meaning if you look at what we have already done and kind of tried to base the architecture of that, you're probably already a couple of years behind us in terms of just where we are at, right, not just in the architecture, but also in the use cases in where this is evolving forward.That said, I think it is, for all of these companies—and I would put—for example, Snowflake is a great example of this, which is Snowflake needn't have existed if Redshift had done a phenomenal job of being cloud-native, right, and kind of done that before Snowflake did it. In hindsight, it seems like it's obvious, but when Snowflake did this, it wasn't obvious that that's where everything was headed. And Snowflake built something that's very technologically innovative, in a sense that it's even now hard to replicate. Plus, it takes a long time to replicate something like that. I think that's where we are at.If Pinecone does its job really well and if we simply execute efficiently, it's very hard to replicate that. So, I'm not super worried about cloud providers, to be honest, in this space, I'm more worried about our execution.Corey: If it helps anything, I'm not very deep into your specific area of the world, obviously, but I am optimistic when I hear people say things like that. Whenever I find folks who are relatively early along in their technological journey being very concerned about oh, the large cloud provider is going to come crashing in, it feels on some level like their perspective is that they have one weird trick, and they were able to crack that, but they have no defensive mode because once someone else figures out the trick, well, okay, now we're done. The idea of sustained and lasting innovation in a space, I think, is the more defensible position to take, with the counterargument, of course, that that's a lot harder to find.Ram: Absolutely. And I think for technologies like this, that's the only solution, which is, if you really want to avoid being disrupted by cloud providers, I think that's the way to go.Corey: I want to talk a little bit about your own background. Before you wound up as the VP of R&D over at Pinecone, you were in a bunch of similar… I guess, similar styled roles—if we'll call it that—at Yahoo, Databricks, and Splunk. I'm curious as to what your experience in those companies wound up impressing on you that made you say, “Ah, that's great and all, but you know what's next? That's right, vector databases.” And off, you went to Pinecone. What did you see?Ram: So, first of all, in was some way or the other, I have been involved in machine learning and systems and the intersection of these two for maybe the last decade-and-a-half. So, it's always been something, like, in the in between the two and that's been personally exciting to me. So, I'm kind of very excited by trying to think about new type of databases, new type of data platforms that really leverages machine learning and data. This has been personally exciting to me. I obviously learned very different things from different companies.I would say that Yahoo was just the learning in cloud to begin with because prior to joining Yahoo, I wasn't familiar with Silicon Valley cloud companies at that scale and Yahoo is a big company and there's a lot to learn from there. It was also my first introduction to Hadoop, Spark, and even machine learning where I really got into machine learning at scale, in online advertising and areas like that, which was a massive scale. And I got into that in Yahoo, and it was personally exciting to me because there's very few opportunities where you can work on machine learning at that scale, right?Databricks was very exciting to me because it was an earlier-stage company than I had been at before. Extremely well run and I learned a lot from Databricks, just the team, the culture, the focus on innovation, and the focus on product thinking. I joined Databricks as a product manager. I hadn't played the product manager hat before that, so it was very much a learning experience for me and I think I learned from some of the best in that area. And even at Pinecone, I carry that forward, which is think about how my learnings at Databricks informs how we should be thinking about products at Pinecone, and so on. So, I think I learned—if I had to pick one company I learned a lot from, I would say, it's Databricks. The most [unintelligible 00:23:50].Corey: I would also like to point out, normally when people say, “Oh, the one company I've learned the most from,” and they pick one of them out of their history, it's invariably the most recent one, but you left there in 2018—Ram: Yeah.Corey: —then went to go spend the next three years over at Splunk, where you were a Senior Principal, Scientist, a Senior Director and Head of Machine-Learning, and then you decided, okay, that's enough hard work. You're going to do something easier and be the VP of Engineering, which is just wild at a company of that scale.Ram: Yeah. At Splunk, I learned a lot about management. I think managing large teams, managing multiple different teams, while working on very different areas is something I learned at Splunk. You know, I was at this point in my career when I was right around trying to start my own company. Basically, I was at a point where I'd taken enough learnings and I really wanted to do something myself.That's when Edo and I—you know, the CEO of Pinecone—and I started talking. And we had worked together for many years, and we started working together at Yahoo. We kept in touch with each other. And we started talking about the sort of problems that I was excited about working on and then I came to realize what he was working on and what Pinecone was doing. And we thought it was a very good fit for the two of us to work together.So, that is kind of how it happened. It sort of happened by chance, as many things do in Silicon Valley, where a lot of things just happen by network and chance. That's what happened in my case. I was just thinking of starting my own company at the time when just a chance encounter with Edo led me to Pinecone.Corey: It feels from my admittedly uninformed perspective, that a lot of what you're doing right now in the vector database area, it feels on some level, like it follows the trajectory of machine learning, in that for a long time, the only people really excited about it were either sci-fi authors or folks who had trouble explaining it to someone without a degree in higher math. And then it turned into—a couple of big stories from the mid-2010s stick out at me when we've been people were trying to sell this to me in a variety of different ways. One of them was, “Oh, yeah, if you're a giant credit card processing company and trying to detect fraud with this kind of transaction volume—” it's, yeah, there are maybe three companies in the world that fall into that exact category. The other was WeWork where they did a lot of computer vision work. And they used this to determine that at certain times of day there was congestion in certain parts of the buildings and that this was best addressed by hiring a second barista. Which distilled down to, “Wait a minute, you're telling me that you spent how much money on machine-learning and advanced analyses and data scientists and the rest have figured out that people like to drink coffee in the morning?” Like, that is a little on the ridiculous side.Now, I think that it is past the time for skepticism around machine learning when you can go to a website and type in a description of something and it paints a picture of the thing you just described. Or you can show it a picture and it describes what is in that picture fairly accurately. At this point, the only people who are skeptics, from my position on this, seem to be holding out for some sort of either next-generation miracle or are just being bloody-minded. Do you think that there's a tipping point for vector search where it's going to become blindingly obvious to, if not the mass market, at least more run-of-the-mill, more prosaic level of engineer that haven't specialized in this?Ram: Yeah. It's already, frankly, started happening. So, two years back, I wouldn't have suspected this fast of an adoption for this new of technology from this varied number of use cases. I just wouldn't have suspected it because I, you know, I still thought, it's going to take some time for this field to mature and, kind of, everybody to really start taking advantage of this. This has happened much faster than even I assumed.So, to some extent, it's already happening. A lot of it is because the barrier to entry is quite low right now, right? So, it's very easy and cost-effective for people to create these embeddings. There is a lot of documentation out there, things are getting easier and easier, day by day. Some of it is by Pinecone itself, by a lot of work we do. Some of it is by, like, companies that I mentioned before who are building better and better models, making it easier and easier for people to take these machine-learning models and use them without having to even fine-tune anything.And as technologies like Pinecone really mature and dramatically become cost-effective, the barrier to entry is very low. So, what we tend to see people do, it's not so much about confidence in this new technology; it is connecting something simple that I need this sort of value out of, and find the least critical path or the simplest way to get going on this sort of technology. And as long as it can make that barrier to entry very small and make this cost-effective and easy for people to explore, this is going to start exploding. And that's what we are seeing. And a lot of Pinecone's focus has been on ease-of-use, in simplicity in connecting the zero-to-one journey for precisely this reason. Because not only do we strongly believe in the value of this technology, it's becoming more and more obvious to the broader community as well. The remaining work to be done is just the ease of use and making things cost-effective. And cost-effectiveness is also what the focus on a lot. Like, this technology can be even more cost-effective than it is today.Corey: I think that it is one of those never-mistaken ideas to wind up making something more accessible to folks than keeping it in a relatively rarefied environment. We take a look throughout the history of computing in general and cloud in particular, were formerly very hard things have largely been reduced down to click the button. Yes, yes, and then get yelled at because you haven't done infrastructure-as-code, but click the button is still possible. I feel like this is on that trendline based upon what you're saying.Ram: Absolutely. And the more we can do here, both Pinecone and the broader community, I think the better, the faster the adoption of this sort of technology is going to be.Corey: I really want to thank you for spending so much time talking me through what it is you folks are working on. If people want to learn more, where's the best place for them to go to find you?Ram: Pinecone.io. Our website has a ton of information about Pinecone, as well as a lot of standard documentation. We have a free tier as well where you can play around with small data sets, really get a feel for vector search. It's completely free. And you can reach me at Ram at Pinecone. I'm always happy to answer any questions. Once again, thanks so much for having me.Corey: Of course. I will put links to all of that in the show notes. This promoted guest episode is brought to us by our friends at Pinecone. Ram Sriharsha is their VP of Engineering and R&D. And I'm Cloud Economist Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that I will never read because the search on your podcast platform is broken because it's not using a vector database.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
The Complexities of AWS Cost Optimization with Rick Ochs

Screaming in the Cloud

Play Episode Listen Later Dec 1, 2022 46:56


About RickRick is the Product Leader of the AWS Optimization team. He previously led the cloud optimization product organization at Turbonomic, and previously was the Microsoft Azure Resource Optimization program owner.Links Referenced: AWS: https://console.aws.amazon.com LinkedIn: https://www.linkedin.com/in/rick-ochs-06469833/ Twitter: https://twitter.com/rickyo1138 TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Chronosphere. Tired of observability costs going up every year without getting additional value? Or being locked in to a vendor due to proprietary data collection, querying and visualization? Modern day, containerized environments require a new kind of observability technology that accounts for the massive increase in scale and attendant cost of data. With Chronosphere, choose where and how your data is routed and stored, query it easily, and get better context and control. 100% open source compatibility means that no matter what your setup is, they can help. Learn how Chronosphere provides complete and real-time insight into ECS, EKS, and your microservices, whereever they may be at snark.cloud/chronosphere That's snark.cloud/chronosphere Corey: This episode is bought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn't care enough about backups.  If you're tired of the vulnerabilities, costs and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, thats V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. For those of you who've been listening to this show for a while, the theme has probably emerged, and that is that one of the key values of this show is to give the guest a chance to tell their story. It doesn't beat the guests up about how they approach things, it doesn't call them out for being completely wrong on things because honestly, I'm pretty good at choosing guests, and I don't bring people on that are, you know, walking trash fires. And that is certainly not a concern for this episode.But this might devolve into a screaming loud argument, despite my best effort. Today, I'm joined by Rick Ochs, Principal Product Manager at AWS. Rick, thank you for coming back on the show. The last time we spoke, you were not here you were at, I believe it was Turbonomic.Rick: Yeah, that's right. Thanks for having me on the show, Corey. I'm really excited to talk to you about optimization and my current role and what we're doing.Corey: Well, let's start at the beginning. Principal product manager. It sounds like one of those corporate titles that can mean a different thing in every company or every team that you're talking to. What is your area of responsibility? Where do you start and where do you stop?Rick: Awesome. So, I am the product manager lead for all of AWS Optimizations Team. So, I lead the product team. That includes several other product managers that focus in on Compute Optimizer, Cost Explorer, right-sizing recommendations, as well as Reservation and Savings Plan purchase recommendations.Corey: In other words, you are the person who effectively oversees all of the AWS cost optimization tooling and approaches to same?Rick: Yeah.Corey: Give or take. I mean, you could argue that oh, every team winds up focusing on helping customers save money. I could fight that argument just as effectively. But you effectively start and stop with respect to helping customers save money or understand where the money is going on their AWS bill.Rick: I think that's a fair statement. And I also agree with your comment that I think a lot of service teams do think through those use cases and provide capabilities, you know? There's, like, S3 storage lines. You know, there's all sorts of other products that do offer optimization capabilities as well, but as far as, like, the unified purpose of my team, it is, unilaterally focused on how do we help customers safely reduce their spend and not hurt their business at the same time.Corey: Safely being the key word. For those who are unaware of my day job, I am a partial owner of The Duckbill Group, a consultancy where we fix exactly one problem: the horrifying AWS bill. This is all that I've been doing for the last six years, so I have some opinions on AWS bill reduction as well. So, this is going to be a fun episode for the two of us to wind up, mmm, more or less smacking each other around, but politely because we are both professionals. So, let's start with a very high level. How does AWS think about AWS bills from a customer perspective? You talk about optimizing it, but what does that mean to you?Rick: Yeah. So, I mean, there's a lot of ways to think about it, especially depending on who I'm talking to, you know, where they sit in our organization. I would say I think about optimization in four major themes. The first is how do you scale correctly, whether that's right-sizing or architecting things to scale in and out? The second thing I would say is, how do you do pricing and discounting, whether that's Reservation management, Savings Plan Management, coverage, how do you handle the expenditures of prepayments and things like that?Then I would say suspension. What that means is turn the lights off when you leave the room. We have a lot of customers that do this and I think there's a lot of opportunity for more. Turning EC2 instances off when they're not needed if they're non-production workloads or other, sort of, stateful services that charge by the hour, I think there's a lot of opportunity there.And then the last of the four methods is clean up. And I think it's maybe one of the lowest-hanging fruit, but essentially, are you done using this thing? Delete it. And there's a whole opportunity of cleaning up, you know, IP addresses unattached EBS volumes, sort of, these resources that hang around in AWS accounts that sort of getting lost and forgotten as well. So, those are the four kind of major thematic strategies for how to optimize a cloud environment that we think about and spend a lot of time working on.Corey: I feel like there's—or at least the way that I approach these things—that there are a number of different levels you can look at AWS billing constructs on. The way that I tend to structure most of my engagements when I'm working with clients is we come in and, step one: cool. Why do you care about the AWS bill? It's a weird question to ask because most of the engineering folks look at me like I've just grown a second head. Like, “So, why do you care about your AWS bill?” Like, “What? Why do you? You run a company doing this?”It's no, no, no, it's not that I'm being rhetorical and I don't—I'm trying to be clever somehow and pretend that I don't understand all the nuances around this, but why does your business care about lowering the AWS bill? Because very often, the answer is they kind of don't. What they care about from a business perspective is being able to accurately attribute costs for the service or good that they provide, being able to predict what that spend is going to be, and also yes, a sense of being good stewards of the money that has been entrusted to them by via investors, public markets, or the budget allocation process of their companies and make sure that they're not doing foolish things with it. And that makes an awful lot of sense. It is rare at the corporate level that the stated number one concern is make the bills lower.Because at that point, well, easy enough. Let's just turn off everything you're running in production. You'll save a lot of money in your AWS bill. You won't be in business anymore, but you'll be saving a lot of money on the AWS bill. The answer is always deceptively nuanced and complicated.At least, that's how I see it. Let's also be clear that I talk with a relatively narrow subset of the AWS customer totality. The things that I do are very much intentionally things that do not scale. Definitionally, everything that you do has to scale. How do you wind up approaching this in ways that will work for customers spending billions versus independent learners who are paying for this out of their own personal pocket?Rick: It's not easy [laugh], let me just preface that. The team we have is incredible and we spent so much time thinking about scale and the different personas that engage with our products and how they're—what their experience is when they interact with a bill or AWS platform at large. There's also a couple of different personas here, right? We have a persona that focuses in on that cloud cost, the cloud bill, the finance, whether that's—if an organization is created a FinOps organization, if they have a Cloud Center of Excellence, versus an engineering team that maybe has started to go towards decentralized IT and has some accountability for the spend that they attribute to their AWS bill. And so, these different personas interact with us in really different ways, where Cost Explorer downloading the CUR and taking a look at the bill.And one thing that I always kind of imagine is somebody putting a headlamp on and going into the caves in the depths of their AWS bill and kind of like spelunking through their bill sometimes, right? And so, you have these FinOps folks and billing people that are deeply interested in making sure that the spend they do have meets their business goals, meaning this is providing high value to our company, it's providing high value to our customers, and we're spending on the right things, we're spending the right amount on the right things. Versus the engineering organization that's like, “Hey, how do we configure these resources? What types of instances should we be focused on using? What services should we be building on top of that maybe are more flexible for our business needs?”And so, there's really, like, two major personas that I spend a lot of time—our organization spends a lot of time wrapping our heads around. Because they're really different, very different approaches to how we think about cost. Because you're right, if you just wanted to lower your AWS bill, it's really easy. Just size everything to a t2.nano and you're done and move on [laugh], right? But you're [crosstalk 00:08:53]—Corey: Aw, t3 or t4.nano, depending upon whether regional availability is going to save you less. I'm still better at this. Let's not kid ourselves I kid. Mostly.Rick: For sure. So t4.nano, absolutely.Corey: T4g. Remember, now the way forward is everything has an explicit letter designator to define which processor company made the CPU that underpins the instance itself because that's a level of abstraction we certainly wouldn't want the cloud provider to take away from us any.Rick: Absolutely. And actually, the performance differences of those different processor models can be pretty incredible [laugh]. So, there's huge decisions behind all of that as well.Corey: Oh, yeah. There's so many factors that factor in all these things. It's gotten to a point of you see this usually with lawyers and very senior engineers, but the answer to almost everything is, “It depends.” There are always going to be edge cases. Easy example of, if you check a box and enable an S3 Gateway endpoint inside of a private subnet, suddenly, you're not passing traffic through a 4.5 cent per gigabyte managed NAT Gateway; it's being sent over that endpoint for no additional cost whatsoever.Check the box, save a bunch of money. But there are scenarios where you don't want to do it, so always double-checking and talking to customers about this is critically important. Just because, the first time you make a recommendation that does not work for their constraints, you lose trust. And make a few of those and it looks like you're more or less just making naive recommendations that don't add any value, and they learn to ignore you. So, down the road, when you make a really high-value, great recommendation for them, they stop paying attention.Rick: Absolutely. And we have that really high bar for recommendation accuracy, especially with right sizing, that's such a key one. Although I guess Savings Plan purchase recommendations can be critical as well. If a customer over commits on the amount of Savings Plan purchase they need to make, right, that's a really big problem for them.So, recommendation accuracy must be above reproach. Essentially, if a customer takes a recommendation and it breaks an application, they're probably never going to take another right-sizing recommendation again [laugh]. And so, this bar of trust must be exceptionally high. That's also why out of the box, the compute optimizer recommendations can be a little bit mild, they're a little time because the first order of business is do no harm; focus on the performance requirements of the application first because we have to make sure that the reason you build these workloads in AWS is served.Now ideally, we do that without overspending and without overprovisioning the capacity of these workloads, right? And so, for example, like if we make these right-sizing recommendations from Compute Optimizer, we're taking a look at the utilization of CPU, memory, disk, network, throughput, iops, and we're vending these recommendations to customers. And when you take that recommendation, you must still have great application performance for your business to be served, right? It's such a crucial part of how we optimize and run long-term. Because optimization is not a one-time Band-Aid; it's an ongoing behavior, so it's really critical that for that accuracy to be exceptionally high so we can build business process on top of it as well.Corey: Let me ask you this. How do you contextualize what the right approach to optimization is? What is your entire—there are certain tools that you have… by ‘you,' I mean, of course, as an organization—have repeatedly gone back to and different approaches that don't seem to deviate all that much from year to year, and customer to customer. How do you think about the general things that apply universally?Rick: So, we know that EC2 is a very popular service for us. We know that sizing EC2 is difficult. We think about that optimization pillar of scaling. It's an obvious area for us to help customers. We run into this sort of industry-wide experience where whenever somebody picks the size of a resource, they're going to pick one generally larger than they need.It's almost like asking a new employee at your company, “Hey, pick your laptop. We have a 16 gig model or a 32 gig model. Which one do you want?” That person [laugh] making the decision on capacity, hardware capacity, they're always going to pick the 32 gig model laptop, right? And so, we have this sort of human nature in IT of, we don't want to get called at two in the morning for performance issues, we don't want our apps to fall over, we want them to run really well, so we're going to size things very conservatively and we're going to oversize things.So, we can help customers by providing those recommendations to say, you can size things up in a different way using math and analytics based on the utilization patterns, and we can provide and pick different instance types. There's hundreds and hundreds of instance types in all of these regions across the globe. How do you know which is the right one for every single resource you have? It's a very, very hard problem to solve and it's not something that is lucrative to solve one by one if you have 100 EC2 instances. Trying to pick the correct size for each and every one can take hours and hours of IT engineering resources to look at utilization graphs, look at all of these types available, look at what is the performance difference between processor models and providers of those processors, is there application compatibility constraints that I have to consider? The complexity is astronomical.And then not only that, as soon as you make that sizing decision, one week later, it's out of date and you need a different size. So, [laugh] you didn't really solve the problem. So, we have to programmatically use data science and math to say, “Based on these utilization values, these are the sizes that would make sense for your business, that would have the lowest cost and the highest performance together at the same time.” And it's super important that we provide this capability from a technology standpoint because it would cost so much money to try to solve that problem that the savings you would achieve might not be meaningful. Then at the same time… you know, that's really from an engineering perspective, but when we talk to the FinOps and the finance folks, the conversations are more about Reservations and Savings Plans.How do we correctly apply Savings Plans and Reservations across a high percentage of our portfolio to reduce the costs on those workloads, but not so much that dynamic capacity levels in our organization mean we all of a sudden have a bunch of unused Reservations or Savings Plans? And so, a lot of organizations that engage with us and we have conversations with, we start with the Reservation and Savings Plan conversation because it's much easier to click a few buttons and buy a Savings Plan than to go institute an entire right-sizing campaign across multiple engineering teams. That can be very difficult, a much higher bar. So, some companies are ready to dive into the engineering task of sizing; some are not there yet. And they're a little maybe a little earlier in their FinOps journey, or the building optimization technology stacks, or achieving higher value out of their cloud environments, so starting with kind of the low hanging fruit, it can vary depending on the company, size of company, technical aptitudes, skill sets, all sorts of things like that.And so, those finance-focused teams are definitely spending more time looking at and studying what are the best practices for purchasing Savings Plans, covering my environment, getting the most out of my dollar that way. Then they don't have to engage the engineering teams; they can kind of take a nice chunk off the top of their bill and sort of have something to show for that amount of effort. So, there's a lot of different approaches to start in optimization.Corey: My philosophy runs somewhat counter to this because everything you're saying does work globally, it's safe, it's non-threatening, and then also really, on some level, feels like it is an approach that can be driven forward by finance or business. Whereas my worldview is that cost and architecture in cloud are one and the same. And there are architectural consequences of cost decisions and vice versa that can be adjusted and addressed. Like, one of my favorite party tricks—although I admit, it's a weird party—is I can look at the exploded PDF view of a customer's AWS bill and describe their architecture to them. And people have questioned that a few times, and now I have a testimonial on my client website that mentions, “It was weird how he was able to do this.”Yeah, it's real, I can do it. And it's not a skill, I would recommend cultivating for most people. But it does also mean that I think I'm onto something here, where there's always context that needs to be applied. It feels like there's an entire ecosystem of product companies out there trying to build what amount to a better Cost Explorer that also is not free the way that Cost Explorer is. So, the challenge I see there's they all tend to look more or less the same; there is very little differentiation in that space. And in the fullness of time, Cost Explorer does—ideally—get better. How do you think about it?Rick: Absolutely. If you're looking at ways to understand your bill, there's obviously Cost Explorer, the CUR, that's a very common approach is to take the CUR and put a BI front-end on top of it. That's a common experience. A lot of companies that have chops in that space will do that themselves instead of purchasing a third-party product that does do bill breakdown and dissemination. There's also the cross-charge show-back organizational breakdown and boundaries because you have these super large organizations that have fiefdoms.You know, if HR IT and sales IT, and [laugh] you know, product IT, you have all these different IT departments that are fiefdoms within your AWS bill and construct, whether they have different ABS accounts or say different AWS organizations sometimes, right, it can get extremely complicated. And some organizations require the ability to break down their bill based on those organizational boundaries. Maybe tagging works, maybe it doesn't. Maybe they do that by using a third-party product that lets them set custom scopes on their resources based on organizational boundaries. That's a common approach as well.We do also have our first-party solutions, they can do that, like the CUDOS dashboard as well. That's something that's really popular and highly used across our customer base. It allows you to have kind of a dashboard and customizable view of your AWS costs and, kind of, split it up based on tag organizational value, account name, and things like that as well. So, you mentioned that you feel like the architectural and cost problem is the same problem. I really don't disagree with that at all.I think what it comes down to is some organizations are prepared to tackle the architectural elements of cost and some are not. And it really comes down to how does the customer view their bill? Is it somebody in the finance organization looking at the bill? Is it somebody in the engineering organization looking at the bill? Ideally, it would be both.Ideally, you would have some of those skill sets that overlap, or you would have an organization that does focus in on FinOps or cloud operations as it relates to cost. But then at the same time, there are organizations that are like, “Hey, we need to go to cloud. Our CIO told us go to cloud. We don't want to pay the lease renewal on this building.” There's a lot of reasons why customers move to cloud, a lot of great reasons, right? Three major reasons you move to cloud: agility, [crosstalk 00:20:11]—Corey: And several terrible ones.Rick: Yeah, [laugh] and some not-so-great ones, too. So, there's so many different dynamics that get exposed when customers engage with us that they might or might not be ready to engage on the architectural element of how to build hyperscale systems. So, many of these customers are bringing legacy workloads and applications to the cloud, and something like a re-architecture to use stateless resources or something like Spot, that's just not possible for them. So, how can they take 20% off the top of their bill? Savings Plans or Reservations are kind of that easy, low-hanging fruit answer to just say, “We know these are fairly static environments that don't change a whole lot, that are going to exist for some amount of time.”They're legacy, you know, we can't turn them off. It doesn't make sense to rewrite these applications because they just don't change, they don't have high business value, or something like that. And so, the architecture part of that conversation doesn't always come into play. Should it? Yes.The long-term maturity and approach for cloud optimization does absolutely account for architecture, thinking strategically about how you do scaling, what services you're using, are you going down the Kubernetes path, which I know you're going to laugh about, but you know, how do you take these applications and componentize them? What services are you using to do that? How do you get that long-term scale and manageability out of those environments? Like you said at the beginning, the complexity is staggering and there's no one unified answer. That's why there's so many different entrance paths into, “How do I optimize my AWS bill?”There's no one answer, and every customer I talk to has a different comfort level and appetite. And some of them have tried suspension, some of them have gone heavy down Savings Plans, some of them want to dabble in right-sizing. So, every customer is different and we want to provide those capabilities for all of those different customers that have different appetites or comfort levels with each of these approaches.Corey: This episode is sponsored in part by our friends at Redis, the company behind the incredibly popular open source database. If you're tired of managing open source Redis on your own, or if you are looking to go beyond just caching and unlocking your data's full potential, these folks have you covered. Redis Enterprise is the go-to managed Redis service that allows you to reimagine how your geo-distributed applications process, deliver, and store data. To learn more from the experts in Redis how to be real-time, right now, from anywhere, visit redis.com/duckbill. That's R - E - D - I - S dot com slash duckbill.Corey: And I think that's very fair. I think that it is not necessarily a bad thing that you wind up presenting a lot of these options to customers. But there are some rough edges. An example of this is something I encountered myself somewhat recently and put on Twitter—because I have those kinds of problems—where originally, I remember this, that you were able to buy hourly Savings Plans, which again, Savings Plans are great; no knock there. I would wish that they applied to more services rather than, “Oh, SageMaker is going to do its own Savings Pla”—no, stop keeping me from going from something where I have to manage myself on EC2 to something you manage for me and making that cost money. You nailed it with Fargate. You nailed it with Lambda. Please just have one unified Savings Plan thing. But I digress.But you had a limit, once upon a time, of $1,000 per hour. Now, it's $5,000 per hour, which I believe in a three-year all-up-front means you will cheerfully add $130 million purchase to your shopping cart. And I kept adding a bunch of them and then had a little over a billion dollars a single button click away from being charged to my account. Let me begin with what's up with that?Rick: [laugh]. Thank you for the tweet, by the way, Corey.Corey: Always thrilled to ruin your month, Rick. You know that.Rick: Yeah. Fantastic. We took that tweet—you know, it was tongue in cheek, but also it was a serious opportunity for us to ask a question of what does happen? And it's something we did ask internally and have some fun conversations about. I can tell you that if you clicked purchase, it would have been declined [laugh]. So, you would have not been—Corey: Yeah, American Express would have had a problem with that. But the question is, would you have attempted to charge American Express, or would something internally have gone, “This has a few too many commas for us to wind up presenting it to the card issuer with a straight face?”Rick: [laugh]. Right. So, it wouldn't have gone through and I can tell you that, you know, if your account was on a PO-based configuration, you know, it would have gone to the account team. And it would have gone through our standard process for having a conversation with our customer there. That being said, we are—it's an awesome opportunity for us to examine what is that shopping cart experience.We did increase the limit, you're right. And we increased the limit for a lot of reasons that we sat down and worked through, but at the same time, there's always an opportunity for improvement of our product and experience, we want to make sure that it's really easy and lightweight to use our products, especially purchasing Savings Plans. Savings Plans are already kind of wrought with mental concern and risk of purchasing something so expensive and large that has a big impact on your AWS bill, so we don't really want to add any more friction necessarily the process but we do want to build an awareness and make sure customers understand, “Hey, you're purchasing this. This has a pretty big impact.” And so, we're also looking at other ways we can kind of improve the ability for the Savings Plan shopping cart experience to ensure customers don't put themselves in a position where you have to unwind or make phone calls and say, “Oops.” Right? We [laugh] want to avoid those sorts of situations for our customers. So, we are looking at quite a few additional improvements to that experience as well that I'm really excited about that I really can't share here, but stay tuned.Corey: I am looking forward to it. I will say the counterpoint to that is having worked with customers who do make large eight-figure purchases at once, there's a psychology element that plays into it. Everyone is very scared to click the button on the ‘Buy It Now' thing or the ‘Approve It.' So, what I've often found is at that scale, one, you can reduce what you're buying by half of it, and then see how that treats you and then continue to iterate forward rather than doing it all at once, or reach out to your account team and have them orchestrate the buy. In previous engagements, I had a customer do this religiously and at one point, the concierge team bought the wrong thing in the wrong region, and from my perspective, I would much rather have AWS apologize for that and fix it on their end, than from us having to go with a customer side of, “Oh crap, oh, crap. Please be nice to us.”Not that I doubt you would do it, but that's not the nervous conversation I want to have in quite the same way. It just seems odd to me that someone would want to make that scale of purchase without ever talking to a human. I mean, I get it. I'm as antisocial as they come some days, but for that kind of money, I kind of just want another human being to validate that I'm not making a giant mistake.Rick: We love that. That's such a tremendous opportunity for us to engage and discuss with an organization that's going to make a large commitment, that here's the impact, here's how we can help. How does it align to our strategy? We also do recommend, from a strategic perspective, those more incremental purchases. I think it creates a better experience long-term when you don't have a single Savings Plan that's going to expire on a specific day that all of a sudden increases your entire bill by a significant percentage.So, making staggered monthly purchases makes a lot of sense. And it also works better for incremental growth, right? If your organization is growing 5% month-over-month or year-over-year or something like that, you can purchase those incremental Savings Plans that sort of stack up on top of each other and then you don't have that risk of a cliff one day where one super-large SP expires and boom, you have to scramble and repurchase within minutes because every minute that goes by is an additional expense, right? That's not a great experience. And so that's, really, a large part of why those staggered purchase experiences make a lot of sense.That being said, a lot of companies do their math and their finance in different ways. And single large purchases makes sense to go through their process and their rigor as well. So, we try to support both types of purchasing patterns.Corey: I think that is an underappreciated aspect of cloud cost savings and cloud cost optimization, where it is much more about humans than it is about math. I see this most notably when I'm helping customers negotiate their AWS contracts with AWS, where they are often perspectives such as, “Well, we feel like we really got screwed over last time, so we want to stick it to them and make them give us a bigger percentage discount on something.” And it's like, look, you can do that, but I would much rather, if it were me, go for something that moves the needle on your actual business and empowers you to move faster, more effectively, and lead to an outcome that is a positive for everyone versus the well, we're just going to be difficult in this one point because they were difficult on something last time. But ego is a thing. Human psychology is never going to have an API for it. And again, customers get to decide their own destiny in some cases.Rick: I completely agree. I've actually experienced that. So, this is the third company I've been working at on Cloud optimization. I spent several years at Microsoft running an optimization program. I went to Turbonomic for several years, building out the right-sizing and savings plan reservation purchase capabilities there, and now here at AWS.And through all of these journeys and experiences working with companies to help optimize their cloud spend, I can tell you that the psychological needle—moving the needle is significantly harder than the technology stack of sizing something correctly or deleting something that's unused. We can solve the technology part. We can build great products that identify opportunities to save money. There's still this psychological component of IT, for the last several decades has gone through this maturity curve of if it's not broken, don't touch it. Five-nines, six sigma, all of these methods of IT sort of rationalizing do no harm, don't touch anything, everything must be up.And it even kind of goes back several decades. Back when if you rebooted a physical server, the motherboard capacitors would pop, right? So, there's even this anti—or this stigma against even rebooting servers sometimes. In the cloud really does away with a lot of that stuff because we have live migration and we have all of these, sort of, stateless designs and capabilities, but we still carry along with us this mentality of don't touch it; it might fall over. And we have to really get past that.And that means that the trust, we went back to the trust conversation where we talk about the recommendations must be incredibly accurate. You're risking your job, in some cases; if you are a DevOps engineer, and your commitments on your yearly goals are uptime, latency, response time, load time, these sorts of things, these operational metrics, KPIs that you use, you don't want to take a downsized recommendation. It has a severe risk of harming your job and your bonus.Corey: “These instances are idle. Turn them off.” It's like, yeah, these instances are the backup site, or the DR environment, or—Rick: Exactly.Corey: —something that takes very bursty but occasional traffic. And yeah, I know it costs us some money, but here's the revenue figures for having that thing available. Like, “Oh, yeah. Maybe we should shut up and not make dumb recommendations around things,” is the human response, but computers don't have that context.Rick: Absolutely. And so, the accuracy and trust component has to be the highest bar we meet for any optimization activity or behavior. We have to circumvent or supersede the human aversion, the risk aversion, that IT is built on, right?Corey: Oh, absolutely. And let's be clear, we see this all the time where I'm talking to customers and they have been burned before because we tried to save money and then we took a production outage as a side effect of a change that we made, and now we're not allowed to try to save money anymore. And there's a hidden truth in there, which is auto-scaling is something that a lot of customers talk about, but very few have instrumented true auto-scaling because they interpret is we can scale up to meet demand. Because yeah, if you don't do that you're dropping customers on the floor.Well, what about scaling back down again? And the answer there is like, yeah, that's not really a priority because it's just money. We're not disappointing customers, causing brand reputation, and we're still able to take people's money when that happens. It's only money; we can fix it later. Covid shined a real light on a lot of the stuff just because there are customers that we've spoken to who's—their user traffic dropped off a cliff, infrastructure spend remained constant day over day.And yeah, they believe, genuinely, they were auto-scaling. The most interesting lies are the ones that customers tell themselves, but the bill speaks. So, getting a lot of modernization traction from things like that was really neat to watch. But customers I don't think necessarily intuitively understand most aspects of their bill because it is a multidisciplinary problem. It's engineering, its finance, its accounting—which is not the same thing as finance—and you need all three of those constituencies to be able to communicate effectively using a shared and common language. It feels like we're marriage counseling between engineering and finance, most weeks.Rick: Absolutely, we are. And it's important we get it right, that the data is accurate, that the recommendations we provide are trustworthy. If the finance team gets their hands on the savings potential they see out of right-sizing, takes it to engineering, and then engineering comes back and says, “No, no, no, we can't actually do that. We can't actually size those,” right, we have problems. And they're cultural, they're transformational. Organizations' appetite for these things varies greatly and so it's important that we address that problem from all of those angles. And it's not easy to do.Corey: How big do you find the optimization problem is when you talk to customers? How focused are they on it? I have my answers, but that's the scale of anec-data. I want to hear your actual answer.Rick: Yeah. So, we talk with a lot of customers that are very interested in optimization. And we're very interested in helping them on the journey towards having an optimal estate. There are so many nuances and barriers, most of them psychological like we already talked about.I think there's this opportunity for us to go do better exposing the potential of what an optimal AWS estate would look like from a dollar and savings perspective. And so, I think it's kind of not well understood. I think it's one of the biggest areas or barriers of companies really attacking the optimization problem with more vigor is if they knew that the potential savings they could achieve out of their AWS environment would really align their spend much more closely with the business value they get, I think everybody would go bonkers. And so, I'm really excited about us making progress on exposing that capability or the total savings potential and amount. It's something we're looking into doing in a much more obvious way.And we're really excited about customers doing that on AWS where they know they can trust AWS to get the best value for their cloud spend, that it's a long-term good bet because their resources that they're using on AWS are all focused on giving business value. And that's the whole key. How can we align the dollars to the business value, right? And I think optimization is that connection between those two concepts.Corey: Companies are generally not going to greenlight a project whose sole job is to save money unless there's something very urgent going on. What will happen is as they iterate forward on the next generation of services or a migration of a service from one thing to another, they will make design decisions that benefit those optimizations. There's low-hanging fruit we can find, usually of the form, “Turn that thing off,” or, “Configure this thing slightly differently,” that doesn't take a lot of engineering effort in place. But, on some level, it is not worth the engineering effort it takes to do an optimization project. We've all met those engineers—speaking is one of them myself—who, left to our own devices, will spend two months just knocking a few hundred bucks a month off of our AWS developer environment.We steal more than office supplies. I'm not entirely sure what the business value of doing that is, in most cases. For me, yes, okay, things that work in small environments work very well in large environments, generally speaking, so I learned how to save 80 cents here and that's a few million bucks a month somewhere else. Most folks don't have that benefit happening, so it's a question of meeting them where they are.Rick: Absolutely. And I think the scale component is huge, which you just touched on. When you're talking about a hundred EC2 instances versus a thousand, optimization becomes kind of a different component of how you manage that AWS environment. And while single-decision recommendations to scale an individual server, the dollar amount might be different, the percentages are just about the same when you look at what is it to be sized correctly, what is it to be configured correctly? And so, it really does come down to priority.And so, it's really important to really support all of those companies of all different sizes and industries because they will have different experiences on AWS. And some will have more sensitivity to cost than others, but all of them want to get great business value out of their AWS spend. And so, as long as we're meeting that need and we're supporting our customers to make sure they understand the commitment we have to ensuring that their AWS spend is valuable, it is meaningful, right, they're not spending money on things that are not adding value, that's really important to us.Corey: I do want to have as the last topic of discussion here, how AWS views optimization, where there have been a number of repeated statements where helping customers optimize their cloud spend is extremely important to us. And I'm trying to figure out where that falls on the spectrum from, “It's the thing we say because they make us say it, but no, we're here to milk them like cows,” all the way on over to, “No, no, we passionately believe in this at every level, top to bottom, in every company. We are just bad at it.” So, I'm trying to understand how that winds up being expressed from your lived experience having solved this problem first outside, and then inside.Rick: Yeah. So, it's kind of like part of my personal story. It's the main reason I joined AWS. And, you know, when you go through the interview loops and you talk to the leaders of an organization you're thinking about joining, they always stop at the end of the interview and ask, “Do you have any questions for us?” And I asked that question to pretty much every single person I interviewed with. Like, “What is AWS's appetite for helping customers save money?”Because, like, from a business perspective, it kind of is a little bit wonky, right? But the answers were varied, and all of them were customer-obsessed and passionate. And I got this sense that my personal passion for helping companies have better efficiency of their IT resources was an absolute primary goal of AWS and a big element of Amazon's leadership principle, be customer obsessed. Now, I'm not a spokesperson, so [laugh] we'll see, but we are deeply interested in making sure our customers have a great long-term experience and a high-trust relationship. And so, when I asked these questions in these interviews, the answers were all about, “We have to do the right thing for the customer. It's imperative. It's also in our DNA. It's one of the most important leadership principles we have to be customer-obsessed.”And it is the primary reason why I joined: because of that answer to that question. Because it's so important that we achieve a better efficiency for our IT resources, not just for, like, AWS, but for our planet. If we can reduce consumption patterns and usage across the planet for how we use data centers and all the power that goes into them, we can talk about meaningful reductions of greenhouse gas emissions, the cost and energy needed to run IT business applications, and not only that, but most all new technology that's developed in the world seems to come out of a data center these days, we have a real opportunity to make a material impact to how much resource we use to build and use these things. And I think we owe it to the planet, to humanity, and I think Amazon takes that really seriously. And I'm really excited to be here because of that.Corey: As I recall—and feel free to make sure that this comment never sees the light of day—you asked me before interviewing for the role and then deciding to accept it, what I thought about you working there and whether I would recommend it, whether I wouldn't. And I think my answer was fairly nuanced. And you're working there now and we still are on speaking terms, so people can probably guess what my comments took the shape of, generally speaking. So, I'm going to have to ask now; it's been, what, a year since you joined?Rick: Almost. I think it's been about eight months.Corey: Time during a pandemic is always strange. But I have to ask, did I steer you wrong?Rick: No. Definitely not. I'm very happy to be here. The opportunity to help such a broad range of companies get more value out of technology—and it's not just cost, right, like we talked about. It's actually not about the dollar number going down on a bill. It's about getting more value and moving the needle on how do we efficiently use technology to solve business needs.And that's been my career goal for a really long time, I've been working on optimization for, like, seven or eight, I don't know, maybe even nine years now. And it's like this strange passion for me, this combination of my dad taught me how to be a really good steward of money and a great budget manager, and then my passion for technology. So, it's this really cool combination of, like, childhood life skills that really came together for me to create a career that I'm really passionate about. And this move to AWS has been such a tremendous way to supercharge my ability to scale my personal mission, and really align it to AWS's broader mission of helping companies achieve more with cloud platforms, right?And so, it's been a really nice eight months. It's been wild. Learning AWS culture has been wild. It's a sharp diverging culture from where I've been in the past, but it's also really cool to experience the leadership principles in action. They're not just things we put on a website; they're actually things people talk about every day [laugh]. And so, that journey has been humbling and a great learning opportunity as well.Corey: If people want to learn more, where's the best place to find you?Rick: Oh, yeah. Contact me on LinkedIn or Twitter. My Twitter account is @rickyo1138. Let me know if you get the 1138 reference. That's a fun one.Corey: THX 1138. Who doesn't?Rick: Yeah, there you go. And it's hidden in almost every single George Lucas movie as well. You can contact me on any of those social media platforms and I'd be happy to engage with anybody that's interested in optimization, cloud technology, bill, anything like that. Or even not [laugh]. Even anything else, either.Corey: Thank you so much for being so generous with your time. I really appreciate it.Rick: My pleasure, Corey. It was wonderful talking to you.Corey: Rick Ochs, Principal Product Manager at AWS. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment, rightly pointing out that while AWS is great and all, Azure is far more cost-effective for your workloads because, given their lack security, it is trivially easy to just run your workloads in someone else's account.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

AWS Morning Brief
The Releases are Coming Fast and Furious Now

AWS Morning Brief

Play Episode Listen Later Nov 30, 2022 4:08


Links: Last Week in AWS Community Slack VPC Lattice AWS Supply Chain OpenSearch Serverless AWS Verified Access Stay Up To Date with re:Quinnvent Sign up for the re:Quinnvent Newsletter Check out the re:Quinnvent playlist on YouTube If you're on site: Join Corey for a Nature Walk through the Expo Hall beginning at the Fortinet booth today (11/29/22) at 1pm PST or  For drinks at Atomic Liquors tonight at 8:15 pm PST. Tomorrow evening is re:Play, if you see Corey there, please say hello! Help the show Share your feedback Subscribe wherever you get your podcasts Buy our merch What's Corey up to? Follow Corey on Twitter (@quinnypig) See our recent work at the Duckbill Group Apply to work with Corey and the Duckbill Group to help lower your AWS bill

Screaming in the Cloud
Crafting a Modern Data Protection Strategy with Sam Nicholls

Screaming in the Cloud

Play Episode Listen Later Nov 30, 2022 37:26


About SamSam Nicholls: Veeam's Director of Public Cloud Product Marketing, with 10+ years of sales, alliance management and product marketing experience in IT. Sam has evolved from his on-premises storage days and is now laser-focused on spreading the word about cloud-native backup and recovery, packing in thousands of viewers on his webinars, blogs and webpages.Links Referenced: Veeam AWS Backup: https://www.veeam.com/aws-backup.html Veeam: https://veeam.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Chronosphere. Tired of observability costs going up every year without getting additional value? Or being locked in to a vendor due to proprietary data collection, querying and visualization? Modern day, containerized environments require a new kind of observability technology that accounts for the massive increase in scale and attendant cost of data. With Chronosphere, choose where and how your data is routed and stored, query it easily, and get better context and control. 100% open source compatibility means that no matter what your setup is, they can help. Learn how Chronosphere provides complete and real-time insight into ECS, EKS, and your microservices, whereever they may be at snark.cloud/chronosphere That's snark.cloud/chronosphere Corey: This episode is brought to us by our friends at Pinecone. They believe that all anyone really wants is to be understood, and that includes your users. AI models combined with the Pinecone vector database let your applications understand and act on what your users want… without making them spell it out. Make your search application find results by meaning instead of just keywords, your personalization system make picks based on relevance instead of just tags, and your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable. Thanks to my friends at Pinecone for sponsoring this episode. Visit Pinecone.io to understand more.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest episode is brought to us by and sponsored by our friends over at Veeam. And as a part of that, they have thrown one of their own to the proverbial lion. My guest today is Sam Nicholls, Director of Public Cloud over at Veeam. Sam, thank you for joining me.Sam: Hey. Thanks for having me, Corey, and thanks for everyone joining and listening in. I do know that I've been thrown into the lion's den, and I am [laugh] hopefully well-prepared to answer anything and everything that Corey throws my way. Fingers crossed. [laugh].Corey: I don't think there's too much room for criticizing here, to be direct. I mean, Veeam is a company that is solidly and thoroughly built around a problem that absolutely no one cares about. I mean, what could possibly be wrong with that? You do backups; which no one ever cares about. Restores, on the other hand, people care very much about restores. And that's when they learn, “Oh, I really should have cared about backups at any point prior to 20 minutes ago.”Sam: Yeah, it's a great point. It's kind of like taxes and insurance. It's almost like, you know, something that you have to do that you don't necessarily want to do, but when push comes to shove, and something's burning down, a file has been deleted, someone's made their way into your account and, you know, running a right mess within there, that's when you really, kind of, care about what you mentioned, which is the recovery piece, the speed of recovery, the reliability of recovery.Corey: It's been over a decade, and I'm still sore about losing my email archives from 2006 to 2009. There's no way to get it back. I ran my own mail server; it was an iPhone setting that said, “Oh, yeah, automatically delete everything in your trash folder—or archive folder—after 30 days.” It was just a weird default setting back in that era. I didn't realize it was doing that. Yeah, painful stuff.And we learned the hard way in some of these cases. Not that I really have much need for email from that era of my life, but every once in a while it still bugs me. Which gets speaks to the point that the people who are the most fanatical about backing things up are the people who have been burned by not having a backup. And I'm fortunate in that it wasn't someone else's data with which I had been entrusted that really cemented that lesson for me.Sam: Yeah, yeah. It's a good point. I could remember a few years ago, my wife migrated a very aging, polycarbonate white Mac to one of the shiny new aluminum ones and thought everything was good—Corey: As the white polycarbonate Mac becomes yellow, then yeah, all right, you know, it's time to replace it. Yeah. So yeah, so she wiped the drive, and what happened?Sam: That was her moment where she learned the value and importance of backup unless she backs everything up now. I fortunately have never gone through it. But I'm employed by a backup vendor and that's why I care about it. But it's incredibly important to have, of course.Corey: Oh, yes. My spouse has many wonderful qualities, but one that drives me slightly nuts is she's something of a digital packrat where her hard drives on her laptop will periodically fill up. And I used to take the approach of oh, you can be more efficient and do the rest. And I realized no, telling other people they're doing it wrong is generally poor practice, whereas just buying bigger drives is way easier. Let's go ahead and do that. It's small price to pay for domestic tranquility.And there's a lesson in that. We can map that almost perfectly to the corporate world where you folks tend to operate in. You're not doing home backup, last time I checked; you are doing public cloud backup. Actually, I should ask that. Where do you folks start and where do you stop?Sam: Yeah, no, it's a great question. You know, we started over 15 years ago when virtualization, specifically VMware vSphere, was really the up-and-coming thing, and, you know, a lot of folks were there trying to utilize agents to protect their vSphere instances, just like they were doing with physical Windows and Linux boxes. And, you know, it kind of got the job done, but was it the best way of doing it? No. And that's kind of why Veeam was pioneered; it was this agentless backup, image-based backup for vSphere.And, of course, you know, in the last 15 years, we've seen lots of transitions, of course, we're here at Screaming in the Cloud, with you, Corey, so AWS, as well as a number of other public cloud vendors we can help protect as well, as a number of SaaS applications like Microsoft 365, metadata and data within Salesforce. So, Veeam's really kind of come a long way from just virtual machines to really taking a global look at the entirety of modern environments, and how can we best protect each and every single one of those without trying to take a square peg and fit it in a round hole?Corey: It's a good question and a common one. We wind up with an awful lot of folks who are confused by the proliferation of data. And I'm one of them, let's be very clear here. It comes down to a problem where backups are a multifaceted, deep problem, and I don't think that people necessarily think of it that way. But I take a look at all of the different, even AWS services that I use for my various nonsense, and which ones can be used to store data?Well, all of them. Some of them, you have to hold it in a particularly wrong sort of way, but they all store data. And in various contexts, a lot of that data becomes very important. So, what service am I using, in which account am I using, and in what region am I using it, and you wind up with data sprawl, where it's a tremendous amount of data that you can generally only track down by looking at your bills at the end of the month. Okay, so what am I being charged, and for what service?That seems like a good place to start, but where is it getting backed up? How do you think about that? So, some people, I think, tend to ignore the problem, which we're seeing less and less, but other folks tend to go to the opposite extreme and we're just going to backup absolutely everything, and we're going to keep that data for the rest of our natural lives. It feels to me that there's probably an answer that is more appropriate somewhere nestled between those two extremes.Sam: Yeah, snapshot sprawl is a real thing, and it gets very, very expensive very, very quickly. You know, your snapshots of EC2 instances are stored on those attached EBS volumes. Five cents per gig per month doesn't sound like a lot, but when you're dealing with thousands of snapshots for thousands machines, it gets out of hand very, very quickly. And you don't know when to delete them. Like you say, folks are just retaining them forever and dealing with this unfortunate bill shock.So, you know, where to start is automating the lifecycle of a snapshot, right, from its creation—how often do we want to be creating them—from the retention—how long do we want to keep these for—and where do we want to keep them because there are other storage services outside of just EBS volumes. And then, of course, the ultimate: deletion. And that's important even from a compliance perspective as well, right? You've got to retain data for a specific number of years, I think healthcare is like seven years, but then you've—Corey: And then not a day more.Sam: Yeah, and then not a day more because that puts you out of compliance, too. So, policy-based automation is your friend and we see a number of folks building these policies out: gold, silver, bronze tiers based on criticality of data compliance and really just kind of letting the machine do the rest. And you can focus on not babysitting backup.Corey: What was it that led to the rise of snapshots? Because back in my very early days, there was no such thing. We wound up using a bunch of servers stuffed in a rack somewhere and virtualization was not really in play, so we had file systems on physical disks. And how do you back that up? Well, you have an agent of some sort that basically looks at all the files and according to some ruleset that it has, it copies them off somewhere else.It was slow, it was fraught, it had a whole bunch of logic that was pushed out to the very edge, and forget about restoring that data in a timely fashion or even validating a lot of those backups worked other than via checksum. And God help you if you had data that was constantly in the state of flux, where anything changing during the backup run would leave your backups in an inconsistent state. That on some level seems to have largely been solved by snapshots. But what's your take on it? You're a lot closer to this part of the world than I am.Sam: Yeah, snapshots, I think folks have turned to snapshots for the speed, the lack of impact that they have on production performance, and again, just the ease of accessibility. We have access to all different kinds of snapshots for EC2, RDS, EFS throughout the entirety of our AWS environment. So, I think the snapshots are kind of like the default go-to for folks. They can help deliver those very, very quick RPOs, especially in, for example, databases, like you were saying, that change very, very quickly and we all of a sudden are stranded with a crash-consistent backup or snapshot versus an application-consistent snapshot. And then they're also very, very quick to recover from.So, snapshots are very, very appealing, but they absolutely do have their limitations. And I think, you know, it's not a one or the other; it's that they've got to go hand-in-hand with something else. And typically, that is an image-based backup that is stored in a separate location to the snapshot because that snapshot is not independent of the disk that it is protecting.Corey: One of the challenges with snapshots is most of them are created in a copy-on-write sense. It takes basically an instant frozen point in time back—once upon a time when we ran MySQL databases on top of the NetApp Filer—which works surprisingly well—we would have a script that would automatically quiesce the database so that it would be in a consistent state, snapshot the file and then un-quiesce it, which took less than a second, start to finish. And that was awesome, but then you had this snapshot type of thing. It wasn't super portable, it needed to reference a previous snapshot in some cases, and AWS takes the same approach where the first snapshot it captures every block, then subsequent snapshots wind up only taking up as much size as there have been changes since the first snapshots. So, large quantities of data that generally don't get access to a whole lot have remarkably small, subsequent snapshot sizes.But that's not at all obvious from the outside, and looking at these things. They're not the most portable thing in the world. But it's definitely the direction that the industry has trended in. So, rather than having a cron job fire off an AWS API call to take snapshots of my volumes as a sort of the baseline approach that we all started with, what is the value proposition that you folks bring? And please don't say it's, “Well, cron jobs are hard and we have a friendlier interface for that.”Sam: [laugh]. I think it's really starting to look at the proliferation of those snapshots, understanding what they're good at, and what they are good for within your environment—as previously mentioned, low RPOs, low RTOs, how quickly can I take a backup, how frequently can I take a backup, and more importantly, how quickly can I restore—but then looking at their limitations. So, I mentioned that they were not independent of that disk, so that certainly does introduce a single point of failure as well as being not so secure. We've kind of touched on the cost component of that as well. So, what Veeam can come in and do is then take an image-based backup of those snapshots, right—so you've got your initial snapshot and then your incremental ones—we'll take the backup from that snapshot, and then we'll start to store that elsewhere.And that is likely going to be in a different account. We can look at the Well-Architected Framework, AWS deeming accounts as a security boundary, so having that cross-account function is critically important so you don't have that single point of failure. Locking down with IAM roles is also incredibly important so we haven't just got a big wide open door between the two. But that data is then stored in a separate account—potentially in a separate region, maybe in the same region—Amazon S3 storage. And S3 has the wonderful benefit of being still relatively performant, so we can have quick recoveries, but it is much, much cheaper. You're dealing with 2.3 cents per gig per month, instead of—Corey: To start, and it goes down from there with sizeable volumes.Sam: Absolutely, yeah. You can go down to S3 Glacier, where you're looking at, I forget how many points and zeros and nines it is, but it's fractions of a cent per gig per month, but it's going to take you a couple of days to recover that da—Corey: Even infrequent access cuts that in half.Sam: Oh yeah.Corey: And let's be clear, these are snapshot backups; you probably should not be accessing them on a consistent, sustained basis.Sam: Well, exactly. And this is where it's kind of almost like having your cake and eating it as well. Compliance or regulatory mandates or corporate mandates are saying you must keep this data for this length of time. Keeping that—you know, let's just say it's three years' worth of snapshots in an EBS volume is going to be incredibly expensive. What's the likelihood of you needing to recover something from two years—actually, even two months ago? It's very, very small.So, the performance part of S3 is, you don't need to take it as much into consideration. Can you recover? Yes. Is it going to take a little bit longer? Absolutely. But it's going to help you meet those retention requirements while keeping your backup bill low, avoiding that bill shock, right, spending tens and tens of thousands every single month on snapshots. This is what I mean by kind of having your cake and eating it.Corey: I somewhat recently have had a client where EBS snapshots are one of the driving costs behind their bill. It is one of their largest single line items. And I want to be very clear here because if one of those people who listen to this and thinking, “Well, hang on. Wait, they're telling stories about us, even though they're not naming us by name?” Yeah, there were three of you in the last quarter.So, at that point, it becomes clear it is not about something that one individual company has done and more about an overall driving trend. I am personalizing it a little bit by referring to as one company when there were three of you. This is a narrative device, not me breaking confidentiality. Disclaimer over. Now, when you talk to people about, “So, tell me why you've got 80 times more snapshots than you do EBS volumes?” The answer is as, “Well, we wanted to back things up and we needed to get hourly backups to a point, then daily backups, then monthly, and so on and so forth. And when this was set up, there wasn't a great way to do this natively and we don't always necessarily know what we need versus what we don't. And the cost of us backing this up, well, you can see it on the bill. The cost of us deleting too much and needing it as soon as we do? Well, that cost is almost incalculable. So, this is the safe way to go.” And they're not wrong in anything that they're saying. But the world has definitely evolved since then.Sam: Yeah, yeah. It's a really great point. Again, it just folds back into my whole having your cake and eating it conversation. Yes, you need to retain data; it gives you that kind of nice, warm, cozy feeling, it's a nice blanket on a winter's day that that data, irrespective of what happens, you're going to have something to recover from. But the question is does that need to be living on an EBS volume as a snapshot? Why can't it be living on much, much more cost-effective storage that's going to give you the warm and fuzzies, but is going to make your finance team much, much happier [laugh].Corey: One of the inherent challenges I think people have is that snapshots by themselves are almost worthless, in that I have an EBS snapshot, it is sitting there now, it's costing me an undetermined amount of money because it's not exactly clear on a per snapshot basis exactly how large it is, and okay, great. Well, I'm looking for a file that was not modified since X date, as it was on this time. Well, great, you're going to have to take that snapshot, restore it to a volume and then go exploring by hand. Oh, it was the wrong one. Great. Try it again, with a different one.And after, like, the fifth or six in a row, you start doing a binary search approach on this thing. But it's expensive, it's time-consuming, it takes forever, and it's not a fun user experience at all. Part of the problem is it seems that historically, backup systems have no context or no contextual awareness whatsoever around what is actually contained within that backup.Sam: Yeah, yeah. I mean, you kind of highlighted two of the steps. It's more like a ten-step process to do, you know, granular file or folder-level recovery from a snapshot, right? You've got to, like you say, you've got to determine the point in time when that, you know, you knew the last time that it was around, then you're going to have to determine the volume size, the region, the OS, you're going to have to create an EBS volume of the same size, region, from that snapshot, create the EC2 instance with the same OS, connect the two together, boot the EC2 instance, mount the volume search for the files to restore, download them manually, at which point you have your file back. It's not back in the machine where it was, it's now been downloaded locally to whatever machine you're accessing that from. And then you got to tear it all down.And that is again, like you say, predicated on the fact that you knew exactly that that was the right time. It might not be and then you have to start from scratch from a different point in time. So, backup tooling from backup vendors that have been doing this for many, many years, knew about this problem long, long ago, and really seek to not only automate the entirety of that process but make the whole e-discovery, the search, the location of those files, much, much easier. I don't necessarily want to do a vendor pitch, but I will say with Veeam, we have explorer-like functionality, whereby it's just a simple web browser. Once that machine is all spun up again, automatic process, you can just search for your individual file, folder, locate it, you can download it locally, you can inject it back into the instance where it was through Amazon Kinesis or AWS Kinesis—I forget the right terminology for it; some of its AWS, some of its Amazon.But by-the-by, the whole recovery process, especially from a file or folder level, is much more pain-free, but also much faster. And that's ultimately what people care about how reliable is my backup? How quickly can I get stuff online? Because the time that I'm down is costing me an indescribable amount of time or money.Corey: This episode is sponsored in part by our friends at Redis, the company behind the incredibly popular open source database. If you're tired of managing open source Redis on your own, or if you are looking to go beyond just caching and unlocking your data's full potential, these folks have you covered. Redis Enterprise is the go-to managed Redis service that allows you to reimagine how your geo-distributed applications process, deliver, and store data. To learn more from the experts in Redis how to be real-time, right now, from anywhere, visit redis.com/duckbill. That's R - E - D - I - S dot com slash duckbill.Corey: Right, the idea of RPO versus RTO: recovery point objective and recovery time objective. With an RPO, it's great, disaster strikes right now, how long is acceptable to it have been since the last time we backed up data to a restorable point? Sometimes it's measured in minutes, sometimes it's measured in fractions of a second. It really depends on what we're talking about. Payments databases, that needs to be—the RPO is basically an asymptotically approaches zero.The RTO is okay, how long is acceptable before we have that data restored and are back up and running? And that is almost always a longer time, but not always. And there's a different series of trade-offs that go into that. But both of those also presuppose that you've already dealt with the existential question of is it possible for us to recover this data. And that's where I know that you are obviously—you have a position on this that is informed by where you work, but I don't, and I will call this out as what I see in the industry: AWS backup is compelling to me except for one fatal flaw that it has, and that is it starts and stops with AWS.I am not a proponent of multi-cloud. Lord knows I've gotten flack for that position a bunch of times, but the one area where it makes absolute sense to me is backups. Have your data in a rehydrate-the-business level state backed up somewhere that is not your primary cloud provider because you're otherwise single point of failure-ing through a company, through the payment instrument you have on file with that company, in the blast radius of someone who can successfully impersonate you to that vendor. There has to be a gap of some sort for the truly business-critical data. Yes, egress to other providers is expensive, but you know what also is expensive? Irrevocably losing the data that powers your business. Is it likely? No, but I would much rather do it than have to justify why I'm not doing it.Sam: Yeah. Wasn't likely that I was going to win that 2 billion or 2.1 billion on the Powerball, but [laugh] I still play [laugh]. But I understand your standpoint on multi-cloud and I read your newsletters and understand where you're coming from, but I think the reality is that we do live in at least a hybrid cloud world, if not multi-cloud. The number of organizations that are sole-sourced on a single cloud and nothing else is relatively small, single-digit percentage. It's around 80-some percent that are hybrid, and the remainder of them are your favorite: multi-cloud.But again, having something that is one hundred percent sole-source on a single platform or a single vendor does expose you to a certain degree of risk. So, having the ability to do cross-platform backups, recoveries, migrations, for whatever reason, right, because it might not just be a disaster like you'd mentioned, it might also just be… I don't know, the company has been taken over and all of a sudden, the preference is now towards another cloud provider and I want you to refactor and re-architect everything for this other cloud provider. If all that data is locked into one platform, that's going to make your job very, very difficult. So, we mentioned at the beginning of the call, Veeam is capable of protecting a vast number of heterogeneous workloads on different platforms, in different environments, on-premises, in multiple different clouds, but the other key piece is that we always use the same backup file format. And why that's key is because it enables portability.If I have backups of EC2 instances that are stored in S3, I could copy those onto on-premises disk, I could copy those into Azure, I could do the same with my Azure VMs and store those on S3, or again, on-premises disk, and any other endless combination that goes with that. And it's really kind of centered around, like control and ownership of your data. We are not prescriptive by any means. Like, you do what is best for your organization. We just want to provide you with the toolset that enables you to do that without steering you one direction or the other with fee structures, disparate feature sets, whatever it might be.Corey: One of the big challenges that I keep seeing across the board is just a lack of awareness of what the data that matters is, where you see people backing up endless fleets of web server instances that are auto-scaled into existence and then removed, but you can create those things at will; why do you care about the actual data that's on these things? It winds up almost at the library management problem, on some level. And in that scenario, snapshots are almost certainly the wrong answer. One thing that I saw previously that really changed my way of thinking about this was back many years ago when I was working at a startup that had just started using GitHub and they were paying for a third-party service that wound up backing up Git repos. Today, that makes a lot more sense because you have a bunch of other stuff on GitHub that goes well beyond the stuff contained within Git, but at the time, it was silly. It was, why do that? Every Git clone is a full copy of the entire repository history. Just grab it off some developer's laptop somewhere.It's like, “Really? You want to bet the company, slash your job, slash everyone else's job on that being feasible and doable or do you want to spend the 39 bucks a month or whatever it was to wind up getting that out the door now so we don't have to think about it, and they validate that it works?” And that was really a shift in my way of thinking because, yeah, backing up things can get expensive when you have multiple copies of the data living in different places, but what's really expensive is not having a company anymore.Sam: Yeah, yeah, absolutely. We can tie it back to my insurance dynamic earlier where, you know, it's something that you know that you have to have, but you don't necessarily want to pay for it. Well, you know, just like with insurances, there's multiple different ways to go about recovering your data and it's only in crunch time, do you really care about what it is that you've been paying for, right, when it comes to backup?Could you get your backup through a git clone? Absolutely. Could you get your data back—how long is that going to take you? How painful is that going to be? What's going to be the impact to the business where you're trying to figure that out versus, like you say, the 39 bucks a month, a year, or whatever it might be to have something purpose-built for that, that is going to make the recovery process as quick and painless as possible and just get things back up online.Corey: I am not a big fan of the fear, uncertainty, and doubt approach, but I do practice what I preach here in that yeah, there is a real fear against data loss. It's not, “People are coming to get you, so you absolutely have to buy whatever it is I'm selling,” but it is something you absolutely have to think about. My core consulting proposition is that I optimize the AWS bill. And sometimes that means spending more. Okay, that one S3 bucket is extremely important to you and you say you can't sustain the loss of it ever so one zone is not an option. Where is it being backed up? Oh, it's not? Yeah, I suggest you spend more money and back that thing up if it's as irreplaceable as you say. It's about doing the right thing.Sam: Yeah, yeah, it's interesting, and it's going to be hard for you to prove the value of doing that when you are driving their bill up when you're trying to bring it down. But again, you have to look at something that's not itemized on that bill, which is going to be the impact of downtime. I'm not going to pretend to try and recall the exact figures because it also varies depending on your business, your industry, the size, but the impact of downtime is massive financially. Tens of thousands of dollars for small organizations per hour, millions and millions of dollars per hour for much larger organizations. The backup component of that is relatively small in comparison, so having something that is purpose-built, and is going to protect your data and help mitigate that impact of downtime.Because that's ultimately what you're trying to protect against. It is the recovery piece that you're buying is the most important piece. And like you, I would say, at least be cognizant of it and evaluate your options and what can you live with and what can you live without.Corey: That's the big burning question that I think a lot of people do not have a good answer to. And when you don't have an answer, you either backup everything or nothing. And I'm not a big fan of doing either of those things blindly.Sam: Yeah, absolutely. And I think this is why we see varying different backup options as well, you know? You're not going to try and apply the same data protection policies each and every single workload within your environment because they've all got different types of workload criticality. And like you say, some of them might not even need to be backed up at all, just because they don't have data that needs to be protected. So, you need something that is going to be able to be flexible enough to apply across the entirety of your environment, protect it with the right policy, in terms of how frequently do you protect it, where do you store it, how often, or when are you eventually going to delete that and apply that on a workload by workload basis. And this is where the joy of things like tags come into play as well.Corey: One last thing I want to bring up is that I'm a big fan of watching for companies saying the quiet part out loud. And one area in which they do this—because they're forced to by brevity—is in the title tag of their website. I pull up veeam.com and I hover over the tab in my browser, and it says, “Veeam Software: Modern Data Protection.”And I want to call that out because you're not framing it as explicitly backup. So, the last topic I want to get into is the idea of security. Because I think it is not fully appreciated on a lived-experience basis—although people will of course agree to this when they're having ivory tower whiteboard discussions—that every place your data lives is a potential for a security breach to happen. So, you want to have your data living in a bunch of places ideally, for backup and resiliency purposes. But you also want it to be completely unworkable or illegible to anyone who is not authorized to have access to it.How do you balance those trade-offs yourself given that what you're fundamentally saying is, “Trust us with your Holy of Holies when it comes to things that power your entire business?” I mean, I can barely get some companies to agree to show me their AWS bill, let alone this is the data that contains all of this stuff to destroy our company.Sam: Yeah. Yeah, it's a great question. Before I explicitly answer that piece, I will just go to say that modern data protection does absolutely have a security component to it, and I think that backup absolutely needs to be a—I'm going to say this an air quotes—a “first class citizen” of any security strategy. I think when people think about security, their mind goes to the preventative, like how do we keep these bad people out?This is going to be a bit of the FUD that you love, but ultimately, the bad guys on the outside have an infinite number of attempts to get into your environment and only have to be right once to get in and start wreaking havoc. You on the other hand, as the good guy with your cape and whatnot, you have got to be right each and every single one of those times. And we as humans are fallible, right? None of us are perfect, and it's incredibly difficult to defend against these ever-evolving, more complex attacks. So backup, if someone does get in, having a clean, verifiable, recoverable backup, is really going to be the only thing that is going to save your organization, should that actually happen.And what's key to a secure backup? I would say separation, isolation of backup data from the production data, I would say utilizing things like immutability, so in AWS, we've got Amazon S3 object lock, so it's that write once, read many state for whatever retention period that you put on it. So, the data that they're seeking to encrypt, whether it's in production or in their backup, they cannot encrypt it. And then the other piece that I think is becoming more and more into play, and it's almost table stakes is encryption, right? And we can utilize things like AWS KMS for that encryption.But that's there to help defend against the exfiltration attempts. Because these bad guys are realizing, “Hey, people aren't paying me my ransom because they're just recovering from a clean backup, so now I'm going to take that backup data, I'm going to leak the personally identifiable information, trade secrets, or whatever on the internet, and that's going to put them in breach compliance and give them a hefty fine that way unless they pay me my ransom.” So encryption, so they can't read that data. So, not only can they not change it, but they can't read it is equally important. So, I would say those are the three big things for me on what's needed for backup to make sure it is clean and recoverable.Corey: I think that is one of those areas where people need to put additional levels of thought in. I think that if you have access to the production environment and have full administrative rights throughout it, you should definitionally not—at least with that account and ideally not you at all personally—have access to alter the backups. Full stop. I would say, on some level, there should not be the ability to alter backups for some particular workloads, the idea being that if you get hit with a ransomware infection, it's pretty bad, let's be clear, but if you can get all of your data back, it's more of an annoyance than it is, again, the existential business crisis that becomes something that redefines you as a company if you still are a company.Sam: Yeah. Yeah, I mean, we can turn to a number of organizations. Code Spaces always springs to mind for me, I love Code Spaces. It was kind of one of those precursors to—Corey: It's amazing.Sam: Yeah, but they were running on AWS and they had everything, production and backups, all stored in one account. Got into the account. “We're going to delete your data if you don't pay us this ransom.” They were like, “Well, we're not paying you the ransoms. We got backups.” Well, they deleted those, too. And, you know, unfortunately, Code Spaces isn't around anymore. But it really kind of goes to show just the importance of at least logically separating your data across different accounts and not having that god-like access to absolutely everything.Corey: Yeah, when you talked about Code Spaces, I was in [unintelligible 00:32:29] talking about GitHub Codespaces specifically, where they have their developer workstations in the cloud. They're still very much around, at least last time I saw unless you know something I don't.Sam: Precursor to that. I can send you the link—Corey: Oh oh—Sam: You can share it with the listeners.Corey: Oh, yes, please do. I'd love to see that.Sam: Yeah. Yeah, absolutely.Corey: And it's been a long and strange time in this industry. Speaking of links for the show notes, I appreciate you're spending so much time with me. Where can people go to learn more?Sam: Yeah, absolutely. I think veeam.com is kind of the first place that people gravitate towards. Me personally, I'm kind of like a hands-on learning kind of guy, so we always make free product available.And then you can find that on the AWS Marketplace. Simply search ‘Veeam' through there. A number of free products; we don't put time limits on it, we don't put feature limitations. You can backup ten instances, including your VPCs, which we actually didn't talk about today, but I do think is important. But I won't waste any more time on that.Corey: Oh, configuration of these things is critically important. If you don't know how everything was structured and built out, you're basically trying to re-architect from first principles based upon archaeology.Sam: Yeah [laugh], that's a real pain. So, we can help protect those VPCs and we actually don't put any limitations on the number of VPCs that you can protect; it's always free. So, if you're going to use it for anything, use it for that. But hands-on, marketplace, if you want more documentation, want to learn more, want to speak to someone veeam.com is the place to go.Corey: And we will, of course, include that in the show notes. Thank you so much for taking so much time to speak with me today. It's appreciated.Sam: Thank you, Corey, and thanks for all the listeners tuning in today.Corey: Sam Nicholls, Director of Public Cloud at Veeam. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry insulting comment that takes you two hours to type out but then you lose it because you forgot to back it up.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

AWS Morning Brief
The Releases of re:Invent are in Full Swing

AWS Morning Brief

Play Episode Listen Later Nov 29, 2022 5:57


Links: Last Week in AWS Community Slack Amazon ECS Service Connect Amazon RDS Optimized Reads and Writes Fully Managed Blue / Green Deployments in Aurora and RDS Protect Sensitive Data with CloudWatch Logs Amazon cloudWatch Cross-Account Observability Stay Up To Date with re:Quinnvent Sign up for the re:Quinnvent Newsletter Check out the re:Quinnvent playlist on YouTube If you're on site: Join Corey for a Nature Walk through the Expo Hall beginning at the Fortinet booth tomorrow (11/29/22) at 1pm PST or  For drinks at Atomic Liquors tomorrow evening at 8:15 pm PST. Help the show Share your feedback Subscribe wherever you get your podcasts Buy our merch What's Corey up to? Follow Corey on Twitter (@quinnypig) See our recent work at the Duckbill Group Apply to work with Corey and the Duckbill Group to help lower your AWS bill

Screaming in the Cloud
The Need for Speed in Time-Series Data with Brian Mullen

Screaming in the Cloud

Play Episode Listen Later Nov 29, 2022 32:55


About BrianBrian is an accomplished dealmaker with experience ranging from developer platforms to mobile services. Before InfluxData, Brian led business development at Twilio. Joining at just thirty-five employees, he built over 150 partnerships globally from the company's infancy through its IPO in 2016. He led the company's international expansion, hiring its first teams in Europe, Asia, and Latin America. Prior to Twilio Brian was VP of Business Development at Clearwire and held management roles at Amp'd Mobile, Kivera, and PlaceWare.Links Referenced:InfluxData: https://www.influxdata.com/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is bought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn't care enough about backups.  If you're tired of the vulnerabilities, costs and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, thats V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.Corey: This episode is brought to us by our friends at Pinecone. They believe that all anyone really wants is to be understood, and that includes your users. AI models combined with the Pinecone vector database let your applications understand and act on what your users want… without making them spell it out.Make your search application find results by meaning instead of just keywords, your personalization system make picks based on relevance instead of just tags, and your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable. Thanks to my friends at Pinecone for sponsoring this episode. Visit Pinecone.io to understand more.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's been a year, which means it's once again time to have a promoted guest episode brought to us by our friends at InfluxData. Joining me for a second time is Brian Mullen, CMO over at InfluxData. Brian, thank you for agreeing to do this a second time. You're braver than most.Brian: Thanks, Corey. I'm happy to be here. Second time is the charm.Corey: So, it's been an interesting year to put it mildly and I tend to have the attention span of a goldfish of most days, so for those who are similarly flighty, let's start at the very top. What is an InfluxDB slash InfluxData slash Influx—when you're not sure which one to use, just shorten it and call it good—and why might someone need it?Brian: Sure. So, InfluxDB is what most people understand our product as, a pretty popular open-source product, been out for quite a while. And then our company, InfluxData is the company behind InfluxDB. And InfluxDB is where developers build IoT real-time analytics and cloud applications, typically all based on time series. It's a time-series data platform specifically built to handle time-series data, which we think about is any type of data that is stamped in time in some way.It could be metrics, like, taken every one second, every two seconds, every three seconds, or some kind of event that occurs and is stamped in time in some way. So, our product and platform is really specialized to handle that technical problem.Corey: When last we spoke, I contextualized that in the realm of an IoT sensor that winds up reporting its device ID and its temperature at a given timestamp. That is sort of baseline stuff that I think aligns with what we're talking about. But over the past year, I started to see it in a bit of a different light, specifically viewing logs as time-series data, which hadn't occurred to me until relatively recently. And it makes perfect sense, on some level. It's weird to contextualize what Influx does as being a logging database, but there's absolutely no reason it couldn't be.Brian: Yeah, it certainly could. So typically, we see the world of time-series data in kind of two big realms. One is, as you mentioned the, you know, think of it as the hardware or, you know, physical realm: devices and sensors, these are things that are going to show up in a connected car, in a factory deployment, in renewable energy, you know, wind farm. And those are real devices and pieces of hardware that are out in the physical world, collecting data and emitting, you know, time-series every one second, or five seconds, or ten minutes, or whatever it might be.But it also, as you mentioned, applies to, call it the virtual world, which is really all of the software and infrastructure that is being stood up to run applications and services. And so, in that world, it could be the same—it's just a different type of source, but is really kind of the same technical problem. It's still time-series data being stamped, you know, data being stamped every, you know, one second, every five seconds, in some cases, every millisecond, but it is coming from a source that is actually in the infrastructure. Could be, you know, virtual machines, it could be containers, it could be microservices running within those containers. And so, all of those things together, both in the physical world and this infrastructure world are all emitting time-series data.Corey: When you take a look at the broader ecosystem, what is it that you see that has been the most misunderstood about time-series data as a whole? For example, when I saw AWS talking about a lot of things that they did in the realm of for your data lake, I talked to clients of mine about this and their response is, “Well, that'd be great genius, if we had a data lake.” It's, “What do you think those petabytes of nonsense in S3 are?” “Oh, those are the logs and the assets and a bunch of other nonsense.” “Yeah, that's what other people are calling a data lake.” “Oh.” Do you see similar lights-go-on moment when you talk to clients and prospective clients about what it is that they're doing that they just hadn't considered to be time-series data previously?Brian: Yeah. In fact, that's exactly what we see with many of our customers is they didn't realize that all of a sudden, they are now handling a pretty sizable time-series workload. And if you kind of take a step back and look at a couple of pretty obvious but sometimes unrecognized trends in technology, the first is cloud applications in general are expanding, they're both—horizontally and vertically. So, that means, like, the workloads that are being run in the Netflix's of the world, or all the different infrastructure that's being spun up in the cloud to run these various, you know, applications and services, those workloads are getting bigger and bigger, those companies and their subscriber bases, and the amount of data they're generating is getting bigger and bigger. They're also expanding horizontally by region and geography.So Netflix, for example, running not just in the US, but in every continent and probably every cloud region around the world. So, that's happening in the cloud world, and then also, in the IoT world, there's this massive growth of connected devices, both net-new devices that are being developed kind of, you know, the next Peloton or the next climate control unit that goes in an apartment or house, and also these longtime legacy devices that are been on the factory floor for a couple of decades, but now are being kind of modernized and coming online. So, if you look at all of that growth of the data sources now being built up in the cloud and you look at all that growth of these connected devices, both new and existing, that are kind of coming online, there's a huge now exponential growth in the sources of data. And all of these sources are emitting time-series data. You can just think about a connected car—not even a self-driving car, just a connected car, your everyday, kind of, 2022 model, and nearly every element of the car is emitting time-series data: its engine components, you know, your tires, like, what the climate inside of the car is, statuses of the engine itself, and it's all doing that in real-time, so every one second, every five seconds, whatever.So, I think in general, people just don't realize they're already dealing with a substantial workload of time series. And in most cases, unless they're using something like Influx, they're probably not, you know, especially tuned to handle it from a technology perspective.Corey: So, it's been a year. What has changed over on your side of the world since the last time we spoke? It seems that well, things continue and they're up and to the right. Well, sure, generally speaking, you're clearly still in business. Good job, always appreciative of your custom, as well as the fact that oh, good, even in a world where it seems like there's a macro recession in progress, that there are still companies out there that continue to persist and in some cases, dare I say, even thrive? What have you folks been up to?Brian: Yeah, it's been a big year. So first, we've seen quite a bit of expansion across the use cases. So, we've seen even further expansion in IoT, kind of expanding into consumer, industrial, and now sustainability and clean energy, and that pairs with what we've seen on FinTech and cryptocurrency, gaming and entertainment applications, network telemetry, including some of the biggest names in telecom, and then a little bit more on the cloud side with cloud services, infrastructure, and dev tools and APIs. So, quite a bit more broad set of use cases we're now seeing across the platform. And the second thing is—you might have seen it in the last month or so—is a pretty big announcement we had of our new storage engine.So, this was just announced earlier this month in November and was previously introduced to our community as what we call an IOx, which is how it was known in the open-source. And think of this really as a rebuilt and reimagined storage engine which is built on that open-source project, InfluxDB IOx that allows us to deliver faster queries, and now—pretty exciting for the first time—unlimited time-series, or cardinality as it's known in the space. And then also we introduced SQL for writing queries and BI tool support. And this is, for the first time we're introducing SQL, which is world's most popular data programming language to our platform, enabling developers to query via the API our language Flux, and InfluxQL in addition.Corey: A long time ago, it really seems that the cloud took a vote, for lack of a better term, and decided that when it comes to storage, object store is the way forward. It was a bit of a reimagining from how we all considered using storage previously, but the economics are at minimum of ten to one in favor of objects store, the latency is far better, the durability is off the charts better, you don't have to deal—at least in AWS-land—with the concept of availability zones and the rest, just from an economic and performance perspective, provided the use case embraces it, there's really no substitute.Brian: Yeah, I mean, the way we think about storage is, you know, obviously, it varies quite a bit from customer to customer with our use cases. Especially in IoT, we see some use cases where customers want to have data around for months and in some cases, years. So, it's a pretty substantial data set you're often looking at. And sometimes those customers want to downsample those, they don't necessarily need every single piece of minutia that they may need in real-time, but not in summary, looking backward. So, you really—we're in this kind of world where we're dealing with both hive fidelity—usually in the moment—data and lower fidelity, when people can downsample and have a little bit more of a summarized view of what happened.So, pretty unique for us and we have to kind of design the product in a way that is able to balance both of those because that's what, you know, the customer use cases demand. It's a super hard problem to solve. One of the reasons that you have a product like InfluxDB, which is specialized to handle this kind of thing, is so that you can actually manage that balance in your application service and setting your retention policy, et cetera.Corey: That's always been something that seemed a little on the odd side to me when I'm looking at a variety of different observability tools, where it seems that one of the key dimensions that they all tend to, I guess, operate on and price on is retention period. And I get it; you might not necessarily want to have your load balancer logs from 2012 readily available and paying for the privilege, but it does seem that given the dramatic fall of archival storage pricing, on some level, people do want to be able to retain that data just on the off chance that will be useful. Maybe that's my internal digital packrat chiming in at this point, but I do believe strongly that there is a correlation between how recent the data is and how useful it is, for a variety of different use cases. But that's also not a global truth. How do you view the divide? And what do you actually see people saying they want versus what they're actually using?Brian: It's a really good question and not a simple problem to solve. So, first of all, I would say it probably really depends on the use case and the extent to which that use case is touching real world applications and services. So, in a pure observability setting where you're looking at, perhaps more of a, kind of, operational view of infrastructure monitoring, you want to understand kind of what happened and when those tend to be a little bit more focused on real-time and recent. So, for example, you of course, want to know exactly what's happening in the moment, zero in on whatever anomaly and kind of surrounding data there is, perhaps that means you're digging into something that happened in you know, fairly recent time. So, those do tend to be, not all of them, but they do tend to be a little bit more real-time and recent-oriented.I think it's a little bit different when we look at IoT. Those generally tend to be longer timeframes that people are dealing with. Their physical out-in-the-field devices, you know, many times those devices are kind of coming online and offline, depending on the connectivity, depending on the environment, you can imagine a connected smart agriculture setup, I mean, those are a pretty wide array of devices out and in, you know, who knows what kind of climate and environment, so they tend to be a little bit longer in retention policy, kind of, being able to dig into the data, what's happening. The time frame that people are dealing with is just, in general, much longer in some of those situations.Corey: One story that I've heard a fair bit about observability data and event data is that they inevitably compose down into metrics rather than events or traces or logs, and I have a hard time getting there because I can definitely see a bunch of log entries showing the web servers return codes, okay, here's the number of 500 errors and number of different types of successes that we wind up seeing in the app. Yeah, all right, how many per minute, per second, per hour, whatever it is that makes sense that you can look at aberrations there. But in the development process at least, I find that having detailed log messages tell me about things I didn't see and need to understand or to continue building the dumb thing that I'm in the process of putting out. It feels like once something is productionalized and running, that its behavior is a lot more well understood, and at that point, metrics really seem to take over. How do you see it, given that you fundamentally live at that intersection where one can become the other?Brian: Yeah, we are right at that intersection and our answer probably would be both. Metrics are super important to understand and have that regular cadence and be kind of measuring that state over time, but you can miss things depending on how frequent those metrics are coming in. And increasingly, when you have the amount of data that you're dealing with coming from these various sources, the measurement is getting smaller and smaller. So, unless you have, you know, perfect metrics coming in every half-second, or you know, in some sub-partition of that, in milliseconds, you're likely to miss something. And so, events are really key to understand those things that pop up and then maybe come back down and in a pure metric setting, in your regular interval, you would have just completely missed. So, we see most of our use cases that are showing a balance of the two is kind of the most effective. And from a product perspective, that's how we think about solving the problem, addressing both.Corey: One of the things that I struggled with is it seems that—again, my approach to this is relatively outmoded. I was a systems administrator back when that title was not considered disparaging by a good portion of the technical community the way that it is today. Even though the job is the same, we call them something different now. Great. Okay, whatever smile, nod, and accept the larger paycheck.But my way of thinking about things are okay, you have the logs, they live on the server itself. And maybe if you want to be fancy, you wind up putting them to a centralized rsyslog cluster or whatnot. Yes, you might send them as well to some other processing system for visibility or a third-party monitoring system, but the canonical truth slash source of logs tends to live locally. That said, I got out of running production infrastructure before this idea of ephemeral containers or serverless functions really became a thing. Do you find that these days you are the source of truth slash custodian of record for these log entries, or do you find that you are more of a secondary source for better visibility and analysis, but not what they're going to bust out when the auditor comes calling in three years?Brian: I think, again, it—[laugh] I feel like I'm answering the same way [crosstalk 00:15:53]Corey: Yeah, oh, and of course, let's be clear, use cases are going to vary wildly. This is not advice on anyone's approach to compliance and the rest [laugh]. I don't want to get myself in trouble here.Brian: Exactly. Well, you know, we kind of think about it in terms of profiles. And we see a couple of different profiles of customers using InfluxDB. So, the first is, and this was kind of what we saw most often early on, still see quite a bit of them is kind of more of that operator profile. And these are folks who are going to—they're building some sort of monitor, kind of, source of truth for—that's internally facing to monitor applications or services, perhaps that other teams within their company built.And so that's, kind of like, a little bit more of your kind of pure operator. Yes, they're building up in the stack themselves, but it's to pay attention to essentially something that another team built. And then what we've seen more recently, especially as we've moved more prominently into the cloud and offered a usage-based service with a, you know, APIs and endpoint people can hit, as we see more people come into it from a builder's perspective. And similar in some ways, except that they're still building kind of a, you know, a source of truth for handling this kind of data. But they're also building the applications and services themselves are taken out to market that are in the hands of customers.And so, it's a little bit different mindset. Typically, there's, you know, a little bit more comfort with using one of many services to kind of, you know, be part of the thing that they're building. And so, we've seen a little bit more comfort from that type of profile, using our service running in the cloud, using the API, and not worrying too much about the kind of, you know, underlying setup of the implementation.Corey: Love how serverless helps you scale big and ship fast, but hate debugging your serverless apps? With Lumigo's serverless observability, it's fast and easy (and maybe a little fun, too). End-to-end distributed tracing gives developers full clarity into their most complex serverless and containerized applications, connecting every service from AWS Lambda and Amazon ECS to DynamoDB, API Gateways, Step Functions and more. Try Lumigo free and debug 3x faster, reduce error rate and speed up development. Visit snark.cloud/lumigo That's snark.cloud/L-U-M-I-G-OCorey: So, I've been on record a lot saying that the best database is TXT records stuffed into Route 53, which works super well as a gag, let's be clear, don't actually build something on top of this, that's a disaster waiting to happen. I don't want to destroy anyone's career as I do this. But you do have a much more viable competitive threat on the landscape. And that is quite simply using the open-source version of InfluxDB. What is the tipping point where, “Huh, I can run this myself,” turns into, “But I shouldn't. I should instead give money to other people to run it for me.”Because having been an engineer, where I believe I'm the world's greatest everything, when it comes to my environment—a fact provably untrue, but that hubris never quite goes away entirely—at what point am I basically being negligent not to start dealing with you in a more formalized business context?Brian: First of all, let me say that we have many customers, many developers out there who are running open-source and it works perfectly for them. The workload is just right, the deployment makes sense. And so, there are many production workloads we're using open-source. But typically, the kind of big turning point for people is on scale, scale, and overall performance related to that. And so, that's typically when they come and look at one of the two commercial offers.So, to start, open-source is a great place to, you know, kind of begin the journey, check it out, do that level of experimentation and kind of proof of concept. We also have 60,000-plus developers using our introductory cloud service, which is a free service. You simply sign up and can begin immediately putting data into the platform and building queries, and you don't have to worry about any of the setup and running servers to deploy software. So, both of those, the open-source and our cloud product are excellent ways to get started. And then when it comes time to really think about building in production and moving up in scale, we have our two commercial offers.And the first of those is InfluxDB Cloud, which is our cloud-native fully managed by InfluxData offering. We run this not only in AWS but also in Google Cloud and Microsoft Azure. It's a usage-based service, which means you pay exactly for what you use, and the three components that people pay for our data in, number of queries, and the amount of data you store in storage. We also for those who are interested in actually managing it themselves, we have InfluxDB Enterprise, which is a software subscription-base model, and it is self-managed by the customer in their environment. Now, that environment could be their own private cloud, it also could be on-premises in their own data center.And so, lots of fun people who are a little bit more oriented to kind of manage software themselves rather than using a service gear toward that. But both those commercial offers InfluxDB Cloud and InfluxDB Enterprise are really designed for, you know, massive scale. In the case of Cloud, I mentioned earlier with the new storage engine, you can hit unlimited cardinality, which means you have no limit on the number of time series you can put into the platform, which is a pretty big game-changing concept. And so, that means however many time-series sources you have and however many series they're emitting, you can run that without a problem without any sort of upper limit in our cloud product. Over on the enterprise side with our self-managed product, that means you can deploy a cluster of whatever size you want. It could be a two-by-four, it could be a four-by-eight, or something even larger. And so, it gives people that are managing in their own private cloud or in a data center environment, really their own options to kind of construct exactly what they need for their particular use case.Corey: Does your object storage layer make it easier to dynamically change clusters on the fly? I mean, historically, running things in a pre-provisioned cluster with EBS volumes or local disk was, “Oh, great. You want to resize something? Well, we're going to be either taking an outage or we're going to be building up something, migrating data live, and there's going to be a knife-switch cutover at some point that makes things relatively unfortunate.” It seems that once you abstract the storage layer away from anything resembling an instance that you would be able to get away from some of those architectural constraints.Brian: Yeah, that's really the promise, and what is delivered in our cloud product is that you no longer, as a developer, have to think about that if you're using that product. You don't have to think about how big the cluster is going to be, you don't have to think about these kind of disaster scenarios. It is all kind of pre-architected in the service. And so, the things that we really want to deliver to people, in addition to the elimination of that concern for what the underlying infrastructure looks like and how its operating. And so, with infrastructure concerns kind of out of the way, what we want to deliver on are kind of the things that people care most about: real-time query speed.So, now with this new storage engine, you can query data across any time series within milliseconds, 100 times faster queries against high cardinality data that was previously impossible. And we also have unlimited time-series volume. Again, any total number of time series you have, which is known as cardinality, is now able to run without a problem in the platform. And then we also have kind of opening up, we're opening up the aperture a bit for developers with SQL language support. And so, this is just a whole new world of flexibility for developers to begin building on the platform. And again, this is all in the way that people are using the product without having to worry about the underlying infrastructure.Corey: For most companies—and this does not apply to you—their core competency is not running time-series databases and the infrastructure attendant thereof, so it seems like it is absolutely a great candidate for, “You know, we really could make this someone else's problem and let us instead focus on the differentiated thing that we are doing or building or complaining about.”Brian: Yeah, that's a true statement. Typically what happens with time-series data is that people first of all, don't realize they have it, and then when they realize they have time-series data, you know, the first thing they'll do is look around and say, “Well, what do I have here?” You know, I have this relational database over here or this document database over here, maybe even this, kind of, search database over here, maybe that thing can handle time series. And in a light manner, it probably does the job. But like I said, the sources of data and just the volume of time series is expanding, really across all these different use cases, exponentially.And so, pretty quickly, people realize that thing that may be able to handle time series in some minor manner, is quickly no longer able to do it. They're just not purpose-built for it. And so, that's where really they come to a product like Influx to really handle this specific problem. We're built specifically for this purpose and so as the time-series workload expands when it kind of hits that tipping point, you really need a specialized tool.Corey: Last question, before I turn you loose to prepare for re:Invent, of course—well, I guess we'll ask you a little bit about that afterwards, but first, we can talk a lot theoretically about what your product could or might theoretically do. What are you actually seeing? What are the use cases that other than the stereotypical ones we've talked about, what have you seen people using it for that surprised you?Brian: Yeah, some of it is—it's just really interesting how it connects to, you know, things you see every day and/or use every day. I mean, chances are, many people listening have probably use InfluxDB and, you know, perhaps didn't know it. You know, if anyone has been to a home that has Tesla Powerwalls—Tesla is a customer of ours—then they've seen InfluxDB in action. Tesla's pulling time-series data from these connected Powerwalls that are in solar-powered homes, and they monitor things like health and availability and performance of those solar panels and the battery setup, et cetera. And they're collecting this at the edge and then sending that back into the hub where InfluxDB is running on their back end.So, if you've ever seen this deployed like that's InfluxDB running behind the scenes. Same goes, I'm sure many people have a Nest thermostat in their house. Nest monitors the infrastructure, actually the powers that collection of IoT data collection. So, you think of this as InfluxDB running behind the scenes to monitor what infrastructure is standing up that back-end Nest service. And this includes their use of Kubernetes and other software infrastructure that's run in their platform for collection, managing, transforming, and analyzing all of this aggregate device data that's out there.Another one, especially for those of us that streamed our minds out during the pandemic, Disney+ entertainment, streaming, and delivery of that to applications and to devices in the home. And so, you know, this hugely popular Disney+ streaming service is essentially a global content delivery network for distributing all these, you know, movies and video series to all the users worldwide. And they monitor the movement and performance of that video content through this global CDN using InfluxDB. So, those are a few where you probably walk by something like this multiple times a week, or in our case of Disney+ probably watching it once a day. And it's great to see InfluxDB kind of working behind the scenes there.Corey: It's one of those things where it's, I guess we'll call it plumbing, for lack of a better term. It's not the sort of thing that people are going to put front-and-center into any product or service that they wind up providing, you know, except for you folks. Instead, it's the thing that empowers a capability behind that product or service that is often taken for granted, just because until you understand the dizzying complexity, particularly at scale, of what these things have to do under the hood, it just—well yeah, of course, it works that way. Why shouldn't it? That's an expectation I have of the product because it's always had that. Yeah, but this is how it gets there.Brian: Our thesis really is that data is best understood through the lens of time. And as this data is expanding exponentially, time becomes increasingly the, kind of, common element, the common component that you're using to kind of view what happened. That could be what's running through a telecom network, what's happening with the devices that are connected that network, the movement of data through that network, and when, what's happening with subscribers and content pushing through a CDN on a streaming service, what's happening with climate and home data in hundreds of thousands, if not millions of homes through common device like a Nest thermostat. All of these things they attach to some real-world collection of data, and as long as that's happening, there's going to be a place for time-series data and tools that are optimized to handle it.Corey: So, my last question—for real this time—we are recording this the week before re:Invent 2022. What do you hope to see, what do you expect to see, what do you fear to see?Brian: No fears. Even though it's Vegas, no fears.Corey: I do have the super-spreader event fear, but that's a separate—Brian: [laugh].Corey: That's a separate issue. Neither one of us are deep into the epidemiology weeds, to my understanding. But yeah, let's just bound this to tech, let's be clear.Brian: Yeah, so first of all, we're really excited to go there. We'll have a pretty big presence. We have a few different locations where you can meet us. We'll have a booth on the main show floor, we'll be in the marketplace pavilion, as I mentioned, InfluxDB Cloud is offered across the marketplaces of each of the clouds, AWS, obviously in this case, but also in Azure and Google. But we'll be there in the AWS Marketplace pavilion, showcasing the new engine and a lot of the pretty exciting new use cases that we've been seeing.And we'll have our full team there, so if you're looking to kind of learn more about InfluxDB, or you've checked it out recently and want to understand kind of what the new capability is, we'll have many folks from our technical teams there, from our development team, some our field folks like the SEs and some of the product managers will be there as well. So, we'll have a pretty great collection of experts on InfluxDB to answer any questions and walk people through, you know, demonstrations and use cases.Corey: I look forward to it. I will be doing my traditional Wednesday afternoon tour through the expo halls and nature walk, so if you're listening to this and it's before Wednesday afternoon, come and find me. I am kicking off and ending at the [unintelligible 00:29:15] booth, but I will make it a point to come by the Influx booth and give you folks a hard time because that's what I do.Brian: We love it. Please. You know, being on the tour is—on the walking tour is excellent. We'll be mentally prepared. We'll have some comebacks ready for you.Corey: Therapists are standing by on both sides.Brian: Yes, exactly. Anyway, we're really looking forward to it. This will be my third year on your walking tour. So, the nature walk is one of my favorite parts of AWS re:Invent.Corey: Well, I appreciate that. Thank you. And thank you for your time today. I will let you get back to your no doubt frenzied preparations. At least they are on my side.Brian: We will. Thanks so much for having me and really excited to do it.Corey: Brian Mullen, CMO at InfluxData, I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment that you naively believe will be stored as a TXT record in a DNS server somewhere rather than what is almost certainly a time-series database.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

AWS Morning Brief
Pre:Invent Edition

AWS Morning Brief

Play Episode Listen Later Nov 28, 2022 7:23


Links: Tiered storage for MSK Lambda telemetry API Resource Explorer Launched GP3 comes to RDS Amazon Time Sync is now available as a public NTP service Zurich region Spain Region Hyderabad Region Faster glacier restores multiple MFA devices Finch AWS Fault Isolation Boundaries whitepaper Stay Up To Date with re:Quinnvent Sign up for the re:Quinnvent Newsletter Check out the re:Quinnvent playlist on YouTube Help the show Share your feedback Subscribe wherever you get your podcasts Buy our merch What's Corey up to? Follow Corey on Twitter (@quinnypig) See our recent work at the Duckbill Group Apply to work with Corey and the Duckbill Group to help lower your AWS bill

Screaming in the Cloud
Couchbase and the Evolving World of Databases with Perry Krug

Screaming in the Cloud

Play Episode Listen Later Nov 28, 2022 34:21


About PerryPerry Krug currently leads the Shared Services team which is focused on building tools and managing infrastructure and data to increase the productivity of Couchbase's Sales and Field organisations.  Perry has been with Couchbase for over 12 years and has served in many customer-facing technical roles, helping hundreds of customers understand, deploy, and maintain Couchbase's NoSQL database technology.  He has been working with high performance caching and database systems for over 15 years.Links Referenced: Couchbase: https://www.couchbase.com/ Perry's LinkedIn: https://www.linkedin.com/in/perrykrug/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us by our friends at Pinecone. They believe that all anyone really wants is to be understood, and that includes your users. AI models combined with the Pinecone vector database let your applications understand and act on what your users want… without making them spell it out. Make your search application find results by meaning instead of just keywords, your personalization system make picks based on relevance instead of just tags, and your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable. Thanks to my friends at Pinecone for sponsoring this episode. Visit Pinecone.io to understand more.Corey: InfluxDB is the smart data platform for time series. It's built from the ground-up to handle the massive volumes and countless sources of time-stamped data produced by sensors, applications, and systems. You probably think of these as logs.InfluxDB is programmable and performant, has a common API across the platform, and handles high granularity data–at scale and with high availability. Use InfluxDB to build real-time applications for analytics, IoT, and cloud-native services, all in less time and with less code. So go ahead–turn your apps up to 11 and start your journey to Awesome for free at InfluxData.com/screaminginthecloudCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's episode is a promoted guest episode brought to us by our friends at Couchbase. Now, I want to start off by saying that this week is AWS re:Invent. And there is Last Week in AWS swag available at their booth. More on that to come throughout the next half hour or so of conversation. But let's get right into it. My guest today is Perry Krug, Director of Shared Services over at Couchbase. Perry, thanks for joining me.Perry: Hey, Corey, thank you so much for having me. It's a pleasure.Corey: So, we're recording this before re:Invent, so the fact that we both have, you know, personality and haven't lost our voices yet should probably be a bit of a giveaway on this. But I want to start at the very beginning because unlike people who are academically successful, I tend to suck at doing the homework, across the board. Couchbase has been around for a long time. We've seen the company do a bunch of different things, most importantly and notably, sponsoring my ridiculous nonsense for which I thank you. But let's start at the beginning. What is Couchbase?Perry: Yeah, you're very welcome, Corey. And it's again, it's a pleasure to be here. So, Couchbase is an enterprise database company at the very top level. We make database software and we distribute that to our customers. We have two flavors, two ways of getting your hands on it.One is the kind of legacy, what we call self-managed, where you the user, the customer, downloads the software, installs it themselves, sets it up, manages the cluster monitoring, scaling all of that. And that's, you know, a big part of our business. Over the last few years we've identified, and certainly others in the industry have, as well the desire for users to access database and other technology in a hosted Software-as-a-Service pay-as-you-go, cloud-native, buzzword, et cetera, et cetera, vehicle. And so, we've released the Couchbase Capella, which is our fully managed, fully hosted database-as-a-service, running in—currently—Amazon and Google, soon to be Azure as well. And it wraps and extends our core Couchbase Server product into a, as I mentioned, hosted and managed platform that our users can now come to and consume as developers and build their applications while leaving all of the operational and administration—monitoring, managing failover expansion, all of that—to us as the experts.Corey: So, you folks are non-relational database, NoSQL in the common parlance, which is odd because they call it NoSQL, yet. They keep making more of them, so I feel like that's sort of the Hollywood model where okay, that was so good. We're going to do it again. Where did NoSQL come from? Because back when I was learning databases, when dinosaurs roamed the earth, it was all about relational models, like we're going to use a relational database because when the only tool you have is an axe, every problem looks like hours of fun. What gave rise to this, I guess, Cambrian explosion that we've seen of NoSQL options that proliferate o'er the land?Perry: Yeah, a really, really good question, and I like the axe-throwing metaphor. So sure, 20, 30, 40 now years ago, as digital applications needed a place to store their data, the world invented relational databases. And those were used and continue to be used very well for what they were designed for, for data that follows a very strict structure that doesn't need to be served at significant scale, does not need to be replicated geographically, does not need to handle data coming in from different sources and those sources changing their formats of things all the time. And so, I'm probably as old as you are and been around when the dinosaurs were there. We remember this term called ‘Web 2.0.' Kids, you're going to have to go look that up in the dictionary or TikTok it or something.But Web 2.0 really was the turning point when websites became web applications. And suddenly, there was the introduction of MySpace and Facebook and Amazon and Google and LinkedIn, and a number of others, and they realized that relational databases we're not going to meet their needs, whether it be performance, whether it be flexibility, whether it be changing of data models, whether it be introducing new features at a rapid pace. They tried; they stretched them, they added a bunch of different databases together, and really was not going to be a viable solution. So, 10 now, maybe 15 years ago, you started to see the rise of these tech giants—although we didn't call them tech giants back then but they were the precursors to today's—invent their own new databases.So, Amazon had theirs, Google has theirs, LinkedIn, and a number of others. These companies had reached a level of scale and reached a level of user base, had reached a level of data requirement, had reached a level of expectation with their customers. These customers, us, the users, us consumers, we expect things to be fast, we expect them to be always available. We expect Facebook to give us our news feed in milliseconds. We expect Google to give us our website or our search results in immediate, with more and more information coming along with them.And so, it was these companies that hit those requirements first. The only solution for them was to start from scratch and rewrite their own databases. Fast forward five, six, seven years, and we as consumers turned around and said, “Look, I really liked the way Facebook does things. I really like the way Google does things. I really like the way Amazon does things.“Bank of America, can you do the same? IRS, can you do the same? Health care vendor number one, two, three, and four, government body, can you all give me the same experience? I want my taxi to tell me exactly where it's going to take me from one place to another, I want it to give me a receipt immediately after I finish my ride. Actually, I want to be able to change my payment method after I paid for that ride because I used the wrong one.”All of these are expectations that we as consumers have taken from the tech giants—Apple, LinkedIn, Facebook—and turned around to nearly every other service that we interact with on a daily basis. And all of a sudden, the requirements that Facebook had, that Google had, that no other company had, you know, outside of the top five, suddenly were needed by every single industry, nearly every single company, in order to be competitive in their markets.Corey: And there's no way to scale relational to get to a point where it can wind up handling those type workloads efficiently?Perry: Correct, correct. And it's not just that the technology cannot do it—everything is technically feasible—but the cost both financially and time-to-market-wise in order to do that in a relational database was untenable. It either cost too much money, or it costs too much developers time, or cost too much of everybody's time to try to shoehorn something into it. And then you have the rise of cloud and containers, which relational databases, you know, never even had the inkling of a thought that they might need to be able to handle someday. And so, these requirements that consumers have been placed on everything else that they interact with really led to the rise of NoSQL as a commodity or as a database for the masses.LinkedIn is not in the business of developing a database and then selling it to everybody else to use as a database, right? They built it for themselves, they made their service better. And so, what you see is some of those founding fathers created databases, but then had no desire to sell them to others. And then after that followed the rise of companies like Couchbase and a number of others who said, “Look, we think we can provide those capabilities, we think we can meet those requirements for everybody.” And thereby rose the plethora of NoSQL databases because everybody had a little bit different of an approach to it.If you ask ten people what NoSQL is about, you're going to get eleven or twelve different answers. But you can kind of distill that into two categories. One is performance and operations. So, I need it to be faster, I need it to be scalable, I need it to be replicated geographically. And that's what NoSQL is to me. And that's the right answer.And so, you have things like Cassandra and Redis that are meant to be fast and scalable and replicated. You ask another group and they're going to tell you, “No, no, no. NoSQL needs to be flexible. I need to get rid of the rigid database schemas, I need to bring JSON or other data formats in and munge all this data together and create something cool and new out of it.” And thereby you have the rise of things like MongoDB, who focused nearly exclusively on the developer experience of working with data.And for a long time, those two were in opposite camps, where you have the databases that did performance and the databases that did flexibility. I'm not here to say that Couchbase is the ultimate kitchen sink for everything, but we've certainly tried to approach both of those challenges together so that you can have something that scales and performs and can be flexible enough in data model. And everybody else is trying to do the same thing, right? But all these databases are competing for that same nirvana of the best of both worlds.Corey: And it almost feels like there's a convergence play in place where everything now is trying to go away from the idea of, “Oh, yeah, we started off as a purpose-built database, but you can use this for everything.” And I don't necessarily know that is going to be the path that a lot of companies want to go down. What do you view Couchbase as I guess, falling down? In other words, what workloads is Couchbase inappropriate for?Perry: Yeah, that's a good question. And my [crosstalk 00:10:35]—Corey: Anyone who can't answer that one is a zealot and that's one of those okay, let's be very careful and not take our eyes off you for one second, while smiling and backing away slowly.Perry: Let's cut to commercial. No, I mean, there certainly are workloads that you know, in the past, we've not been good for that we've made improvements to address. There are workloads that we had not address well today that we will try to address in the future, and there are workloads that we may never see as fitting in our wheelhouse. The biggest category group that comes to mind is Couchbase is not an archival database. We are not meant to have data put in us that you don't care about, that you don't want to—that you just need to keep it around, but you don't ever need to access.And there are systems that do that well, they do that at a solid total cost of ownership. And Couchbase is meant for operational data. It's meant for data that needs to be interacted with, read and/or written, at scale and at a reasonable performance to serve a user-facing or system-facing application. And we call ourselves a general-purpose database. Bongo and others call themselves as well. Oracle calls itself a general-purpose database, and yet, not everybody uses Oracle for everything.So, there are reasons that you—Corey: Who could afford that?Perry: Who could? Exactly. It comes down to cost, ultimately. So, I'm not here to say that Couchbase does everything. We like to think, and we're trying to target and strive towards an 80%, right? If we can do 80% of an application or an organization's workloads, there is certainly room for 20% of other workloads, other applications, other requirements that can be met or need to be met by purpose-built databases.But if you rewind four or five years, there was this big push towards polyglot persistence. It's a buzzword that came and kind of has gone out of fashion, but it presented the idea that everybody is going to use 15 different databases and everybody is going to pick the right one for exactly the workload and they're going to somehow stitch them all together. And that really hasn't come to fruition either. So, I think there's some balance, where it's not one to rule them all, but it's also not 15 for every company. Some organizations just have a set of requirements that they want to be met and our database can do that.Corey: Let's continue our tour of the competitive landscape here now that we've handled the relational side of the world. The best database, as anyone who's listened to this show knows, is of course, Amazon's Route 53 TXT records stuffed into DNS, especially in the NoSQL land. Clearly, you're all fighting for second place after that. How do you stack up against the idea of legitimately using that approach? And for those who are not in on the joke, please don't do this. It is not the right answer. But I'm curious to get your take as to why DNS TXT records are an inappropriate NoSQL option.Perry: Well, it's a joke, right? And let's be clear about that. But—Corey: I have to say that because otherwise, someone tries it in production. I've gotten that wrong a few times, historically, so now I put a disclaimer in because yeah, it's only funny, so long as people are in on the joke. If not, and I lead someone down the primrose path to disaster, I feel bad. So, let's be very clear. We're kidding.Perry: And I'm laughing. I'm laughing here behind the camera. I am. I am.Corey: Yeah.Perry: So, the element of truth that I think Couchbase is in a position, or I'm in a position to kind of talk about is, 12 years ago, when Couchbase started, we were a key-value database and that's where we saw the best part of the market in those days, and where we were able to achieve the best scale and replication and performance, and fairly quickly realized that simple key-value, though extremely valuable and easy to manage, was not broad enough in requirements-meeting. And that's where we set our sights on and identified the larger, kind of, document database group, which is really just a derivative of key-value, where still everything is a key and a value; it's just now a document that you can reason about, that you can create an index on, that you can query, that you can run full-text search on, you can do much more with the data. So, at our core, we are still a key-value database. When that value is JSON, we become a document database. And so, if Route 53 decided that they wanted to enter into the document database market, they would need to be adding things that allowed you to introspect and ask questions of the data within that text which you can't, right?Corey: Well, not with that attitude. But yeah, I agree with you.Perry: [laugh].Corey: Moving up the stack, let's talk about a much more fearsome competitor here that I'm certain you see an awful lot of deals that you wind up closing, specifically your own open-source product. You historically have wound up selling software into environments, I believe, you referred to as your legacy offering where it's the hosted version of your commercial software. And now of course, you also have Capella, your cloud-hosted version. But open-source looks surprisingly compelling for an awful lot of use cases and an awful lot of folks. What's the distinction?Perry: Sure. Just to correct a little bit the distinction, we have Couchbase Server, which we provide as a what we call self-managed, where you can download it and install it yourself. Now, you could do that with the open-source version or you could do that with our Enterprise Edition. What we've then done is wrapped that Enterprise Edition in a hosted bottle, and that's Capella. So, the open-source version is something we've long been supporters of; it's been a core part of our go-to-market for the last 12 or 13 years or so and we still see it as a strong offering for organizations that don't need the added features, the added capabilities, don't need the support of the experts that wrote the software behind them.Certainly, we contribute and support our community through our forums and Discord and other channels, but that's a very big difference than two o'clock in the morning, something's not working and I need a ticket to track. We don't do that for our community edition. So, we see lots of users downloading that, picking it up building it into their applications, especially applications that are in their infancy or are with organizations that they simply can't afford the added cost and therefore they don't get the added benefit. We're not here to gouge and carve out every dollar that we can, but if you need the benefit that we can provide, we think there's value in that and that's what we're trying to run a business as.Corey: Oh, absolutely. It doesn't work when you're trying to wind up charging a license fee for something that someone is doing in their spare time project for funsies just to learn the technology. It's like, and then you show up. It's like, “That'll be $700. Surprise.”Yeah, that's sort of the AWS billing model approach, where—it's not a viable onramp for most folks. So, the open-source direction down there make sense. Counterpoint. If you're running a bank on top of it, “Well, we're running it ourselves and really hoping for the best. I mean, we have access to the code and all.” Great, but there are times you absolutely want some of the best minds in the world, with respect to that particular product, able to help troubleshoot so the ATM start working again before people riot in the streets.Perry: Yeah, yeah. And ultimately, it's a question of core competencies. Are you an organization that wants to be in the database development market? Great, by all means, we'd love to support you in that. If you want to focus on doing what you do best be at a bank or an e-commerce website, you worry about your application, you let us worry about the database and everybody gets along very well.Corey: There's definitely something to be said for outsourcing some of the pain, some of the challenge around an awful lot of it.Perry: There's a natural progression to the cloud for that and Software-as-a-Service, database-as-a-service where you're now outsourcing even more by running on our hosting platform. No longer do you have to download the binary and install yourself, no longer do you have to setup the cluster and watch it in case it has a blip or the statistic goes up too far. We're taking care of that for you. So yes, you're paying for that service, but you're getting the value of not having to be a database manager, let alone database developer for them.Corey: Love how serverless helps you scale big and ship fast, but hate debugging your serverless apps? With Lumigo's serverless observability, it's fast and easy (and maybe a little fun, too). End-to-end distributed tracing gives developers full clarity into their most complex serverless and containerized applications, connecting every service from AWS Lambda and Amazon ECS to DynamoDB, API Gateways, Step Functions and more. Try Lumigo free and debug 3x faster, reduce error rate and speed up development. Visit snark.cloud/lumigo That's snark.cloud/L-U-M-I-G-OCorey: What is the point of distinction between Couchbase Server and Couchbase Capella? To be clear, your self-hosted versus managed cloud offerings. When is one appropriate versus the other?Perry: Well, I'm supposed to say that Capella is always the appropriate choice, but there are currently a number of situations where Capella is not available in particular regions or cloud providers and so downloading running the software yourself certainly in your own—yes, there are people who still run their own data centers. I know it's taboo and we don't like to talk about that, but there are people who have on-premise. And so, Couchbase Capella is not available for them. But Couchbase Server is the original Couchbase database and it is the core of Couchbase Capella. So, wrapping is not giving it enough credit; we use Couchbase Server to power Couchbase Capella.And so, there's an enormous amount of value added around the core database, but ultimately, it's the behind the scenes of Couchbase Capella. Which I think is a nice benefit in that when an application is connecting to either one, it gets the same experience. You can point an application at one versus the other and because it's the same database running behind the scenes, the behavior, the data model, the query language, the APIs are all the same, so it adds a nice level of flexibility four customers that are either moving from one to another or have to have some sort of hybrid approach, which we see in the market today.Corey: Let's talk economics for a second. I can see scenarios where especially you have a high volume environment where you're sending tremendous amounts of data back and forth and as soon as it crosses an availability zone boundary or a region boundary, or God forbid, goes out to the internet via standard egress fees over in AWS-land, there's a radically different economic modeling that comes into play as opposed to having something in the same availability zone, in the same subnet just where that—or all traffic back and forth is free. Do you see that in your customer base, that that is a model that is driving people towards self-hosting?Perry: No. And I'd say no because Capella allows you to peer and run your application in the same availability zone as the as a database. And so, as long as that's an option for you that we have, you know, our offering in the right region, in the right AZ, and you can put your application there, then that's not a not an issue. We did have a customer not too long ago that didn't set that up correctly, they thought they did, and we noticed some high data transfer charges. Again, the benefit of running a hosted service, we detected that for them and were able to turn around and say, “Hmm, you might want to change this to over there so that we all save some money in doing so.”If we were not there watching it, they might not have noticed that themselves if they were running it self-managed; they might not have known what to do about it. And so, there's a benefit to working with us and using that hosted platform that we can keep an eye out. And we can apply all of our learning and best practices and bug fixes, we give that to everybody, rather than each person having to stumble across those hurdles themselves.Corey: That's one of those fun, weird corner-case trivia things about AWS data transfer. When you're transferring data within the same region between availability zones, it costs a penny on the sending side and a penny on the receiving side. Everything else is one side or the other that winds up getting the charge. And what makes this especially fun is that when it shows up on your bill, if you transfer a petabyte, it shows as cross-AZ data transfer: two petabytes.Perry: Two. Yeah.Corey: So, it double-counts so they can bill for it appropriately, but it leads to some really weird hunting it down, like, “Okay, well, we found half of it, but where's the other half hiding?” It's always obnoxious to trace this stuff down. The fact that you see it on your bill, well, that's testament to the fact that yeah, they're using the service. Good for them and good for you. Being able to track it down on a per-customer basis that does speak to your level of insight into what exactly is going on in your environment and where. As someone who does this for a living, let me confirm that is absolutely non-trivial.Perry: No, definitely not trivial. And you know, we've learned over the last four or five years, we've learned an enormous amount about how cloud providers work, how AWS works, but guess what, Google does it completely differently. And Azure does it—Corey: Yep.Perry: —completely differently. And so, on the surface level, they're all just cloud providers and they give you a VM, and you put some stuff on it, but integrating with the APIs, integrating with the different systems and naming of things, and then understanding the intricacies of the ins and outs, and, yeah, these cloud providers have their own bugs as well. And so, sometimes you stumble across that for them. And it's been a significant learning exercise that I think we're all better off for, having Couchbase gone through it for you.Corey: Let's get this a little bit more germane for this week for those of you who are listening to this during re:Invent. You folks are clearly here at the show—it's funny to talk about ‘here,' even though when we're recording this, it is not near here; we're actually home and enjoying ourselves, but welcome to temporal dislocation; here we are—here at the show, you folks are—among other things—being kind enough to pass out the Last Week in AWS swag from your booth, which, thank you. So, that is obviously the primary reason that you were at the show. What are the other reasons? What are the secondary reasons that you decided to come here?Perry: Yeah [laugh]. Well, I guess I have to think about this now since you already called out the primary reason.Corey: Exactly. Wait, we can have more than one reason for things? My God.Perry: Can we? Can we? AWS has long been a huge partner of ours, even before Capella itself was released. I remember sometime in, you know, five years or so ago, some 30% of our customers were running Couchbase inside of AWS, and some of our largest were some of your largest at times, like Viber, the messaging platform. And so, we've always had a very strong relationship with AWS, and the better that we can be presenting ourselves to your customers, and our customers can feel that we are jointly supporting them, the better. And so, you know, coming to re:Invent is a testament to that long-standing and very solid partnership, and also it's meant to get more exposure for us to let it be clear that Couchbase runs very well on AWS.Corey: It's one of those areas where when someone says, “Oh yeah, this is a great service offering, but it doesn't run super well on AWS.” It's like, “Okay, so are you bad computers or is what you have built so broken and Byzantine that it has to live somewhere else?” Or occasionally, the use case is absolutely not supported by AWS. Not to beat them up some more on their egress fees, but I'm absolutely about to if you're building a video streaming site, you don't want it living in AWS. It won't run super well there. Well, it'll run well, it'll just run extortionately expensively and that means that it's a non-starter.Perry: Yeah, why do you think Netflix raises their fees?Corey: Netflix, to their credit, has been really rather public about this, where they do all of their egress via their Open Connect, custom-built CDN appliances that they drop all over the place. They don't stream a single byte from AWS, and we know this from the outside because they are clearly still solvent.Perry: [laugh].Corey: I do the math on that. So, if I had been streaming at on-demand prices one month with my Netflix usage, I would have wound up spending four times my subscription fee just in their raw costs for data transfer. And I have it on good authority that is not just data transfer that is their only bill in the entire company; they also have to pay people and content and the analytics engine and whatnot. And it's kind of a weird, strange world.Perry: Real estate.Corey: Yeah. Because it's one of those strange stories because they are absolutely a showcase customer for AWS. They've been a marquee customer trotted out year after year to talk about what they're doing. But if you attempt to replicate their business purely on top of AWS, it will not work. Full stop. The economics preclude that happening.What is your philosophy these days on what historically has felt like an existential threat to most vendors that I've spoken to in a variety of ways: what if Amazon decides to enter your market? I'd ask you the same thing. Do you have fears that they're going to wind up effectively taking your open-source offering and turning it into Amazon Basics Couchbase, for lack of a better term? Is that something that is on your threat radar, or is that not really something you concern yourselves about?Perry: So, I mean, there's no arguing, there's no illusion that Amazon and Google and Microsoft are significant competitors in the database space, along with Oracle and IBM and Mongo and a handful of others.Corey: Anything's a database if you hold it wrong.Perry: This is true. This specific point of open-source is something that we have addressed in the same ways that others have addressed. And that's by choosing and changing our license model so that it precludes cloud providers from using the open-source database to produce their own service on the back of it. Let me be clear, it does not impact our existing open-source users and anybody that wants to use the Community Edition or download the software, the source code, and build it themselves. It's only targeted at Amazon because they have a track record of doing that to things like Elastic and Redis and Mongo, all of whom who have made similar to Couchbase moves to prevent that by the licensing of the open-source code.Corey: So, one of the things I do see at re:Invent every year is—and I believe wholeheartedly this comes historically from a lot of AWS's requirements for vendors on the show floor that have become public through a variety of different ways—where you for a long time, you are not allowed to mention multi-cloud or reference the fact that you work on any other cloud provider there. So, there's been a theme of this is why, for whatever it is we sell or claim to sell or hope one day to sell, AWS is the absolute best place for you to run it, full stop. And in some cases, that's absolutely true because people build primarily for a certain cloud provider and then when they find customers and other places, they learn to run it over there, too. If I'm approaching this from the perspective of I have a database problem—because looking at my philosophy on databases is hard to imagine I don't have database problems—then is my experience going to be better or even materially different between any of the cloud providers if I become a Couchbase Capella customer?Perry: I'd like to say no. We've done our best to abstract and to leverage the best of all of the cloud providers underneath to provide Couchbase in the best form that they will allow us to. And as far as I can see, there's no difference amongst those. Your application and what you do with the data, that may be better suited to one provider or another, but it's always been Couchbase is philosophy—sort of say, strategy—to make our software available to wherever our customers and users want to, to consume it. And that goes everything from physical hardware running in a data center, virtual machines on top of that, containers, cloud, and different cloud providers, different regions, different availability zones, all the way through to edge and other infrastructures. We're not in a position to say, “If you want Couchbase, you should use AWS.” We're in a position to say, “If you are using AWS, you can have Couchbase.”Corey: I really want to thank you for being so generous with your time, and of course, your sponsorship dollars, which are deeply appreciated. Once again, swag is available at the Couchbase booth this week at re:Invent. If people want to learn more and if for some unfathomable reason, they're not at re:Invent, probably because they make good life choices, where can they go to find you?Perry: couchbase.com. That'll to be the best place to land on. That takes you to our documentation, our resources, our getting help, our contact pages, directly into Capella if you want to sign in or login. I would go there.Corey: And we will, of course, put links to that in the show notes. Thank you so much for your time. I really appreciate it.Perry: Corey, it's been a pleasure. Thank you for your questions and banter, and I really appreciate the opportunity to come and share some time with you.Corey: We'll have to have you back in the near future. Perry Krug, Director of Shared Services at Couchbase. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment berating me for being nowhere near musical enough when referencing [singing] Couchbase Capella.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

AWS Morning Brief
The Feudal Lords of Amazon: AWS' Infinite Service Launches and Counterproductive Culture

AWS Morning Brief

Play Episode Listen Later Nov 23, 2022 8:51


Want to give your ears a break and read this as an article? You're looking for this link.https://www.lastweekinaws.com/blog/the-feudal-lords-of-amazon/Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/g1guW6tiR50Never miss an episode Join the Last Week in AWS newsletter Subscribe wherever you get your podcasts Help the show Leave a review Share your feedback Subscribe wherever you get your podcasts Buy our merch https://store.lastweekinaws.comWhat's Corey up to? Follow Corey on Twitter (@quinnypig) See our recent work at the Duckbill Group Apply to work with Corey and the Duckbill Group to help lower your AWS bill

Screaming in the Cloud
The Art and Science of Database Innovation with Andi Gutmans

Screaming in the Cloud

Play Episode Listen Later Nov 23, 2022 37:07


About AndiAndi Gutmans is the General Manager and Vice President for Databases at Google. Andi's focus is on building, managing and scaling the most innovative database services to deliver the industry's leading data platform for businesses. Before joining Google, Andi was VP Analytics at AWS running services such as Amazon Redshift. Before his tenure at AWS, Andi served as CEO and co-founder of Zend Technologies, the commercial backer of open-source PHP.Andi has over 20 years of experience as an open source contributor and leader. He co-authored open source PHP. He is an emeritus member of the Apache Software Foundation and served on the Eclipse Foundation's board of directors. He holds a bachelor's degree in Computer Science from the Technion, Israel Institute of Technology.Links Referenced: LinkedIn: https://www.linkedin.com/in/andigutmans/ Twitter: https://twitter.com/andigutmans TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig secures your cloud from source to run. They believe, as do I, that DevOps and security are inextricably linked. If you wanna learn more about how they view this, check out their blog, it's definitely worth the read. To learn more about how they are absolutely getting it right from where I sit, visit Sysdig.com and tell them that I sent you. That's S Y S D I G.com. And my thanks to them for their continued support of this ridiculous nonsense.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is brought to us by our friends at Google Cloud, and in so doing, they have gotten a guest to appear on this show that I have been low-key trying to get here for a number of years. Andi Gutmans is VP and GM of Databases at Google Cloud. Andi, thank you for joining me.Andi: Corey, thanks so much for having me.Corey: I have to begin with the obvious. Given that one of my personal passion projects is misusing every cloud service I possibly can as a database, where do you start and where do you stop as far as saying, “Yes, that's a database,” so it rolls up to me and, “No, that's not a database, so someone else can deal with the nonsense?”Andi: I'm in charge of the operational databases, so that includes both the managed third-party databases such as MySQL, Postgres, SQL Server, and then also the cloud-first databases, such as Spanner, Big Table, Firestore, and AlloyDB. So, I suggest that's where you start because those are all awesome services. And then what doesn't fall underneath, kind of, that purview are things like BigQuery, which is an analytics, you know, data warehouse, and other analytics engines. And of course, there's always folks who bring in their favorite, maybe, lesser-known or less popular database and self-manage it on GCE, on Compute.Corey: Before you wound up at Google Cloud, you spent roughly four years at AWS as VP of Analytics, which is, again, one of those very hazy type of things. Where does it start? Where does it stop? It's not at all clear from the outside. But even before that, you were, I guess, something of a legendary figure, which I know is always a weird thing for people to hear.But you were partially at least responsible for the Zend Framework in the PHP world, which I didn't realize what the heck that was, despite supporting it in production at a couple of jobs, until after I, for better or worse, was no longer trusted to support production environments anymore. Which, honestly, if you can get out, I'm a big proponent of doing that. You sleep so much better without a pager. How did you go from programming languages all the way on over to databases? It just seems like a very odd mix.Andi: Yeah. No, that's a great question. So, I was one of the core developers of PHP, and you know, I had been in the PHP community for quite some time. I also helped ideate. The Zend Framework, which was the company that, you know, I co-founded Zend Technologies was kind of the company behind PHP.So, like Red Hat supports Linux commercially, we supported PHP. And I was very much focused on developers, programming languages, frameworks, IDEs, and that was, you know, really exciting. I had also done quite a bit of work on interoperability with databases, right, because behind every application, there's a database, and so a lot of what we focused on is a great connectivity to MySQL, to Postgres, to other databases, and I got to kind of learn the database world from the outside from the application builders. We sold our company in I think it was 2015 and so I had to kind of figure out what's next. And so, one option would have been, hey, stay in programming languages, but what I learned over the many years that I worked with application developers is that there's a huge amount of value in data.And frankly, I'm a very curious person; I always like to learn, so there was this opportunity to join Amazon, to join the non-relational database side, and take myself completely out of my comfort zone. And actually, I joined AWS to help build the graph database Amazon Neptune, which was even more out of my comfort zone than even probably a relational database. So, I kind of like to do different things and so I joined and I had to learn, you know how to build a database pretty much from the ground up. I mean, of course, I didn't do the coding, but I had to learn enough to be dangerous, and so I worked on a bunch of non-relational databases there such as, you know, Neptune, Redis, Elasticsearch, DynamoDB Accelerator. And then there was the opportunity for me to actually move over from non-relational databases to analytics, which was another way to get myself out of my comfort zone.And so, I moved to run the analytic space, which included services like Redshift, like EMR, Athena, you name it. So, that was just a great experience for me where I got to work with a lot of awesome people and learn a lot. And then the opportunity arose to join Google and actually run the Google transactional databases including their older relational databases. And by the way, my job actually have two jobs. One job is running Spanner and Big Table for Google itself—meaning, you know, search ads and YouTube and everything runs on these databases—and then the second job is actually running external-facing databases for external customers.Corey: How alike are those two? Is it effectively the exact same thing, just with different API endpoints? Are they two completely separate universes? It's always unclear from the outside when looking at large companies that effectively eat versions of their own dog food, where their internal usage of these things starts and stops.Andi: So, great question. So, Cloud Spanner and Cloud Big Table do actually use the internal Spanner and Big Table. So, at the core, it's exactly the same engine, the same runtime, same storage, and everything. However, you know, kind of, internally, the way we built the database APIs was kind of good for scrappy, you know, Google engineers, and you know, folks are kind of are okay, learning how to fit into the Google ecosystem, but when we needed to make this work for enterprise customers, we needed a cleaner APIs, we needed authentication that was an external, right, and so on, so forth. So, think about we had to add an additional set of APIs on top of it, and management, right, to really make these engines accessible to the external world.So, it's running the same engine under the hood, but it is a different set of APIs, and a big part of our focus is continuing to expose to enterprise customers all the goodness that we have on the internal system. So, it's really about taking these very, very unique differentiated databases and democratizing access to them to anyone who wants to.Corey: I'm curious to get your position on the idea that seems to be playing it's—I guess, a battle that's been playing itself out in a number of different customer conversations. And that is, I guess, the theoretical decision between, do we go towards general-purpose databases and more or less treat every problem as a nail in search of a hammer or do you decide that every workload gets its own custom database that aligns the best with that particular workload? There are trade-offs in either direction, but I'm curious where you land on that given that you tend to see a lot more of it than I do.Andi: No, that's a great question. And you know, just for the viewers who maybe aren't aware, there's kind of two extreme points of view, right? There's one point of view that says, purpose-built for everything, like, every specific pattern, like, build bespoke databases, it's kind of a best-of-breed approach. The problem with that approach is it becomes extremely complex for customers, right? Extremely complex to decide what to use, they might need to use multiple for the same application, and so that can be a bit daunting as a customer. And frankly, there's kind of a law of diminishing returns at some point.Corey: Absolutely. I don't know what the DBA role of the future is, but I don't think anyone really wants it to be, “Oh, yeah. We're deciding which one of these three dozen manage database services is the exact right fit for each and every individual workload.” I mean, at some point it feels like certain cloud providers believe that not only every workload should have its own database, but almost every workload should have its own database service. It's at some point, you're allowed to say no and stop building these completely, what feel like to me, Byzantine, esoteric database engines that don't seem to have broad applicability to a whole lot of problems.Andi: Exactly, exactly. And maybe the other extreme is what folks often talk about as multi-model where you say, like, “Hey, I'm going to have a single storage engine and then map onto that the relational model, the document model, the graph model, and so on.” I think what we tend to see is if you go too generic, you also start having performance issues, you may not be getting the right level of abilities and trade-offs around consistency, and replication, and so on. So, I would say Google, like, we're taking a very pragmatic approach where we're saying, “You know what? We're not going to solve all of customer problems with a single database, but we're also not going to have two dozen.” Right?So, we're basically saying, “Hey, let's understand that the main characteristics of the workloads that our customers need to address, build the best services around those.” You know, obviously, over time, we continue to enhance what we have to fit additional models. And then frankly, we have a really awesome partner ecosystem on Google Cloud where if someone really wants a very specialized database, you know, we also have great partners that they can use on Google Cloud and get great support and, you know, get the rest of the benefits of the platform.Corey: I'm very curious to get your take on a pattern that I've seen alluded to by basically every vendor out there except the couple of very obvious ones for whom it does not serve their particular vested interests, which is that there's a recurring narrative that customers are demanding open-source databases for their workloads. And when you hear that, at least, people who came up the way that I did, spending entirely too much time on Freenode, back when that was not a deeply problematic statement in and of itself, where, yes, we're open-source, I guess, zealots is probably the best terminology, and yeah, businesses are demanding to participate in the open-source ecosystem. Here in reality, what I see is not ideological purity or anything like that and much more to do with, “Yeah, we don't like having a single commercial vendor for our databases that basically plays the insert quarter to continue dance whenever we're trying to wind up doing something new. We want the ability to not have licensing constraints around when, where, how, and how quickly we can run databases.” That's what I hear when customers are actually talking about open-source versus proprietary databases. Is that what you see or do you think that plays out differently? Because let's be clear, you do have a number of database services that you offer that are not open-source, but are also absolutely not tied to weird licensing restrictions either?Andi: That's a great question, and I think for years now, customers have been in a difficult spot because the legacy proprietary database vendors, you know, knew how sticky the database is, and so as a result, you know, the prices often went up and was not easy for customers to kind of manage costs and agility and so on. But I would say that's always been somewhat of a concern. I think what I'm seeing changing and happening differently now is as customers are moving into the cloud and they want to run hybrid cloud, they want to run multi-cloud, they need to prove to their regulator that it can do a stressed exit, right, open-source is not just about reducing cost, it's really about flexibility and kind of being in control of when and where you can run the workloads. So, I think what we're really seeing now is a significant surge of customers who are trying to get off legacy proprietary database and really kind of move to open APIs, right, because they need that freedom. And that freedom is far more important to them than even the cost element.And what's really interesting is, you know, a lot of these are the decision-makers in these enterprises, not just the technical folks. Like, to your point, it's not just open-source advocates, right? It's really the business people who understand they need the flexibility. And by the way, even the regulators are asking them to show that they can flexibly move their workloads as they need to. So, we're seeing a huge interest there and, as you said, like, some of our services, you know, are open-source-based services, some of them are not.Like, take Spanner, as an example, it is heavily tied to how we build our infrastructure and how we build our systems. Like, I would say, it's almost impossible to open-source Spanner, but what we've done is we've basically embraced open APIs and made sure if a customer uses these systems, we're giving them control of when and where they want to run their workloads. So, for example, Big Table has an HBase API; Spanner now has a Postgres interface. So, our goal is really to give customers as much flexibility and also not lock them into Google Cloud. Like, we want them to be able to move out of Google Cloud so they have control of their destiny.Corey: I'm curious to know what you see happening in the real world because I can sit here and come up with a bunch of very well-thought-out logical reasons to go towards or away from certain patterns, but I spent years building things myself. I know how it works, you grab the closest thing handy and throw it in and we all know that there is nothing so permanent as a temporary fix. Like, that thing is load-bearing and you'll retire with that thing still in place. In the idealized world, I don't think that I would want to take a dependency on something like—easy example—Spanner or AlloyDB because despite the fact that they have Postgres-squeal—yes, that's how I pronounce it—compatibility, the capabilities of what they're able to do under the hood far exceed and outstrip whatever you're going to be able to build yourself or get anywhere else. So, there's a dataflow architectural dependency lock-in, despite the fact that it is at least on its face, Postgres compatible. Counterpoint, does that actually matter to customers in what you are seeing?Andi: I think it's a great question. I'll give you a couple of data points. I mean, first of all, even if you take a complete open-source product, right, running them in different clouds, different on-premises environments, and so on, fundamentally, you will have some differences in performance characteristics, availability characteristics, and so on. So, the truth is, even if you use open-source, right, you're not going to get a hundred percent of the same characteristics where you run that. But that said, you still have the freedom of movement, and with I would say and not a huge amount of engineering investment, right, you're going to make sure you can run that workload elsewhere.I kind of think of Spanner in the similar way where yes, I mean, you're going to get all those benefits of Spanner that you can't get anywhere else, like unlimited scale, global consistency, right, no maintenance downtime, five-nines availability, like, you can't really get that anywhere else. That said, not every application necessarily needs it. And you still have that option, right, that if you need to, or want to, or we're not giving you a reasonable price or reasonable price performance, but we're starting to neglect you as a customer—which of course we wouldn't, but let's just say hypothetically, that you know, that could happen—that you still had a way to basically go and run this elsewhere. Now, I'd also want to talk about some of the upsides something like Spanner gives you. Because you talked about, you want to be able to just grab a few things, build something quickly, and then, you know, you don't want to be stuck.The counterpoint to that is with Spanner, you can start really, really small, and then let's say you're a gaming studio, you know, you're building ten titles hoping that one of them is going to take off. So, you can build ten of those, you know, with very minimal spend on Spanner and if one takes off overnight, it's really only the database where you don't have to go and re-architect the application; it's going to scale as big as you need it to. And so, it does enable a lot of this innovation and a lot of cost management as you try to get to that overnight success.Corey: Yeah, overnight success. I always love that approach. It's one of those, “Yeah, I became an overnight success after only ten short years.” It becomes this idea people believe it's in fits and starts, but then you see, I guess, on some level, the other side of it where it's a lot of showing up and doing the work. I have to confess, I didn't do a whole lot of admin work in my production years that touched databases because I have an aura and I'm unlucky, and it turns out that when you blow away some web servers, everyone can laugh and we'll reprovision stateless things.Get too close to the data warehouse, for example, and you don't really have a company left anymore. And of course, in the world of finance that I came out of, transactional integrity is also very much a thing. A question that I had [centers 00:17:51] really around one of the predictions you gave recently at Google Cloud Next, which is your prediction for the future is that transactional and analytical workloads from a database perspective will converge. What's that based on?Andi: You know, I think we're really moving from a world where customers are trying to make real-time decisions, right? If there's model drift from an AI and ML perspective, want to be able to retrain their models as quickly as possible. So, everything is fast moving into streaming. And I think what you're starting to see is, you know, customers don't have that time to wait for analyzing their transactional data. Like in the past, you do a batch job, you know, once a day or once an hour, you know, move the data from your transactional system to analytical system, but that's just not how it is always-on businesses run anymore, and they want to have those real-time insights.So, I do think that what you're going to see is transactional systems more and more building analytical capabilities, analytical systems building, and more transactional, and then ultimately, cloud platform providers like us helping fill that gap and really making data movement seamless across transactional analytical, and even AI and ML workloads. And so, that's an area that I think is a big opportunity. I also think that Google is best positioned to solve that problem.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation, permissions as code, connectivity between any two devices, reduce latency, and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: On some level, I've found that, at least in my own work, that once I wind up using a database for something, I'm inclined to try and stuff as many other things into that database as I possibly can just because getting a whole second data store, taking a dependency on it for any given workload tends to be a little bit on the, I guess, challenging side. Easy example of this. I've talked about it previously in various places, but I was talking to one of your colleagues, [Sarah Ellis 00:19:48], who wound up at one point making a joke that I, of course, took way too far. Long story short, I built a Twitter bot on top of Google Cloud Functions that every time the Azure brand account tweets, it simply quote-tweets that translates their tweet into all caps, and then puts a boomer-style statement in front of it if there's room. This account is @cloudboomer.Now, the hard part that I had while doing this is everything stateless works super well. Where do I wind up storing the ID of the last tweet that it saw on his previous run? And I was fourth and inches from just saying, “Well, I'm already using Twitter so why don't we use Twitter as a database?” Because everything's a database if you're either good enough or bad enough at programming. And instead, I decided, okay, we'll try this Firebase thing first.And I don't know if it's Firestore, or Datastore or whatever it's called these days, but once I wrap my head around it incredibly effective, very fast to get up and running, and I feel like I made at least a good decision, for once in my life, involving something touching databases. But it's hard. I feel like I'm consistently drawn toward the thing I'm already using as a default database. I can't shake the feeling that that's the wrong direction.Andi: I don't think it's necessarily wrong. I mean, I think, you know, with Firebase and Firestore, that combination is just extremely easy and quick to build awesome mobile applications. And actually, you can build mobile applications without a middle tier which is probably what attracted you to that. So, we just see, you know, huge amount of developers and applications. We have over 4 million databases in Firestore with just developers building these applications, especially mobile-first applications. So, I think, you know, if you can get your job done and get it done effectively, absolutely stick to them.And by the way, one thing a lot of people don't know about Firestore is it's actually running on Spanner infrastructure, so Firestore has the same five-nines availability, no maintenance downtime, and so on, that has Spanner, and the same kind of ability to scale. So, it's not just that it's quick, it will actually scale as much as you need it to and be as available as you need it to. So, that's on that piece. I think, though, to the same point, you know, there's other databases that we're then trying to make sure kind of also extend their usage beyond what they've traditionally done. So, you know, for example, we announced AlloyDB, which I kind of call it Postgres on steroids, we added analytical capabilities to this transactional database so that as customers do have more data in their transactional database, as opposed to having to go somewhere else to analyze it, they can actually do real-time analytics within that same database and it can actually do up to 100 times faster analytics than open-source Postgres.So, I would say both Firestore and AlloyDB, are kind of good examples of if it works for you, right, we'll also continue to make investments so the amount of use cases you can use these databases for continues to expand over time.Corey: One of the weird things that I noticed just looking around this entire ecosystem of databases—and you've been in this space long enough to, presumably, have seen the same type of evolution—back when I was transiting between different companies a fair bit, sometimes because I was consulting and other times because I'm one of the greatest in the world at getting myself fired from jobs based upon my personality, I found that the default standard was always, “Oh, whatever the database is going to be, it started off as MySQL and then eventually pivots into something else when that starts falling down.” These days, I can't shake the feeling that almost everywhere I look, Postgres is the answer instead. What changed? What did I miss in the ecosystem that's driving that renaissance, for lack of a better term?Andi: That's a great question. And, you know, I have been involved in—I'm going to date myself a bit—but in PHP since 1997, pretty much, and one of the things we kind of did is we build a really good connector to MySQL—and you know, I don't know if you remember, before MySQL, there was MS SQL. So, the MySQL API actually came from MS SQL—and we bundled the MySQL driver with PHP. And so, kind of that LAMP stack really took off. And kind of to your point, you know, the default in the web, right, was like, you're going to start with MySQL because it was super easy to use, just fun to use.By the way, I actually wrote—co-authored—the tab completion in the MySQL client. So like, a lot of these kinds of, you know, fun, simple ways of using MySQL were there, and frankly, was super fast, right? And so, kind of those fast reads and everything, it just was great for web and for content. And at the time, Postgres kind of came across more like a science project. Like the folks who were using Postgres were kind of the outliers, right, you know, the less pragmatic folks.I think, what's changed over the past, how many years has it been now, 25 years—I'm definitely dating myself—is a few things: one, MySQL is still awesome, but it didn't kind of go in the direction of really, kind of, trying to catch up with the legacy proprietary databases on features and functions. Part of that may just be that from a roadmap perspective, that's not where the owner wanted it to go. So, MySQL today is still great, but it didn't go into that direction. In parallel, right, customers wanting to move more to open-source. And so, what they found this, the thing that actually looks and smells more like legacy proprietary databases is actually Postgres, plus you saw an increase of investment in the Postgres ecosystem, also very liberal license.So, you have lots of other databases including commercial ones that have been built off the Postgres core. And so, I think you are today in a place where, for mainstream enterprise, Postgres is it because that is the thing that has all the features that the enterprise customer is used to. MySQL is still very popular, especially in, like, content and web, and mobile applications, but I would say that Postgres has really become kind of that de facto standard API that's replacing the legacy proprietary databases.Corey: I've been on the record way too much as saying, with some justification, that the best database in the world that should be used for everything is Route 53, specifically, TXT records. It's a key-value store and then anyone who's deep enough into DNS or databases generally gets a slightly greenish tinge and feels ill. That is my simultaneous best and worst database. I'm curious as to what your most controversial opinion is about the worst database in the world that you've ever seen.Andi: This is the worst database? Or—Corey: Yeah. What is the worst database that you've ever seen? I know, at some level, since you manage all things database, I'm asking you to pick your least favorite child, but here we are.Andi: Oh, that's a really good question. No, I would say probably the, “Worst database,” double-quotes is just the file system, right? When folks are basically using the file system as regular database. And that can work for, you know, really simple apps, but as apps get more complicated, that's not going to work. So, I've definitely seen some of that.I would say the most awesome database that is also file system-based kind of embedded, I think was actually SQLite, you know? And SQLite is actually still very, very popular. I think it sits on every mobile device pretty much on the planet. So, I actually think it's awesome, but it's, you know, it's on a database server. It's kind of an embedded database, but it's something that I, you know, I've always been pretty excited about. And, you know, their stuff [unintelligible 00:27:43] kind of new, interesting databases emerging that are also embedded, like DuckDB is quite interesting. You know, it's kind of the SQLite for analytics.Corey: We've been using it for a few things around a bill analysis ourselves. It's impressive. I've also got to say, people think that we had something to do with it because we're The Duckbill Group, and it's DuckDB. “Have you done anything with this?” And the answer is always, “Would you trust me with a database? I didn't think so.” So no, it's just a weird coincidence. But I liked that a lot.It's also counterintuitive from where I sit because I'm old enough to remember when Microsoft was teasing the idea of WinFS where they teased a future file system that fundamentally was a database—I believe it's an index or journal for all of that—and I don't believe anything ever came of it. But ugh, that felt like a really weird alternate world we could have lived in.Andi: Yeah. Well, that's a good point. And by the way, you know, if I actually take a step back, right, and I kind of half-jokingly said, you know, file system and obviously, you know, all the popular databases persist on the file system. But if you look at what's different in cloud-first databases, right, like, if you look at legacy proprietary databases, the typical setup is wright to the local disk and then do asynchronous replication with some kind of bounded replication lag to somewhere else, to a different region, or so on. If you actually start to look at what the cloud-first databases look like, they actually write the data in multiple data centers at the same time.And so, kind of joke aside, as you start to think about, “Hey, how do I build the next generation of applications and how do I really make sure I get the resiliency and the durability that the cloud can offer,” it really does take a new architecture. And so, that's where things like, you know, Spanner and Big Table, and kind of, AlloyDB databases are truly architected for the cloud. That's where they actually think very differently about durability and replication, and what it really takes to provide the highest level of availability and durability.Corey: On some level, I think one of the key things for me to realize was that in my own experiments, whenever I wind up doing something that is either for fun or I just want see how it works in what's possible, the scale of what I'm building is always inherently a toy problem. It's like the old line that if it fits in RAM, you don't have a big data problem. And then I'm looking at things these days that are having most of a petabyte's worth of RAM sometimes it's okay, that definition continues to extend and get ridiculous. But I still find that most of what I do in a database context can be done with almost any database. There's no reason for me not to, for example, uses a SQLite file or to use an object store—just there's a little latency, but whatever—or even a text file on disk.The challenge I find is that as you start scaling and growing these things, you start to run into limitations left and right, and only then it's one of those, oh, I should have made different choices or I should have built-in abstractions. But so many of these things comes to nothing; it just feels like extra work. What guidance do you have for people who are trying to figure out how much effort to put in upfront when they're just more or less puttering around to see what comes out of it?Andi: You know, we like to think about ourselves at Google Cloud as really having a unique value proposition that really helps you future-proof your development. You know, if I look at both Spanner and I look at BigQuery, you can actually start with a very, very low cost. And frankly, not every application has to scale. So, you can start at low cost, you can have a small application, but everyone wants two things: one is availability because you don't want your application to be down, and number two is if you have to scale you want to be able to without having to rewrite your application. And so, I think this is where we have a very unique value proposition, both in how we built Spanner and then also how we build BigQuery is that you can actually start small, and for example, on Spanner, you can go from one-tenth of what we call an instance, like, a small instance, that is, you know, under $65 a month, you can go to a petabyte scale OLTP environment with thousands of instances in Spanner, with zero downtime.And so, I think that is really the unique value proposition. We're basically saying you can hold the stick at both ends: you can basically start small and then if that application doesn't need to scale, does need to grow, you're not reengineering your application and you're not taking any downtime for reprovisioning. So, I think that's—if I had to give folks, kind of, advice, I say, “Look, what's done is done. You have workloads on MySQL, Postgres, and so on. That's great.”Like, they're awesome databases, keep on using them. But if you're truly building a new app, and you're hoping that app is going to be successful at some point, whether it's, like you said, all overnight successes take at least ten years, at least you built in on something like Spanner, you don't actually have to think about that anymore or worry about it, right? It will scale when you need it to scale and you're not going to have to take any downtime for it to scale. So, that's how we see a lot of these industries that have these potential spikes, like gaming, retail, also some use cases in financial services, they basically gravitate towards these databases.Corey: I really want to thank you for taking so much time out of your day to talk with me about databases and your perspective on them, especially given my profound level of ignorance around so many of them. If people want to learn more about how you view these things, where's the best place to find you?Andi: Follow me on LinkedIn. I tend to post quite a bit on LinkedIn, I still post a bit on Twitter, but frankly, I've moved more of my activity to LinkedIn now. I find it's—Corey: That is such a good decision. I envy you.Andi: It's a more curated [laugh], you know, audience and so on. And then also, you know, we just had Google Cloud Next. I recorded a session there that kind of talks about database and just some of the things that are new in database-land at Google Cloud. So, that's another thing that if folks more interested to get more information, that may be something that could be appealing to you.Corey: We will, of course, put links to all of this in the [show notes 00:34:03]. Thank you so much for your time. I really appreciate it.Andi: Great. Corey, thanks so much for having me.Corey: Andi Gutmans, VP and GM of Databases at Google Cloud. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment, then I'm going to collect all of those angry, insulting comments and use them as a database.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Security for Speed and Scale with Ashish Rajan

Screaming in the Cloud

Play Episode Listen Later Nov 22, 2022 35:24


About AshishAshish has over 13+yrs experience in the Cybersecurity industry with the last 7 focusing primarily helping Enterprise with managing security risk at scale in cloud first world and was the CISO of a global Cloud First Tech company in his last role. Ashish is also a keynote speaker and host of the widely poplar Cloud Security Podcast, a SANS trainer for Cloud Security & DevSecOps. Ashish currently works at Snyk as a Principal Cloud Security Advocate. He is a frequent contributor on topics related to public cloud transformation, Cloud Security, DevSecOps, Security Leadership, future Tech and the associated security challenges for practitioners and CISOs.Links Referenced: Cloud Security Podcast: https://cloudsecuritypodcast.tv/ Personal website: https://www.ashishrajan.com/ LinkedIn: https://www.linkedin.com/in/ashishrajan/ Twitter: https://twitter.com/hashishrajan Cloud Security Podcast YouTube: https://www.youtube.com/c/CloudSecurityPodcast Cloud Security Podcast LinkedIn: https://www.linkedin.com/company/cloud-security-podcast/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Thinkst Canary. Most folks find out way too late that they've been breached. Thinkst Canary changes this. Deploy canaries and canary tokens in minutes, and then forget about them. Attackers tip their hand by touching them, giving you one alert, when it matters. With zero administrative overhead to this and almost no false positives, Canaries are deployed and loved on all seven continents. Check out what people are saying at canary.love today. Corey: This episode is bought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn't care enough about backups.  If you're tired of the vulnerabilities, costs and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, thats V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is brought to us once again by our friends at Snyk. Snyk does amazing things in the world of cloud security and terrible things with the English language because, despite raising a whole boatload of money, they still stubbornly refuse to buy a vowel in their name. I'm joined today by Principal Cloud Security Advocate from Snyk, Ashish Rajan. Ashish, thank you for joining me.Corey: Your history is fascinating to me because you've been around for a while on a podcast of your own, the Cloud Security Podcast. But until relatively recently, you were a CISO. As has become relatively accepted in the industry, the primary job of the CISO is to get themselves fired, and then, “Well, great. What's next?” Well, failing upward is really the way to go wherever possible, so now you are at Snyk, helping the rest of us fix our security. That's my headcanon on all of that anyway, which I'm sure bears scant, if any, resemblance to reality, what's your version?Ashish: [laugh]. Oh, well, fortunately, I wasn't fired. And I think I definitely find that it's a great way to look at the CISO job to walk towards the path where you're no longer required because then I think you've definitely done your job. I moved into the media space because we got an opportunity to go full-time. I spoke about this offline, but an incident inspired us to go full-time into the space, so that's what made me leave my CISO job and go full-time into democratizing cloud security as much as possible for anyone and everyone. So far, every day, almost now, so it's almost like I dream about cloud security as well now.Corey: Yeah, I dream of cloud security too, but my dreams are of a better world in which people didn't tell me how much they really care about security in emails that demonstrate how much they failed to care about security until it was too little too late. I was in security myself for a while and got out of it because I was tired of being miserable all the time. But I feel that there's a deep spiritual alignment between people who care about cost and people who care about security when it comes to cloud—or business in general—because you can spend infinite money on those things, but it doesn't really get your business further. It's like paying for fire insurance. It isn't going to get you to your next milestone, whereas shipping faster, being more effective at launching a feature into markets, that can multiply revenue. That's what companies are optimized around. It's, “Oh, right. We have to do the security stuff,” or, “We have to fix the AWS billing piece.” It feels, on some level, like it's a backburner project most of the time and it's certainly invested in that way. What's your take on that?Ashish: I tend to disagree with that, for a couple reasons.Corey: Excellent. I love arguments.Ashish: I feel this in a healthy way as well. A, I love the analogy of spiritual animals where they are cost optimization as well as the risk aversion as well. I think where I normally stand—and this is what I had to unlearn after doing years of cybersecurity—was that initially, we always used to be—when I say ‘we,' I mean cybersecurity folks—we always used to be like police officers. Is that every time there's an incident, it turns into a crime scene, and suddenly we're all like, “Pew, pew, pew,” with trying to get all the evidence together, let's make this isolated as much—as isolated as possible from the rest of the environment, and let's try and resolve this.I feel like in Cloud has asked people to become more collaborative, which is a good problem to have. It also encourages that, I don't know how many people know this, but the reason we have brakes in our cars is not because we can slow down the car; it's so that we can go faster. And I feel security is the same thing. The guardrails we talk about, the risks that you're trying to avert, the reason you're trying to have security is not to slow down but to go faster. Say for example in an ideal world, to quote what you were saying earlier if we were to do the right kind of encryption—I'm just going to use the most basic example—if we just do encryption, right, and just ensure that as a guardrail, the entire company needs to have encryption at rest, encryption in transit, period, nothing else, no one cares about anything else.But if you just lay that out as a framework and this is our guardrail, no one brakes this, and whoever does, hey we—you know, slap on the wrist and come back on to the actual track, but keep going forward. That just means any project that comes in that meets [unintelligible 00:04:58] criteria. Keeps going forward, as many times we want to go into production. Doesn't matter. So, that is the new world of security that we are being asked to move towards where Amazon re:Invent is coming in, there will be another, I don't know, three, four hundred services that will be released. How many people, irrespective of security, would actually know all of those services? They would not. So, [crosstalk 00:05:20]—Corey: Oh, we've long since passed the point where I can convincingly talk about AWS services that don't really exist and not get called out on it by Amazon employees. No one keeps them on their head. Except me because I'm sad.Ashish: Oh, no, but I think you're right, though. I can't remember who was it—maybe Andrew Vogel or someone—they didn't release a service which didn't exist, and became, like, a thing on Twitter. Everyone—Corey: Ah, AWS's Infinidash. I want to say that was Joe Nash out of Twilio at the time. I don't recall offhand if I'm right on that, but that's how it feels. Yeah, it was certainly not me. People said that was my idea. Nope, nope, I just basically amplified it to a huge audience.But yeah, it was a brilliant idea, just because it's a fake service so everyone could tell stories about it. And amazing product feedback, if you look at it through the right lens of how people view your company and your releases when they get this perfect, platonic ideal of what it is you might put out there, what do people say about it?Ashish: Yeah. I think that's to your point, I will use that as an example as well to talk about things that there will always be a service which we will be told about for the first time, which we will not know. So, going back to the unlearning part, as a security team, we just have to understand that we can't use the old ways of, hey, I want to have all the controls possible, cover all there is possible. I need to have a better understanding of all the cloud services because I've done, I don't know, 15 years of cloud, there is no one that has 10, 15 years of cloud unless you're I don't know someone from Amazon employee yourself. Most people these days still have five to six years experience and they're still learning.Even the cloud engineering folks or the DevOps folks, they're all still learning and the tooling is continuing to evolve. So yeah, I think I definitely find that the security in this cloud world a lot more collaborative and it's being looked at as the same function as a brake would have in car: to help you go faster, not to just slam the brake every time it's like, oh, my God, is the situation isolated and to police people.Corey: One of the points I find that is so aligned between security and cost—and you alluded to it a minute ago—is the idea of helping companies go faster safely. To that end, guardrails have to be at least as easy as just going off and doing it cow-person style. Because if it's not, it's more work in any way, shape, or form, people won't do it. People will not tag their resources by hand, people will not go through and use the dedicated account structure you've got that gets in their way and screams at them every time they try to use one of the native features built into the platform. It has to get out of their way and make things easier, not worse, or people fight it, they go around it, and you're never going to get buy-in.Ashish: Do you feel like cost is something that a lot more people pay a lot more attention to because, you know, that creeps into your budget? Like, as people who've been leaders before, and this was the conversation, they would just go, “Well, I only have, I don't know, 100,000 to spend this quarter,” or, “This year,” and they are the ones who—are some of them, I remember—I used to have this manager, once, a CTO would always be conscious about the spend. It's almost like if you overspend, where do you get the money from? There's no money to bring in extra. Like, no. There's a set money that people plan for any year for a budget. And to your point about if you're not keeping an eye on how are we spending this in the AWS context because very easy to spend the entire money in one day, or in the cloud context. So, I wonder if that is also a big driver for people to feel costs above security? Where do you stand on that?Corey: When it comes to cost, one of the nice things about it—and this is going to sound sarcastic, but I swear to you it's not—it's only money.Ashish: Mmm.Corey: Think about that for a second because it's true. Okay, we wound up screwing up and misconfiguring something and overspending. Well, there are ways around that. You can call AWS, you can get credits, you can get concessions made for mistakes, you can sign larger contracts and get a big pile of proof of concept credit et cetera, et cetera. There are ways to make that up, whereas with security, it's there are no do-overs on security breaches.Ashish: No, that's a good point. I mean, you can always get more money, use a credit card, worst case scenario, but you can't do the same for—there's a security breach and suddenly now—hopefully, you don't have to call New York Times and say, “Can you undo that article that you just have posted that told you it was a mistake. We rewinded what we did.”Corey: I'm curious to know what your take is these days on the state of the cloud security community. And the reason I bring that up is, well, I started about a year-and-a-half ago now doing a podcast every Thursday. Which is Last Week in AWS: Security Edition because everything else I found in the industry that when I went looking was aimed explicitly at either—driven by the InfoSec community, which is toxic and a whole bunch of assumed knowledge already built in that looks an awful lot like gatekeeping, which is the reason I got out of InfoSec in the first place, or alternately was completely vendor-captured, where, okay, great, we're going to go ahead and do a whole bunch of interesting content and it's all brought to you by this company and strangely, all of the content is directly align with doing some pretty weird things that you wouldn't do unless you're trying to build a business case for that company's product. And it just feels hopelessly compromised. I wanted to find something that was aimed at people who had to care about security but didn't have security as part of their job title. Think DevOps types and you're getting warmer.That's what I wound up setting out to build. And when all was said and done, I wasn't super thrilled with, honestly, how alone it still felt. You've been doing this for a while, and you're doing a great job at it, don't get me wrong, but there is the question that—and I understand they're sponsoring this episode, but the nice thing about promoted guest episodes is that they can buy my attention, not my opinion. How do you retain creative control of your podcast while working for a security vendor?Ashish: So, that's a good question. So, Snyk by themselves have not ever asked us to change any piece of content; we have been working with them for the past few months now. The reason we kind of came along with Snyk was the alignment. And we were talking about this earlier for I totally believe that DevSecOps and cloud security are ultimately going to come together one day. That may not be today, that may not be tomorrow, that may not be in 2022, or maybe 2023, but there will be a future where these two will sit together.And the developer-first security mentality that they had, in this context from cloud prospective—developers being the cloud engineers, the DevOps people as you called out, the reason you went in that direction, I definitely want to work with them. And ultimately, there would never be enough people in security to solve the problem. That is the harsh reality. There would never be enough people. So, whether it's cloud security or not, like, for people who were at AWS re:Inforce, the first 15 minutes by Steve Schmidt, CSO of Amazon, was get a security guardian program.So, I've been talking about it, everyone else is talking about right now, Amazon has become the first CSP to even talk about this publicly as well that we should have security guardians. Which by the way, I don't know why, but you can still call it—it is technically DevSecOps what you're trying to do—they spoke about a security champion program as part of the keynote that they were running. Nothing to do with cloud security, but the idea being how much of this workload can we share? We can raise, as a security team—for people who may be from a security background listening to this—how much elevation can we provide the risk in front of the right people who are a decision-maker? That is our role.We help them with the governance, we help with managing it, but we don't know how to solve the risk or close off a risk, or close off a vulnerability because you might be the best person because you work in that application every day, every—you know the bandages that are put in, you know all the holes that are there, so the best threat model can be performed by the person who works on a day-to-day, not a security person who spent, like, an hour with you once a week because that's the only time they could manage. So, going back to the Snyk part, that's the mission that we've had with the podcast; we want to democratize cloud security and build a community around neutral information. There is no biased information. And I agree with what you said as well, where a lot of the podcasts outside of what we were finding was more focused on, “Hey, this is how you use AWS. This is how you use Azure. This is how you use GCP.”But none of them were unbiased in the opinion. Because real life, let's just say even if I use the AWS example—because we are coming close to the AWS re:Invent—they don't have all the answers from a security perspective. They don't have all the answers from an infrastructure perspective or cloud-native perspective. So, there are some times—or even most times—people are making a call where they're going outside of it. So, unbiased information is definitely required and it is not there enough.So, I'm glad that at least people like yourself are joining, and you know, creating the world where more people are trying to be relatable to DevOps people as well as the security folks. Because it's hard for a security person to be a developer, but it's easy for a developer or an engineer to understand security. And the simplest example I use is when people walk out of their house, they lock the door. They're already doing security. This is the same thing we're asking when we talk about security in the cloud or in the [unintelligible 00:14:49] as well. Everyone is, it just it hasn't been pointed out in the right way.Corey: I'm curious as to what it is that gets you up in the morning. Now, I know you work in security, but you're also not a CISO anymore, so I'm not asking what gets you up at 2 a.m. because we know what happens in the security space, then. There's a reason that my area of business focus is strictly a business hours problem. But I'd love to know what it is about cloud security as a whole that gets you excited.Ashish: I think it's an opportunity for people to get into the space without the—you know, you said gatekeeper earlier, those gatekeepers who used to have that 25 years experience in cybersecurity, 15 years experience in cybersecurity, Cloud has challenged that norm. Now, none of that experience helps you do AWS services better. It definitely helps you with the foundational pieces, definitely helps you do identity, networking, all of that, but you still have to learn something completely new, a new way of working, which allows for a lot of people who earlier was struggling to get into cybersecurity, now they have an opening. That's what excites me about cloud security, that it has opened up a door which is beyond your CCNA, CISSP, and whatever else certification that people want to get. By the way, I don't have a CISSP, so I can totally throw CISSP under the bus.But I definitely find that cloud security excites me every morning because it has shown me light where, to what you said, it was always a gated community. Although that's a very huge generalization. There's a lot of nice people in cybersecurity who want to mentor and help people get in. But Cloud security has pushed through that door, made it even wider than it was before.Corey: I think there's a lot to be said for the concept of sending the elevator back down. I really have remarkably little patience for people who take the perspective of, “Well, I got mine so screw everyone else.” The next generation should have it easier than we did, figuring out where we land in the ecosystem, where we live in the space. And there are folks who do a tremendous job of this, but there are also areas where I think there is significant need for improvement. I'm curious to know what you see as lacking in the community ecosystem for folks who are just dipping their toes into the water of cloud security.Ashish: I think that one, there's misinformation as well. The first one being, if you have never done IT before you can get into cloud security, and you know, you will do a great job. I think that is definitely a mistake to just accept the fact if Amazon re:Invent tells you do all these certifications, or Azure does the same, or GCP does the same. If I'll be really honest—and I feel like I can be honest, this is a safe space—that for people who are listening in, if you're coming to the space for the first time, whether it's cloud or cloud security, if you haven't had much exposure to the foundational pieces of it, it would be a really hard call. You would know all the AWS services, you will know all the Azure services because you have your certification, but if I was to ask you, “Hey, help me build an application. What would be the architecture look like so it can scale?”“So, right now we are a small pizza-size ten-people team”—I'm going to use the Amazon term there—“But we want to grow into a Facebook tomorrow, so please build me an architecture that can scale.” And if you regurgitate what Amazon has told you, or Azure has told you, or GCP has told you, I can definitely see that you would struggle in the industry because that's not how, say every application is built. Because the cloud service provider would ask you to drink the Kool-Aid and say they can solve all your problems, even though they don't have all the servers in the world. So, that's the first misinformation.The other one too, for people who are transitioning, who used to be in IT or in cybersecurity and trying to get into the cloud security space, the challenge over there is that outside of Amazon, Google, and Microsoft, there is not a lot of formal education which is unbiased. It is a great way to learn AWS security on how amazing AWS is from AWS people, the same way Microsoft will be [unintelligible 00:19:10], however, when it comes down to actual formal education, like the kind that you and I are trying to provide through a podcast, me with the Cloud Security Podcast, you with Last Week in AWS in the Security Edition, that kind of unbiased formal education, like free education, like what you and I are doing does definitely exist and I guess I'm glad we have company, that you and I both exist in this space, but formal education is very limited. It's always behind, say an expensive paid wall sometimes, and rightly so because it's information that would be helpful. So yeah, those two things. Corey: This episode is sponsored in part by our friends at Uptycs. Attackers don't think in silos, so why would you have siloed solutions protecting cloud, containers, and laptops distinctly? Meet Uptycs - the first unified solution prioritizes risk across your modern attack surface—all from a single platform, UI, and data model. Stop by booth 3352 at AWS re:Invent in Las Vegas to see for yourself and visit uptycs.com. That's U-P-T-Y-C-S.com. Corey: One of the problems that I have with the way a lot of cloud security stuff is situated is that you need to have something running to care about the security of. Yeah, I can spin up a VM in the free tier of most of these environments, and okay, “How do I secure a single Linux box?” Okay, yes, there are a lot of things you can learn there, but it's very far from a holistic point of view. You need to have the infrastructure running at reasonable scale first, in order to really get an effective lab that isn't contrived.Now, Snyk is a security company. I absolutely understand and have no problem with the fact that you charge your customers money in order to get security outcomes that are better than they would have otherwise. I do not get why AWS and GCP charge extra for security. And I really don't get why Azure charges extra for security and then doesn't deliver security by dropping the ball on it, which is neither here nor there.Ashish: [laugh].Corey: It feels like there's an economic form of gatekeeping, where you must spend at least this much money—or work for someone who does—in order to get exposure to security the way that grownups think about it. Because otherwise, all right, I hit my own web server, I have ten lines in the logs. Now, how do I wind up doing an analysis run to figure out what happened? I pull it up on my screen and I look at it. You need a point of scale before anything that the modern world revolves around doesn't seem ludicrous.Ashish: That's a good point. Also because we don't talk about the responsibility that the cloud service provider has themselves for security, like the encryption example that I used earlier, as a guardrail, it doesn't take much for them to enable by default. But how many do that by default? I feel foolish sometimes to tell people that, “Hey, you should have encryption enabled on your storage which is addressed, or in transit.”It should be—like, we have services like Let's Encrypt and other services, which are trying to make this easily available to everyone so everyone can do SSL or HTTPS. And also, same goes for encryption. It's free and given the choice that you can go customer-based keys or your own key or whatever, but it should be something that should be default. We don't have to remind people, especially if you're the providers of the service. I agree with you on the, you know, very basic principle of why do I pay extra for security, when you should have already covered this for me as part of the service.Because hey, technically, aren't you also responsible in this conversation? But the way I see shared responsibility is that—someone on the podcast mentioned it and I think it's true—shared responsibility means no one's responsible. And this is the kind of world we're living in because of that.Corey: Shared responsibility has always been an odd concept to me because AWS is where I first encountered it and they, from my perspective, turn what fits into a tweet into a 45-minute dog-and-pony show around, “Ah, this is how it works. This is the part we're responsible for. This is the part where the customer responsibility is. Now, let's have a mind-numbingly boring conversation around it.” Whereas, yeah, there's a compression algorithm here. Basically, if the cloud gets breached, it is overwhelmingly likely that you misconfigured something on your end, not the provider doing it, unless it's Azure, which is neither here nor there, once again.The problem with that modeling, once you get a little bit more business sophistication than I had the first time I made the observation, is that you can't sit down with a CISO at a company that just suffered a data breach and have your conversation be, “Doesn't it suck to be you—[singing] duh, duh—because you messed up. That's it.” You need that dog-and-pony show of being able to go in-depth and nuance because otherwise, you're basically calling out your customer, which you can't really do. Which I feel occludes a lot of clarity for folks who are not in that position who want to understand these things a bit better.Ashish: You're right, Corey. I think definitely I don't want to be in a place where we're definitely just educating people on this, but I also want to call out that we are in a world where it is true that Amazon, Azure, Google Cloud, they all have vulnerabilities as well. Thanks to research by all these amazing people on the internet from different companies out there, they've identified that, hey, these are not pristine environments that you can go into. Azure, AWS, Google Cloud, they themselves have vulnerabilities, and sometimes some of those vulnerabilities cannot be fixed until the customer intervenes and upgrades their services. We do live in a world where there is not enough education about this as well, so I'm glad you brought this up because for people who are listening in, I mean, I was one of those people who would always say, “When was the last time you heard Amazon had a breach?” Or, “Microsoft had a breach?” Or, “Google Cloud had a breach?”That was the idea when people were just buying into the concept of cloud and did not trust cloud. Every cybersecurity person that I would talk to they're like, “Why would you trust cloud? Doesn't make sense.” But this is, like, seven, eight years ago. Fast-forward to today, it's almost default, “Why would you not go into cloud?”So, for people who tend to forget that part, I guess, there is definitely a journey that people came through. With the same example of multi-factor authentication, it was never a, “Hey, let's enable password and multi-factor authentication.” It took a few stages to get there. Same with this as well. We're at that stage where now cloud service providers are showing the kinks in the armor, and now people are questioning, “I should update my risk matrix for what if there's actually a breach in AWS?”Now, Capital One is a great example where the Amazon employee who was sentenced, she did something which has—never even [unintelligible 00:25:32] on before, opened up the door for that [unintelligible 00:25:36] CISO being potentially sentenced. There was another one. Because it became more primetime news, now people are starting to understand, oh, wait. This is not the same as it used to be. Cloud security breaches have evolved as well.And just sticking to the Uber point, when Uber has that recent breach where they were talking about, “Hey, so many data records were gone,” what a lot of people did not talk about in that same message, it also mentioned the fact that, hey, they also got access to the AWS console of Uber. Now, that to me, is my risk metrics has already gone higher than where it was before because it just not your data, but potentially your production, your pre-prod, any development work that you were doing for, I don't know, self-driving cars or whatever that Uber [unintelligible 00:26:18] is doing, all that is out on the internet. But who was talking about all of that? That's a much worse a breach than what was portrayed on the internet. I don't know, what do you think?Corey: When it comes to trusting providers, where I sit is that I think, given their scale, they need to be a lot more transparent than they have been historically. However, I also believe that if you do not trust that these companies are telling you the truth about what they're doing, how they're doing it, what their controls are, then you should not be using them as a customer, full stop. This idea of confidential computing drives me nuts because so much of it is, “Well, what if we assume our cloud provider is lying to us about all of these things?” Like, hypothetically there's nothing stopping them from building an exact clone of their entire control plane that they redirect your request to that do something completely different under the hood. “Oh, yeah, of course, we're encrypting it with that special KMS key.” No, they're not. For, “Yeah, sure we're going to put that into this region.” Nope, it goes right back to Virginia. If you believe that's what's going on and that they're willing to do that, you can't be in cloud.Ashish: Yeah, a hundred percent. I think foundational trust need to exist and I don't think the cloud service providers themselves do a great job of building that trust. And maybe that's where the drift comes in because the business has decided they're going to cloud. The cyber security people are trying to be more aware and asking the question, “Hey, why do we trust it so blindly? I don't have a pen test report from Amazon saying they have tested service.”Yes, I do have a certificate saying it's PCI compliant, but how do I know—to what you said—they haven't cloned our services? Fortunately, businesses are getting smarter. Like, Walmart would never have their resources in AWS because they don't trust them. It's a business risk if suddenly they decide to go into that space. But the other way around, Microsoft may decides tomorrow that they want to start their own Walmart. Then what do you do?So, I don't know how many people actually consider that as a real business risk, especially because there's a word that was floating around the internet called supercloud. And the idea behind this was—oh, I can already see your reaction [laugh].Corey: Yeah, don't get me started on that whole mess.Ashish: [laugh]. Oh no, I'm the same. I'm like, “What? What now?” Like, “What are you—” So, one thing I took away which I thought was still valuable was the fact that if you look at the cloud service providers, they're all like octopus, they all have tentacles everywhere.Like, if you look at the Amazon of the world, they not only a bookstore, they have a grocery store, they have delivery service. So, they are into a lot of industries, the same way Google Cloud, Microsoft, they're all in multiple industries. And they can still have enough money to choose to go into an industry that they had never been into before because of the access that they would get with all this information that they have, potentially—assuming that they [unintelligible 00:29:14] information. Now, “Shared responsibility,” quote-unquote, they should not do it, but there is nothing stopping them from actually starting a Walmart tomorrow if they wanted to.Corey: So, because a podcast and a day job aren't enough, what are you going to be doing in the near future given that, as we record this, re:Invent is nigh?Ashish: Yeah. So, podcasting and being in the YouTube space has definitely opened up the creative mindset for me. And I think for my producer as well. We're doing all these exciting projects. We have something called Cloud Security Villains that is coming up for AWS re:Invent, and it's going to be released on our YouTube channel as well as my social media.And we'll have merchandise for it across the re:Invent as well. And I'm just super excited about the possibility that media as a space provides for everyone. So, for people who are listening in and thinking that, I don't know, I don't want to write for a blog or email newsletter or whatever the thing may be, I just want to put it out there that I used to be excited about AWS re:Invent just to understand, hey, hopefully, they will release a new security service. Now, I get excited about these events because I get to meet community, help them, share what they have learned on the internet, and sound smarter [laugh] as a result of that as well, and get interviewed where people like yourself. But I definitely find that at the moment with AWS re:Invent coming in, a couple of things that are exciting for me is the release of the Cloud Security Villains, which I think would be an exciting project, especially—hint, hint—for people who are into comic books, you will definitely enjoy it, and I think your kids will as well. So, just in time for Christmas.Corey: We will definitely keep an eye out for that and put a link to that in the show notes. I really want to thank you for being so generous with your time. If people want to learn more about what you're up to, where's the best place for them to find you?Ashish: I think I'm fortunate enough to be at that stage where normally if people Google me—and it's simply Ashish Rajan—they will definitely find me [laugh]. I'll be really hard for them not find me on the internet. But if you are looking for a source of unbiased cloud security knowledge, you can definitely hit up cloudsecuritypodcast.tv or our YouTube and LinkedIn channel.We go live stream every week with a new guest talking about cloud security, which could be companies like LinkedIn, Twilio, to name a few that have come on the show already, and a lot more than have come in and been generous with their time and shared how they do what they do. And we're fortunate that we get ranked top 100 in America, US, UK, as well as Australia. I'm really fortunate for that. So, we're doing something right, so hopefully, you get some value out of it as well when you come and find me.Corey: And we will, of course, put links to all of that in the show notes. Thank you so much for being so generous with your time. I really appreciate it.Ashish: Thank you, Corey, for having me. I really appreciate this a lot. I enjoyed the conversation.Corey: As did I. Ashish Rajan, Principal Cloud Security Advocate at Snyk who is sponsoring this promoted guest episode. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment pointing out that not every CISO gets fired; some of them successfully manage to blame the intern.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Snyk and the Complex World of Vulnerability Intelligence with Clinton Herget

Screaming in the Cloud

Play Episode Listen Later Nov 17, 2022 38:39


About ClintonClinton Herget is Field CTO at Snyk, the leader is Developer Security. He focuses on helping Snyk's strategic customers on their journey to DevSecOps maturity. A seasoned technnologist, Cliton spent his 20-year career prior to Snyk as a web software developer, DevOps consultant, cloud solutions architect, and engineering director. Cluinton is passionate about empowering software engineering to do their best work in the chaotic cloud-native world, and is a frequent conference speaker, developer advocate, and technical thought leader.Links Referenced: Snyk: https://snyk.io/ duckbillgroup.com: https://duckbillgroup.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at Pinecone. They believe that all anyone really wants is to be understood, and that includes your users. AI models combined with the Pinecone vector database let your applications understand and act on what your users want… without making them spell it out.Make your search application find results by meaning instead of just keywords, your personalization system make picks based on relevance instead of just tags, and your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable. Thanks to my friends at Pinecone for sponsoring this episode. Visit Pinecone.io to understand more.Corey: This episode is bought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn't care enough about backups.  If you're tired of the vulnerabilities, costs and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, thats V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. One of the fun things about establishing traditions is that the first time you do it, you don't really know that that's what's happening. Almost exactly a year ago, I sat down for a previous promoted guest episode much like this one, With Clinton Herget at Snyk—or Synic; however you want to pronounce that. He is apparently a scarecrow of some sorts because when last we spoke, he was a principal solutions engineer, but like any good scarecrow, he was outstanding in his field, and now, as a result, is a Field CTO. Clinton, Thanks for coming back, and let me start by congratulating you on the promotion. Or consoling you depending upon how good or bad it is.Clinton: You know, Corey, a little bit of column A, a little bit of column B. But very glad to be here again, and frankly, I think it's because you insist on mispronouncing Snyk as Synic, and so you get me again.Corey: Yeah, you could add a couple of new letters to it and just call the company [Synack 00:01:27]. Now, it's a hard pivot to a networking company. So, there's always options.Clinton: I acknowledge what you did there, Corey.Corey: I like that quite a bit. I wasn't sure you'd get it.Clinton: I'm a nerd going way, way back, so we'll have to go pretty deep in the stack for you to stump me on some of this stuff.Corey: As we did with the, “I wasn't sure you'd get it.” See that one sailed right past you. And I win. Chalk another one up for me and the networking pun wars. Great, we'll loop back for that later.Clinton: I don't even know where I am right now.Corey: [laugh]. So, let's go back to a question that one would think that I'd already established a year ago, but I have the attention span of basically a goldfish, let's not kid ourselves. So, as I'm visiting the Snyk website, I find that it says different words than it did a year ago, which is generally a sign that is positive; when nothing's been updated including the copyright date, things are going really well or really badly. One wonders. But no, now you're talking about Snyk Cloud, you're talking about several other offerings as well, and my understanding of what it is you folks do no longer appears to be completely accurate. So, let me be direct. What the hell do you folks do over there?Clinton: It's a really great question. Glad you asked me on a year later to answer it. I would say at a very high level, what we do hasn't changed. However, I think the industry has certainly come a long way in the past couple years and our job is to adapt to that Snyk—again, pronounced like a pair of sneakers are sneaking around—it's a developer security platform. So, we focus on enabling the people who build applications—which as of today, means modern applications built in the cloud—to have better visibility, and ultimately a better chance of mitigating the risk that goes into those applications when it matters most, which is actually in their workflow.Now, you're exactly right. Things have certainly expanded in that remit because the job of a software engineer is very different, I think this year than it even was last year, and that's continually evolving over time. As a developer now, I'm doing a lot more than I was doing a few years ago. And one of the things I'm doing is building infrastructure in the cloud, I'm writing YAML files, I'm writing CloudFormation templates to deploy things out to AWS. And what happens in the cloud has a lot to do with the risk to my organization associated with those applications that I'm building.So, I'd love to talk a little bit more about why we decided to make that move, but I don't think that represents a watering down of what we're trying to do at Snyk. I think it recognizes that developer security vision fundamentally can't exist without some understanding of what's happening in the cloud.Corey: One of the things that always scares me is—and sets the spidey sense tingling—is when I see a company who has a product, and I'm familiar—ish—with what they do. And then they take their product name and slap the word cloud at the end, which is almost always codes to, “Okay, so we took the thing that we sold in boxes in data centers, and now we're making a shitty hosted version available because it turns out you rubes will absolutely pay a subscription for it.” Yeah, I don't get the sense that at all is what you're doing. In fact, I don't believe that you're offering a hosted managed service at the moment, are you?Clinton: No, the cloud part, that fundamentally refers to a new product, an offering that looks at the security or potentially the risks being introduced into cloud infrastructure, by now the engineers who were doing it who are writing infrastructure as code. We previously had an infrastructure-as-code security product, and that served alongside our static analysis tool which is Snyk Code, our open-source tool, our container scanner, recognizing that the kinds of vulnerabilities you can potentially introduce in writing cloud infrastructure are not only bad to the organization on their own—I mean, nobody wants to create an S3 bucket that's wide open to the world—but also, those misconfigurations can increase the blast radius of other kinds of vulnerabilities in the stack. So, I think what it does is it recognizes that, as you and I think your listeners well know, Corey, there's no such thing as the cloud, right? The cloud is just a bunch of fancy software designed to abstract away from the fact that you're running stuff on somebody else's computer, right?Corey: Unfortunately, in this case, the fact that you're calling it Snyk Cloud does not mean that you're doing what so many other companies in that same space do it would have led to a really short interview because I have no faith that it's the right path forward, especially for you folks, where it's, “Oh, you want to be secure? You've got to host your stuff on our stuff instead. That's why we called it cloud.” That's the direction that I've seen a lot of folks try and pivot in, and I always find it disastrous. It's, “Yeah, well, at Snyk if we run your code or your shitty applications here in our environment, it's going to be safer than if you run it yourself on something untested like AWS.” And yeah, those stories hold absolutely no water. And may I just say, I'm gratified that's not what you're doing?Clinton: Absolutely not. No, I would say we have no interest in running anyone's applications. We do want to scan them though, right? We do want to give the developers insight into the potential misconfigurations, the risks, the vulnerabilities that you're introducing. What sets Snyk apart, I think, from others in that application security testing space is we focus on the experience of the developer, rather than just being another tool that runs and generates a bunch of PDFs and then throws them back to say, “Here's everything you did wrong.”We want to say to developers, “Here's what you could do better. Here's how that default in a CloudFormation template that leads to your bucket being, you know, wide open on the internet could be changed. Here's the remediation that you could introduce.” And if we do that at the right moment, which is inside that developer workflow, inside the IDE, on their local machine, before that gets deployed, there's a much greater chance that remediation is going to be implemented and it's going to happen much more cheaply, right? Because you no longer have to do the round trip all the way out to the cloud and back.So, the cloud part of it fundamentally means completing that story, recognizing that once things do get deployed, there's a lot of valuable context that's happening out there that a developer can really take advantage of. They can say, “Wait a minute. Not only do I have a Log4Shell vulnerability, right, in one of my open-source dependencies, but that artifact, that application is actually getting deployed to a VPC that has ingress from the internet,” right? So, not only do I have remote code execution in my application, but it's being put in an enclave that actually allows it to be exploited. You can only know that if you're actually looking at what's really happening in the cloud, right?So, not only does Snyk cloud allows us to provide an additional layer of security by looking at what's misconfigured in that cloud environment and help your developers make remediations by saying, “Here's the actual IAC file that caused that infrastructure to come into existence,” but we can also say, here's how that affects the risk of other kinds of vulnerabilities at different layers in the stack, right? Because it's all software; it's all connected. Very rarely does a vulnerability translate one-to-one into risk, right? They're compound because modern software is compound. And I think what developers lack is the tooling that fits into their workflow that understands what it means to be a software engineer and actually helps them make better choices rather than punishing them after the fact for guessing and making bad ones.Corey: That sounds awesome at a very high level. It is very aligned with how executives and decision-makers think about a lot of these things. Let's get down to brass tacks for a second. Assume that I am the type of developer that I am in real life, by which I mean shitty. What am I going to wind up attempting to do that Snyk will flag and, in other words, protect me from myself and warn me that I'm about to commit a dumb?Clinton: First of all, I would say, look, there's no such thing as a non-shitty developer, right? And I built software for 20 years and I decided that's really hard. What's a lot easier is talking about building software for a living. So, that's what I do now. But fundamentally, the reason I'm at Snyk, is I want to help people who are in the kinds of jobs that I had for a very long time, which is to say, you have a tremendous amount of anxiety because you recognize that the success of the organization rests on your shoulders, and you're making hundreds, if not thousands of decisions every day without the right context to understand fully how the results of that decision is going to affect the organization that you work for.So, I think every developer in the world has to deal with this constant cognitive dissonance of saying, “I don't know that this is right, but I have to do it anyway because I need to clear that ticket because that release needs to get into production.” And it becomes really easy to short-sightedly do things like pull an open-source dependency without checking whether it has any CVEs associated with it because that's the version that's easiest to implement with your code that already exists. So, that's one piece. Snyk Open Source, designed to traverse that entire tree of dependencies in open-source all the way down, all the hundreds and thousands of packages that you're pulling in to say, not only, here's a vulnerability that you should really know is going to end up in your application when it's built, but also here's what you can do about it, right? Here's the upgrade you can make, here's the minimum viable change that actually gets you out of this problem, and to do so when it's in the right context, which is in you know, as you're making that decision for the first time, right, inside your developer environment.That also applies to things like container vulnerabilities, right? I have even less visibility into what's happening inside a container than I do inside my application. Because I know, say, I'm using an Ubuntu or a Red Hat base image. I have no idea, what are all the Linux packages that are on it, let alone what are the vulnerabilities associated with them, right? So, being able to detect, I've got a version of OpenSSL 3.0 that has a potentially serious vulnerability associated with it before I've actually deployed that container out into the cloud very much helps me as a developer.Because I'm limiting the rework or the refactoring I would have to do by otherwise assuming I'm making a safe choice or guessing at it, and then only finding out after I've written a bunch more code that relies on that decision, that I have to go back and change it, and then rewrite all of the things that I wrote on top of it, right? So, it's the identifying the layer in the stack where that risk could be introduced, and then also seeing how it's affected by all of those other layers because modern software is inherently complex. And that complexity is what drives both the risk associated with it, and also things like efficiency, which I know your audience is, for good reason, very concerned about.Corey: I'm going to challenge you on aspect of this because on the tin, the way you describe it, it sounds like, “Oh, I already have something that does that. It's the GitHub Dependabot story where it winds up sending me a litany of complaints every week.” And we are talking, if I did nothing other than read this email in that day, that would be a tremendously efficient processing of that entire thing because so much of it is stuff that is ancient and archived, and specific aspects of the vulnerabilities are just not relevant. And you talk about the OpenSSL 3.0 issues that just recently came out.I have no doubt that somewhere in the most recent email I've gotten from that thing, it's buried two-thirds of the way down, like all the complaints like the dishwasher isn't loaded, you forgot to take the trash out, that baby needs a change, the kitchen is on fire, and the vacuuming, and the r—wait, wait. What was that thing about the kitchen? Seems like one of those things is not like the others. And it just gets lost in the noise. Now, I will admit to putting my thumb a little bit on the scale here because I've used Snyk before myself and I know that you don't do that. How do you avoid that trap?Clinton: Great question. And I think really, the key to the story here is, developers need to be able to prioritize, and in order to prioritize effectively, you need to understand the context of what happens to that application after it gets deployed. And so, this is a key part of why getting the data out of the cloud and bringing it back into the code is so important. So, for example, take an OpenSSL vulnerability. Do you have it on a container image you're using, right? So, that's question number one.Question two is, is there actually a way that code can be accessed from the outside? Is it included or is it called? Is the method activated by some other package that you have running on that container? Is that container image actually used in a production deployment? Or does it just go sit in a registry and no one ever touches it?What are the conditions required to make that vulnerability exploitable? You look at something like Spring Shell, for example, yes, you need a certain version of spring-beans in a JAR file somewhere, but you also need to be running a certain version of Tomcat, and you need to be packaging those JARs inside a WAR in a certain way.Corey: Exactly. I have a whole bunch of Lambda functions that provide the pipeline system that I use to build my newsletter every week, and I get screaming concerns about issues in, for example, a version of the markdown parser that I've subverted. Yeah, sure. I get that, on some level, if I were just giving it random untrusted input from the internet and random ad hoc users, but I'm not. It's just me when I write things for that particular Lambda function.And I'm not going to be actively attempting to subvert the thing that I built myself and no one else should have access to. And looking through the details of some of these things, it doesn't even apply to the way that I'm calling the libraries, so it's just noise, for lack of a better term. It is not something that basically ever needs to be adjusted or fixed.Clinton: Exactly. And I think cutting through that noise is so key to creating developer trust in any kind of tool that scanning an asset and providing you what, in theory, are a list of actionable steps, right? I need to be able to understand what is the thing, first of all. There's a lot of tools that do that, right, and we tend to mock them by saying things like, “Oh, it's just another PDF generator. It's just another thousand pages that you're never going to read.”So, getting the information in the right place is a big part of it, but filtering out all of the noise by saying, we looked at not just one layer of the stack, but multiple layers, right? We know that you're using this open-source dependency and we also know that the method that contains the vulnerability is actively called by your application in your first-party code because we ran our static analysis tool against that. Furthermore, we know because we looked at your cloud context, we connected to your AWS API—we're big partners with AWS and very proud of that relationship—but we can tell that there's inbound internet access available to that service, right? So, you start to build a compound case that maybe this is something that should be prioritized, right? Because there's a way into the asset from the outside world, there's a way into the vulnerable functions through the labyrinthine, you know, spaghetti of my code to get there, and the conditions required to exploit it actually exist in the wild.But you can't just run a single tool; you can't just run Dependabot to get that prioritization. You actually have to look at the entire holistic application context, which includes not just your dependencies, but what's happening in the container, what's happening in your first-party, your proprietary code, what's happening in your IAC, and I think most importantly for modern applications, what's actually happening in the cloud once it gets deployed, right? And that's sort of the holy grail of completing that loop to bring the right context back from the cloud into code to understand what change needs to be made, and where, and most importantly why. Because it's a priority that actually translates into organizational risk to get a developer to pay attention, right? I mean, that is the key to I think any security concern is how do you get engineering mindshare and trust that this is actually what you should be paying attention to and not a bunch of rework that doesn't actually make your software more secure?Corey: One of the challenges that I see across the board is that—well, let's back up a bit here. I have in previous episodes talked in some depth about my position that when it comes to the security of various cloud providers, Google is number one, and AWS is number two. Azure is a distant third because it figures out what Crayons tastes the best; I don't know. But the reason is not because of any inherent attribute of their security models, but rather that Google massively simplifies an awful lot of what happens. It automatically assumes that resources in the same project should be able to talk to one another, so I don't have to painstakingly configure that.In AWS-land, all of this must be done explicitly; no one has time for that, so we over-scope permissions massively and never go back and rein them in. It's a configuration vulnerability more than an underlying inherent weakness of the platform. Because complexity is the enemy of security in many respects. If you can't fit it all in your head to reason about it, how can you understand the security ramifications of it? AWS offers a tremendous number of security services. Many of them, when taken in some totality of their pricing, cost more than any breach, they could be expected to prevent. Adding more stuff that adds more complexity in the form of Snyk sounds like it's the exact opposite of what I would want to do. Change my mind.Clinton: I would love to. I would say, fundamentally, I think you and I—and by ‘I,' I mean Snyk and you know, Corey Quinn Enterprises Limited—I think we fundamentally have the same enemy here, right, which is the cyclomatic complexity of software, right, which is how many different pathways do the bits have to travel down to reach the same endpoint, right, the same goal. The more pathways there are, the more risk is introduced into your software, and the more inefficiency is introduced, right? And then I know you'd love to talk about how many different ways is there to run a container on AWS, right? It's either 30 or 400 or eleventy-million.I think you're exactly right that that complexity, it is great for, first of all, selling cloud resources, but also, I think, for innovating, right, for building new kinds of technology on top of that platform. The cost that comes along with that is a lack of visibility. And I think we are just now, as we approach the end of 2022 here, coming to recognize that fundamentally, the complexity of modern software is beyond the ability of a single engineer to understand. And that is really important from a security perspective, from a cost control perspective, especially because software now creates its own infrastructure, right? You can't just now secure the artifact and secure the perimeter that it gets deployed into and say, “I've done my job. Nobody can breach the perimeter and there's no vulnerabilities in the thing because we scanned it and that thing is immutable forever because it's pets, not cattle.”Where I think the complexity story comes in is to recognize like, “Hey, I'm deploying this based on a quickstart or CloudFormation template that is making certain assumptions that make my job easier,” right, in a very similar way that choosing an open-source dependency makes my job easier as a developer because I don't have to write all of that code myself. But what it does mean is I lack the visibility into, well hold on. How many different pathways are there for getting things done inside this dependency? How many other dependencies are brought on board? In the same way that when I create an EKS cluster, for example, from a CloudFormation template, what is it creating in the background? How many VPCs are involved? What are the subnets, right? How are they connected to each other? Where are the potential ingress points?So, I think fundamentally, getting visibility into that complexity is step number one, but understanding those pathways and how they could potentially translate into risk is critically important. But that prioritization has to involve looking at the software holistically and not just individual layers, right? I think we lose when we say, “We ran a static analysis tool and an open-source dependency scanner and a container scanner and a cloud config checker, and they all came up green, therefore the software doesn't have any risks,” right? That ignores the fundamental complexity in that all of these layers are connected together. And from an adversaries perspective, if my job is to go in and exploit software that's hosted in the cloud, I absolutely do not see the application model that way.I see it as it is inherently complex and that's a good thing for me because it means I can rely on the fact that those engineers had tremendous anxiety, we're making a lot of guesses, and crossing their fingers and hoping something would work and not be exploitable by me, right? So, the only way I think we get around that is to recognize that our engineers are critical stakeholders in that security process and you fundamentally lack that visibility if you don't do your scanning until after the fact. If you take that traditional audit-based approach that assumes a very waterfall, legacy approach to building software, and recognize that, hey, we're all on this infinite loop race track now. We're deploying every three-and-a-half seconds, everything's automated, it's all built at scale, but the ability to do that inherently implies all of this additional complexity that ultimately will, you know, end up haunting me, right? If I don't do anything about it, to make my engineer stakeholders in, you know, what actually gets deployed and what risks it brings on board.Corey: This episode is sponsored in part by our friends at Uptycs. Attackers don't think in silos, so why would you have siloed solutions protecting cloud, containers, and laptops distinctly? Meet Uptycs - the first unified solution that prioritizes risk across your modern attack surface—all from a single platform, UI, and data model. Stop by booth 3352 at AWS re:Invent in Las Vegas to see for yourself and visit uptycs.com. That's U-P-T-Y-C-S.com. My thanks to them for sponsoring my ridiculous nonsense.Corey: When I wind up hearing you talk about this—I'm going to divert us a little bit because you're dancing around something that it took me a long time to learn. When I first started fixing AWS bills for a living, I thought that it would be mostly math, by which I mean arithmetic. That's the great secret of cloud economics. It's addition, subtraction, and occasionally multiplication and division. No, turns out it's much more psychology than it is math. You're talking in many aspects about, I guess, what I'd call the psychology of a modern cloud engineer and how they think about these things. It's not a technology problem. It's a people problem, isn't it?Clinton: Oh, absolutely. I think it's the people that create the technology. And I think the longer you persist in what we would call the legacy viewpoint, right, not recognizing what the cloud is—which is fundamentally just software all the way down, right? It is abstraction layers that allow you to ignore the fact that you're running stuff on somebody else's computer—once you recognize that, you realize, oh, if it's all software, then the problems that it introduces are software problems that need software solutions, which means that it must involve activity by the people who write software, right? So, now that you're in that developer world, it unlocks, I think, a lot of potential to say, well, why don't developers tend to trust the security tools they've been provided with, right?I think a lot of it comes down to the question you asked earlier in terms of the noise, the lack of understanding of how those pieces are connected together, or the lack of context, or not even frankly, caring about looking beyond the single-point solution of the problem that solution was designed to solve. But more importantly than that, not recognizing what it's like to build modern software, right, all of the decisions that have to be made on a daily basis with very limited information, right? I might not even understand where that container image I'm building is going in the universe, let alone what's being built on top of it and how much critical customer data is being touched by the database, that that container now has the credentials to access, right? So, I think in order to change anything, we have to back way up and say, problems in the cloud or software problems and we have to treat them that way.Because if we don't if we continue to represent the cloud as some evolution of the old environment where you just have this perimeter that's pre-existing infrastructure that you're deploying things onto, and there's a guy with a neckbeard in the basement who is unplugging cables from a switch and plugging them back in and that's how networking problems are solved, I think you missed the idea that all of these abstraction layers introduced the very complexity that needs to be solved back in the build space. But that requires visibility into what actually happens when it gets deployed. The way I tend to think of it is, there's this firewall in place. Everybody wants to say, you know, we're doing DevOps or we're doing DevSecOps, right? And that's a lie a hundred percent of the time, right? No one is actually, I think, adhering completely to those principles.Corey: That's why one of the core tenets of ClickOps is lying about doing anything in the console.Clinton: Absolutely, right? And that's why shadow IT becomes more and more prevalent the deeper you get into modern development, not less and less prevalent because it's fundamentally hard to recognize the entirety of the potential implications, right, of a decision that you're making. So, it's a lot easier to just go in the console and say, “Okay, I'm going to deploy one EC2 to do this. I'm going to get it right at some point.” And that's why every application that's ever been produced by human hands has a comment in it that says something like, “I don't know why this works but it does. Please don't change it.”And then three years later because that developer has moved on to another job, someone else comes along and looks at that comment and says, “That should really work. I'm going to change it.” And they do and everything fails, and they have to go back and fix it the original way and then add another comment saying, “Hey, this person above me, they were right. Please don't change this line.” I think every engineer listening right now knows exactly where that weak spot is in the applications that they've written and they're terrified of that.And I think any tool that's designed to help developers fundamentally has to get into the mindset, get into the psychology of what that is, like, of not fundamentally being able to understand what those applications are doing all of the time, but having to write code against them anyway, right? And that's what leads to, I think, the fear that you're going to get woken up because your pager is going to go off at 3 a.m. because the building is literally on fire and it's because of code that you wrote. We have to solve that problem and it has to be those people who's psychology we get into to understand, how are you working and how can we make your life better, right? And I really do think it comes with that the noise reduction, the understanding of complexity, and really just being humble and saying, like, “We get that this job is really hard and that the only way it gets better is to begin admitting that to each other.”Corey: I really wish that there were a better way to articulate a lot of these things. This the reason that I started doing a security newsletter; it's because cost and security are deeply aligned in a few ways. One of them is that you care about them a lot right after you failed to care about them sufficiently, but the other is that you've got to build guardrails in such a way that doing the right thing is easier than doing it the wrong way, or you're never going to gain any traction.Clinton: I think that's absolutely right. And you use the key term there, which is guardrails. And I think that's where in their heart of hearts, that's where every security professional wants to be, right? They want to be defining policy, they want to be understanding the risk posture of the organization and nudging it in a better direction, right? They want to be talking up to the board, to the executive team, and creating confidence in that risk posture, rather than talking down or off to the side—depending on how that org chart looks—to the engineers and saying, “Fix this, fix that, and then fix this other thing.” A, B, and C, right?I think the problem is that everyone in a security role or an organization of any size at this point, is doing 90% of the latter and only about 10% of the former, right? They're acting as gatekeepers, not as guardrails. They're not defining policy, they're spending all of their time creating Jira tickets and all of their time tracking down who owns the piece of code that got deployed to this pod on EKS that's throwing all these errors on my console, and how can I get the person to make a decision to actually take an action that stops these notifications from happening, right? So, all they're doing is throwing footballs down the field without knowing if there's a receiver there, right, and I think that takes away from the job that our security analysts really shouldn't be doing, which is creating those guardrails, which is having confidence that the policy they set is readily understood by the developers making decisions, and that's happening in an automated way without them having to create friction by bothering people all the time. I don't think security people want to be [laugh] hated by the development teams that they work with, but they are. And the reason they are is I think, fundamentally, we lack the tooling, we lack—Corey: They are the barrier method.Clinton: Exactly. And we lacked the processes to get the right intelligence in a way that's consumable by the engineers when they're doing their job, and not after the fact, which is typically when the security people have done their jobs.Corey: It's sad but true. I wish that there were a better way to address these things, and yet here we are.Clinton: If only there were better way to address these things.Corey: [laugh].Clinton: Look, I wouldn't be here at Snyk if I didn't think there were a better way, and I wouldn't be coming on shows like yours to talk to the engineering communities, right, people who have walked the walk, right, who have built those Terraform files that contain these misconfigurations, not because they're bad people or because they're lazy, or because they don't do their jobs well, but because they lacked the visibility, they didn't have the understanding that that default is actually insecure. Because how would I know that otherwise, right? I'm building software; I don't see myself as an expert on infrastructure, right, or on Linux packages or on cyclomatic complexity or on any of these other things. I'm just trying to stay in my lane and do my job. It's not my fault that the software has become too complex for me to understand, right?But my management doesn't understand that and so I constantly have white knuckles worrying that, you know, the next breach is going to be my fault. So, I think the way forward really has to be, how do we make our developers stakeholders in the risk being introduced by the software they write to the organization? And that means everything we've been talking about: it means prioritization; it means understanding how the different layers of the stack affect each other, especially the cloud pieces; it means an extensible platform that lets me write code against it to inject my own reasoning, right? The piece that we haven't talked about here is that risk calculation doesn't just involve technical aspects, there's also business intelligence that's involved, right? What are my critical applications, right, what actually causes me to lose significant amounts of money if those services go offline?We at Snyk can't tell that. We can't run a scanner to say these are your crown jewel services that can't ever go down, but you can know that as an organization. So, where we're going with the platform is opening up the extensible process, creating APIs for you to be able to affect that risk triage, right, so that as the creators have guardrails as the security team, you are saying, “Here's how we want our developers to prioritize. Here are all of the factors that go into that decision-making.” And then you can be confident that in their environment, back over in developer-land, when I'm looking at IntelliJ, or, you know, or on my local command line, I am seeing the guardrails that my security team has set for me and I am confident that I'm fixing the right thing, and frankly, I'm grateful because I'm fixing it at the right time and I'm doing it in such a way and with a toolset that actually is helping me fix it rather than just telling me I've done something wrong, right, because everything we do at Snyk focuses on identifying the solution, not necessarily identifying the problem.It's great to know that I've got an unencrypted S3 bucket, but it's a whole lot better if you give me the line of code and tell me exactly where I have to copy and paste it so I can go on to the next thing, rather than spending an hour trying to figure out, you know, where I put that line and what I actually have to change it to, right? I often say that the most valuable currency for a developer, for a software engineer, it's not money, it's not time, it's not compute power or anything like that, it's the right context, right? I actually have to understand what are the implications of the decision that I'm making, and I need that to be in my own environment, not after the fact because that's what creates friction within an organization is when I could have known earlier and I could have known better, but instead, I had to guess I had to write a bunch of code that relies on the thing that was wrong, and now I have to redo it all for no good reason other than the tooling just hadn't adapted to the way modern software is built.Corey: So, one last question before we wind up calling it a day here. We are now heavily into what I will term pre:Invent where we're starting to see a whole bunch of announcements come out of the AWS universe in preparation for what I'm calling Crappy Cloud Hanukkah this year because I'm spending eight nights in Las Vegas. What are you doing these days with AWS specifically? I know I keep seeing your name in conjunction with their announcements, so there's something going on over there.Clinton: Absolutely. No, we're extremely excited about the partnership between Snyk and AWS. Our vulnerability intelligence is utilized as one of the data sources for AWS Inspector, particularly around open-source packages. We're doing a lot of work around things like the code suite, building Snyk into code pipeline, for example, to give developers using that code suite earlier visibility into those vulnerabilities. And really, I think the story kind of expands from there, right?So, we're moving forward with Amazon, recognizing that it is, you know, sort of the de facto. When we say cloud, very often we mean AWS. So, we're going to have a tremendous presence at re:Invent this year, I'm going to be there as well. I think we're actually going to have a bunch of handouts with your face on them is my understanding. So, please stop by the booth; would love to talk to folks, especially because we've now released the Snyk Cloud product and really completed that story. So, anything we can do to talk about how that additional context of the cloud helps engineers because it's all software all the way down, those are absolutely conversations we want to be having.Corey: Excellent. And we will, of course, put links to all of these things in the [show notes 00:35:00] so people can simply click, and there they are. Thank you so much for taking all this time to speak with me. I appreciate it.Clinton: All right. Thank you so much, Corey. Hope to do it again next year.Corey: Clinton Herget, Field CTO at Snyk. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment telling me that I'm being completely unfair to Azure, along with your favorite tasting color of Crayon.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

AWS Morning Brief
How To Learn Something New: Kubernetes The Much Harder Way

AWS Morning Brief

Play Episode Listen Later Nov 16, 2022 8:27


Want to give your ears a break and read this as an article? You're looking for this link.https://www.lastweekinaws.com/blog/How-To-Learn-Something-New-Kubernetes-the-Much-Harder-WayWant to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/bpp5tpgU6CENever miss an episode Join the Last Week in AWS newsletter Subscribe wherever you get your podcasts Help the show Leave a review Share your feedback Subscribe wherever you get your podcasts Buy our merch https://store.lastweekinaws.comWhat's Corey up to? Follow Corey on Twitter (@quinnypig) See our recent work at the Duckbill Group Apply to work with Corey and the Duckbill Group to help lower your AWS bill

Screaming in the Cloud
The Non-Magical Approach to Cloud-Based Development with Chen Goldberg

Screaming in the Cloud

Play Episode Listen Later Nov 15, 2022 40:13


About ChenChen Goldberg is GM and Vice President of Engineering at Google Cloud, where she leads the Cloud Runtimes (CR) product area, helping customers deliver greater value, effortlessly. The CR  portfolio includes both Serverless and Kubernetes based platforms on Google Cloud, private cloud and other public clouds. Chen is a strong advocate for customer empathy, building products and solutions that matter. Chen has been core to Google Cloud's open core vision since she joined the company six years ago. During that time, she has led her team to focus on helping development teams increase their agility and modernize workloads. Prior to joining Google, Chen wore different hats in the tech industry including leadership positions in IT organizations, SI teams and SW product development, contributing to Chen's broad enterprise perspective. She enjoys mentoring IT talent both in and outside of Google. Chen lives in Mountain View, California, with her husband and three kids. Outside of work she enjoys hiking and baking.Links Referenced: Twitter: https://twitter.com/GoldbergChen LinkedIn: https://www.linkedin.com/in/goldbergchen/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation, permissions as code, connectivity between any two devices, reduce latency, and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Welcome to Screaming in the Cloud, I'm Corey Quinn. When I get bored and the power goes out, I find myself staring at the ceiling, figuring out how best to pick fights with people on the internet about Kubernetes. Because, well, I'm basically sad and have a growing collection of personality issues. My guest today is probably one of the best people to have those arguments with. Chen Goldberg is the General Manager of Cloud Runtimes and VP of Engineering at Google Cloud. Chen, Thank you for joining me today.Chen: Thank you so much, Corey, for having me.Corey: So, Google has been doing a lot of very interesting things in the cloud, and the more astute listener will realize that interesting is not always necessarily a compliment. But from where I sit, I am deeply vested in the idea of a future where we do not have a cloud monoculture. As I've often said, I want, “What cloud should I build something on in five to ten years?” To be a hard question to answer, and not just because everything is terrible. I think that Google Cloud is absolutely a bright light in the cloud ecosystem and has been for a while, particularly with this emphasis around developer experience. All of that said, Google Cloud is sort of a big, unknowable place, at least from the outside. What is your area of responsibility? Where do you start? Where do you stop? In other words, what can I blame you for?Chen: Oh, you can blame me for a lot of things if you want to. I [laugh] might not agree with that, but that's—Corey: We strive for accuracy in these things, though.Chen: But that's fine. Well, first of all, I've joined Google about seven years ago to lead the Kubernetes and GKE team, and ever since, continued at the same area. So evolved, of course, Kubernetes, and Google Kubernetes Engine, and leading our hybrid and multi-cloud strategy as well with technologies like Anthos. And now I'm responsible for the entire container runtime, which includes Kubernetes and the serverless solutions.Corey: A while back, I, in fairly typical sarcastic form, wound up doing a whole inadvertent start of a meme where I joked about there being 17 ways to run containers on AWS. And then as that caught on, I wound up listing out 17 services you could use to do that. A few months went past and then I published a sequel of 17 more services you can use to run Kubernetes. And while that was admittedly tongue-in-cheek, it does lead to an interesting question that's ecosystem-wide. If I look at Google Cloud, I have Cloud Run, I have GKE, I have GCE if I want to do some work myself.It feels like more and more services are supporting Docker in a variety of different ways. How should customers and/or people like me—though, I am sort of a customer as well since I do pay you folks every month—how should we think about containers and services in which to run them?Chen: First of all, I think there's a lot of credit that needs to go to Docker that made containers approachable. And so, Google has been running containers forever. Everything within Google is running on containers, even our VMs, even our cloud is running on containers, but what Docker did was creating a packaging mechanism to improve developer velocity. So, that's on its own, it's great. And one of the things, by the way, that I love about Google Cloud approach to containers and Docker that yes, you can take your Docker container and run it anywhere.And it's actually really important to ensure what we call interoperability, or low barrier to entry to a new technology. So, I can take my Docker container, I can move it from one platform to another, and so on. So, that's just to start with on a containers. Between the different solutions, so first of all, I'm all about managed services. You are right, there are many ways to run a Kubernetes. I'm taking a lot of pride—Corey: The best way is always to have someone else run it for you. Problem solved. Great, the best kind of problems are always someone else's.Chen: Yes. And I'm taking a lot of pride of what our team is doing with Kubernetes. I mean, we've been working on that for so long. And it's something that you know, we've coined that term, I think back in 2016, so there is a success disaster, but there's also what we call sustainable success. So, thinking about how to set ourselves up for success and scale. Very proud of that service.Saying that, not everybody and not all your workloads you need the flexibility that Kubernetes gives you in all the ecosystem. So, if you start with containers your first time, you should start with Cloud Run. It's the easiest way to run your containers. That's one. If you are already in love with Kubernetes, we won't take it away from you. Start with GKE. Okay [laugh]? Go all-in. Okay, we are all in loving Kubernetes as well. But what my team and I are working on is to make sure that those will work really well together. And we actually see a lot of customers do that.Corey: I'd like to go back a little bit in history to the rise of Docker. I agree with you it was transformative, but containers had been around in various forms—depending upon how you want to define it—dating back to the '70s with logical partitions on mainframes. Well, is that a container? Is it not? Well, sort of. We'll assume yes for the sake of argument.The revelation that I found from Docker was the developer experience, start to finish. Suddenly, it was a couple commands and you were just working, where previously it had taken tremendous amounts of time and energy to get containers working in that same context. And I don't even know today whether or not the right way to contextualize containers is as sort of a lite version of a VM, as a packaging format, as a number of other things that you could reasonably call it. How do you think about containers?Chen: So, I'm going to do, first of all, a small [unintelligible 00:06:31]. I actually started my career as a system mainframe engineer—Corey: Hmm.Chen: And I will share that when you know, I've learned Kubernetes, I'm like, “Huh, we already have done all of that, in orchestration, in workload management on mainframe,” just to the side. The way I think about containers is as a—two things: one, it is a packaging of an application, but the other thing which is also critical is the decoupling between your application and the OS. So, having that kind of abstraction and allowing you to portable and move it between environments. So, those are the two things that are when I think about containers. And what technologies like Kubernetes and serverless gives on top of that is that manageability and making sure that we take care of everything else that is needed for you to run your application.Corey: I've been, how do I put this, getting some grief over the past few years, in the best ways possible, around a almost off-the-cuff prediction that I made, which was that in five years, which is now a lot closer to two, basically, nobody is going to care about Kubernetes. And I could have phrased that slightly more directly because people think I was trying to say, “Oh, Kubernetes is just hype. It's going to go away. Nobody's going to worry about it anymore.” And I think that is a wildly inaccurate prediction.My argument is that people are not going to have to think about it in the same way that they are today. Today, if I go out and want to go back to my days of running production services in anger—and by ‘anger,' I of course mean in production—then it would be difficult for me to find a role that did not at least touch upon Kubernetes. But people who can work with that technology effectively are in high demand and they tend to be expensive, not to mention then thinking about all of the intricacies and complexities that Kubernetes brings to the foreground, that is what doesn't feel sustainable to me. The idea that it's going to have to collapse down into something else is, by necessity, going to have to emerge. How are you seeing that play out? And also, feel free to disagree with the prediction. I am thrilled to wind up being told that I'm wrong it's how I learn the most.Chen: I don't know if I agree with the time horizon of when that will happen, but I will actually think it's a failure on us if that won't be the truth, that the majority of people will not need to know about Kubernetes and its internals. And you know, we keep saying that, like, hey, we need to make it more, like, boring, and easy, and I've just said like, “Hey, you should use managed.” And we have lots of customers that says that they're just using GKE and it scales on their behalf and they don't need to do anything for that and it's just like magic. But from a technology perspective, there is still a way to go until we can make that disappear.And there will be two things that will push us into that direction. One is—you mentioned that is as well—the talent shortage is real. All the customers that I speak with, even if they can find those great people that are experts, they're actually more interesting things for them to work on, okay? You don't need to take, like, all the people in your organization and put them on building the infrastructure. You don't care about that. You want to build innovation and promote your business.So, that's one. The second thing is that I do expect that the technology will continue to evolve and are managed solutions will be better and better. So hopefully, with these two things happening together, people will not care that what's under the hood is Kubernetes. Or maybe not even, right? I don't know exactly how things will evolve.Corey: From where I sit, what are the early criticisms I had about Docker, which I guess translates pretty well to Kubernetes, are that they solve a few extraordinarily painful problems. In the case of Docker, it was, “Well, it works on my machine,” as a grumpy sysadmin, the way I used to be, the only real response we had to that was, “Well. Time to backup your email, Skippy, because your laptop is going into production, then.” Now, you can effectively have a high-fidelity copy of production, basically anywhere, and we've solved the problem of making your Mac laptop look like a Linux server. Great, okay, awesome.With Kubernetes, it also feels, on some level, like it solves for very large-scale Google-type of problems where you want to run things across at least a certain point of scale. It feels like even today, it suffers from having an easy Hello World-style application to deploy on top of it. Using it for WordPress, or some other form of blogging software, for example, is stupendous overkill as far as the Hello World story tends to go. Increasingly as a result, it feels like it's great for the large-scale enterprise-y applications, but the getting started story of how do I have a service I could reasonably run in production? How do I contextualize that, in the world of Kubernetes? How do you respond to that type of perspective?Chen: We'll start with maybe a short story. I started my career in the Israeli army. I was head of the department and one of the lead technology units and I was responsible for building a PAS. In essence, it was 20-plus years ago, so we didn't really call it a PAS but that's what it was. And then at some point, it was amazing, developers were very productive, we got innovation again, again. And then there was some new innovation just at the beginning of web [laugh] at some point.And it was actually—so two things I've noticed back then. One, it was really hard to evolve the platform to allow new technologies and innovation, and second thing, from a developer perspective, it was like a black box. So, the developers team that people were—the other development teams couldn't really troubleshoot environment; they were not empowered to make decisions or [unintelligible 00:12:29] in the platform. And you know, when it was just started with Kubernetes—by the way, beginning, it only supported 100 nodes, and then 1000 nodes. Okay, it was actually not for scale; it actually solved those two problems, which I'm—this is where I spend most of my time.So, the first one, we don't want magic, okay? To be clear on, like, what's happening, I want to make sure that things are consistent and I can get the right observability. So, that's one. The second thing is that we invested so much in the extensibility an environment that it's, I wouldn't say it's easy, but it's doable to evolve Kubernetes. You can change the models, you can extend it you can—there is an ecosystem.And you know, when we were building it, I remember I used to tell my team, there won't be a Kubernetes 2.0. Which is for a developer, it's [laugh] frightening. But if you think about it and you prepare for that, you're like, “Huh. Okay, what does that mean with how I build my APIs? What does that mean of how we build a system?” So, that was one. The second thing I keep telling my team, “Please don't get too attached to your code because if it will still be there in 5, 10 years, we did something wrong.”And you can see areas within Kubernetes, again, all the extensions. I'm very proud of all the interfaces that we've built, but let's take networking. This keeps to evolve all the time on the API and the surface area that allows us to introduce new technologies. I love it. So, those are the two things that have nothing to do with scale, are unique to Kubernetes, and I think are very empowering, and are critical for the success.Corey: One thing that you said that resonates most deeply with me is the idea that you don't want there to be magic, where I just hand it to this thing and it runs it as if by magic. Because, again, we've all run things in anger in production, and what happens when the magic breaks? When you're sitting around scratching your head with no idea how it starts or how it stops, that is scary. I mean, I recently wound up re-implementing Google Cloud Distinguished Engineer Kelsey Hightower's “Kubernetes the Hard Way” because he gave a terrific tutorial that I ran through in about 45 minutes on top of Google Cloud. It's like, “All right, how do I make this harder?”And the answer is to do it on AWS, re-implement it there. And my experiment there can be found at kubernetesthemuchharderway.com because I have a vanity domain problem. And it taught me he an awful lot, but one of the challenges I had as I went through that process was, at one point, the nodes were not registering with the controller.And I ran out of time that day and turned everything off—because surprise bills are kind of what I spend my time worrying about—turn it on the next morning to continue and then it just worked. And that was sort of the spidey sense tingling moment of, “Okay, something wasn't working and now it is, and I don't understand why. But I just rebooted it and it started working.” Which is terrifying in the context of a production service. It was understandable—kind of—and I think that's the sort of thing that you understand a lot better, the more you work with it in production, but a counterargument to that is—and I've talked about it on this show before—for this podcast, I wind up having sponsors from time to time, who want to give me fairly complicated links to go check them out, so I have the snark.cloud URL redirector.That's running as a production service on top of Google Cloud Run. It took me half an hour to get that thing up and running; I haven't had to think about it since, aside from a three-second latency that was driving me nuts and turned out to be a sleep hidden in the code, which I can't really fault Google Cloud Run for so much as my crappy nonsense. But it just works. It's clearly running atop Kubernetes, but I don't have to think about it. That feels like the future. It feels like it's a glimpse of a world to come, we're just starting to dip our toes into. That, at least to me, feels like a lot more of the abstractions being collapsed into something easily understandable.Chen: [unintelligible 00:16:30], I'm happy you say that. When talking with customers and we're showing, like, you know, yes, they're all in Kubernetes and talking about Cloud Run and serverless, I feel there is that confidence level that they need to overcome. And that's why it's really important for us in Google Cloud is to make sure that you can mix and match. Because sometimes, you know, a big retail customer of ours, some of their teams, it's really important for them to use a Kubernetes-based platform because they have their workloads also running on-prem and they want to serve the same playbooks, for example, right? How do I address issues, how do I troubleshoot, and so on?So, that's one set of things. But some cloud only as simple as possible. So, can I use both of them and still have a similar developer experience, and so on? So, I do think that we'll see more of that in the coming years. And as the technology evolves, then we'll have more and more, of course, serverless solutions.By the way, it doesn't end there. Like, we see also, you know, databases and machine learning, and like, there are so many more managed services that are making things easy. And that's what excites me. I mean, that's what's awesome about what we're doing in cloud. We are building platforms that enable innovation.Corey: I think that there's an awful lot of power behind unlocking innovation from a customer perspective. The idea that I can use a cloud provider to wind up doing an experiment to build something in the course of an evening, and if it works, great, I can continue to scale up without having to replace, you know, the crappy Raspberry Pi-level hardware in my spare room with serious enterprise servers in a data center somewhere. The on-ramp and the capability and the lack of long-term commitments is absolutely magical. What I'm also seeing that is contributing to that is the de facto standard that's emerged of most things these days support Docker, for better or worse. There are many open-source tools that I see where, “Oh, how do I get this up and running?”“Well, you can go over the river and through the woods and way past grandmother's house to build this from source or run this Docker file.” I feel like that is the direction the rest of the world is going. And as much fun as it is to sit on the sidelines and snark, I'm finding a lot more capability stories emerging across the board. Does that resonate with what you're seeing, given that you are inherently working at very large scale, given the [laugh] nature of where you work?Chen: I do see that. And I actually want to double down on the open standards, which I think this is also something that is happening. At the beginning, we talked about I want it to be very hard when I choose the cloud provider. But innovation doesn't only come from cloud providers; there's a lot of companies and a lot of innovation happening that are building new technologies on top of those cloud providers, and I don't think this is going to stop. Innovation is going to come from many places, and it's going to be very exciting.And by the way, things are moving super fast in our space. So, the investment in open standard is critical for our industry. So, Docker is one example. Google is in [unintelligible 00:19:46] speaking, it's investing a lot in building those open standards. So, we have Docker, we have things like of course Kubernetes, but we are also investing in open standards of security, so we are working with other partners around [unintelligible 00:19:58], defining how you can secure the software supply chain, which is also critical for innovation. So, all of those things that reduce the barrier to entry is something that I'm personally passionate about.Corey: Scaling containers and scaling Kubernetes is hard, but a whole ‘nother level of difficulty is scaling humans. You've been at Google for, as you said, seven years and you did not start as a VP there. Getting promoted from Senior Director to VP at Google is a, shall we say, heavy lift. You also mentioned that you previously started with, I believe, it was a seven-person team at one point. How have you been able to do that? Because I can see a world in which, “Oh, we just write some code and we can scale the computers pretty easily,” I've never found a way to do that for people.Chen: So yes, I started actually—well not 7, but the team was 30 people [laugh]. And you can imagine how surprised I was when I joining Google Cloud with Kubernetes and GKE and it was a pretty small team, to the beginning of those days. But the team was already actually on the edge of burning out. You know, pings on Slack, the GitHub issues, there was so many things happening 24/7.And the thing was just doing everything. Everybody were doing everything. And one of the things I've done on my second month on the team—I did an off-site, right, all managers; that's what we do; we do off-sites—and I brought the team in to talk about—the leadership team—to talk about our team values. And in the beginning, they were a little bit pissed, I would say, “Okay, Chen. What's going on? You're wasting two days of our lives to talk about those things. Why we are not doing other things?”And I was like, “You know guys, this is really important. Let's talk about what's important for us.” It was an amazing it worked. By the way, that work is still the foundation of the culture in the team. We talked about the three values that we care about and how that will look like.And the reason it's important is that when you scale teams, the key thing is actually to scale decision-making. So, how do you scale decision-making? I think there are two things there. One is what you're trying to achieve. So, people should know and understand the vision and know where we want to get to.But the second thing is, how do we work? What's important for us? How do we prioritize? How do we make trade-offs? And when you have both the what we're trying to do and the how, you build that team culture. And when you have that, I find that you're set up more for success for scaling the team.Because then the storyteller is not just the leader or the manager. The entire team is a storyteller of how things are working in this team, how do we work, what you're trying to achieve, and so on. So, that's something that had been a critical. So, that's just, you know, from methodology of how I think it's the right thing to scale teams. Specifically, with a Kubernetes, there were more issues that we needed to work on.For example, building or [recoding 00:23:05] different functions. It cannot be just engineering doing everything. So, hiring the first product managers and information engineers and marketing people, oh my God. Yes, you have to have marketing people because there are so many events. And so, that was one thing, just you know, from people and skills.And the second thing is that it was an open-source project and a product, but what I was personally doing, I was—with the team—is bringing some product engineering practices into the open-source. So, can we say, for example, that we are going to focus on user experience this next release? And we're not going to do all the rest. And I remember, my team was like worried about, like, “Hey, what about that, and what about this, and we have—” you know, they were juggling everything together. And I remember telling them, “Imagine that everything is on the floor. All the balls are on the floor. I know they're on the floor, you know they're on the floor. It's okay. Let's just make sure that every time we pick something up, it never falls again.” And that idea is a principle that then evolved to ‘No Heroics,' and it evolved to ‘Sustainable Success.' But building things towards sustainable success is a principle which has been very helpful for us.Corey: This episode is sponsored in part by our friend at Uptycs. Attackers don't think in silos, so why would you have siloed solutions protecting cloud, containers, and laptops distinctly? Meet Uptycs - the first unified solution that prioritizes risk across your modern attack surface—all from a single platform, UI, and data model. Stop by booth 3352 at AWS re:Invent in Las Vegas to see for yourself and visit uptycs.com. That's U-P-T-Y-C-S.com. My thanks to them for sponsoring my ridiculous nonsense.Corey: When I take a look back, it's very odd to me to see the current reality that is Google, where you're talking about empathy, and the No Heroics, and the rest of that is not the reputation that Google enjoyed back when a lot of this stuff got started. It was always oh, engineers should be extraordinarily bright and gifted, and therefore it felt at the time like our customers should be as well. There was almost an arrogance built into, well, if you wrote your code more like Google will, then maybe your code wouldn't be so terrible in the cloud. And somewhat cynically I thought for a while that oh Kubernetes is Google's attempt to wind up making the rest of the world write software in a way that's more Google-y. I don't think that observation has aged very well. I think it's solved a tremendous number of problems for folks.But the complexity has absolutely been high throughout most of Kubernetes life. I would argue, on some level, that it feels like it's become successful almost in spite of that, rather than because of it. But I'm curious to get your take. Why do you believe that Kubernetes has been as successful as it clearly has?Chen: [unintelligible 00:25:34] two things. One about empathy. So yes, Google engineers are brilliant and are amazing and all great. And our customers are amazing, and brilliant, as well. And going back to the point before is, everyone has their job and where they need to be successful and we, as you say, we need to make things simpler and enable innovation. And our customers are driving innovation on top of our platform.So, that's the way I think about it. And yes, it's not as simple as it can be—probably—yet, but in studying the early days of Kubernetes, we have been investing a lot in what we call empathy, and the customer empathy workshop, for example. So, I partnered with Kelsey Hightower—and you mentioned yourself trying to start a cluster. The first time we did a workshop with my entire team, so then it was like 50 people [laugh], their task was to spin off a cluster without using any scripts that we had internally.And unfortunately, not many folks succeeded in this task. And out of that came the—what you you call it—a OKR, which was our goal for that quarter, is that you are able to spin off a cluster in three commands and troubleshoot if something goes wrong. Okay, that came out of that workshop. So, I do think that there is a lot of foundation on that empathetic engineering and the open-source of the community helped our Google teams to be more empathetic and understand what are the different use cases that they are trying to solve.And that actually bring me to why I think Kubernetes is so successful. People might be surprised, but the amount of investment we're making on orchestration or placement of containers within Kubernetes is actually pretty small. And it's been very small for the last seven years. Where do we invest time? One is, as I mentioned before, is on the what we call the API machinery.So, Kubernetes has introduced a way that is really suitable for a cloud-native technologies, the idea of reconciliation loop, meaning that the way Kubernetes is—Kubernetes is, like, a powerful automation machine, which can automate, of course, workload placement, but can automate other things. Think about it as a way of the Kubernetes API machinery is observing what is the current state, comparing it to the desired state, and working towards it. Think about, like, a thermostat, which is a different automation versus the ‘if this, then that,' where you need to anticipate different events. So, this idea about the API machinery and the way that you can extend it made it possible for different teams to use that mechanism to automate other things in that space.So, that has been one very powerful mechanism of Kubernetes. And that enabled all of innovation, even if you think about things like Istio, as an example, that's how it started, by leveraging that kind of mechanism to separate storage and so on. So, there are a lot of operators, the way people are managing their databases, or stateful workloads on top of Kubernetes, they're extending this mechanism. So, that's one thing that I think is key and built that ecosystem. The second thing, I am very proud of the community of Kubernetes.Corey: Oh, it's a phenomenal community success story.Chen: It's not easy to build a community, definitely not in open-source. I feel that the idea of values, you know, that I was talking about within my team was actually a big deal for us as we were building the community: how we treat each other, how do we help people start? You know, and we were talking before, like, am I going to talk about DEI and inclusivity, and so on. One of the things that I love about Kubernetes is that it's a new technology. There is actually—[unintelligible 00:29:39] no, even today, there is no one with ten years experience in Kubernetes. And if anyone says they have that, then they are lying.Corey: Time machine. Yes.Chen: That creates an opportunity for a lot of people to become experts in this technology. And by having it in open-source and making everything available, you can actually do it from your living room sofa. That excites me, you know, the idea that you can become an expert in this new technology and you can get involved, and you'll get people that will mentor you and help you through your first PR. And there are some roles within the community that you can start, you know, dipping your toes in the water. It's exciting. So, that makes me really happy, and I know that this community has changed the trajectory of many people's careers, which I love.Corey: I think that's probably one of the most impressive things that it's done. One last question I have for you is that we've talked a fair bit about the history and how we see it progressing through the view toward the somewhat recent past. What do you see coming in the future? What does the future of Kubernetes look like to you?Chen: Continue to be more and more boring. There is the promise of hybrid and multi-cloud, for example, is only possible by technologies like Kubernetes. So, I do think that, as a technology, it will continue to be important by ensuring portability and interoperability of workloads. I see a lot of edge use cases. If you think about it, it's like just lagging a bit around, like, innovation that we've seen in the cloud, can we bring that innovation to the edge, this will require more development within Kubernetes community as well.And that's really actually excites me. I think there's a lot of things that we're going to see there. And by the way, you've seen it also in KubeCon. I mean, there were some announcements in that space. In Google Cloud, we just announced before, like, with customers like Wendy's and Rite Aid as well. So, taking advantage of this technology to allow innovation everywhere.But beyond that, my hope is that we'll continue and hide the complexity. And our challenge will be to not make it a black box. Because that will be, in my opinion, a failure pattern, doesn't help those kinds of platforms. So, that will be the challenge. Can we scope the project, ensure that we have the right observability, and from a use case perspective, I do think edge is super interesting.Corey: I would agree. There are a lot of workloads out there that are simply never going to be hosted in the cloud provider region, for a variety of reasons of varying validity, but it is the truth. I think that the focus on addressing customers where they are has been an emerging best practice for cloud providers and I'm thrilled to see Google leading the charge on that.Chen: Yeah. And you just reminded me, the other thing that we see also more and more is definitely AI and ML workloads running on Kubernetes, which is part of that, right? So, Google Cloud is investing a lot in making an AI/ML easy. And I don't know if many people know, but, like, even Vertex AI, our own platform, is running on GKE. So, that's part of seeing how do we make sure that platform is suitable for these kinds of workloads and really help customers do the heavy lifting.So, that's another set of workloads that are very relevant at the edge. And one of our customers—MLB, for example—two things are interesting there. The first one, I think a lot of people sometimes say, “Okay, I'm going to move to the cloud and I want to know everything right now, how that will evolve.” And one of the things that's been really exciting with working with MLB for the last four years is the journey and the iterations. So, they started somewhat, like, at one phase and then they saw what's possible, and then moved to the next one, and so on. So, that's one. The other thing is that, really, they have so much ML running at the stadium with Google Cloud technology, which is very exciting.Corey: I'm looking forward to seeing how this continues to evolve and progress, particularly in light of the recent correction we're seeing in the market where a lot of hype-driven ideas are being stress test, maybe not in the way we might have hoped that they would, but it'll be really interesting to see what shakes out as far as things that deliver business value and are clear wins for customers versus a lot of the speculative stories that we've been hearing for a while now. Maybe I'm totally wrong on this. And this is going to be a temporary bump in the road, and we'll see no abatement in the ongoing excitement around so many of these emerging technologies, but I'm curious to see how it plays out. But that's the beautiful part about getting to be a pundit—or whatever it is people call me these days that's at least polite enough to say on a podcast—is that when I'm right, people think I'm a visionary, and when I'm wrong, people don't generally hold that against you. It seems like futurist is the easiest job in the world because if you predict and get it wrong, no one remembers. Predict and get it right, you look like a genius.Chen: So, first of all, I'm optimistic. So usually, my predictions are positive. I will say that, you know, what we are seeing, also what I'm hearing from our customers, technology is not for the sake of technology. Actually, nobody cares [laugh]. Even today.Okay, so nothing needs to change for, like, nobody would c—even today, nobody cares about Kubernetes. They need to care, unfortunately, but what I'm hearing from our customers is, “How do we create new experiences? How we make things easy?” Talent shortage is not just with tech people. It's also with people working in the warehouse or working in the store.Can we use technology to help inventory management? There's so many amazing things. So, when there is a real business opportunity, things are so much simpler. People have the right incentives to make it work. Because one thing we didn't talk about—right, we talked about all these new technologies and we talked about scaling team and so on—a lot of time, the challenge is not the technology.A lot of time, the challenge is the process. A lot of time, the challenge is the skills, is the culture, there's so many things. But when you have something—going back to what I said before—how you unite teams, when there's something a clear goal, a clear vision that everybody's excited about, they will make it work. So, I think this is where having a purpose for the innovation is critical for any successful project.Corey: I think and I hope that you're right. I really want to thank you for spending as much time with me as you have. If people want to learn more, where's the best place for them to find you?Chen: So, first of all, on Twitter. I'm there or on LinkedIn. I will say that I'm happy to connect with folks. Generally speaking, at some point in my career, I recognized that I have a voice that can help people, and I've experienced that can also help people build their careers. I'm happy to share that and [unintelligible 00:36:54] folks both in the company and outside of it.Corey: I think that's one of the obligations on a lot of us, once we wanted to get into a certain position or careers to send the ladder back down, for lack of a better term. It's I've never appreciated the perspective, “Well, screw everyone else. I got mine.” The whole point the next generation should have it easier than we did.Chen: Yeah, definitely.Corey: Chen Goldberg, General Manager of Cloud Runtimes and VP of Engineering at Google. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry rant of a comment talking about how LPARs on mainframes are absolutely not containers, making sure it's at least far too big to fit in a reasonably-sized Docker container.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.