Podcasts about cloudwatch

  • 50PODCASTS
  • 128EPISODES
  • 37mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • Feb 6, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about cloudwatch

Latest podcast episodes about cloudwatch

The Cloud Pod
290: Open AI to Operator: There is a DeepSeek Outside the Door

The Cloud Pod

Play Episode Listen Later Feb 6, 2025 70:12


Welcome to episode 290 of The Cloud Pod – where the forecast is always cloudy! It's a full house this week – and a good thing too, since there's a lot of news! Justin, Jonathan, Ryan, and Matthew are all in the house to bring you news on DeepSeek, OpenVox, CloudWatch, and more.  Titles we almost went with this week: The cloud pod wonders if azure is still hung over from new years Stratoshark sends the Cloud pod to the stratosphere Cutting-Edge Chinese “Reasoning” Model Rivals OpenAI… and it’s FREE?! Wireshark turns 27, Cloud Pod Hosts feel old Operator: DeepSeek is here to kill OpenAI Time for a deepthink on buying all that Nvidia stock AWS Token Service finally goes cloud native The CloudPod wonders if OpenAI’s Operator can order its own $200 subscription A big thanks to this week's sponsor: We're sponsorless! Want to get your brand, company, or service in front of a very enthusiastic group of cloud news seekers? You've come to the right place! Send us an email or hit us up on our slack channel for more info.  AI IS Going Great – Or How ML Makes All Its Money 01:29 Introducing the GenAI Platform: Simplifying AI Development for All  If you’re struggling to find that AI GPU capacity, Digital Ocean is pleased to announce their DigitalOcean GenAI Platform is now available to everyone. The platform aims to democratize AI development, empowering everyone – from solo developers to large teams – to leverage the transformative potential of generative AI.  On the Gen AI platform you can: Build Scalable AI Agents Seamlessly integrate with workflows Leverage guardrails Optimize Efficiency.  Some of the use cases they are highlighting are chatbots, e-commerce assistance, support automation, business insights, AI-Driven CRMs, Personalized Learning and interactive tools.  02:23 Jonathan – “Inference cost is really the big driver there. So once you once you build something that’s that’s done, but it’s nice to see somebody focusing on delivering it as a service rather than, you know, a $50 an hour compute for training models. This is right where they need to be.” 04:21 OpenAI: Introducing Operator We have thoughts about the name of this service… OpenAI is releasing the preview version of their agent that can use a web browser to perform tasks for you.  The new version is available to OpenAI pro users.  OpenAI says it’s currently a research preview, meaning it has limitations and will evolve based on your feedback.  Operator can handle various browser tasks such as filling out forms, ordering groceries, and even creating memes.  

The Six Five with Patrick Moorhead and Daniel Newman
Unlocking Cloud Efficiency: AWS Reveals AI-Driven Operations - Six Five at AWS re:Invent

The Six Five with Patrick Moorhead and Daniel Newman

Play Episode Listen Later Dec 31, 2024 18:00


The cloud ops landscape is evolving at lightning speed ⚡ from simple web servers to serverless functions and now, the explosion of AI workloads.

Datacenter Technical Deep Dives
Deep Dive: Automating the vBrownBag with AWS Serverless

Datacenter Technical Deep Dives

Play Episode Listen Later Sep 19, 2024


In this episode of the vBrownBag, Damian does a deeper dive into the Meatgrinder, showing how the different AWS services interact, how the process logs to CloudWatch, and more! 00:00 Intro 1:20 The AWS Services that power the Meatgrinder

cloudonaut
#090 AWS Testing Awesomeness

cloudonaut

Play Episode Listen Later Jun 13, 2024 29:02


Andreas and Michael Wittig were pretty jazzed about writing unit tests using mocks for the AWS SDK v3 in JavaScript. They broke down Amazon's new GuardDuty malware protection for S3 and how it compares to their own product bucketAV. The duo also covered testing Terraform modules and using aws-nuke to clean up leftover resources from failed tests. They gave their two cents on some recent AWS service announcements too - CloudWatch, Fargate, CloudFormation and more!

AWS Morning Brief
Get Billed For a G6

AWS Morning Brief

Play Episode Listen Later Apr 8, 2024 6:00


AWS Morning Brief for the week of April 8, 2024, with Corey Quinn. Links:Amazon GuardDuty EC2 Runtime Monitoring is now generally availableAmazon SageMaker Canvas announces new pricing for training tabular models Introducing AWS CodeConnections, formerly known as AWS CodeStar Connections -Amazon CloudWatch now supports cross-account anomaly detection Amazon EKS extended support for Kubernetes versions now generallyAnnouncing AWS Deadline Cloud AWS Console Mobile Application adds support for CloudWatch custom dashboardsAWS Lambda adds support for Ruby 3.3 Announcing general availability of Amazon EC2 G6 instances Announcing per-second billing for EC2 Red Hat Enterprise Linux (RHEL)-based instances Why AWS Supports ValkeyAWS Activate credits now accepted for third-party models on Amazon Bedrock DoorDash saves millions annually using Amazon S3 Storage Lens 

What's new in Cloud FinOps?
WNiCF - March 2024 - News

What's new in Cloud FinOps?

Play Episode Listen Later Apr 5, 2024 40:48 Transcription Available


In this episode, Frank and Steve discuss various news and updates in the cloud industry. They cover topics such as New AMD instances in AzureHolographic stickers

AWS Bites
120. Lambda Best Practices

AWS Bites

Play Episode Listen Later Apr 4, 2024 26:22


In this episode, we discuss best practices for working with AWS Lambda. We cover how Lambda functions work under the hood, including cold starts and warm starts. We then explore different invocation types - synchronous, asynchronous, and event-based. For each, we share tips on performance, cost optimization, and monitoring. Other topics include function structure, logging, instrumentation, and security. Throughout the episode, we aim to provide a solid mental model for serverless development and share our experiences to help you build efficient and robust Lambda applications.

Screaming in the Cloud
The Hidden Costs of Cloud Computing with Jack Ellis

Screaming in the Cloud

Play Episode Listen Later Feb 27, 2024 35:17


On this week's episode of Screaming in the Cloud, Corey Quinn is joined by Jack Ellis. He is the technical co-founder of Fathom Analytics, a privacy-first alternative to Google Analytics. Corey and Jack talk in-depth about a wide variety of AWS services, which ones have a habit of subtly hiking the monthly bill, and why Jack has moved towards working with consultants instead of hiring a costly DevOps team. This episode is truly a deep dive into everything AWS and billing-related led by one of the best in the industry. Tune in.Show Highlights(00:00) - Introduction and Background(00:31) - The Birth of Fathom Analytics(03:35) - The Surprising Cost Drivers: Lambda and CloudWatch(05:27) - The New Infrastructure Plan: CloudFront and WAF Logs(08:10) - The Unexpected Costs of CloudWatch and NAT Gateways(10:37) - The Importance of Efficient Data Movement(12:54) - The Hidden Costs of S3 Versioning(14:33) - The Benefits of AWS Compute Optimizer(17:38) - The Implications of AWS's New IPv4 Address Charges(18:57) - Considering On-Premise Data Centers(21:05) - The Economics of Cloud vs On-Premise(24:05) - The Role of Consultants in Cloud Management(31:05) - The Future of Cloud Management(33:20) - Closing Thoughts and Contact InformationAbout Jack EllisTechnical co-founder of Fathom Analytics, the simple, privacy-first alternative to Google Analytics.Links:Twitter: @JackEllisWebsite: https://usefathom.com/Blog Post: An alterNAT Future: We Now Have a NAT Gateway ReplacementSponsor: Oso - osohq.com

Cables2Clouds
Ep 22 - AWS re:Invent 2023 Recap

Cables2Clouds

Play Episode Listen Later Dec 13, 2023 44:40 Transcription Available


Strap in for a discussion on the big reveals from the AWS re:Invent conference, where AI stole the limelight. We, your hosts - Chris Miles, Alex Perkins, and Tim McConnaughy, dissect the announcements and how this tech giant is embedding AI in their cloud operations. And it's not just about the big names - we also delve into the AI-powered Amazon Q,  a promising tool that might just revolutionize network troubleshooting.But wait, there's more! Picture a constellation of satellites granting private access to AWS. Now imagine AWS outposts on the moon. Sounds like science fiction, right? Well, we're here to tell you it's closer to reality than you think. We're excited to share our thoughts on these trail-blazing developments in satellite connectivity and the latest updates to CloudWatch. This episode is a must-listen for anyone keen on staying ahead in the rapidly evolving world of cloud computing.As we wrap up, we'll take a nostalgic trip down memory lane. Remember the early days of online shopping and the excitement surrounding limited product releases? We'll chat about how that frenzy has evolved, influenced by innovations in cloud technology. We'll also discuss the keynotes from the AWS re:Invent conference, focusing on Adam Selipsky's opening speech and Werner Vogels' emphasis on cost in functional design. Don't miss out on our take on these industry-shaping keynotes and what they signal for the future of cloud computing. Buckle up for a thrilling journey through the universe of AI and cloud technology!Check out the Fortnightly Cloud Networking NewsVisit our website and subscribe: https://www.cables2clouds.com/Follow us on Twitter: https://twitter.com/cables2cloudsFollow us on YouTube: https://www.youtube.com/@cables2clouds/Follow us on TikTok: https://www.tiktok.com/@cables2cloudsMerch Store: https://store.cables2clouds.com/Join the Discord Study group: https://artofneteng.com/iaatjArt of Network Engineering (AONE): https://artofnetworkengineering.com

Screaming in the Cloud
Chronosphere on Crafting a Cloud-Native Observability Strategy with Rachel Dines

Screaming in the Cloud

Play Episode Listen Later Nov 28, 2023 29:41


Rachel Dines, Head of Product and Technical Marketing at Chronosphere, joins Corey on Screaming in the Cloud to discuss why creating a cloud-native observability strategy is so critical, and the challenges that come with both defining and accomplishing that strategy to fit your current and future observability needs. Rachel explains how Chronosphere is taking an open-source approach to observability, and why it's more important than ever to acknowledge that the stakes and costs are much higher when it comes to observability in the cloud. About RachelRachel leads product and technical marketing for Chronosphere. Previously, Rachel wore lots of marketing hats at CloudHealth (acquired by VMware), and before that, she led product marketing for cloud-integrated storage at NetApp. She also spent many years as an analyst at Forrester Research. Outside of work, Rachel tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston where she's based.Links Referenced: Chronosphere: https://chronosphere.io/ LinkedIn: https://www.linkedin.com/in/rdines/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's featured guest episode is brought to us by our friends at Chronosphere, and they have also brought us Rachel Dines, their Head of Product and Solutions Marketing. Rachel, great to talk to you again.Rachel: Hi, Corey. Yeah, great to talk to you, too.Corey: Watching your trajectory has been really interesting, just because starting off, when we first started, I guess, learning who each other were, you were working at CloudHealth which has since become VMware. And I was trying to figure out, huh, the cloud runs on money. How about that? It feels like it was a thousand years ago, but neither one of us is quite that old.Rachel: It does feel like several lifetimes ago. You were just this snarky guy with a few followers on Twitter, and I was trying to figure out what you were doing mucking around with my customers [laugh]. Then [laugh] we kind of both figured out what we're doing, right?Corey: So, speaking of that iterative process, today, you are at Chronosphere, which is an observability company. We would have called it a monitoring company five years ago, but now that's become an insult after the observability war dust has settled. So, I want to talk to you about something that I've been kicking around for a while because I feel like there's a gap somewhere. Let's say that I build a crappy web app—because all of my web apps inherently are crappy—and it makes money through some mystical form of alchemy. And I have a bunch of users, and I eventually realize, huh, I should probably have a better observability story than waiting for the phone to ring and a customer telling me it's broken.So, I start instrumenting various aspects of it that seem to make sense. Maybe I go too low level, like looking at all the discs on every server to tell me if they're getting full or not, like their ancient servers. Maybe I just have a Pingdom equivalent of is the website up enough to respond to a packet? And as I wind up experiencing different failure modes and getting yelled at by different constituencies—in my own career trajectory, my own boss—you start instrumenting for all those different kinds of breakages, you start aggregating the logs somewhere and the volume gets bigger and bigger with time. But it feels like it's sort of a reactive process as you stumble through that entire environment.And I know it's not just me because I've seen this unfold in similar ways in a bunch of different companies. It feels to me, very strongly, like it is something that happens to you, rather than something you set about from day one with a strategy in mind. What's your take on an effective way to think about strategy when it comes to observability?Rachel: You just nailed it. That's exactly the kind of progression that we so often see. And that's what I really was excited to talk with you about today—Corey: Oh, thank God. I was worried for a minute there that you'd be like, “What the hell are you talking about? Are you just, like, some sort of crap engineer?” And, “Yes, but it's mean of you to say it.” But yeah, what I'm trying to figure out is there some magic that I just was never connecting? Because it always feels like you're in trouble because the site's always broken, and oh, like, if the disk fills up, yeah, oh, now we're going to start monitoring to make sure the disk doesn't fill up. Then you wind up getting barraged with alerts, and no one wins, and it's an uncomfortable period of time.Rachel: Uncomfortable period of time. That is one very polite way to put it. I mean, I will say, it is very rare to find a company that actually sits down and thinks, “This is our observability strategy. This is what we want to get out of observability.” Like, you can think about a strategy and, like, the old school sense, and you know, as an industry analyst, so I'm going to have to go back to, like, my roots at Forrester with thinking about, like, the people, and the process, and the technology.But really what the bigger component here is like, what's the business impact? What do you want to get out of your observability platform? What are you trying to achieve? And a lot of the time, people have thought, “Oh, observability strategy. Great, I'm just going to buy a tool. That's it. Like, that's my strategy.”And I hate to bring it to you, but buying tools is not a strategy. I'm not going to say, like, buy this tool. I'm not even going to say, “Buy Chronosphere.” That's not a strategy. Well, you should buy Chronosphere. But that's not a strategy.Corey: Of course. I'm going to throw the money by the wheelbarrow at various observability vendors, and hope it solves my problem. But if that solved the problem—I've got to be direct—I've never spoken to those customers.Rachel: Exactly. I mean, that's why this space is such a great one to come in and be very disruptive in. And I think, back in the days when we were running in data centers, maybe even before virtual machines, you could probably get away with not having a monitoring strategy—I'm not going to call it observability; it's not we call the back then—you could get away with not having a strategy because what was the worst that was going to happen, right? It wasn't like there was a finite amount that your monitoring bill could be, there was a finite amount that your customer impact could be. Like, you're paying the penny slots, right?We're not on the penny slots anymore. We're in the $50 craps table, and it's Las Vegas, and if you lose the game, you're going to have to run down the street without your shirt. Like, the game and the stakes have changed, and we're still pretending like we're playing penny slots, and we're not anymore.Corey: That's a good way of framing it. I mean, I still remember some of my biggest observability challenges were building highly available rsyslog clusters so that you could bounce a member and not lose any log data because some of that was transactionally important. And we've gone beyond that to a stupendous degree, but it still feels like you don't wind up building this into the application from day one. More's the pity because if you did, and did that intelligently, that opens up a whole world of possibilities. I dream of that changing where one day, whenever you start to build an app, oh, and we just push the button and automatically instrument with OTel, so you instrument the thing once everywhere it makes sense to do it, and then you can do your vendor selection and what you said were decisions later in time. But these days, we're not there.Rachel: Well, I mean, and there's also the question of just the legacy environment and the tech debt. Even if you wanted to, the—actually I was having a beer yesterday with a friend who's a VP of Engineering, and he's got his new environment that they're building with observability instrumented from the start. How beautiful. They've got OTel, they're going to have tracing. And then he's got his legacy environment, which is a hot mess.So, you know, there's always going to be this bridge of the old and the new. But this was where it comes back to no matter where you're at, you can stop and think, like, “What are we doing and why?” What is the cost of this? And not just cost in dollars, which I know you and I could talk about very deeply for a long period of time, but like, the opportunity costs. Developers are working on stuff that they could be working on something that's more valuable.Or like the cost of making people work round the clock, trying to troubleshoot issues when there could be an easier way. So, I think it's like stepping back and thinking about cost in terms of dollar sense, time, opportunity, and then also impact, and starting to make some decisions about what you're going to do in the future that's different. Once again, you might be stuck with some legacy stuff that you can't really change that much, but [laugh] you got to be realistic about where you're at.Corey: I think that that is a… it's a hard lesson to be very direct, in that, companies need to learn it the hard way, for better or worse. Honestly, this is one of the things that I always noticed in startup land, where you had a whole bunch of, frankly, relatively early-career engineers in their early-20s, if not younger. But then the ops person was always significantly older because the thing you actually want to hear from your ops person, regardless of how you slice it, is, “Oh, yeah, I've seen this kind of problem before. Here's how we fixed it.” Or even better, “Here's the thing we're doing, and I know how that's going to become a problem. Let's fix it before it does.” It's the, “What are you buying by bringing that person in?” “Experience, mostly.”Rachel: Yeah, that's an interesting point you make, and it kind of leads me down this little bit of a side note, but a really interesting antipattern that I've been seeing in a lot of companies is that more seasoned ops person, they're the one who everyone calls when something goes wrong. Like, they're the one who, like, “Oh, my God, I don't know how to fix it. This is a big hairy problem,” I call that one ops person, or I call that very experienced person. That experience person then becomes this huge bottleneck into solving problems that people don't really—they might even be the only one who knows how to use the observability tool. So, if we can't find a way to democratize our observability tooling a little bit more so, like, just day-to-day engineers, like, more junior engineers, newer ones, people who are still ramping, can actually use the tool and be successful, we're going to have a big problem when these ops people walk out the door, maybe they retire, maybe they just get sick of it. We have these massive bottlenecks in organizations, whether it's ops or DevOps or whatever, that I see often exacerbated by observability tools. Just a side note.Corey: Yeah. On some level, it feels like a lot of these things can be fixed with tooling. And I'm not going to say that tools aren't important. You ever tried to implement observability by hand? It doesn't work. There have to be computers somewhere in the loop, if nothing else.And then it just seems to devolve into a giant swamp of different companies, doing different things, taking different approaches. And, on some level, whenever you read the marketing or hear the stories any of these companies tell you also to normalize it from translating from whatever marketing language they've got into something that comports with the reality of your own environment and seeing if they align. And that feels like it is so much easier said than done.Rachel: This is a noisy space, that is for sure. And you know, I think we could go out to ten people right now and ask those ten people to define observability, and we would come back with ten different definitions. And then if you throw a marketing person in the mix, right—guilty as charged, and I know you're a marketing person, too, Corey, so you got to take some of the blame—it gets mucky, right? But like I said a minute ago, the answer is not tools. Tools can be part of the strategy, but if you're just thinking, “I'm going to buy a tool and that's going to solve my problem,” you're going to end up like this company I was talking to recently that has 25 different observability tools.And not only do they have 25 different observability tools, what's worse is they have 25 different definitions for their SLOs and 25 different names for the same metric. And to be honest, it's just a mess. I'm not saying, like, go be Draconian and, you know, tell all the engineers, like, “You can only use this tool [unintelligible 00:10:34] use that tool,” you got to figure out this kind of balance of, like, hands-on, hands-off, you know? How much do you centralize, how much do you push and standardize? Otherwise, you end up with just a huge mess.Corey: On some level, it feels like it was easier back in the days of building it yourself with Nagios because there's only one answer, and it sucks, unless you want to start going down the world of HP OpenView. Which step one: hire a 50-person team to manage OpenView. Okay, that's not going to solve my problem either. So, let's get a little more specific. How does Chronosphere approach this?Because historically, when I've spoken to folks at Chronosphere, there isn't that much of a day one story, in that, “I'm going to build a crappy web app. Let's instrument it for Chronosphere.” There's a certain, “You must be at least this tall to ride,” implicit expectation built into the product just based upon its origins. And I'm not saying that doesn't make sense, but it also means there's really no such thing as a greenfield build out for you either.Rachel: Well, yes and no. I mean, I think there's no green fields out there because everyone's doing something for observability, or monitoring, or whatever you want to call it, right? Whether they've got Nagios, whether they've got the Dog, whether they've got something else in there, they have some way of introspecting their systems, right? So, one of the things that Chronosphere is built on, that I actually think this is part of something—a way you might think about building out an observability strategy as well, is this concept of control and open-source compatibility. So, we only can collect data via open-source standards. You have to send this data via Prometheus, via Open Telemetry, it could be older standards, like, you know, statsd, Graphite, but we don't have any proprietary instrumentation.And if I was making a recommendation to somebody building out their observability strategy right now, I would say open, open, open, all day long because that gives you a huge amount of flexibility in the future. Because guess what? You know, you might put together an observability strategy that seems like it makes sense for right now—actually, I was talking to a B2B SaaS company that told me that they made a choice a couple of years ago on an observability tool. It seemed like the right choice at the time. They were growing so fast, they very quickly realized it was a terrible choice.But now, it's going to be really hard for them to migrate because it's all based on proprietary standards. Now, of course, a few years ago, they didn't have the luxury of Open Telemetry and all of these, but now that we have this, we can use these to kind of future-proof our mistakes. So, that's one big area that, once again, both my recommendation and happens to be our approach at Chronosphere.Corey: I think that that's a fair way of viewing it. It's a constant challenge, too, just because increasingly—you mentioned the Dog earlier, for example—I will say that for years, I have been asked whether or not at The Duckbill Group, we look at Azure bills or GCP bills. Nope, we are pure AWS. Recently, we started to hear that same inquiry specifically around Datadog, to the point where it has become a board-level concern at very large companies. And that is a challenge, on some level.I don't deviate from my typical path of I fix AWS bills, and that's enough impossible problems for one lifetime, but there is a strong sense of you want to record as much as possible for a variety of excellent reasons, but there's an implicit cost to doing that, and in many cases, the cost of observability becomes a massive contributor to the overall cost. Netflix has said in talks before that they're effectively an observability company that also happens to stream movies, just because it takes so much effort, engineering, and raw computing resources in order to get that data do something actionable with it. It's a hard problem.Rachel: It's a huge problem, and it's a big part of why I work at Chronosphere, to be honest. Because when I was—you know, towards the tail end at my previous company in cloud cost management, I had a lot of customers coming to me saying, “Hey, when are you going to tackle our Dog or our New Relic or whatever?” Similar to the experience you're having now, Corey, this was happening to me three, four years ago. And I noticed that there is definitely a correlation between people who are having these really big challenges with their observability bills and people that were adopting, like Kubernetes, and microservices and cloud-native. And it was around that time that I met the Chronosphere team, which is exactly what we do, right? We focus on observability for these cloud-native environments where observability data just goes, like, wild.We see 10X 20X as much observability data and that's what's driving up these costs. And yeah, it is becoming a board-level concern. I mean, and coming back to the concept of strategy, like if observability is the second or third most expensive item in your engineering bill—like, obviously, cloud infrastructure, number one—number two and number three is probably observability. How can you not have a strategy for that? How can this be something the board asks you about, and you're like, “What are we trying to get out of this? What's our purpose?” “Uhhhh… troubleshooting?”Corey: Right because it turns into business metrics as well. It's not just about is the site up or not. There's a—like, one of the things that always drove me nuts not just in the observability space, but even in cloud costing is where, okay, your costs have gone up this week so you get a frowny face, or it's in red, like traffic light coloring. Cool, but for a lot of architectures and a lot of customers, that's because you're doing a lot more volume. That translates directly into increased revenues, increased things you care about. You don't have the position or the context to say, “That's good,” or, “That's bad.” It simply is. And you can start deriving business insight from that. And I think that is the real observability story that I think has largely gone untold at tech conferences, at least.Rachel: It's so right. I mean, spending more on something is not inherently bad if you're getting more value out of it. And it definitely a challenge on the cloud cost management side. “My costs are going up, but my revenue is going up a lot faster, so I'm okay.” And I think some of the plays, like you know, we put observability in this box of, like, it's for low-level troubleshooting, but really, if you step back and think about it, there's a lot of larger, bigger picture initiatives that observability can contribute to in an org, like digital transformation. I know that's a buzzword, but, like that is a legit thing that a lot of CTOs are out there thinking about. Like, how do we, you know, get out of the tech debt world, and how do we get into cloud-native?Maybe it's developer efficiency. God, there's a lot of people talking about developer efficiency. Last week at KubeCon, that was one of the big, big topics. I mean, and yeah, what [laugh] what about cost savings? To me, we've put observability in a smaller box, and it needs to bust out.And I see this also in our customer base, you know? Customers like DoorDash use observability, not just to look at their infrastructure and their applications, but also look at their business. At any given minute, they know how many Dashers are on the road, how many orders are being placed, cut by geos, down to the—actually down to the second, and they can use that to make decisions.Corey: This is one of those things that I always found a little strange coming from the world of running systems in large [unintelligible 00:17:28] environments to fixing AWS bills. There's nothing that even resembles a fast, reactive response in the world of AWS billing. You wind up with a runaway bill, they're going to resolve that over a period of weeks, on Seattle business hours. If you wind up spinning something up that creates a whole bunch of very expensive drivers behind your bill, it's going to take three days, in most cases, before that starts showing up anywhere that you can reasonably expect to get at it. The idea of near real time is a lie unless you want to start instrumenting everything that you're doing to trap the calls and then run cost extrapolation from there. That's hard to do.Observability is a very different story, where latencies start to matter, where being able to get leading indicators of certain events—be a technical or business—start to be very important. But it seems like it's so hard to wind up getting there from where most people are. Because I know we like to talk dismissively about the past, but let's face it, conference-ware is the stuff we're the proudest of. The reality is the burning dumpster of regret in our data centers that still also drives giant piles of revenue, so you can't turn it off, nor would you want to, but you feel bad about it as a result. It just feels like it's such a big leap.Rachel: It is a big leap. And I think the very first step I would say is trying to get to this point of clarity and being honest with yourself about where you're at and where you want to be. And sometimes not making a choice is a choice, right, as well. So, sticking with the status quo is making a choice. And so, like, as we get into things like the holiday season right now, and I know there's going to be people that are on-call 24/7 during the holidays, potentially, to keep something that's just duct-taped together barely up and running, I'm making a choice; you're make a choice to do that. So, I think that's like the first step is the kind of… at least acknowledging where you're at, where you want to be, and if you're not going to make a change, just understanding the cost and being realistic about it.Corey: Yeah, being realistic, I think, is one of the hardest challenges because it's easy to wind up going for the aspirational story of, “In the future when everything's great.” Like, “Okay, cool. I appreciate the need to plant that flag on the hill somewhere. What's the next step? What can we get done by the end of this week that materially improves us from where we started the week?” And I think that with the aspirational conference-ware stories, it's hard to break that down into things that are actionable, that don't feel like they're going to be an interminable slog across your entire existing environment.Rachel: No, I get it. And for things like, you know, instrumenting and adding tracing and adding OTEL, a lot of the time, the return that you get on that investment is… it's not quite like, “I put a dollar in, I get a dollar out,” I mean, something like tracing, you can't get to 60% instrumentation and get 60% of the value. You need to be able to get to, like, 80, 90%, and then you'll get a huge amount of value. So, it's sort of like you're trudging up this hill, you're charging up this hill, and then finally you get to the plateau, and it's beautiful. But that hill is steep, and it's long, and it's not pretty. And I don't know what to say other than there's a plateau near the top. And those companies that do this well really get a ton of value out of it. And that's the dream, that we want to help customers get up that hill. But yeah, I'm not going to lie, the hill can be steep.Corey: One thing that I find interesting is there's almost a bimodal distribution in companies that I talk to. On the one side, you have companies like, I don't know, a Chronosphere is a good example of this. Presumably you have a cloud bill somewhere and the majority of your cloud spend will be on what amounts to a single application, probably in your case called, I don't know, Chronosphere. It shares the name of the company. The other side of that distribution is the large enterprise conglomerates where they're spending, I don't know, $400 million a year on cloud, but their largest workload is 3 million bucks, and it's just a very long tail of a whole bunch of different workloads, applications, teams, et cetera.So, what I'm curious about from the Chronosphere perspective—or the product you have, not the ‘you' in this metaphor, which gets confusing—is, it feels easier to instrument a Chronosphere-like company that has a primary workload that is the massive driver of most things and get that instrumented and start getting an observability story around that than it does to try and go to a giant company and, “Okay, 1500 teams need to all implement this thing that are all going in different directions.” How do you see it playing out among your customer base, if that bimodal distribution holds up in your world?Rachel: It does and it doesn't. So, first of all, for a lot of our customers, we often start with metrics. And starting with metrics means Prometheus. And Prometheus has hundreds of exporters. It is basically built into Kubernetes. So, if you're running Kubernetes, getting Prometheus metrics out, actually not a very big lift. So, we find that we start with Prometheus, we start with getting metrics in, and we can get a lot—I mean, customers—we have a lot of customers that use us just for metrics, and they get a massive amount of value.But then once they're ready, they can start instrumenting for OTEL and start getting traces in as well. And yeah, in large organizations, it does tend to be one team, one application, one service, one department that kind of goes at it and gets all that instrumented. But I've even seen very large organizations, when they get their act together and decide, like, “No, we're doing this,” they can get OTel instrumented fairly quickly. So, I guess it's, like, a lining up. It's more of a people issue than a technical issue a lot of the time.Like, getting everyone lined up and making sure that like, yes, we all agree. We're on board. We're going to do this. But it's usually, like, it's a start small, and it doesn't have to be all or nothing. We also just recently added the ability to ingest events, which is actually a really beautiful thing, and it's very, very straightforward.It basically just—we connect to your existing other DevOps tools, so whether it's, like, a Buildkite, or a GitHub, or, like, a LaunchDarkly, and then anytime something happens in one of those tools, that gets registered as an event in Chronosphere. And then we overlay those events over your alerts. So, when an alert fires, then first thing I do is I go look at the alert page, and it says, “Hey, someone did a deploy five minutes ago,” or, “There was a feature flag flipped three minutes ago,” I solved the problem right then. I don't think of this as—there's not an all or nothing nature to any of this stuff. Yes, tracing is a little bit of a—you know, like I said, it's one of those things where you have to make a lot of investment before you get a big reward, but that's not the case in all areas of observability.Corey: Yeah. I would agree. Do you find that there's a significant easy, early win when customers start adopting Chronosphere? Because one of the problems that I've found, especially with things that are holistic, and as you talk about tracing, well, you need to get to a certain point of coverage before you see value. But human psychology being what it is, you kind of want to be able to demonstrate, oh, see, the Meantime To Dopamine needs to come down, to borrow an old phrase. Do you find that some of there's some easy wins that start to help people to see the light? Because otherwise, it just feels like a whole bunch of work for no discernible benefit to them.Rachel: Yeah, at least for the Chronosphere customer base, one of the areas where we're seeing a lot of traction this year is in optimizing the costs, like, coming back to the cost story of their overall observability bill. So, we have this concept of the control plane in our product where all the data that we ingest hits the control plane. At that point, that customer can look at the data, analyze it, and decide this is useful, this is not useful. And actually, not just decide that, but we show them what's useful, what's not useful. What's being used, what's high cardinality, but—and high cost, but maybe no one's touched it.And then we can make decisions around aggregating it, dropping it, combining it, doing all sorts of fancy things, changing the—you know, downsampling it. We can do this, on the trace side, we can do it both head based and tail based. On the metrics side, it's as it hits the control plane and then streams out. And then they only pay for the data that we store. So typically, customers are—they come on board and immediately reduce their observability dataset by 60%. Like, that's just straight up, that's the average.And we've seen some customers get really aggressive, get up to, like, in the 90s, where they realize we're only using 10% of this data. Let's get rid of the rest of it. We're not going to pay for it. So, paying a lot less helps in a lot of ways. It also helps companies get more coverage of their observability. It also helps customers get more coverage of their overall stack. So, I was talking recently with an autonomous vehicle driving company that recently came to us from the Dog, and they had made some really tough choices and were no longer monitoring their pre-prod environments at all because they just couldn't afford to do it anymore. It's like, well, now they can, and we're still saving the money.Corey: I think that there's also the downstream effect of the money saving to that, for example, I don't fix observability bills directly. But, “Huh, why is your CloudWatch bill through the roof?” Or data egress charges in some cases? It's oh because your observability vendor is pounding the crap out of those endpoints and pulling all your log data across the internet, et cetera. And that tends to mean, oh, yeah, it's not just the first-order effect; it's the second and third and fourth-order effects this winds up having. It becomes almost a holistic challenge. I think that trying to put observability in its own bucket, on some level—when you're looking at it from a cost perspective—starts to be a, I guess, a structure that makes less and less sense in the fullness of time.Rachel: Yeah, I would agree with that. I think that just looking at the bill from your vendor is one very small piece of the overall cost you're incurring. I mean, all of the things you mentioned, the egress, the CloudWatch, the other services, it's impacting, what about the people?Corey: Yeah, it sure is great that your team works for free.Rachel: [laugh]. Exactly, right? I know, and it makes me think a little bit about that viral story about that particular company with a certain vendor that had a $65 million per year observability bill. And that impacted not just them, but, like, it showed up in both vendors' financial filings. Like, how did you get there? How did you get to that point? And I think this all comes back to the value in the ROI equation. Yes, we can all sit in our armchairs and be like, “Well, that was dumb,” but I know there are very smart people out there that just got into a bad situation by kicking the can down the road on not thinking about the strategy.Corey: Absolutely. I really want to thank you for taking the time to speak with me about, I guess, the bigger picture questions rather than the nuts and bolts of a product. I like understanding the overall view that drives a lot of these things. I don't feel I get to have enough of those conversations some weeks, so thank you for humoring me. If people want to learn more, where's the best place for them to go?Rachel: So, they should definitely check out the Chronosphere website. Brand new beautiful spankin' new website: chronosphere.io. And you can also find me on LinkedIn. I'm not really on the Twitters so much anymore, but I'd love to chat with you on LinkedIn and hear what you have to say.Corey: And we will, of course, put links to all of that in the [show notes 00:28:26]. Thank you so much for taking the time to speak with me. It's appreciated.Rachel: Thank you, Corey. Always fun.Corey: Rachel Dines, Head of Product and Solutions Marketing at Chronosphere. This has been a featured guest episode brought to us by our friends at Chronosphere, and I'm Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment that I will one day read once I finished building my highly available rsyslog system to consume it with.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

Real World Serverless with theburningmonk
#90: He built a hotel booking system that costs $0.82/month to run

Real World Serverless with theburningmonk

Play Episode Listen Later Nov 22, 2023 48:35


In this episode, I spoke with Hieu Do, who is a Solution Architect at FPT Software in Vietnam. As a side project, he built a hotel booking system for a friend using entirely serverless technologies.He shared his experience and learnings from building this system and how he was able to keep the cost down. After running the system for a year, the system only costs an average of $0.82 per month and serves over 3000 users per day.There are some useful tips that everyone can apply. Such as reducing the cost of API Gateway by cutting out the OPTION requests associated with CORS. Or looking out for CloudWatch costs when using 3rd party services that poll CloudWatch data, such as NewRelic.Links from the episode:Hieu's article on running a serverless hotel booking system for a yearPuppeteerChrome Lambda LayerVendia serverless-express frameworkHieu's Linkedin profile-----For more stories about real-world use of serverless technologies, please subscribe to the channel and follow me on X as @theburningmonk.And if you're hungry for more insights, best practices, and invaluable tips on building serverless apps, make sure to subscribe to our free newsletter and elevate your serverless game! https://theburningmonk.com/subscribeOpening theme song:Cheery Monday by Kevin MacLeodLink: https://incompetech.filmmusic.io/song/3495-cheery-mondayLicense: http://creativecommons.org/licenses/by/4.0

Software Sessions
David Copeland on Medium Sized Decisions (RubyConf 2023)

Software Sessions

Play Episode Listen Later Nov 17, 2023 48:33


David was the chief software architect and director of engineering at Stitch Fix. He's also the author of a number of books including Sustainable Web Development with Ruby on Rails and most recently Ruby on Rails Background Jobs with Sidekiq. He talks about how he made decisions while working with a medium sized team (~200 developers) at Stitch Fix. The audio quality for the first 19 minutes is not great but the correct microphones turn on right after that. Recorded at RubyConf 2023 in San Diego. A few topics covered: Ruby's origins at Stitch Fix Thoughts on Go Choosing technology and cloud services Moving off heroku Building a platform team Where Ruby and Rails fit in today The role of books and how different people learn Large Language Model's effects on technical content Related Links David's Blog Mastodon Transcript You can help correct transcripts on GitHub. Intro [00:00:00] Jeremy: Today. I want to share another conversation from RubyConf San Diego. This time it's with David Copeland. He was a chief software architect and director of engineering at stitch fix. And at the start of the conversation, you're going to hear about why he decided to write the book, sustainable web development with Ruby on rails. Unfortunately, you're also going to notice the sound quality isn't too good. We had some technical difficulties. But once you hit the 20 minute mark of the recording, the mics are going to kick in. It's going to sound way better. So I hope you stick with it. Enjoy. Ruby at Stitch Fix [00:00:35] David: Stitch Fix was a Rails shop. I had done a lot of Rails and learned a lot of things that worked and didn't work, at least in that situation. And so I started writing them down and I was like, I should probably make this more than just a document that I keep, you know, privately on my computer. Uh, so that's, you know, kind of, kind of where the genesis of that came from and just tried to, write everything down that I thought what worked, what didn't work. Uh, if you're in a situation like me. Working on a product, with a medium sized, uh, team, then I think the lessons in there will be useful, at least some of them. Um, and I've been trying to keep it up over, over the years. I think the first version came out a couple years ago, so I've been trying to make sure it's always up to date with the latest stuff and, and Rails and based on my experience and all that. [00:01:20] Jeremy: So it's interesting that you mention, medium sized team because, during the, the keynote, just a few moments ago, Matz the creator of Ruby was talking about how like, Oh, Rails is really suitable for this, this one person team, right? Small, small team. And, uh, he was like, you're not Google. So like, don't worry about, right. Can you scale to that level? Yeah. Um, and, and I wonder like when you talk about medium size or medium scale, like what are, what are we talking? [00:01:49] David: I think probably under 200 developers, I would say. because when I left Stitch Fix, it was closing in on that number of developers. And so it becomes, you know, hard to... You can kind of know who everybody is, or at least the names sound familiar of everybody. But beyond that, it's just, it's just really hard. But a lot of it was like, I don't have experience at like a thousand developer company. I have no idea what that's like, but I definitely know that Rails can work for like... 200 ish people how you can make it work basically. yeah. [00:02:21] Jeremy: The decision to use Rails, I'm assuming that was made before you joined? [00:02:26] David: Yeah, the, um, the CTO of Stitch Fix, he had come in to clean up a mess made by contractors, as often happens. They had used Django, which is like the Python version of Rails. And he, the CTO, he was more familiar with Rails. So the first two developers he hired, also familiar with Rails. There wasn't a lot to maintain with the Django app, so they were like, let's just start fresh, fresh with Rails. yeah, but it's funny because a lot of the code in that Rails app was, like, transliterated from Python. So you could, it would, it looked like the strangest Ruby code in the world because it was basically, there was no test. So they were like, let's just write the Ruby version of this Python just so we know it works. but obviously that didn't, didn't last forever, so. [00:03:07] Jeremy: So, so what's an example of a, of a tell? Where you're looking at the code and you're like, oh, this is clearly, it came from Python. [00:03:15] David: You'd see like, very, very explicit, right? Like Python, there's a lot of like single line things. very like, this sounds like a dig, but it's very simple looking code. Like, like I don't know Python, but I was able to change this Django app. And I had to, I could look at it and you can figure out immediately how it works. Cause there's. Not much to it. There's nothing fancy. So, like, this, this Ruby code, there was nothing fancy. You'd be like, well, maybe they should have memoized that, or maybe they should have taken that into another class, or you could have done this with a hash or something like that. So there was, like, none of that. It was just, like, really basic, plain code like you would see in any beginning programming language kind of thing. Which is at least nice. You can understand it. but you probably wouldn't have written it that way at first in Ruby. Thoughts on Go [00:04:05] Jeremy: Yeah, that's, that's interesting because, uh, people sometimes talk about the Go programming language and how it looks, I don't know if simple is the right word, but it's something where you look at the code and even if you don't necessarily understand Go, it's relatively straightforward. Yeah. I wonder what your thoughts are on that being a strength versus that being, like, [00:04:25] David: Yeah, so at Stitch Fix at one point we had a pro, we were moving off of Heroku and we were going to, basically build a deployment platform using ECS on AWS. And so the deployment platform was a Rails app and we built a command line tool using Ruby. And it was fine, but it was a very complicated command line tool and it was very slow. And so one of the developers was like, I'm going to rewrite it in Go. I was like, ugh, you know, because I just was not a big fan. So he rewrote it in Go. It was a bazillion times faster. And then I was like, okay, I'm going to add, I'll add a feature to it. It was extremely easy. Like, it's just like what you said. I looked at it, like, I don't know anything about Go. I know what is happening here. I can copy and paste this and change things and make it work for what I want to do. And it did work. And it was, it was pretty easy. so there's that, I mean, aesthetically it's pretty ugly and it's, I, I. I can't really defend that as a real reason to not use it, but it is kind of gross. I did do Go, I did a small project in Go after Stitch Fix, and there's this vibe in Go about like, don't create abstractions. I don't know where I got that from, but every Go I look at, I'm like we should make an abstraction for this, but it's just not the vibe. They just don't like doing that. They like it all written out. And I see the value because you can look at the code and know what it does and you don't have to chase abstractions anywhere. But. I felt like I was copying and pasting a lot of, a lot of things. Um, so I don't know. I mean, the, the team at Stitch Fix that did this like command line app in go, they're the platform team. And so their job isn't to write like web apps all day, every day. There's kind of in and out of all kinds of things. They have to try to figure out something that they don't understand quickly to debug a problem. And so I can see the value of something like go if that's your job, right? You want to go in and see what the issue is. Figure it out and be done and you're not going to necessarily develop deep expertise and whatever that thing is that you're kind of jumping into. Day to day though, I don't know. I think it would make me kind of sad. (laughs) [00:06:18] Jeremy: So, so when you say it would make you kind of sad, I mean, what, what about it? Is it, I mean, you mentioned that there's a lot of copy and pasting, so maybe there's code duplication, but are there specific things where you're like, oh, I just don't? [00:06:31] David: Yeah, so I had done a lot of Java in my past life and it felt very much like that. Where like, like the Go library for making an HTTP call for like, I want to call some web service. It's got every feature you could ever want. Everything is tweakable. You can really, you can see why it's designed that way. To dial in some performance issue or solve some really esoteric thing. It's there. But the problem is if you just want to get an JSON, it's just like huge production. And I felt like that's all I really want to do and it's just not making it very easy. And it just felt very, very cumbersome. I think that having to declare types also is a little bit of a weird mindset because, I mean, I like to make types in Ruby, I like to make classes, but I also like to just use hashes and stuff to figure it out. And then maybe I'll make a class if I figure it out, but Go, you can't. You have to have a class, you have to have a type, you have to think all that ahead of time, and it just, I'm not used to working that way, so it felt, I mean, I guess I could get used to it, but I just didn't warm up to that sort of style of working, so it just felt like I was just kind of fighting with the vibe of the language, kind of. Yeah, [00:07:40] Jeremy: so it's more of the vibe or the feel where you're writing it and you're like this seems a little too... Explicit. I feel like I have to be too verbose. It just doesn't feel natural for me to write this. [00:07:53] David: Right, it's not optimized for what in my mind is the obvious case. And maybe that's not the obvious case for the people that write Go programs. But for me, like, I just want to like get this endpoint and get the JSON back as a map. Not any easier than any other case, right? Whereas like in Ruby, right? And you can, I think if you include net HTTP, you can just type get. And it will just return whatever that is. Like, that's amazing. It's optimized for what I think is a very common use case. So it makes me feel really productive. It makes me feel pretty good. And if that doesn't work out long term, I can always use something more complicated. But I'm not required to dig into the NetHttp library just to do what in my mind is something very simple. [00:08:37] Jeremy: Yeah, I think that's something I've noticed myself in working with Ruby. I mean, you have the standard library that's very... Comprehensive and the API surface is such that, like you said there, when you're trying to do common tasks, a lot of times they have a call you make and it kind of does the thing you expected or hoped for. [00:08:56] David: Yeah, yeah. It's kind of, I mean, it's that whole optimized for programmer happiness thing. Like it does. That is the vibe of Ruby and it seems like that is still the way things are. And, you know, I, I suppose if I had a different mindset, I mean, because I work with developers who did not like using Ruby or Rails. They loved using Go or Java. And I, I guess there's probably some psychological analysis we could do about their background and history and mindset that makes that make sense. But, to me, I don't know. It's, it's nice when it's pleasant. And Ruby seems pleasant. (laughs) Choosing Technology [00:09:27] Jeremy: as a... Software Architect, or as a CTO, when, when you're choosing technology, what are some of the things you look at in terms of, you know? [00:09:38] David: Yeah, I mean, I think, like, it's a weird criteria, but I think what is something that the team is capable of executing with? Because, like, most, right, most programming languages all kind of do the same thing. Like, you can kind of get most stuff done in most common popular programming languages. So, it's probably not... It's not true that if you pick the wrong language, you can't build the app. Like, that's probably not really the case. At least for like a web app or something. so it's more like, what is the team that's here to do it? What are they comfortable and capable of doing? I worked on a project with... It was a mix of like junior engineers who knew JavaScript, and then some senior engineers from Google. And for whatever reason someone had chosen a Rails app and none of them were comfortable or really yet competent with doing Ruby on Rails and they just all hated it and like it didn't work very well. Um, and so even though, yes, Rails is a good choice for doing stuff for that team at that moment. Not a good choice. Right. So I think you have to go in and like, what, what are we going to be able to execute on so that when the business wants us to do something, we just do it. And we don't complain and we don't say, Oh, well we can't because this technology that we chose, blah, blah, blah. Like you don't ever want to say that if possible. So I think that's. That's kind of the, the top thing. I think second would be how widely supported is it? Like you don't want to be the cutting edge user that's finding all the bugs in something really. Like you want to use something that's stable. Postgres, MySQL, like those work, those are fine. The bugs have been sorted out for most common use cases. Some super fancy edge database, I don't know if I'd want to be doing, doing that you know? Choosing cloud services [00:11:15] Jeremy: How do you feel about the cloud specific services and databases? Like are you comfortable saying like, oh, I'm going to use... Google Cloud, BigQuery. Yeah. [00:11:27] David: That sort of thing. I think it would kind of fall under the same criteria that I was just, just saying like, so with AWS it's interesting 'cause when we moved from Heroku to AWS by EC2 RDS, their database thing, uh, S3, those have been around for years, probably those are gonna work, but they always introduce new things. Like we, we use RabbitMQ and AWS came out with. Some, I forget what it was, it was a queuing service similar to Rabbit. We were like, Oh, maybe we should switch to that. But it was clear that they weren't really ready to support it. So. Yeah, so we didn't, we didn't switch to that. So I, you gotta try to read the tea leaves of the provider to see are they committed to, to supporting this thing or is this there to get some enterprise client to move into the cloud. And then the idea is to move off of that transitional thing into what they do support. And it's hard to get a clear answer from them too. So it takes a little bit of research to figure out, Are they going to support this or not? Because that's what you don't want. To move everything into some very proprietary cloud system and have them sunset it and say, Oh yeah, now you've got to switch again. Uh, that kind of sucks. So, it's a little trickier. [00:12:41] Jeremy: And what kind of questions or research do you do? Is it purely a function of this thing has existed for X number of years so I feel okay? [00:12:52] David: I mean, it's kind of similar to looking at like some gem you're going to add to your project, right? So you'll, you'll look at how often does it change? Is it being updated? Uh, what is the documentation? Does it look like someone really cared about the documentation? Does the documentation look updated? Are there issues with it that are being addressed or, or not? Um, so those are good signals. I think, talking to other practitioners too can be good. Like if you've got someone who's experienced. You can say, hey, do you know anybody back channeling through, like, everybody knows somebody that works at AWS, you can probably try to get something there. at Stitch Fix, we had an enterprise support contract, and so your account manager will sometimes give you good information if you ask. Again, it's a, they're not going to come out and say, don't use this product that we have, but they might communicate that in a subtle way. So you have to triangulate from all these sources to try to. to try to figure out what, what you want to do. [00:13:50] Jeremy: Yeah, it kind of makes me wish that there was a, a site like, maybe not quite like, can I use, right? Can I use, you can see like, oh, can I use this in my browser? Is there, uh, like an AWS or a Google Cloud? Can I trust this? Can I trust this? Yeah. Is this, is this solid or not? [00:14:04] David: Right, totally. It's like, there's that, that site where you, it has all the Apple products and it says whether or not you should buy it because one may or may not be coming out or they may be getting rid of it. Like, yeah, that would... For cloud services, that would be, that would be nice. [00:14:16] Jeremy: Yeah, yeah. That's like the Mac Buyer's Guide. And then we, we need the, uh, the technology. Yeah. Maybe not buyers. Cloud Provider Buyer's Guide, yeah. I guess we are buyers. [00:14:25] David: Yeah, yeah, totally, totally. [00:14:27] Jeremy: it's interesting that you, you mentioned how you want to see that, okay, this thing is mature. I think it's going to stick around because, I, interviewed, someone who worked on, I believe it was the CloudWatch team. Okay. Daniel Vassalo, yeah. so he left AWS, uh, after I think about 10 years, and then he wrote a book called, uh, The Good Parts of AWS. Oh! And, if you read his book, most of the services he says to use are the ones that are, like, old. Yeah. He's, he's basically saying, like, S3, you know you're good. Yeah. Right? but then all these, if you look at the AWS webpage, they have who knows, I don't know how many hundreds of services. Yeah. He's, he's kind of like I worked there and I would not use, you know, all these new services. 'cause I myself, I don't trust [00:15:14] David: it yet. Right. And so, and they're working there? Yeah, they're working there. Yeah. No. One of the VPs at Stitch Fix had worked on Google Cloud and so when we were doing this transition from Heroku, he was like, we are not using Google Cloud. I was like, really? He's like AWS is far ahead of the game. Do not use Google Cloud. I was like, all right, I don't need any more info. You work there. You said don't. I'm gonna believe you. So [00:15:36] Jeremy: what, what was his did he have like a core point? [00:15:39] David: Um, so he never really had anything bad to say about Google per se. Like I think he enjoyed his time there and I think he thought highly of who he worked with and what he worked on and that sort of thing. But his, where he was coming from was like AWS was so far ahead. of Google on anything that we would use, he was like, there's, there's really no advantage to, to doing it. AWS is a known quantity, right? it's probably still the case. It's like, you know, you've heard the nobody ever got fired for using IBM or using Microsoft or whatever the thing is. Like, I think that's, that was kind of the vibe. And he was like, moving all of our infrastructure right before we're going to go public. This is a serious business. We should just use something that we know will work. And he was like, I know this will work. I'm not confident about. Google, uh, for our use case. So we shouldn't, we shouldn't risk it. So I was like, okay, I trust you because I didn't know anything about any of that stuff at the time. I knew Heroku and that was it. So, yeah. [00:16:34] Jeremy: I don't know if it's good or bad, but like you said, AWS seems to be the default choice. Yeah. And I mean, there's people who use Azure. I assume it's mostly primarily Microsoft. Yeah. And then there's Google Cloud. It's not really clear why you would pick it, unless there was a specific service or something that only they had. [00:16:55] David: Yeah, yeah. Or you're invested in Google, you know, you want to keep everything there. I mean, I don't know. I haven't really been at that level to make that kind of decision, and I would probably choose AWS for the reasons discussed, but, yeah. Moving off Heroku [00:17:10] Jeremy: And then, so at Stitch Fix, you said you moved off of Heroku [00:17:16] David: yeah. Yeah, so we were heavy into Heroku. I think that we were told that at one point we had the biggest Heroku Postgres database on their platform. Not a good place to be, right? You never want to be the biggest customer person, usually. but the problem we were facing was essentially we were going to go public. And to do that, you're under all the scrutiny. about many things, including the IT systems and the security around there. So, like, by default, a Postgres, a Heroku Postgres database is, like, on the internet. It's only secured by the password. all their services are on the internet. So, not, not ideal. they were developing their private cloud service at that time. And so that would have given us, in theory, on paper, it would have solved all of our problems. And we liked Heroku and we liked the developer experience. It was great. but... Heroku private spaces, it was still early. There's a lot of limitations that when they explained why those limitations, they were reasonable. And if we had. started from scratch on Heroku Private Spaces. It probably would have worked great, but we hadn't. So we just couldn't make it work. So we were like, okay, we're going to have to move to AWS so that everything can be basically off the internet. Like our public website needs to be on the internet and that's kind of it. So we need to, so that's basically was the, was the impetus for that. but it's too bad because I love Heroku. It was great. I mean, they were, they were a great partner. They were great. I think if Stitch Fix had started life a year later, Private Spaces. Now it's, it's, it's way different than it was then. Cause it's been, it's a mature product now, so we could have easily done that, but you know, the timing didn't work out, unfortunately. [00:18:50] Jeremy: And that was a compliance thing to, [00:18:53] David: Yeah. And compliance is weird cause they don't tell you what to do, but they give you some parameters that you need to meet. And so one of them is like how you control access. So, so going public, the compliance is around the financial data and. Ensuring that the financial data is accurate. So a lot of the systems at Stichfix were storing the financial data. We, you know, the warehouse management system was custom made. Uh, all the credit card processing was all done, like it was all in some databases that we had running in Heroku. And so those needed to be subject to stricter security than we could achieve with just a single password that we just had to remember to rotate when someone like left the team. So that was, you know, the kind of, the kind of impetus for, for all of that. [00:19:35] Jeremy: when you were using Heroku, Salesforce would have already owned it then. Did you, did you get any sense that you weren't really sure about the future of the platform while you're on it or, [00:19:45] David: At that time, no, it seemed like they were still innovating. So like, Heroku has a Redis product now. They didn't at the time we wish that they did. They told us they're working on it, but it wasn't ready. We didn't like using the third parties. Kafka was not a thing. We very much were interested in that. We would have totally used it if it was there. So they were still. Like doing bigger innovations then, then it seems like they are now. I don't know. It's weird. Like they're still there. They still make money, I assume for Salesforce. So it doesn't feel like they're going away, but they're not innovating at the pace that they were kind of back in the day. [00:20:20] Jeremy: it used to feel like when somebody's asking, I want to host a Rails app. Then you would say like, well, use Heroku because it's basically the easiest to get started. It's a known quantity and it's, it's expensive, but, it seemed for, for most people, it was worth it. and then now if I talk to people, it's like. Not what people suggest anymore. [00:20:40] David: Yeah, because there's, there's actual competitors. It's crazy to me that there was no competitors for years, and now there's like, Render and Fly. io seem to be the two popular alternatives. Um, I doubt they're any cheaper, honestly, but... You get a sense, right, that they're still innovating, still building those platforms, and they can build with, you know, all of the knowledge of what has come before them, and do things differently that might, that might help. So, I still use Heroku for personal things just because I know it, and I, you know, sometimes you don't feel like learning a new thing when you just want to get something done, but, yeah, I, I don't know if we were starting again, I don't know, maybe I'd look into those things. They, they seem like they're getting pretty mature and. Heroku's resting on its laurels, still. [00:21:26] Jeremy: I guess I never quite the mindset, right? Where you You have a platform that's doing really well and people really like it and you acquire it and then it just It seems like you would want to keep it rolling, right? (laughs) [00:21:38] David: Yeah, it's, it is wild, I mean, I guess... Why did you, what was Salesforce thinking they were going to get? Uh, who knows maybe the person at Salesforce that really wanted to purchase it isn't there. And so no one at Salesforce cares about it. I mean, there's all these weird company politics that like, who knows what's going on and you could speculate. all day. What's interesting is like, there's definitely some people in the Ruby community who work there and still are working there. And that's like a little bit of a canary for me. I'm like, all right, well, if that person's still working there, that person seems like they're on the level and, and, and, and seems pretty good. They're still working there. It, it's gotta be still a cool place to be or still doing something, something good. But, yeah, I don't know. I would, I would love to know what was going on in all the Salesforce meetings about acquiring that, how to manage it. What are their plans for it? I would love to know that stuff. [00:22:29] Jeremy: maybe you had some experience with this at Stitch Fix But I've heard with Heroku some of their support staff at least in the past they would, to some extent, actually help you troubleshoot, like, what's going on with your app. Like, if your app is, like, using a whole bunch of memory, and you're out of memory, um, they would actually kind of look into that, for you, which is interesting, because it's like, that's almost like a services thing than it is just a platform. [00:22:50] David: Yeah. I mean, they, their support, you would get, you would get escalated to like an engineer sometimes, like who worked on that stuff and they would help figure out what the problem was. Like you got the sense that everybody there really wanted the platform to be good and that they were all sort of motivated to make sure that everybody. You know, did well and used the platform. And they also were good at, like a thing that trips everybody up about Heroku is that your app restarts every day. And if you don't know anything about anything, you might think that is stupid. Why, why would I want that? That's annoying. And I definitely went through that and I complained to them a lot. And I'm like, if you only could not restart. And they very patiently and politely explained to me why that it needed to do that, they weren't going to remove that, and how to think about my app given that reality, right? Which is great because like, what company does that, right? From the engineers that are working on it, like No, nobody does that. So, yeah, no, I haven't escalated anything to support at Heroku in quite some time, so I don't know if it's still like that. I hope it is, but I'm not really, not really sure. Building a platform team [00:23:55] Jeremy: Yeah, that, uh, that reminds me a little bit of, I think it's Rackspace? There's, there's, like, another hosting provider that was pretty popular before, and they... Used to be famous for that type of support, where like your, your app's having issues and somebody's actually, uh, SSHing into your box and trying to figure out like, okay, what's going on? which if, if that's happening, then I, I can totally see where the, the price is justified. But if the support is kind of like dropping off to where it's just, they don't do that kind of thing, then yeah, I can see why it's not so much of a, yeah, [00:24:27] David: We used to think of Heroku as like they were the platform team before we had our own platform team and they, they acted like it, which was great. [00:24:35] Jeremy: Yeah, I don't have, um, experience with, render, but I, I, I did, talk to someone from there, and it does seem like they're, they're trying to fill that role, um, so, yeah, hopefully, they and, and other companies, I guess like Vercel and things like that, um, they're, they're all trying to fill that space, [00:24:55] David: Yeah, cause, cause building our own internal platform, I mean it was the right thing to do, but it's, it's a, you can't just, you have to have a team on it, it's complicated, getting all the stuff in AWS to work the way you want it to work, to have it be kind of like Heroku, like it's not trivial. if I'm a one person company, I don't want to be messing around with that particularly. I want to just have it, you know, push it up and have it go and I'm willing to pay for that. So it seems logical that there would be competitors in that space. I'm glad there are. Hopefully that'll light a fire under, under everybody. [00:25:26] Jeremy: so in your case, it sounds like you moved to having your own platform team and stuff like that, uh, partly because of the compliance thing where you're like, we need our, we need to be isolated from the internet. We're going to go to AWS. If you didn't have that requirement, do you still think like that would have been the time to, to have your own platform team and manage that all yourself? [00:25:46] David: I don't know. We, we were thinking an issue that we were running into when we got bigger, um, was that, I mean, Heroku, it, It's obviously not as flexible as AWS, but it is still very flexible. And so we had a lot of internal documentation about this is how you use Heroku to do X, Y, and Z. This is how you set up a Stitch Fix app for Heroku. Like there was just the way that we wanted it to be used to sort of. Just make it all manageable. And so we were considering having a team spun up to sort of add some tooling around that to sort of make that a little bit easier for everybody. So I think there may have been something around there. I don't know if it would have been called a platform team. Maybe we call, we thought about calling it like developer happiness or because you got developer experience or something. We, we probably would have had something there, but. I do wonder how easy it would have been to fund that team with developers if we hadn't had these sort of business constraints around there. yeah, um, I don't know. You get to a certain size, you need some kind of manageability and consistency no matter what you're using underneath. So you've got to have, somebody has to own it to make sure that it's, that it's happening. [00:26:50] Jeremy: So even at your, your architect level, you still think it would have been a challenge to, to. Come to the executive team and go like, I need funding to build this team. [00:27:00] David: You know, certainly it's a challenge because everybody, you know, right? Nobody wants to put developers in anything, right? There are, there are a commodity and I mean, that is kind of the job of like, you know, the staff engineer or the architect at a company is you don't have, you don't have the power to put anybody on anything you, you have the power to Schedule a meeting with a VP or the CTO and they will listen to you. And that's basically, you've got to use that power to convince them of what you want done. And they're all reasonable people, but they're balancing 20 other priorities. So it would, I would have had to, it would have been a harder case to make that, Hey, I want to take three engineers. And have them write tooling to make Heroku easier to use. What? Heroku is not easy to use. Why aren't, you know, so you really, I would, it would be a little bit more of a stretch to walk them through it. I think a case could be made, but, definitely would take some more, more convincing than, than what was needed in our case. [00:27:53] Jeremy: Yeah. And I guess if you're able to contrast that with, you were saying, Oh, I need three people to help me make Heroku easier. Your actual platform team on AWS, I imagine was much larger, right? [00:28:03] David: Initially it was, there was, it was three people did the initial move over. And so by the time we went public, we'd been on this new system for, I don't know, six to nine months. I can't remember exactly. And so at that time the platform team was four or five people, and I, I mean, so percentage wise, right, the engineering team was maybe almost 200, 150, 200. So percentage wise, maybe a little small, I don't know. but it kind of gets back to the power of like the rails and the one person framework. Like everything we did was very much the same And so the Rails app that managed the deployment was very simple. The, the command line app, even the Go one with all of its verbosity was very, very simple. so it was pretty easy for that small team to manage. but, Yeah, so it was sort of like for redundancy, we probably needed more than three or four people because you know, somebody goes out sick or takes a vacation. That's a significant part of the team. But in terms of like just managing the complexity and building it and maintaining it, like it worked pretty well with, you know, four or five people. Where Rails fits in vs other technology [00:29:09] Jeremy: So during the Keynote today, they were talking about how companies like GitHub and Shopify and so on, they're, they're using Rails and they're, they're successful and they're fairly large. but I think the thing that was sort of unsaid was the fact that. These companies, while they use Rails, they use a lot of other, technology as well. And, and, and kind of increasing amounts as well. So, I wonder from your perspective, either from your experience at StitchFix or maybe going forward, what is the role that, that Ruby and Rails plays? Like, where does it make sense for that to be used versus like, Okay, we need to go and build something in Java or, you know, or Go, that sort of thing? [00:29:51] David: right. I mean, I think for like your standard database backed web app, it's obviously great. especially if your sort of mindset bought into server side rendering, it's going to be great at that. so like internal tools, like the customer service dashboard or... You know, something for like somebody who works at a company to use. Like, it's really great because you can go super fast. You're not going to be under a lot of performance constraints. So you kind of don't even have to think about it. Don't even have to solve it. You can, but you don't have to, where it wouldn't work, I guess, you know, if you have really strict performance. Requirements, you know, like a, a Go version of some API server is going to use like percentages of what, of what Rails would use. If that's meaningful, if what you're spending on memory or compute is, is meaningful, then, then yeah. That, that becomes worthy of consideration. I guess if you're, you know, if you're making a mobile app, you probably need to make a mobile app and use those platforms. I mean, I guess you can wrap a Rails app sort of, but you're still making, you still need to make a mobile app, that does something. yeah. And then, you know, interestingly, the data science part of Stitch Fix was not part of the engineering team. They were kind of a separate org. I think Ruby and Rails was probably the only thing they didn't use over there. Like all the ML stuff, everything is either Java or Scala or Python. They use all that stuff. And so, yeah, if you want to do AI and ML with Ruby, you, it's, it's hard cause there's just not a lot there. You really probably should use Python. It'll make your life easier. so yeah, those would be some of the considerations, I guess. [00:31:31] Jeremy: Yeah, so I guess in the case of, ML, Python, certainly, just because of the, the ecosystem, for maybe making a command line application, maybe Go, um, Go or Rust, perhaps, [00:31:44] David: Right. Cause you just get a single binary. Like the problem, I mean, I wrote this book on Ruby command line apps and the biggest problem is like, how do I get the Ruby VM to be anywhere so that it can then run my like awesome scripts? Like that's kind of a huge pain. (laughs) So [00:31:59] Jeremy: and then you said, like, if it's Very performance sensitive, which I am kind of curious in, in your experience with the companies you've worked at, when you're taking on a project like that, do you know up front where you're like, Oh, the CPU and memory usage is going to be a problem, or is it's like you build it and you're like, Oh, this isn't working. So now I know. [00:32:18] David: yeah, I mean, I, I don't have a ton of great experience there at Stitch Fix. The biggest expense the company had was the inventory. So like the, the cost of AWS was just de minimis compared to all that. So nobody ever came and said, Hey, you've got to like really save costs on, on that stuff. Cause it just didn't really matter. at the, the mental health startup I was at, it was too early. But again, the labor costs were just far, far exceeded the amount of money I was spending on, on, um, you know, compute and infrastructure and stuff like that. So, Not knowing anything, I would probably just sort of wait and see if it's a problem. But I suppose you always take into account, like, what am I actually building? And like, what does this business have to scale to, to make it worthwhile? And therefore you can kind of do a little bit of planning ahead there. But, I dunno, I think it would kind of have to depend. [00:33:07] Jeremy: There's a sort of, I guess you could call it a meme, where people say like, Oh, it's, it's not, it's not Rails that's slow, it's the, the database that's slow. And, uh, I wonder, is that, is that accurate in your experience, or, [00:33:20] David: I mean, most of the stuff that we had that was slow was the database, because like, it's really easy to write a crappy query in Rails if you're not, if you're not careful, and then it's really easy to design a database that doesn't have any indexes if you're not careful. Like, you, you kind of need to know that, But of course, those are easy to fix too, because you just add the index, especially if it's before the database gets too big where we're adding indexes is problematic. But, I think those are just easy performance mistakes to make. Uh, especially with Rails because you're not, I mean, a lot of the Rails developers at Citrix did not know SQL at all. I mean, they had to learn it eventually, but they didn't know it at all. So they're not even knowing that what they're writing could possibly be problematic. It's just, you're writing it the Rails way and it just kind of works. And at a small scale, it does. And it doesn't matter until, until one day it does. [00:34:06] Jeremy: And then in, in the context of, let's say, using ActiveRecord and instantiating the objects, or, uh, the time it takes to render templates, that kinds of things, to, at least in your experience, that wasn't such of an issue. [00:34:20] David: No, and it was always, I mean, whenever we looked at why something was slow, it was always the database and like, you know, you're iterating over some active records and then, and then, you know, you're going into there and you're just following this object graph. I've got a lot of the, a lot of the software at Stitch Fix was like internal stuff and it was visualizing complicated data out of the database. And so if you didn't think about it, you would just start dereferencing and following those relationships and you have this just massive view and like the HTML is fine. It's just that to render this div, you're. Digging into some active record super deep. and so, you know, that was usually the, the, the problems that we would see and they're usually easy enough to fix by making an index or. Sometimes you do some caching or something like that. and that solved most of the, most of the issues [00:35:09] Jeremy: The different ways people learn [00:35:09] Jeremy: so you're also the author of the book, Sustainable Web Development with Ruby on Rails. And when you talk to people about like how they learn things, a lot of them are going on YouTube, they're going on, uh, you know, looking for blogs and things like that. And so as an author, what do you think the role is of, of books now? Yeah, [00:35:29] David: I have thought about this a lot, because I, when I first got started, I'm pretty old, so books were all you had, really. Um, so they seem very normal and natural to me, but... does someone want to sit down and read a 400 page technical book? I don't know. so Dave Thomas who runs Pragmatic Bookshelf, he was on a podcast and was asked the same question and basically his answer, which is my answer, is like a long form book is where you can really lay out your thinking, really clarify what you mean, really take the time to develop sometimes nuanced, examples or nuanced takes on something that are Pretty hard to do in a short form video or in a blog post. Because the expectation is, you know, someone sends you an hour long YouTube video, you're probably not going to watch that. Two minute YouTube video is sure, but you can't, you can't get into so much, kind of nuanced detail. And so I thought that was, was right. And that was kind of my motivation for writing. I've got some thoughts. They're too detailed. It's, it's too much set up for a blog post. There's too much of a nuanced element to like, really get across. So I need to like, write more. And that means that someone's going to have to read more to kind of get to it. But hopefully it'll be, it'll be valuable. one of the sessions that we're doing later today is Ruby content creators, where it's going to be me and Noel Rappin and Dave Thomas representing the old school dudes that write books and probably a bunch of other people that do, you know, podcasts videos. It'd be interesting to see, I really want to know how do people learn stuff? Because if no one reads books to learn things, then there's not a lot of point in doing it. But if there is value, then, you know. It should be good and should be accessible to people. So, that's why I do it. But I definitely recognize maybe I'm too old and, uh, I'm not hip with the kids or, or whatever, whatever the case is. I don't know. [00:37:20] Jeremy: it's tricky because, I think it depends on where you are in the process of learning that thing. Because, let's say, you know a fair amount about the technology already. And you look at a book, in a lot of cases it's, it's sort of like taking you from nothing to something. And so you're like, well, maybe half of this isn't relevant to me, but then if I don't read it, then I'm probably missing a lot still. And so you're in this weird in be in between zone. Another thing is that a lot of times when people are trying to learn something, they have a specific problem. And, um, I guess with, with books, it's, you kind of don't know for sure if the thing you're looking for is going to be in the book. [00:38:13] David: I mean, so my, so my book, I would not say as a beginner, it's not a book to learn how to do Rails. It's like you already kind of know Rails and you want to like learn some comprehensive practices. That's what my book is for. And so sometimes people will ask me, I don't know Rails, should I get your book? And I'm like, no, you should not. but then you have the opposite thing where like the agile web development with Rails is like the beginner version. And some people are like, Oh, it's being updated for Rails 7. Should I get it? I'm like, probably not because How to go from zero to rails hasn't changed a lot in years. There's not that much that's going to be new. but, how do you know that, right? Hopefully the Table of Contents tells you. I mean, the first book I wrote with Pragmatic, they basically were like, The Table of Contents is the only thing the reader, potential reader is going to have to have any idea what's in the book. So, You need to write the table of contents with that in mind, which may not be how you'd write the subsections of a book, but since you know that it's going to serve these dual purposes of organizing the book, but also being promotional material that people can read, you've got to keep that in mind, because otherwise, how does anybody, like you said, how does anybody know what's, what's going to be in there? And they're not cheap, I mean, these books are 50 bucks sometimes, and That's a lot of money for people in the U. S. People outside the U. S. That's a ton of money. So you want to make sure that they know what they're getting and don't feel ripped off. [00:39:33] Jeremy: Yeah, I think the other challenge is, at least what I've heard, is that... When people see a video course, for whatever reason, they, they set, like, a higher value to it. They go, like, oh, this video course is, 200 dollars and it's, like, seems like a lot of money, but for some people it's, like, okay, I can do that. But then if you say, like, oh, this, this book I've been researching for five years, uh, I want to sell it for a hundred bucks, people are going to be, like no. No way., [00:40:00] David: Yeah. Right. A hundred bucks for a book. There's no way. That's a, that's a lot. Yeah. I mean, producing video, I've thought about doing video content, but it seems so labor intensive. Um, and it's kind of like, It's sort of like a performance. Like I was mentioning before we started that I used to play in bands and like, there's a lot to go into making an even mediocre performance. And so I feel like, you know, video content is the same way. So I get that it like, it does cost more to produce, but, are you getting more information out of it? I, that, I don't know, like maybe not, but who knows? I mean, people learn things in different ways. So, [00:40:35] Jeremy: It's just like this perception thing, I think. And, uh, I'm not sure why that is. Um, [00:40:40] David: Yeah, maybe it's newer, right? Maybe books feel older so they're easier to make and video seems newer. I mean, I don't know. I would love to talk to engineers who are like... young out of college, a few years into their career to see what their perception of this stuff is. Cause I mean, there was no, I mean, like I said, I read books cause that's all there was. There was no, no videos. You, you go to a conference and you read a book and that was, that was all you had. so I get it. It seems a whole video. It's fancier. It's newer. yeah, I don't know. I would love to hear a wide variety of takes on it to see what's actually the, the future, you know? [00:41:15] Jeremy: sure, yeah. I mean, I think it probably can't just be one or the other, right? Like, I think there are... Benefits of each way. Like, if you have the book, you can read it at your own pace without having to, like, scroll through the video, and you can easily copy and paste the, the code segments, [00:41:35] David: Search it. Go back and forth. [00:41:36] Jeremy: yeah, search it. So, I think there's a place for it, but yeah, I think it would be very interesting, like you said, to, to see, like, how are people learning, [00:41:45] David: Right. Right. Yeah. Well, it's the same with blogs and podcasts. Like I, a lot of podcasters I think used to be bloggers and they realized that like they can get out what they need by doing a podcast. And it's way easier because it's more conversational. You don't have to do a bunch of research. You don't have to do a bunch of editing. As long as you're semi coherent, you can just have a conversation with somebody and sort of get at some sort of thing that you want to talk about or have an opinion about. And. So you, you, you see a lot more podcasts and a lot less blogs out there because of that. So it's, that's kind of like the creators I think are kind of driving that a little bit. yeah. So I don't know. [00:42:22] Jeremy: Yeah, I mean, I can, I can say for myself, the thing about podcasts is that it's something that I can listen to while I'm doing something else. And so you sort of passively can hopefully pick something up out of that conversation, but... Like, I think it's maybe not so good at the details, right? Like, if you're talking code, you can talk about it over voice, but can you really visualize it? Yeah, yeah, yeah. I think if you sit down and you try to implement something somebody talked about, you're gonna be like, I don't know what's happening. [00:42:51] David: Yeah. [00:42:52] Jeremy: So, uh, so, so I think there's like these, these different roles I think almost for so like maybe you know the podcast is for you to Maybe get some ideas or get some familiarity with a thing and then when you're ready to go deeper You can go look at a blog post or read a book I think video kind of straddles those two where sometimes video is good if you want to just see, the general concept of a thing, and have somebody explain it to you, maybe do some visuals. that's really good. but then it can also be kind of detailed, where, especially like the people who stream their process, right, you can see them, Oh, let's, let's build this thing together. You can ask me questions, you can see how I think. I think that can be really powerful. at the same time, like you said, it can be hard to say, like, you know, I look at some of the streams and it's like, oh, this is a three hour stream and like, well, I mean, I'm interested. I'm interested, but yeah, it's hard enough for me to sit through a, uh, a three hour movie, [00:43:52] David: Well, then that, and that gets into like, I mean, we're, you know, we're at a conference and they, they're doing something a little, like, there are conference talks at this conference, but there's also like. sort of less defined activities that aren't a conference talk. And I think that could be a reaction to some of this too. It's like I could watch a conference talk on, on video. How different is that going to be than being there in person? maybe it's not that different. Maybe, maybe I don't need to like travel across the country to go. Do something that I could see on video. So there's gotta be something here that, that, that meets that need that I can't meet any other way. So it's all these different, like, I would like to think that's how it is, right? All this media all is a part to play and it's all going to kind of continue and thrive and it's not going to be like, Oh, remember books? Like maybe, but hopefully not. Hopefully it's like, like what you're saying. Like it's all kind of serving different purposes that all kind of work together. Yeah. [00:44:43] Jeremy: I hope that's the case, because, um, I don't want to have to scroll through too many videos. [00:44:48] David: Yeah. The video's not for me. Large Language Models [00:44:50] Jeremy: I, I like, I actually do find it helpful, like, like I said, for the high level thing, or just to see someone's thought process, but it's like, if you want to know a thing, and you have a short amount of time, maybe not the best, um, of course, now you have all the large language model stuff where you like, you feed the video in like, Hey, tell, tell, tell me, uh, what this video is about and give me the code snippets and all that stuff. I don't know how well it works, but it seems [00:45:14] David: It's gotta get better. Cause you go to a support site and they're like, here's how to fix your problem, and it's a video. And I'm like, can you just tell me? But I'd never thought about asking the AI to just look at the video and tell me. So yeah, it's not bad. [00:45:25] Jeremy: I think, that's probably where we're going. So it's, uh, it's a little weird to think about, but, [00:45:29] David: yeah, yeah. I was just updating, uh, you know, like I said, I try to keep the book updated when new versions of Rails come out, so I'm getting ready to update it for Rails 7. 1 and in Amazon's, Kindle Direct Publishing as their sort of backend for where you, you know, publish like a Kindle book and stuff, and so they added a new question, was AI used in the production of this thing or not? And if you answer yes, they want you to say how much, And I don't know what they're gonna do with that exactly, but I thought it was pretty interesting, cause I would be very disappointed to pay 50 for a book that the AI wrote, right? So it's good that they're asking that? Yeah. [00:46:02] Jeremy: I think the problem Amazon is facing is where people wholesale have the AI write the book, and the person either doesn't review it at all, or maybe looks at a little, a little bit. And, I mean, the, the large language model stuff is very impressive, but If you have it generate a technical book for you, it's not going to be good. [00:46:22] David: yeah. And I guess, cause cause like Amazon, I mean, think about like Amazon scale, like they're not looking at the book at all. Like I, I can go click a button and have my book available and no person's going to look at it. they might scan it or something maybe with looking for bad words. I don't know, but there's no curation process there. So I could, yeah. I could see where they could have that, that kind of problem. And like you as the, as the buyer, you don't necessarily, if you want to book on something really esoteric, there are a lot of topics I wish there was a book on that there isn't. And as someone generally want to put it on Amazon, I could see a lot of people buying it, not realizing what they're getting and feeling ripped off when it was not good. [00:47:00] Jeremy: Yeah, I mean, I, I don't know, if it's an issue with the, the technical stuff. It probably is. But I, I know they've definitely had problems where, fiction, they have people just generating hundreds, thousands of books, submitting them all, just flooding it. [00:47:13] David: Seeing what happens. [00:47:14] Jeremy: And, um, I think that's probably... That's probably the main reason why they ask you, cause they want you to say like, uh, yeah, you said it wasn't. And so now we can remove your book. [00:47:24] David: right. Right. Yeah. Yeah. [00:47:26] Jeremy: I mean, it's, it's not quite the same, but it's similar to, I don't know what Stack Overflow's policy is now, but, when the large language model stuff started getting big, they had a lot of people answering the questions that were just. Pasting the question into the model [00:47:41] David: Which because they got it from [00:47:42] Jeremy: and then [00:47:43] David: The Got model got it from Stack Overflow. [00:47:45] Jeremy: and then pasting the answer into Stack Overflow and the person is not checking it. Right. So it's like, could be right, could not be right. Um, cause, cause to me, it's like, if, if you generate it, if you generate the answer and the answer is right, and you checked it, I'm okay with that. [00:48:00] David: Yeah. Yeah. [00:48:01] Jeremy: but if you're just like, I, I need some karma, so I'm gonna, I'm gonna answer these questions with, with this bot, I mean, then maybe [00:48:08] David: I could have done that. You're not adding anything. Yeah, yeah. [00:48:11] Jeremy: it's gonna be a weird, weird world, I think. [00:48:12] David: Yeah, no kidding. No kidding. [00:48:15] Jeremy: that's a, a good place to end it on, but is there anything else you want to mention, [00:48:19] David: No, I think we covered it all just yeah, you could find me online. I'm Davetron5000 on Ruby. social Mastodon, I occasionally post on Twitter, but not that much anymore. So Mastodon's a place to go. [00:48:31] Jeremy: David, thank you so much [00:48:32] David: All right. Well, thanks for having me.

Screaming in the Cloud
Making an Affordable Event Data Solution with Seif Lotfy

Screaming in the Cloud

Play Episode Listen Later Oct 19, 2023 27:49


Seif Lotfy, Co-Founder and CTO at Axiom, joins Corey on Screaming in the Cloud to discuss how and why Axiom has taken a low-cost approach to event data. Seif describes the events that led to him helping co-found a company, and explains why the team wrote all their code from scratch. Corey and Seif discuss their views on AWS pricing, and Seif shares his views on why AWS doesn't have to compete on price. Seif also reveals some of the exciting new products and features that Axiom is currently working on. About SeifSeif is the bubbly Co-founder and CTO of Axiom where he has helped build the next generation of logging, tracing, and metrics. His background is at Xamarin, and Deutche Telekom and he is the kind of deep technical nerd that geeks out on white papers about emerging technology and then goes to see what he can build.Links Referenced: Axiom: https://axiom.co/ Twitter: https://twitter.com/seiflotfy TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest episode is brought to us by my friends, and soon to be yours, over at Axiom. Today I'm talking with Seif Lotfy, who's the co-founder and CTO of Axiom. Seif, how are you?Seif: Hey, Corey, I am very good, thank you. It's pretty late here, but it's worth it. I'm excited to be on this interview. How are you today?Corey: I'm not dead yet. It's weird, I see you at a bunch of different conferences, and I keep forgetting that you do in fact live half a world away. Is the entire company based in Europe? And where are you folks? Where do you start and where do you stop geographically? Let's start there. We over—everyone dives right into product. No, no, no. I want to know where in the world people sit because apparently, that's the most important thing about a company in 2023.Seif: Unless you ask Zoom because they're undoing whatever they did. We're from New Zealand, all the way to San Francisco, and everything in between. So, we have people in Egypt and Nigeria, all around Europe, all around the US… and UK, if you don't consider it Europe anymore.Corey: Yeah, it really depends. There's a lot of unfortunate naming that needs to get changed in the wake of that.Seif: [laugh].Corey: But enough about geopolitics. Let's talk about industry politics. I've been a fan of Axiom for a while and I was somewhat surprised to realize how long it had been around because I only heard about you folks a couple of years back. What is it you folks do? Because I know how I think about what you're up to, but you've also gone through some messaging iteration, and it is a near certainty that I am behind the times.Seif: Well, at this point, we just define ourselves as the best home for event data. So, Axiom is the best home for event data. We try to deal with everything that is event-based, so time-series. So, we can talk metrics, logs, traces, et cetera. And right now predominantly serving engineering and security.And we're trying to be—or we are—the first cloud-native time-series platform to provide streaming search, reporting, and monitoring capabilities. And we're built from the ground up, by the way. Like, we didn't actually—we're not using Parquet [unintelligible 00:02:36] thing. We're completely everything from the ground up.Corey: When I first started talking to you folks a few years back, there were two points to me that really stood out, and I know at least one of them still holds true. The first is that at the time, you were primarily talking about log data. Just send all your logs over to Axiom. The end. And that was a simple message that was simple enough that I could understand it, frankly.Because back when I was slinging servers around and you know breaking half of them, logs were effectively how we kept track of what was going on, where. These days, it feels like everything has been repainted with a very broad brush called observability, and the takeaway from most company pitches has been, you must be smarter than you are to understand what it is that we're up to. And in some cases, you scratch below the surface and realize it no, they have no idea what they're talking about either and they're really hoping you don't call them on that.Seif: It's packaging.Corey: Yeah. It is packaging and that's important.Seif: It's literally packaging. If you look at it, traces and logs, these are events. There's a timestamp and just data with it. It's a timestamp and data with it, right? Even metrics is all the way to that point.And a good example, now everybody's jumping on [OTel 00:03:46]. For me, OTel is nothing else, but a different structure for time series, for different types of time series, and that can be used differently, right? Or at least not used differently but you can leverage it differently.Corey: And the other thing that you did that was interesting and is a lot, I think, more sustainable as far as [moats 00:04:04] go, rather than things that can be changed on a billboard or whatnot, is your economic position. And your pricing has changed around somewhat, but I ran a number of analyses on your cost that you were passing on to customers and my takeaway was that it was a little bit more expensive to store data for logs in Axiom than it was to store it in S3, but not by much. And it just blew away the price point of everything else focused around logs, including AWS; you're paying 50 cents a gigabyte to ingest CloudWatch logs data over there. Other companies are charging multiples of that and Cisco recently bought Splunk for $28 billion because it was cheaper than paying their annual Splunk bill. How did you get to that price point? Is it just a matter of everyone else being greedy or have you done something different?Seif: We looked at it from the perspective of… so there's the three L's of logging. I forgot the name of the person at Netflix who talked about that, but basically, it's low costs, low latency, large scale, right? And you will never be able to fulfill all three of them. And we decided to work on low costs and large scale. And in terms of low latency, we won't be low as others like ClickHouse, but we are low enough. Like, we're fast enough.The idea is to be fast enough because in most cases, I don't want to compete on milliseconds. I think if the user can see his data in two seconds, he's happy. Or three seconds, he's happy. I'm not going to be, like, one to two seconds and make the cost exponentially higher because I'm one second faster than the other. And that's, I think, that the way we approached this from day one.And from day one, we also started utilizing the idea of existence of Open—Object Storage, we have our own compressions, our own encodings, et cetera, from day one, too, so and we still stick to that. That's why we never converted to other existing things like Parquet. Also because we are a Schema-On-Read, which Parquet doesn't allow you really to do. But other than that, it's… from day one, we wanted to save costs by also making coordination free. So, ingest has to be coordination free, right, because then we don't run a shitty Kafka, like, honestly a lot—a lot of the [logs 00:06:19] companies who running a Kafka in front of it, the Kafka tax reflects in what they—the bill that you're paying for them.Corey: What I found fun about your pricing model is it gets to a point that for any reasonable workload, how much to log or what to log or sample or keep everything is no longer an investment decision; it's just go ahead and handle it. And that was originally what you wound up building out. Increasingly, it seems like you're not just the place to send all the logs to, which to be honest, I was excited enough about that. That was replacing one of the projects I did a couple of times myself, which is building highly available, fault-tolerant, rsyslog clusters in data centers. Okay, great, you've gotten that unlocked, the economics are great, I don't have to worry about that anymore.And then you started adding interesting things on top of it, analyzing things, replaying events that happen to other players, et cetera, et cetera, it almost feels like you're not just a storage depot, but you also can forward certain things on under a variety of different rules or guises and format them as whatever on the other side is expecting them to be. So, there's a story about integrating with other observability vendors, for example, and only sending the stuff that's germane and relevant to them since everyone loves to charge by ingest.Seif: Yeah. So, we did this one thing called endpoints, the number one. Endpoints was a beginning where we said, “Let's let people send us data using whatever API they like using, let's say Elasticsearch, Datadog, Honeycomb, Loki, whatever, and we will just take that data and multiplex it back to them.” So, that's how part of it started. This allows us to see, like, how—allows customers to see how we compared to others, but then we took it a bit further and now, it's still in closed invite-only, but we have Pipelines—codenamed Pipelines—which allows you to send data to us and we will keep it as a source of truth, then we will, given specific rules, we can then ship it anywhere to a different destination, right, and this allows you just to, on the fly, send specific filter things out to, I don't know, a different vendor or even to S3 or you could send it to Splunk. But at the same time, you can—because we have all your data, you can go back in the past, if the incident happens and replay that completely into a different product.Corey: I would say that there's a definite approach to observability, from the perspective of every company tends to visualize stuff a little bit differently. And one of the promises of OTel that I'm seeing that as it grows is the idea of oh, I can send different parts of what I'm seeing off to different providers. But the instrumentation story for OTel is still very much emerging. Logs are kind of eternal and the only real change we've seen to logs over the past decade or so has been instead of just being plain text and their positional parameters would define what was what—if it's in this column, it's an IP address and if it's in this column, it's a return code, and that just wound up being ridiculous—now you see them having schemas; they are structured in a variety of different ways. Which, okay, it's a little harder to wind up just cat'ing a file together and piping it to grep, but there are trade-offs that make it worth it, in my experience.This is one of those transitional products that not only is great once you get to where you're going, from my playing with it, but also it meets you where you already are to get started because everything you've got is emitting logs somewhere, whether you know it or not.Seif: Yes. And that's why we picked up on OTel, right? Like, one of the first things, we now support… we have an OTel endpoint natively bec—or as a first-class citizen because we wanted to build this experience around OTel in general. Whether we like it or not, and there's more reasons to like it, OTel is a standard that's going to stay and it's going to move us forward. I think of OTel as will have the same effect if not bigger as [unintelligible 00:10:11] back of the day, but now it just went away from metrics, just went to metrics, logs, and traces.Traces is, for me, very interesting because I think OTel is the first one to push it in a standard way. There were several attempts to make standardized [logs 00:10:25], but I think traces was something that OTel really pushed into a proper standard that we can follow. It annoys me that everybody uses a different bits and pieces of it and adds something to it, but I think it's also because it's not that mature yet, so people are trying to figure out how to deliver the best experience and package it in a way that it's actually interesting for a user.Corey: What I have found is that there's a lot that's in this space that is just simply noise. Whenever I spend a protracted time period working on basically anything and I'm still confused by the way people talk about that thing, months or years later, I'm starting to get the realization that maybe I'm not the problem here. And I'm not—I don't mean this to be insulting, but one of the things I've loved about you folks is I've always understood what you're saying. Now, you can hear that as, “Oh, you mean we talk like simpletons?” No, it means what you're talking about resonates with at least a subset of the people who have the problem you solve. That's not nothing.Seif: Yes. We've tried really hard because one of the things we've tried to do is actually bring observability to people who are not always busy or it's not part of their day to day. So, we try to bring into [Versal 00:11:37] developers, right, with doing a Versal integration. And all of a sudden, now they have their logs, and they have a metrics, and they have some traces. So, all of a sudden, they're doing the observability work. Or they have actual observability, for their Versal based, [unintelligible 00:11:54]-based product.And we try to meet the people where they are, so we try to—instead of actually telling people, “You should send us data.”—I mean, that's what they do now—we try to find, okay, what product are you using and how can we grab data from there and send it to us to make your life easier? You see that we did that with Versal, we did that with Cloudflare. AWS, we have extensions, Lambda extensions, et cetera, but we're doing it for more things. For Netlify, it's a one-click integration, too, and that's what we're trying to do to actually make the experience and the journey easier.Corey: I want to change gears a little bit because something that we spent a fair bit of time talking about—it's why we became friends, I would think anyway—is that we have a shared appreciation for several things. One of which, at most notable to anyone around us is whenever we hang out, we greet each other effusively and then immediately begin complaining about costs of cloud services. What is your take on the way that clouds charge for things? And I know it's a bit of a leading question, but it's core and foundational to how you think about Axiom, as well as how you serve customers.Seif: They're ripping us off. I'm sorry [laugh]. They just—the amount of money they make, like, it's crazy. I would love to know what margins they have. That's a big question I've always had. I'm like, what are the margins they have at AWS right now?Corey: Across the board, it's something around 30 to 40%, last time I looked at it.Seif: That's a lot, too.Corey: Well, that's also across the board of everything, to be clear. It is very clear that some services are subsidized by other services. As it should be. If you start charging me per IAM call, we're done.Seif: And also, I mean, the machine learning stuff. Like, they won't be doing that much on top of it right now, right, [else nobody 00:13:32] will be using it.Corey: But data transfer? Yeah, there's a significant upcharge on that. But I hear you. I would moderate it a bit. I don't think that I would say that it's necessarily an intentional ripoff. My problem with most cloud services that they offer is not usually that they're too expensive—though there are exceptions to that—but rather that the dimensions are unpredictable in advance. So, you run something for a while and see what it costs. From where I sit, if a customer uses your service and then at the end of usage is surprised by how much it cost them, you've kind of screwed up.Seif: Look, if they can make egress free—like, you saw how Cloudflare just did the egress of R2 free? Because I am still stuck with AWS because let's face it, for me, it is still my favorite cloud, right? Cloudflare is my next favorite because of all the features that are trying to develop and the pace they're picking, the pace they're trying to catch up with. But again, one of the biggest things I liked is R2, and R2 egress is free. Now, that's interesting, right?But I never saw anything coming back from S3 from AWS on S3 for that, like you know. I think Amazon is so comfortable because from a product perspective, they're simple, they have the tools, et cetera. And the UI is not the flashiest one, but you know what you're doing, right? The CLI is not the flashiest one, but you know what you're doing. It is so cool that they don't really need to compete with others yet.And I think they're still dominantly the biggest cloud out there. I think you know more than me about that, but [unintelligible 00:14:57], like, I think they are the biggest one right now in terms of data volume. Like, how many customers are using them, and even in terms of profiles of people using them, it's very, so much. I know, like, a lot of the Microsoft Azure people who are using it, are using it because they come from enterprise that have been always Microsoft… very Microsoft friendly. And eventually, Microsoft also came in Europe in these all these different weird ways. But I feel sometimes ripped off by AWS because I see Cloudflare trying to reduce the prices and AWS just looking, like, “Yeah, you're not a threat to us so we'll keep our prices as they are.”Corey: I have it on good authority from folks who know that there are reasons behind the economic structures of both of those companies based—in terms of the primary direction the traffic flows and the rest. But across the board, they've done such a poor job of articulating this that, frankly, I think the confusion is on them to clear up, not us.Seif: True. True. And the reason I picked R2 and S3 to compare there and not look at Workers and Lambdas because I look at it as R2 is S3 compatible from an API perspective, right? So, they're giving me something that I already use. Everything else I'm using, I'm using inside Amazon, so it's in a VPC, but just the idea. Let me dream. Let me dream that S3 egress will be free at some point.Corey: I can dream.Seif: That's like Christmas. It's better than Christmas.Corey: What I'm surprised about is how reasonable your pricing is in turn. You wind up charging on the basis of ingest, which is basically the only thing that really makes sense for how your company is structured. But it's predictable in advance, the free tier is, what, 500 gigs a month of ingestion, and before people think, “Oh, that doesn't sound like a lot,” I encourage you to just go back and think how much data that really is in the context of logs for any toy project. Like, “Well, our production environment spits out way more than that.” Yes, and by the word production that you just used, you probably shouldn't be using a free trial of anything as your critical path observability tooling. Become a customer, not a user. I'm a big believer in that philosophy, personally. For all of my toy projects that are ridiculous, this is ample.Seif: People always tend to overestimate how much logs they're going to be sending. Like so, there's one thing. What you said it right: people who already have something going on, they already know how much logs they'll be sending around. But then eventually they're sending too much, and that's why we're back here and they're talking to us. Like, “We want to ttry your tool, but you know, we'll be sending more than that.” So, if you don't like our pricing, go find something else because I think we are the cheapest out there right now. We're the competitive the cheapest out there right now.Corey: If there is one that is less expensive, I'm unaware of it.Seif: [laugh].Corey: And I've been looking, let's be clear. That's not just me saying, “Well, nothing has skittered across my desk.” No, no, no, I pay attention to this space.Seif: Hey, where's—Corey, we're friends. Loyalty.Corey: Exactly.Seif: If you find something, you tell me.Corey: Oh, if I find something, I'll tell everyone.Seif: Nononon, you tell me first and you tell me in a nice way so I can reduce the prices on my site [laugh].Corey: This is how we start a price was, industry-wide, and I would love to see it.Seif: [laugh]. But there's enough channels that we share at this point across different Slacks and messaging apps that you should be able to ping me if you find one. Also, get me the name of the CEO and the CTO while you're at it.Corey: And where they live. Yes, yes, of course. The dire implications will be awesome.Seif: That was you, not me. That was your suggestion.Corey: Exactly.Seif: I will not—[laugh].Corey: Before we turn into a bit of an old thud and blunder, let's talk about something else that I'm curious about here. You've been working on Axiom for something like seven years now. You come from a world of databases and events and the like. Why start a company in the model of Axiom? Even back then, when I looked around, my big problem with the entire observability space could never have been described as, “You know what we need? More companies that do exactly this.” What was it that you saw that made you say, “Yeah, we're going to start a company because that sounds easy.”Seif: So, I'll be very clear. Like, I'm not going to, like, sugarcoat this. We kind of got in a position where it [forced counterweighted 00:19:10]. And [laugh] by that I mean, we came from a company where we were dealing with logs. Like, we actually wrote an event crash analytics tool for a company, but then we ended up wanting to use stuff like Datadog, but we didn't have the budget for that because Datadog was killing us.So, we ended up hosting our own Elasticsearch. And Elasticsearch, it costs us more to maintain our Elasticsearch cluster for the logs than to actually maintain our own little infrastructure for the crash events when we were getting, like, 1 billion crashes a month at this point. So eventually, we just—that was the first burn. And then you had alert fatigue and then you had consolidating events and timestamps and whatnot. The whole thing just seemed very messy.So, we started off after some company got sold, we started off by saying, “Okay, let's go work on a new self-hosted version of the [unintelligible 00:20:05] where we do metrics and logs.” And then that didn't go as well as we thought it would, but we ended up—because from day one, we were working on cloud na—because we d—we cloud ho—we were self-hosted, so we wanted to keep costs low, we were working on and making it stateless and work against object store. And this is kind of how we started. We realized, oh, our cost, we can host this and make it scale, and won't cost us that much.So, we did that. And that started gaining more attention. But the reason we started this was we wanted to start a self-hosted version of Datadog that is not costly, and we ended up doing a Software as a Service. I mean, you can still come and self-hosted, but you'll have to pay money for it, like, proper money for that. But we do as a SaaS version of this and instead of trying to be a self-hosted Datadog, we are now trying to compete—or we are competing with Datadog.Corey: Is the technology that you've built this on top of actually that different from everything else out there, or is this effectively what you see in a lot of places: “Oh, yeah, we're just going to manage Elasticsearch for you because that's annoying.” Do you have anything that distinguishes you from, I guess, the rest of the field?Seif: Yeah. So, very just bluntly, like, I think Scuba was the first thing that started standing out, and then Honeycomb came into the scene and they start building something based on Scuba, the [unintelligible 00:21:23] principles of Scuba. Then one of the authors of actual Scuba reached out to me when I told him I'm trying to build something, and he's gave me some ideas, and I start building that. And from day one, I said, “Okay, everything in S3. All queries have to be serverless.”So, all the queries run on functions. There's no real disks. It's just all on S3 right now. And the biggest issue—achievement we got to lower our cost was to get rid of Kafka, and have—let's say, in behind the scenes we have our own coordination-free mechanism, but the idea is not to actually have to use Kafka at all and thus reduce the costs incredibly. In terms of technology, no, we don't use Elasticsearch.We wrote everything from the ground up, from scratch, even the query language. Like, we have our own query language that's based—modeled after Kusto—KQL by Microsoft—so everything we have is built from absolutely from the ground up. And no Elastic. I'm not using Elastic anymore. Elastic is a horror for me. Absolutely horror.Corey: People love the API, but no, I've never met anyone who likes managing Elasticsearch or OpenSearch, or whatever we're calling your particular flavor of it. It is a colossal pain, it is subject to significant trade-offs, regardless of how you work with it, and Amazon's managed offering doesn't make it better; it makes it worse in a bunch of ways.Seif: And the green status of Elasticsearch is a myth. You'll only see it once: the first time you start that cluster, that's what the Elasticsearch cluster is green. After that, it's just orange, or red. And you know what? I'm happy when it's orange. Elasticsearch kept me up for so long. And we had actually a very interesting situation where we had Elasticsearch running on Azure, on Windows machines, and I would have server [unintelligible 00:23:10]. And I'd have to log in and every day—you remember, what's it called—RP… RP Something. What was it called?Corey: RDP? Remote Desktop Protocol, or something else?Seif: Yeah, yeah. Where you have to log in, like, you actually have visual thing, and you have to go in and—Corey: Yep.Seif: And visually go in and say, “Please don't restart.” Every day, I'd have to do that. Please don't restart, please don't restart. And also a lot of weird issues, and also at that point, Azure would decide to disconnect the pod, wanted to try to bring in a new pod, and all these weird things were happening back then. So, eventually, end up with a [unintelligible 00:23:39] decision. I'm talking 2013, '14, so it was back in the day when Elasticsearch was very young. And so, that was just a bad start for me.Corey: I will say that Azure is the most cost-effective cloud because their security is so clown shoes, you can just run whatever you want in someone else's account and it's free to you. Problem solved.Seif: Don't tell people how we save costs, okay?Corey: [laugh]. I love that.Seif: [laugh]. Don't tell people how we do this. Like, Corey, come on [laugh], you're exposing me here. Let me tell you one thing, though. Elasticsearch is the reason I literally use a shock collar or a shock bracelet on myself every time it went down—which was almost every day, instead of having PagerDuty, like, ring my phone.And, you know, I'd wake up and my partner back then would wake up. I bought a Bluetooth collar off of Alibaba that would tase me every time I'd get a notification, regardless of the notification. So, some things are false alarm, but I got tased for at least two, three weeks before I gave up. Every night I'd wake up, like, to a full discharge.Corey: I would never hook myself up to a shocker tied to outages, even if I owned a company. There are pleasant ways to wake up, unpleasant ways to wake up, and even worse. So, you're getting shocked for some—so someone else can wind up effectively driving the future of the business. You're, more or less, the monkey that gets shocked awake to go ahead and fix the thing that just broke.Seif: [laugh]. Well, the fix to that was moving from Azure to AWS without telling anybody. That got us in a lot of trouble. Again, that wasn't my company.Corey: They didn't notice that you did this, or it caused a lot of trouble because suddenly nothing worked where they thought it would work?Seif: They—no, no, everything worked fine on AWS. That's how my love story began. But they didn't notice for, like, six months.Corey: That's kind of amazing.Seif: [laugh]. That was specta—we rewrote everything from C# to Node.js and moved everything away from Elasticsearch, started using Redshift, Redis and a—you name it. We went AWS all the way and they didn't even notice. We took the budget from another department to start filling that in.But we cut the costs from $100,000 down to, like, 40, and then eventually down to $30,000 a month.Corey: More than a little wild.Seif: Oh, God, yeah. Good times, good times. Next time, just ask me to tell you the full story about this. I can't go into details on this podcast. I'll get in a lot—I think I'll get in trouble. I didn't sign anything though.Corey: Those are the best stories. But no, I hear you. I absolutely hear you. Seif, I really want to thank you for taking the time to speak with me. If people want to learn more, where should they go?Seif: So, axiom.co—not dot com. Dot C-O. That's where they learn more about Axiom. And other than that, I think I have a Twitter somewhere. And if you know how to write my name, you'll—it's just one word and find me on Twitter.Corey: We will put that all in the [show notes 00:26:33]. Thank you so much for taking the time to speak with me. I really appreciate it.Seif: Dude, that was awesome. Thank you, man.Corey: Seif Lotfy, co-founder and CTO of Axiom, who has brought this promoted guest episode our way. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that one of these days, I will get around to aggregating in some horrifying custom homebrew logging system, probably built on top of rsyslog.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

cloudonaut
#081 AWS JavaScript SDK v3 + CloudWatch Dashboard Custom Widgets

cloudonaut

Play Episode Listen Later Sep 30, 2023 29:36


Andreas and Michael Wittig are building on AWS since 2009. Follow their journey of developing products like bucketAV, marbot, and HyperEnv and learn from practice.

AWS Morning Brief
SageMaker Podcast HealthOmics

AWS Morning Brief

Play Episode Listen Later Aug 21, 2023 6:21


AWS Morning Brief for the week of August 21, 2023 with Corey Quinn. Links: Corey is performing a live Q&A next month; submit your questions here! Amazon Polly launches new Gulf Arabic male NTTS voice AWS HealthOmics supports cross-account sharing of omics analytics stores  New – Amazon EC2 M7a General Purpose Instances Powered by 4th Gen AMD EPYC Processors Amazon OpenSearch Serverless expands support for larger workloads and collections  Reduce Lambda cold start times: migrate to AWS SDK for JavaScript v3  Architecting for Resilience in the cloud for critical railway systems  How Amazon Shopping uses Amazon Rekognition Content Moderation to review harmful images in product reviews Zero-shot text classification with Amazon SageMaker JumpStart Build a multi-account access notification system with Amazon EventBridge Getting Started with CloudWatch agent and collectd  Cost considerations and common options for AWS Network Firewall log management Addressing gender inequity in the technology industry 

ASecuritySite Podcast
Bill Buchanan - A Bluffer's Guide To Encryption In The Cloud: Top 100

ASecuritySite Podcast

Play Episode Listen Later Aug 21, 2023 20:55


In cybersecurity, the teaching of Cloud security is often weak. So, here are my Top 100 things about encryption in the Cloud. I've focused on AWS, but Azure is likely to also be applicable. Keys are created in the AWS KMS (Key Management Store). In Azure, this is named KeyVault. The cost of using a key in KMS is around $1/month (prorated hourly). When a key is disabled, it is not charged. With AWS KMS, we use a shared customer HSM (Hardware Security Module), and with AWS CloudHSM it is dedidated to one customer. For data at rest, with file storage, we can integrate encryption with Amazon EBS (Elastic Block Storage) and Amazon S3. Amazon EBS drives are encrypted with AES-256 with XTS mode. For AWS-managed keys, a unique key is used for every object within S3 buckets. Amazon S3 uses server-side encryption to store encrypted data. The customer can use client-side encryption to encrypt data before it is stored in the AWS infrastructure. AWS uses 256-bit Advanced Encryption Standard Galois/Counter Mode (AES-GCM) for its symmetric key encryption. In AWS S3, by default, all the objects are encrypted. A customer can use client-side encryption to encrypt data before it goes into the AWS infrastructure. For data at rest, for databases, we can integrate encryption with Amazon RDS (AWS's relational database service) and Amazon Redshift (AWS's data warehousing). For data at rest, we can integrate encryption into ElastiCache (AWS's content caching service), AWS Lambda (AWS's serverless computing service), and Amazon SageMake (AWS's machine learning service). Keys are tokenized and have an ARN (Amazon Resource Names) and alias. An example ARN for a key is arn:aws:kms:us-east-1:103269750866:key/de30e8e6-c753–4a2c-881a-53c761242644, and an example alias is “Bill's Key”. Both of these should be unique in the user's account. To define a KMS key, we can either use its key ID, its key ARN, its alias name, or alias ARN. You can link keys to other AWS Accounts. For this, we specify in the form of “arn:aws:iam::[AWS ID]:root”, and where AWS ID is the ID of the other AWS account. To enhance security, we can use AWS CloudHSM (Hardware Security Module). For simpler and less costly solutions, we typically use AWS KMS (Key Management Solution). For CloudHSM, we pay per hour, but for KMS, we just pay for the usage of the keys. The application of the keys is restricted to defined services. Key identifiers and policies are defined with a JSON key-value pair for data objects. Each key should have a unique GUID, such as “de30e8e6-c753–4a2c-881a-53c761242644”. Users are identified and roles are identified with an ARN, such as : “arn:aws:iam::222222:root”. With the usage of keys we have Key Administrative Permission and a Key Usage policies. There is an explicit denial on a policy if there is not a specific allow defined in a policy. For key permissions, we have fields of “Sid” (the descriptive name of the policy), “Effect” (typically “Allow”), Principal (the ARN of the user/group), “Action” (such as Create, Disable and Delete) and “Resource”. A wildcard (“*”) allows or disallows all. To enable a user of “root” access to everything with a key would be : “Sid”: “Enable IAM User Permissions”, “Effect”: “Allow”,“Principal”: {“AWS”: “arn:aws:iam::22222222:root”},“Action”: “kms:*”, “Resource”: “*”}. The main operations within the KMS are to encrypt/decrpyt data, sign/verify signatures, export data keys, and generate/verify MACs (Message Authentication Codes). Key are either AWS managed (such as for the Lambda service), Customer managed keys (these are created and managed by the customer). Custom key stores are where the customer has complete control over the keys). The main use of keys are for EC2 (Compute), EBS (Elastic Block Storage) and S3 (Storage). AES symmetric keys or an RSA key pair are used to encrypt and decrypt. RSA uses 2K, 3K or 4K keys, and with either “RSA PCKS1 v1.5” or “RSA PSS” padding. RSA PCKS1 v1.5 padding is susceptible to Bleichenbacher's attack, so it should only be used for legacy applications, and for all others, we should use RSA PSS. For RSA, we can use a hashing method of SHA-256, SHA-384 or SHA-512. In RSA, we encrypt with the public key and decrypt with the private key. For signatures, we can use either RSA or ECC signing. For RSA, we have 2K, 3K, or 4K keys, whereas ECC signing uses NIST P256, NIST P384, NIST P521, and SECG P256k1 (as used in Bitcoin and Ethereum). For MACs (Message Authentication Codes), Bob and Alice have the same shared secret key and can authenticate the hash version of a message. In the KMS, we can have HMAC-224, HMAC-256, HMAC-384 and HMAC-512. KMS uses hardware security modules (HSMs) with FIPS 140–2 and which cannot be accessed by AWS employees (or any other customer). Keys will never appear in an AWS disk or backup, and only existing the memory of the HSM. They are only loaded when used. Encryption keys can be restricted to one region of the world (unless defined by the user). With symmetric keys, the key never appears outside the HSM, and for asymmetric keys (public key encryption), the private key stays inside the HSM, and only the public key is exported outside. AWS CloudWatch shows how and when the encryption keys are being used. The minimum time that can be set for a key to be deleted is seven days (and up to 30 days maximum). An organisation can also create its own HSM with the CloudHSM cluster. When a key is then created in KMS, it is then stored in the cluster. The usage of encryption keys should be limited to a minimal set of service requirements. If possible, separate key managers and key users. With a key management (KEY_ADMINISTRATOR) role, we typically have the rights to create, revoke, put, get, list and disable keys. The key management role will typically not be able to encrypt and decrypt. For a key user (KEY_WORKER) role, we cannot create or delete keys and typically focus on tasks such as encrypting and decrypting. Hae a rule of minimum access rights, and simplify user access by defining key administration and usage roles. Users are then added to these roles. Avoid manual updates to keys and use key rotation. The system keeps track of keys that are rotated and can use previously defined ones. The default time to rotate keys is once every year. Key rotation shows up in the CloudWatch and CloudTrail logs. KMS complies with PCI DSS Level 1, FIPS 140–2, FedRAMP, and HIPAA. AWS KMS is matched to FIPS 140–2 Level 2. AWS CloudHSM complies with FIPS 140–2 Level 3 validated HSMs. AWS CloudHSM costs around $1.45 per hour to run, and the costs end when it is disabled or deleted. The CloudHSM is backed-up every 24 hours, and where we can cluster the HSMs into a single logical HSM. CloudHSM can be replicated in AWS regions. AWS KSM is limited to the popular encryption methods, whereas the CloudHSM can implement a wider range of methods. The CloudHSM can support methods such as 3DES with AWS Payment Cryptography. This complies with payment card industry (PCI) standards, such as PCI PIN, PCI P2PE, and PCI DSS. In the CloudHSM for payments, we can generate CVV, CVV2 and ARQC values, and where sensitive details never exist outside the HSM in an unprotected form. With the CloudHSM, we have a command line interface where we can issue commands, and is named CloudHSM CLI. Within the CloudHSM CLI, we can use the genSymKey command to generate symmetric key within the HSM, such as where -t is a key type (31 is AES), -s is a key size (32 bytes) and -l is the label: genSymKey -t 31 -s 32 -l aes256 With genSymKey the key types are: 16 (Generic Secret), 18 (RC4), 21 (Triple DES), and 31 (AES). Within the CloudHSM CLI, we can use the genRSAKeyPair command to generate an RSA key pair, such as where -m is the modulus and -e is the public exponent: genRSAKeyPair -m 2048 -e 65537 -l mykey AWS CloudHSM is integrated with AWS CloudTrail, and where we can track user, role, or an AWS service within AWS CloudHSM. With AWS Payments Cryptography, the 2KEY TDES is Two-key Triple DES and has a 112-bit equivalent key size. The Pin Encryption Key (PEK) is used to encryption PIN values and uses a KEY TDES key. This can store PINs in a secure way, and then decrypt them when required. S3 buckets can be encrypted either with Amazon S3-managed keys (SSE-S3) or AWS Key Management Service (AWS KMS) keys. There is no cost to use SSE keys. For symmetric key encryption, AWS uses envelope encryption, and where a random key is used to encrypt data, and then the key is encrypted with the user's key. AWS should not be able to access the key used for the encryption. The default in creating an encryption key is for it only be to used in a single region, but this can be changed to multi-region, and where the key will be replicated across more than one region. In AWS, a region is a geographical area, and which is split into isolated locations. US-East-1 (N.Virginia) and US-East-2 (Ohio) are different regions, while us-east-1a, us-east-1b and us-east-1c are in the same region. A single region key the US-East-1 region would replicate across eu-east-1a, eu-east-1b and eu-east-1c, and not to eu-east-2a, eu-east-2b and eu-east-2c. When creating a key, you can either create in the KMS, import a key (BYOK — bring your own key), create in the AWS CloudHSM, or create in an external key store (HYOK — hold you own key). For keys stored on-premise we can use an external key store (XKS) — this can be defined as Hold Your Own Keys (HYOKs), and where and where no entity in AWS will able to read any of the encrypted data. [here]. You can BYOK (bring your own key) with KMS, and import keys. KMS will keep a copy of this key. With XKS, we need a proxy URI endpoint, with the proxy credentials of an access key ID, and secret access key. To export keys from AWS CloudHSM, we can encrypt them with an AES key. This is known as key wrapping, as defined in RFC 5648 (for padding with zeros) or RFC 3394 (without padding). A strong password should always be used for key wrapping. AWS encryption operations can either be conducted from the command line or within API, such as with Python, Node.js or Golang. With KMS, the maximum data size is 4,096 bytes for a symmetric key, 190 bytes for RSA 2048 OAEP SHA-256, 318 bytes for RSA 3072 OAEP SHA-256, ad 446 bytes for RSA 4096 OAEP SHA-256. An example command to encrypt a file for 1.txt with symmetric key encryption is: aws kms encryp --key-id alias/MySymKey --plaintext fileb://1.txt --query CiphertextBlob --output text > 1.out To decrypt a file with symmetric key encryption, an example with 1.enc is: aws kms decrypt --key-id alias/BillsNewKey --output text --query Plaintext --ciphertext-blob fileb://1.enc > 2.out In Python, to integrate with KMS, we use the Boto3 library. The standard output of encrypted content is in byte format. If we need to have a text version of ciphertext, we typically use Base64 format. The base64 command can be used to convert byte format in Base64, such as with: $ base64 -i 1.out — decode > 1.enc The xxd command in the command line allows the cipher text to be dumped to a hex output and can then be edited. We can then convert it back to a binary output with: An example piece of Python code for encrypting a plaintext message with the symmetric key in Python is: ciphertext = kms_client.encrypt(KeyId=alias,Plaintext=bytes(secret, encoding='utf8') An example piece of Python code to decrypt some cipher text (in Base64 format) is: plain_text = kms_client.decrypt(KeyId=alias,CiphertextBlob=bytes(base64.b64decode(ciphertext))) To generate an HMAC signature for a message in the command line, we have the form of: aws kms generate-mac --key-id alias/MyHMACKey --message fileb://1.txt --mac-algorithm HMAC_SHA_256 --query Mac > 4.out To verify an HMAC signature for a message in the command line, we have the form of: aws kms verify-mac -key-id alias/MyHMACKey -message fileb://1.txt -mac-algorithm HMAC_SHA_256 -mac fileb://4.mac To create an ECDSA signature in the command line, we have the form of: aws kms sign -key-id alias/MyPublicKeyForSigning -message fileb://1.txt -signing-algorithm ECDSA_SHA_256 -query Signature > 1.out To verify an ECDSA signature in the command line, we have the form of: aws kms verify -key-id alias/MyPublicKeyForSigning -message fileb://1.txt -signature fileb://1.sig -signing-algorithm ECDSA_SHA_256 To encrypt data using RSA in the command line, we have the form of: aws kms encrypt -key-id alias/PublicKeyForDemo -plaintext fileb://1.txt -query CiphertextBlob -output text -encryption-algorithm RSAES_OAEP_SHA_1 > 1.out To decrypt data using RSA in the command line, we have the form of: aws kms decryptb -key-id alias/PublicKeyForDemo -output text -query Plaintext -ciphertext-blob fileb://1.enc -encryption-algorithm RSAES_OAEP_SHA_1 > 2.out To sign data using RSA in the command line, we have the form of: aws kms sign --key-id alias/MyRSAKey --message fileb://1.txt --signing-algorithm RSASSA_PSS_SHA_256 --query Signature --output text > 1.out To verify data using RSA in the command line, we have the form of: aws kms verify --key-id alias/MyRSAKey --message fileb://1.txt — signature fileb://1.sig --signing-algorithm RSASSA_PSS_SHA_256 You cannot encrypt data with Elliptic Curve keys. Only RSA and AES can do that. Elliptic Curve keys are used to sign data. If you delete an encryption key, you will not be able to decrypt any ciphertext that uses it. We can store our secrets, such as application passwords, in the secrets manager. An example of a secret name of “my-secret-passphrase” and a secret string of “Qwery123” we can have: aws secretsmanager create-secret --name my-secret-passphrase --secret-string Qwerty123 In China regions, along with RSA and ECDSA, you can use SM2 KMS signing keys. In China Regions, we can use SM2PKE to encrypt data with asymmetric key encryption. Find out more here: https://asecuritysite.com/aws

The Cloud Pod
217: The Cloud Pod Whispers Its Secrets to Azure Open AI

The Cloud Pod

Play Episode Listen Later Jul 8, 2023 39:53


Welcome to the newest episode of The Cloud Pod podcast - where the forecast is always cloudy! Today your hosts Justin, Jonathan, and Matt discuss all things cloud and AI, as well as some really interesting forays into quantum computing, changes to Google domains, Google accusing Microsoft of cloud monopoly shenanigans, and the fact that Azure wants all your industry secrets. Also, Finops and all the logs you could hope for. Are your secrets safe? Better tune in and find out!  Titles we almost went with this week: The Cloud Pod Adds Domains to the Killed by Google list The Cloud Pod Whispers it's Secrets to Azure OpenAI The Cloud Pod Accuses the Cloud of Being a Monopoly The Cloud Pod Does Not Pass Go and Does Not collect $200 A big thanks to this week's sponsor: Foghorn Consulting, provides top-notch cloud and DevOps engineers to the world's most innovative companies. Initiatives stalled because you have trouble hiring?  Foghorn can be burning down your DevOps and Cloud backlogs as soon as next week.

The Cloud Pod
216: The Cloud Pod is Feeling Elevated Enough to Record the Podcast

The Cloud Pod

Play Episode Listen Later Jun 30, 2023 30:53


Welcome to the newest episode of The Cloud Pod podcast - where the forecast is always cloudy! Today your hosts are Jonathan and Matt as we discuss all things cloud and AI, including Temporary Elevated Access Management (or TEAM, since we REALLY like acronyms today)  FTP servers, SQL servers and all the other servers, as well as pipelines, whether or not the government should regulate AI (spoiler alert: the AI companies don't think so) and some updates to security at Amazon and Google.  Titles we almost went with this week: The Cloud Pod's FTP server now with post-quantum keys support The CloudPod can now Team into your account, but only temporarily  The CloudPod dusts off their old floppy drive  The CloudPod dusts off their old SQL server disks The CloudPod is feeling temporarily elevated to do a podcast The CloudPod promise that AI will not take over the world The CloudPod duals with keys The CloudPod is feeling temporarily elevated. A big thanks to this week's sponsor: Foghorn Consulting, provides top-notch cloud and DevOps engineers to the world's most innovative companies. Initiatives stalled because you have trouble hiring?  Foghorn can be burning down your DevOps and Cloud backlogs as soon as next week.

Screaming in the Cloud
OpsLevel and The Need for a Developer Portal with Kenneth Rose

Screaming in the Cloud

Play Episode Listen Later Jun 15, 2023 36:34


Kenneth Rose, CTO at OpsLevel, joins Corey on Screaming in the Cloud to discuss how OpsLevel is helping developer teams to scale effectively. Kenneth reveals what a developer portal is, how he thinks about the functionality of a developer portal, and the problems a developer portal solves for large developer teams. Corey and Kenneth discuss how to drive adoption of a developer portal, and Kenneth explains why it's so necessary to have executive buy-in throughout that process. Kenneth also discusses how using their own portal internally along with seeking out customer feedback has allowed OpsLevel to make impactful innovations. About KenKenneth (Ken) Rose is the CTO and Co-Founder of OpsLevel. Ken has spent over 15 years scaling engineering teams as an early engineer at PagerDuty and Shopify. Having in-the-trenches experience has allowed Ken a unique perspective on how some of the best teams are built and scaled and lends this viewpoint to building products for OpsLevel, a service ownership platform built to turn chaos into consistency for engineering leaders.Links Referenced: OpsLevel: https://www.opslevel.com/ LinkedIn: https://www.linkedin.com/company/opslevel/ Twitter: https://twitter.com/OpsLevelHQ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn, about, oh I don't know, two years ago and change, I wound up writing a blog post titled, “Developer Portals are An Anti Pattern,” and I haven't really spent a lot of time thinking about them since. This promoted guest episode is brought to us by our friends at OpsLevel, and they have sent their CTO and co-founder Ken Rose, presumably in an attempt to change my perspective on these things. Let's find out. Ken, thank you for agreeing to, well, run the gauntlet, for lack of a better term.Ken: Hey, Corey. Thanks again for having me. And I've heard, you know, heard and listened to your show a bunch, and really excited to be here today.Corey: Let's begin with defining our terms. I'm curious to know what a developer portal is. ‘What would you say a developer portal means to you?' Like it's a college entrance essay.Ken: Right? Definitely. You know, so really, a developer portal is this consolidated place for developers to come to, especially in large organizations to be able to get their jobs done more easily, right? A large challenge that developers have in large organizations, there's just a lot to do and a lot to take care of. So, a developer portal is a place for developers to be able to better own, manage, and run the services, they're responsible for that run in production, and they can do that through access, easy access to self-service tooling.Corey: I guess, on some level, this turns into one of those alignment charts of, like, what is a database and, like, how prescriptive you want to be. It's like, well is a senior engineer a database because you can query them and they have information? Would you consider, for example, Kubernetes be a developer platform, and/or would the AWS console?Ken: Yeah, that's actually an interesting question, right? So, I think there's actually two—we're going to get really niggly here—there's developer platform and developer portal, right? And the word portal for me is something that sits above a developer platform. I don't know if you remember, like, the late-90s, early-2000s, like, portals were all the rage.Like, Yahoo and AltaVistas were like search portals, they were trying to, at the time, consolidate all this information on a much smaller internet to make it easy to access. A developer portal is sort of the same thing, but custom-built for developers and trying to consolidate a lot of the tooling that exists. Now, in terms of the AWS console? Yeah, maybe. Like, it has a suite of tools and suite of offerings. It doesn't do a lot on the well, how do I quickly find out what's running in production and who is responsible for it? I don't know, unless AWS shipped, like, their, you know, three-hundredth new offering in the last week that I haven't, you know, kept on top of.But you know, there's definitely some spectrum in terms of what goes into a developer portal. For me, there's kind of three main things you need. You do need some kind of a catalog, like, what's out there who owns it; you need some kind of a way to measure, like, how good are those services, like, how well built are they; and then you need some access to self-service tooling. And that last part is where, like, the Kubernetes or AWS could be, you know, sort of a dev portal as well.Corey: My experience with developer portals—there was a time when I loved it. RightScale was what I used—at some depth—back in I want to say 2010, 2011 because the EC2 console was clearly not built or designed by anyone who had not built EC2 themselves with their bare hands and sweat of their brow. And in time, the EC2 console got better where it wasn't written in hieroglyphics, as best we could tell, and it became ‘click button to launch instance.' And RightScale really didn't have a second act and they wound up getting acquired by our friends over at Flexera years later. And I haven't seen their developer portal in at least eight years as a direct result of this.So, the problem, at least when I was viewing it purely in the context of AWS services, it feels like you are competing against AWS iterating forward on developer experience, which they iterate slowly, sometimes, and unevenly across their breadth of services, but it does feel like at some level by building an internal portal, you are, first, trying to out-innovate AWS, in some ways, and two, you are inherently making the trade-off of not using recent features and enhancements that have not themselves been incorporated into the portal. That's where the, I guess the start, the genesis of my opposition to the developer portal approach comes from. Is that philosophy valid these days? Not as much. Because I can see an argument for it shifting.Ken: Yeah, I think it's slightly different. I think of a developer portal as again, it's something that sort of sits on top of AWS or Google Cloud or whatever cloud provider use, right? You give an example for example with RightScale and EC2. So, provisioning instances is one part of the activity you have to do as a developer. Now, in most modern organizations, you have, like, your product developers that ship features. They don't actually care about provisioning instance themselves. There are another group called the platform engineers or platform group that are responsible for building automation and tooling to help spin up instances and create CI/CD pipelines and get everything you need set up.And they might use AWS under the covers to do that, but the automation built on top and making that accessible to developers, that's really what a developer portal can provide. In addition, it also provides links to operational tooling that you need, technical documentation, it's everything you need as a developer to do your job, in one place. And though AWs bills itself is that, I think of them as more, they have a lot of platform offerings, right, they have a lot of infra-offerings, but they still haven't been able to, I think, customize that, unless you're an organization that builds—that has kind of gone in-all on AWS and doesn't build any of your own tooling, that's where a developer portal helps. It really helps by consolidating all that information in one place, by making that information discoverable for every developer so they have less… less cognitive load, right? We've asked developers to kind of do too much that we don't… we've asked to shift left and well, how do we make that information more accessible?Regarding the point of, you know, AWS adds new features or new capabilities all the time and, like, well you have this dev portal, that's sort of your interface for how to get things done. Like, how will you use those? Dev portal doesn't stop you from doing that, right? So, my mental model is, if I'm a developer, and I want to spin up a new service, I can just press a button inside of my dev portal in my company and do that. And I have a service that is built according to the latest standards, it has a CI/CD pipeline, it already has a—you know, it's registered in PagerDuty, it's registered in Datadog, it has all the various bits.And then there's something else that I want to do that isn't really on the golden path because maybe this is some new service or some experiment, nothing stops us from doing that. Like, you still can use all those tools from AWS, you know, kind of raw. And if those prove to be valuable for the rest of the organization, great. They can make their way into the dev portal; they can actually become a source of leverage. But if they're not, then they can also just sit there on the vine. Like, not everything that eight of us ever produces will be used by every company.Corey: Many years ago, I got a Cisco pair of certifications because recession was hitting and I needed to be better at networking. And taking those certifications, in those days before Cisco became the sad corporate dragon with no friends we all know today, they were highly germane and relevant. But I distinctly remember, even now, 15 years later, that there was this entire philosophy of pretend that the entire world is Cisco only, which in networking is absolutely never true. It feels like a lot of the AWS designs and patterns tend to assume, oh yeah, you're going to use AWS services for everything. I have never yet found that to be true, other than when I'm just trying to be obstinate.And hell is interoperability between a bunch of different things. Yes, I may want to spin up an EC2 instance and an AWS load balancer and some S3 storage or whatnot, but I'm also going to want to monitor it with PagerDuty, I'm going to want to have a CDN that isn't CloudFront because most CDN these days don't hate you in quite the same economic ways and are simpler to work with, et cetera, et cetera, et cetera. So, there's definitely a story wherein I've found that there's an—the interoperability of tying these things together is helpful. How do you avoid falling down the trap of oh, everyone should be multi-cloud, single pane of glass at cetera, et cetera? In practice that always seems to turn to custard.Ken: Yeah, I think multi-cloud and single pane of glass are actually two different things. So multi-cloud, like, I agree with you to some sense. Like, pick a cloud and go with it, like, unless you have really good business reasons to go for multi-cloud. And sometimes you do, like, years ago, I worked at PagerDuty, they were multi-cloud for a reliability reason, that hey, if one cloud provider goes down, you don't want [crosstalk 00:08:40]—Corey: They were an example I used all the time for that story—Ken: Right.Corey: —specifically the thing woke you up was homed in a bunch of different places, whereas the marketing site, the onboarding flow, the periphery stuff around it was not because it didn't need to be.Ken: Exactly.Corey: Like, the core business need of wake you up was very much multi-cloud because once upon a time, it wasn't and it went down with the rest of us-east-1 and people weren't woken up to be told their site was on fire.Ken: A hundred percent. And on the kind of like application side where, even then, pick a cloud and go with it, unless there's a really compelling business reason for your business to go multi-cloud. Maybe there's something credits or compliance or availability, right? There might be reasons, but you have to be articulate about whether they're right for you.Now, single pane of glass, I think that's different, right? I do think that's something that, ultimately, is a net boon for developers. In any large organization, there is a myriad of internal tools that have been built. And it's like, well, how do I provision a new topic in the Kafka cluster? How do I actually get access to the AWS console? How do I spin up a new service, right? How do I kind of do these things?And if I'm a developer, I just want to ship features. Like, that's what I'm incented to do, that's what I'm optimizing for. And all this other stuff I have to do as part of my job, but I don't want to have to become, like, a Kubernetes guru to be able to do it, right? So, what a developer portal is trying to do is be that single pane of glass, bringing all these common set of tools and responsibilities that you have as a developer in one place. They're easy to search for, they're easy to find, they're easy to query, they're easy to use.Corey: I should probably have asked this earlier on, but let's disambiguate for a little bit here. Because when I'm setting up to use a new service or product and kick the tires on it, no two explorations really look the same. Whereas at most responsible mature companies that are building products that are—services that are going to production use, they've standardized around a number of different approaches. What does your target customer look like? Is there a certain point of scale, a certain level of complexity, a certain maturity of process?Ken: Absolutely. So, a tool like OpsLevel or a developer portal really only makes sense when you hit some critical mass in terms of the number of services you have running in production, or the number of developers that you have. So, when you hit 20, 30, 50 developers or 20, 30, 50 services, an important part of a developer portal is this catalog of what's out there. Once you kind of hit the Dunbar number of services, like, when you have more than you keep in your head, that's when you start to need tooling like this. If you look at our customer base, they're all you know, kind of medium to large-sized companies. If you're a startup with, like, ten people, OpsLevel is probably not right for you. We use all playable internally at OpsLevel, and you know, like, we're still a small company. It's like, we make it work for us because we know how to get the most out of it, but like, it's not the perfect fit because it's not really meant for, you know, smaller companies.Corey: Oh, I hear you. I think I'm probably… I have a better AWS bill analytic system running internally here at The Duckbill Group than some banks do. So, I hear you on that front.Ken: I believe it.Corey: But also implies to me that there's no OpsLevel prospect or customer deployment that has ever been greenfield. It's always you're building existing things, there's already infrastructure in place, vendors have been selected across the board. You aren't—don't to want to starting a company day one, they're going to all right, time to spin up our AWS account and we're also going to wind up signing up for OpsLevel, from the sound of it.Ken: Correct—Corey: Accurate? Inaccurate?Ken: I think that's actually accurate. Like, a lot of the problems, we solve other problems that come as you start to scale both your product and your engineering team. And it's the problems of complexity.Corey: What do those painful problems look like? In other words, what is someone sitting at home right now listening to this, or driving to work debating whether want to ram a bridge abutment or go into the office depending on their mental state today, what painful problem did they have that OpsLevel is designed to fix?Ken: Yeah, for sure. So, let's help people self-select. So, here's my mental model for any [unintelligible 00:12:25]. There are product developers, platform developers, and engineering leaders. Product developers, if you're asking questions like, “I just got paged for the service. I don't know what this does.” Or, “It's upstream from here. Where do I find the technical documentation?” Or, “I think I have to do something with the payment service. Where do I find the API for that?”You know, when you get to that scale, a developer portal can help you. If you're a platform engineer and you have questions like, “Okay, we got to migrate. We're migrating, I don't know, from Datadog to Honeycomb, right? We got to get these fifty or a hundred or thousands of services and all these different owners to, like, switch to some new tool.” Or, “Hey, we've done all this work to ship the golden path. Like, how to actually measure the adoption of all this work that we're doing and if it's actually valuable?” Right?Like, we want everybody to be on a certain set of CI tooling or a certain minimum version of some library or framework. How do we do that? How do we measure that? OpsLevel is for you, right? We have a whole bunch of stuff around maturity.And if you're engineering leader, ultimately, questions you care about, like, “How fast are my developers working? I have this massive team, we've made this massive investment in hiring all these humans to write software and bring value for our customers. How can we be more efficient as a business in terms of that value delivery?” And that's where OpsLevel can help as well.Corey: Guardrails, whether they be economic, regulatory, or otherwise, have to make it easier than doing things incorrectly because one of the miracle aspects of cloud also turns into a bit of a problem, which is shadow IT is only ever a corporate credit card away. Make it too difficult to comply with corporate policies and people won't. And they're good actors; they're trying to get work done. They're not trying to make people's lives harder, but they don't want to spend six weeks provisioning an EC2 cluster. So, there's always that weird trade-off.Now, it feels—and please correct me if I'm wrong—once someone has rolled out OpsLevel at their organization, where it really shines is spinning up a new service where okay, great, you're going to spin up the automatic observability portion of it, you're going to spin up the underlying infrastructure in certain ways that comply with our policies, it's going to build the CI/CD pipelines around it, you're going to wind up having the various cost instrumentation rolled out to it. But for services that are already excellent within the environment, is there an OpsLevel story for them?Ken: Oh, absolutely. So, I look at it as, like, the first problem OpsLevel helps solve is the catalog and what's out there and who owns it. So, not even getting developers to spin up new services that are kind of on the golden path, but just understanding the taxonomy of what are the services we have? How do those services compose into higher-level things like systems or domains? What's the whole set of infrastructure we have?Like, I have 50 AWS accounts, maybe a handful of GCP ones, also, some Azure. I have all this infrastructure that, like, how do I start to get a handle on, like, what's out there in prod and who's responsible for it. And that helps you get in front of compliance risks, security risks. That's really the starting point for OpsLevel building that catalog. And we have a bunch of integrations that kind of slurp all this data to automatically assemble that catalog, or YAML as well if that's your thing. But that's the starting point is building that catalog and figuring out this assignment of, like, okay, this service and this human, or this—sorry—team, like, they're paired together.Corey: A number of offerings in this space, which honestly, my exposure to it is bounded simultaneously to things that are ten years old and no one uses anymore, or a bunch of things I found on GitHub. And the challenge that both of those products tend to have is that they assume certain things to be true about a given environment: that they're using Terraform to manage everything, or they're always going to be using CloudFormation, or everyone there knows Python or something else like that. What are the prerequisites to get started with OpsLevel?Ken: Yeah, so we worked pretty hard to build just a ton of integrations. I would say integrations is are just continuing thing we have going on in the background. Like, when we started, like, we only supported a GitHub. Now, we support all the gits, you know, like GitHub, GitLab, Bitbucket, Azure DevOps, like, we're building [unintelligible 00:16:19]. There's just a whole, like, long tail of integrations.The same with APM tooling. The same with vulnerability management tooling, right? And the reason we do that is because there's just this huge vendor footprint, and people, you know, want OpsLevel to work for them. Now, the other thing we try to do is we also build APIs. So, anything we have as, like, a core integration, we also have kind of like an underlying API for, so that there's, no matter what you have an escape hatch. If like, you're using some tool that we don't support or you have some homegrown thing, there's always a way to try to be able to integrate that into OpsLevel.Corey: When people think about developer portals, the most common one that pops to mind is Backstage, which Spotify wound up building, internally, championing, open-sourcing, and I believe, on some level, turned into a product because if there's one thing people want, it's to have their podcast music company become a SaaS vendor, which is weird to me. But the criticisms that I've seen about and across the board have rung relatively true, including from people internal at Spotify who have used the thing, which is, well first is underestimating the amount of effort that is necessary to maintain Backstage itself, that the build versus buy discussion is always harder to bu—engineers love to build, but they shouldn't be building things outside of their core competency half the time, and the other is driving adoption within the org where you can have the most amazing developer portal in the known universe, but if people don't use it, it may as well not exist and doing the carrot and stick approach often doesn't work. I think you have a pretty good answer that I need not even ask you to elaborate on, “Well, how do we avoid having to maintain this ourselves,” since you have a company that does this, but how do you find companies are driving adoption successfully once they have deployed OpsLevel?Ken: Yeah, that's a great question. So, absolutely. Like, I think the biggest thing you need first, is kind of cultural buy-in and that this is a tool that we want to invest in, right? I think one of the reasons Spotify was successful with Backstage and I think it was System Z before that was that they had this kind of flywheel of, like, they saw that their developers were getting, you know better faster, working happier, by using this type of tooling, by reducing the cognitive load. The way that we approach it is sort of similar, right?We want to make sure that there is executive buy-in that, like, everybody agrees this is, like, a problem that's worth solving. The first step we do is trying to build out that catalog again and helping assign ownership. And that helps people understand, like, hey, these are the services I'm responsible for. Oh, look, and now here's this other context that I didn't have before. And then helping organizations, you know, what—it depends on the problem we're trying to solve, but whether it's rolling out self-serve automation to help developers, like, reduce what was before a ton of cognitive load or if it's helping platform teams define what good looks like so they can start to level up the overall health of what's running in production, we kind of work on different problems, but it's picking one problem and then you know, kind of working with the customers and driving it forward.Corey: On some level, I think that this is going to be looked down upon inherently just by automatic reflex of folks with infrastructure engineering backgrounds. It's taken me some time to learn to overcome my own negative reaction to it. Because it's, I'm here to build things and I want to build things out in such a way that it's portable and reusable without having to be tied to a particular vendor and move on. And it took me a long time to realize that what that instinct was whispering in my ear was in fact, no, you should be your own cloud provider. If that's really what I want to do, I probably should just brush up on you know, computer science trivia from 20 years ago and then go see if I can pass Google's SRE interview.I'm not here to build the things that just provision infrastructure from scratch every company I wind up landing at. It feels like there's more important, impactful work that I can do. And let's be clear, people are never going to follow guardrails themselves when they have to do a bunch of manual steps. It has to be something that is done for them. And I don't know how you necessarily get there without having some form of blueprint or something like that, provided for them with something that is self-service because otherwise, it's not going to work.Ken: I a hundred percent agree, by the way, Corey. Like, the take that, like, automation is the only way to drive a lot of this forward is true, right? If for every single thing you're trying—like, we have a concept called a rubric and it's basically how you measure the service health. And you can—it's very customizable, you have different dimensions. But if, for any check that's on your rubric, it requires manual effort from all your developers, that is going to be harder than something you can just automate away.So, vulnerability management is a great example. If you tell developers, “Hey, you have to go upgrade this library,” okay, some percentage [unintelligible 00:20:47], if you give developers, “Here's a pull request that's already been done and has a test passing and now you just need to merge it,” you're going to have a much better adoption rate with that. Similarly with, like, applying templates being able to [up-level 00:20:57], you know, kind of apply the latest version of a template to an existing service, those types of capabilities, anything where you can automate what the fixes are, absolutely you're going to get better adoption.Corey: As you take a look at your existing reference customers—which is something I always look for on vendor websites because, like, oh, we have many customers who will absolutely not admit to being customers, it's like, that sounds like something that's easy to say—you have actual names tied to these things. Not just companies, but also individuals. If you were to sit down and ask your existing customer base, “So, why did you wind up implementing OpsLevel and what has the value that's delivered to you been since that implementation?” What do they say?Ken: Definitely. I actually had to check our website because we, you know, land new customers and put new logos on it. I was like, “Oh, I wonder what the current set is out right now?”Corey: I have the exact same challenge. Like oh, we have some mutual customers. And it's okay. I don't know if I can mention them by name because I haven't checked our own list of testimonials [unintelligible 00:21:51] lately because say the wrong thing and that's how you wind up being sued and not having a company anymore.Ken: Yeah. So, I don't—I definitely, you know, want to stay [on side 00:22:00] on that part, but in terms of, like, kind of sample reference customer, a lot of the folks that we initially worked with are the platform teams, right? They're the teams that care about what's out there, and they need to know who's responsible for it because they're trying to drive some kind of cross-cutting change across the entire, you know, production footprint. And so, the first thing that generally people will say is—and I love this quote. This came—I won't name them, but like, it's in one of our case studies.It was like, “I had, like, 50 different attempts at making a spreadsheet and they're all, like, in the graveyard, like, to be able to capture what's out there and who's responsible for it.” And just OpsLevel helping automate that has been one of the biggest values that they've gotten. The second point, then is now be able to drive maturity and be able to measure how well those services are being built. And again, it's sort of this interesting thing where we start with the platform teams. And then sometime later security teams find out about OpsLevel, and they're like, “Oh, this is a tool I can use to, like, get developers to do stuff? Like, I've been trying to get developers to do stuff for the longest time.”And they—I file Jira tickets and they just sit there and nothing gets done. But when it becomes part of this, like, overall health score that you're trying to increase a part of the across the board, yeah, it's just a way to kind of drive action.Corey: I think that there's a dichotomy of companies that emerge. And I tend to see the world through a lens of AWS bills, so let's go down that path. I feel like there are some companies presumably like OpsLevel, whereas if I—assuming you're running on top of AWS—if I were to pull your AWS bill, I would see upwards of 80% of your spend is going to be on this application called OpsLevel, the service that you provide to people. As opposed to the other side of the world, which is large enterprises, where they're spending hundreds of millions of dollars a year, but the largest application they have is a million-and-a-half a year in spend because just, they have thousands of these things scattered everywhere. That latter case is where I tend to see more platform teams, where I start to see a lot of managing a whole bunch of relatively small workloads. And developer platforms really seem to be where a lot of solutions lead, whereas 80% of our workload is one application, we don't feel the need for that as much. Is that accurate? Am I misunderstanding some aspect of it?Ken: No, a hundred percent you'd hit the nail on the head. Like, okay, think about the typical, like, microservices adoption journey. Like, you started with, you know, some small company—like us—you started with a monolith. Ah, maybe you built out a second app—Corey: Then you read on Hacker News and realize, “Oh, if we want to hire people, we've got to be doing what all the cool kids are up to.”Ken: Right. We got a microservice all the thing—but that's actually you know, microservices should come later, right, as a response to you needing scale your org and scale your—Corey: As someone who started building some application with microservices, I could not agree more.Ken: A hundred percent. So, it's as you're starting to take steps to having just more moving parts in your production infrastructure, right? If you have one moving part, unless it's like a really large moving part that you can internally break down, like, kind of this majestic monolith where you do have kind of like individual domains that are owned by different teams, but really the problem we're trying to solve, it's more about, like, who owns what. Now, if that's a single atomic unit, great, but can you decompose that? But if you just have, like, one small application, kind of like the whole team is owning everything, again, a developer portal is probably not the right tool for you. It really is a tool that you need as you start to scale your engineer work and as you start to scale the number of moving parts in your production infrastructure.Corey: I tended to use to think of that in terms of boring companies versus innovative ones and I don't think that's accurate. I think it is the question of maturity and where companies lead to. On some level, of OpsLevel starts growing and becomes larger and larger in different ways and starts doing acquisitions and launching into other areas, at some point, you don't have just one product offering, you have a multitude of them. At which point having something like that is going to be critical. But I have to ask, given that you are sort of not exactly your target customer profile, what are the sharp edges been on using it for your use case?Ken: Yeah. So, we actually have an internal Slack channel, we call OpsLevel on OpsLevel. And finding those sharp edges actually has been really useful for us. You know, all the good stuff, dogfooding and it makes your own product better. Okay, so we have our main app, we also do have a bunch of smaller things and it's like, oh yeah, you know, we have, like, I don't know, various Hackaday things that go on, it's important we kind of wind those down for, you know, compliance, we have our marketing site, we have, like, our Terraform.Like, so there's, like, stuff. It's not, like, hundreds or thousands of things, but there's more than just the main app. The second though, is it's really on the maturity piece that we really try to get a lot of value out of our own product, right? Helping—we have our own platform team. They're also trying to drive certain initiatives with our product developers.There is that usual tension of our, like, our own product developers are like, “I want to ship features.” What's this security thing I have to go take care of right now? But OpsLevel itself, like, helps reflect that. We had an operational review today and it was like, “Oh, this one service is actually now”—we have platinum as a level. It's in gold instead of platinum. It's like, “Why?” “Oh, there's this thing that came up. We got to go fix that.” “Great. Let's go actually go fix that so we're back into platinum.”Corey: Do you find that there's often a choice you have to make internally, where you could make the product more effective for your specific use case, but that also diverges from where your typical customer needs or wants the product to go?Ken: No, I think a lot of the things we find for our use case are, like, they're more small paper cuts, right? They're just as we're using it, it's like, “Hey, like, as I'm using this, I want to see the report for this particular check. Why do I have to click six times to get?” You know, like, “Wouldn't it be great if we had a button?” Right?And so, it's those type of, like, small innovations that kind of come up. And those ultimately lead to, you know, a better product for our customers. We also work really closely with our customers and developers are not shy about telling you what they don't like about your product. And I say this with love, like, a lot of our customers give us phenomenal feedback just on how our product can be better and we try to internalize that and you know, roll that feedback into the product.Corey: You have a number of integrations of different SaaS providers, infrastructure providers, et cetera, that you wind up working with. I imagine that given your scale and scope and whatnot, those offerings are dictated by what customers say, “Hey, we're using this thing. Are you going to support that or are you not going to maintain our business?” Which is a great way to wind up financing a lot of product development and figuring out what matters to people. My question for you is, if you look across the totality of your user base, what are the most popularly used integrations, if you can say?Ken: Yeah, for sure. I think right now—I could actually dive in to pull the numbers—GitHub and GitLab—or… I think GitHub, like, has slightly more adoption across our customer base. At least with our customers, almost nobody uses Bitbucket. I mean, we have, like, a small number, but, like, it's… I think, single-digit percentage. A lot of people use PagerDuty, which you know, hey, I'm an ex-PagerDuty person [crosstalk 00:28:24] and I'm glad to see that.Corey: I have a free tier PagerDuty account that will automatically page me for my home automation stuff. Specifically, if you know, the fire alarm goes off. Like, yeah, okay, there are certain things I want to be woken up for, but it's a very short list.Ken: Yeah, it's funny, the running default message when we use a test PagerDuty was, “The server is on fire.” [unintelligible 00:28:44] be like, “The house is on fire.” Like you know, go get that taken care of. There's one other tool so that's used a lot. Datadog actually is used a ton by just across our entire customer base, despite its… we're also Data—we're a Datadog partner, we're a Datadog customer, you know? It's not cheap, but it's a good product for, you know, monitoring and logs and there are [crosstalk 00:29:01]—Corey: No other than cloud infrastructure providers, I get the number one most common source of inquiries is Datadog optimization. It has now risen to a board-level concern in many cases because observability is expensive. That's a sign of success, on some level. Meanwhile, I'm sitting here, like, Date-a-dog? Oh, my God, that's disgusting. It's like Tinder for Pets. Which it turns out is not at all what they do.Ken: Nice.Corey: Yeah.[audio break 00:29:23]—optimizing their Slack integrations, their GitHub integration, et cetera. Or are they starting with the spinning up the servers piece of it?Ken: A lot of the time—and again, that first problem they're trying to solve is just get me a handle on everything we have running in production. You know, if you have multiple AWS accounts, multiple Kubernetes clusters, dozens or even hundreds of teams, God help you if you're going to try to, like, build a list manually to consolidate all that information. That's really the first part is, like, integrate Kubernetes, integrate your CI/CD pipelines, integrate Git, integrate your Cloud account, like, will integrate with everything and will try to build that map of, like, here's everything that's out there, and start to try to assign it to, like, and here's people that we think might be responsible in terms of owning the software. That's generally the starting point.Corey: Which makes an awesome amount of sense. I think going at it from the infrastructure first perspective is where I've seen most developer platforms founder. And to be fair, the job is easier now than it was years ago because it used to be that you were being out-innovated by AWS constantly. Innovation has slow down there. And you know that because of how much they say the pace of innovation has only sped up.And whenever AWS says something in a marketing context, they're insecure about it. I've learned this through the fullness of time observing that company. And these days, most customers do not use the majority of features available for any given service. They have solidified to a point where you can responsibly build on top of these things. Now, it seems that the problem is all the ‘yes, and' stuff that gets built on top of it.Ken: Yeah. Do you have an example, actually, like, one of the kinds of, like, ‘yes, and' tools that you're thinking about?Corey: Oh, absolutely. We have a bunch of AWS environment stuff so we should configure CloudWatch to look at all these things from an observability perspective. No, you should not. You should set up Datadog. And the first time someone does that by hand, they enable all have the observability and the rest and suddenly get charged approximately the GDP of Guam.And okay, maybe we shouldn't do that because then you have the downstream impact of that on your CloudWatch bill. So okay, how do we optimize this for the observability piece directly tied to that? How do we make sure that we get woken up when the site is down or preferably before that, but not every time basically, a EBS volume starts to get a little bit toasty? You have to start dialing this stuff in. And once you've found a lot of those aspects, being able to templatize that and roll that out on an ongoing basis and having the integrations all work together feels like it's the right problem to be solving.Ken: Yeah, absolutely. And the group that I think is responsible for that kind of—because it's a set of problems you described—is really, like, platform teams. Sometimes service owners for like, how should we get paged, but really, what you're describing are these kind of cross-cutting engineering concerns that platform teams are uniquely poised to help solve in an [unintelligible 00:32:03] organization, right? I was thinking what you said earlier. Like, nobody just wants to rebuild the same info over and over, but it's sort of like, it's not just building an [unintelligible 00:32:09]; it's kind of like solving this, like, how do we ship? Can we actually run stuff in prod? And not just run it but get observability and ensure that we're woken up for it and, like, what's that total end-to-end look like from, like, developers writing code to running software in production that's serving traffic? And solving all the problems [unintelligible 00:32:24], that's what I think of was platform engineering.Corey: So, my last question before we wind up wrapping this episode comes down to, I am very adept at two different programming languages, and those are brute force and enthusiasm. What implementation language is most of what you find yourself working with? And why is it in invariably going to be YAML?Ken: Yeah, that's a great question. So, I think there's, in terms of implementing OpsLevel and implementing a service catalog, we support YAML. Like, you know, there's this very common workflow, you just drop a YAML spec, basically, in your repo, if you're a service owner. And that, we can support that. I don't think that's a great take, though.Like, we have other integrations. Again, if the problem you're trying to solve is I want to build a catalog of everything that's out there, asking each of your developers hey, can you please all write YAML files that, like, describe the services you own and drop them into this repo? You've inverted this, like, database that essentially you're trying to build, like, what's out there and stored it in Git, potentially across several hundreds or thousands of repos. You put a lot of toil now on individual product developers to go write and maintain these files. And if you ever had to, like, make a blanket update to these files, there's no atomic way to kind of do that, right?So, I look at YAML as, like, I get it, you know? Like, we use the YAML for all the things in DevOps, so why not their service catalog as well, but I think it's toil. Like, there are easier ways to build a catalog. By, kind of, just integrate. Like, hook up AWS, hook up GitHub, hook up Kubernetes, hook up your CI/CD pipeline, hook up all these different sources that have information about what's running in prod, and let the software, let the tool, automatically infer what's actually running as opposed to requiring humans to manually enter data.Corey: I find that there are remarkably few technical holy wars that I cannot unify both sides on by nominating something far worse. Like, the VI versus Emacs stuff, the tabs versus spaces, and of course, the JSON versus YAML folks. My JSON versus YAML answer is XML: God's language. I find that as soon as you suggest that, people care a hell of a lot less about the differences between JSON and YAML because their job is to now kill the apostate, which is me.Ken: Right. Yeah. I remember XML, like, oh, man, 2002. SOAP. I remember SOAP as a protocol. That was a thing.Corey: Some of the earliest S3 API calls were done in SOAP, and I think they finally just used it to wash their mouths out when all was said and done.Ken: Nice. Yeah.Corey: I really want to thank you for taking the time to do your level best to attempt to convert me, and I would argue in many respects, you have succeeded. I'm thinking about this differently than I did half an hour ago. If people want to learn more, where's the best place for them to find you?Ken: Absolutely. So, you can always check out our website, opslevel.com. We're also fairly active on LinkedIn. If Twitter hasn't imploded by the time this episode becomes launched, then they can also check us out at twitter.com/OpsLevelHQ. We're always posting, just different content on, like, how to be successful with service maturity, DevOps, developer productivity, so that you know, ultimately, that you can ship out to customers faster.Corey: And we will, of course, put links to that in the [show notes 00:35:23]. Thank you so much for taking the time, not just to speak with me, but also for sponsoring this episode. It is appreciated.Ken: Cheers.Corey: Ken Rose, CTO and co-founder at OpsLevel. I'm Cloud Economist Corey Quinn and this has been a promoted guest episode of Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment which, upon further reflection, you could have posted to all of the podcast platforms if only you had the right developer platform to pull it off.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud
Honeycomb on Observability as Developer Self-Care with Brooke Sargent

Screaming in the Cloud

Play Episode Listen Later May 25, 2023 33:09


Brooke Sargent, Software Engineer at Honeycomb, joins Corey on Screaming in the Cloud to discuss how she fell into the world of observability by adopting Honeycomb. Brooke explains how observability was new to her in her former role, but she quickly found it to enable faster learning and even a form of self care for herself as a developer. Corey and Brooke discuss the differences of working at a large company where observability is a new idea, versus an observability company like Honeycomb. Brooke also reveals the importance of helping people reach a personal understanding of what observability can do for them when trying to introduce it to a company for the first time. About BrookeBrooke Sargent is a Software Engineer at Honeycomb, working on APIs and integrations in the developer ecosystem. She previously worked on IoT devices at Procter and Gamble in both engineering and engineering management roles, which is where she discovered an interest in observability and the impact it can have on engineering teams.Links Referenced: Honeycomb: https://www.honeycomb.io/ Twitter: https://twitter.com/codegirlbrooke TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest episode—which is another way of saying sponsored episode—is brought to us by our friends at Honeycomb. And today's guest is new to me. Brooke Sargent is a software engineer at Honeycomb. Welcome to the show, Brooke.Brooke: Hey, Corey, thanks so much for having me.Corey: So, you were part of I guess I would call it the new wave of Honeycomb employees, which is no slight to you, but I remember when Honeycomb was just getting launched right around the same time that I was starting my own company and I still think of it as basically a six-person company versus, you know, a couple of new people floating around. Yeah, turns out, last I checked, you were, what, north of 100 employees and doing an awful lot of really interesting stuff.Brooke: Yeah, we regularly have, I think, upwards of 100 in our all-hands meeting, so definitely growing in size. I started about a year ago and at that point, we had multiple new people joining pretty much every week. So yeah, a lot of new people.Corey: What was it that drove you to Honeycomb? Before this, you spent a bit of time over Procter and Gamble. You were an engineering manager and now you're going—you went from IC to management and now you're IC again. There's a school of thought that I vehemently disagree with, that that's a demotion. I think they are orthogonal skill sets to my mind, but I'm curious to hear your journey through your story.Brooke: Yeah, absolutely. So yeah, I worked at Procter and Gamble, which is a big Cincinnati company. That's where I live and I was there for around four years. And I worked in both engineering and engineering management roles there. I enjoy both types of roles.What really drove me to Honeycomb is, my time at Procter and Gamble, I spent probably about a year-and-a-half, really diving into observability and setting up an observability practice on the team that I was on, which was working on connected devices, connected toothbrushes, that sort of thing. So, I set up an observability practice there and I just saw so much benefit to the engineering team culture and the way that junior and apprentice engineers on the team were able to learn from it, that it really caught my attention. And Honeycomb is what we were using and I kind of just wanted to spend all of my time working on observability-type of stuff.Corey: When you say software engineer, my mind immediately shortcuts to a somewhat outdated definition of what that term means. It usually means application developer, to my mind, whereas I come from the world of operations, historically sysadmins, which it still is except now, with better titles, you get more money. But that's functionally what SRE and DevOps and all the rest of the terms still currently are, which is, if it plugs into the wall, congratulations. It's your problem now to go ahead and fix that thing immediately. Were you on the application development side of the fence? Were you focusing on the SRE side of the world or something else entirely?Brooke: Yeah, so I was writing Go code in that role at P&G, but also doing what I call it, like, AWS pipe-connecting, so a little bit of both writing application code but also definitely thinking about the architecture aspects and lining those up appropriately using a lot of AWS serverless and managed services. At Honeycomb, I definitely find myself—I'm on the APIs and partnerships team—I find myself definitely writing a lot more code and focusing a lot more on code because we have a separate platform team that is focusing on the AWS aspects.Corey: One thing that I find interesting is that it is odd, in many cases, to see, first, a strong focus on observability coming from the software engineer side of the world. And again, this might be a legacy of where I was spending a lot of my career, but it always felt like getting the application developers to instrument whatever it was that they were building felt in many ways like it was pulling teeth. And in many further cases, it seemed that you didn't really understand the value of having that visibility or that perspective into what's going on in your environment, until immediately after. You really wished you had that perspective into what was going on in your environment, but didn't. It's similar to, no one is as zealous about backups as someone who's just suffered a data loss. Same operating theory. What was it that you came from the software engineering side to give a toss about the idea of observability?Brooke: Yeah, so working on the IoT—I was working on, like, the cloud side of things, so in Internet of Things, you're keeping a mobile application, firmware, and cloud synced up. So, I was focused on the cloud aspect of that triangle. And we got pretty close to launching this greenfield IoT cloud that we were working on for P&G, like, we were probably a few months from the initial go-live date, as they like to call it, and we didn't have any observability. We were just kind of sending things to CloudWatch logs. And it was pretty painful to figure out when something went wrong, from, like, you know, hearing from a peer on a mobile app team or the firmware team that they sent us some data and they're not seeing it reflected in the cloud that is, like, syncing it up.Figuring out where that went wrong, just using CloudWatch logs was pretty difficult and syncing up the requests that they were talking about to the specific CloudWatch log that had the information that we needed, if we had even logged the right thing. And I was getting a little worried about the fact that people were going to be going into stores and buying these toothbrushes and we might not have visibility into what could be going wrong or even being able to be proactive about what is going wrong. So, then I started researching observability. I had seen people talking about it as a best practice thing that you should think about when you're building a system, but I just hadn't had the experience with it yet. So, I experimented with Honeycomb a bit and ended up really liking their approach to observability. It fit my mental model and made a lot of sense. And so, I went full-steam ahead with implementing it.Corey: I feel what you just said is very key: the idea of finding an observability solution that keys into the mental model that someone's operating with. I found that a lot of observability talk sailed right past me because it did not align with that, until someone said, “Oh yeah, and then here's events.” “Well, what do you mean by event?” It distills down to logs. And oh, if you start viewing everything as a log event, then yeah, that suddenly makes a lot of sense, and that made it click for me in a way that, honestly, is a little embarrassing that it didn't before then.But I come from a world before containers and immutable infrastructure and certainly before the black boxes that are managed serverless products, where I'm used to, oh, something's not working on this Linux box. Well, I have root, so let's go ahead and fix that and see what's going on. A lot of those tools don't work, either at scale or in ephemeral environments or in scenarios where you just don't have the access to do the environment. So, there's this idea that if you're trying to diagnose something that happened and the container that it happened on stopped existing 20 minutes ago, your telemetry game has got to be on point or you're just guessing at that point. That is something that I think I did myself a bit of a disservice by getting out of hands-on keyboard operations roles before that type of architecture really became widespread.Brooke: Yeah, that makes a lot of sense. On the team that I was on, we were using a lot of AWS Lambda and similarly, tracking things down could be a little bit challenging. And emitting telemetry data also has some quirks [laugh] with Lambda.Corey: There certainly is. It's also one of those areas that, on some level, being stubborn to adopt it works to your benefit. Because when Lambda first came out, it was a platform that was almost entirely identified by its constraints. And Amazon didn't do a terrific job, at least in the way that I tend to learn, of articulating what those constraints are. So, you learn by experimenting and smacking face first into a lot of those things.What the hell do you mean you can't write to the file? Oh, it's a read-only file system. [except slash tap 00:08:39]. What do you mean, it's only half a gigabyte? Oh, that's the constraint there. Well, what do you mean, it automatically stops after—I think back in that point it was five or ten minutes; it's 15 these days. But—Brooke: Right.Corey: —I guess it's their own creative approach to solving the halting problem from computer science classes, where after 15 minutes, your code will stop executing, whether you want it to or not. They're effectively evolving these things as we go and once you break your understanding in a few key ways, at least from where I was coming from, it made a lot more sense. But ugh, that was a rough couple of weeks for me.Brooke: Yeah [laugh]. Agreed.Corey: So, a topic that you have found personally inspiring is that observability empowers junior engineers in a bunch of ways. And I do want to get into that, but beforehand, I am curious as to the modern-day path for SREs because it doesn't feel to me like there is a good answer for, “What does a junior SRE look like?” Because the answer is, “Oh, they don't.” It goes back to the old sysadmin school of thought, which is that, oh, you basically learn by having experience. I've lost count a number of startups I've encountered where you have a bunch of early-20-something engineers but the SRE folks are all generally a decade into what they're what they've been doing because the number-one thing you want to hear from someone in that role is, “Oh, the last time I saw it, here's what it was.” What is the observability story these days for junior engineers?Brooke: So, with SRE I agr—like, that's a conversation that I've had a lot of times on different teams that I've been on, is just can a junior SRE exist? And I think that they can.Corey: I mean, they have to because otherwise, it's well, where does it SRE come from? Oh, they spring—Brooke: [laugh].Corey: —fully formed from the forehead of some God out of mythology. It doesn't usually work that way.Brooke: Right. But you definitely need a team that is ready to support a junior SRE. You need a robust team that is interested in teaching and mentoring. And not all teams are like that, so making sure that you have a team culture that is receptive to taking on a junior SRE is step number one. And then I think that the act of having an observability practice on a team is very empowering to somebody who is new to the industry.Myself, I came from a self-taught background, learning to code. I actually have a music degree; I didn't go to school for computer science. And when I finally found my way to observability, it made so many, kind of, light bulbs go off of just giving me more visuals to go from, “I think this is happening,” to, “I know this is happening.” And then when I started mentoring juniors and apprentices and putting that same observability data in front of them, I noticed them learning so much faster.Corey: I am curious in that you went from implementing a lot of these things and being in a management role of mentoring folks on observability concepts to working for an observability vendor, which is… I guess I would call Honeycomb the observability vendor. They were the first to really reframe a lot of how we considered what used to be called monitoring and now it's called observability, or as I think of it, hipster monitoring.Brooke: [laugh].Corey: But I am curious as to when you look at this, my business partner wrote a book for O'Reilly, Practical Monitoring, and he loved it so much that by the end of that book, he got out of the observability monitoring space entirely and came to work on AWS bills with me. Did you find that going to Honeycomb has changed your perspective on observability drastically?Brooke: I had definitely consumed a lot of Honeycomb's blog posts, like, that's one of the things that I had loved about the company is they put out a lot of interesting stuff, not just about observability but about operating healthy teams, and like you mentioned, like, a pendulum between engineering management and being an IC and just interesting concepts within our industry overall as, like, software engineers and SREs. So, I knew a lot of the thought leadership that the company put out, and that was very helpful. It was a big change going from an enterprise like Procter and Gamble to a startup observability company like Honeycomb, just—and also, going from a company that very much believes in in-person work to remote-first work at Honeycomb, now. So, there were a lot of, like, cultural changes, but I think I kind of knew what I was getting myself into as far as the perspective that the company takes on observability.Corey: That is always the big, somewhat awkward question because of the answer goes a certain way, it becomes a real embarrassment, but I'm confident enough, having worked with Honeycomb as a recurring sponsor and having helped out on the AWS bill side of the world since you were a reference client on both sides of that business, I want to be very clear that I don't think I'm throwing you under a bus on this one. But do you find that the reality, now that you've been there for a year, has matched the external advertising and the ethos of the story they tell about Honeycomb from the outside?Brooke: I definitely think it matches up. One thing that is just different about working inside of a company like Honeycomb versus working at a company that doesn't have any observability at all yet, is that there are a lot of abstraction layers in our codebase and things like that. So, me being a software engineer and writing code Honeycomb compared to P&G, I don't have to think about observability as much because everybody in the company is thinking about observability and had thought about it before I joined and had put in a lot of thought to how to make sure that we consistently have telemetry data that we need to solve problems versus I was thinking about this stuff on the daily at P&G.Corey: Something I've heard from former employees of a bunch of different observability companies has a recurring theme to it, and that it's hard to leave. Because when you're at an observability company, everything is built with an eye toward observability. And there's always the dogfooding story of, we instrument absolutely everything we have with everything that we sell the customers. Now, in practice, you leave and go to a different company, that is almost never going to be true, if for no other reason than based on simple economics. Turning on every facet of every observability tool that a given company sells becomes extraordinarily expensive and is an investment decision, so companies say yes to some, no to others. Do you think you're going to have that problem if and when you decide it's time to move on to your next role, assuming of course, that it's not at a competing observability company?Brooke: I'm sure there will be some challenges if I decide to move on from working for observability platforms in the future. The one that I think would be the most challenging is joining a team where people just don't understand the value of observability and don't want to invest, like, the time and effort into actually instrumenting their code, and don't see why they need to do it, versus just, like, they haven't gotten there yet or they haven't had enough people hired to do it just yet. But if people are actively, like, kind of against the idea of instrumenting your code, I think that would be really challenging to kind of shift to especially after, over the last two-and-a-half years or so, being so used to having this, like, extra sense when I'm debugging problems and dealing with outages.Corey: I will say, it was a little surreal the first time I wound up taking a look at Honeycomb's environment—because I do believe that cost and architecture are fundamentally the same thing when it comes to cloud—and you had clear lines of visibility into what was going on in your AWS bill by way of Honeycomb as a product. And that's awesome. I haven't seen anyone else do that yet and I don't know that it would necessarily work as well because, as you said, there, everyone's thinking about it through this same shared vision, whereas in a number of other companies, it flat out does not work that way. There are certain unknowns and questions. And from the outside, and when you first start down this path, it feels like a ridiculous thing to do, until you get to a point of seeing the payoff, and yeah, this makes an awful lot of sense.I don't know that it would, for example, work as a generic solution for us to roll out to our various clients and say, oh, let's instrument your environment with this and see what's going on because first, we don't have that level of ability to make change in customer environments. We are read-only for some very good reasons. And further, it also seems like it's a, “Step one: change your entire philosophy around these sorts of things so we can help you spend less on AWS,” seems like a bit of a tall order.Brooke: Yeah, agreed. And yeah, on previous teams that I've been on, I definitely—and I think it's fair, absolutely fair, that there were things where, especially using AWS serverless services, I was trying to get as much insight as possible into adding some of these services to our traces, like, AppSync was one example where I could not for the life of me figure out how to get AppSync API requests onto my Honeycomb trace. And I spent a lot of time trying to figure it out. And I had team members that would just be, like, you know, “Let's timebox this; let's not, like, sink all of our time into it.” And so, I think as observability evolves, hopefully, carving out those patterns continues to get easier so that engineers don't have to spend all of their time, kind of, carving out those patterns.Corey: It feels like that's the hard part, is the shift in perspective. Instrumenting a given tool into an environment is not the heavy lift compared to appreciating the value of it. Do you find that that was an easy thing for you to overcome, back when you were at Procter and Gamble, as far as people already have bought in, on some level, to observability from having seen it in some kind of scenarios where it absolutely save folks' bacon? Or was it the problem of, first you have to educate people about the painful problem that they have before they realize it is in fact, A, painful, and B, a problem, and then C, that you have something to sell them that will solve that? Because that pattern is a very hard sales motion to execute in most spaces. But you were doing it at it, from the customer side first.Brooke: Yeah. Yeah, doing it from the customer side, I was able to get buy-in on the team that I was on, and I should also say, like, the team that I was on was considered an innovation team. We were in a separate building from, like, the corporate building and things like that, which I'm sure played into some of those cultural aspects and dynamics. But trying to educate people outside of our team and trying to build an observability practice within this big enterprise company was definitely very challenging, and it was a lot of spending time sharing information and talking to people about their stack and what languages and tools that they're using and how this could help them. I think until people have had that, kind of, magical moment of using observability data to solve a problem for themselves, it's very hard, it can be very hard to really make them understand the value.Corey: That was is always my approach because it feels like observability is a significant and sizable investment in infrastructure alone, let alone mental overhead, the teams to manage these things, et cetera, et cetera. And until you have a challenge that observability can solve, it feels like it is pure cost, similar to backups, where it's just a whole bunch of expense for no benefit until suddenly, one day, you're very glad you had it. Now, the world is littered with stories that are very clear about what happens when you don't have backups. Most people have a personal story around that, but it feels like it's less straightforward to point at a visceral story where not having observability really hobbled someone or something.It feels like—because in the benefit of perfect hindsight, oh yeah, like a disk filled up and we didn't know about that. Like, “Ah, if we just had the right check, we would have caught that early on.” Yeah, coulda, woulda shoulda, but it was a cascading failure that wasn't picked up until seven levels downstream. Do you think that that's the situation these days or am I misunderstanding how people are starting to conceive about this stuff?Brooke: Yeah. I mean, I definitely have a couple of stories of even once I was on the journey to observability adoption—which I call it a journey because you don't just—kind of—snap your fingers and have observability—I started with one service, instrumenting that and just, like, gradually, over sprint's would instrument more services and pull more team members in to do that as well. But when we were in that process of instrumenting services, there was one service which was our auth service—which maybe should have been the first one that we instrumented—that a code change was made and it was erroring every time somebody tried to sign up in the app. And if we had observability instrumentation in place for that service, it wouldn't have taken us, like, the four or five hours to find the problem of the one line of code that we had changed; we would have been able to see more clearly what error was happening and what line of code it was happening on and probably fix it within an hour.And we had a similar issue with a Redshift database that we were running more on the metrics side of things. We were using it to send analytics data to other people in the company and that Redshift database just got maxed out at a certain point. The CPU utilization was at, like, 98% and people in the company were very upset and [laugh] having a very bad time querying their analytics data.Corey: It's a terrific sales pitch for Snowflake, to be very direct, because you hear that story kind of a lot.Brooke: Yeah, it was not a fun time. But at that point, we started sending Redshift metrics data over to Honeycomb as well, so that we could keep a better pulse on what exactly was happening with that database.Corey: So, here's sort of the acid test: people tend to build software when they're starting out greenfield, in ways that emphasize their perspective on the world. For example, when I'm building something new, doesn't matter if it's tiny or for just a one-off shitposting approach, and it touches anything involving AWS, first thing I do out of the gate is I wind up setting tags so that I can do cost allocation work on it; someday, I'm going to wonder how much this thing cost. That is, I guess my own level of brokenness.Brooke: [laugh].Corey: When you start building something at work from scratch, I guess this is part ‘you,' part ‘all of Honeycomb,' do you begin from that greenfield approach of Hello World of instrumenting it for observability, even if it's not explicitly an observability-focused workload? Or is it something that you wind up retrofitting with observability insights later, once it hits a certain point of viability?Brooke: Yeah. So, if I'm at the stage of just kind of trying things out locally on my laptop, kind of outside of, like, the central repo for the company, I might not do observability data because I'm just kind of learning and trying things out on my laptop. Once I pull it into our central repo, there is some observability data that I am going to get, just in the way that we kind of have our services set up. And as I'm going through writing code to do this whatever new feature I'm trying to do, I'm thinking about what things, when this breaks—not if it breaks; when it breaks [laugh]—am I going to want to know about in the future. And I'll add those things, kind of, on the fly just to make things easier on myself, and that's just kind of how my brain works at this point of thinking about my future self, which is, kind of like, the same definition of self-care. So, I think of observability as self-care for developers.But later on, when we're closer to actually launching a thing, I might take another pass at just, like, okay, let's once again take a look at the error paths and how this thing can break and make sure that we have enough information at those points of error to know what is happening within a trace view of this request.Corey: My two programming languages that I rely on the most are enthusiasm and brute force, and I understand this is not a traditional software engineering approach. But I've always found that having to get observability in involved a retrofit, on some level. And it always was frustrating to me just because it felt like it was so much effort in various ways that I've just always kicked myself: I should have done this early on. But I've been on the other side of that, and it's like, should I instrument this with good observability? No, that sounds like work. I want to see if this thing actually works at all, or not first.And I don't know what side of the fence is the correct one to be on, but I always find that I'm on the wrong one. Like, I don't know if it's, like, one of those, there's two approaches and neither one works. I do see in client environments where observability is always, always, always something that has to be retrofit into what it is that they're doing. Does it stay that way once companies get past a certain point? Does observability of adoption among teams just become something that is ingrained into them or do people have to consistently relearn that same lesson, in your experience?Brooke: I think it depends, kind of, on the size of your company. If you are a small company with a, you know, smaller engineering organization where it's not, I won't say easy, but easier to get kind of full team buy-in on points of view and decisions and things like that, it becomes more built-in. If you're in a really big company like the one that I came from, I think it is continuously, like, educating people and trying to show the value of, like, why we are doing this—coming back to that why—and like, the magical moment of, like, stories of problems that have been solved because of the instrumentation that was in place. So, I guess, like most things, it's an, ‘it depends.' But the larger that your company becomes, I think the harder it gets to keep everybody on the same page.Corey: I am curious, in that I tend to see the world through AWS bills, which is a sad, pathetic way to live that I don't recommend to basically anyone, but I do see the industry, or at least my client base, forming a bit of a bimodal distribution. On one side, you have companies like Honeycomb, including, you know, Honeycomb, where the majority of your AWS spend is driven by the application that is Honeycomb, you know, the SaaS thing you sell to people to solve their problems. The other side of the world are companies that look a lot more like Procter and Gamble, presumably, where—because I think of oh, what does Procter and Gamble do? And the answer is, a lot. They're basically the definition of conglomerate in some ways.So, you look at that, a bill at a big company like that and it might be hundreds of millions of dollars, but the largest individual workload is going to be a couple million at best. So, it feels very much like it's this incredibly diffuse selection of applications. And in those environments, you have to start thinking a lot more about centralization things you can do, for example, for savings plan purchases and whatnot, whereas at Honeycomb-like companies, you can start looking at, oh, well, you have this single application that's the lion's share of everything. We can go very deep into architecture and start looking at micro-optimizations here that will have a larger impact. Having been an engineer at both types of companies, do you find that there's a different internal philosophy, or is it that when you're working in a larger company on a specific project, that specific project becomes your entire professional universe?Brooke: Yeah, definitely at P&G, for the most part, IoT was kind of the center of my universe. But one philosophy that I noticed as being different—and I think this is from being an enterprise in a startup—is just the way that thinking about cost and architecture choices, kind of, happened. So, at P&G, like I said, we were using a lot of Lambda, and pretty much any chance we got, we used a serverless or managed offering from AWS. And I think a big part of that reasoning was because, like I said earlier, P&G is very interested in in-person work. So, everybody that we hired her to be located in Cincinnati.And it became hard to hire for people who had Go and Terraform experience because a lot of people in the Midwest are much more comfortable in .NET and Java; there's just a lot more jobs using those technologies. So, we had a lot of trouble hiring and would choose—because P&G had a lot of money to spend—to give AWS that money because we had trouble finding engineers to hire, whereas Honeycomb really does not seem to have trouble hiring engineers. They hire remote employees and lots of people are interested in working at Honeycomb and they also do not have the bank account [laugh] that Procter and Gamble has, so just thinking about cost and architecture is kind of a different beast. So, at Honeycomb, we are building a lot more services versus just always choosing a serverless or easy, like, AWS managed option to think about it less.Corey: Yeah, at some level, it's an unfair question, just because it comes down to, in the fullness of time, even Honeycomb turns into something that looks a lot more like Procter and Gamble. Because, okay, you have the Honeycomb application. That's great, but as the company continues to grow and offer different things to different styles of customers, you start seeing a diffusion where, yeah, everything stills observability focused, but I can see a future where it becomes a bunch of different subcomponents. You make acquisitions of other companies that wind up being treated as separate environments and the rest. And in the fullness of time, I can absolutely see that that is the path that a lot of companies go down.So, it might also just be that I'm looking at this through a perspective lens of… just company stage, as opposed to what the internal story of the company is. I mean, Procter and Gamble's, what, a century old give or take? Whereas Honeycomb is an ancient tech company, by which I mean it's over 18 months old.Brooke: Yeah, P&G was founded in 1837. So that's—Corey: Almost 200 years old. Wonderful.Brooke: —quite old [laugh]. Yeah [laugh].Corey: And for some reason, they did not choose to build their first technical backbone on top of AWS back then. I don't understand why, for the life of me.Brooke: [laugh]. Yeah, but totally agree on your point that the kind of difference of thinking about cost and architecture definitely comes from company's stage rather than necessarily the industry.Corey: I really want to thank you for taking the time out of your day to talk with me about what you're up to and how you view these things. If people want to learn more, what's the best place for them to find you?Brooke: Yeah, so I think the main place that I still sometimes am, is Twitter: @codegirlbrooke is my username. But I'm only there sometimes, now [laugh].Corey: I feel like that's a problem a lot of us are facing right now. Like, I'm more active on Bluesky these days, but it's still invite only and it feels like it's too much of a weird flex to wind up moving people to just yet. I'm hoping that changes soon, but we'll see how it plays. We'll, of course, put links to that in the [show notes 00:31:53]. I really want to thank you for taking the time out of your day to talk with me.Brooke: Yeah, thanks so much for chatting with me. It was a good time.If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud
Cutting Costs in Cloud with Everett Berry

Screaming in the Cloud

Play Episode Listen Later May 9, 2023 31:59


Everett Berry, Growth and Open Source at Vantage, joins Corey at Screaming in the Cloud to discuss the complex world of cloud costs. Everett describes how Vantage takes a broad approach to understanding and cutting cloud costs across a number of different providers, and reveals which providers he feels generate large costs quickly. Everett also explains some of his best practices for cutting costs on cloud providers, and explores what he feels the impact of AI will be on cloud providers. Corey and Everett also discuss the pros and cons of AWS savings plans, why AWS can't be counted out when it comes to AI, and why there seems to be such a delay in upgrading instances despite the cost savings. About EverettEverett is the maintainer of ec2instances.info at Vantage. He also writes about cloud infrastructure and analyzes cloud spend. Prior to Vantage Everett was a developer advocate at Arctype, a collaborative SQL client acquired by ClickHouse. Before that, Everett was cofounder and CTO of Perceive, a computer vision company. In his spare time he enjoys playing golf, reading sci-fi, and scrolling Twitter.Links Referenced: Vantage: https://www.vantage.sh/ Vantage Cloud Cost Report: https://www.vantage.sh/cloud-cost-report Everett Berry Twitter: https://twitter.com/retttx Vantage Twitter: https://twitter.com/JoinVantage TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey:  LANs of the late 90's and early 2000's were a magical place to learn about computers, hang out with your friends, and do cool stuff like share files, run websites & game servers, and occasionally bring the whole thing down with some ill-conceived software or network configuration. That's not how things are done anymore, but what if we could have a 90's style LAN experience along with the best parts of the 21st century internet? (Most of which are very hard to find these days.) Tailscale thinks we can, and I'm inclined to agree. With Tailscale I can use trusted identity providers like Google, or Okta, or GitHub to authenticate users, and automatically generate & rotate keys to authenticate devices I've added to my network. I can also share access to those devices with friends and teammates, or tag devices to give my team broader access. And that's the magic of it, your data is protected by the simple yet powerful social dynamics of small groups that you trust.Try now - it's free forever for personal use. I've been using it for almost two years personally, and am moderately annoyed that they haven't attempted to charge me for what's become an essential-to-my-workflow service.Corey: Have you listened to the new season of Traceroute yet? Traceroute is a tech podcast that peels back the layers of the stack to tell the real, human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform, or learn more about Traceroute at origins.dev. My thanks to them for sponsoring this ridiculous podcast. Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This seems like an opportune moment to take a step back and look at the overall trend in cloud—specifically AWS—spending. And who better to do that than this week, my guest is Everett Berry who is growth in open-source over at Vantage. And they've just released the Vantage Cloud Cost Report for Q1 of 2023. Everett, thank you for joining me.Everett: Thanks for having me, Corey.Corey: I enjoy playing slap and tickle with AWS bills because I am broken in exactly that kind of way where this is the thing I'm going to do with my time and energy and career. It's rare to find people who are, I guess, similarly afflicted. So, it's great to wind up talking to you, first off.Everett: Yeah, great to be with you as well. Last Week in AWS and in particular, your Twitter account, are things that we follow religiously at Vantage.Corey: Uh-oh [laugh]. So, I want to be clear because I'm sure someone's thinking it out there, that, wait, Vantage does cloud cost optimization as a service? Isn't that what I do? Aren't we competitors? And the answer that I have to that is not by any definition that I've ever seen that was even halfway sensible.If SaaS could do the kind of bespoke consulting engagements that I do, we would not sell bespoke consulting engagements because it's easier to click button: receive software. And I also will point out that we tend to work once customers are at a certain point at scale that in many cases is a bit prohibitive for folks who are just now trying to understand what the heck's going on the first time finance has some very pointed questions about the AWS bill. That's how I see it from my perspective, anyway. Agree? Disagree?Everett: Yeah, I agree with that. I think the product solution, the system of record that companies need when they're dealing with Cloud costs ends up being a different service than the one that you guys provide. And I think actually the to work in concert very well, where you establish a cloud cost optimization practice, and then you keep it in place via software and via sort of the various reporting tools that the Vantage provide. So, I completely agree with you. In fact, in the hundreds of customers and deals that Vantage has worked on, I don't think we have ever come up against Duckbill Group. So, that tells you everything you need to know in that regard.Corey: Yeah. And what's interesting about this is that you have a different scale of visibility into the environment. We wind up dealing with a certain profile, or a couple of profiles, in our customer base. We work with dozens of companies a year; you work with hundreds. And that's bigger numbers, of course, but also in many cases at different segments of the industry.I also am somewhat fond of saying that Vantage is more focused on going broad in ways where we tend to focus on going exclusively deep. We do AWS; the end. You folks do a number of different cloud providers, you do Datadog cost visibility. I've lost track of all the different services that you wind up tracking costs for.Everett: Yeah, that's right. We just launched our 11th provider, which was OpenAI and for the first time in this report, we're actually breaking out data among the different clouds and we're comparing services across AWS, Google, and Azure. And I think it's a bit of a milestone for us because we started on AWS, where I think the cost problem is the most acute, if you will, and we've hit a point now across Azure and Google where we actually have enough data to say some interesting things about how those clouds work. But in general, we have this term, single pane of glass, which is the idea that you use 5, 6, 7 services, and you want to bundle all those costs into one report.Corey: Yeah. And that is something that we see in many cases where customers are taking a more holistic look at things. But, on some level, when people ask me, “Oh, do you focus on Google bills, too,” or Azure bills in the early days, it was, “Well, not yet. Let's take a look.” And what I was seeing was, they're spending, you know, millions or hundreds of millions, in some cases, on AWS, and oh, yeah, here's, like, a $300,000 thing we're running over on GCP is a proof-of-concept or some bizdev thing. And it's… yeah, why don't we focus on the big numbers first? The true secret of cloud economics is, you know, big numbers first rather than alphabetical, but don't tell anyone I told you that.Everett: It's pretty interesting you say that because, you know, in this graph where we break down costs across providers, you can really see that effect on Google and Azure. So, for example, the number three spending category on Google is BigQuery and I think many people would say BigQuery is kind of the jewel of the Google Cloud empire. Similarly for Azure, we actually found Databricks showing up as a top-ten service. Compare that to AWS where you just see a very routine, you know, compute, database, storage, monitoring, bandwidth, down the line. AWS still is the king of costs, if you will, in terms of, like, just running classic compute workloads. And the other services are a little bit more bespoke, which has been something interesting to see play out in our data.Corey: One thing that I've heard that's fascinating to me is that I've now heard from multiple Fortune 500 companies where the Datadog bill is now a board-level concern, given the size and scale of it. And for fun, once I modeled out all the instance-based pricing models that they have for the suite of services they offer, and at the time was three or $400 a month, per instance to run everything that they've got, which, you know, when you look at the instances that I have, costing, you know, 15, 20 bucks a month, in some cases, hmm, seems a little out of whack. And I can absolutely see that turning into an unbounded growth problem in kind of the same way. I just… I don't need to conquer the world. I'm not VC-backed. I am perfectly content at the scale that I'm at—Everett: [laugh].Corey: —with the focus on the problems that I'm focused on.Everett: Yeah, Datadog has been fascinating. It's been one of our fastest-growing providers of sort of the ‘others' category that we've launched. And I think the thing with Datadog that is interesting is you have this phrase cloud costs are all about cloud architecture and I think that's more true on Datadog than a lot of other services because if you have a model where you have, you know, thousands of hosts, and then you add-on one of Datadogs 20 services, which charges per host, suddenly your cloud bill has grown exponentially compared to probably the thing that you were after. And a similar thing happens—actually, my favorite Datadog cost recommendation is, when you have multiple endpoints, and you have sort of multiple query parameters for those endpoints, you end up in this cardinality situation where suddenly Datadog is tracking, again, like, exponentially increasing number of data points, which it's then charging to you on a usage-based model. And so, Datadog is great partners with AWS and I think it's no surprise because the two of them actually sort of go hand-in-hand in terms of the way that they… I don't want to say take ad—Corey: Extract revenue?Everett: Yeah, extract revenue. That's a good term. And, you know, you might say a similar thing about Snowflake, possibly, and the way that they do things. Like oh, the, you know, warehouse has to be on for one minute, minimum, no matter how long the query runs, and various architectural decisions that these folks make that if you were building a cost-optimized version of the service, you would probably go in the other direction.Corey: One thing that I'm also seeing, too, is that I can look at the AWS bill—and just billing data alone—and then say, “Okay, you're using Datadog, aren't you?” Like, “How did you know that?” Like, well, first, most people are secondly, CloudWatch is your number two largest service spend right now. And it's the downstream effect of hammering all the endpoints with all of the systems. And is that data you're actually using? Probably not, in some cases. It's, everyone turns on all the Datadog integrations the first time and then goes back and resets and never does it again.Everett: Yeah, I think we have this set of advice that we give Datadog folks and a lot of it is just, like, turn down the ingestion volume on your logs. Most likely, logs from 30 days ago that are correlated with some new services that you spun up—like you just talked about—are potentially not relevant anymore, for the kind of day-to-day cadence that you want to get into with your cloud spending. So yeah, I mean, I imagine when you're talking to customers, they're bringing up sort of like this interesting distinction where you may end up in a meeting room with the actual engineering team looking at the actual YAML configuration of the Datadog script, just to get a sense of like, well, what are the buttons I can press here? And so, that's… yeah, I mean, that's one reason cloud costs are a pretty interesting world is, on the surface level, you may end up buying some RIs or savings plans, but then when you really get into saving money, you end up actually changing the knobs on the services that you're talking about.Corey: That's always a fun thing when we talk to people in our sales process. It's been sord—“Are you just going to come in and tell us to buy savings plans or reserved instances?” Because the answer to that used to be, “No, that's ridiculous. That's not what we do.” But then we get into environments and find they haven't bought any of those things in 18 months.Everett: [laugh].Corey: —and it's well… okay, that's step two. Step one is what are you using you shouldn't be? Like, basically measure first then cut as opposed to going the other direction and then having to back your way into stuff. Doesn't go well.Everett: Yeah. One of the things that you were discussing last year that I thought was pretty interesting was the gp3 volumes that are now available for RDS and how those volumes, while they offer a nice discount and a nice bump in price-to-performance on EC2, actually don't offer any of that on RDS except for specific workloads. And so, I think that's the kind of thing where, as you're working with folks, as Vantage is working with people, the discussion ends up in these sort of nuanced niche areas, and that's why I think, like, these reports, hopefully, are helping people get a sense of, like, well, what's normal in my architecture or where am I sort of out of bounds? Oh, the fact that I'm spending most of my bill on NAT gateways and bandwidth egress? Well, that's not normal. That would be something that would be not typical of what your normal AWS user is doing.Corey: Right. There's always a question of, “Am I normal?” is one of the first things people love to ask. And it comes in different forms. But it's benchmarking. It's, okay, how much should it cost us to service a thousand monthly active users? It's like, there's no good way to say that across the board for everyone.Everett: Yeah. I like the model of getting into the actual unit costs. I have this sort of vision in my head of, you know, if I'm Uber and I'm reporting metrics to the public stock market, I'm actually reporting a cost to serve a rider, a cost to deliver an Uber Eats meal, in terms of my cloud spend. And that sort of data is just ridiculously hard to get to today. I think it's what we're working towards with Vantage and I think it's something that with these Cloud Cost Reports, we're hoping to get into over time, where we're actually helping companies think about well, okay, within my cloud spend, it's not just what I'm spending on these different services, there's also an idea of how much of my cost to deliver my service should be realized by my cloud spending.Corey: And then people have the uncomfortable realization that wait, my bill is less a function of number of customers I have but more the number of engineers I've hired. What's going on with that?Everett: [laugh]. Yeah, it is interesting to me just how many people end up being involved in this problem at the company. But to your earlier point, the cloud spending discussion has really ramped up over the past year. And I think, hopefully, we are going to be able to converge on a place where we are realizing the promise of the cloud, if you will, which is that it's actually cheaper. And I think what these reports show so far is, like, we've still got a long ways to go for that.Corey: One thing that I think is opportune about the timing of this recording is that as of last week, Amazon wound up announcing their earnings. And Andy Jassy has started getting on the earnings calls, which is how you know it's bad because the CEO of Amazon never deigned to show up on those things before. And he said that a lot of AWS employees are focused and spending their time on helping customers lower their AWS bills. And I'm listening to this going, “Oh, they must be talking to different customers than the ones that I'm talking to.” Are you seeing a lot of Amazonian involvement in reducing AWS bills? Because I'm not and I'm wondering where these people are hiding.Everett: So, we do see one thing, which is reps pushing savings plans on customers, which in general, is great. It's kind of good for everybody, it locks people into longer-term spend on Amazon, it gets them a lower rate, savings plans have some interesting functionality where they can be automatically applied to the area where they offer the most discount. And so, those things are all positive. I will say with Vantage, we're a cloud cost optimization company, of course, and so when folks talk to us, they often already have talked to their AWS rep. And the classic scenario is, that the rep passes over a large spreadsheet of options and ways to reduce costs, but for the company, that spreadsheet may end up being quite a ways away from the point where they actually realize cost savings.And ultimately, the people that are working on cloud cost optimization for Amazon are account reps who are comped by how much cloud spending their accounts are using on Amazon. And so, at the end of the day, some of the, I would say, most hard-hitting optimizations that you work on that we work on, end up hitting areas where they do actually reduce the bill which ends up being not in the account manager's favor. And so, it's a real chicken-and-egg game, except for savings plans is one area where I think everybody can kind of work together.Corey: I have found that… in fairness, there is some defense for Amazon in this but their cost-cutting approach has been rightsizing instances, buy some savings plans, and we are completely out of ideas. Wait, can you switch to Graviton and/or move to serverless? And I used to make fun of them for this but honestly that is some of the only advice that works across the board, irrespective in most cases, of what a customer is doing. Everything else is nuanced and it depends.That's why in some cases, I find that I'm advising customers to spend more money on certain things. Like, the reason that I don't charge percentage of savings in part is because otherwise I'm incentivized to say things like, “Backups? What are you, some kind of coward? Get rid of them.” And that doesn't seem like it's going to be in the customer's interest every time. And as soon as you start down that path, it starts getting a little weird.But people have asked me, what if my customers reach out to their account teams instead of talking to us? And it's, we do bespoke consulting engagements; I do not believe that we have ever had a client who did not first reach out to their account team. If the account teams were capable of doing this at the level that worked for customers, I would have to be doing something else with my business. It is not something that we are seeing hit customers in a way that is effective, and certainly not at scale. You said—as you were right on this—that there's an element here of account managers doing this stuff, there's an [unintelligible 00:15:54] incentive issue in part, but it's also, quality is extraordinarily uneven when it comes to these things because it is its own niche and a lot of people focus in different areas in different ways.Everett: Yeah. And to the areas that you brought up in terms of general advice that's given, we actually have some data on this in this report. In particular Graviton, this is something we've been tracking the whole time we've been doing these reports, which is the past three quarters and we actually are seeing Graviton adoption start to increase more rapidly than it was before. And so, for this last quarter Q1, we're seeing 5% of our costs that we're measuring on EC2 coming from Graviton, which is up from, I want to say 2% the previous quarter, and, like, less than 1% the quarter before. The previous quarter, we also reported that Lambda costs are now majority on ARM among the Vantage customer base.And that one makes some sense to me just because in most cases with Lambda, it's a flip of a switch. And then to your archival point on backups, this is something that we report in this one is that intelligent tiering, which we saw, like, really make an impact for folks towards the end of last year, the numbers for that were flat quarter over quarter. And so, what I mean by that is, we reported that I think, like, two-thirds of our S3 costs are still in the standard storage tier, which is the most expensive tier. And folks have enabled S3 intelligent tiering, which moves your data to progressively cheaper tiers, but we haven't seen that increase this quarter. So, it's the same number as it was last quarter.And I think speaks to what you're talking about with a ceiling on some cost optimization techniques, where it's like, you're not just going to get rid of all your backups; you're not just going to get rid of your, you know, Amazon WorkSpaces archived desktop snapshots that you need for some HIPAA compliance reason. Those things have an upper limit and so that's where, when the AWS rep comes in, it's like, as they go through the list of top spending categories, the recommendations they can give start to provide diminishing returns.Corey: I also think this is sort of a law of large numbers issue. When you start seeing a drop off in the growth rate of large cloud providers, like, there's a problem, in that there are only so many exabyte scale workloads that can be moved inside of a given quarter into the cloud. You're not going to see the same unbounded infinite growth that you would expect mathematically. And people lose their minds when they start to see those things pointed out, but the blame that oh, that's caused by cost optimization efforts, with respect, bullshit it is. I have seen customers devote significant efforts to reducing their AWS bills and it takes massive amounts of work and even then they don't always succeed in getting there.It gets better, but they still wind up a year later, having spent more on a month-by-month basis than they did when they started. Sure they understand it better and it's organic growth that's driving it and they've solved the low hanging fruit problem, but there is a challenge in acting as a boundary for what is, in effect, an unbounded growth problem.Everett: Yeah. And speaking to growth, I thought Microsoft had the most interesting take on where things could happen next quarter, and that, of course, is AI. And so, they attributed, I think it was, 1% of their guidance regarding 26 or 27% growth for Q2 Cloud revenue and it attributed 1% of that to AI. And I think Amazon is really trying to be in the room for those discussions when a large enterprise is talking about AI workloads because it's one of the few remaining cloud workloads that if it's not in the cloud already, is generating potentially massive amounts of growth for these guys.And so, I'm not really sure if I believe the 1% number. I think Microsoft may be having some fun with the fact that, of course, OpenAI is paying them for acting as a cloud provider for ChatGPT and further API, but I do think that AWS, although they were maybe a little slow to the game, they did, to their credit, launch a number of AI services that I'm excited to see if that contributes to the cost that we're measuring next quarter. We did measure, for the first time, a sudden increase on those new [Inf1 00:20:17] EC2 instances, which are optimized for machine learning. And I think if AWS can have success moving customers to those the way they have with Graviton, then that's going to be a very healthy area of growth for them.Corey: I'll also say that it's pretty clear to me that Amazon does not know what it's doing in its world of machine-learning-powered services. I use Azure for the [unintelligible 00:20:44] clients I built originally for Twitter, then for Mastodon—I'm sure Bluesky is coming—but the problem that I'm seeing there is across the board, start to finish, that there is no cohesive story from the AWS side of here's a picture tell me what's in it and if it's words, describe it to me. That's a single API call when we go to Azure. And the more that Amazon talks about something, I find, the less effective they're being in that space. And they will not stop talking about machine learning. Yes, they have instances that are powered by GPUs; that's awesome. But they're an infrastructure provider and moving up the stack is not in their DNA. But that's where all the interest and excitement and discussion is going to be increasingly in the AI space. Good luck.Everett: I think it might be something similar to what you've talked about before with all the options to run containers on AWS. I think they today have a bit of a grab bag of services and they may actually be looking forward to the fact that they're these truly foundational models which let you do a number of tasks, and so they may not need to rely so much on you know, Amazon Polly and Amazon Rekognition and sort of these task-specific services, which to date, I'm not really sure of the takeoff rates on those. We have this cloud costs leaderboard and I don't think you would find them in the top 50 of AWS services. But we'll see what happens with that.AWS I think, ends up being surprisingly good at sticking with it. I think our view is that they probably have the most customer spend on Kubernetes of any major cloud, even though you might say Google at first had the lead on Kubernetes and maybe should have done more with GKE. But to date, I would kind of agree with your take on AI services and I think Azure is… it's Azure's to lose for the moment.Corey: I would agree. I think the future of the cloud is largely Azure's to lose and it has been for a while, just because they get user experience, they get how to talk to enterprises. I just… I wish they would get security a little bit more effectively, and if failing that, communicating with their customers about security more effectively. But it's hard for a leopard to change its spots. Microsoft though has demonstrated an ability to change their nature multiple times, in ways that I would have bet were impossible. So, I just want to see them do it again. It's about time.Everett: Yeah, it's been interesting building on Azure for the past year or so. I wrote a post recently about, kind of, accessing billing data across the different providers and it's interesting in that every cloud provider is unique in the way that it simply provides an external endpoint for downloading your billing data, but Azure is probably one of the easiest integrations; it's just a REST API. However, behind that REST API are, like, years and years of different ways to pay Microsoft: are you on a pay-as-you-go plan, are you on an Azure enterprise plan? So, there's all this sort of organizational complexity hidden behind Azure and I think sometimes it rears its ugly head in a way that stringing together services on Amazon may not, even if that's still a bear in and of itself, if you will.Corey: Any other surprises that you found in the Cloud Cost Report? I mean, looking through it, it seems directionally aligned with what I see in my environments with customers. Like for example, you're not going to see Kubernetes showing up as a line item on any of these things just because—Everett: Yeah.Corey: That is indistinguishable from a billing perspective when we're looking at EC2 spend versus control plane spend. I don't tend to [find 00:24:04] too much that's shocking me. My numbers are of course, different percentage-wise, but surprise, surprise, different companies doing different things doing different percentages, I'm sure only AWS knows for sure.Everett: Yeah, I think the biggest surprise was just the—and, this could very well just be kind of measurement method, but I really expected to see AI services driving more costs, whether it was GPU instances, or AI-specific services—which we actually didn't report on at all, just because they weren't material—or just any indication that AI was a real driver of cloud spending. But I think what you see instead is sort of the same old folks at the top, and if you look at the breakdown of services across providers, that's, you know, compute, database, storage, bandwidth, monitoring. And if you look at our percentage of AI costs as a percentage of EC2 costs, it's relatively flat, quarter over quarter. So, I would have thought that would have shown up in some way in our data and we really didn't see it.Corey: It feels like there's a law of large numbers things. Everyone's talking about it. It's very hype right now—Everett: Yeah.Corey: But it's also—you talk to these companies, like, “Okay, we have four exabytes of data that we're storing and we have a couple 100,000 instances at any given point in time, so yeah, we're going to start spending $100,000 a month on our AI adventures and experiments.” It's like, that's just noise and froth in the bill, comparatively.Everett: Exactly, yeah. And so, that's why I think Microsoft's thought about AI driving a lot of growth in the coming quarters is, we'll see how that plays out, basically. The one other thing I would point to is—and this is probably not surprising, maybe, for you having been in the infrastructure world and seeing a lot of this, but for me, just seeing the length of time it takes companies to upgrade their instance cycles. We're clocking in at almost three years since the C6 series instances have been released and for just now seeing C6 and R6 start to edge above 10% of our compute usage. I actually wonder if that's just the stranglehold that Intel has on cloud computing workloads because it was only last year around re:Invent that the C6in and the Intel version of the C6 series instances had been released. So, I do think in general, there's supposed to be a price-to-performance benefit of upgrading your instances, and so sometimes it surprises me to see how long it takes companies to get around to doing that.Corey: Generation 6 to 7 is also 6% more expensive in my sampling.Everett: Right. That's right. I think Amazon has some work to do to actually make that price-to-performance argument, sort of the way that we were discussing with gp2 versus gp3 volumes. But yeah, I mean, other than that, I think, in general, my view is that we're past the worst of it, if you will, for cloud spending. Q4 was sort of a real letdown, I think, in terms of the data we had and the earnings that these cloud providers had and I think Q1 is actually everyone looking forward to perhaps what we call out at the beginning of the report, which is a return to normal spend patterns across the cloud.Corey: I think that it's going to be an interesting case. One thing that I'm seeing that might very well explain some of the reluctance to upgrade EC2 instances has been that a lot of those EC2 instances are databases. And once those things are up and running and working, people are hesitant to do too much with them. One of the [unintelligible 00:27:29] roads that I've seen of their savings plan approach is that you can migrate EC2 spend to Fargate to Lambda—and that's great—but not RDS. You're effectively leaving a giant pile of money on the table if you've made a three-year purchase commitment on these things. So, all right, we're not going to be in any rush to migrate to those things, which I think is AWS getting in its own way.Everett: That's exactly right. When we encounter customers that have a large amount of database spend, the most cost-effective option is almost always basically bare-metal EC2 even with the overhead of managing the backup-restore scalability of those things. So, in some ways, that's a good thing because it means that you can then take advantage of the, kind of, heavy committed use options on EC2, but of course, in other ways, it's a bit of a letdown because, in the ideal case, RDS would scale with the level of workloads and the economics would make more sense, but it seems that is really not the case.Corey: I really want to thank you for taking the time to come on the show and talk to me. I'll include a link in the [show notes 00:28:37] to the Cost Report. One thing I appreciate is the fact that it doesn't have one of those gates in front of it of, your email address, and what country you're in, and how can our salespeople best bother you. It's just, here's a link to the PDF. The end. So, thanks for that; it's appreciated. Where else can people go to find you?Everett: So, I'm on Twitter talking about cloud infrastructure and AI. I'm at@retttx, that's R-E-T-T-T-X. And then of course, Vantage also did quick hot-takes on this report with a series of graphs and explainers in a Twitter thread and that's @JoinVantage.Corey: And we will, of course, put links to that in the [show notes 00:29:15]. Thank you so much for your time. I appreciate it.Everett: Thanks, Corey. Great to chat.Corey: Everett Berry, growth in open-source at Vantage. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that will increase its vitriol generation over generation, by approximately 6%.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud
Operating in the Kubernetes Cloud on Amazon EKS with Eswar Bala

Screaming in the Cloud

Play Episode Listen Later May 5, 2023 34:29


Eswar Bala, Director of Amazon EKS at AWS, joins Corey on Screaming in the Cloud to discuss how and why AWS built a Kubernetes solution, and what customers are looking for out of Amazon EKS. Eswar reveals the concerns he sees from customers about the cost of Kubernetes, as well as the reasons customers adopt EKS over ECS. Eswar gives his reasoning on why he feels Kubernetes is here to stay and not just hype, as well as how AWS is working to reduce the complexity of Kubernetes. Corey and Eswar also explore the competitive landscape of Amazon EKS, and the new product offering from Amazon called Karpenter.About EswarEswar Bala is a Director of Engineering at Amazon and is responsible for Engineering, Operations, and Product strategy for Amazon Elastic Kubernetes Service (EKS). Eswar leads the Amazon EKS and EKS Anywhere teams that build, operate, and contribute to the services customers and partners use to deploy and operate Kubernetes and Kubernetes applications securely and at scale. With a 20+ year career in software , spanning multimedia, networking and container domains, he has built greenfield teams and launched new products multiple times.Links Referenced: Amazon EKS: https://aws.amazon.com/eks/ kubernetesthemuchharderway.com: https://kubernetesthemuchharderway.com kubernetestheeasyway.com: https://kubernetestheeasyway.com EKS documentation: https://docs.aws.amazon.com/eks/ EKS newsletter: https://eks.news/ EKS GitHub: https://github.com/aws/eks-distro TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: It's easy to **BEEP** up on AWS. Especially when you're managing your cloud environment on your own!Mission Cloud un **BEEP**s your apps and servers. Whatever you need in AWS, we can do it. Head to missioncloud.com for the AWS expertise you need. Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. Today's promoted guest episode is brought to us by our friends at Amazon. Now, Amazon is many things: they sell underpants, they sell books, they sell books about underpants, and underpants featuring pictures of books, but they also have a minor cloud computing problem. In fact, some people would call them a cloud computing company with a gift shop that's attached. Now, the problem with wanting to work at a cloud company is that their interviews are super challenging to pass.If you want to work there, but can't pass the technical interview for a long time, the way to solve that has been, “Ah, we're going to run Kubernetes so we get to LARP as if we worked at a cloud company but don't.” Eswar Bala is the Director of Engineering for Amazon EKS and is going to basically suffer my slings and arrows about one of the most complicated, and I would say overwrought, best practices that we're seeing industry-wide. Eswar, thank you for agreeing to subject yourself to this nonsense.Eswar: Hey, Corey, thanks for having me here.Corey: [laugh]. So, I'm a little bit unfair to Kubernetes because I wanted to make fun of it and ignore it. But then I started seeing it in every company that I deal with in one form or another. So yes, I can still sit here and shake my fist at the tide, but it's turned into, “Old Man Yells at Cloud,” which I'm thrilled to embrace, but everyone's using it. So, EKS has recently crossed, I believe, the five-year mark since it was initially launched. What is EKS other than Amazon's own flavor of Kubernetes?Eswar: You know, the best way I can define EKS is, EKS is just Kubernetes. Not Amazon's version of Kubernetes. It's just Kubernetes that we get from the community and offer it to customers to make it easier for them to consume. So, EKS. I've been with EKS from the very beginning when we thought about offering a managed Kubernetes service in 2017.And at that point, the goal was to bring Kubernetes to enterprise customers. So, we have many customers telling us that they want us to make their life easier by offering a managed version of Kubernetes that they've actually beginning to [erupt 00:02:42] at that time period, right? So, my goal was to figure it out, what does that service look like and which customer base should be targeting service towards.Corey: Kelsey Hightower has a fantastic learning tool out there in a GitHub repo called, “Kubernetes the Hard Way,” where he talks you through building the entire thing, start to finish. I wound up forking it and doing that on top of AWS, and you can find that at kubernetesthemuchharderway.com. And that was fun.And I went through the process and my response at the end was, “Why on earth would anyone ever do this more than once?” And we got that sorted out, but now it's—customers aren't really running these things from scratch. It's like the Linux from Scratch project. Great learning tool; probably don't run this in production in the same way that you might otherwise because there are better ways to solve for the problems that you will have to solve yourself when you're building these things from scratch. So, as I look across the ecosystem, it feels like EKS stands in the place of the heavy, undifferentiated lifting of running the Kubernetes control plane so customers functionally don't have to. Is that an effective summation of this?Eswar: That is precisely right. And I'm glad you mentioned, “Kubernetes the Hard Way,” I'm a big fan of that when it came out. And if anyone who did that tutorial, and also your tutorial, “Kubernetes the Harder Way,” would walk away thinking, “Why would I pick this technology when it's super complicated to setup?” But then you see that customers love Kubernetes and you see that reflected in the adoption, even in 2016, 2017 timeframes.And the reason is, it made life easier for application developers in terms of offering web services that they wanted to offer to their customer base. And because of all the features that Kubernetes brought on, application lifecycle management, service discoveries, and then it evolved to support various application architectures, right, in terms of stateless services, stateful applications, and even daemon sets, right, like for running your logging and metrics agents. And these are powerful features, at the end of the day, and that's what drove Kubernetes. And because it's super hard to get going to begin with and then to operate, the day-two operator experience is super complicated.Corey: And the day one experience is super hard and the day two experience of, “Okay, now I'm running it and something isn't working the way it used to. Where do I start,” has been just tremendously overwrought. And frankly, more than a little intimidating.Eswar: Exactly. Right? And that exactly was our opportunity when we started in 2017. And when we started, there was question on, okay, should we really build a service when you have an existing service like ECS in place? And by the way, like, I did work in ECS before I started working in EKS from the beginning.So, the answer then was, it was about giving what customers want. And their space for many container orchestration systems, right, ECS was the AWS service at that point in time. And our thinking was, how do we give customers what they wanted? They wanted a Kubernetes solution. Let's go build that. But we built it in a way that we remove the undifferentiated heavy lifting of managing Kubernetes.Corey: One of the weird things that I find is that everyone's using Kubernetes, but I don't see it in the way that I contextualize the AWS universe, which of course, is on the bill. That's right. If you don't charge for something in AWS Lambda, and preferably a fair bit, I don't tend to know it exists. Like, “What's an IAM and what might that possibly do?” Always have reassuring thing to hear from someone who's often called an expert in this space. But you know, if it doesn't cost money, why do I pay attention to it?The control plane is what EKS charges for, unless you're running a bunch of Fargate-managed pods and containers to wind up handling those things. So, it mostly just shows up as an addenda to the actual big, meaty portions of the belt. It just looks like a bunch of EC2 instances with some really weird behavior patterns, particularly with regard to auto-scaling and crosstalk between all of those various nodes. So, it's a little bit of a murder mystery, figuring out, “So, what's going on in this environment? Do you folks use containers at all?” And the entire Kubernetes shop is looking at me like, “Are you simple?”No, it's just I tend to disregard the lies that customers say, mostly to themselves because everyone has this idea of what's going on in their environment, but the bill speaks. It's always been a little bit of an investigation to get to the bottom of anything that involves Kubernetes at significant points of scale.Eswar: Yeah, you're right. Like if you look at EKS, right, like, we started with managing the control plane to begin with. And managing the control plane is a drop in the bucket when you actually look at the costs in terms of operating a Kubernetes cluster or running a Kubernetes cluster. When you look at how our customers use and where they spend most of their cost, it's about where their applications run; it's actually the Kubernetes data plane and the amount of compute and memory that the applications end of using end up driving 90% of the cost. And beyond that is the storage, beyond that as a networking costs, right, and then after that is the actual control plane costs. So, the problem right now is figuring out, how do we optimize our costs for the application to run on?Corey: On some level, it requires a little bit of understanding of what's going on under the hood. There have been a number of cost optimization efforts that have been made in the Kubernetes space, but they tend to focus around stuff that I find relatively, well, I call it banal because it basically is. You're looking at the idea of, okay, what size instances should you be running, and how well can you fill them and make sure that all the resources per node wind up being taken advantage of? But that's also something that, I guess from my perspective, isn't really the interesting architectural point of view. Whether or not you're running a bunch of small instances or a few big ones or some combination of the two, that doesn't really move the needle on any architectural shift, whereas ingesting a petabyte a month of data and passing 50 petabytes back and forth between availability zones, that's where it starts to get really interesting as far as tracking that stuff down.But what I don't see is a whole lot of energy or effort being put into that. And I mean, industry-wide, to be clear. I'm not attempting to call out Amazon specifically on this. That's [laugh] not the direction I'm taking this in. For once. I know, I'm still me. But it seems to be just an industry-wide issue, where zone affinity for Kubernetes has been a very low priority item, even on project roadmaps on the Kubernetes project.Eswar: Yeah, the Kubernetes does provide ability for customers to restrict their workloads within as particular [unintelligible 00:09:20], right? Like, there is constraints that you can place on your pod specs that end up driving applications towards a particular AZ if they want, right? You're right, it's still left to the customers to configure. Just because there's a configuration available doesn't mean the customers use it. If it's not defaulted, most of the time, it's not picked up.That's where it's important for service providers—like EKS—to offer ability to not only provide the visibility by means of reporting that it's available using tools like [Cue Cards 00:09:50] and Amazon Billing Explorer but also provide insights and recommendations on what customers can do. I agree that there's a gap today. For example in EKS, in terms of that. Like, we're slowly closing that gap and it's something that we're actively exploring. How do we provide insights across all the resources customers end up using from within a cluster? That includes not just compute and memory, but also storage and networking, right? And that's where we are actually moving towards at this point.Corey: That's part of the weird problem I've found is that, on some level, you get to play almost data center archaeologists when you start exploring what's going on in these environments. I found one of the only reliable ways to get answers to some of this stuff has been oral tradition of, “Okay, this Kubernetes cluster just starts hurling massive data quantities at 3 a.m. every day. What's causing that?” And it leads to, “Oh, no no, have you talked to the data science team,” like, “Oh, you have a data science team. A common AWS billing mistake.” And exploring down that particular path sometimes pays dividends. But there's no holistic way to solve that globally. Today. I'm optimistic about tomorrow, though.Eswar: Correct. And that's where we are spending our efforts right now. For example, we recently launched our partnership with Cue Cards, and Cue Cards is now available as an add-on from the Marketplace that you can easily install and provision on Kubernetes EKS clusters, for example. And that is a start. And Cue Cards is amazing in terms of features, in terms of insight it offers, right, it looking into computer, the memory, and the optimizations and insights it provides you.And we are also working with the AWS Cost and Usage Reporting team to provide a native AWS solution for the cost reporting and the insights aspect as well in EKS. And it's something that we are going to be working really closely to solve the networking gaps in the near future.Corey: What are you seeing as far as customer concerns go, with regard to cost and Kubernetes? I see some things, but let's be very clear here, I have a certain subset of the market that I spend an inordinate amount of time speaking to and I always worry that what I'm seeing is not holistically what's going on in the broader market. What are you seeing customers concerned about?Eswar: Well, let's start from the fundamentals here, right? Customers really want to get to market faster, whatever services and applications that they want to offer. And they want to have it cheaper to operate. And if they're adopting EKS, they want it cheaper to operate in Kubernetes in the cloud. They also want a high performance, they also want scalability, and they want security and isolation.There's so many parameters that they have to deal with before they put their service on the market and continue to operate. And there's a fundamental tension here, right? Like they want cost efficiency, but they also want to be available in the market quicker and they want performance and availability. Developers have uptime, SLOs, and SLAs is to consider and they want the maximum possible resources that they want. And on the other side, you've got financial leaders and the business leaders who want to look at the spending and worry about, like, okay, are we allocating our capital wisely? And are we allocating where it makes sense? And are we doing it in a manner that there's very little wastage and aligned with our customer use, for example? And this is where the actual problems arise from [unintelligible 00:13:00].Corey: I want to be very clear that for a long time, one of the most expensive parts about running Kubernetes has not been the infrastructure itself. It's been the people to run this responsibly, where it's the day two, day three experience where for an awful lot of companies like, oh, we're moving to Kubernetes because I don't know we read it in an in-flight magazine or something and all the cool kids are doing it, which honestly during the pandemic is why suddenly everyone started making better IT choices because they're execs were not being exposed to airport ads. I digress. The point, though, is that as customers are figuring this stuff out and playing around with it, it's not sustainable that every company that wants to run Kubernetes can afford a crack SRE team that is individually incredibly expensive and collectively staggeringly so. That it seems to be the real cost is the complexity tied to it.And EKS has been great in that it abstracts an awful lot of the control plane complexity away. But I still can't shake the feeling that running Kubernetes is mind-bogglingly complicated. Please argue with me and tell me I'm wrong.Eswar: No, you're right. It's still complicated. And it's a journey towards reducing the complexity. When we launched EKS, we launched only with managing the control plane to begin with. And that's where we started, but customers had the complexity of managing the worker nodes.And then we evolved to manage the Kubernetes worker nodes in terms two products: we've got Managed Node Groups and Fargate. And then customers moved on to installing more agents in their clusters before they actually installed their business applications, things like Cluster Autoscaler, things like Metric Server, critical components that they have come to rely on, but doesn't drive their business logic directly. They are supporting aspects of driving core business logic.And that's how we evolved into managing the add-ons to make life easier for our customers. And it's a journey where we continue to reduce the complexity of making it easier for customers to adopt Kubernetes. And once you cross that chasm—and we are still trying to cross it—once you cross it, you have the problem of, okay so, adopting Kubernetes is easy. Now, we have to operate it, right, which means that we need to provide better reporting tools, not just for costs, but also for operations. Like, how easy it is for customers to get to the application level metrics and how easy it is for customers to troubleshoot issues, how easy for customers to actually upgrade to newer versions of Kubernetes. All of these challenges come out beyond day one, right? And those are initiatives that we have in flight to make it easier for customers [unintelligible 00:15:39].Corey: So, one of the things I see when I start going deep into the Kubernetes ecosystem is, well, Kubernetes will go ahead and run the containers for me, but now I need to know what's going on in various areas around it. One of the big booms in the observability space, in many cases, has come from the fact that you now need to diagnose something in a container you can't log into and incidentally stopped existing 20 minutes for you got the alert about the issue, so you'd better hope your telemetry is up to snuff. Now, yes, that does act as a bit of a complexity burden, but on the other side of it, we don't have to worry about things like failed hard drives taking systems down anymore. That has successfully been abstracted away by Kubernetes, or you know, your cloud provider, but that's neither here nor there these days. What are you seeing as far as, effectively, the sidecar pattern, for example of, “Oh, you have too many containers and need to manage them? Have you considered running more containers?” Sounds like something a container salesman might say.Eswar: So, running containers demands that you have really solid observability tooling, things that you're able to troubleshoot—successfully—debug without the need to log into the containers itself. In fact, that's an anti-pattern, right? You really don't want a container to have the ability to SSH into a particular container, for example. And to be successful at it demands that you publish your metrics and you publish your logs. All of these are things that a developer needs to worry about today in order to adopt containers, for example.And it's on the service providers to actually make it easier for the developers not to worry about these. And all of these are available automatically when you adopt a Kubernetes service. For example, in EKS, we are working with our managed Prometheus service teams inside Amazon, right—and also CloudWatch teams—to easily enable metrics and logging for customers without having to do a lot of heavy lifting.Corey: Let's talk a little bit about the competitive landscape here. One of my biggest competitors in optimizing AWS bills is Microsoft Excel, specifically, people are going to go ahead and run it themselves because, “Eh, hiring someone who's really good at this, that sounds expensive. We can screw it up for half the cost.” Which is great. It seems to me that one of your biggest competitors is people running their own control plane, on some level.I don't tend to accept the narrative that, “Oh, EKS is expensive that winds up being what 35 bucks or 70 bucks or whatever it is per control plane per cluster on a monthly basis.” Okay, yes, that's expensive if you're trying to stay completely within a free tier perhaps, but if you're running anything that's even slightly revenue-generating or a for-profit company, you will spend far more than that just on people's time. I have no problems—for once—with the EKS pricing model, start to finish. Good work on that. You've successfully nailed it. But are you seeing significant pushback from the industry of, “Nope, we're going to run our own Kubernetes management system instead because we enjoy pain, corporately speaking.”Eswar: Actually, we are in a good spot there, right? Like, at this point, customers who choose to run Kubernetes on AWS by themselves and not adopt EKS just fall into one main category, so—or two main categories: number one, they have existing technical stack built on running Kubernetes on themselves and they'd rather maintain that and not moving to EKS. Or they demand certain custom configurations of the Kubernetes control plane that EKS doesn't support. And those are the only two reasons why we see customers not moving into EKS and prefer to run their own Kubernetes on AWS clusters.[midroll 00:19:46]Corey: It really does seem, on some level, like there's going to be a… I don't want to say reckoning because that makes it sound vaguely ominous and that's not the direction that I intend for things to go in, but there has to be some form of collapsing of the complexity that is inherent to all of this because the entire industry has always done that. An analogy that I fall back on because I've seen this enough times to have the scars to show for it is that in the '90s, running a web server took about a week of spare time and an in-depth knowledge of GCC compiler flags. And then it evolved to ah, I could just unzip a tarball of precompiled stuff, and then RPM or Deb became a thing. And then Yum, or something else, or I guess apt over in the Debian land to wind up wrapping around that. And then you had things like Puppet where it was it was ensure installed. And now it's Docker Run.And today, it's a checkbox in the S3 console that proceeds to yell at you because you're making a website public. But that's neither here nor there. Things don't get harder with time. But I've been surprised by how I haven't yet seen that sort of geometric complexity collapsing of around Kubernetes to make it easier to work with. Is that coming or are we going to have to wait for the next cycle of things?Eswar: Let me think. I actually don't have a good answer to that, Corey.Corey: That's good, at least because if you did, I'd worried that I was just missing something obvious. That's kind of the entire reason I ask. Like, “Oh, good. I get to talk to smart people and see what they're picking up on that I'm absolutely missing.” I was hoping you had an answer, but I guess it's cold comfort that you don't have one off the top of your head. But man, is it confusing.Eswar: Yeah. So, there are some discussions in the community out there, right? Like, it's Kubernetes the right layer to do interact? And there are some tooling that's built on top of Kubernetes, for example, Knative that tries to provide a serverless layer on top of Kubernetes, for example. There are also attempts at abstracting Kubernetes completely and providing tooling that just completely removes any sort of Kubernetes API out of the picture and maybe a specific CI/CD-based solution that takes it from the source and deploys the service without even showing you that there's Kubernetes underneath, right?All of these are evolutions that are being tested out there in the community. Time will tell whether these end up sticking. But what's clear here is the gravity around Kubernetes. All sorts of tooling that gets built on top of Kubernetes, all the operators, all sorts of open-source initiatives that are built to run on Kubernetes. For example, Spark, for example, Cassandra, so many of these big, large-scale, open-source solutions are now built to run really well on Kubernetes. And that is the gravity that's pushing Kubernetes at this point.Corey: I'm curious to get your take on one other, I would consider interestingly competitive spaces. Now, because I have a domain problem, if you go to kubernetestheeasyway.com, you'll wind up on the ECS marketing page. That's right, the worst competition in the world: the people who work down the hall from you.If someone's considering using ECS, Elastic Container Service versus EKS, Elastic Kubernetes Service, what is the deciding factor when a customer's making that determination? And to be clear, I'm not convinced there's a right or wrong answer. But I am curious to get your take, given that you have a vested interest, but also presumably don't want to talk complete smack about your colleagues. But feel free to surprise me.Eswar: Hey, I love ECS, by the way. Like I said, I started my life in the AWS in ECS. So look, ECS is a hugely successful container orchestration service. I know we talk a lot about Kubernetes, I know there's a lot of discussions around Kubernetes, but I wouldn't make it a point that, like, ECS is a hugely successful service. Now, what determines how customers go to?If customers are… if the customers tech stack is entirely on AWS, right, they use a lot of AWS services and they want an easy way to get started in the container world that has really tight integration with other AWS services without them having to configure a lot, ECS is the way, right? And customers have actually seen terrific success adopting ECS for that particular use case. Whereas EKS customers, they start with, “Okay, I want an open-source solution. I really love Kubernetes. I lo—or, I have a tooling that I really like in the open-source land that really works well with Kubernetes. I'm going to go that way.” And those kind of customers end up picking EKS.Corey: I feel like, on some level, Kubernetes has become the most the default API across a wide variety of environments. AWS obviously, but on-prem other providers. It seems like even the traditional VPS companies out there that offer just rent-a-server in the cloud somewhere are all also offering, “Oh, and we have a Kubernetes service as well.” I wound up backing a Kickstarter project that runs a Kubernetes cluster with a shared backplane across a variety of Raspberries Pi, for example. And it seems to be almost everywhere you look.Do you think that there's some validity to that approach of effectively whatever it is that we're going to wind up running in the future, it's going to be done on top of Kubernetes or do you think that that's mostly hype-driven these days?Eswar: It's definitely not hype. Like we see the proof in the kind of adoption we see. It's becoming the de facto container orchestration API. And with all the tooling, open-source tooling that's continuing to build on top of Kubernetes, CNCF tooling ecosystem that's actually spawned to actually support Kubernetes at option, all of this is solid proof that Kubernetes is here to stay and is a really strong, powerful API for customers to adopt.Corey: So, four years ago, I had a prediction on Twitter, and I said, “In five years, nobody will care about Kubernetes.” And it was in February, I believe, and every year, I wind up updating an incrementing a link to it, like, “Four years to go,” “Three years to go,” and I believe it expires next year. And I have to say, I didn't really expect when I made that prediction for it to outlive Twitter, but yet, here we are, which is neither here nor there. But I'm curious to get your take on this. But before I wind up just letting you savage the naive interpretation of that, my impression has been that it will not be that Kubernetes has gone away. That is ridiculous. It is clearly in enough places that even if they decided to rip it out now, it would take them ten years, but rather than it's going to slip below the surface level of awareness.Once upon a time, there was a whole bunch of energy and drama and debate around the Linux virtual memory management subsystem. And today, there's, like, a dozen people on the planet who really have to care about that, but for the rest of us, it doesn't matter anymore. We are so far past having to care about that having any meaningful impact in our day-to-day work that it's just, it's the part of the iceberg that's below the waterline. I think that's where Kubernetes is heading. Do you agree or disagree? And what do you think about the timeline?Eswar: I agree with you; that's a perfect analogy. It's going to go the way of Linux, right? It's here to stay; it just going to get abstracted out if any of the abstraction efforts are going to stick around. And that's where we're testing the waters there. There are many, many open-source initiatives there trying to abstract Kubernetes. All of these are yet to gain ground, but there's some reasonable efforts being made.And if they are successful, they just end up being a layer on top of Kubernetes. Many of the customers, many of the developers, don't have to worry about Kubernetes at that point, but a certain subset of us in the tech world will need to do a deal with Kubernetes, and most likely teams like mine that end up managing and operating their Kubernetes clusters.Corey: So, one last question I have for you is that if there's one thing that AWS loves, it's misspelling things. And you have an open-source offering called Karpenter spelled with a K that is an extending of that tradition. What does Karpenter do and why would someone use it?Eswar: Thank you for that. Karpenter is one of my favorite launches in the last one year.Corey: Presumably because you're terrible at the spelling bee back when you were a kid. But please tell me more.Eswar: [laugh]. So Karpenter, is an open-source flexible and high performance cluster auto-scaling solution. So basically, when your cluster needs more capacity to support your workloads, Karpenter automatically scales the capacity as needed. For people that know the Kubernetes space well, there's an existing component called Cluster Autoscaler that fills this space today. And it's our take on okay, so what if we could reimagine the capacity management solution available in Kubernetes? And can we do something better? Especially for cases where we expect terrific performance at scale to enable cost efficiency and optimization use cases for our customers, and most importantly, provide a way for customers not to pre-plan a lot of capacity to begin with.Corey: This is something we see a lot, in the sense of very bursty workloads where, okay, you're going to steady state load. Cool. Buy a bunch of savings plans, get things set up the way you want them, and call it a day. But when it's bursty, there are challenges with it. Folks love using Spot, but in the event of a sudden capacity shortfall, the question is, is can we spin up capacity to backfill it within those two minutes that we have a warning on that on? And if the answer is no, then it becomes a bit of a non-starter.Customers have had to build an awful lot of those things around EC2 instances that handle a lot of that logic for them in ways that are tuned specifically for their use cases. I'm encouraged to see there's a Kubernetes story around this that starts to remove some of that challenge from the customer side.Eswar: Yeah. So, the burstiness is where complexity comes [here 00:29:42], right? Like many customers for steady state, they know what their capacity requirements are, they set up the capacity, they can also reason out what is the effective capacity needed for good utilization for economical reasons and they can actually pre plan that and set it up. But once burstiness comes in, which inevitably does it at [unintelligible 00:30:05] applications, customers worry about, “Okay, am I going to get the capacity that I need in time that I need to be able to service my customers? And am I confident at it?”If I'm not confident, I'm going to actually allocate capacity beforehand, assuming that I'm going to actually get the burst that I needed. Which means, you're paying for resources that you're not using at the moment. And the burstiness might happen and then you're on the hook to actually reduce the capacity for it once the peak subsides at the end of the [day 00:30:36]. And this is a challenging situation. And this is one of the use cases that we targeted Karpenter towards.Corey: I find that the idea that you're open-sourcing this is fascinating because of two reasons. One, it does show a willingness to engage with the community that… again, it's difficult. When you're a big company, people love to wind up taking issue with almost anything that you do. But for another, it also puts it out in the open, on some level, where, especially when you're talking about cost optimization and decisions that affect cost, it's all out in public. So, people can look at this and think, “Wait a minute, it's not—what is this line of code that means if it's toward the end of the month, crank it up because we might need to hit our numbers.” Like, there's nothing like that in there. At least I'm assuming. I'm trusting that other people have read this code because honestly, that seems like a job for people who are better at that than I am. But that does tend to breed a certain element of trust.Eswar: Right. It's one of the first things that we thought about when we said okay, so we have some ideas here to actually improve the capacity management solution for Kubernetes. Okay, should we do it out in the open? And the answer was a resounding yes, right? I think there's a good story here that actually enables not just AWS to offer these ideas out there, right, and we want to bring it to all sorts of Kubernetes customers.And one of the first things we did is to architecturally figure out all the core business logic of Karpenter, which is, okay, how to schedule better, how quickly to scale, what is the best instance types to pick for this workload. All of that business logic was abstracted out from the actual cloud provider implementation. And the cloud provider implementation is super simple. It's just creating instances, deleting instances, and describing instances. And it's something that we bake from the get-go so it's easier for other cloud providers to come in and to add their support to it. And we as a community actually can take these ideas forward in a much faster way than just AWS doing it.Corey: I really want to thank you for taking the time to speak with me today about all these things. If people want to learn more, where's the best place for them to find you?Eswar: The best place to learn about EKS, right, as EKS evolves, is using our documentation, we have an EKS newsletter that you can go subscribe, and you can also find us on GitHub where we share our product roadmap. So, it's a great places to learn about how EKS is evolving and also sharing your feedback.Corey: Which is always great to hear, as opposed to, you know, in the AWS Console, where we live, waiting for you to stumble upon us, which, yeah. No it's good does have a lot of different places for people to engage with you. And we'll put links to that, of course, in the [show notes 00:33:17]. Thank you so much for being so generous with your time. I appreciate it.Eswar: Corey, really appreciate you having me.Corey: Eswar Bala, Director of Engineering for Amazon EKS. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice telling me why, when it comes to tracking Kubernetes costs, Microsoft Excel is in fact the superior experience.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

The Cloud Pod
200: Now you can make bad cloud decisions like running EKS on SNOW

The Cloud Pod

Play Episode Listen Later Feb 22, 2023 50:18


EKS on Snow Devices On this episode of The Cloud Pod, the team highlights the new Graviton3-based images for users of AWS, new ways provided by Google to pay for its cloud services, the new partnership between Azure and the Finops Foundation, as well as Oracle's new cloud banking, and the automation of CCOE. A big thanks to this week's sponsor, Foghorn Consulting, which provides full-stack cloud solutions with a focus on strategy, planning and execution for enterprises seeking to take advantage of the transformative capabilities of AWS, Google Cloud and Azure. This week's highlights

The Cloud Pod
TCP Talks: Applying and Maximizing Observability with Christine Yen

The Cloud Pod

Play Episode Listen Later Feb 14, 2023 26:37


Applying and Maximizing Observability In this episode, Christine talks about her company, Honeycomb which runs on AWS, with the goal of promoting observability for clients interested in the performance of their code or those trying to identify problem areas that need to be corrected. Christine Yen is the Co-Founder and CEO of Honeycomb. Before founding Honeycomb, she built analytics products at Parse/Facebook and loved writing software to separate signals from noise. Christine delights in being a developer in a room full of ops folks. Outside of work, Christine is kept busy by her two dogs and wants your sci-fi & fantasy book recommendations. Notes Honeycomb is an observability platform that helps customers understand why their code is behaving differently from what they expected. The inspiration behind this software came after Christine's previous company was acquired by Facebook and they realized how software made it very easy to identify problems in large code data within a short time. This encouraged them to build the tool and make it available to all engineers. If the first wave of DevOps was Ops-people learning how to automate their working code, the second wave would be helping developers learn to operate their code. Honeycomb is designed intentionally to ensure that all types of engineers can make sense of the tool. Honeycomb has always come up with ways for customers to use AWS products and get the data reflected in Honeycomb to be manipulated. Over the last few months, they have ensured that it is possible for clients to plug into CloudWatch Log and CloudWatch metrics, and redirect data directly from AWS products into Honeycomb instead. Clients can also use Honeycomb to extract data based on what their applications are doing. This applies to performance optimization, experimentation, or any situation where a company wants to try a code to see how it performs on production. The focus remains on the application layer. Before Honeycomb, no one was using observability in this context. The pricing of Honeycomb is based on the volume of data, which makes it predictable and understandable. Unlike when the pricing scale is based on the fidelity of the data, which can be quite expensive. Challenges within the observability space: The question is how to help new engineers learn from the seasoned engineers on the team through paper trails left by the seasoned engineers. This is a problem that can only be solved by enabling teams to orient new engineers on their systems without having to create another question as part of the code. Building an AI Approach in Honeycomb may not be suitable because of the context involved, since training effective machine learning models relies on a vast amount of easily classifiable data and this does not apply in the world of software; every engineering team's systems are different from every other engineering team's systems. Honeycomb is interested in using Al to build these models in order to help users know what questions to ask. With Honeycomb, usage patterns are much more dependent on the curiosity and proficiency of the engineering team; while some engineers who are used to getting answers directly may just leave the software, those who have a culture of asking questions will benefit more from it. Top Quotes

cloudonaut
#65 [Hot off the Cloud] Year in Review + CloudWatch Metrics Insights + SaaS Free Trail

cloudonaut

Play Episode Listen Later Dec 20, 2022 30:44


Two brothers discussing all things AWS every week. Hosted by Andreas and Michael Wittig presented by cloudonaut.

cloudonaut
#61 [Hot off the Cloud] re:Invent + Cross-Account CloudWatch + AuthZ Verified Permissions + ELB Resilience

cloudonaut

Play Episode Listen Later Nov 29, 2022 26:04


Two brothers discussing all things AWS every week. Hosted by Andreas and Michael Wittig presented by cloudonaut.

Screaming in the Cloud
A Cloud Economist is Born - The AlterNAT Origin Story

Screaming in the Cloud

Play Episode Listen Later Nov 9, 2022 34:45


About BenBen Whaley is a staff software engineer at Chime. Ben is co-author of the UNIX and Linux System Administration Handbook, the de facto standard text on Linux administration, and is the author of two educational videos: Linux Web Operations and Linux System Administration. He is an AWS Community Hero since 2014. Ben has held Red Hat Certified Engineer (RHCE) and Certified Information Systems Security Professional (CISSP) certifications. He earned a B.S. in Computer Science from Univ. of Colorado, Boulder.Links Referenced: Chime Financial: https://www.chime.com/ alternat.cloud: https://alternat.cloud Twitter: https://twitter.com/iamthewhaley LinkedIn: https://www.linkedin.com/in/benwhaley/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation, permissions as code, connectivity between any two devices, reduce latency, and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn and this is an episode unlike any other that has yet been released on this august podcast. Let's begin by introducing my first-time guest somehow because apparently an invitation got lost in the mail somewhere. Ben Whaley is a staff software engineer at Chime Financial and has been an AWS Community Hero since Andy Jassy was basically in diapers, to my level of understanding. Ben, welcome to the show.Ben: Corey, so good to be here. Thanks for having me on.Corey: I'm embarrassed that you haven't been on the show before. You're one of those people that slipped through the cracks and somehow I was very bad at following up slash hounding you into finally agreeing to be here. But you certainly waited until you had something auspicious to talk about.Ben: Well, you know, I'm the one that really should be embarrassed here. You did extend the invitation and I guess I just didn't feel like I had something to drop. But I think today we have something that will interest most of the listeners without a doubt.Corey: So, folks who have listened to this podcast before, or read my newsletter, or follow me on Twitter, or have shared an elevator with me, or at any point have passed me on the street, have heard me complain about the Managed NAT Gateway and it's egregious data processing fee of four-and-a-half cents per gigabyte. And I have complained about this for small customers because they're in the free tier; why is this thing charging them 32 bucks a month? And I have complained about this on behalf of large customers who are paying the GDP of the nation of Belize in data processing fees as they wind up shoving very large workloads to and fro, which is I think part of the prerequisite requirements for having a data warehouse. And you are no different than the rest of these people who have those challenges, with the singular exception that you have done something about it, and what you have done is so, in retrospect, blindingly obvious that I am embarrassed the rest of us never thought of it.Ben: It's interesting because when you are doing engineering, it's often the simplest solution that is the best. I've seen this repeatedly. And it's a little surprising that it didn't come up before, but I think it's in some way, just a matter of timing. But what we came up with—and is this the right time to get into it, do you want to just kind of name the solution, here?Corey: Oh, by all means. I'm not going to steal your thunder. Please, tell us what you have wrought.Ben: We're calling it AlterNAT and it's an alternative solution to a high-availability NAT solution. As everybody knows, NAT Gateway is sort of the default choice; it certainly is what AWS pushes everybody towards. But there is, in fact, a legacy solution: NAT instances. These were around long before NAT Gateway made an appearance. And like I said they're considered legacy, but with the help of lots of modern AWS innovations and technologies like Lambdas and auto-scaling groups with max instance lifetimes and the latest generation of networking improved or enhanced instances, it turns out that we can maybe not quite get as effective as a NAT Gateway, but we can save a lot of money and skip those data processing charges entirely by having a NAT instance solution with a failover NAT Gateway, which I think is kind of the key point behind the solution. So, are you interested in diving into the technical details?Corey: That is very much the missing piece right there. You're right. What we used to use was NAT instances. That was the thing that we used because we didn't really have another option. And they had an interface in the public subnet where they lived and an interface hanging out in the private subnet, and they had to be configured to wind up passing traffic to and fro.Well, okay, that's great and all but isn't that kind of brittle and dangerous? I basically have a single instance as a single point of failure and these are the days early on when individual instances did not have the level of availability and durability they do now. Yeah, it's kind of awful, but here you go. I mean, the most galling part of the Managed NAT Gateway service is not that it's expensive; it's that it's expensive, but also incredibly good at what it does. You don't have to think about this whole problem anymore, and as of recently, it also supports ipv4 to ipv6 translation as well.It's not that the service is bad. It's that the service is stonkingly expensive, particularly at scale. And everything that we've seen before is either oh, run your own NAT instances or bend your knee and pays your money. And a number of folks have come up with different options where this is ridiculous. Just go ahead and run your own NAT instances.Yeah, but what happens when I have to take it down for maintenance or replace it? It's like, well, I guess you're not going to the internet today. This has the, in hindsight, obvious solution, well, we just—we run the Managed NAT Gateway because the 32 bucks a year in instance-hour charges don't actually matter at any point of scale when you're doing this, but you wind up using that for day in, day out traffic, and the failover mode is simply you'll use the expensive Managed NAT Gateway until the instance is healthy again and then automatically change the route table back and forth.Ben: Yep. That's exactly it. So, the auto-scaling NAT instance solution has been around for a long time well, before even NAT Gateway was released. You could have NAT instances in an auto-scaling group where the size of the group was one, and if the NAT instance failed, it would just replace itself. But this left a period in which you'd have no internet connectivity during that, you know, when the NAT instance was swapped out.So, the solution here is that when auto-scaling terminates an instance, it fails over the route table to a standby NAT Gateway, rerouting the traffic. So, there's never a point at which there's no internet connectivity, right? The NAT instance is running, processing traffic, gets terminated after a certain period of time, configurable, 14 days, 30 days, whatever makes sense for your security strategy could be never, right? You could choose that you want to have your own maintenance window in which to do it.Corey: And let's face it, this thing is more or less sitting there as a network traffic router, for lack of a better term. There is no need to ever log into the thing and make changes to it until and unless there's a vulnerability that you can exploit via somehow just talking to the TCP stack when nothing's actually listening on the host.Ben: You know, you can run your own AMI that has been pared down to almost nothing, and that instance doesn't do much. It's using just a Linux kernel to sit on two networks and pass traffic back and forth. It has a translation table that kind of keeps track of the state of connections and so you don't need to have any service running. To manage the system, we have SSM so you can use Session Manager to log in, but frankly, you can just disable that. You almost never even need to get a shell. And that is, in fact, an option we have in the solution is to disable SSM entirely.Corey: One of the things I love about this approach is that it is turnkey. You throw this thing in there and it's good to go. And in the event that the instance becomes unhealthy, great, it fails traffic over to the Managed NAT Gateway while it terminates the old node and replaces it with a healthy one and then fails traffic back. Now, I do need to ask, what is the story of network connections during that failover and failback scenario?Ben: Right, that's the primary drawback, I would say, of the solution is that any established TCP connections that are on the NAT instance at the time of a route change will be lost. So, say you have—Corey: TCP now terminates on the floor.Ben: Pretty much. The connections are dropped. If you have an open SSH connection from a host in the private network to a host on the internet and the instance fails over to the NAT Gateway, the NAT Gateway doesn't have the translation table that the NAT instance had. And not to mention, the public IP address also changes because you have an Elastic IP assigned to the NAT instance, a different Elastic IP assigned to the NAT Gateway, and so because that upstream IP is different, the remote host is, like, tracking the wrong IP. So, those connections, they're going to be lost.So, there are some use cases where this may not be suitable. We do have some ideas on how you might mitigate that, for example, with the use of a maintenance window to schedule the replacement, replaced less often so it doesn't have to affect your workflow as much, but frankly, for many use cases, my belief is that it's actually fine. In our use case at Chime, we found that it's completely fine and we didn't actually experience any errors or failures. But there might be some use cases that are more sensitive or less resilient to failure in the first place.Corey: I would also point out that a lot of how software is going to behave is going to be a reflection of the era in which it was moved to cloud. Back in the early days of EC2, you had no real sense of reliability around any individual instance, so everything was written in a very defensive manner. These days, with instances automatically being able to flow among different hardware so we don't get instance interrupt notifications the way we once did on a semi-constant basis, it more or less has become what presents is bulletproof, so a lot of people are writing software that's a bit more brittle. But it's always been a best practice that when a connection fails okay, what happens at failure? Do you just give up and throw your hands in the air and shriek for help or do you attempt to retry a few times, ideally backing off exponentially?In this scenario, those retries will work. So, it's a question of how well have you built your software. Okay, let's say that you made the worst decisions imaginable, and okay, if that connection dies, the entire workload dies. Okay, you have the option to refactor it to be a little bit better behaved, or alternately, you can keep paying the Manage NAT Gateway tax of four-and-a-half cents per gigabyte in perpetuity forever. I'm not going to tell you what decision to make, but I know which one I'm making.Ben: Yeah, exactly. The cost savings potential of it far outweighs the potential maintenance troubles, I guess, that you could encounter. But the fact is, if you're relying on Managed NAT Gateway and paying the price for doing so, it's not as if there's no chance for connection failure. NAT Gateway could also fail. I will admit that I think it's an extremely robust and resilient solution. I've been really impressed with it, especially so after having worked on this project, but it doesn't mean it can't fail.And beyond that, upstream of the NAT Gateway, something could in fact go wrong. Like, internet connections are unreliable, kind of by design. So, if your system is not resilient to connection failures, like, there's a problem to solve there anyway; you're kind of relying on hope. So, it's a kind of a forcing function in some ways to build architectural best practices, in my view.Corey: I can't stress enough that I have zero problem with the capabilities and the stability of the Managed NAT Gateway solution. My complaints about it start and stop entirely with the price. Back when you first showed me the blog post that is releasing at the same time as this podcast—and you can visit that at alternat.cloud—you sent me an early draft of this and what I loved the most was that your math was off because of a not complete understanding of the gloriousness that is just how egregious the NAT Gateway charges are.Your initial analysis said, “All right, if you're throwing half a terabyte out to the internet, this has the potential of cutting the bill by”—I think it was $10,000 or something like that. It's, “Oh no, no. It has the potential to cut the bill by an entire twenty-two-and-a-half thousand dollars.” Because this processing fee does not replace any egress fees whatsoever. It's purely additive. If you forget to have a free S3 Gateway endpoint in a private subnet, every time you put something into or take something out of S3, you're paying four-and-a-half cents per gigabyte on that, despite the fact there's no internet transitory work, it's not crossing availability zones. It is simply a four-and-a-half cent fee to retrieve something that has only cost you—at most—2.3 cents per month to store in the first place. Flip that switch, that becomes completely free.Ben: Yeah. I'm not embarrassed at all to talk about the lack of education I had around this topic. The fact is I'm an engineer primarily and I came across the cost stuff because it kind of seemed like a problem that needed to be solved within my organization. And if you don't mind, I might just linger on this point and kind of think back a few months. I looked at the AWS bill and I saw this egregious ‘EC2 Other' category. It was taking up the majority of our bill. Like, the single biggest line item was EC2 Other. And I was like, “What could this be?”Corey: I want to wind up flagging that just because that bears repeating because I often get people pushing back of, “Well, how bad—it's one Managed NAT Gateway. How much could it possibly cost? $10?” No, it is the majority of your monthly bill. I cannot stress that enough.And that's not because the people who work there are doing anything that they should not be doing or didn't understand all the nuances of this. It's because for the security posture that is required for what you do—you are at Chime Financial, let's be clear here—putting everything in public subnets was not really a possibility for you folks.Ben: Yeah. And not only that but there are plenty of services that have to be on private subnets. For example, AWS Glue services must run in private VPC subnets if you want them to be able to talk to other systems in your VPC; like, they cannot live in public subnet. So essentially, if you want to talk to the internet from those jobs, you're forced into some kind of NAT solution. So, I dug into the EC2 Other category and I started trying to figure out what was going on there.There's no way—natively—to look at what traffic is transiting the NAT Gateway. There's not an interface that shows you what's going on, what's the biggest talkers over that network. Instead, you have to have flow logs enabled and have to parse those flow logs. So, I dug into that.Corey: Well, you're missing a step first because in a lot of environments, people have more than one of these things, so you get to first do the scavenger hunt of, okay, I have a whole bunch of Managed NAT Gateways and first I need to go diving into CloudWatch metrics and figure out which are the heavy talkers. Is usually one or two followed by a whole bunch of small stuff, but not always, so figuring out which VPC you're even talking about is a necessary prerequisite.Ben: Yeah, exactly. The data around it is almost missing entirely. Once you come to the conclusion that it is a particular NAT Gateway—like, that's a set of problems to solve on its own—but first, you have to go to the flow logs, you have to figure out what are the biggest upstream IPs that it's talking to. Once you have the IP, it still isn't apparent what that host is. In our case, we had all sorts of outside parties that we were talking to a lot and it's a matter of sorting by volume and figuring out well, this IP, what is the reverse IP? Who is potentially the host there?I actually had some wrong answers at first. I set up VPC endpoints to S3 and DynamoDB and SQS because those were some top talkers and that was a nice way to gain some security and some resilience and save some money. And then I found, well, Datadog; that's another top talker for us, so I ended up creating a nice private link to Datadog, which they offer for free, by the way, which is more than I can say for some other vendors. But then I found some outside parties, there wasn't a nice private link solution available to us, and yet, it was by far the largest volume. So, that's what kind of started me down this track is analyzing the NAT Gateway myself by looking at VPC flow logs. Like, it's shocking that there isn't a better way to find that traffic.Corey: It's worse than that because VPC flow logs tell you where the traffic is going and in what volumes, sure, on an IP address and port basis, but okay, now you have a Kubernetes cluster that spans two availability zones. Okay, great. What is actually passing through that? So, you have one big application that just seems awfully chatty, you have multiple workloads running on the thing. What's the expensive thing talking back and forth? The only way that you can reliably get the answer to that I found is to talk to people about what those workloads are actually doing, and failing that you're going code spelunking.Ben: Yep. You're exactly right about that. In our case, it ended up being apparent because we have a set of subnets where only one particular project runs. And when I saw the source IP, I could immediately figure that part out. But if it's a K8s cluster in the private subnets, yeah, how are you going to find it out? You're going to have to ask everybody that has workloads running there.Corey: And we're talking about in some cases, millions of dollars a month. Yeah, it starts to feel a little bit predatory as far as how it's priced and the amount of work you have to put in to track this stuff down. I've done this a handful of times myself, and it's always painful unless you discover something pretty early on, like, oh, it's talking to S3 because that's pretty obvious when you see that. It's, yeah, flip switch and this entire engagement just paid for itself a hundred times over. Now, let's see what else we can discover.That is always one of those fun moments because, first, customers are super grateful to learn that, oh, my God, I flipped that switch. And I'm saving a whole bunch of money. Because it starts with gratitude. “Thank you so much. This is great.” And it doesn't take a whole lot of time for that to alchemize into anger of, “Wait. You mean, I've been being ridden like a pony for this long and no one bothered to mention that if I click a button, this whole thing just goes away?”And when you mention this to your AWS account team, like, they're solicitous, but they either have to present as, “I didn't know that existed either,” which is not a good look, or, “Yeah, you caught us,” which is worse. There's no positive story on this. It just feels like a tax on not knowing trivia about AWS. I think that's what really winds me up about it so much.Ben: Yeah, I think you're right on about that as well. My misunderstanding about the NAT pricing was data processing is additive to data transfer. I expected when I replaced NAT Gateway with NAT instance, that I would be substituting data transfer costs for NAT Gateway costs, NAT Gateway data processing costs. But in fact, NAT Gateway incurs both data processing and data transfer. NAT instances only incur data transfer costs. And so, this is a big difference between the two solutions.Not only that, but if you're in the same region, if you're egressing out of your say us-east-1 region and talking to another hosted service also within us-east-1—never leaving the AWS network—you don't actually even incur data transfer costs. So, if you're using a NAT Gateway, you're paying data processing.Corey: To be clear you do, but it is cross-AZ in most cases billed at one penny egressing, and on the other side, that hosted service generally pays one penny ingressing as well. Don't feel bad about that one. That was extraordinarily unclear and the only reason I know the answer to that is that I got tired of getting stonewalled by people that later turned out didn't know the answer, so I ran a series of experiments designed explicitly to find this out.Ben: Right. As opposed to the five cents to nine cents that is data transfer to the internet. Which, add that to data processing on a NAT Gateway and you're paying between thirteen-and-a-half cents to nine-and-a-half cents for every gigabyte egressed. And this is a phenomenal cost. And at any kind of volume, if you're doing terabytes to petabytes, this becomes a significant portion of your bill. And this is why people hate the NAT Gateway so much.Corey: I am going to short-circuit an angry comment I can already see coming on this where people are going to say, “Well, yes. But it's a multi-petabyte scale. Nobody's paying on-demand retail price.” And they're right. Most people who are transmitting that kind of data, have a specific discount rate applied to what they're doing that varies depending upon usage and use case.Sure, great. But I'm more concerned with the people who are sitting around dreaming up ideas for a company where I want to wind up doing some sort of streaming service. I talked to one of those companies very early on in my tenure as a consultant around the billing piece and they wanted me to check their napkin math because they thought that at their numbers when they wound up scaling up, if their projections were right, that they were going to be spending $65,000 a minute, and what did they not understand? And the answer was, well, you didn't understand this other thing, so it's going to be more than that, but no, you're directionally correct. So, that idea that started off on a napkin, of course, they didn't build it on top of AWS; they went elsewhere.And last time I checked, they'd raised well over a quarter-billion dollars in funding. So, that's a business that AWS would love to have on a variety of different levels, but they're never going to even be considered because by the time someone is at scale, they either have built this somewhere else or they went broke trying.Ben: Yep, absolutely. And we might just make the point there that while you can get discounts on data transfer, you really can't—or it's very rare—to get discounts on data processing for the NAT Gateway. So, any kind of savings you can get on data transfer would apply to a NAT instance solution, you know, saving you four-and-a-half cents per gigabyte inbound and outbound over the NAT Gateway equivalent solution. So, you're paying a lot for the benefit of a fully-managed service there. Very robust, nicely engineered fully-managed service as we've already acknowledged, but an extremely expensive solution for what it is, which is really just a proxy in the end. It doesn't add any value to you.Corey: The only way to make that more expensive would be to route it through something like Splunk or whatnot. And Splunk does an awful lot for what they charge per gigabyte, but it just feels like it's rent-seeking in some of the worst ways possible. And what I love about this is that you've solved the problem in a way that is open-source, you have already released it in Terraform code. I think one of the first to-dos on this for someone is going to be, okay now also make it CloudFormation and also make it CDK so you can drop it in however you want.And anyone can use this. I think the biggest mistake people might make in glancing at this is well, I'm looking at the hourly charge for the NAT Gateways and that's 32-and-a-half bucks a month and the instances that you recommend are hundreds of dollars a month for the big network-optimized stuff. Yeah, if you care about the hourly rate of either of those two things, this is not for you. That is not the problem that it solves. If you're an independent learner annoyed about the $30 charge you got for a Managed NAT Gateway, don't do this. This will only add to your billing concerns.Where it really shines is once you're at, I would say probably about ten terabytes a month, give or take, in Managed NAT Gateway data processing is where it starts to consider this. The breakeven is around six or so but there is value to not having to think about things. Once you get to that level of spend, though it's worth devoting a little bit of infrastructure time to something like this.Ben: Yeah, that's effectively correct. The total cost of running the solution, like, all-in, there's eight Elastic IPs, four NAT Gateways, if you're—say you're four zones; could be less if you're in fewer zones—like, n NAT Gateways, n NAT instances, depending on how many zones you're in, and I think that's about it. And I said right in the documentation, if any of those baseline fees are a material number for your use case, then this is probably not the right solution. Because we're talking about saving thousands of dollars. Any of these small numbers for NAT Gateway hourly costs, NAT instance hourly costs, that shouldn't be a factor, basically.Corey: Yeah, it's like when I used to worry about costing my customers a few tens of dollars in Cost Explorer or CloudWatch or request fees against S3 for their Cost and Usage Reports. It's yeah, that does actually have a cost, there's no real way around it, but look at the savings they're realizing by going through that. Yeah, they're not going to come back and complaining about their five-figure consulting engagement costing an additional $25 in AWS charges and then lowering it by a third. So, there's definitely a difference as far as how those things tend to be perceived. But it's easy to miss the big stuff when chasing after the little stuff like that.This is part of the problem I have with an awful lot of cost tooling out there. They completely ignore cost components like this and focus only on the things that are easy to query via API, of, oh, we're going to cost-optimize your Kubernetes cluster when they think about compute and RAM. And, okay, that's great, but you're completely ignoring all the data transfer because there's still no great way to get at that programmatically. And it really is missing the forest for the trees.Ben: I think this is key to any cost reduction project or program that you're undertaking. When you look at a bill, look for the biggest spend items first and work your way down from there, just because of the impact you can have. And that's exactly what I did in this project. I saw that ‘EC2 Other' slash NAT Gateway was the big item and I started brainstorming ways that we could go about addressing that. And now I have my next targets in mind now that we've reduced this cost to effectively… nothing, extremely low compared to what it was, we have other new line items on our bill that we can start optimizing. But in any cost project, start with the big things.Corey: You have come a long way around to answer a question I get asked a lot, which is, “How do I become a cloud economist?” And my answer is, you don't. It's something that happens to you. And it appears to be happening to you, too. My favorite part about the solution that you built, incidentally, is that it is being released under the auspices of your employer, Chime Financial, which is immune to being acquired by Amazon just to kill this thing and shut it up.Because Amazon already has something shitty called Chime. They don't need to wind up launching something else or acquiring something else and ruining it because they have a Slack competitor of sorts called Amazon Chime. There's no way they could acquire you [unintelligible 00:27:45] going to get lost in the hallways.Ben: Well, I have confidence that Chime will be a good steward of the project. Chime's goal and mission as a company is to help everyone achieve financial peace of mind and we take that really seriously. We even apply it to ourselves and that was kind of the impetus behind developing this in the first place. You mentioned earlier we have Terraform support already and you're exactly right. I'd love to have CDK, CloudFormation, Pulumi supports, and other kinds of contributions are more than welcome from the community.So, if anybody feels like participating, if they see a feature that's missing, let's make this project the best that it can be. I suspect we can save many companies, hundreds of thousands or millions of dollars. And this really feels like the right direction to go in.Corey: This is easily a multi-billion dollar savings opportunity, globally.Ben: That's huge. I would be flabbergasted if that was the outcome of this.Corey: The hardest part is reaching these people and getting them on board with the idea of handling this. And again, I think there's a lot of opportunity for the project to evolve in the sense of different settings depending upon risk tolerance. I can easily see a scenario where in the event of a disruption to the NAT instance, it fails over to the Managed NAT Gateway, but fail back becomes manual so you don't have a flapping route table back and forth or a [hold 00:29:05] downtime or something like that. Because again, in that scenario, the failure mode is just well, you're paying four-and-a-half cents per gigabyte for a while until you wind up figuring out what's going on as opposed to the failure mode of you wind up disrupting connections on an ongoing basis, and for some workloads, that's not tenable. This is absolutely, for the common case, the right path forward.Ben: Absolutely. I think it's an enterprise-grade solution and the more knobs and dials that we add to tweak to make it more robust or adaptable to different kinds of use cases, the best outcome here would actually be that the entire solution becomes irrelevant because AWS fixes the NAT Gateway pricing. If that happens, I will consider the project a great success.Corey: I will be doing backflips like you wouldn't believe. I would sing their praises day in, day out. I'm not saying reduce it to nothing, even. I'm not saying it adds no value. I would change the way that it's priced because honestly, the fact that I can run an EC2 instance and be charged $0 on a per-gigabyte basis, yeah, I would pay a premium on an hourly charge based upon traffic volumes, but don't meter per gigabyte. That's where it breaks down.Ben: Absolutely. And why is it additive to data transfer, also? Like, I remember first starting to use VPC when it was launched and reading about the NAT instance requirement and thinking, “Wait a minute. I have to pay this extra management and hourly fee just so my private hosts could reach the internet? That seems kind of janky.”And Amazon established a norm here because Azure and GCP both have their own equivalent of this now. This is a business choice. This is not a technical choice. They could just run this under the hood and not charge anybody for it or build in the cost and it wouldn't be this thing we have to think about.Corey: I almost hate to say it, but Oracle Cloud does, for free.Ben: Do they?Corey: It can be done. This is a business decision. It is not a technical capability issue where well, it does incur cost to run these things. I understand that and I'm not asking for things for free. I very rarely say that this is overpriced when I'm talking about AWS billing issues. I'm talking about it being unpredictable, I'm talking about it being impossible to see in advance, but the fact that it costs too much money is rarely my complaint. In this case, it costs too much money. Make it cost less.Ben: If I'm not mistaken. GCPs equivalent solution is the exact same price. It's also four-and-a-half cents per gigabyte. So, that shows you that there's business games being played here. Like, Amazon could get ahead and do right by the customer by dropping this to a much more reasonable price.Corey: I really want to thank you both for taking the time to speak with me and building this glorious, glorious thing. Where can we find it? And where can we find you?Ben: alternat.cloud is going to be the place to visit. It's on Chime's GitHub, which will be released by the time this podcast comes out. As for me, if you want to connect, I'm on Twitter. @iamthewhaley is my handle. And of course, I'm on LinkedIn.Corey: Links to all of that will be in the podcast notes. Ben, thank you so much for your time and your hard work.Ben: This was fun. Thanks, Corey.Corey: Ben Whaley, staff software engineer at Chime Financial, and AWS Community Hero. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry rant of a comment that I will charge you not only four-and-a-half cents per word to read, but four-and-a-half cents to reply because I am experimenting myself with being a rent-seeking schmuck.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
The Man Behind the Curtain at Zoph with Victor Grenu

Screaming in the Cloud

Play Episode Listen Later Oct 25, 2022 28:28


About VictorVictor is an Independent Senior Cloud Infrastructure Architect working mainly on Amazon Web Services (AWS), designing: secure, scalable, reliable, and cost-effective cloud architectures, dealing with large-scale and mission-critical distributed systems. He also has a long experience in Cloud Operations, Security Advisory, Security Hardening (DevSecOps), Modern Applications Design, Micro-services and Serverless, Infrastructure Refactoring, Cost Saving (FinOps).Links Referenced: Zoph: https://zoph.io/ unusd.cloud: https://unusd.cloud Twitter: https://twitter.com/zoph LinkedIn: https://www.linkedin.com/in/grenuv/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq.com/screaminginthecloud to get. That's www.datadoghq.com/screaminginthecloudCorey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomomento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. One of the best parts about running a podcast like this and trolling the internet of AWS things is every once in a while, I get to learn something radically different than what I expected. For a long time, there's been this sort of persona or brand in the AWS space, specifically the security side of it, going by Zoph—that's Z-O-P-H—and I just assumed it was a collective or a whole bunch of people working on things, and it turns out that nope, it is just one person. And that one person is my guest today. Victor Grenu is an independent AWS architect. Victor, thank you for joining me.Victor: Hey, Corey, thank you for having me. It's a pleasure to be here.Corey: So, I want to start by diving into the thing that first really put you on my radar, though I didn't realize it was you at the time. You have what can only be described as an army of Twitter bots around the AWS ecosystem. And I don't even know that I'm necessarily following all of them, but what are these bots and what do they do?Victor: Yeah. I have a few bots on Twitter that I push some notification, some tweets, when things happen on AWS security space, especially when the AWS managed policies are updated from AWS. And it comes from an initial project from Scott Piper. He was running a Git command on his own laptop to push the history of AWS managed policy. And it told me that I can automate this thing using a deployment pipeline and so on, and to tweet every time a new change is detected from AWS. So, the idea is to monitor every change on these policies.Corey: It's kind of wild because I built a number of somewhat similar Twitter bots, only instead of trying to make them into something useful, I'd make them into something more than a little bit horrifying and extraordinarily obnoxious. Like there's a Cloud Boomer Twitter account that winds up tweeting every time Azure tweets something only it quote-tweets them in all caps and says something insulting. I have an AWS releases bot called AWS Cwoud—so that's C-W-O-U-D—and that winds up converting it to OwO speak. It's like, “Yay a new auto-scawowing growp.” That sort of thing is obnoxious and offensive, but it makes me laugh.Yours, on the other hand, are things that I have notifications turned on for just because when they announce something, it's generally fairly important. The first one that I discovered was your IAM changes bot. And I found some terrifying things coming out of that from time to time. What's the data source for that? Because I'm just grabbing other people's Twitter feeds or RSS feeds; you're clearly going deeper than that.Victor: Yeah, the data source is the official AWS managed policy. In fact, I run AWS CLI in the background and I'm doing just a list policy, the list policy command, and with this list I'm doing git of each policy that is returned, so I can enter it in a git repository to get the full history of the time. And I also craft a list of deprecated policy, and I also run, like, a dog-food initiative, the policy analysis, validation analysis from AWS tools to validate the consistency and the accuracy of the own policies. So, there is a policy validation with their own tool. [laugh].Corey: You would think that wouldn't turn up anything because their policy validator effectively acts as a linter, so if it throws an error, of course, you wouldn't wind up pushing that. And yet, somehow the fact that you have bothered to hook that up and have findings from it indicates that that's not how the real world works.Victor: Yeah, there is some, let's say, some false positive because we are running the policy validation with their own linter then own policies, but this is something that is documented from AWS. So, there is an official page where you can find why the linter is not working on each policy and why. There is a an explanation for each findings. I thinking of [unintelligible 00:05:05] managed policy, which is too long, and policy analyzer is crashing because the policy is too long.Corey: Excellent. It's odd to me that you have gone down this path because it's easy enough to look at this and assume that, oh, this must just be something you do for fun or as an aspect of your day job. So, I did a little digging into what your day job is, and this rings very familiar to me: you are an independent AWS consultant, only you're based out of Paris, whereas I was doing this from San Francisco, due to an escalatingly poor series of life choices on my part. What do you focus on in the AWS consulting world?Victor: Yeah. I'm running an AWS consulting boutique in Paris and I'm working for a large customer in France. And I'm doing mostly infrastructure stuff, infrastructure design for cloud-native application, and I'm also doing some security audits and [unintelligible 00:06:07] mediation for my customer.Corey: It seems to me that there's a definite divide as far as how people find the AWS consulting experience to be. And I'm not trying to cast judgment here, but the stories that I hear tend to fall into one of two categories. One of them is the story that you have, where you're doing this independently, you've been on your own for a while working specifically on this, and then there's the stories of, “Oh, yeah, I work for a 500 person consultancy and we do everything as long as they'll pay us money. If they've got money, we'll do it. Why not?”And it always seems to me—not to be overly judgy—but the independent consultants just seem happier about it because for better or worse, we get to choose what we focus on in a way that I don't think you do at a larger company.Victor: Yeah. It's the same in France or in Europe; there is a lot of consulting firms. But with the pandemic and with the market where we are working, in the cloud, in the cloud-native solution and so on, that there is a lot of demands. And the natural path is to start by working for a consulting firm and then when you are ready, when you have many AWS certification, when you have the experience of the customer, when you have a network of well-known customer, and you gain trust from your customer, I think it's natural to go by yourself, to be independent and to choose your own project and your own customer.Corey: I'm curious to get your take on what your perception of being an AWS consultant is when you're based in Paris versus, in my case, being based in the West Coast of the United States. And I know that's a bit of a strange question, but even when I travel, for example, over to the East Coast, suddenly, my own newsletter sends out three hours later in the day than I expect it to and that throws me for a loop. The AWS announcements don't come out at two or three in the afternoon; they come out at dinnertime. And for you, it must be in the middle of the night when a lot of those things wind up dropping. The AWS stuff, not my newsletter. I imagine you're not excitedly waiting on tenterhooks to see what this week's issue of Last Week in AWS talks about like I am.But I'm curious is that even beyond that, how do you experience the market? From what you're perceiving people in the United States talking about as AWS consultants versus what you see in Paris?Victor: It's difficult, but in fact, I don't have so much information about the independent in the US. I know that there is a lot, but I think it's more common in Europe. And yeah, it's an advantage to whoever ten-hour time [unintelligible 00:08:56] from the US because a lot of stuff happen on the Pacific time, on the Seattle timezone, on San Francisco timezone. So, for example, for this podcast, my Monday is over right now, so, so yeah, I have some advantage in time, but yeah.Corey: This is potentially an odd question for you. But I find an awful lot of the AWS documentation to be challenging, we'll call it. I don't always understand exactly what it's trying to tell me, and it's not at all clear that the person writing the documentation about a service in some cases has ever used the service. And in everything I just said, there is no language barrier. This documentation was written—theoretically—in English and I, most days, can stumble through a sentence in English and almost no other language. You obviously speak French as a first language. Given that you live in Paris, it seems to be a relatively common affliction. How do you find interacting with AWS in French goes? Or is it just a complete nonstarter, and it all has to happen in English for you?Victor: No, in fact, the consultants in Europe, I think—in fact, in my part, I'm using my laptop in English, I'm using my phone in English, I'm using the AWS console in English, and so on. So, the documentation for me is a switch on English first because for the other language, there is sometimes some automated translation that is very dangerous sometimes, so we all keep the documentation and the materials in English.Corey: It's wild to me just looking at how challenging so much of the stuff is. Having to then work in a second language on top of that, it just seems almost insurmountable to me. It's good they have automated translation for a lot of this stuff, but that falls down in often hilariously disastrous ways, sometimes. It's wild to me that even taking most programming languages that folks have ever heard of, even if you program and speak no English, which happens in a large part of the world, you're still using if statements even if the term ‘if' doesn't mean anything to you localized in your language. It really is, in many respects, an English-centric industry.Victor: Yeah. Completely. Even in French for our large French customer, I'm writing the PowerPoint presentation in English, some emails are in English, even if all the folks in the thread are French. So yeah.Corey: One other area that I wanted to explore with you a bit is that you are very clearly focused on security as a primary area of interest. Does that manifest in the work that you do as well? Do you find that your consulting engagements tend to have a high degree of focus on security?Victor: Yeah. In my design, when I'm doing some AWS architecture, my main objective is to design some security architecture and security patterns that apply best practices and least privilege. But often, I'm working for engagement on security audits, for startups, for internal customer, for diverse company, and then doing some accommodation after all. And to run my audit, I'm using some open-source tooling, some custom scripts, and so on. I have a methodology that I'm running for each customer. And the goal is to sometime to prepare some certification, PCI DSS or so on, or maybe to ensure that the best practice are correctly applied on a workload or before go-live or, yeah.Corey: One of the weird things about this to me is that I've said for a long time that cost and security tend to be inextricably linked, as far as being a sort of trailing reactive afterthought for an awful lot of companies. They care about both of those things right after they failed to adequately care about those things. At least in the cloud economic space, it's only money as opposed to, “Oops, we accidentally lost our customers' data.” So, I always found that I find myself drifting in a security direction if I don't stop myself, just based upon a lot of the cost work I do. Conversely, it seems that you have come from the security side and you find yourself drifting in a costing direction.Your side project is a SaaS offering called unusd.cloud, that's U-N-U-S-D dot cloud. And when you first mentioned this to me, my immediate reaction was, “Oh, great. Another SaaS platform for costing. Let's tear this one apart, too.” Except I actually like what you're building. Tell me about it.Victor: Yeah, and unusd.cloud is a side project for me and I was working since, let's say one year. It was a project that I've deployed for some of my customer on their local account, and it was very useful. And so, I was thinking that it could be a SaaS project. So, I've worked at [unintelligible 00:14:21] so yeah, a few months on shifting the product to assess [unintelligible 00:14:27].The product aim to detect the worst on AWS account on all AWS region, and it scan all your AWS accounts and all your region, and you try to detect and use the EC2, LDS, Glue [unintelligible 00:14:45], SageMaker, and so on, and attach a EBS and so on. I don't craft a new dashboard, a new Cost Explorer, and so on. It's it just cost awareness, it's just a notification on email or Slack or Microsoft Teams. And you just add your AWS account on the project and you schedule, let's say, once a day, and it scan, and it send you a cost of wellness, a [unintelligible 00:15:17] detection, and you can act by turning off what is not used.Corey: What I like about this is it cuts at the number one rule of cloud economics, which is turn that shit off if you're not using it. You wouldn't think that I would need to say that except that everyone seems to be missing that, on some level. And it's easy to do. When you need to spin something up and it's not there, you're very highly incentivized to spin that thing up. When you're not using it, you have to remember that thing exists, otherwise it just sort of sits there forever and doesn't do anything.It just costs money and doesn't generate any value in return for that. What you got right is you've also eviscerated my most common complaint about tools that claim to do this, which is you build in either a explicit rule of ignore this resource or ignore resources with the following tags. The benefit there is that you're not constantly giving me useless advice, like, “Oh, yeah, turn off this idle thing.” It's, yeah, that's there for a reason, maybe it's my dev box, maybe it's my backup site, maybe it's the entire DR environment that I'm going to need at little notice. It solves for that problem beautifully. And though a lot of tools out there claim to do stuff like this, most of them really failed to deliver on that promise.Victor: Yeah, I just want to keep it simple. I don't want to add an additional console and so on. And you are correct. You can apply a simple tag on your asset, let's say an EC2 instances, you apply the tag in use and the value of, and then the alerting is disabled for this asset. And the detection is based on the CPU [unintelligible 00:17:01] and the network health metrics, so when the instances is not used in the last seven days, with a low CPU every [unintelligible 00:17:10] and low network out, it comes as a suspect. [laugh].[midroll 00:17:17]Corey: One thing that I like about what you've done, but also have some reservations about it is that you have not done with so many of these tools do which is, “Oh, just give us all the access in your account. It'll be fine. You can trust us. Don't you want to save money?” And yeah, but I also still want to have a company left when all sudden done.You are very specific on what it is that you're allowed to access, and it's great. I would argue, on some level, it's almost too restrictive. For example, you have the ability to look at EC2, Glue, IAM—just to look at account aliases, great—RDS, Redshift, and SageMaker. And all of these are simply list and describe. There's no gets in there other than in Cost Explorer, which makes sense. You're not able to go rummaging through my data and see what's there. But that also bounds you, on some level, to being able to look only at particular types of resources. Is that accurate or are you using a lot of the CloudWatch stuff and Cost Explorer stuff to see other areas?Victor: In fact, it's the least privilege and read-only permission because I don't want too much question for the security team. So, it's full read-only permission. And I've only added the detection that I'm currently supports. Then if in some weeks, in some months, I'm adding a new detection, let's say for Snapshot, for example, I will need to update, so I will ask my customer to update their template. There is a mechanisms inside the project to tell them that the template is obsolete, but it's not a breaking change.So, the detection will continue, but without the new detection, the new snapshot detection, let's say. So yeah, it's least privilege, and all I need is the get-metric-statistics from CloudWatch to detect unused assets. And also checking [unintelligible 00:19:16] Elastic IP or [unintelligible 00:19:19] EBS volume. So, there is no CloudWatching in this detection.Corey: Also, to be clear, I am not suggesting that what you have done is at all a mistake, even if you bound it to those resources right now. But just because everyone loves to talk about these exciting, amazing, high-level services that AWS has put up there, for example, oh, what about DocumentDB or all these other—you know, Amazon Basics MongoDB; same thing—or all of these other things that they wind up offering, but you take a look at where customers are spending money and where they're surprised to be spending money, it's EC2, it's a bit of RDS, occasionally it's S3, but that's a lot harder to detect automatically whether that data is unused. It's, “You haven't been using this data very much.” It's, “Well, you see how the bucket is labeled ‘Archive Backups' or ‘Regulatory Logs?'” imagine that. What a ridiculous concept.Yeah. Whereas an idle EC2 instance sort of can wind up being useful on this. I am curious whether you encounter in the wild in your customer base, folks who are having idle-looking EC2 instances, but are in fact, for example, using a whole bunch of RAM, which you can't tell from the outside without custom CloudWatch agents.Victor: Yeah, I'm not detecting this behavior for larger usage of RAM, for example, or for maybe there is some custom application that is low in CPU and don't talk to any other services using the network, but with this detection, with the current state of the detection, I'm covering large majority of waste because what I see from my customer is that there is some teams, some data scientists or data teams who are experimenting a lot with SageMaker with Glue, with Endpoint and so on. And this is very expensive at the end of the day because they don't turn off the light at the end of the day, on Friday evening. So, what I'm trying to solve here is to notify the team—so on Slack—when they forgot to turn off the most common waste on AWS, so EC2, LTS, Redshift.Corey: I just now wound up installing it while we've been talking on my dedicated shitposting account, and sure enough, it already spat out a single instance it found, which yeah was running an EC2 instance on the East Coast when I was just there, so that I had a DNS server that was a little bit more local. Okay, great. And it's a T4g.micro, so it's not exactly a whole lot of money, but it does exactly what it says on the tin. It didn't wind up nailing the other instances I have in that account that I'm using for a variety of different things, which is good.And it further didn't wind up falling into the trap that so many things do, which is the, “Oh, it's costing you zero and your spend this month is zero because this account is where I dump all of my AWS credit codes.” So, many things say, “Oh, well, it's not costing you anything, so what's the problem?” And then that's how you accidentally lose $100,000 in activate credits because someone left something running way too long. It does a lot of the right things that I would hope and expect it to do, and the fact that you don't do that is kind of amazing.Victor: Yeah. It was a need from my customer and an opportunity. It's a small bet for me because I'm trying to do some small bets, you know, the small bets approach, so the idea is to try a new thing. It's also an excuse for me to learn something new because building a SaaS is a challenging.Corey: One thing that I am curious about, in this account, I'm also running the controller for my home WiFi environment. And that's not huge. It's T3.small, but it is still something out there that it sits there because I need it to exist. But it's relatively bored.If I go back and look over the last week of CloudWatch metrics, for example, it doesn't look like it's usually busy. I'm sure there's some network traffic in and out as it updates itself and whatnot, but the CPU peeks out at a little under 2% used. It didn't warn on this and it got it right. I'm just curious as to how you did that. What is it looking for to determine whether this instance is unused or not?Victor: It's the magic [laugh]. There is some intelligence artif—no, I'm just kidding. It just statistics. And I'm getting two metrics, the superior average from the last seven days and the network out. And I'm getting the average on those metrics and I'm doing some assumption that this EC2, this specific EC2 is not used because of these metrics, this server average.Corey: Yeah, it is wild to me just that this is working as well as it is. It's just… like, it does exactly what I would expect it to do. It's clear that—and this is going to sound weird, but I'm going to say it anyway—that this was built from someone who was looking to answer the question themselves and not from the perspective of, “Well, we need to build a product and we have access to all of this data from the API. How can we slice and dice it and add some value as we go?” I really liked the approach that you've taken on this. I don't say that often or lightly, particularly when it comes to cloud costing stuff, but this is something I'll be using in some of my own nonsense.Victor: Thanks. I appreciate it.Corey: So, I really want to thank you for taking as much time as you have to talk about who you are and what you're up to. If people want to learn more, where can they find you?Victor: Mainly on Twitter, my handle is @zoph [laugh]. And, you know, on LinkedIn or on my company website, as zoph.io.Corey: And we will, of course, put links to that in the [show notes 00:25:23]. Thank you so much for your time today. I really appreciate it.Victor: Thank you, Corey, for having me. It was a pleasure to chat with you.Corey: Victor Grenu, independent AWS architect. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment that is going to cost you an absolute arm and a leg because invariably, you're going to forget to turn it off when you're done.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Dynamic Configuration Through AWS AppConfig with Steve Rice

Screaming in the Cloud

Play Episode Listen Later Oct 11, 2022 35:54


About Steve:Steve Rice is Principal Product Manager for AWS AppConfig. He is surprisingly passionate about feature flags and continuous configuration. He lives in the Washington DC area with his wife, 3 kids, and 2 incontinent dogs.Links Referenced:AWS AppConfig: https://go.aws/awsappconfig TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With tail scale, ssh, you can do exactly that. Tail scale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate.S. Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built in key rotation permissions is code connectivity between any two devices, reduce latency and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?Nope, I wish it were. tail scales. Completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscaleCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This is a promoted guest episode. What does that mean? Well, it means that some people don't just want me to sit here and throw slings and arrows their way, they would prefer to send me a guest specifically, and they do pay for that privilege, which I appreciate. Paying me is absolutely a behavior I wish to endorse.Today's victim who has decided to contribute to slash sponsor my ongoing ridiculous nonsense is, of all companies, AWS. And today I'm talking to Steve Rice, who's the principal product manager on AWS AppConfig. Steve, thank you for joining me.Steve: Hey, Corey, great to see you. Thanks for having me. Looking forward to a conversation.Corey: As am I. Now, AppConfig does something super interesting, which I'm not aware of any other service or sub-service doing. You are under the umbrella of AWS Systems Manager, but you're not going to market with Systems Manager AppConfig. You're just AWS AppConfig. Why?Steve: So, AppConfig is part of AWS Systems Manager. Systems Manager has, I think, 17 different features associated with it. Some of them have an individual name that is associated with Systems Manager, some of them don't. We just happen to be one that doesn't. AppConfig is a service that's been around for a while internally before it was launched externally a couple years ago, so I'd say that's probably the origin of the name and the service. I can tell you more about the origin of the service if you're curious.Corey: Oh, I absolutely am. But I just want to take a bit of a detour here and point out that I make fun of the sub-service names in Systems Manager an awful lot, like Systems Manager Session Manager and Systems Manager Change Manager. And part of the reason I do that is not just because it's funny, but because almost everything I found so far within the Systems Manager umbrella is pretty awesome. It aligns with how I tend to think about the world in a bunch of different ways. I have yet to see anything lurking within the Systems Manager umbrella that has led to a tee-hee-hee bill surprise level that rivals, you know, the GDP of Guam. So, I'm a big fan of the entire suite of services. But yes, how did AppConfig get its name?Steve: [laugh]. So, AppConfig started about six years ago, now, internally. So, we actually were part of the region services department inside of Amazon, which is in charge of launching new services around the world. We found that a centralized tool for configuration associated with each service launching was really helpful. So, a service might be launching in a new region and have to enable and disable things as it moved along.And so, the tool was sort of built for that, turning on and off things as the region developed and was ready to launch publicly; then the regions launch publicly. It turned out that our internal customers, which are a lot of AWS services and then some Amazon services as well, started to use us beyond launching new regions, and started to use us for feature flagging. Again, turning on and off capabilities, launching things safely. And so, it became massively popular; we were actually a top 30 service internally in terms of usage. And two years ago, we thought we really should launch this externally and let our customers benefit from some of the goodness that we put in there, and some of—those all come from the mistakes we've made internally. And so, it became AppConfig. In terms of the name itself, we specialize in application configuration, so that's kind of a mouthful, so we just changed it to AppConfig.Corey: Earlier this year, there was a vulnerability reported around I believe it was AWS Glue, but please don't quote me on that. And as part of its excellent response that AWS put out, they said that from the time that it was disclosed to them, they had patched the service and rolled it out to every AWS region in which Glue existed in a little under 29 hours, which at scale is absolutely magic fast. That is superhero speed and then some because you generally don't just throw something over the wall, regardless of how small it is when we're talking about something at the scale of AWS. I mean, look at who your customers are; mistakes will show. This also got me thinking that when you have Adam, or previously Andy, on stage giving a keynote announcement and then they mention something on stage, like, “Congratulations. It's now a very complicated service with 14 adjectives in his name because someone's paid by the syllable. Great.”Suddenly, the marketing pages are up, the APIs are working, it's showing up in the console, and it occurs to me only somewhat recently to think about all of the moving parts that go on behind this. That is far faster than even the improved speed of CloudFront distribution updates. There's very clearly something going on there. So, I've got to ask, is that you?Steve: Yes, a lot of that is us. I can't take credit for a hundred percent of what you're talking about, but that's how we are used. We're essentially used as a feature-flagging service. And I can talk generically about feature flagging. Feature flagging allows you to push code out to production, but it's hidden behind a configuration switch: a feature toggle or a feature flag. And that code can be sitting out there, nobody can access it until somebody flips that toggle. Now, the smart way to do it is to flip that toggle on for a small set of users. Maybe it's just internal users, maybe it's 1% of your users. And so, the features available, you can—Corey: It's your best slash worst customers [laugh] in that 1%, in some cases.Steve: Yeah, you want to stress test the system with them and you want to be able to look and see what's going to break before it breaks for everybody. So, you release us to a small cohort, you measure your operations, you measure your application health, you measure your reputational concerns, and then if everything goes well, then you maybe bump it up to 2%, and then 10%, and then 20%. So, feature flags allow you to slowly release features, and you know what you're releasing by the time it's at a hundred percent. It's tempting for teams to want to, like, have everybody access it at the same time; you've been working hard on this feature for a long time. But again, that's kind of an anti-pattern. You want to make sure that on production, it behaves the way you expect it to behave.Corey: I have to ask what is the fundamental difference between feature flags and/or dynamic configuration. Because to my mind, one of them is a means of achieving the other, but I could also see very easily using the terms interchangeably. Given that in some of our conversations, you have corrected me which, first, how dare you? Secondly, okay, there's probably a reason here. What is that point of distinction?Steve: Yeah. Typically for those that are not eat, sleep, and breathing dynamic configuration—which I do—and most people are not obsessed with this kind of thing, feature flags is kind of a shorthand for dynamic configuration. It allows you to turn on and off things without pushing out any new code. So, your application code's running, it's pulling its configuration data, say every five seconds, every ten seconds, something like that, and when that configuration data changes, then that app changes its behavior, again, without a code push or without restarting the app.So, dynamic configuration is maybe a superset of feature flags. Typically, when people think feature flags, they're thinking of, “Oh, I'm going to release a new feature, so it's almost like an on-off switch.” But we see customers using feature flags—and we use this internally—for things like throttling limits. Let's say you want to be able to throttle TPS transactions per second. Or let's say you want to throttle the number of simultaneous background tasks, and say, you know, I just really don't want this creeping above 50; bad things can start to happen.But in a period of stress, you might want to actually bring that number down. Well, you can push out these changes with dynamic configuration—which is, again, any type of configuration, not just an on-off switch—you can push this out and adjust the behavior and see what happens. Again, I'd recommend pushing it out to 1% of your users, and then 10%. But it allows you to have these dials and switches to do that. And, again, generically, that's dynamic configuration. It's not as fun to term as feature flags; feature flags is sort of a good mental picture, so I do use them interchangeably, but if you're really into the whole world of this dynamic configuration, then you probably will care about the difference.Corey: Which makes a fair bit of sense. It's the question of what are you talking about high level versus what are you talking about implementation detail-wise.Steve: Yep. Yep.Corey: And on some level, I used to get… well, we'll call it angsty—because I can't think of a better adjective right now—about how AWS was reluctant to disclose implementation details behind what it did. And in the fullness of time, it's made a lot more sense to me, specifically through a lens of, you want to be able to have the freedom to change how something works under the hood. And if you've made no particular guarantee about the implementation detail, you can do that without potentially worrying about breaking a whole bunch of customer expectations that you've inadvertently set. And that makes an awful lot of sense.The idea of rolling out changes to your infrastructure has evolved over the last decade. Once upon a time you'd have EC2 instances, and great, you want to go ahead and make a change there—or this actually predates EC2 instances. Virtual machines in a data center or heaven forbid, bare metal servers, you're not going to deploy a whole new server because there's a new version of the code out, so you separate out your infrastructure from the code that it runs. And that worked out well. And increasingly, we started to see ways of okay, if we want to change the behavior of the application, we'll just push out new environment variables to that thing and restart the service so it winds up consuming those.And that's great. You've rolled it out throughout your fleet. With containers, which is sort of the next logical step, well, okay, this stuff gets baked in, we'll just restart containers with a new version of code because that takes less than a second each and you're fine. And then Lambda functions, it's okay, we'll just change the deployment option and the next invocation will wind up taking the brand new environment variables passed out to it. How do feature flags feature into those, I guess, three evolving methods of running applications in anger, by which I mean, of course, production?Steve: [laugh]. Good question. And I think you really articulated that well.Corey: Well, thank you. I should hope so. I'm a storyteller. At least I fancy myself one.Steve: [laugh]. Yes, you are. Really what you talked about is the evolution of you know, at the beginning, people were—well, first of all, people probably were embedding their variables deep in their code and then they realized, “Oh, I want to change this,” and now you have to find where in my code that is. And so, it became a pattern. Why don't we separate everything that's a configuration data into its own file? But it'll get compiled at build time and sent out all at once.There was kind of this breakthrough that was, why don't we actually separate out the deployment of this? We can separate the deployment from code from the deployment of configuration data, and have the code be reading that configuration data on a regular interval, as I already said. So now, as the environments have changed—like you said, containers and Lambda—that ability to make tweaks at microsecond intervals is more important and more powerful. So, there certainly is still value in having things like environment variables that get read at startup. We call that static configuration as opposed to dynamic configuration.And that's a very important element in the world of containers that you talked about. Containers are a bit ephemeral, and so they kind of come and go, and you can restart things, or you might spin up new containers that are slightly different config and have them operate in a certain way. And again, Lambda takes that to the next level. I'm really excited where people are going to take feature flags to the next level because already today we have people just fine-tuning to very targeted small subsets, different configuration data, different feature flag data, and allows them to do this like at we've never seen before scale of turning this on, seeing how it reacts, seeing how the application behaves, and then being able to roll that out to all of your audience.Now, you got to be careful, you really don't want to have completely different configurations out there and have 10 different, or you know, 100 different configurations out there. That makes it really tough to debug. So, you want to think of this as I want to roll this out gradually over time, but eventually, you want to have this sort of state where everything is somewhat consistent.Corey: That, on some level, speaks to a level of operational maturity that my current deployment adventures generally don't have. A common reference I make is to my lasttweetinaws.com Twitter threading app. And anyone can visit it, use it however they want.And it uses a Route 53 latency record to figure out, ah, which is the closest region to you because I've deployed it to 20 different regions. Now, if this were a paid service, or I had people using this in large volume and I had to worry about that sort of thing, I would probably approach something that is very close to what you describe. In practice, I pick a devoted region that I deploy something to, and cool, that's sort of my canary where I get things working the way I would expect. And when that works the way I want it to I then just push it to everything else automatically. Given that I've put significant effort into getting deployments down to approximately two minutes to deploy to everything, it feels like that's a reasonable amount of time to push something out.Whereas if I were, I don't know, running a bank, for example, I would probably have an incredibly heavy process around things that make changes to things like payment or whatnot. Because despite the lies, we all like to tell both to ourselves and in public, anything that touches payments does go through waterfall, not agile iterative development because that mistake tends to show up on your customer's credit card bills, and then they're also angry. I think that there's a certain point of maturity you need to be at as either an organization or possibly as a software technology stack before something like feature flags even becomes available to you. Would you agree with that, or is this something everyone should use?Steve: I would agree with that. Definitely, a small team that has communication flowing between the two probably won't get as much value out of a gradual release process because everybody kind of knows what's going on inside of the team. Once your team scales, or maybe your audience scales, that's when it matters more. You really don't want to have something blow up with your users. You really don't want to have people getting paged in the middle of the night because of a change that was made. And so, feature flags do help with that.So typically, the journey we see is people start off in a maybe very small startup. They're releasing features at a very fast pace. They grow and they start to build their own feature flagging solution—again, at companies I've been at previously have done that—and you start using feature flags and you see the power of it. Oh, my gosh, this is great. I can release something when I want without doing a big code push. I can just do a small little change, and if something goes wrong, I can roll it back instantly. That's really handy.And so, the basics of feature flagging might be a homegrown solution that you all have built. If you really lean into that and start to use it more, then you probably want to look at a third-party solution because there's so many features out there that you might want. A lot of them are around safeguards that makes sure that releasing a new feature is safe. You know, again, pushing out a new feature to everybody could be similar to pushing out untested code to production. You don't want to do that, so you need to have, you know, some checks and balances in your release process of your feature flags, and that's what a lot of third parties do.It really depends—to get back to your question about who needs feature flags—it depends on your audience size. You know, if you have enough audience out there to want to do a small rollout to a small set first and then have everybody hit it, that's great. Also, if you just have, you know, one or two developers, then feature flags are probably something that you're just kind of, you're doing yourself, you're pushing out this thing anyway on your own, but you don't need it coordinated across your team.Corey: I think that there's also a bit of—how to frame this—misunderstanding on someone's part about where AppConfig starts and where it stops. When it was first announced, feature flags were one of the things that it did. And that was talked about on stage, I believe in re:Invent, but please don't quote me on that, when it wound up getting announced. And then in the fullness of time, there was another announcement of AppConfig now supports feature flags, which I'm sitting there and I had to go back to my old notes. Like, did I hallucinate this? Which again, would not be the first time I'd imagine such a thing. But no, it was originally how the service was described, but now it's extra feature flags, almost like someone would, I don't know, flip on a feature-flag toggle for the service and now it does a different thing. What changed? What was it that was misunderstood about the service initially versus what it became?Steve: Yeah, I wouldn't say it was a misunderstanding. I think what happened was we launched it, guessing what our customers were going to use it as. We had done plenty of research on that, and as I mentioned before we had—Corey: Please tell me someone used it as a database. Or am I the only nutter that does stuff like that?Steve: We have seen that before. We have seen something like that before.Corey: Excellent. Excellent, excellent. I approve.Steve: And so, we had done our due diligence ahead of time about how we thought people were going to use it. We were right about a lot of it. I mentioned before that we have a lot of usage internally, so you know, that was kind of maybe cheating even for us to be able to sort of see how this is going to evolve. What we did announce, I guess it was last November, was an opinionated version of feature flags. So, we had people using us for feature flags, but they were building their own structure, their own JSON, and there was not a dedicated console experience for feature flags.What we announced last November was an opinionated version that structured the JSON in a way that we think is the right way, and that afforded us the ability to have a smooth console experience. So, if we know what the structure of the JSON is, we can have things like toggles and validations in there that really specifically look at some of the data points. So, that's really what happened. We're just making it easier for our customers to use us for feature flags. We still have some customers that are kind of building their own solution, but we're seeing a lot of them move over to our opinionated version.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloudCorey: Part of the problem I have when I look at what it is you folks do, and your use cases, and how you structure it is, it's similar in some respects to how folks perceive things like FIS, the fault injection service, or chaos engineering, as is commonly known, which is, “We can't even get the service to stay up on its own for any [unintelligible 00:18:35] period of time. What do you mean, now let's intentionally degrade it and make it work?” There needs to be a certain level of operational stability or operational maturity. When you're still building a service before it's up and running, feature flags seem awfully premature because there's no one depending on it. You can change configuration however your little heart desires. In most cases. I'm sure at certain points of scale of development teams, you have a communications problem internally, but it's not aimed at me trying to get something working at 2 a.m. in the middle of the night.Whereas by the time folks are ready for what you're doing, they clearly have that level of operational maturity established. So, I have to guess on some level, that your typical adopter of AppConfig feature flags isn't in fact, someone who is, “Well, we're ready for feature flags; let's go,” but rather someone who's come up with something else as a stopgap as they've been iterating forward. Usually something homebuilt. And it might very well be you have the exact same biggest competitor that I do in my consulting work, which is of course, Microsoft Excel as people try to build their own thing that works in their own way.Steve: Yeah, so definitely a very common customer of ours is somebody that is using a homegrown solution for turning on and off things. And they really feel like I'm using the heck out of these feature flags. I'm using them on a daily or weekly basis. I would like to have some enhancements to how my feature flags work, but I have limited resources and I'm not sure that my resources should be building enhancements to a feature-flagging service, but instead, I'd rather have them focusing on something, you know, directly for our customers, some of the core features of whatever your company does. And so, that's when people sort of look around externally and say, “Oh, let me see if there's some other third-party service or something built into AWS like AWS AppConfig that can meet those needs.”And so absolutely, the workflows get more sophisticated, the ability to move forward faster becomes more important, and do so in a safe way. I used to work at a cybersecurity company and we would kind of joke that the security budget of the company is relatively low until something bad happens, and then it's, you know, whatever you need to spend on it. It's not quite the same with feature flags, but you do see when somebody has a problem on production, and they want to be able to turn something off right away or make an adjustment right away, then the ability to do that in a measured way becomes incredibly important. And so, that's when, again, you'll see customers starting to feel like they're outgrowing their homegrown solution and moving to something that's a third-party solution.Corey: Honestly, I feel like so many tools exist in this space, where, “Oh, yeah, you should definitely use this tool.” And most people will use that tool. The second time. Because the first time, it's one of those, “How hard could that be out? I can build something like that in a weekend.” Which is sort of the rallying cry of doomed engineers who are bad at scoping.And by the time that they figure out why, they have to backtrack significantly. There's a whole bunch of stuff that I have built that people look at and say, “Wow, that's a really great design. What inspired you to do that?” And the absolute honest answer to all of it is simply, “Yeah, I worked in roles for the first time I did it the way you would think I would do it and it didn't go well.” Experience is what you get when you didn't get what you wanted, and this is one of those areas where it tends to manifest in reasonable ways.Steve: Absolutely, absolutely.Corey: So, give me an example here, if you don't mind, about how feature flags can improve the day-to-day experience of an engineering team or an engineer themselves. Because we've been down this path enough, in some cases, to know the failure modes, but for folks who haven't been there that's trying to shave a little bit off of their journey of, “I'm going to learn from my own mistakes.” Eh, learn from someone else's. What are the benefits that accrue and are felt immediately?Steve: Yeah. So, we kind of have a policy that the very first commit of any new feature ought to be the feature flag. That's that sort of on-off switch that you want to put there so that you can start to deploy your code and not have a long-lived branch in your source code. But you can have your code there, it reads whether that configuration is on or off. You start with it off.And so, it really helps just while developing these things about keeping your branches short. And you can push the mainline, as long as the feature flag is off and the feature is hidden to production, which is great. So, that helps with the mess of doing big code merges. The other part is around the launch of a feature.So, you talked about Andy Jassy being on stage to launch a new feature. Sort of the old way of doing this, Corey, was that you would need to look at your pipelines and see how long it might take for you to push out your code with any sort of code change in it. And let's say that was an hour-and-a-half process and let's say your CEO is on stage at eight o'clock on a Friday. And as much as you like to say it, “Oh, I'm never pushing out code on a Friday,” sometimes you have to. The old way—Corey: Yeah, that week, yes you are, whether you want to or not.Steve: [laugh]. Exactly, exactly. The old way was this idea that I'm going to time my release, and it takes an hour-and-a-half; I'm going to push it out, and I'll do my best, but hopefully, when the CEO raises her arm or his arm up and points to a screen that everything's lit up. Well, let's say you're doing that and something goes wrong and you have to start over again. Well, oh, my goodness, we're 15 minutes behind, can you accelerate things? And then you start to pull away some of these blockers to accelerate your pipeline or you start editing it right in the console of your application, which is generally not a good idea right before a really big launch.So, the new way is, I'm going to have that code already out there on a Wednesday [laugh] before this big thing on a Friday, but it's hidden behind this feature flag, I've already turned it on and off for internals, and it's just waiting there. And so, then when the CEO points to the big screen, you can just flip that one small little configuration change—and that can be almost instantaneous—and people can access it. So, that just reduces the amount of stress, reduces the amount of risk in pushing out your code.Another thing is—we've heard this from customers—customers are increasing the number of deploys that they can do per week by a very large percentage because they're deploying with confidence. They know that I can push out this code and it's off by default, then I can turn it on whenever I feel like it, and then I can turn it off if something goes wrong. So, if you're into CI/CD, you can actually just move a lot faster with a number of pushes to production each week, which again, I think really helps engineers on their day-to-day lives. The final thing I'm going to talk about is that let's say you did push out something, and for whatever reason, that following weekend, something's going wrong. The old way was oop, you're going to get a page, I'm going to have to get on my computer and go and debug things and fix things, and then push out a new code change.And this could be late on a Saturday evening when you're out with friends. If there's a feature flag there that can turn it off and if this feature is not critical to the operation of your product, you can actually just go in and flip that feature flag off until the next morning or maybe even Monday morning. So, in theory, you kind of get your free time back when you are implementing feature flags. So, I think those are the big benefits for engineers in using feature flags.Corey: And the best way to figure out whether someone is speaking from a position of experience or is simply a raving zealot when they're in a position where they are incentivized to advocate for a particular way of doing things or a particular product, as—let's be clear—you are in that position, is to ask a form of the following question. Let's turn it around for a second. In what scenarios would you absolutely not want to use feature flags? What problems arise? When do you take a look at a situation and say, “Oh, yeah, feature flags will make things worse, instead of better. Don't do it.”Steve: I'm not sure I wouldn't necessarily don't do it—maybe I am that zealot—but you got to do it carefully.Corey: [laugh].Steve: You really got to do things carefully because as I said before, flipping on a feature flag for everybody is similar to pushing out untested code to production. So, you want to do that in a measured way. So, you need to make sure that you do a couple of things. One, there should be some way to measure what the system behavior is for a small set of users with that feature flag flipped to on first. And it could be some canaries that you're using for that.You can also—there's other mechanisms you can do that to: set up cohorts and beta testers and those kinds of things. But I would say the gradual rollout and the targeted rollout of a feature flag is critical. You know, again, it sounds easy, “I'll just turn it on later,” but you ideally don't want to do that. The second thing you want to do is, if you can, is there some sort of validation that the feature flag is what you expect? So, I was talking about on-off feature flags; there are things, as when I was talking about dynamic configuration, that are things like throttling limits, that you actually want to make sure that you put in some other safeguards that say, “I never want my TPS to go above 1200 and never want to set it below 800,” for whatever reason, for example. Well, you want to have some sort of validation of that data before the feature flag gets pushed out. Inside Amazon, we actually have the policy that every single flag needs to have some sort of validation around it so that we don't accidentally fat-finger something out before it goes out there. And we have fat-fingered things.Corey: Typing the wrong thing into a command structure into a tool? “Who would ever do something like that?” He says, remembering times he's taken production down himself, exactly that way.Steve: Exactly, exactly, yeah. And we've done it at Amazon and AWS, for sure. And so yeah, if you have some sort of structure or process to validate that—because oftentimes, what you're doing is you're trying to remediate something in production. Stress levels are high, it is especially easy to fat-finger there. So, that check-and-balance of a validation is important.And then ideally, you have something to automatically roll back whatever change that you made, very quickly. So AppConfig, for example, hooks up to CloudWatch alarms. If an alarm goes off, we're actually going to roll back instantly whatever that feature flag was to its previous state so that you don't even need to really worry about validating against your CloudWatch. It'll just automatically do that against whatever alarms you have.Corey: One of the interesting parts about working at Amazon and seeing things in Amazonian scale is that one in a million events happen thousands of times every second for you folks. What lessons have you learned by deploying feature flags at that kind of scale? Because one of my problems and challenges with deploying feature flags myself is that in some cases, we're talking about three to five users a day for some of these things. That's not really enough usage to get insights into various cohort analyses or A/B tests.Steve: Yeah. As I mentioned before, we build these things as features into our product. So, I just talked about the CloudWatch alarms. That wasn't there originally. Originally, you know, if something went wrong, you would observe a CloudWatch alarm and then you decide what to do, and one of those things might be that I'm going to roll back my configuration.So, a lot of the mistakes that we made that caused alarms to go off necessitated us building some automatic mechanisms. And you know, a human being can only react so fast, but an automated system there is going to be able to roll things back very, very quickly. So, that came from some specific mistakes that we had made inside of AWS. The validation that I was talking about as well. We have a couple of ways of validating things.You might want to do a syntactic validation, which really you're validating—as I was saying—the range between 100 and 1000, but you also might want to have sort of a functional validation, or we call it a semantic validation so that you can make sure that, for example, if you're switching to a new database, that you're going to flip over to your new database, you can have a validation there that says, “This database is ready, I can write to this table, it's truly ready for me to switch.” Instead of just updating some config data, you're actually going to be validating that the new target is ready for you. So, those are a couple of things that we've learned from some of the mistakes we made. And again, not saying we aren't making mistakes still, but we always look at these things inside of AWS and figure out how we can benefit from them and how our customers, more importantly, can benefit from these mistakes.Corey: I would say that I agree. I think that you have threaded the needle of not talking smack about your own product, while also presenting it as not the global panacea that everyone should roll out, willy-nilly. That's a good balance to strike. And frankly, I'd also say it's probably a good point to park the episode. If people want to learn more about AppConfig, how you view these challenges, or even potentially want to get started using it themselves, what should they do?Steve: We have an informational page at go.aws/awsappconfig. That will tell you the high-level overview. You can search for our documentation and we have a lot of blog posts to help you get started there.Corey: And links to that will, of course, go into the [show notes 00:31:21]. Thank you so much for suffering my slings, arrows, and other assorted nonsense on this. I really appreciate your taking the time.Steve: Corey thank you for the time. It's always a pleasure to talk to you. Really appreciate your insights.Corey: You're too kind. Steve Rice, principal product manager for AWS AppConfig. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment. But before you do, just try clearing your cookies and downloading the episode again. You might be in the 3% cohort for an A/B test, and you [want to 00:32:01] listen to the good one instead.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

The Cloud Pod
184: The CloudPod Explicitly trusts itself

The Cloud Pod

Play Episode Listen Later Oct 7, 2022 52:22


On The Cloud Pod this week, AWS announces an update to IAM role trust policy behavior, Easily Collect Vehicle Data and Send to the Cloud with new AWS IoT FleetWise, now generally available, Get a head start with no-cost learning challenges before Google Next ‘22. Thank you to our sponsor, Foghorn Consulting, which provides top notch cloud and DevOps engineers to the world's most innovative companies. Initiatives stalled because you're having trouble hiring? Foghorn can be burning down your DevOps and Cloud backlogs as soon as next week. Episode Highlights ⏰ AWS announces an update to IAM role trust policy behavior. ⏰ Easily Collect Vehicle Data and Send to the Cloud with new AWS IoT FleetWise, now generally available. ⏰ Get a head start with no-cost learning challenges before Google Next ‘22. General News:

Screaming in the Cloud
The Ever-Changing World of Cloud Native Observability with Ian Smith

Screaming in the Cloud

Play Episode Listen Later Sep 13, 2022 41:58


About IanIan Smith is Field CTO at Chronosphere where he works across sales, marketing, engineering and product to deliver better insights and outcomes to observability teams supporting high-scale cloud-native environments. Previously, he worked with observability teams across the software industry in pre-sales roles at New Relic, Wavefront, PagerDuty and Lightstep.Links Referenced: Chronosphere: https://chronosphere.io Last Tweet in AWS: lasttweetinaws.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, I find that something I'm working on aligns perfectly with a person that I wind up basically convincing to appear on this show. Today's promoted guest is Ian Smith, who's Field CTO at Chronosphere. Ian, thank you for joining me.Ian: Thanks, Corey. Great to be here.Corey: So, the coincidental aspect of what I'm referring to is that Chronosphere is, despite the name, not something that works on bending time, but rather an observability company. Is that directionally accurate?Ian: That's true. Although you could argue it probably bend a little bit of engineering time. But we can talk about that later.Corey: [laugh]. So, observability is one of those areas that I think is suffering from too many definitions, if that makes sense. And at first, I couldn't make sense of what it was that people actually meant when they said observability, this sort of clarified to me at least when I realized that there were an awful lot of, well, let's be direct and call them ‘legacy monitoring companies' that just chose to take what they were already doing and define that as, “Oh, this is observability.” I don't know that I necessarily agree with that. I know a lot of folks in the industry vehemently disagree.You've been in a lot of places that have positioned you reasonably well to have opinions on this sort of question. To my understanding, you were at interesting places, such as LightStep, New Relic, Wavefront, and PagerDuty, which I guess technically might count as observability in a very strange way. How do you view observability and what it is?Ian: Yeah. Well, a lot of definitions, as you said, common ones, they talk about the three pillars, they talk really about data types. For me, it's about outcomes. I think observability is really this transition from the yesteryear of monitoring where things were much simpler and you, sort of, knew all of the questions, you were able to define your dashboards, you were able to define your alerts and that was really the gist of it. And going into this brave new world where there's a lot of unknown things, you're having to ask a lot of sort of unique questions, particularly during a particular instance, and so being able to ask those questions in an ad hoc fashion layers on top of what we've traditionally done with monitoring. So, observability is sort of that more flexible, more dynamic kind of environment that you have to deal with.Corey: This has always been something that, for me, has been relatively academic. Back when I was running production environments, things tended to be a lot more static, where, “Oh, there's a problem with the database. I will SSH into the database server.” Or, “Hmm, we're having a weird problem with the web tier. Well, there are ten or 20 or 200 web servers. Great, I can aggregate all of their logs to Syslog, and worst case, I can log in and poke around.”Now, with a more ephemeral style of environment where you have Kubernetes or whatnot scheduling containers into place that have problems you can't attach to a running container very easily, and by the time you see an error, that container hasn't existed for three hours. And that becomes a problem. Then you've got the Lambda universe, which is a whole ‘nother world pain, where it becomes very challenging, at least for me, in order to reason using the old style approaches about what's actually going on in your environment.Ian: Yeah, I think there's that and there's also the added complexity of oftentimes you'll see performance or behavioral changes based on even more narrow pathways, right? One particular user is having a problem and the traffic is spread across many containers. Is it making all of these containers perform badly? Not necessarily, but their user experience is being affected. It's very common in say, like, B2B scenarios for you to want to understand the experience of one particular user or the aggregate experience of users at a particular company, particular customer, for example.There's just more complexity. There's more complexity of the infrastructure and just the technical layer that you're talking about, but there's also more complexity in just the way that we're handling use cases and trying to provide value with all of this software to the myriad of customers in different industries that software now serves.Corey: For where I sit, I tend to have a little bit of trouble disambiguating, I guess, the three baseline data types that I see talked about again and again in observability. You have logs, which I think I've mostly I can wrap my head around. That seems to be the baseline story of, “Oh, great. Your application puts out logs. Of course, it's in its own unique, beautiful format. Why wouldn't it be?” In an ideal scenario, they're structured. Things are never ideal, so great. You're basically tailing log files in some cases. Great. I can reason about those.Metrics always seem to be a little bit of a step beyond that. It's okay, I have a whole bunch of log lines that are spitting out every 500 error that my app is throwing—and given my terrible code, it throws a lot—but I can then ideally count the number of times that appears and then that winds up incrementing counter, similar to the way that we used to see with StatsD, for example, and Collectd. Is that directionally correct? As far as the way I reason about, well so far, logs and metrics?Ian: I think at a really basic level, yes. I think that, as we've been talking about, sort of greater complexity starts coming in when you have—particularly metrics in today's world of containers—Prometheus—you mentioned StatsD—Prometheus has become sort of like the standard for expressing those things, so you get situations where you have incredibly high cardinality, so cardinality being the interplay between all the different dimensions. So, you might have, my container is a label, but also the type of endpoint is running on that container as a label, then maybe I want to track my customer organizations and maybe I have 5000 of those. I have 3000 containers, and so on and so forth. And you get this massive explosion, almost multiplicatively.For those in the audience who really live and read cardinality, there's probably someone screaming about well, it's not truly multiplicative in every sense of the word, but, you know, it's close enough from an approximation standpoint. As you get this massive explosion of data, which obviously has a cost implication but also has, I think, a really big implication on the core reason why you have metrics in the first place you alluded to, which is, so a human being can reason about it, right? You don't want to go and look at 5000 log lines; you want to know, out of those 5000 log lines of 4000 errors and I have 1000, OKs. It's very easy for human beings to reason about that from a numbers perspective. When your metrics start to re-explode out into thousands, millions of data points, and unique sort of time series more numbers for you to track, then you're sort of losing that original goal of metrics.Corey: I think I mostly have wrapped my head around the concept. But then that brings us to traces, and that tends to be I think one of the hardest things for me to grasp, just because most of the apps I build, for obvious reasons—namely, I'm bad at programming and most of these are proof of concept type of things rather than anything that's large scale running in production—the difference between a trace and logs tends to get very muddled for me. But the idea being that as you have a customer session or a request that talks to different microservices, how do you collate across different systems all of the outputs of that request into a single place so you can see timing information, understand the flow that user took through your application? Is that again, directionally correct? Have I completely missed the plot here? Which is again, eminently possible. You are the expert.Ian: No, I think that's sort of the fundamental premise or expected value of tracing, for sure. We have something that's akin to a set of logs; they have a common identifier, a trace ID, that tells us that all of these logs essentially belong to the same request. But importantly, there's relationship information. And this is the difference between just having traces—sorry, logs—with just a trace ID attached to them. So, for example, if you have Service A calling Service B and Service C, the relatively simple thing, you could use time to try to figure this out.But what if there are things happening in Service B at the same time there are things happening in Service C and D, and so on and so forth? So, one of the things that tracing brings to the table is it tells you what is currently happening, what called that. So oh, I know that I'm Service D. I was actually called by Service B and I'm not just relying on timestamps to try and figure out that connection. So, you have that information and ultimately, the data model allows you to fully sort of reflect what's happening with the request, particularly in complex environments.And I think this is where, you know, tracing needs to be sort of looked at as not a tool for—just because I'm operating in a modern environment, I'm using some Kubernetes, or I'm using Lambda, is it needs to be used in a scenario where you really have troubles grasping, from a conceptual standpoint, what is happening with the request because you need to actually fully document it. As opposed to, I have a few—let's say three Lambda functions. I maybe have some key metrics about them; I have a little bit of logging. You probably do not need to use tracing to solve, sort of, basic performance problems with those. So, you can get yourself into a place where you're over-engineering, you're spending a lot of time with tracing instrumentation and tracing tooling, and I think that's the core of observability is, like, using the right tool, the right data for the job.But that's also what makes it really difficult because you essentially need to have this, you know, huge set of experience or knowledge about the different data, the different tooling, and what influential architecture and the data you have available to be able to reason about that and make confident decisions, particularly when you're under a time crunch which everyone is familiar with a, sort of like, you know, PagerDuty-style experience of my phone is going off and I have a customer-facing incident. Where is my problem? What do I need to do? Which dashboard do I need to look at? Which tool do I need to investigate? And that's where I think the observability industry has become not serving the outcomes of the customers.Corey: I had a, well, I wouldn't say it's a genius plan, but it was a passing fancy that I've built this online, freely available Twitter client for authoring Twitter threads—because that's what I do is that of having a social life—and it's available at lasttweetinaws.com. I've used that as a testbed for a few things. It's now deployed to roughly 20 AWS regions simultaneously, and this means that I have a bit of a problem as far as how to figure out not even what's wrong or what's broken with this, but who's even using it?Because I know people are. I see invocations all over the planet that are not me. And sometimes it appears to just be random things crawling the internet—fine, whatever—but then I see people logging in and doing stuff with it. I'd kind of like to log and see who's using it just so I can get information like, is there anyone I should talk to about what it could be doing differently? I love getting user experience reports on this stuff.And I figured, ah, this is a perfect little toy application. It runs in a single Lambda function so it's not that complicated. I could instrument this with OpenTelemetry, which then, at least according to the instructions on the tin, I could then send different types of data to different observability tools without having to re-instrument this thing every time I want to kick the tires on something else. That was the promise.And this led to three weeks of pain because it appears that for all of the promise that it has, OpenTelemetry, particularly in a Lambda environment, is nowhere near ready for being able to carry a workload like this. Am I just foolish on this? Am I stating an unfortunate reality that you've noticed in the OpenTelemetry space? Or, let's be clear here, you do work for a company with opinions on these things. Is OpenTelemetry the wrong approach?Ian: I think OpenTelemetry is absolutely the right approach. To me, the promise of OpenTelemetry for the individual is, “Hey, I can go and instrument this thing, as you said and I can go and send the data, wherever I want.” The sort of larger view of that is, “Well, I'm no longer beholden to a vendor,”—including the ones that I've worked for, including the one that I work for now—“For the definition of the data. I am able to control that, I'm able to choose that, I'm able to enhance that, and any effort I put into it, it's mine. I own that.”Whereas previously, if you picked, say, for example, an APM vendor, you said, “Oh, I want to have some additional aspects of my information provider, I want to track my customer, or I want to track a particular new metric of how much dollars am I transacting,” that effort really going to support the value of that individual solution, it's not going to support your outcomes. Which is I want to be able to use this data wherever I want, wherever it's most valuable. So, the core premise of OpenTelemetry, I think, is great. I think it's a massive undertaking to be able to do this for at least three different data types, right? Defining an API across a whole bunch of different languages, across three different data types, and then creating implementations for those.Because the implementations are the thing that people want, right? You are hoping for the ability to, say, drop in something. Maybe one line of code or preferably just, like, attach a dependency, let's say in Java-land at runtime, and be able to have the information flow through and have it complete. And this is the premise of, you know, vendors I've worked with in the past, like New Relic. That was what New Relic built on: the ability to drop in an agent and get visibility immediately.So, having that out-of-the-box visibility is obviously a goal of OpenTelemetry where it makes sense—Go, it's very difficult to attach things at runtime, for example—but then saying, well, whatever is provided—let's say your gRPC connections, database, all these things—well, now I want to go and instrument; I want to add some additional value. As you said, maybe you want to track something like I want to have in my traces the email address of whoever it is or the Twitter handle of whoever is so I can then go and analyze that stuff later. You want to be able to inject that piece of information or that instrumentation and then decide, well, where is the best utilized? Is it best utilized in some tooling from AWS? Is it best utilized in something that you've built yourself? Is it best of utilized an open-source project? Is it best utilized in one of the many observability vendors, or is even becoming more common, I want to shove everything in a data lake and run, sort of, analysis asynchronously, overlay observability data for essentially business purposes.All of those things are served by having a very robust, open-source standard, and simple-to-implement way of collecting a really good baseline of data and then make it easy for you to then enhance that while still owning—essentially, it's your IP right? It's like, the instrumentation is your IP, whereas in the old world of proprietary agents, proprietary APIs, that IP was basically building it, but it was tied to that other vendor that you were investing in.Corey: One thing that I was consistently annoyed by in my days of running production infrastructures at places, like, you know, large banks, for example, one of the problems I kept running into is that this, there's this idea that, “Oh, you want to use our tool. Just instrument your applications with our libraries or our instrumentation standards.” And it felt like I was constantly doing and redoing a lot of instrumentation for different aspects. It's not that we were replacing one vendor with another; it's that in an observability, toolchain, there are remarkably few, one-size-fits-all stories. It feels increasingly like everyone's trying to sell me a multifunction printer, which does one thing well, and a few other things just well enough to technically say they do them, but badly enough that I get irritated every single time.And having 15 different instrumentation packages in an application, that's either got security ramifications, for one, see large bank, and for another it became this increasingly irritating and obnoxious process where it felt like I was spending more time seeing the care and feeding of the instrumentation then I was the application itself. That's the gold—that's I guess the ideal light at the end of the tunnel for me in what OpenTelemetry is promising. Instrument once, and then you're just adjusting configuration as far as where to send it.Ian: That's correct. The organization's, and you know, I keep in touch with a lot of companies that I've worked with, companies that have in the last two years really invested heavily in OpenTelemetry, they're definitely getting to the point now where they're generating the data once, they're using, say, pieces of the OpenTelemetry pipeline, they're extending it themselves, and then they're able to shove that data in a bunch of different places. Maybe they're putting in a data lake for, as I said, business analysis purposes or forecasting. They may be putting the data into two different systems, even for incident and analysis purposes, but you're not having that duplication effort. Also, potentially that performance impact, right, of having two different instrumentation packages lined up with each other.Corey: There is a recurring theme that I've noticed in the observability space that annoys me to no end. And that is—I don't know if it's coming from investor pressure, from folks never being satisfied with what they have, or what it is, but there are so many startups that I have seen and worked with in varying aspects of the observability space that I think, “This is awesome. I love the thing that they do.” And invariably, every time they start getting more and more features bolted onto them, where, hey, you love this whole thing that winds up just basically doing a tail-F on a log file, so it just streams your logs in the application and you can look for certain patterns. I love this thing. It's great.Oh, what's this? Now, it's trying to also be the thing that alerts me and wakes me up in the middle of the night. No. That's what PagerDuty does. I want PagerDuty to do that thing, and I want other things—I want you just to be the log analysis thing and the way that I contextualize logs. And it feels like they keep bolting things on and bolting things on, where everything is more or less trying to evolve into becoming its own version of Datadog. What's up with that?Ian: Yeah, the sort of, dreaded platform play. I—[laugh] I was at New Relic when there were essentially two products that they sold. And then by the time I left, I think there was seven different products that were being sold, which is kind of a crazy, crazy thing when you think about it. And I think Datadog has definitely exceeded that now. And I definitely see many, many vendors in the market—and even open-source solutions—sort of presenting themselves as, like, this integrated experience.But to your point, even before about your experience of these banks it oftentimes become sort of a tick-a-box feature approach of, “Hey, I can do this thing, so buy more. And here's a shared navigation panel.” But are they really integrated? Like, are you getting real value out of it? One of the things that I do in my role is I get to work with our internal product teams very closely, particularly around new initiatives like tracing functionality, and the constant sort of conversation is like, “What is the outcome? What is the value?”It's not about the feature; it's not about having a list of 19 different features. It's like, “What is the user able to do with this?” And so, for example, there are lots of platforms that have metrics, logs, and tracing. The new one-upmanship is saying, “Well, we have events as well. And we have incident response. And we have security. And all these things sort of tie together, so it's one invoice.”And constantly I talk to customers, and I ask them, like, “Hey, what are the outcomes that you're getting when you've invested so heavily in one vendor?” And oftentimes, the response is, “Well, I only need to deal with one vendor.” Okay, but that's not an outcome. [laugh]. And it's like the business having a single invoice.Corey: Yeah, that is something that's already attainable today. If you want to just have one vendor with a whole bunch of crappy offerings, that's what AWS is for. They have AmazonBasics versions of everything you might want to use in production. Oh, you want to go ahead and use MongoDB? Well, use AmazonBasics MongoDB, but they call it DocumentDB because of course they do. And so, on and so forth.There are a bunch of examples of this, but those companies are still in business and doing very well because people often want the genuine article. If everyone was trying to do just everything to check a box for procurement, great. AWS has already beaten you at that game, it seems.Ian: I do think that, you know, people are hoping for that greater value and those greater outcomes, so being able to actually provide differentiation in that market I don't think is terribly difficult, right? There are still huge gaps in let's say, root cause analysis during an investigation time. There are huge issues with vendors who don't think beyond sort of just the one individual who's looking at a particular dashboard or looking at whatever analysis tool there is. So, getting those things actually tied together, it's not just, “Oh, we have metrics, and logs, and traces together,” but even if you say we have metrics and tracing, how do you move between metrics and tracing? One of the goals in the way that we're developing product at Chronosphere is that if you are alerted to an incident—you as an engineer; doesn't matter whether you are massively sophisticated, you're a lead architect who has been with the company forever and you know everything or you're someone who's just come out of onboarding and is your first time on call—you should not have to think, “Is this a tracing problem, or a metrics problem, or a logging problem?”And this is one of those things that I mentioned before of requiring that really heavy level of knowledge and understanding about the observability space and your data and your architecture to be effective. And so, with the, you know, particularly observability teams and all of the engineers that I speak with on a regular basis, you get this sort of circumstance where well, I guess, let's talk about a real outcome and a real pain point because people are like, okay, yeah, this is all fine; it's all coming from a vendor who has a particular agenda, but the thing that constantly resonates is for large organizations that are moving fast, you know, big startups, unicorns, or even more traditional enterprises that are trying to undergo, like, a rapid transformation and go really cloud-native and make sure their engineers are moving quickly, a common question I will talk about with them is, who are the three people in your organization who always get escalated to? And it's usually, you know, between two and five people—Corey: And you can almost pick those perso—you say that and you can—at least anyone who's worked in environments or through incidents like this more than a few times, already have thought of specific people in specific companies. And they almost always fall into some very predictable archetypes. But please, continue.Ian: Yeah. And people think about these people, they always jump to mind. And one of the things I asked about is, “Okay, so when you did your last innovation around observably”—it's not necessarily buying a new thing, but it maybe it was like introducing a new data type or it was you're doing some big investment in improving instrumentation—“What changed about their experience?” And oftentimes, the most that can come out is, “Oh, they have access to more data.” Okay, that's not great.It's like, “What changed about their experience? Are they still getting woken up at 3 am? Are they constantly getting pinged all the time?” One of the vendors that I worked at, when they would go down, there were three engineers in the company who were capable of generating list of customers who are actually impacted by damage. And so, every single incident, one of those three engineers got paged into the incident.And it became borderline intolerable for them because nothing changed. And it got worse, you know? The platform got bigger and more complicated, and so there were more incidents and they were the ones having to generate that. But from a business level, from an observability outcomes perspective, if you zoom all the way up, it's like, “Oh, were we able to generate the list of customers?” “Yes.”And this is where I think the observability industry has sort of gotten stuck—you know, at least one of the ways—is that, “Oh, can you do it?” “Yes.” “But is it effective?” “No.” And by effective, I mean those three engineers become the focal point for an organization.And when I say three—you know, two to five—it doesn't matter whether you're talking about a team of a hundred or you're talking about a team of a thousand. It's always the same number of people. And as you get bigger and bigger, it becomes more and more of a problem. So, does the tooling actually make a difference to them? And you might ask, “Well, what do you expect from the tooling? What do you expect to do for them?” Is it you give them deeper analysis tools? Is it, you know, you do AI Ops? No.The answer is, how do you take the capabilities that those people have and how do you spread it across a larger population of engineers? And that, I think, is one of those key outcomes of observability that no one, whether it be in open-source or the vendor side is really paying a lot of attention to. It's always about, like, “Oh, we can just shove more data in. By the way, we've got petabyte scale and we can deal with, you know, 2 billion active time series, and all these other sorts of vanity measures.” But we've gotten really far away from the outcomes. It's like, “Am I getting return on investment of my observability tooling?”And I think tracing is this—as you've said, it can be difficult to reason about right? And people are not sure. They're feeling, “Well, I'm in a microservices environment; I'm in cloud-native; I need tracing because my older APM tools appear to be failing me. I'm just going to go and wriggle my way through implementing OpenTelemetry.” Which has significant engineering costs. I'm not saying it's not worth it, but there is a significant engineering cost—and then I don't know what to expect, so I'm going to go on through my data somewhere and see whether we can achieve those outcomes.And I do a pilot and my most sophisticated engineers are in the pilot. And they're able to solve the problems. Okay, I'm going to go buy that thing. But I've just transferred my problems. My engineers have gone from solving problems in maybe logs and grepping through petabytes worth of logs to using some sort of complex proprietary query language to go through your tens of petabytes of trace data but actually haven't solved any problem. I've just moved it around and probably just cost myself a lot, both in terms of engineering time and real dollars spent as well.Corey: One of the challenges that I'm seeing across the board is that observability, for certain use cases, once you start to see what it is and its potential for certain applications—certainly not all; I want to hedge that a little bit—but it's clear that there is definite and distinct value versus other ways of doing things. The problem is, is that value often becomes apparent only after you've already done it and can see what that other side looks like. But let's be honest here. Instrumenting an application is going to take some significant level of investment, in many cases. How do you wind up viewing any return on investment that it takes for the very real cost, if only in people's time, to go ahead instrumenting for observability in complex environments?Ian: So, I think that you have to look at the fundamentals, right? You have to look at—pretend we knew nothing about tracing. Pretend that we had just invented logging, and you needed to start small. It's like, I'm not going to go and log everything about every application that I've had forever. What I need to do is I need to find the points where that logging is going to be the most useful, most impactful, across the broadest audience possible.And one of the useful things about tracing is because it's built in distributed environments, primarily for distributed environments, you can look at, for example, the biggest intersection of requests. A lot of people have things like API Gateways, or they have parts of a monolith which is still handling a lot of requests routing; those tend to be areas to start digging into. And I would say that, just like for anyone who's used Prometheus or decided to move away from Prometheus, no one's ever gone and evaluated Prometheus solution without having some sort of Prometheus data, right? You don't go, “Hey, I'm going to evaluate a replacement for Prometheus or my StatsD without having any data, and I'm simultaneously going to generate my data and evaluate the solution at the same time.” It doesn't make any sense.With tracing, you have decent open-source projects out there that allow you to visualize individual traces and understand sort of the basic value you should be getting out of this data. So, it's a good starting point to go, “Okay, can I reason about a single request? Can I go and look at my request end-to-end, even in a relatively small slice of my environment, and can I see the potential for this? And can I think about the things that I need to be able to solve with many traces?” Once you start developing these ideas, then you can have a better idea of, “Well, where do I go and invest more in instrumentation? Look, databases never appear to be a problem, so I'm not going to focus on database instrumentation. What's the real problem is my external dependencies. Facebook API is the one that everyone loves to use. I need to go instrument that.”And then you start to get more clarity. Tracing has this interesting network effect. You can basically just follow the breadcrumbs. Where is my biggest problem here? Where are my errors coming from? Is there anything else further down the call chain? And you can sort of take that exploratory approach rather than doing everything up front.But it is important to do something before you start trying to evaluate what is my end state. End state obviously being sort of nebulous term in today's world, but where do I want to be in two years' time? I would like to have a solution. Maybe it's open-source solution, maybe it's a vendor solution, maybe it's one of those platform solutions we talked about, but how do I get there? It's really going to be I need to take an iterative approach and I need to be very clear about the value and outcomes.There's no point in doing a whole bunch of instrumentation effort in things that are just working fine, right? You want to go and focus your time and attention on that. And also you don't want to go and burn just singular engineers. The observability team's purpose in life is probably not to just write instrumentation or just deploy OpenTelemetry. Because then we get back into the land where engineers themselves know nothing about the monitoring or observability they're doing and it just becomes a checkbox of, “I dropped in an agent. Oh, when it comes time for me to actually deal with an incident, I don't know anything about the data and the data is insufficient.”So, a level of ownership supported by the observability team is really important. On that return on investment, sort of, though it's not just the instrumentation effort. There's product training and there are some very hard costs. People think oftentimes, “Well, I have the ability to pay a vendor; that's really the only cost that I have.” There's things like egress costs, particularly volumes of data. There's the infrastructure costs. A lot of the times there will be elements you need to run in your own environment; those can be very costly as well, and ultimately, they're sort of icebergs in this overall ROI conversation.The other side of it—you know, return and investment—return, there's a lot of difficulty in reasoning about, as you said, what is the value of this going to be if I go through all this effort? Everyone knows a sort of, you know, meme or archetype of, “Hey, here are three options; pick two because there's always going to be a trade off.” Particularly for observability, it's become an element of, I need to pick between performance, data fidelity, or cost. Pick two. And when data fidelity—particularly in tracing—I'm talking about the ability to not sample, right?If you have edge cases, if you have narrow use cases and ways you need to look at your data, if you heavily sample, you lose data fidelity. But oftentimes, cost is a reason why you do that. And then obviously, performance as you start to get bigger and bigger datasets. So, there's a lot of different things you need to balance on that return. As you said, oftentimes you don't get to understand the magnitude of those until you've got the full data set in and you're trying to do this, sort of, for real. But being prepared and iterative as you go through this effort and not saying, “Okay, well, I'm just going to buy everything from one vendor because I'm going to assume that's going to solve my problem,” is probably that undercurrent there.Corey: As I take a look across the entire ecosystem, I can't shake the feeling—and my apologies in advance if this is an observation, I guess, that winds up throwing a stone directly at you folks—Ian: Oh, please.Corey: But I see that there's a strong observability community out there that is absolutely aligned with the things I care about and things I want to do, and then there's a bunch of SaaS vendors, where it seems that they are, in many cases, yes, advancing the state of the art, I am not suggesting for a second that money is making observability worse. But I do think that when the tool you sell is a hammer, then every problem starts to look like a nail—or in my case, like my thumb. Do you think that there's a chance that SaaS vendors are in some ways making this entire space worse?Ian: As we've sort of gone into more cloud-native scenarios and people are building things specifically to take advantage of cloud from a complexity standpoint, from a scaling standpoint, you start to get, like, vertical issues happening. So, you have things like we're going to charge on a per-container basis; we're going to charge on a per-host basis; we're going to charge based off the amount of gigabytes that you send us. These are sort of like more horizontal pricing models, and the way the SaaS vendors have delivered this is they've made it pretty opaque, right? Everyone has experiences, or has jerks about overages from observability vendors' massive spikes. I've worked with customers who have used—accidentally used some features and they've been billed a quarter million dollars on a monthly basis for accidental overages from a SaaS vendor.And these are all terrible things. Like, but we've gotten used to this. Like, we've just accepted it, right, because everyone is operating this way. And I really do believe that the move to SaaS was one of those things. Like, “Oh, well, you're throwing us more data, and we're charging you more for it.” As a vendor—Corey: Which sort of erodes your own value proposition that you're bringing to the table. I mean, I don't mean to be sitting over here shaking my fist yelling, “Oh, I could build a better version in a weekend,” except that I absolutely know how to build a highly available Rsyslog cluster. I've done it a handful of times already and the technology is still there. Compare and contrast that with, at scale, the fact that I'm paying 50 cents per gigabyte ingested to CloudWatch logs, or a multiple of that for a lot of other vendors, it's not that much harder for me to scale that fleet out and pay a much smaller marginal cost.Ian: And so, I think the reaction that we're seeing in the market and we're starting to see—we're starting to see the rise of, sort of, a secondary class of vendor. And by secondary, I don't mean that they're lesser; I mean that they're, sort of like, specifically trying to address problems of the primary vendors, right? Everyone's aware of vendors who are attempting to reduce—well, let's take the example you gave on logs, right? There are vendors out there whose express purpose is to reduce the cost of your logging observability. They just sit in the middle; they are a middleman, right?Essentially, hey, use our tool and even though you're going to pay us a whole bunch of money, it's going to generate an overall return that is greater than if you had just continued pumping all of your logs over to your existing vendor. So, that's great. What we think really needs to happen, and one of the things we're doing at Chronosphere—unfortunate plug—is we're actually building those capabilities into the solution so it's actually end-to-end. And by end-to-end, I mean, a solution where I can ingest my data, I can preprocess my data, I can store it, query it, visualize it, all those things, aligned with open-source standards, but I have control over that data, and I understand what's going on with particularly my cost and my usage. I don't just get a bill at the end of the month going, “Hey, guess what? You've spent an additional $200,000.”Instead, I can know in real time, well, what is happening with my usage. And I can attribute it. It's this team over here. And it's because they added this particular label. And here's a way for you, right now, to address that and cap it so it doesn't cost you anything and it doesn't have a blast radius of, you know, maybe degraded performance or degraded fidelity of the data.That though is diametrically opposed to the way that most vendors are set up. And unfortunately, the open-source projects tend to take a lot of their cues, at least recently, from what's happening in the vendor space. One of the ways that you can think about it is a sort of like a speed of light problem. Everyone knows that, you know, there's basic fundamental latency; everyone knows how fast disk is; everyone knows the, sort of like, you can't just make your computations happen magically, there's a cost of running things horizontally. But a lot of the way that the vendors have presented efficiency to the market is, “Oh, we're just going to incrementally get faster as AWS gets faster. We're going to incrementally get better as compression gets better.”And of course, you can't go and fit a petabyte worth of data into a kilobyte, unless you're really just doing some sort of weird dictionary stuff, so you feel—you're dealing with some fundamental constraints. And the vendors just go, “I'm sorry, you know, we can't violate the speed of light.” But what you can do is you can start taking a look at, well, how is the data valuable, and start giving the people controls on how to make it more valuable. So, one of the things that we do with Chronosphere is we allow you to reshape Prometheus metrics, right? You go and express Prometheus metrics—let's say it's a business metric about how many transactions you're doing as a business—you don't need that on a per-container basis, particularly if you're running 100,000 containers globally.When you go and take a look at that number on a dashboard, or you alert on it, what is it? It's one number, one time series. Maybe you break it out per region. You have five regions, you don't need 100,000 data points every minute behind that. It's very expensive, it's not very performant, and as we talked about earlier, it's very hard to reason about as a human being.So, giving the tools to be able to go and condense that data down and make it more actionable and more valuable, you get performance, you get cost reduction, and you get the value that you ultimately need out of the data. And it's one of the reasons why, I guess, I work at Chronosphere. Which I'm hoping is the last observability [laugh] venture I ever work for.Corey: Yeah, for me a lot of the data that I see in my logs, which is where a lot of this stuff starts and how I still contextualize these things, is nonsense that I don't care about and will never care about. I don't care about load balance or health checks. I don't particularly care about 200 results for the favicon when people visit the site. I care about other things, but just weed out the crap, especially when I'm paying by the pound—or at least by the gigabyte—in order to get that data into something. Yeah. It becomes obnoxious and difficult to filter out.Ian: Yeah. And the vendors just haven't done any of that because why would they, right? If you went and reduced the amount of log—Corey: Put engineering effort into something that reduces how much I can charge you? That sounds like lunacy. Yeah.Ian: Exactly. They're business models entirely based off it. So, if you went and reduced every one's logging bill by 30%, or everyone's logging volume by 30% and reduced the bills by 30%, it's not going to be a great time if you're a publicly traded company who has built your entire business model on essentially a very SaaS volume-driven—and in my eyes—relatively exploitative pricing and billing model.Corey: Ian, I want to thank you for taking so much time out of your day to talk to me about this. If people want to learn more, where can they find you? I mean, you are a Field CTO, so clearly you're outstanding in your field. But if, assuming that people don't want to go to farm country, where's the best place to find you?Ian: Yeah. Well, it'll be a bunch of different conferences. I'll be at KubeCon this year. But chronosphere.io is the company website. I've had the opportunity to talk to a lot of different customers, not from a hard sell perspective, but you know, conversations like this about what are the real problems you're having and what are the things that you sort of wish that you could do?One of the favorite things that I get to ask people is, “If you could wave a magic wand, what would you love to be able to do with your observability solution?” That's, A, a really great part, but oftentimes be being able to say, “Well, actually, that thing you want to do, I think I have a way to accomplish that,” is a really rewarding part of this particular role.Corey: And we will, of course, put links to that in the show notes. Thank you so much for being so generous with your time. I appreciate it.Ian: Thanks, Corey. It's great to be here.Corey: Ian Smith, Field CTO at Chronosphere on this promoted guest episode. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment, which going to be super easy in your case, because it's just one of the things that the omnibus observability platform that your company sells offers as part of its full suite of things you've never used.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Cloud Security and Cost with Anton Chuvakin

Screaming in the Cloud

Play Episode Listen Later Aug 2, 2022 35:47


About AntonDr. Anton Chuvakin is now involved with security solution strategy at Google Cloud, where he arrived via Chronicle Security (an Alphabet company) acquisition in July 2019.Anton was, until recently, a Research Vice President and Distinguished Analyst at Gartner for Technical Professionals (GTP) Security and Risk Management Strategies team. (see chuvakin.org for more)Links Referenced: Google Cloud: https://cloud.google.com/ Cloud Security Podcast: https://cloud.withgoogle.com/cloudsecurity/podcast/ Twitter: https://twitter.com/anton_chuvakin Medium blog: https://medium.com/anton.chuvakin TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on-premises, private cloud, and they just announced a fully-managed service on AWS and Azure called BigAnimal, all one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications—including Oracle—to the cloud. To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.Corey: Let's face it, on-call firefighting at 2am is stressful! So there's good news and there's bad news. The bad news is that you probably can't prevent incidents from happening, but the good news is that incident.io makes incidents less stressful and a lot more valuable. incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues and learn from incident insights to improve site reliability and fix your vulnerabilities. Try incident.io, recover faster and sleep more.Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. My guest today is Anton Chuvakin, who is a Security Strategy Something at Google Cloud. And I absolutely love the title, given, honestly, how anti-corporate it is in so many different ways. Anton, first, thank you for joining me.Anton: Sure. Thanks for inviting me.Corey: So, you wound up working somewhere else—according to LinkedIn—for two months, which in LinkedIn time is about 20 minutes because their date math is always weird. And then you wound up going—according to LinkedIn, of course—leaving and going to Google. Now, that was an acquisition if I'm not mistaken, correct?Anton: That's correct, yes. And it kind of explains that timing in a little bit of a title story because my original title was Head of Security Solution Strategy, and it was for a startup called Chronicle. And within actually three weeks, if I recall correctly, I was acquired into Google. So, title really made little sense of Google, so I kind of go with, like, random titles that include the word security, and occasionally strategy if I feel generous.Corey: It's pretty clear, the fastest way to get hired at Google, given their famous interview process is to just get acquired. Like, “I'm going to start a company and raise it to, like, a little bit of providence, and then do an acquihire because that will be faster than going through the loop, and ideally, there will be less algorithm solving on whiteboards.” But I have to ask, did you have to solve algorithms on whiteboards for your role?Anton: Actually, no, but it did come close to that for some other people who were seen as non-technical and had to join technical roles. I think they were forced to solve coding questions and stuff, but I was somehow grandfathered into a technical role. I don't know exactly how it happened.Corey: Yeah, how you wound up in a technical role. Let's be clear, you are Doctor Anton Chuvakin, and you have written multiple books, you were a research VP at Gartner for many years, and once upon a time, that was sort of a punchline in the circles I hung out with, and then I figured out what Gartner actually does. And okay, that actually is something fairly impressive, let's be clear here. Even as someone who categorically defines himself as not an analyst, I find myself increasingly having a lot of respect for the folks who are actually analysts and the laborious amount of work that they do that remarkably few people understand.Anton: That's correct. And I don't want to boost my ego too much. It's kind of big enough already, obviously, but I actually made it all the way to Distinguished Analyst, which is the next rank after VP.Corey: Ah, my apologies. I did not realize it. This [challenges 00:02:53] the internal structure.Anton: [laugh]. Yeah.Corey: It's like, “Oh, I went from Senior to Staff,” or Staff to Senior because I'm external; I don't know the direction these things go in. It almost feels like a half-step away from oh, I went from [SDE3 to SDE4 00:03:02]. It's like, what do those things mean? Nobody knows. Great.Anton: And what's the top? Is it 17 or is it 113? [laugh].Corey: Exactly. It's like, oh okay, so you're Research VP—or various kinds of VPs—the real question is, how many people have to die before you're the president? And it turns out that that's not how companies think. Who knew?Anton: That's correct. And I think Gartner was a lot of hard work. And it's the type of work that a lot of people actually don't understand. Some people understand it wrong, and some people understand it wrong, kind of, for corrupt reasons. So, for example, a lot of Gartner machinery involves soaking insight from the outside world, organizing it, packaging it, writing it, and then giving it as advice to other people.So, there's nothing offensive about that because there is a lot of insight in the outside world, and somebody needs to be a sponge slash filter slash enrichment facility for that insight. And that, to me, is a good analyst firm, like Gartner.Corey: Yeah. It's a very interesting world. But you historically have been doing a lot of, well, let's I don't even know how to properly describe it because Gardner's clientele historically has not been startups because let's face it, Gartner is relatively expensive. And let's be clear, you're at Google Cloud now, which is a different kind of expensive, but in a way that works for startups, so good for you; gold star. But what was interesting there is that the majority of the Gartner clientele that I've spoken to tend to be big-E Enterprise, which runs legacy businesses, which is a condescending engineering term for ‘it makes money.'And they had the temerity to start their company before 15 years ago, so they built data centers and did things in a data center environment, and now they're moving in a cloudy direction. Your emphasis has always been on security, so my question for you to start with all this is where do you see security vendors fitting in? Because when I walk the RSA expo hall and find myself growing increasingly depressed, it seems like an awful lot of what vendors are selling looks very little removed from, “We took a box, now we shoved in a virtual machine and here you go; it's in your cloud environment. Please pay us money.” The end. And it feels, if I'm looking at this from a pure cloud-native, how I would build things in the cloud from scratch perspective, to be the wrong design. Where do you stand on it?Anton: So, this has been one of the agonizing questions. So, I'm going to kind of ignore some of the context. Of course, I'll come back to it later, but want to kind of frame it—Corey: I love ignoring context. My favorite thing; it's what makes me a decent engineer some days.Anton: So, the frame was this. One of the more agonizing questions for me as an analyst was, a client calls me and says, “We want to do X.” Deep in my heart, I know that X is absolutely wrong, however given their circumstances and how they got to decided to do X, X is perhaps the only thing they can logically do. So, do you tell them, “Don't do X; X is bad,” or you tell them, “Here's how you do X in a manner that aligns with your goals, that's possible, that's whatever.”So, cloud comes up a lot in this case. Somebody comes and says, I want to put my on-premise security information management tool or SIM in the cloud. And I say, deep in my heart, I say, “No, get cloud-native tool.” But I tell them, “Okay, actually, here's how you do it in a less painful manner.” So, this is always hard. Do you tell them they're on their own path, but you help them tread their own path with least pain? So, as an analyst, I agonized over that. This was almost like a moral decision. What do I tell them?Corey: It makes sense. It's a microcosm of the architect's dilemma, on some level, because if you ask a typical Google-style interview whiteboard question, one of my favorites in years past was ‘build a URL shortener.' Great. And you can scale it out and turn it into different things and design things on the whiteboard, and that's great. Most mid-level people can wind up building a passable designed for most things in a cloud sense, when you're starting from scratch.That's not hard. The problem is that the real world is messy and doesn't fit on a whiteboard. And when you're talking about taking a thing that exists in a certain state—for whatever reason, that's the state that it's in—and migrating it to a new environment or a new way of operating, there are so many assumptions that have to break, and in most cases, you don't get the luxury of just taking the thing down for 18 months so you can rework it. And even that it's never as easy as people think it is, so it's going to be 36. Great.You have to wind up meeting people where they are as they're contextualizing these things. And I always feel like the first step of the cloud migration has been to improve your data center environment at the cost of worsening your cloud environment. And that's okay. We don't all need to be the absolute vanguard of how everything should be built and pushing the bleeding edge. You're an insurance company, for God's sake. Maybe that's not where you want to spend your innovation energies.Anton: Yeah. And that's why I tend to lean towards helping them get out of this situation, or maybe build a five-step roadmap of how to become a little bit more cloud-native, rather than tell them, “You're wrong. You should just rewrite the app in a cloud-native way.” That advice almost never actually works in real world. So, I see a lot of the security people move their security stacks to the cloud.And if I see this, I deepen my heart and say, “Holy cow. What do you mean, you want to IDS every packet between Cloud instances? You want to capture every packet in cloud instances? Why? It's all encrypted anyway.” But I don't say that. I say, “Okay, I see how this is the first step for you. Let's describe the next seven steps.”Corey: The problem I keep smacking into is that very often folks who are pushing a lot of these solutions are, yes, they're meeting customers where they are, and that makes an awful lot of sense; I'm not saying that there's anything inherently wrong about that. The challenge is it also feels on the high end, when those customers start to evolve and transform, that those vendors act as a drag. Because if you wind up going in a full-on cloud-native approach, in the fullness of time, there's an entire swath of security vendors that do not have anything left to sell you.Anton: Yes, that is correct. And I think that—I had a fight with an EDR vendor, Endpoint Detection Response, vendor one day when they said, “Oh, we're going to be XDR and we'll do cloud.” And I told them, “You do realize that in a true cloud-native environment, there's no E? There is no endpoint the way you understand it? There is no OS. There is no server. And 99% of your IP isn't working on the clients and servers. How are you going to secure a cloud again?”And I get some kind of rambling answer from them, but the point is that you're right, I do see a lot of vendors that meet clients where they are during their first step in the cloud, and then they may become a drag, or the customer has to show switch to a cloud-native vendor, or to both sometimes, and pay into two mouths. Well, shove money into two pockets.Corey: Well, first, I just want to interject for a second here because when I was walking the RSA expo floor, there were something like 15 different vendors that were trying to sell me XDR. Not a single one of them bothered to expand the acronym—Anton: Just 15? You missed half of them.Corey: Well, yeah—Anton: Holy cow.Corey: As far as I know XDR cable. It's an audio thing right? I already have a bunch of those for my microphone. What's the deal here? Like, “I believe that's XLR.” It's like, “I believe you should expand your acronyms.” What is XDR?Anton: So, this is where I'm going to be very self-serving and point to a blog that I've written that says we don't know what's XDR. And I'm going to—Corey: Well, but rather than a spiritual meaning, I'm going to ask, what does the acronym stands for? I don't actually know the answer to that.Anton: Extended Detection and Response.Corey: Ah.Anton: Extended Detection and Response. But the word ‘extended' is extended by everybody in different directions. There are multiple camps of opinion. Gartner argues with Forrester. If they ever had a pillow fight, it would look really ugly because they just don't agree on what XDR is.Many vendors don't agree with many other vendors, so at this point, if you corner me and say, “Anton, commit to a definition of XDR,” I would not. I will just say, “TBD. Wait two years.” We don't have a consensus definition of XDR at this point. And RSA notwithstanding, 30 booths with XDRs on their big signs… still, sorry, I don't have it.Corey: The problem that I keep running into again and again and again, has been pretty consistently that there are vendors willing to help customers in a very certain position, and for those customers, those vendors are spot on the right thing to do.Anton: Mmm, yep.Corey: But then they tried to expand and instead of realizing that the market has moved on and the market that they're serving is inherently limited and long-term is going to be in decline, they instead start trying to fight the tide and saying, “Oh, no, no, no, no. Those new cloud things, can't trust them.” And they start out with the FU, the Fear, Uncertainty, and Doubt marketing model where, “You can't trust those newfangled cloud things. You should have everything on-prem,” ignoring entirely the fact that in their existing data centers, half the time the security team forgets to lock the door.Anton: Yeah, yeah.Corey: It just feels like there is so much conflict of interest about in the space. I mean, that's the reason I started my Thursday Last Week in AWS newsletter that does security round-ups, just because everything else I found was largely either community-driven where it understood that it was an InfoSec community thing—and InfoSec community is generally toxic—or it was vendor-captured. And I wanted a round-up of things that I had to care about running an infrastructure, but security is not in my job title, even if the word something is or is not there. It's—I have a job to do that isn't security full time; what do I need to know? And that felt like an underserved market, and I feel like there's no equivalent of that in the world of the emerging cloud security space.Anton: Yes, I think so. But it has a high chance of being also kind of captured by legacy vendors. So, when I was at Gartner, there was a lot of acronyms being made with that started with a C: Cloud. There was CSPM, there was CWBP, and after I left the coined SNAPP with double p at the end. Cloud-Native Application Protection Platform. And you know, in my time at Gartner, five-letter acronyms are definitely not very popular. Like, you shouldn't have done a five-letter acronym if you can help yourself.So, my point is that a lot of these vendors are more from legacy vendors. They are not born in the cloud. They are born in the 1990s. Some are born in the cloud, but it's a mix. So, the same acronym may apply to a vendor that's 2019, or—wait for it—1989.Corey: That is… well, I'd say on the one hand, it's terrifying, but on the other, it's not that far removed from the founding of Google.Anton: True, true. Well, '89, kind of, it's another ten years. I think that if you're from the '90s, maybe you're okay, but if you're from the '80s… you really need to have superpowers of adaptation. Again, it's possible. Funny aside: at Gartner, I met somebody who was an analyst for 32 years.So, he was I think, at Gartner for 32 years. And how do you keep your knowledge current if you are always in an ivory tower? The point is that this person did do that because he had a unique ability to absorb knowledge from the outside world. You can adapt; it's just hard.Corey: It always is. I'm going to pivot a bit and put you in a little bit of a hot seat here. Not intentionally so. But it is something that I've been really kicking around for a while. And I'm going to basically focus on Google because that's where you work.I yeah, I want you to go and mouth off about other cloud companies. Yeah, that's—Anton: [laugh]. No.Corey: Going to go super well and no one will have a problem with that. No, it's… we'll pick on Google for a minute because Google Cloud offers a whole bunch of services. I think it's directionally the right number of services because there are areas that you folks do not view as a core competency, and you actually—imagine that—partner with third parties to wind up delivering something great rather than building this shitty knockoff version that no one actually wants. Ehem, I might be some subtweeting someone here with this, only out loud.Anton: [laugh].Corey: The thing that resonates with me though, is that you do charge for a variety of security services. My perspective, by and large, is that the cloud vendors should not be viewing security as a profit center but rather is something that comes baked into the platform that winds up being amortized into the cost of everything else, just because otherwise you wind up with such a perverse set of incentives.Anton: Mm-hm.Corey: Does that sound ridiculous or is that something that aligns with your way of thinking. I'm willing to take criticism that I'm wrong on this, too.Anton: Yeah. It's not that. It's I almost start to see some kind of a magic quadrant in my mind that kind of categorizes some things—Corey: Careful, that's trademarked.Anton: Uhh, okay. So, some kind of vis—Corey: It's a mystical quadrilateral.Anton: Some kind of visual depiction, perhaps including four parts—not quadrants, mind you—that is focused on things that should be paid and aren't, things that should be paid and are paid, and whatever else. So, the point is that if you're charging for encryption, like basic encryption, you're probably making a mistake. And we don't, and other people, I think, don't as well. If you're charging for logging, then it's probably also wrong—because charging for log retention, keeping logs perhaps is okay because ultimately you're spending resources on this—charging for logging to me is kind of in the vile territory. But how about charging for a tool that helps you secure your on-premise environment? That's fair game, right?Corey: Right. If it's something you're taking to another provider, I think that's absolutely fair. But the idea—and again, I'm okay with the reality of, “Okay, here's our object storage costs for things, and by the way, when you wind up logging things, yeah, we'll charge you directionally what it costs to store that an object store,” that's great, but I don't have the Google Cloud price list shoved into my head, but I know over an AWS land that CloudWatch logs charge 50 cents per gigabyte, for ingress. And the defense is, “Well, that's a lot less expensive than most other logging vendors out there.” It's, yeah, but it's still horrifying, and at scale, it makes me want to do some terrifying things like I used to, which is build out a cluster of Rsyslog boxes and wind up having everything logged to those because I don't have an unbounded growth problem.This gets worse with audit logs because there's no alternative available for this. And when companies start charging for that, either on a data plane or a management plane level, that starts to get really, really murky because you can get visibility into what happened and reconstruct things after the fact, but only if you pay. And that bugs me.Anton: That would bug me as well. And I think these are things that I would very clearly push into the box of this is security that you should not charge for. But authentication is free. But, like, deeper analysis of authentication patterns, perhaps costs money. This to me is in the fair game territory because you may have logs, you may have reports, but what if you want some kind of fancy ML that analyzes the logs and gives you some insights? I don't think that's offensive to charge for that.Corey: I come bearing ill tidings. Developers are responsible for more than ever these days. Not just the code that they write, but also the containers and the cloud infrastructure that their apps run on. Because serverless means it's still somebody's problem. And a big part of that responsibility is app security from code to cloud. And that's where our friend Snyk comes in. Snyk is a frictionless security platform that meets developers where they are - Finding and fixing vulnerabilities right from the CLI, IDEs, Repos, and Pipelines. Snyk integrates seamlessly with AWS offerings like code pipeline, EKS, ECR, and more! As well as things you're actually likely to be using. Deploy on AWS, secure with Snyk. Learn more at Snyk.co/scream That's S-N-Y-K.co/screamCorey: I think it comes down to what you're doing with it. Like, the baseline primitives, the things that no one else is going to be in a position to do because honestly, if I can get logging and audit data out of your control plane, you have a different kind of security problem, and—Anton: [laugh].Corey: That is a giant screaming fire in the building, as it should be. The other side of it, though, is that if we take a look at how much all of this stuff can cost, and if you start charging for things that are competitive to other log analytics tools, great because at that point, we're talking about options. I mean, I'd like to see, in an ideal world, that you don't charge massive amounts of money for egress but ingress is free. I'd like to see that normalized a bit.But yeah, okay, great. Here's the data; now I can run whatever analytics tools I want on it and then you're effectively competing on a level playing field, as opposed to, like, okay, this other analytics tool is better, but it'll cost me over ten times as much to migrate to it, so is it ten times better? Probably not; few things are, so I guess I'm sticking with the stuff that you're offering. It feels like the cloud provider security tools never quite hit the same sweet spot that third-party vendors tend to as far as usability, being able to display things in a way that aligns with various stakeholders at those companies. But it still feels like a cash grab and I have to imagine without having insight into internal costing structures, that the security services themselves are not a significant revenue driver for any of the cloud companies. And the rare times where they are is almost certainly some horrifying misconfiguration that should be fixed.Anton: That's fair, but so to me, it still fits into the bucket of some things you shouldn't charge for and most people don't. There is a bucket of things that you should not charge for, but some people do. And there's a bucket of things where it's absolutely fair to charge for I don't know the amount I'm not a pricing person, but I also seen things that are very clearly have cost to a provider, have value to a client, have margins, so it's very clear it's a product; it's not just a feature of the cloud to be more secure. But you're right if somebody positions as, “I got cloud. Hey, give me secure cloud. It costs double.” I'd be really offended because, like, what is your first cloud is, like, broken and insecure? Yeah. Replace insecure with broken. Why are you selling broken to me?Corey: Right. You tried to spin up a service in Google Cloud, it's like, “Great. Do you want the secure version or the shitty one?”Anton: Yeah, exactly.Corey: Guess which one of those costs more. It's… yeah, in the fullness of time, of course, the shitty one cost more because you find out about security breaches on the front page of The New York Times, and no one's happy, except maybe The Times. But the problem that you hit is that I don't know how to fix that. I think there's an opportunity there for some provider—any provider, please—to be a trendsetter, and, “Yeah, we don't charge for security services on our own stuff just because it'd be believed that should be something that is baked in.” Like, that becomes the narrative of the secure cloud.Anton: What about tiers? What about some kind of a good, better, best, or bronze, gold, platinum, where you have reasonable security, but if you want superior security, you pay money? How do you feel, what's your gut feel on this approach? Like, I can't think of example—log analysis. You're going to get some analytics and you're going to get fancy ML. Fancy ML costs money; yay, nay?Corey: You're bringing up an actually really interesting point because I think I'm conflating too many personas at once. Right now, just pulling up last months bill on Google Cloud, it fits in the free tier, but my Cloud Run bill was 13 cents for the month because that's what runs my snark.cloud URL shortener. And it's great. And I wound up with—I think my virtual machine costs dozen times that much. I don't care.Over in AWS-land, I was building out a serverless nonsense thing, my Last Tweet In AWS client, and that cost a few pennies a month all told, plus a whopping 50 cents for a DNS zone. Whatever. But because I was deploying it to all regions and the way that configural evaluations work, my config bill for that was 16 bucks. Now, I don't actually care about the dollar figures on this. I assure you, you could put zeros on the end of that for days and it doesn't really move the needle on my business until you get to a very certain number there, and then suddenly, I care a lot.Anton: [laugh]. Yeah.Corey: And large enterprises, this is expected because even the sheer cost of people's time to go through these things is valuable. What I'm thinking of is almost a hobby-level side project instead, where I'm a student, and I'm learning this in a dorm room or in a bootcamp or in my off hours, or I'm a career switcher and I'm doing this on my own dime out of hours. And I wind up getting smacked with the bill for security services that, for a company, don't even slightly matter. But for me, they matter, so I'm not going to enable them. And when I transition into the workforce and go somewhere, I'm going to continue to work the same way that I did when I was an independent learner, like, having a wildly generous free tier for small-scale accounts, like, even taking a perspective until you wind up costing, I don't know, five, ten—whatever it is—thousand dollars a month, none of the security stuff is going to be billable for you because it's it is not aimed at you and we want you comfortable with and using these things.This is a whole deep dive into the weeds of economics and price-driven behavior and all kinds of other nonsense, but every time I wind up seeing that, like, in my actual production account over at AWS land for The Duckbill Group, all things wrapped up, it's something like 1100 bucks a month. And over a third of it is monitoring, audit, and observability services, and a few security things as well. And on the one hand, I'm sitting here going, “I don't see that kind of value coming from it.” Now, the day there's an incident and I have to look into this, yeah, it's absolutely going to be worth having, but it's insurance. But it feels like a disproportionate percentage of it. And maybe I'm just sitting here whining and grousing and I sound like a freeloader who doesn't want to pay for things, but it's one of those areas where I would gladly pay more for a just having this be part of the cost and not complain at all about it.Anton: Well, if somebody sells me a thing that costs $1, and then they say, “Want to make it secure?” I say yes, but I'm already suspicious, and they say, “Then it's going to be 16 bucks.” I'd really freak out because, like, there are certain percentages, certain ratios of the actual thing plus security or a secure version of it; 16x is not the answer expect. 30%, probably still not the answer I expect, frankly. I don't know. This is, like, an ROI question [crosstalk 00:23:46]—Corey: Let's also be clear; my usage pattern is really weird. You take a look at most large companies at significant scale, their cloud environments from a billing perspective look an awful lot like a crap ton of instances—or possibly containers running—and smattering of other things. Yeah, you also database and storage being the other two tiers and because of… reasons data transfer loves to show up too, but by and large, everything else was more or less a rounding error. I have remarkably few of those things, just given the weird way that I use services inappropriately, but that is the nature of me, so don't necessarily take that as being gospel. Like, “Oh, you'll spend a third of your bill.”Like, I've talked to analyst types previously—not you, of course—who will hear a story like this and that suddenly winds up as a headline in some report somewhere. And it's, “Yeah, if your entire compute is based on Lambda functions and you get no traffic, yeah, you're going to see some weird distortions in your bill. Welcome to the conversation.” But it's a problem that I think is going to have to be addressed at some point, especially we talked about earlier, those vendors who are catering to customers who are not born in the cloud, and they start to see their business erode as the cloud-native way of doing things continues to accelerate, I feel like we're in for a time where they're going to be coming at the cloud providers and smacking them for this way harder than I am with my, “As a customer, wouldn't it be nice to have this?” They're going to turn this into something monstrous. And that's what it takes, that's what it takes. But… yeah.Anton: It will take more time than than we think, I think because again, back in the Gartner days, I loved to make predictions. And sometimes—I've learned that predictions end up coming true if you're good, but much later.Corey: I'm learning that myself. I'm about two years away from the end of it because three years ago, I said five years from now, nobody will care about Kubernetes. And I didn't mean it was going to go away, but I meant that it would slip below the surface level of awareness to point where most people didn't have to think about it in the same way. And I know it's going to happen because it's too complex now and it's going to be something that just gets handled in the same way that Linux kernels do today, but I think I was aggressive on the timeline. And to be clear, I've been misquoted as, “Oh, I don't think Kubernetes is going to be relevant.”It is, it's just going to not be something that you need to spend the quarter million bucks an engineer on to run in production safely.Anton: Yeah.Corey: So, we'll see. I'm curious. One other question I had for you while I've got you here is you run a podcast of your own: the Cloud Security Podcast if I'm not mistaken, which is—Anton: Sadly, you are not. [laugh].Corey: —the Cloud Se—yeah. Interesting name on that one, yeah. It's like what the Cloud Podcast was taken?Anton: Essentially, we had a really cool name [Weather Insecurity 00:26:14]. But the naming team here said, you must be descriptive as everybody else at Google, and we ended up with the name, Cloud Security Podcast. Very, very original.Corey: Naming is challenging. I still maintain that the company is renamed Alphabet, just so it could appear before Amazon in the yellow pages, but I don't know how accurate that one actually is. Yeah, to be clear, I'm not dunking on your personal fun podcast, for those without context. This is a corporate Google Cloud podcast and if you want to make the argument that I'm punching down by making fun of Google, please, I welcome that debate.Anton: [laugh]. Yes.Corey: I can't acquire companies as a shortcut to hire people. Yet. I'm sure it'll happen someday, but I can aspire to that level of budgetary control. So, what are you up to these days? You spent seven years at Gartner and now you're doing a lot of cloud security… I'll call it storytelling, and I want to be clear that I mean that as a compliment, not the, “Oh, you just tell stories rather than build things?”Anton: [laugh].Corey: Yeah, it turns out that you have to give people a reason to care about what you've built or you don't have your job for very long. What are you talking about these days? What narratives are you looking at going forward?Anton: So, one of the things that I've been obsessed with lately is a lot of people from more traditional companies come in in the cloud with their traditional on-premise knowledge, and they're trying to do cloud the on-premise way. On our podcast, we do dedicate quite some airtime to people who do cloud as if it were a rented data center, and sometimes we say, the opposite is called—we don't say cloud-native, I think; we say you're doing the cloud the cloudy way. So, if you do cloud, the cloudy way, you're probably doing it right. But if you're doing the cloud is rented data center, when you copy a security stack, you lift and shift your IDS, and your network capture devices, and your firewalls, and your SIM, you maybe are okay, as a first step. People here used to be a little bit more enraged about it, but to me, we meet customers where they are, but we need to journey with them.Because if all you do is copy your stack—security stack—from a data center to the cloud, you are losing effectiveness, you're spending money, and you're making other mistakes. I sometimes joke that you copy mistakes, not just practices. Why copy on-prem mistakes to the cloud? So, that's been bugging me quite a bit and I'm trying to tell stories to guide people out of a situation. Not away, but out.Corey: A lot of people don't go for the idea of the lift and shift migration and they say that it's a terrible pattern and it causes all kinds of problems. And they're right. The counterpoint is that it's basically the second-worst approach and everything else seems to tie itself for first place. I don't mean to sound like I'm trying to pick a fight on these things, but we're going to rebuild an application while we move it. Great.Then it doesn't work or worse works intermittently and you have no idea whether it's the rewrite, the cloud provider, or something else you haven't considered. It just sounds like a recipe for disaster.Anton: For sure. And so, imagine that you're moving the app, you're doing cut-and-paste to the cloud of the application, and then you cut-and-paste security, and then you end up with sizeable storage costs, possibly egress costs, possibly mistakes you used to make beyond five firewalls, now you make this mistake straight on the edge. Well, not on the edge edge, but on the edge of the public internet. So, some of the mistakes do become worse when you copy them from the data center to the cloud. So, we do need to, kind of, help people to get out of the situation but not by telling them don't do it because they will do it. We need to tell them what step B; what's step 1.5 out of this?Corey: And cost doesn't drive it and security doesn't drive it. Those are trailing functions. It has to be a capability story. It has to be about improving feature velocity or it does not get done. I have learned this the painful way.Anton: Whatever 10x cost if you do something in the data center-ish way in the cloud, and you're ten times more expensive, cost will drive it.Corey: To an extent, yes. However, the problem is that companies are looking at this from the perspective of okay, we can cut our costs by 90% if we make these changes. Okay, great. It cuts the cloud infrastructure cost that way. What is the engineering time, what is the opportunity cost that they gets baked into that, and what are the other strategic priorities that team has been tasked with this year? It has to go along for the ride with a redesign that unlocks additional capability because a pure cost savings play is something I have almost never found to be an argument that carries the day.There are always exceptions, to be clear, but the general case I found is that when companies get really focused on cost-cutting, rather than expanding into new markets, on some level, it feels like they are not in the best of health, corporately speaking. I mean, there's a reason I'm talking about cost optimization for what I do and not cost-cutting.It's not about lowering the bill to zero at all cost. “Cool. Turn everything off. Your bill drops to zero.” “Oh, you don't have a company anymore? Okay, so there's a constraint. Let's talk more about that.” Companies are optimized to increase revenue as opposed to reduce costs. And engineers are always more expensive than the cloud provider resources they're using, unless you've done something horrifying.Anton: And some people did, by replicating their mistakes for their inefficient data centers straight into the cloud, occasionally, yeah. But you're right, yeah. It costs the—we had the same pattern of Gartner. It's like, it's not about doing cheaper in the cloud.Corey: I really want to thank you for spending so much time talking to me. If people want to learn more about what you're up to, how you view the world, and what you're up to next, where's the best place for them to find you?Anton: At this point, it's probably easiest to find me on Twitter. I was about to say Podcast, I was about to say my Medium blog, but frankly, all of it kind of goes into Twitter at some point. And so, I think I am twitter.com/anton_chuvakin, if I recall correctly. Sorry, I haven't really—Corey: You are indeed. It's always great; it's one of those that you have a sizable audience, and you're like, “What is my Twitter handle, again? That's a good question. I don't know.” And it's your name. Great. Cool. “So, you're going to spell that for you, too, while you're at it?” We will, of course, put a link to that in the [show notes 00:32:09]. I really want to thank you for being so generous with your time. I appreciate it.Anton: Perfect. Thank you. It was fun.Corey: Anton Chuvakin, Security Strategy Something at Google Cloud. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment because people are doing it wrong, but also tell me which legacy vendor you work for.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Technical Lineage and Careers in Tech with Sheeri Cabral

Screaming in the Cloud

Play Episode Listen Later Jul 12, 2022 35:50


About SheeriAfter almost 2 decades as a database administrator and award-winning thought leader, Sheeri Cabral pivoted to technical product management. Her super power of “new customer” empathy informs her presentations and explanations. Sheeri has developed unique insights into working together and planning, having survived numerous reorganizations, “best practices”, and efficiency models. Her experience is the result of having worked at everything from scrappy startups such as Guardium – later bought by IBM – to influential tech companies like Mozilla and MongoDB, to large established organizations like Salesforce.Links Referenced: Collibra: https://www.collibra.com WildAid GitHub: https://github.com/wildaid Twitter: https://twitter.com/sheeri Personal Blog: https://sheeri.org TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored by our friends at Fortinet. Fortinet's partnership with AWS is a better-together combination that ensures your workloads on AWS are protected by best-in-class security solutions powered by comprehensive threat intelligence and more than 20 years of cybersecurity experience. Integrations with key AWS services simplify security management, ensure full visibility across environments, and provide broad protection across your workloads and applications. Visit them at AWS re:Inforce to see the latest trends in cybersecurity on July 25-26 at the Boston Convention Center. Just go over to the Fortinet booth and tell them Corey Quinn sent you and watch for the flinch. My thanks again to my friends at Fortinet.Corey: Let's face it, on-call firefighting at 2am is stressful! So there's good news and there's bad news. The bad news is that you probably can't prevent incidents from happening, but the good news is that incident.io makes incidents less stressful and a lot more valuable. incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues and learn from incident insights to improve site reliability and fix your vulnerabilities. Try incident.io, recover faster and sleep more.Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. My guest today is Sheeri Cabral, who's a Senior Product Manager of ETL lineage at Collibra. And that is an awful lot of words that I understand approximately none of, except maybe manager. But we'll get there. The origin story has very little to do with that.I was following Sheeri on Twitter for a long time and really enjoyed the conversations that we had back and forth. And over time, I started to realize that there were a lot of things that didn't necessarily line up. And one of the more interesting and burning questions I had is, what is it you do, exactly? Because you're all over the map. First, thank you for taking the time to speak with me today. And what is it you'd say it is you do here? To quote a somewhat bizarre and aged movie now.Sheeri: Well, since your listeners are technical, I do like to match what I say with the audience. First of all, hi. Thanks for having me. I'm Sheeri Cabral. I am a product manager for technical and ETL tools and I can break that down for this technical audience. If it's not a technical audience, I might say something—like if I'm at a party, and people ask what I do—I'll say, “I'm a product manager for technical data tool.” And if they ask what a product manager does, I'll say I helped make sure that, you know, we deliver a product the customer wants. So, you know, ETL tools are tools that transform, extract, and load your data from one place to another.Corey: Like AWS Glue, but for some of them, reportedly, you don't have to pay AWS by the gigabyte-second.Sheeri: Correct. Correct. We actually have an AWS Glue technical lineage tool in beta right now. So, the technical lineage is how data flows from one place to another. So, when you're extracting, possibly transforming, and loading your data from one place to another, you're moving it around; you want to see where it goes. Why do you want to see where it goes? Glad you asked. You didn't really ask. Do you care? Do you want to know why it's important?Corey: Oh, I absolutely do. Because it's—again, people who are, like, “What do you do?” “Oh, it's boring, and you won't care.” It's like when people aren't even excited themselves about what they work on, it's always a strange dynamic. There's a sense that people aren't really invested in what they do.I'm not saying you have to have this overwhelming passion and do this in your spare time, necessarily, but you should, at least in an ideal world, like what you do enough to light up a bit when you talk about it. You very clearly do. I'm not wanting to stop you. Please continue.Sheeri: I do. I love data and I love helping people. So, technical lineage does a few things. For example, a DBA—which I used to be a DBA—can use technical lineage to predict the impact of a schema update or migration, right? So, if I'm going to change the name of this column, what uses it downstream? What's going to be affected? What scripts do I need to change? Because if the name changes other thing—you know, then I need to not get errors everywhere.And from a data governance perspective, which Collibra is data governance tool, it helps organizations see if, you know, you have private data in a source, does it remain private throughout its journey, right? So, you can take a column like email address or government ID number and see where it's used down the line, right? GDPR compliance, CCPA compliance. The CCPA is a little newer; people might not know that acronym. It's California Consumer Privacy Act.I forget what GDPR is, but it's another privacy act. It also can help the business see where data comes from so if you have technical lineage all the way down to your reports, then you know whether or not you can trust the data, right? So, you have a report and it shows salary ranges for job titles. So, where did the data come from? Did it come from a survey? Did it come from job sites? Or did it come from a government source like the IRS, right? So, now you know, like, what you get to trust the most.Corey: Wait, you can do that without a blockchain? I kid, I kid, I kid. Please don't make me talk about blockchains. No, it's important. The provenance of data, being able to establish a almost a chain-of-custody style approach for a lot of these things is extraordinarily important.Sheeri: Yep.Corey: I was always a little hazy on the whole idea of ETL until I started, you know, working with large-volume AWS bills. And it turns out that, “Well, why do you have to wind up moving and transforming all of these things?” “Oh, because in its raw form, it's complete nonsense. That's why. Thank you for asking.” It becomes a problem—Sheeri: [laugh]. Oh, I thought you're going to say because AWS has 14 different products for things, so you have to move it from one product to the other to use the features.Corey: And two of them are good. It's a wild experience.Sheeri: [laugh].Corey: But this is also something of a new career for you. You were a DBA for a long time. You're also incredibly engaging, you have a personality, you're extraordinarily creative, and that—if I can slander an entire profession for a second—does not feel like it is a common DBA trait. It's right up there with an overly creative accountant. When your accountant has done a stand-up comedy, you're watching and you're laughing and thinking, “I am going to federal prison.” It's one of those weird things that doesn't quite gel, if we're speaking purely in terms of stereotypes. What has your career been like?Sheeri: I was a nerd growing up. So, to kind of say, like, I have a personality, like, my personality is very nerdish. And I get along with other nerdy people and we have a lot of fun, but when I was younger, like, when I was, I don't know, seven or eight, one of the things I really love to do is I had a penny collection—you know, like you do—and I love to sort it by date. So, in the states anyway, we have these pennies that have the date that they were minted on it. And so, I would organize—and I probably had, like, five bucks worth a pennies.So, you're talking about 500 pennies and I would sort them and I'd be like, “Oh, this is 1969. This was 1971.” And then when I was done, I wanted to sort things more, so I would start to, like, sort them in order how shiny the pennies were. So, I think that from an early age, it was clear that I wanted to be a DBA from that sorting of my data and ordering it, but I never really had a, like, “Oh, I want to be this when I grew up.” I kind of had a stint when I was in, like, middle school where I was like, maybe I'll be a creative writer and I wasn't as creative a writer as I wanted to be, so I was like, “Ah, whatever.”And I ended up actually coming to computer science just completely through random circumstance. I wanted to do neuroscience because I thought it was completely fascinating at how the brain works and how, like, you and I are, like, 99.999—we're, like, five-nines the same except for, like, a couple of genetic, whatever. But, like, how our brain wiring right how the neuron, how the electricity flows through it—Corey: Yeah, it feels like I want to store a whole bunch of data, that's okay. I'll remember it. I'll keep it in my head. And you're, like, rolling up the sleeves and grabbing, like, the combination software package off the shelf and a scalpel. Like, “Not yet, but you're about to.” You're right, there is an interesting point of commonality on this. It comes down to almost data organization and the—Sheeri: Yeah.Corey: —relationship between data nodes if that's a fair assessment.Sheeri: Yeah. Well, so what happened was, so I went to university and in order to take introductory neuroscience, I had to take, like, chemistry, organic chemistry, biology, I was basically doing a pre-med track. And so, in the beginning of my junior year, I went to go take introductory neuroscience and I got a D-minus. And a D-minus level doesn't even count for the major. And I'm like, “Well, I want to graduate in three semesters.”And I had this—I got all my requirements done, except for the pesky little major thing. So, I was already starting to take, like, a computer science, you know, basic courses and so I kind of went whole-hog, all-in did four or five computer science courses a semester and got my degree in computer science. Because it was like math, so it kind of came a little easy to me. So taking, you know, logic courses, and you know, linear algebra courses was like, “Yeah, that's great.” And then it was the year 2000, when I got my bachelor's, the turn of the century.And my university offered a fifth-year master's degree program. And I said, I don't know who's going to look at me and say, conscious bias, unconscious bias, “She's a woman, she can't do computer science, so, like, let me just get this master's degree.” I, like, fill out a one page form, I didn't have to take a GRE. And it was the year 2000. You were around back then.You know what it was like. The jobs were like—they were handing jobs out like candy. I literally had a friend who was like, “My company”—that he founded. He's like, just come, you know, it's Monday in May—“Just start, you will just bring your resume the first day and we'll put it on file.” And I was like, no, no, I have this great opportunity to get a master's degree in one year at 25% off the cost because I got a tuition reduction or whatever for being in the program. I was like, “What could possibly go wrong in one year?”And what happened was his company didn't exist the next year, and, like, everyone was in a hiring freeze in 2001. So, it was the best decision I ever made without really knowing because I would have had a job for six months had been laid off with everyone else at the end of 2000 and… and that's it. So, that's how I became a DBA is I, you know, got a master's degree in computer science, really wanted to use databases. There weren't any database jobs in 2001, but I did get a job as a sysadmin, which we now call SREs.Corey: Well, for some of the younger folks in the audience, I do want to call out the fact that regardless of how they think we all rode dinosaurs to school, databases did absolutely exist back in that era. There's a reason that Oracle is as large as it is of a company. And it's not because people just love doing business with them, but technology was head and shoulders above everything else for a long time, to the point where people worked with them in spite of their reputation, not because of it. These days, it seems like in the database universe, you have an explosion of different options and different ways that are great at different things. The best, of course, is Route 53 or other DNS TXT records. Everything else is competing for second place on that. But no matter what it is, you're after, there are options available. This was not the case back then. It was like, you had a few options, all of them with serious drawbacks, but you had to pick your poison.Sheeri: Yeah. In fact, I learned on Postgres in university because you know, that was freely available. And you know, you'd like, “Well, why not MySQL? Isn't that kind of easier to learn?” It's like, yeah, but I went to college from '96 to 2001. MySQL 1.0 or whatever was released in '95. By the time I graduated, it was six years old.Corey: And academia is not usually the early adopter of a lot of emerging technologies like that. That's not a dig on them any because otherwise, you wind up with a major that doesn't exist by the time that the first crop of students graduates.Sheeri: Right. And they didn't have, you know, transactions. They didn't have—they barely had replication, you know? So, it wasn't a full-fledged database at the time. And then I became a MySQL DBA. But yeah, as a systems administrator, you know, we did websites, right? We did what web—are they called web administrators now? What are they called? Web admins? Webmaster?Corey: Web admins, I think that they became subsumed into sysadmins, by and large and now we call them DevOps, or SRE, which means the exact same thing except you get paid 60% more and your primary job is arguing about which one of those you're not.Sheeri: Right. Right. Like we were still separated from network operations, but database stuff that stuff and, you know, website stuff, it's stuff we all did, back when your [laugh] webmail was your Horde based on PHP and you had a database behind it. And yeah, it was fun times.Corey: I worked at a whole bunch of companies in that era. And that's where basically where I formed my early opinion of a bunch of DBA-leaning sysadmins. Like the DBA in and a lot of these companies, it was, I don't want to say toxic, but there's a reason that if I were to say, “I'm writing a memoir about a career track in tech called The Legend of Surly McBastard,” people are going to say, “Oh, is it about the DBA?” There's a reason behind this. It always felt like there was a sense of elitism and a sense of, “Well, that's not my job, so you do your job, but if anything goes even slightly wrong, it's certainly not my fault.” And to be fair, all of these fields have evolved significantly since then, but a lot of those biases that started early in our career are difficult to shake, particularly when they're unconscious.Sheeri: They are. I'd never ran into that person. Like, I never ran into anyone who—like a developer who treated me poorly because the last DBA was a jerk and whatever, but I heard a lot of stories, especially with things like granting access. In fact, I remember, my first job as an actual DBA and not as a sysadmin that also the DBA stuff was at an online gay dating site, and the CTO rage-quit. Literally yelled, stormed out of the office, slammed the door, and never came back.And a couple of weeks later, you know, we found out that the customer service guys who were in-house—and they were all guys, so I say guys although we also referred to them as ladies because it was an online gay dating site.Corey: Gals works well too, in those scenarios. “Oh, guys is unisex.” “Cool. So's ‘gals' by that theory. So gals, how we doing?” And people get very offended by that and suddenly, yeah, maybe ‘folks' is not a terrible direction to go in. I digress. Please continue.Sheeri: When they hired me, they were like, are you sure you're okay with this? I'm like, “I get it. There's, like, half-naked men posters on the wall. That's fine.” But they would call they'd be, like, “Ladies, let's go to our meeting.” And I'm like, “Do you want me also?” Because I had to ask because that was when ladies actually might not have included me because they meant, you know.Corey: I did a brief stint myself as the director of TechOps at Grindr. That was a wild experience in a variety of different ways.Sheeri: Yeah.Corey: It's over a decade ago, but it was still this… it was a very interesting experience in a bunch of ways. And still, to this day, it remains the single biggest source of InfoSec nightmares that kept me awake at night. Just because when I'm working at a bank—which I've also done—it's only money, which sounds ridiculous to say, especially if you're in a regulated profession, but here in reality where I'm talking about it, it's I'm dealing instead, with cool, this data leaks, people will die. Most of what I do is not life or death, but that was and that weighed very heavily on me.Sheeri: Yeah, there's a reason I don't work for a bank or a hospital. You know, I make mistakes. I'm human, right?Corey: There's a reason I work on databases for that exact same reason. Please, continue.Sheeri: Yeah. So, the CTO rage-quit. A couple of weeks later, the head of customer service comes to me and be like, “Can we have his spot as an admin for customer service?” And I'm like, “What do you mean?” He's like, “Well, he told us, we had, like, ten slots of permission and he was one of them so we could have have, like, nine people.”And, like, I went and looked, and they put permission in the htaccess file. So, this former CTO had just wielded his power to be like, “Nope, can't do that. Sorry, limitations.” When there weren't any. I'm like, “You could have a hundred. You want every customer service person to be an admin? Whatever. Here you go.” So, I did hear stories about that. And yeah, that's not the kind of DBA I was.Corey: No, it's the more senior you get, the less you want to have admin rights on things. But when I leave a job, like, the number one thing I want you to do is revoke my credentials. Not—Sheeri: Please.Corey: Because I'm going to do anything nefarious; because I don't want to get blamed for it. Because we have a long standing tradition in tech at a lot of places of, “Okay, something just broke. Whose fault is it? Well, who's the most recent person to leave the company? Let's blame them because they're not here to refute the character assassination and they're not going to be angling for a raise here; the rest of us are so let's see who we can throw under the bus that can't defend themselves.” Never a great plan.Sheeri: Yeah. So yeah, I mean, you know, my theory in life is I like helping. So, I liked helping developers as a DBA. I would often run workshops to be like, here's how to do an explain and find your explain plan and see if you have indexes and why isn't the database doing what you think it's supposed to do? And so, I like helping customers as a product manager, right? So…Corey: I am very interested in watching how people start drifting in a variety of different directions. It's a, you're doing product management now and it's an ETL lineage product, it is not something that is directly aligned with your previous positioning in the market. And those career transitions are always very interesting to me because there's often a mistaken belief by people in their career realizing they're doing something they don't want to do. They want to go work in a different field and there's this pervasive belief that, “Oh, time for me to go back to square one and take an entry level job.” No, you have a career. You have experience. Find the orthogonal move.Often, if that's challenging because it's too far apart, you find the half-step job that blends the thing you do now with something a lot closer, and then a year or two later, you complete the transition into that thing. But starting over from scratch, it's why would you do that? I can't quite wrap my head around jumping off the corporate ladder to go climb another one. You very clearly have done a lateral move in that direction into a career field that is surprisingly distant, at least in my view. How'd that happen?Sheeri: Yeah, so after being on call for 18 years or so, [laugh] I decided—no, I had a baby, actually. I had a baby. He was great. And then I another one. But after the first baby, I went back to work, and I was on call again. And you know, I had a good maternity leave or whatever, but you know, I had a newborn who was six, eight months old and I was getting paged.And I was like, you know, this is more exhausting than having a newborn. Like, having a baby who sleeps three hours at a time, like, in three hour chunks was less exhausting than being on call. Because when you have a baby, first of all, it's very rare that they wake up and crying in the midnight it's an emergency, right? Like they have to go to the hospital, right? Very rare. Thankfully, I never had to do it.But basically, like, as much as I had no brain cells, and sometimes I couldn't even go through this list, right: they need to be fed; they need to be comforted; they're tired, and they're crying because they're tired, right, you can't make them go to sleep, but you're like, just go to sleep—what is it—or their diaper needs changing, right? There's, like, four things. When you get that beep of that pager in the middle of the night it could be anything. It could be logs filling up disk space, you're like, “Alright, I'll rotate the logs and be done with it.” You know? It could be something you need snoozed.Corey: “Issue closed. Status, I no longer give a shit what it is.” At some point, it's one of those things where—Sheeri: Replication lag.Corey: Right.Sheeri: Not actionable.Corey: Don't get me started down that particular path. Yeah. This is the area where DBAs and my sysadmin roots started to overlap a bit. Like, as the DBA was great at data analysis, the table structure and the rest, but the backups of the thing, of course that fell to the sysadmin group. And replication lag, it's, “Okay.”“It's doing some work in the middle of the night; that's normal, and the network is fine. And why are you waking me up with things that are not actionable? Stop it.” I'm yelling at the computer at that point, not the person—Sheeri: Right,right.Corey: —to be very clear. But at some point, it's don't wake me up with trivial nonsense. If I'm getting woken up in the middle of the night, it better be a disaster. My entire business now is built around a problem that's business-hours only for that explicit reason. It's the not wanting to deal with that. And I don't envy that, but product management. That's a strange one.Sheeri: Yeah, so what happened was, I was unhappy at my job at the time, and I was like, “I need a new job.” So, I went to, like, the MySQL Slack instance because that was 2018, 2019. Very end of 2018, beginning of 2019. And I said, “I need something new.” Like, maybe a data architect, or maybe, like, a data analyst, or data scientist, which was pretty cool.And I was looking at data scientist jobs, and I was an expert MySQL DBA and it took a long time for me to be able to say, “I'm an expert,” without feeling like oh, you're just ballooning yourself up. And I was like, “No, I'm literally a world-renowned expert DBA.” Like, I just have to say it and get comfortable with it. And so, you know, I wasn't making a junior data scientist's salary. [laugh].I am the sole breadwinner for my household, so at that point, I had one kid and a husband and I was like, how do I support this family on a junior data scientist's salary when I live in the city of Boston? So, I needed something that could pay a little bit more. And a former I won't even say coworker, but colleague in the MySQL world—because is was the MySQL Slack after all—said, “I think you should come at MongoDB, be a product manager like me.”Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate. Is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other; which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at honeycomb.io/screaminginthecloud. Observability: it's more than just hipster monitoring. Corey: If I've ever said, “Hey, you should come work with me and do anything like me,” people will have the blood drain from their face. And like, “What did you just say to me? That's terrible.” Yeah, it turns out that I have very hard to explain slash predict, in some ways. It's always fun. It's always wild to go down that particular path, but, you know, here we are.Sheeri: Yeah. But I had the same question everybody else does, which was, what's a product manager? What does the product manager do? And he gave me a list of things a product manager does, which there was some stuff that I had the skills for, like, you have to talk to customers and listen to them.Well, I've done consulting. I could get yelled at; that's fine. You can tell me things are terrible and I have to fix it. I've done that. No problem with that. Then there are things like you have to give presentations about how features were okay, I can do that. I've done presentations. You know, I started the Boston MySQL Meetup group and ran it for ten years until I had a kid and foisted it off on somebody else.And then the things that I didn't have the skills in, like, running a beta program were like, “Ooh, that sounds fascinating. Tell me more.” So, I was like, “Yeah, let's do it.” And I talked to some folks, they were looking for a technical product manager for MongoDB's sharding product. And they had been looking for someone, like, insanely technical for a while, and they found me; I'm insanely technical.And so, that was great. And so, for a year, I did that at MongoDB. One of the nice things about them is that they invest in people, right? So, my manager left, the team was like, we really can't support someone who doesn't have the product management skills that we need yet because you know, I wasn't a master in a year, believe it or not. And so, they were like, “Why don't you find another department?” I was like, “Okay.”And I ended up finding a place in engineering communications, doing, like, you know, some keynote demos, doing some other projects and stuff. And then after—that was a kind of a year-long project, and after that ended, I ended up doing product management for developer relations at MongoDB. Also, this was during the pandemic, right, so this is 2019, until '21; beginning of 2019, to end of 2020, so it was, you know, three full years. You know, I kind of like woke up from the pandemic fog and I was like, “What am I doing? Do I want to really want to be a content product manager?” And I was like, “I want to get back to databases.”One of the interesting things I learned actually in looking for a job because I did it a couple of times at MongoDB because I changed departments and I was also looking externally when I did that. I had the idea when I became a product manager, I was like, “This is great because now I'm product manager for databases and so, I'm kind of leveraging that database skill and then I'll learn the product manager stuff. And then I can be a product manager for any technical product, right?”Corey: I like the idea. Of some level, it feels like the product managers likeliest to succeed at least have a grounding or baseline in the area that they're in. This gets into the age-old debate of how important is industry-specific experience? Very often you'll see a bunch of job ads just put that in as a matter of course. And for some roles, yeah, it's extremely important.For other roles it's—for example, I don't know, hypothetically, you're looking for someone to fix the AWS bill, it doesn't necessarily matter whether you're a services company, a product company, or a VC-backed company whose primary output is losing money, it doesn't matter because it's a bounded problem space and that does not transform much from company to company. Same story with sysadmin types to be very direct. But the product stuff does seem to get into that industry specific stuff.Sheeri: Yeah, and especially with tech stuff, you have to understand what your customer is saying when they're saying, “I have a problem doing X and Y,” right? The interesting part of my folly in that was that part of the time that I was looking was during the pandemic, when you know, everyone was like, “Oh, my God, it's a seller's market. If you're looking for a job, employers are chomping at the bit for you.” And I had trouble finding something because so many people were also looking for jobs, that if I went to look for something, for example, as a storage product manager, right—now, databases and storage solutions have a lot in common; databases are storage solutions, in fact; but file systems and databases have much in common—but all that they needed was one person with file system experience that had more experience than I did in storage solutions, right? And they were going to choose them over me. So, it was an interesting kind of wake-up call for me that, like, yeah, probably data and databases are going to be my niche. And that's okay because that is literally why they pay me the literal big bucks. If I'm going to go niche that I don't have 20 years of experience and they shouldn't pay me as big a bucks right?Corey: Yeah, depending on what you're doing, sure. I don't necessarily believe in the idea that well you're new to this particular type of role so we're going to basically pay you a lot less. From my perspective it's always been, like, there's a value in having a person in a role. The value to the company is X and, “Well, I have an excuse now to pay you less for that,” has never resonated with me. It's if you're not, I guess, worth—the value-added not worth being paid what the stated rate for a position is, you are probably not going to find success in that role and the role has to change. That has always been my baseline operating philosophy. Not to yell at people on this, but it's, uh, I am very tired of watching companies more or less dunk on people from a position of power.Sheeri: Yeah. And I mean, you can even take the power out of that and take, like, location-based. And yes, I understand the cost of living is different in different places, but why do people get paid differently if the value is the same? Like if I want to get a promotion, right, my company is going to be like, “Well, show me how you've added value. And we only pay your value. We don't pay because—you know, you don't just automatically get promoted after seven years, right? You have to show the value and whatever.” Which is, I believe, correct, right?And yet, there are seniority things, there are this many years experience. And you know, there's the old caveat of do you have ten years experience or do you have two years of experience five times?Corey: That is the big problem is that there has to be a sense of movement that pushes people forward. You're not the first person that I've had on the show and talked to about a 20 year career. But often, I do wind up talking to folks as I move through the world where they basically have one year of experience repeated 20 times. And as the industry continues to evolve and move on and skill sets don't keep current, in some cases, it feels like they have lost touch, on some level. And they're talking about the world that was and still is in some circles, but it's a market in long-term decline as opposed to keeping abreast of what is functionally a booming industry.Sheeri: Their skills have depreciated because they haven't learned more skills.Corey: Yeah. Tech across the board is a field where I feel like you have to constantly be learning. And there's a bit of an evolve-or-die dinosaur approach. And I have some, I do have some fallbacks on this. If I ever decide I am tired of learning and keeping up with AWS, all I have to do is go and work in an environment that uses GovCloud because that's, like, AWS five years ago.And that buys me the five years to find something else to be doing until a GovCloud catches up with the modern day of when I decided to make that decision. That's a little insulting and also very accurate for those who have found themselves in that environment. But I digress.Sheeri: No, and I find it to with myself. Like, I got to the point with MySQL where I was like, okay, great. I know MySQL back and forth. Do I want to learn all this other stuff? Literally just today, I was looking at my DMs on Twitter and somebody DMed me in May, saying, “Hi, ma'am. I am a DBA and how can I use below service: Lambda, Step Functions, DynamoDB, AWS Session Manager, and CloudWatch?”And I was like, “You know, I don't know. I have not ever used any of those technologies. And I haven't evolved my DBA skills because it's been, you know, six years since I was a DBA.” No, six years, four or five? I can't do math.Corey: Yeah. Which you think would be a limiting factor to a DBA but apparently not. One last question that [laugh] I want to ask you, before we wind up calling this a show. You've done an awful lot across the board. As you look at all of it, what is it you would say that you're the most proud of?Sheeri: Oh, great question. What I'm most proud of is my work with WildAid. So, when I was at MongoDB—I referenced a job with engineering communications, and they hired me to be a product manager because they wanted to do a collaboration with a not-for-profit and make a reference application. So, make an application using MongoDB technology and make it something that was going to be used, but people can also see it. So, we made this open-source project called o-fish.And you know, we can give GitHub links: it's github.com/wildaid, and it has—that's the organization's GitHub which we created, so it only has the o-fish projects in it. But it is a mobile and web app where governments who patrol waters, patrol, like, marine protected areas—which are like national parks but in the water, right, so they are these, you know, wildlife preserves in the water—and they make sure that people aren't doing things they shouldn't do: they're not throwing trash in the ocean, they're not taking turtles out of the Galapagos Island area, you know, things like that. And they need software to track that and do that because at the time, they were literally writing, you know, with pencil on paper, and, you know, had stacks and stacks of this paper to do data entry.And MongoDB had just bought the Realm database and had just integrated it, and so there was, you know, some great features about offline syncing that you didn't have to do; it did all the foundational plumbing for you. And then the reason though, that I'm proud of that project is not just because it's pretty freaking cool that, you know, doing something that actually makes a difference in the world and helps fight climate change and all that kind of stuff, the reason I was proud of it is I was the sole product manager. It was the first time that I'd really had sole ownership of a product and so all the mistakes were my own and the credit was my own, too. And so, it was really just a great learning experience and it turned out really well.Corey: There's a lot to be said for pitching in and helping out with good causes in a way that your skill set winds up benefitting. I found that I was a lot happier with a lot of the volunteer stuff that I did when it was instead of licking envelopes, it started being things that I had a bit of proficiency in. “Hey, can I fix your AWS bill?” It turns out as some value to certain nonprofits. You have to be at a certain scale before it makes sense, otherwise it's just easier to maybe not do it that way, but there's a lot of value to doing something that puts good back into the world. I wish more people did that.Sheeri: Yeah. And it's something to do in your off-time that you know is helping. It might feel like work, it might not feel like work, but it gives you a sense of accomplishment at the end of the day. I remember my first job, one of the interview questions was—no, it wasn't. [laugh]. It wasn't an interview question until after I was hired and they asked me the question, and then they made it an interview question.And the question was, what video games do you play? And I said, “I don't play video games. I spend all day at work staring at a computer screen. Why would I go home and spend another 12 hours till three in the morning, right—five in the morning—playing video games?” And they were like, we clearly need to change our interview questions. This was again, back when the dinosaurs roamed the earth. So, people are are culturally sensitive now.Corey: These days, people ask me, “What's your favorite video game?” My answer is, “Twitter.”Sheeri: Right. [laugh]. Exactly. It's like whack-a-mole—Corey: Yeah.Sheeri: —you know? So, for me having a tangible hobby, like, I do a lot of art, I knit, I paint, I carve stamps, I spin wool into yarn. I know that's not a metaphor for storytelling. That is I literally spin wool into yarn. And having something tangible, you work on something and you're like, “Look. It was nothing and now it's this,” is so satisfying.Corey: I really want to thank you for taking the time to speak with me today about where you've been, where you are, and where you're going, and as well as helping me put a little bit more of a human angle on Twitter, which is intensely dehumanizing at times. It turns out that 280 characters is not the best way to express the entirety of what makes someone a person. You need to use a multi-tweet thread for that. If people want to learn more about you, where can they find you?Sheeri: Oh, they can find me on Twitter. I'm @sheeri—S-H-E-E-R-I—on Twitter. And I've started to write a little bit more on my blog at sheeri.org. So hopefully, I'll continue that since I've now told people to go there.Corey: I really want to thank you again for being so generous with your time. I appreciate it.Sheeri: Thanks to you, Corey, too. You take the time to interview people, too, so I appreciate it.Corey: I do my best. Sheeri Cabral, Senior Product Manager of ETL lineage at Collibra. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice or smash the like and subscribe buttons on the YouTubes, whereas if you've hated it, do exactly the same thing—like and subscribe, hit those buttons, five-star review—but also leave a ridiculous comment where we will then use an ETL pipeline to transform it into something that isn't complete bullshit.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

airhacks.fm podcast with adam bien
Idempotency, Secrets, Dependency Injection and AWS Lambda

airhacks.fm podcast with adam bien

Play Episode Listen Later Jul 8, 2022 54:27


An airhacks.fm conversation with Mark Sailes (@MarkSailes3) about: AWS Lambda Powertools for Java,, fetching and caching secrets, default caching retention period for short lived secrets, the limits of the clouds, the consistency of DynamoDB, DynamoDB connectivity with AWS Lambda, RDS Proxy, "Proprietary Cloud Native Managed Service", SQS, SNS, Kinesis, the fan-out pattern is implemented with SQS and SNS; EventBridge could replace SQS and SNS for the implementation of fan out patterns, AWS Lambda Powertools for Java Idempontency, using request as idempotency key, transactions, network errors and idempotency, idempotency per design, partial batch failures handling, bisect in SQS, SQS Large Message Handling, message offloading to S3, S3 transactional writes, dependency injection of Lambda resources, secret injection with AWS Lambda, JSON-logging and AWS Lambda Powertools for Java Logging, converting json to metrics in CloudWatch, Mark Sailes on twitter: @MarkSailes3, Mark's blog: mark-sailes.medium.com

The Cloud Pod
166: The Cloud Pod Eagerly Awaits the Microsoft Pay Increase

The Cloud Pod

Play Episode Listen Later May 27, 2022 61:47


On The Cloud Pod this week, the team struggles with scheduling to get everyone in the same room for just one week. Plus, Microsoft increases pay for talent retention while changing licensing for European Cloud Providers, Google Cloud introduces AlloyDB for PostgreSQL, and AWS announces EC2 support for NitroTPM. A big thanks to this week's sponsor, Foghorn Consulting, which provides full-stack cloud solutions with a focus on strategy, planning and execution for enterprises seeking to take advantage of the transformative capabilities of AWS, Google Cloud and Azure. This week's highlights

Screaming in the Cloud
Serverless Should be Simple with Tomasz Łakomy

Screaming in the Cloud

Play Episode Listen Later May 10, 2022 38:43


About TomaszTomasz is a Frontend Engineer at Stedi, Co-Founder/Head of React at Cloudash, egghead.io instructor with over 200 lessons published, a tech speaker, an AWS Community Hero and a lifelong learner.Links Referenced: Cloudash: https://cloudash.dev/ Twitter: https://twitter.com/tlakomy TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate. Is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other; which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at honeycomb.io/screaminginthecloud. Observability: it's more than just hipster monitoring.Corey: This episode is sponsored in part by our friends at ChaosSearch. You could run Elasticsearch or Elastic Cloud—or OpenSearch as they're calling it now—or a self-hosted ELK stack. But why? ChaosSearch gives you the same API you've come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for app performance monitoring, cybersecurity. If you're using Elasticsearch, consider not running Elasticsearch. They're also available now in the AWS marketplace if you'd prefer not to go direct and have half of whatever you pay them count towards your EDB commitment. Discover what companies like Equifax, Armor Security, and Blackboard already have. To learn more, visit chaossearch.io and tell them I sent you just so you can see them facepalm, yet again.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's always a pleasure to talk to people who ask the bold questions. One of those great bold questions is, what if CloudWatch's web page didn't suck? It's a good question. It's one I ask myself all the time.And then I stumbled across a product that wound up solving this for me, and I'm a happy customer. To be clear, they're not sponsoring anything that I do, nor should they. It's one of those bootstrapped, exciting software projects called Cloudash. Today, I'm joined by the Head of React at Cloudash, Tomasz Łakomy. Tomasz, thank you for joining me.Tomasz: It's a pleasure to be here.Corey: So, where did this entire idea come from? Because I sit and I get upset every time I have to go into the CloudWatch dashboard because first, something's broken. In an ideal scenario, I don't have to care about monitoring or observability or anything like that. But then it's quickly overshadowed by the fact that this interface is terrible. And the reason I know it's terrible is that every time I'm in there, I feel dumb.My belief is—for the longest time, I thought that was a problem with me. But no, invariably, when you wind up working with something and consistently finding it a bad—you don't know enough to solve for it, it's not you. It is, in fact, the signs of a poorly designed experience, start to finish. “You should be smarter to use this tool,” is very rarely correct. And there are a bunch of observability tools and monitoring tools for serverless things that have made sense over the years and made this easier, but one of the most—and please don't take this the wrong way—stripped down, bare essentials of just the facts, style of presentation is Cloudash. It's why I continue to pay for it every month with a smile on my face. How did you get here from there?Tomasz: Yeah that's a good question. I would say that. Cloudash was born out of desire for simple things to be simple. So, as you mentioned, Cloudash is basically the monitoring and troubleshooting tool for serverless applications, made for serverless developers because I am very much into serverless space, as is Maciej Winnicki, who is the another half of Cloudash team. And, you know, the whole premise of serverless was things are going to be simpler, right?So, you know, you have a bunch of code, you're going to dump it into a Lambda function, and that's it. You don't have to care about servers, you don't have to care about, you know, provisioning stuff, you don't have to care about maintenance, and so on. And that is not exactly true because why PagerDuty still continues to be [unintelligible 00:02:56] business even in serverless spaces. So, you will get paged every now and then. The problem is—what we kind of found is once you have an incident—you know, PagerDuty always tends to call it in the middle of the night; it's never, like, 11 a.m. during the workday; it's always the middle of the night.Corey: And no one's ever happy when it calls them either. It's, “Ah, hell.” Whatever it rings, it's yeah, the original Call of Duty. PagerDuty hooked up to Nagios. I am old enough to remember those days.Tomasz: [unintelligible 00:03:24] then business, like, imagine paying for something that's going to wake you up in the middle of the night. It doesn't make sense. In any case—Corey: “So, why do you pay for that product? Because it's really going to piss me off.” “Okay, well… does that sound like a good business to you? Well, AWS seems to think so. No one's happy working with that stuff.” “Fair. Fair enough.”Tomasz: So, in any case, like we've established an [unintelligible 00:03:43]. So you wake up, you go to AWS console because you saw a notification that this-and-this API has, you know, this threshold was above it, something was above the threshold. And then you go to the CloudWatch console. And then you see, okay, those are the logs, those are the metrics. I'm going to copy this request ID. I'm going to go over here. I'm going to go to X-Ray.And again, it's 3 a.m. so you don't exactly remember what do you investigate; you have, like, ten minutes. And this is a problem. Like, we've kind of identified that it's not simple to do these kinds of things, too—it's not simple to open something and have an understanding, okay, what exactly is happening in my serverless app at this very moment? Like, what's going on?So, we've built that. So, Cloudash is a desktop app; it lives on your machine, which is a single pane of glass. It's a single pane of glass view into your serverless system. So, if you are using CloudFormation in order to provision something, when you open Cloudash, you're going to see, you know, all of the metrics, all the Lambda functions, all of the API Gateways that you have provisioned. As of yesterday, API Gateway is no longer cool because they did launch the direct integration, so you have—you can call Lambda functions with [crosstalk 00:04:57]—Corey: Yeah, it's the one they released, and then rolled back and somehow never said a word—because that's an AWS messaging story, and then some—right around re:Invent last year. And another quarter goes by and out it goes.Tomasz: It's out yesterday.Corey: Yeah, it's terrific. I love that thing. The only downside to it is, ah, you have to use one of their—you have to use their domain; no custom domain support. Really? Well, you can hook up CloudFront to it, but the pricing model that way makes it more expensive than API Gateway.Okay, so I could use Cloudflare in front of it, and then it becomes free, so I bought a domain just for that purpose. That's right, my serverl—my direct Lambda URLs now live behind the glorious domain of cheapass.cloud because of course. They are. It's a day-one product from AWS, so of course, it's not feature-complete.But one of the things I like about the serverless model, and it's also a challenge when it comes to troubleshooting stuff is that it's very much set it and forget it style because serverless in many cases, at least the way that I tend to use it, is back-office stuff, its back-end things, it's processing on things that are not necessarily always direct front and center. So, these things can run on their own for years until finally, you find a strange bug in a new use case, or you want to go and change something. And then it's how the hell did this ever work? And it's still working, kind of, but what fool built this? Of course, it was me; it's always me.But what happened here? You're basically excavating your own legacy code, trying to understand what's going on. And so, you're already upset then. Cloudash makes this easier to find the things, to navigate through a whole bunch of different accounts. And there are a bunch of decisions that you made while building the app that are so clearly correct, that I get actively annoyed when others don't because oh, it looks at your AWS configuration file in your user home directory. Great, awesome. It's a desktop app, but it still consults that file. Yay, integration between ClickOps and the terminal. Wonderful.But ah, use SSO for a lot of stuff, so that's going to fix your little red wagon. I click on that app, and suddenly, bam, a browser opens asking me to log in and authenticate, allow the request. It works, and then suddenly, it goes back to doing exactly what you'd expect it to. It's really nice. The affordances behind this are glorious.Tomasz: Like I said, one of our kind of design goals when building Cloudash was to make simple things simple again. The whole purpose is to make sure that you can get into the root cause of an issue within, like, five minutes, if not less. And this is kind of the app that you're going to tend to open whenever that—as I said, because some of the systems can be around for, like, ages, literally without any incident whatsoever, then the data is going to change because somebody [unintelligible 00:07:30] got that the year is 2020 and off you go, we have an incident.But what's important about Cloudash is that we don't send logs anywhere. And that's kind of important because you don't pay for [PUT 00:07:42] metric API because we are not sending those logs anywhere. If you install Cloudash on your machine, we are not going to get your logs from the last ten years, put them in into a system, charge you for that, just so you are able to, you know, find out what happened in this particular hour, like, two weeks ago. We genuinely don't care about your logs; we have enough of our own logs at work to, you know, to analyze, to investigate, and so on; we are not storing them anywhere.In fact, you know, whatever happens on your machine stays on the machine. And that is partially why this is a desktop app. Because we don't want to handle your credentials. We don't—absolutely, we don't want you to give us any of your credentials or access keys, you know, whatever. We don't want that.So, that is why you install Cloudash, it's going to run on your machine, it's going to use your local credentials. So, it's… effectively, you could say that this is a much more streamlined and much more laser-focused browser or like, an eye into AWS systems, which live on the serverless side of things.Corey: I got to deal with it in a bit of an interesting way, recently. I have a detector in my company's production AWS org, to detect when ClickOps is afoot. Now, I'm a big proponent of ClickOps, but I also want to know what's going on, so I have a whole thing that [runs detects 00:09:04] when people are doing things in the console versus via API. And it alerts on certain subsets of them. I had to build a special case for the user agent string coming out of Cloudash because no, no, this is an app, this is not technically ClickOps—it is also read-only, which is neither here nor there, to my understanding.But it was, “Oh yeah, this is effectively an Electron app.” It just wraps, effectively, a browser and presents that as an application. And cool. From my perspective, that's an implementation detail. It feels like a native app—because it is—and I can suddenly see the things I care about in a way that is much more straightforward without having to have four different browser tabs open where, okay, here's the CloudTrail log for this thing, here's the metrics next to it. Oh, those are two separate windows already, and so on and so forth. It just makes hunting down to the obnoxious problems so much nicer.It's also, you're one of those rare products where if I don't use it for a month, I don't get the bill at the end of the month and think, “Ooh, that's going to—did I waste the money?” It's no, nice. I had a whole month where I didn't have to mess with this. It's great.Tomasz: Exactly. I feel like, you know, it's one of those systems where, as you said, we send you an email at the end of every month that we're going to charge you X dollars for the month—by the way, we have fixed pricing and then you can cancel anytime—and it's like one of those things that, you know, I didn't have to open this up for a month. This is awesome because I didn't have any incidents. But I know whenever again, PagerDuty is going to decide, “Hey, dude, wake up. You know, if slept for three hours. That is definitely long enough,” then you know that; you know, this app is there and you can use that.We very much care about, you know, building this stuff, not only for our customers, but we also use that on a daily basis. In fact, I… every single time that I have to—I want to investigate something in, like, our serverless systems at Stedi because everything that we do at work, at Stedi, since this incident serverless paradigm. So, I tend to open Cloudash, like, 95% of the time whenever I want to investigate something. And whenever I am not able to do something in Cloudash, this goes, like, straight to the top of our, you know, issue lists or backlog or whatever you want to call it. Because we want to make this product, not only awesome, you know, for customers to buy a [unintelligible 00:11:22] or whatever, but we also want to be able to use that on a daily basis.And so far, I think we've kind of succeeded. But then again, we have quite a long way to go because we have more ideas, than we have the time, definitely, so we have to kind of prioritize what exactly we're going to build. So, [unintelligible 00:11:39] integrations with alarms. So, for instance, we want to be able to see the alarms directly in the Cloudash UI. Secondly, integration with logs insights, and many other ideas. I could probably talk for hours about what we want to build.Corey: I also want to point out that this is still your side gig. You are by day a front-end engineer over at Stedi, which has a borderline disturbing number of engineers with side gigs, generally in the serverless space, doing interesting things like this. Dynobase is another example, a DynamoDB desktop client; very similar in some respects. I pay for that too. Honestly, for a company in Stedi's space, which is designed as basically a giant API for deep, large enterprise business stuff, there's an awful lot of stuff for small-scale coming out of that.Like, I wind up throwing a disturbing amount of money in the general direction of Stedi for not being their customer. But there's something about the culture that you folks have built over there that's just phenomenal.Tomasz: Yeah. For the record, you know, having a side gig is another part of interview process at Stedi. You don't have to have [laugh] a side project, but yeah, you're absolutely right, you know, the amount of kind of side projects, and you know, some of those are monetized, as you mentioned, you know, Cloudash and Dynobase and others. Some of those—because for instance, you talked to Aidan, I think a couple of weeks ago about his shenanigans, whenever you know, AWS is going to announce something he gets in and try to [unintelligible 00:13:06] this in the most amusing ways possible. Yeah, I mean, I could probably talk for ages about why Stedi is by far the best company I've ever worked at, but I'm going to say this: that this is the most talented group of people I've ever met, and myself, honestly.And, you know, the fact that I think we are the second largest, kind of, group of AWS experts outside of AWS because the density of AWS Heroes, or ex-AWS employees, or people who have been doing cloud stuff for years, is frankly, massive, I tend to learn something new about cloud every single day. And not only because of the Last Week in AWS but also from our Slack.Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of “Hello, World” demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And—let me be clear here—it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself, all while gaining the networking, load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisks next to the word free? This is actually free, no asterisk. Start now. Visit snark.cloud/oci-free that's snark.cloud/oci-free.Corey: There's something to be said for having colleagues that you learn from. I have never enjoyed environments where I did not actively feel like the dumbest person in the room. That's why I love what I do now. I inherently am. I have to talk about so many different things, that whenever I talk to a subject matter expert, it is a certainty that they know more about the thing than I do, with the admitted and depressing exception of course of the AWS bill because it turns out the reason I had to start becoming the expert in that was because there weren't any. And here we are now.I want to talk as well about some of—your interaction outside of work with AWS. For example, you've been an Egghead instructor for a while with over 200 lessons that you published. You're an AWS Community Hero, which means you have the notable distinction of volunteering for a for-profit company—good work—no, the community is very important. It's helping each other make sense of the nonsense coming out of there. You've been involved within the ecosystem for a very long time. What is it about, I guess—the thing I'm wondering about myself sometimes—what is it about the AWS universe that drew you in, and what keeps you here?Tomasz: So, give you some context, I've started, you know, learning about the cloud and AWS back in early-2019. So, fun fact: Maciej Winnicki—again, the co-founder of Cloudash—was my manager at the time. So, we were—I mean, the company I used to work for at the time, OLX Group, we are in the middle of cloud transformation, so to speak. So, going from, you know, on-premises to AWS. And I was, you know, hired as a senior front-end engineer doing, you know, all kinds of front-end stuff, but I wanted to grow, I wanted to learn more.So, the idea was, okay, maybe you can get AWS Certified because, you know, it's one of those corporate goals that you have to have something to put that checkbox next to it. So, you know, getting certified, there you go, you have a checkbox. And off you go. So, I started, you know, diving in, and I saw this whole ocean of things that, you know, I was not entirely aware of. To be fair, at the time I knew about this S3, I knew that you can put a file in an S3 bucket and then you can access it from the internet. This is, like, the [unintelligible 00:16:02] idea of my AWS experiences.Corey: Ideally, intentionally, but one wonders sometimes.Tomasz: Yeah, exactly. That is why you always put stuff as public, right? Because you didn't have to worry about who [unintelligible 00:16:12] [laugh] public [unintelligible 00:16:15]. No, I'm kidding, of course. But still, I think what's [unintelligible 00:16:20] to AWS is what—because it is this endless ocean of things to learn and things to play with, and, you know, things to teach.I do enjoy teaching. As you said, I have quite a lot of, you know, content, videos, blog posts, conference talks, and a bunch of other stuff, and I do that for two reasons. You know, first of all, I tend to learn the best by teaching, so it helps me very much, kind of like, solidify my own knowledge. Whenever I record—like, I have two courses about CDK, you know, when I was recording those, I definitely—that kind of solidify my, you know, ideas about CDK, I get to play with all those technologies.And secondly, you know, it's helpful for others. And, you know, people have opinions about certificates, and so on and so forth, but I think that for somebody who's trying to get into either the tech industry or, you know, cloud stuff in general, being certified helps massively. And I've heard stories about people who are basically managed to double or triple their salaries by going into tech, you know, with some of those certificates. That is why I strongly believe, by the way, that those certificates should be free. Like, if you can pass the exam, you shouldn't have to worry about this $150 of the fee.Corey: I wrote a blog post a while back, “The Dumbest Dollars a Cloud Provider Can Make,” and it's charging for training and certification because if someone's going to invest that kind of time in learning your platform, you're going to try and make $150 bucks off them? Which in some cases, is going to put people off from even beginning that process. “What cloud provider I'm not going to build a project on?” Obviously, the one I know how to work with and have a familiarity with, in almost every case. And the things you learn in your spare time as an independent learner when you get a job, you tend to think about your work the same way. It matters. It's an early on-ramp that pays off down the road and the term of years.I used to be very anti-cert personally because it felt like I was jumping through hoops, and paying, in some cases, for the privilege. I had a CCNA for a while from Cisco. There were a couple of smaller companies, SaltStack, for example, that I got various certifications from at different times. And that was sort of cheating because I helped write the software, but that's neither here nor there. It's the—and I do have a standing AWS cert that I get a different one every time—mine is about to expire—because it gets me access to lounges at physical events, which is the dumbest of all reasons to get certs, but here you go. I view it as the $150 lounge pass with a really weird entrance questionnaire.But in my case it certs don't add anything to what I do. I am not the common case. I am not early in my career. Because as you progress through your career, things—there needs to be a piece of paper that says you know things, and early on degree or certifications are great at that. In the time it becomes your own list of experience on your resume or CV or LinkedIn or God knows what. Polywork if you're doing it the right way these days.And it shows a history of projects that are similar in scope and scale and impact to the kinds of problems that your prospective employer is going to have to solve themselves. Because the best answer to hear—especially in the ops world—when there's a problem is, “Oh, I've seen this before. Here's how you fix it.” As opposed to, “Well, I don't know. Let me do some research.”There's value to that. And I don't begrudge anyone getting certs… to a point. At least that's where I sit on it. At some point when you have 25 certs, it's when you actually do any work? Because it's taking the tests and learning all of these things, which in many ways does boil down to trivia, it stands in counterbalance to a lot of these things.Tomasz: Yeah. I mean, I definitely, totally agree. I remember, you know, going from zero to—maybe not Hero; I'm not talking about AWS Hero—but going from zero to be certified, there was the Solutions Architect Associate. I think it took me, like, 200 hours. I am not the, you know, the brightest, you know, the sharpest tool in the shed, so it probably took me, kind of, somewhat more.I think it's doable in, like, 100 hours, but I tend to over-prepare for stuff, so I didn't actually take the actual exam until I was able to pass the sample exams with, like, 90% pass, just to be extra sure that I'm actually going to pass it. But still, I think that, you know, at some point, you probably should focus on, you know, getting into the actual stuff because I hold two certificates, you know, one of those is going to expire, and I'm not entirely sure if I want to go through the process again. But still, if AWS were to introduce, like, a serverless specialty exam, I would be more than happy to have that. I genuinely enjoy, kind of, serverless, and you know, the fact that I would be able to solidify my knowledge, I have this kind of established path of the things that I should learn about in order to get this particular certificate, I think this could be interesting. But I am not probably going to chase all the 12 certificates.Maybe if AWS IQ was available in Poland, maybe that would change because I do know that with IQ, those certs do matter. But as of [unintelligible 00:21:26] now, I'm quite happy with my certs that I have right now.Corey: Part of the problem, too, is the more you work with these things, the harder it becomes to pass the exams, which sounds weird and counterintuitive, but let me use myself as an example. When I got the cloud practitioner cert, which I believe has lapsed since then, and I got one of the new associate-level betas—I'll keep moving up the stack until I start failing exams. But I got a question wrong on the cloud practitioner because it was, “How long does it take to restore an RDS database from a snapshot backup?” And I gave the honest answer of what I've seen rather than what it says in the book, and that honest answer can be measured in days or hours. Yeah.And no, that's not the correct answer. Yeah, but it is the real one. Similarly, a lot of the questions get around trivia, syntax of which of these is the correct argument, and which ones did we make up? It's, I can explain in some level of detail, virtually every one of AWS has 300 some-odd services to you. Ask me about any of them, I could tell you what it is, how it works, how it's supposed to work and make a dumb joke about it. Fine, whatever.You'll forgive me if I went down that path, instead of memorizing what is the actual syntax of this YAML construct inside of a CloudFormation template? Yeah, I can get the answer to that question in the real world, with about ten seconds of Googling and we move on. That's the way most of us learn. It's not cramming trivia into our heads. There's something broken about the way that we do certifications, and tech interviews in many cases as well.I look back at some of the questions I used to ask people for Linux sysadmin-style jobs, and I don't remember the answer to a lot of these things. I could definitely get back into it, but if I went through one of these interviews now, I wouldn't get the job. One would argue I shouldn't because of my personality, but that's neither here nor there.Tomasz: [laugh]. I mean, that's why you use CDK, so you'd have to remember random YAML comments. And if you [unintelligible 00:23:26] you don't have YAML anymore. [unintelligible 00:23:27].Corey: Yes, you're quite the CDK fanboy, apparently.Tomasz: I do like CDK, yes. I don't like, you know, mental overhead, I don't like context switching, and the way we kind of work at Stedi is everything is written in TypeScript. So, I am a front-end engineer, so I do stuff in the front-end line in TypeScript, all of our Lambda functions are written in TypeScript, and our [unintelligible 00:23:48] is written in TypeScript. So, I can, you know, open up my Visual Studio Code and jump between all of those files, and the language stays the same, the syntax stays the same, the tools stay the same. And I think this is one of the benefits of CDK that is kind of hard to replicate otherwise.And, you know, people have many opinions about the best to deploy infrastructure in the cloud, you know? The best infrastructure-as-code tool is the one that you use at work or in your private projects, right? Because some people enjoy ClickOps like you do; people—Corey: Oh yeah.Tomasz: Enjoy CloudFormation by hand, which I don't; people are very much into Terraform or Serverless Framework. I'm very much into CDK.Corey: Or the SAM CLI, like, three or four more, and I use—Tomasz: Oh, yeah. [unintelligible 00:24:33]—Corey: —all of these things in various ways in some of my [monstrous 00:24:35] projects to keep up on all these things. I did an exploration with the CDK. Incidentally, I think you just answered why I don't like it.Tomasz: Because?Corey: Because it is very clear that TypeScript is a first-class citizen with the CDK. My language of choice is shitty bash because, grumpy old sysadmin; it happens. And increasingly, that is switching over to terrible Python because I'm very bad at that. And the problem that I run into as I was experimenting with this is, it feels like the Python support is not fully baked, most people who are using the CDK are using a flavor of JavaScript and, let's be very clear here, the every time I have tried to explore front-end, I have come away more confused than I was when I started, part of me really thinks I should be learning some JavaScript just because of its versatility and utility to a whole bunch of different problems. But it does not work the way I think, on some level, that it should because of my own biases and experiences. So, if you're not a JavaScript person, I think that you have a much rockier road with the CDK.Tomasz: I agree. Like I said, I tend to talk about my own experiences and my kind of thoughts about stuff. I'm not going to say that, you know, this tool or that tool is the best tool ever because nothing like that exists. Apart from jQuery, which is the best thing that ever happened to the web since, you know, baked bread, honestly. But you are right about CDK, to the best of my knowledge, kind of, all the other languages that are supported by CDK are effectively transpiled down from TypeScript. So it's, like, first of all, it is written in TypeScript, and then kind of the Python, all of the other languages… kind of come second.You know, and afterwards, I tend to enjoy CDK because as I said, I use TypeScript on a daily basis. And you know, with regards to front-end, you mentioned that you are, every single time you is that you end up being more confused. It never goes away. I've been doing front-end stuff for years, and it's, you know, kind of exactly the same. Fun story, I actually joined Cloudash because, well, Maciej started working on Cloudash alone, and after quite some time, he was so frustrated with the modern front-end landscape that he asked me, “Dude, you need to help me. Like, I genuinely need some help. I am tired of React. I am tired of React hooks. This is way too complex. I want to go back to doing back-end stuff. I want to go back, you know, thinking about how we're going to integrate with all those APIs. I don't want to do UI stuff anymore.”Which was kind of like an interesting shift because I remember at the very beginning of my career, where people were talking about front-end—you know, “Front-end is not real programming. Front-end is, you know, it's easy, it's simple. I can learn CSS in an hour.” And the amount of people who say that CSS is easy, and are good at CSS is exactly zero. Literally, nobody who's actually good at CSS says that, you know, CSS, or front-end, or anything like that is easy because it's not. It's incredibly complex. It's getting probably more and more complex because the expectations of our front-end UIs [unintelligible 00:27:44].Corey: It's challenging, it is difficult, and one of the things I find most admirable about you is not even your technical achievements, it's the fact that you're teaching other people to do this. In fact, this gets to the last point I want to cover on our conversation today. When I was bouncing topic ideas off of you, one of the points you brought up that I'm like, “Oh, we're keeping that and saving that for the end,” is why—to your words—why speaking at tech events gets easier, but never easy. Let's dive into that. Tell me more about it.Tomasz: Basically, I've accidentally kickstarted my career by speaking at meetups which later turned into conferences, which later turned into me publishing courses online, which later turned into me becoming an AWS Hero, and here we are, you know, talking to each other. I do enjoy, you know, going out in public and speaking and being on stage. I think, you know, if somebody has, kind of, the heart, the ability to do that, I do strongly recommend, you know, giving it a shot, not only to give, like, an honestly life-changing experience because the first time you go in front of hundreds of people, this is definitely, you know, something that's going to shake you, while at the same time acknowledging that this is absolutely, definitely not for everyone. But if you are able to do that, I think this is definitely worth your time. But as you said—by quoting me—that it gets easier, so every single time you go on stage, talk at a meetup or at a conference or online conferences—which I'm not exactly a fan of, for the record—it's—Corey: It's too much like work, too much like meetings. There's nothing different about it.Tomasz: Yeah, exactly. Like, there's no journey. There's no adventure in online conferences. I know that, of course, you know, given all of that, you know, we had to kind of switch to online conferences for quite some time where I think we are pretending that Covid is not a thing anymore, so we, you know, we're effectively going back, but kind of the point I wanted to make is that I am a somewhat experienced public speaker—I'd like to say that because I've been doing that for years—but I've been, you know, talking to people who actually get paid to speak at the conferences, to actually kind of do that for a living, and they all say the same thing. It gets simpler, it gets easier, but it's never freaking easy, you know, to go out there, and you know, to share whatever you've learned.Corey: I'm one of those people. I am a paid public speaker fairly often, even ignoring the podcast side, and I've spoken on conference stages a couple hundred times at least. And it does get easier but never easy. That's a great way of framing it. You… I get nervous before every talk I give.There are I think two talks I've given that I did not have an adrenaline hit and nervous energy before I went onstage, and both of those were duds. Because I think that it's part of the process, at least for me. And it's like, “Oh, how do you wind up not being scared for before you go on stage?” You don't. You really don't.But if that appeals to you and you enjoy the adrenaline rush of the rest, do it. If you're one of those people who've used public speaking as, “I would prefer death over that,” people are more scared of public speaking their death, in some cases, great. There are so many ways to build audiences and to reach people that fine, if you don't like doing it on stage, don't force yourself to. I'd say try it once; see how it feels meetups are great for this.Tomasz: Yeah. Meetups are basically the best way to get started. I'm yet to meet a meetup, either, you know, offline or online, who is not looking for speakers. It's always quite the opposite, you know? I was, you know, co-organizing a meetup in my city here in Poznań, Poland, and the story always goes like this: “Okay, we have a date. We have a venue. Where are the speakers?” And then you know, the tumbleweed is going to roll across the road and, “Oh, crap, we don't have any speakers.” So, we're going to try to find some, reach out to people. “Hey, I know that you did this fantastic project at your workplace. Come to us, talk about this.” “No, I don't want to. You know, I'm not an expert. I am, you know, I have on the 50 years of experience as an engineer. This is not enough.” Like I said, I do strongly recommend it, but as you said, if you're more scared of public speaking than, like, literally dying, maybe this is not for you.Corey: Yeah. It comes down to stretching your limits, finding yourself interesting. I find that there are lots of great engineers out there. The ones that I find myself drawn to are the ones who aren't just great at building something, but at storytelling around the thing that they are built of, yes, you build something awesome, but you have to convince me to care about it. You have to show me the thing that got you excited about this.And if you can't inspire that excitement in other people, okay. Are you really excited about it? Or what is the story here? And again, it's a different skill set. It is not for everyone, but it is absolutely a significant career accelerator if it's leveraged right.Tomasz: [crosstalk 00:32:45].Corey: [crosstalk 00:32:46] on it.Tomasz: Yeah, absolutely. I think that we don't talk enough about, kind of, the overlap between engineering and marketing. In the good sense of marketing, not the shady kind of marketing. The kind of marketing that you do for yourself in order to elevate yourself, your projects, your successes to others. Because, you know, try as you might, but if you are kind of like sitting in the corner of an office, you know, just jamming on your keyboard 40 hours per week, you're not exactly likely to be promoted because nobody's going to actively reach out to you to find out about your, you know, recent successes and so on.Which at the same time, I'm not saying that you should go @channel in Slack every single time you push a commit to the main branch, but there's definitely, you know, a way of being, kind of, kind to yourself by letting others know that, “Okay, I'm here. I do exist, I have, you know, those particular skills that you may be interested about. And I'm able to tell a story which is, you know, convincing.” So it's, you know, you can tell a story on stage, but you can also tell your story to your customers by building a future that they're going to use. [unintelligible 00:33:50].Corey: I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place to find you?Tomasz: So, the best place to find me is on Twitter. So, my Twitter handle is @tlakomy. So, it's T-L-A-K-O-M-Y. I'm assuming this is going to be in the [show notes 00:34:06] as well.Corey: Oh, it absolutely is. You beat me to it.Tomasz: [laugh]. So, you can find Cloudash at cloudash.dev. You can probably also find my email, but don't email me because I'm terrible, absolutely terrible at email, so the best way to kind of reach out to me is via my Twitter DMs. I'm slightly less bad at those.Corey: Excellent. And we will, of course, put links to that in the [show notes 00:34:29]. Thank you so much for being so generous with your time. I appreciate it.Tomasz: Thank you. Thank you for having me.Corey: Tomasz Łakomy, Head of React at Cloudash. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, and if you're on the YouTubes, smash the like and subscribe button, as the kids say. Whereas if you've hated this episode, please do the exact same thing—five-star reviews smash the buttons—but this time also leave an insulting and angry comment written in the form of a CloudWatch log entry that no one is ever able to find in the native interface.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

AWS Bites
35. How can you become a Logs Ninja with CloudWatch?

AWS Bites

Play Episode Listen Later May 5, 2022 31:54


In the age of distributed systems we produce tons and tons of logs. This is especially true for AWS when using CloudWatch logs. So how do we make sense of all these logs and how can we find useful information in them? In this episode we talk all about logs on AWS and we discuss the main concepts in CloudWatch for logs like Log Groups and Log Streams. We discuss how you can consume logs and how this used to be a big pain point with AWS CloudWatch logs and how now things are a lot better thanks to a relatively new feature called Log Insights. Finally we discuss some best practices that you should consider when thinking about logs for your distributed cloud applications. In this episode, we mentioned the following resources: - Our previous episode on CloudWatch alarms: https://www.youtube.com/watch?v=rk4QMJf6R4U - Analyzing log data with CloudWatch Logs Insights: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html - CloudWatch logs insights query syntax: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html - Pino logger for Node.js: https://getpino.io This episode is also available on YouTube: https://www.youtube.com/AWSBites You can listen to AWS Bites wherever you get your podcasts: - Apple Podcasts: https://podcasts.apple.com/us/podcast/aws-bites/id1585489017 - Spotify: https://open.spotify.com/show/3Lh7PzqBFV6yt5WsTAmO5q - Google: https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy82YTMzMTJhMC9wb2RjYXN0L3Jzcw== - Breaker: https://www.breaker.audio/aws-bites - RSS: https://anchor.fm/s/6a3312a0/podcast/rss Do you have any AWS questions you would like us to address? Leave a comment here or connect with us on Twitter: - https://twitter.com/eoins - https://twitter.com/loige #aws #logs #cloudwatch

AWS Bites
34. How to get the most out of CloudWatch Alarms?

AWS Bites

Play Episode Listen Later Apr 28, 2022 26:34


CloudWatch is a great service for metrics. You get tons of metrics out of the box and you can also create your custom ones. One of the most important things you can do with metrics is to create alarms, so how do we get the most out of CloudWatch alarms? In this episode we share our insights and cover the different types of alarms that exist, how to create an alarm, what to do when an alarm is triggered, a few examples of useful alarms and some of the drawbacks of CloudWatch alarms and how to overcome them. In this episode, we mentioned the following resources: - Our previous episode on CloudWatch metrics: https://www.youtube.com/watch?v=vwo2jXfyooQ - SLIC Watch, a serverless framework plugin that generates sensible alarms and dashboard automatically: https://fth.link/slic-watch This episode is also available on YouTube: https://www.youtube.com/AWSBites You can listen to AWS Bites wherever you get your podcasts: - Apple Podcasts: https://podcasts.apple.com/us/podcast/aws-bites/id1585489017 - Spotify: https://open.spotify.com/show/3Lh7PzqBFV6yt5WsTAmO5q - Google: https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy82YTMzMTJhMC9wb2RjYXN0L3Jzcw== - Breaker: https://www.breaker.audio/aws-bites - RSS: ​​https://anchor.fm/s/6a3312a0/podcast/rss Do you have any AWS questions you would like us to address? Leave a comment here or connect with us on Twitter: - https://twitter.com/eoins - https://twitter.com/loige #aws #alarms #cloudwatch

AWS Bites
33. What can you do with CloudWatch metrics?

AWS Bites

Play Episode Listen Later Apr 21, 2022 33:01


CloudWatch is the main Observability tool in AWS and it offers a wide range of features: logs, metrics, dashboards, alarms and even events (recently moved into EventBridge). In this episode we are going to focus on CloudWatch metrics. We are going to discuss the characteristics of metrics in CloudWatch: namespaces, dimensions, units and more. What metrics you get out of the box and how to create your own. How to access and explore metrics. Finally we will compare CloudWatch to other providers like DataDog, New Relic, Honeycomb and Grafana + Prometheus and try to assess whether CloudWatch is enough or if you need to use other third-party services. In this episode, we mentioned the following resources: - How to send Gzipped requests with boto3 (which uses the PutMetricsData API as an example): https://loige.co/how-to-send-gzipped-requests-with-boto3 - CloudWatch service quota: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_limits.html - CloudWatch metrics stream for DataDog: https://www.datadoghq.com/blog/amazon-cloudwatch-metric-streams-datadog/ This episode is also available on YouTube: https://www.youtube.com/AWSBites You can listen to AWS Bites wherever you get your podcasts: - Apple Podcasts: https://podcasts.apple.com/us/podcast/aws-bites/id1585489017 - Spotify: https://open.spotify.com/show/3Lh7PzqBFV6yt5WsTAmO5q - Google: https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy82YTMzMTJhMC9wb2RjYXN0L3Jzcw== - Breaker: https://www.breaker.audio/aws-bites - RSS: ​​https://anchor.fm/s/6a3312a0/podcast/rss Do you have any AWS questions you would like us to address? Leave a comment here or connect with us on Twitter: - https://twitter.com/eoins - https://twitter.com/loige

Screaming in the Cloud
Creating “Quinntainers” with Casey Lee

Screaming in the Cloud

Play Episode Listen Later Apr 20, 2022 46:16


About CaseyCasey spends his days leveraging AWS to help organizations improve the speed at which they deliver software. With a background in software development, he has spent the past 20 years architecting, building, and supporting software systems for organizations ranging from startups to Fortune 500 enterprises.Links Referenced: “17 Ways to Run Containers in AWS”: https://www.lastweekinaws.com/blog/the-17-ways-to-run-containers-on-aws/ “17 More Ways to Run Containers on AWS”: https://www.lastweekinaws.com/blog/17-more-ways-to-run-containers-on-aws/ kubernetestheeasyway.com: https://kubernetestheeasyway.com snark.cloud/quinntainers: https://snark.cloud/quinntainers ECS Chargeback: https://github.com/gaggle-net/ecs-chargeback  twitter.com/nektos: https://twitter.com/nektos TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I've been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn't limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That's R-E-V-E-L-O dot I-O slash screaming.Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is someone that I had the pleasure of meeting at re:Invent last year, but we'll get to that story in a minute. Casey Lee is the CTO with a company called Gaggle, which is—as they frame it—saving lives. Now, that seems to be a relatively common position that an awful lot of different tech companies take. “We're saving lives here.” It's, “You show banner ads and some of them are attack platforms for JavaScript malware. Let's be serious here.” Casey, thank you for joining me, and what makes the statement that Gaggle saves lives not patently ridiculous?Casey: Sure. Thanks, Corey. Thanks for having me on the show. So Gaggle, we're ed-tech company. We sell software to school districts, and school districts use our software to help protect their students while the students use the school-issued Google or Microsoft accounts.So, we're looking for signs of bullying, harassment, self-harm, and potentially suicide from K-12 students while they're using these platforms. They will take the thoughts, concerns, emotions they're struggling with and write them in their school-issued accounts. We detect that and then we notify the school districts, and they get the students the help they need before they can do any permanent damage to themselves. We protect about 6 million students throughout the US. We ingest a lot of content.Last school year, over 6 billion files, about the equal number of emails ingested. We're looking for concerning content and then we have humans review the stuff that our machine learning algorithms detect and flag. About 40 million items had to go in front of humans last year, resulted in about 20,000 what we call PSSes. These are Possible Student Situations where students are talking about harming themselves or harming others. And that resulted in what we like to track as lives saved. 1400 incidents last school year where a student was dealing with suicide ideation, they were planning to take their own lives. We detect that and get them help within minutes before they can act on that. That's what Gaggle has been doing. We're using tech, solving tech problems, and also saving lives as we do it.Corey: It's easy to lob a criticism at some of the things you're alluding to, the idea of oh, you're using machine learning on student data for young kids, yadda, yadda, yadda. Look at the outcome, look at the privacy controls you have in place, and look at the outcomes you're driving to. Now, I don't necessarily trust the number of school administrations not to become heavy-handed and overbearing with it, but let's be clear, that's not the intent. That is not what the success stories you have alluded to. I've got to say I'm a fan, so thanks for doing what you're doing. I don't say that very often to people who work in tech companies.Casey: Cool. Thanks, Corey.Corey: But let's rewind a bit because you and I had passed like ships in the night on Twitter for a while, but last year at re:Invent something odd happened. First, my business partner procrastinated at getting his ticket—that's not the odd part; he does that a lot—but then suddenly ticket sales slammed shut and none were to be had anywhere. You reached out with a, “Hey, I have a spare ticket because someone can't go. Let me get it to you.” And I said, “Terrific. Let me pay you for the ticket and take you to dinner.”You said, “Yes on the dinner, but I'd rather you just look at my AWS bill and don't worry about the cost of the ticket.” “All right,” said I. I know a deal when I see one. We grabbed dinner at the Venetian. I said, “Bust out your laptop.” And you said, “Oh, I was kidding.” And I said, “Great. I wasn't. Bust it out.”And you went from laughing to taking notes in about the usual time that happens when I start looking at these things. But how was your recollection of that? I always tend to romanticize some of these things. Like, “And then everyone's restaurant just turned, stopped, and clapped the entire time.” Maybe that part didn't happen.Casey: Everything was right up until the clapping part. That was a really cool experience. I appreciate you walking through that with me. Yeah, we've got lots of opportunity to save on our AWS bill here at Gaggle, and in that little bit of time that we had together, I think I walked away with no more than a dozen ideas for where to shave some costs. The most obvious one, the first thing that you keyed in on, is we had RIs coming due that weren't really well-optimized and you steered me towards savings plans. We put that in place and we're able to apply those savings plans not just to our EC2 instances but also to our serverless spend as well.So, that was a very worthwhile and cost-effective dinner for us. The thing that was most surprising though, Corey, was your approach. Your approach to how to review our bill was not what I thought at all.Corey: Well, what did you expect my approach was going to be? Because this always is of interest to me. Like, do you expect me to, like, whip a portable machine learning rig out of my backpack full of GPUs or something?Casey: I didn't know if you had, like, some secret tool you were going to hit, or if nothing else, I thought you were going to go for the Cost Explorer. I spend a lot of time in Cost Explorer, that's my go-to tool, and you wanted nothing to do with Cost Exp—I think I was actually pulling up Cost Explorer for you and you said, “I'm not interested. Take me to the bills.” So, we went right to the billing dashboard, you started opening up the invoices, and I thought to myself, “I don't remember the last time I looked at an AWS invoice.” I just, it's noise; it's not something that I pay attention to.And I learned something, that you get a real quick view of both the cost and the usage. And that's what you were keyed in on, right? And you were looking at things relative to each other. “Okay, I have no idea about Gaggle or what they do, but normally, for a company that's spending x amount of dollars in EC2, why is your data transfer cost the way it is? Is that high or low?” So, you're looking for kind of relative numbers, but it was really cool watching you slice and dice that bill through the dashboard there.Corey: There are a few things I tie together there. Part of it is that this is sort of a surprising thing that people don't think about but start with big numbers first, rather than going alphabetically because I don't really care about your $6 Alexa for Business spend. I care a bit more about the $6 million, or whatever it happens to be at EC2—I'm pulling numbers completely out of the ether, let's be clear; I don't recall what the exact magnitude of your bill is and it's not relevant to the conversation.And then you see that and it's like, “Huh. Okay, you're spending $6 million on EC2. Why are you spending 400 bucks on S3? Seems to me that those two should be a little closer aligned. What's the deal here? Oh, God, you're using eight petabytes of EBS volumes. Oh, dear.”And just, it tends to lead to interesting stuff. Break it down by region, service, and use case—or usage type, rather—is what shows up on those exploded bills, and that's where I tend to start. It also is one of the easiest things to wind up having someone throw into a PDF and email my way if I'm not doing it in a restaurant with, you know, people clapping standing around.Casey: [laugh]. Right.Corey: I also want to highlight that you've been using AWS for a long time. You're a Container Hero; you are not bad at understanding the nuances and depths of AWS, so I take praise from you around this stuff as valuing it very highly. This stuff is not intuitive, it is deeply nuanced, and you have a business outcome you are working towards that invariably is not oriented day in day out around, “How do I get these services for less money than I'm currently paying?” But that is how I see the world and I tend to live in a very different space just based on the nature of what I do. It's sort of a case study and the advantage of specialization. But I know remarkably little about containers, which is how we wound up reconnecting about a week or so before we did this recording.Casey: Yeah. I saw your tweet; you were trying to run some workload—container workload—and I could hear the frustration on the other end of Twitter when you were shaking your fist at—Corey: I should not tweet angrily, and I did in this case. And, eh, every time I do I regret it. But it played well with the people, so that does help. I believe my exact comment was, “‘me: I've got this container. Run it, please.' ‘Google Cloud: Run. You got it, boss.' AWS has 17 ways to run containers and they all suck.”And that's painting with an overly broad brush, let's be clear, but that was at the tail end of two or three days of work trying to solve a very specific, very common, business problem, that I was just beating my head off of a wall again and again and again. And it took less than half an hour from start to finish with Google Cloud Run and I didn't have to think about it anymore. And it's one of those moments where you look at this and realize that the future is here, we just don't see it in certain ways. And you took exception to this. So please, let's dive in because 280 characters of text after half a bottle of wine is not the best context to have a nuanced discussion that leaves friendships intact the following morning.Casey: Nice. Well, I just want to make sure I understand the use case first because I was trying to read between the lines on what you needed, but let me take a guess. My guess is you got your source code in GitHub, you have a Docker file, and you want to be able to take that repo from GitHub and just have it continuously deployed somewhere in Run. And you don't want to have headaches with it; you just want to push more changes up to GitHub, Docker Build runs and updates some service somewhere. Am I right so far?Corey: Ish, but think a little further up the stack. It was in service of this show. So, this show, as people who are listening to this are probably aware by this point, periodically has sponsors, which we love: We thank them for participating in the ongoing support of this show, which empowers conversations like this. Sometimes a sponsor will come to us with, “Oh, and here's the URL we want to give people.” And it's, “First, you misspelled your company name from the common English word; there are three sublevels within the domain, and then you have a complex UTM tagging tracking co—yeah, you realize people are driving to work when they're listening to this?”So, I've built a while back a link shortener, snark.cloud because is it the shortest thing in the world? Not really, but it's easily understandable when I say that, and people hear it for what it is. And that's been running for a long time as an S3 bucket with full of redirects, behind CloudFront. So, I wind up adding a zero-byte object with a redirect parameter on it, and it just works.Now, the challenge that I have here as a business is that I am increasingly prolific these days. So, anything that I am not directly required to be doing, I probably shouldn't necessarily be the one to do it. And care and feeding of those redirect links is a prime example of this. So, I went hunting, and the things that I was looking for were, obviously, do the redirect. Now, if you pull up GitHub, there are hundreds of solutions here.There are AWS blog posts. One that I really liked and almost got working was Eric Johnson's three-part blog post on how to do it serverlessly, with API Gateway, and DynamoDB, no Lambdas required. I really liked aspects of what that was, but it was complex, I kept smacking into weird challenges as I went, and front end is just baffling to me. Because I needed a front end app for people to be able to use here; I need to be able to secure that because it turns out that if you just have a, anyone who stumbles across the URL can redirect things to other places, well, you've just empowered a whole bunch of spam email, and you're going to find that service abused, and everyone starts blocking it, and then you have trouble. Nothing lasts the first encounter with jerks.And I was getting more and more frustrated, and then I found something by a Twitter engineer on GitHub, with a few creative search terms, who used to work at Google Cloud. And what it uses as a client is it doesn't build any kind of custom web app. Instead, as a database, it uses not S3 objects, not Route 53—the ideal database—but a Google sheet, which sounds ridiculous, but every business user here knows how to use that.Casey: Sure.Corey: And it looks for the two columns. The first one is the slug after the snark.cloud, and the second is the long URL. And it has a TTL of five seconds on cache, so make a change to that spreadsheet, five seconds later, it's live. Everyone gets it, I don't have to build anything new, I just put it somewhere around the relevant people can access it, I gave him a tutorial and a giant warning on it, and everyone gets that. And it just works well. It was, “Click here to deploy. Follow the steps.”And the documentation was a little, eh, okay, I had to undo it once and redo it again. Getting the domain registered was getting—ported over took a bit of time, and there were some weird SSL errors as the certificates were set up, but once all of that was done, it just worked. And I tested the heck out of it, and cold starts are relatively low, and the entire thing fits within the free tier. And it is reminiscent of the magic that I first saw when I started working with some of the cloud providers services, years ago. It's been a long time since I had that level of delight with something, especially after three days of frustration. It's one of the, “This is a great service. Why are people not shouting about this from the rooftops?” That was my perspective. And I put it out on Twitter and oh, Lord, did I get comments. What was your take on it?Casey: Well, so my take was, when you're evaluating a platform to use for running your applications, how fast it can get you to Hello World is not necessarily the best way to go. I just assumed you're wrong. I assumed of the 17 ways AWS has to run containers, Corey just doesn't understand. And so I went after it. And I said, “Okay, let me see if I can find a way that solves his use case, as I understand it, through a quick tweet.”And so I tried to App Runner; I saw that App Runner does not meet your needs because you have to somehow get your Docker image pushed up to a repo. App Runner can take an image that's already been pushed up and deployed for you or it can build from source but neither of those were the way I understood your use case.Corey: Having used App Runner before via the Copilot CLI, it is the closest as best I can tell to achieving what I want. But also let's be clear that I don't believe there's a free tier; there needs to be a load balancer in front of it, so you're starting with 15 bucks a month for this thing. Which is not the end of the world. Had I known at the beginning that all of this was going to be there, I would have just signed up for a bit.ly account and called it good. But here we are.Casey: Yeah. I tried Copilot. Copilot is a great developer experience, but it also is just pulling together tons of—I mean just trying to do a Copilot service deploy, VPCs are being created and tons IAM roles are being created, code pipelines, there's just so much going on. I was like 20 minutes into it, and I said, “Yeah, this is not fitting the bill for what Corey was looking for.” Plus, it doesn't solve my the way I understood your use case, which is you don't want to worry about builds, you just want to push code and have new Docker images get built for you.Corey: Well, honestly, let's be clear here, once it's up and running, I don't want to ever have to touch the silly thing again.Casey: Right.Corey: And that's so far has been the case, after I forked the repo and made a couple of changes to it that I wanted to see. One of them was to render the entire thing case insensitive because I get that one wrong a lot, and the other is I wanted to change the permanent 301 redirect to a temporary 302 redirect because occasionally, sponsors will want to change where it goes in the fullness of time. And that is just fine, but I want to be able to support that and not have to deal with old cached data. So, getting that up and running was a bit of a challenge. But the way that it worked, was following the instructions in the GitHub repo.The developer environment had spun up in the Google's Cloud Shell was just spectacular. It prompted me for a few things and it told me step by step what to do. This is the sort of thing I could have given a basically non-technical user, and they would have had success with it.Casey: So, I tried it as well. I said, “Well, okay, if I'm going to respond to Corey here and challenge him on this, I need to try Cloud Run.” I had no experience with Cloud Run. I had a small example repo that loosely mapped what I understood you were trying to do. Within five minutes, I had Cloud Run working.And I was surprised anytime I pushed a new change, within 45 seconds the change was built and deployed. So, here's my conclusion, Corey. Google Cloud Run is great for your use case, and AWS doesn't have the perfect answer. But here's my challenge to you. I think that you just proved why there's 17 different ways to run containers on AWS, is because there's that many different types of users that have different needs and you just happen to be number 18 that hasn't gotten the right attention yet from AWS.Corey: Well, let's be clear, like, my gag about 17 ways to run containers on AWS was largely a joke, and it went around the internet three times. So, I wrote a list of them on the blog post of “17 Ways to Run Containers in AWS” and people liked it. And then a few months later, I wrote “17 More Ways to Run Containers on AWS” listing 17 additional services that all run containers.And my favorite email that I think I've ever received in feedback was from a salty AWS employee, saying that one of them didn't really count because of some esoteric reason. And it turns out that when I'm trying to make a point of you have a sarcastic number of ways to run containers, pointing out that well, one of them isn't quite valid, doesn't really shatter the argument, let's be very clear here. So, I appreciate the feedback, I always do. And it's partially snark, but there is an element of truth to it in that customers don't want to run containers, by and large. That is what they do in service of a business goal.And they want their application to run which is in turn to serve as the business goal that continues to abstract out into, “Remain a going concern via the current position the company stakes out.” In your case, it is saving lives; in my case, it is fixing horrifying AWS bills and making fun of Amazon at the same time, and in most other places, there are somewhat more prosaic answers to that. But containers are simply an implementation detail, to some extent—to my way of thinking—of getting to that point. An important one [unintelligible 00:18:20], let's be clear, I was very anti-container for a long time. I wrote a talk, “Heresy in the Church of Docker” that then was accepted at ContainerCon. It's like, “Oh, boy, I'm not going to leave here alive.”And the honest answer is many years later, that Kubernetes solves almost all the criticisms that I had with the downside of well, first, you have to learn Kubernetes, and that continues to be mind-bogglingly complex from where I sit. There's a reason that I've registered kubernetestheeasyway.com and repointed it to ECS, Amazon's container service that is not requiring you to cosplay as a cloud provider yourself. But even ECS has a number of challenges to it, I want to be very clear here. There are no silver bullets in this.And you're completely correct in that I have a large, complex environment, and the application is nuanced, and I'm willing to invest a few weeks in setting up the baseline underlying infrastructure on AWS with some of these services, ideally not all of them at once because that's something a lunatic would do, but getting them up and running. The other side of it, though, is that if I am trying to evaluate a cloud provider's handling of containers and how this stuff works, the reason that everyone starts with a Hello World-style example is that it delivers ideally, the meantime to dopamine. There's a reason that Hello World doesn't have 18 different dependencies across a bunch of different databases and message queues and all the other complicated parts of running a modern application. Because you just want to see how it works out of the gate. And if getting that baseline empty container that just returns the string ‘Hello World' is that complicated and requires that much work, my takeaway is not that this user experience is going to get better once I'd make the application itself more complicated.So, I find that off-putting. My approach has always been find something that I can get the easy, minimum viable thing up and running on, and then as I expand know that you'll be there to catch me as my needs intensify and become ever more complex. But if I can't get the baseline thing up and running, I'm unlikely to be super enthused about continuing to beat my head against the wall like, “Well, I'll just make it more complex. That'll solve the problem.” Because it often does not. That's my position.Casey: Yeah, I agree that dopamine hit is valuable in getting attached to want to invest into whatever tech stack you're using. The challenge is your second part of that. Your second part is will it grow with me and scale with me and support the complex edge cases that I have? And the problem I've seen is a lot of organizations will start with something that's very easy to get started with and then quickly outgrow it, and then come up with all sorts of weird Rube Goldberg-type solutions. Because they jumped all in before seeing—I've got kind of an example of that.I'm happy to announce that there's now 18 ways to run containers on AWS. Because in your use case, in the spirit of AWS customer obsession, I hear your use case, I've created an open-source project that I want to share called Quinntainers—Corey: Oh, no.Casey: —and it solves—yes. Quinntainers is live and is ready for the world. So, now we've got 18 ways to run containers. And if you have Corey's use case of, “Hey, here's my container. Run it for me,” now we've got a one command that you can run to get things going for you. I can share a link for you and you could check it out. This is a [unintelligible 00:21:38]—Corey: Oh, we're putting that in the [show notes 00:21:37], for sure. In fact, if you go to snark.cloud/quinntainers, you'll find it.Casey: You'll find it. There you go. The idea here was this: There is a real use case that you had, and I looked at AWS does not have an out-of-the-box simple solution for you. I agree with that. And Google Cloud Run does.Well, the answer would have been from AWS, “Well, then here, we need to make that solution.” And so that's what this was, was a way to demonstrate that it is a solvable problem. AWS has all the right primitives, just that use case hadn't been covered. So, how does Quinntainers work? Real straightforward: It's a command-line—it's an NPM tool.You just run a [MPX 00:22:17] Quinntainer, it sets up a GitHub action role in your AWS account, it then creates a GitHub action workflow in your repo, and then uses the Quinntainer GitHub action—reusable action—that creates the image for you; every time you push to the branch, pushes it up to ECR, and then automatically pushes up that new version of the image to App Runner for you. So, now it's using App Runner under the covers, but it's providing that nice developer experience that you are getting out of Cloud Run. Look, is container really the right way to go with running containers? No, I'm not making that point at all. But the point is it is a—Corey: It might very well be.Casey: Well, if you want to show a good Hello World experience, Quinntainer's the best because within 30 seconds, your app is now set up to continuously deliver containers into AWS for your very specific use case. The problem is, it's not going to grow for you. I mean that it was something I did over the weekend just for fun; it's not something that would ever be worthy of hitching up a real production workload to. So, the point there is, you can build frameworks and tools that are very good at getting that initial dopamine hit, but then are not going to be there for you unnecessarily as you mature and get more complex.Corey: And yet, I've tilted a couple of times at the windmill of integrating GitHub actions in anything remotely resembling a programmatic way with AWS services, as far as instance roles go. Are you using permanent credentials for this as stored secrets or are you doing the [OICD 00:23:50][00:23:50] handoff?Casey: OIDC. So, what happens is the tool creates the IAM role for you with the trust policy on GitHub's OIDC provider, sets all that up for you in your account, locks it down so that just your repo and your main branch is able to push or is able to assume the role, the role is set up just to allow deployments to App Runner and ECR repository. And then that's it. At that point, it's out of your way. And you're just git push, and couple minutes later, your updates are now running an App Runner for you.Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning fast processing power, courtesy of third gen AMD EPYC processors without the IO, or hardware limitations, of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices, and say goodbye to noisy neighbors and egregious egress forever.Vultr delivers the power of the cloud with none of the bloat. "Screaming in the Cloud" listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That's G E T V U L T R.com/screaming. My thanks to them for sponsoring this ridiculous podcast.Corey: Don't undersell what you've just built. This is something that—is this what I would use for a large-scale production deployment, obviously not, but it has streamlined and made incredibly accessible things that previously have been very complex for folks to get up and running. One of the most disturbing themes behind some of the feedback I got was, at one point I said, “Well, have you tried running a Docker container on Lambda?” Because now it supports containers as a packaging format. And I said no because I spent a few weeks getting Lambda up and running back when it first came out and I've basically been copying and pasting what I got working ever since the way most of us do.And response is, “Oh, that explains a lot.” With the implication being that I'm just a fool. Maybe, but let's be clear, I am never the only person in the room who doesn't know how to do something; I'm just loud about what I don't know. And the failure mode of a bad user experience is that a customer feels dumb. And that's not okay because this stuff is complicated, and when a user has a bad time, it's a bug.I learned that in 2012. From Jordan Sissel the creator of LogStash. He has been an inspiration to me for the last ten years. And that's something I try to live by that if a user has a bad time, something needs to get fixed. Maybe it's the tool itself, maybe it's the documentation, maybe it's the way that GitHub repo's readme is structured in a way that just makes it accessible.Because I am not a trailblazer in most things, nor do I intend to be. I'm not the world's best engineer by a landslide. Just look at my code and you'd argue the fact that I'm an engineer at all. But if it's bad and it works, how bad is it? Is sort of the other side of it.So, my problem is that there needs to be a couple of things. Ignore for a second the aspect of making it the right answer to get something out of the door. The fact that I want to take this container and just run it, and you and I both reach for App Runner as the default AWS service that does this because I've been swimming in the AWS waters a while and you're a frickin AWS Container Hero, where it is expected that you know what most of these things do. For someone who shows up on the containers webpage—which by the way lists, I believe 15 ways to run containers on mobile and 19 ways to run containers on non-mobile, which is just fascinating in its own right—and it's overwhelming, it's confusing, and it's not something that makes it is abundantly clear what the golden path is. First, get it up and working, get it running, then you can add nuance and flavor and the rest, and I think that's something that's gotten overlooked in our mad rush to pretend that we're all Google engineers, circa 2012.Casey: Mmm. I think people get stressed out when they tried to run containers in AWS because they think, “What is that golden path?” You said golden path. And my advice to people is there is no golden path. And the great thing about AWS is they do continue to invest in the solutions they come up with. I'm still bitter about Google Reader.Corey: As am I.Casey: Yeah. I built so much time getting my perfect set of RSS feeds and then I had to find somewhere else to—with AWS, the different offerings that are available for running containers, those are there intentionally, it's not by accident. They're there to solve specific problems, so the trick is finding what works best for you and don't feel like one is better than the other is going to get more attention than others. And they each have different use cases.And I approach it this way. I've seen a couple of different people do some great flowcharts—I think Forrest did one, Vlad did one—on ways to make the decision on how to run your containers. And I break it down to three questions. I ask people first of all, where are you going to run these workloads? If someone says, “It has to be in the data center,” okay, cool, then ECS Anywhere or EKS Anywhere and we'll figure out if Kubernetes is needed.If they need specific requirements, so if they say, “No, we can run in the cloud, but we need privileged mode for containers,” or, “We need EBS volumes,” or, “We want really small container sizes,” like, less than a quarter-VCP or less than half a gig of RAM—or if you have custom log requirements, Fargate is not going to work for you, so you're going to run on EC2. Otherwise, run it on Fargate. But that's the first question. Figure out where are you going to run your containers. That leads to the second question: What's your control plane?But those are different, sort of related but different questions. And I only see six options there. That's App Runner for your control plane, LightSail for your control plane, Rosa if you're invested in OpenShift already, EKS either if you have Momentum and Kubernetes or you have a bunch of engineers that have a bunch of experience with Kubernetes—if you don't have either, don't choose it—or ECS. The last option Elastic Beanstalk, but let's leave that as a—if you're not currently invested in Elastic Beanstalk don't start today. But I look at those as okay, so I—first question, where am I going to run my containers? Second question, what do I want to use for my control plane? And there's different pros and cons of each of those.And then the third question, how do I want to manage them? What tools do I want to use for managing deployment? All those other tools like Copilot or App2Container or Proton, those aren't my control plane; those aren't where I run my containers; that's how I manage, deploy, and orchestrate all the different containers. So, I look at it as those three questions. But I don't know, what do you think of that, Corey?Corey: I think you're onto something. I think that is a terrific way of exploring that question. I would argue that setting up a framework like that—one or very similar—is what the AWS containers page should be, just coming from the perspective of what is the neophyte customer experience. On some level, you almost need a slide of have choose your level of experience ranging from, “What's a container?” To, “I named my kid Kubernetes because I make terrible life decisions,” and anywhere in between.Casey: Sure. Yeah, well, and I think that really dictates the control plane level. So, for example, LightSail, where does LightSail fit? To me, the value of LightSail is the simplicity. I'm looking at a monthly pricing: Seven bucks a month for a container.I don't know how [unintelligible 00:30:23] works, but I can think in terms of monthly pricing. And it's tailored towards a console user, someone just wants to click in, point to an image. That's a very specific user, there's thousands of customers that are very happy with that experience, and they use it. App Runner presents that scale to zero. That's one of the big selling points I see with App Runner. Likewise, with Google Cloud Run. I've got that scale to zero. I can't do that with ECS, or EKS, or any of the other platforms. So, if you've got something that has a ton of idle time, I'd really be looking at those. I would argue that I think I did the math, Google Cloud Run is about 30% more expensive than App Runner.Corey: Yeah, if you disregard the free tier, I think that's have it—running persistently at all times throughout the month, the drop-out cold starts would cost something like 40 some odd bucks a month or something like that. Don't quote me on it. Again and to be clear, I wound up doing this very congratulatory and complimentary tweet about them on I think it was Thursday, and then they immediately apparently took one look at this and said, “Holy shit. Corey's saying nice things about us. What do we do? What do we do?” Panic.And the next morning, they raised prices on a bunch of cloud offerings. Whew, that'll fix it. Like—Casey: [laugh].Corey: Di-, did you miss the direction you're going on here? No, that's the exact opposite of what you should be doing. But here we are. Interestingly enough, to tie our two conversation threads together, when I look at an AWS bill, unless you're using Fargate, I can't tell whether you're using Kubernetes or not because EKS is a small charge. And almost every case for the control plane, or Fargate under it.Everything else just manifests as EC2 spend. From the perspective of the cloud provider. If you're running a Kubernetes cluster, it is a single-tenant application that can have some very funky behaviors like cross-AZ chatter back and fourth because there's no internal mechanism to say talk to the free thing, rather than the two cents a gigabyte thing. It winds up spinning up and down in a bunch of different ways, and the behavior patterns, because of how placement works are not necessarily deterministic, depending upon workload. And that becomes something that people find odd when, “Okay, we look at our bill for a week, what can you say?”“Well, first question. Are you running Kubernetes at all?” And they're like, “Who invited these clowns?” Understand, we're not prying into your workloads for a variety of excellent legal and contractual reasons, here. We are looking at how they behave, and for specific workloads, once we have a conversation engineering team, yeah, we're going to dive in, but it is not at all intuitive from the outside to make any determination whether you're running containers, or whether you're running VMs that you just haven't done anything with in 20 years, or what exactly is going on. And that's just an artifact of the billing system.Casey: We ran into this challenge in Gaggle. We don't use EKS, we use ECS, but we have some shared clusters, lots of EC2 spend, hard to figure out which team is creating the services that's running that up. We actually ended up creating a tool—we open-sourced it—ECS Chargeback, and what it does is it looks at the CPU memory reservations for each task definition, and then prorates the overall charge of the ECS cluster, and then creates metrics in Datadog to give us a breakdown of cost per ECS service. And it also measures what we like to refer to as waste, right? Because if you're reserving four gigs of memory, but your utilization never goes over two gigs, we're paying for that reservation, but you're underutilizing.So, we're able to also show which services have the highest degree of waste, not just utilization, so it helps us go after it. But this is a hard problem. I'd be curious, how do you approach these shared ECS resources and slicing and dicing those bills?Corey: Everyone has a different approach, too. This there is no unifiable, correct answer. A previous show guest, Peter Hamilton, over at Remind had done something very similar, open-sourced a bunch of these things. Understanding what your spend is important on this, and it comes down to getting at the actual business concern because in some cases, effectively dead reckoning is enough. You take a look at the cluster that is really hard to attribute because it's a shared service. Great. It is 5% of your bill.First pass, why don't we just agree that it is a third for Service A, two-thirds for Service B, and we'll call it mostly good at that point? That can be enough in a lot of cases. With scale [laugh] you're just sort of hand-waving over many millions of dollars a year there. How about we get into some more depth? And then you start instrumenting and reporting to something, be it CloudWatch, be a Datadog, be it something else, and understanding what the use case is.In some cases, customers have broken apart shared clusters for that specific reason. I don't think that's necessarily the best approach from an engineering perspective, but again, this is not purely an engineering decision. It comes down to serving the business need. And if you're taking up partial credits on that cluster, for a tax credit for R&D for example, you want that position to be extraordinarily defensible, and spending a few extra dollars to ensure that it is the right business decision. I mean, again, we're pure advisory; we advise customers on what we would do in their position, but people often mistake that to be we're going to go for the lowest possible price—bad idea, or that we're going to wind up doing this from a purely engineering-centric point of view.It's, be aware of that in almost every case, with some very notable weird exceptions, the AWS Bill costs significantly less than the payroll expense that you have of people working on the AWS environment in various ways. People are more expensive, so the idea of, well, you can save a whole bunch of engineering effort by spending a bit more on your cloud, yeah, let's go ahead and do that.Casey: Yeah, good point.Corey: The real mark of someone who's senior enough is their answer to almost any question is, “It depends.” And I feel I've fallen into that trap as well. Much as I'd love to sit here and say, “Oh, it's really simple. You do X, Y, and Z.” Yeah… honestly, my answer, the simple answer, is I think that we orchestrate a cyber-bullying campaign against AWS through the AWS wishlist hashtag, we get people to harass their account managers with repeated requests for, “Hey, could you go ahead and [dip 00:36:19] that thing in—they give that a plus-one for me, whatever internal system you're using?”Just because this is a problem we're seeing more and more. Given that it's an unbounded growth problem, we're going to see it more and more for the foreseeable future. So, I wish I had a better answer for you, but yeah, that's stuff's super hard is honest, but it's also not the most useful answer for most of us.Casey: I'd love feedback from anyone from you or your team on that tool that we created. I can share link after the fact. ECS Chargeback is what we call it.Corey: Excellent. I will follow up with you separately on that. That is always worth diving into. I'm curious to see new and exciting approaches to this. Just be aware that we have an obnoxious talent sometimes for seeing these things and, “Well, what about”—and asking about some weird corner edge case that either invalidates the entire thing, or you're like, “Who on earth would ever have a problem like that?” And the answer is always, “The next customer.”Casey: Yeah.Corey: For a bounded problem space of the AWS bill. Every time I think I've seen it all, I just have to talk to one more customer.Casey: Mmm. Cool.Corey: In fact, the way that we approached your teardown in the restaurant is how we launched our first pass approach. Because there's value in something like that is different than the value of a six to eight-week-long, deep-dive engagement to every nook and cranny. And—Casey: Yeah, for sure. It was valuable to us.Corey: Yeah, having someone come in to just spend a day with your team, diving into it up one side and down the other, it seems like a weird thing, like, “How much good could you possibly do in a day?” And the answer in some cases is—we had a Honeycomb saying that in a couple of days of something like this, we wound up blowing 10% off their entire operating budget for the company, it led to an increased valuation, Liz Fong-Jones says that—on multiple occasions—that the company would not be what it was without our efforts on their bill, which is just incredibly gratifying to hear. It's easy to get lost in the idea of well, it's the AWS bill. It's just making big companies spend a little bit less to another big company. And that's not exactly, you know, saving the lives of K through 12 students here.Casey: It's opening up opportunities.Corey: Yeah. It's about optimizing for the win for everyone. Because now AWS gets a lot more money from Honeycomb than they would if Honeycomb had not continued on their trajectory. It's, you can charge customers a lot right now, or you can charge them a little bit over time and grow with them in a partnership context. I've always opted for the second model rather than the first.Casey: Right on.Corey: But here we are. I want to thank you for taking so much time out of well, several days now to argue with me on Twitter, which is always appreciated, particularly when it's, you know, constructive—thanks for that—Casey: Yeah.Corey: For helping me get my business partner to re:Invent, although then he got me that horrible puzzle of 1000 pieces for the Cloud-Native Computing Foundation landscape and now I don't ever want to see him again—so you know, that happens—and of course, spending the time to write Quinntainers, which is going to be at snark.cloud/quinntainers as soon as we're done with this recording. Then I'm going to kick the tires and send some pull requests.Casey: Right on. Yeah, thanks for having me. I appreciate you starting the conversation. I would just conclude with I think that yes, there are a lot of ways to run containers in AWS; don't let it stress you out. They're there for intention, they're there by design. Understand them.I would also encourage people to go a little deeper, especially if you got a significantly large workload. You got to get your hands dirty. As a matter of fact, there's a hands-on lab that a company called Liatrio does. They call it their Night Lab; it's a one-day free, hands-on, you run legacy monolithic job applications on Kubernetes, gives you first-hand experience on how to—gets all the way up into observability and doing things like Canary deployments. It's a great, great lab.But you got to do something like that to really get your hands dirty and understand how these things work. So, don't sweat it; there's not one right way. There's a way that will probably work best for each user, and just take the time and understand the ways to make sure you're applying the one that's going to give you the most runway for your workload.Corey: I will definitely dig into that myself. But I think you're right, I think you have nailed a point that is, again, a nuanced one and challenging to put in a rage tweet. But the services don't exist in a vacuum. They're not there because, despite the joke, someone wants to get promoted. It's because there are customer needs that are going on that, and this is another way of meeting those needs.I think there could be better guidance, but I also understand that there are a lot of nuanced perspectives here and that… hell is someone else's workflow—Casey: [laugh].Corey: —and there's always value in broadening your perspective a bit on those things. If people want to learn more about you and how you see the world, where's the best place to find you?Casey: Probably on Twitter: twitter.com/nektos, N-E-K-T-O-S.Corey: That might be the first time Twitter has been described as a best place for anything. But—Casey: [laugh].Corey: Thank you once again, for your time. It is always appreciated.Casey: Thanks, Corey.Corey: Casey Lee, CTO at Gaggle and AWS Container Hero. And apparently writing code in anger to invalidate my points, which is always appreciated. Please do more of that, folks. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, or the YouTube comments, which is always a great place to go reading, whereas if you've hated this podcast, please leave a five-star review in the usual places and an angry comment telling me that I'm completely wrong, and then launching your own open-source tool to point out exactly what I've gotten wrong this time.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Cribl Sharpens the Security Edge with Clint Sharp

Screaming in the Cloud

Play Episode Listen Later Mar 22, 2022 37:35


About ClintClint is the CEO and a co-founder at Cribl, a company focused on making observability viable for any organization, giving customers visibility and control over their data while maximizing value from existing tools.Prior to co-founding Cribl, Clint spent two decades leading product management and IT operations at technology and software companies, including Splunk and Cricket Communications. As a former practitioner, he has deep expertise in network issues, database administration, and security operations.Links: Cribl: https://cribl.io/ Cribl.io: https://cribl.io Docs.cribl.io: https://docs.cribl.io Sandbox.cribl.io: https://sandbox.cribl.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Today's episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that's built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that's exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn't eat all the data you've gotten on the system, it's exactly what you've been looking for. Check it out today at min.io/download, and see for yourself. That's min.io/download, and be sure to tell them that I sent you.Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They've also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I have a repeat guest joining me on this promoted episode. Clint Sharp is the CEO and co-founder of Cribl. Clint, thanks for joining me.Clint: Hey, Corey, nice to be back.Corey: I was super excited when you gave me the premise for this recording because you said you had some news to talk about, and I was really excited that oh, great, they're finally going to buy a vowel so that people look at their name and understand how to pronounce it. And no, that's nowhere near forward-looking enough. It's instead it's some, I guess, I don't know, some product announcement or something. But you know, hope springs eternal. What have you got for us today?Clint: Well, one of the reasons I love talking to your audiences because product announcements actually matter to this audience. It's super interesting, as you get into starting a company, you're such, like, a product person, you're like, “Oh, I have this new set of things that's really going to make your life better.” And then you go out to, like, the general media, and you're like, “Hey, I have this product.” And they're like, “I don't care. What product? Do you have a funding announcement? Do you have something big in the market that—you know, do you have a new executive? Do you”—it's like, “No, but, like, these features, like these things, that we—the way we make our lives better for our customers. Isn't that interesting?” “No.”Corey: Real depressing once you—“Do you have a security breach to announce?” It's, “No. God no. Why would I wind up being that excited about it?” “Well, I don't know. I'd be that excited about it.” And yeah, the stuff that mainstream media wants to write about in the context of tech companies is exactly the sort of thing that tech companies absolutely do not want to be written about for. But fortunately, that is neither here nor there.Clint: Yeah, they want the thing that gets the clicks.Corey: Exactly. You built a product that absolutely resonates in its target market and outside of that market. It's one of those, what is that thing, again? If you could give us a light refresher on what Cribl is and does, you'll probably do a better job of it than I will. We hope.Clint: We'd love to. Yeah, so we are an observability company, fundamentally. I think one of the interesting things to talk about when it comes to observability is that observability and security are merging. And so I like to say observability and include security people. If you're a security person, and you don't feel included by the word observability, sorry.We also include you; you're under our tent here. So, we sell to technology professionals, we help make their lives better. And we do that today through a flagship product called LogStream—which is part of this announcement, we're actually renaming to Stream. In some ways, we're dropping logs—and we are a pipeline company. So, we help you take all of your existing agents, all of your existing data that's moving, and we help you process that data in the stream to control costs and to send it multiple places.And it sounds kind of silly, but one of the biggest problems that we end up solving for a lot of our enterprises is, “Hey, I've got, like, this old Syslog feed coming off of my firewalls”—like, you remember those things, right? Palo Alto firewalls, ASA firewalls—“I actually get that thing to multiple places because, hey, I want to get that data into another security solution. I want to get that data into a data lake. How do I do that?” Well, in today's world, that actually turns out is sort of a neglected set of features, like, the vendors who provide you logging solutions, being able to reshape that data, filter that data, control costs, wasn't necessarily at the top of their priority list.It wasn't nefarious. It wasn't like people are like, “Oh, I'm going to make sure that they can't process this data before it comes into my solution.” It's more just, like, “I'll get around to it eventually.” And the eventually never actually comes. And so our streaming product helps people do that today.And the big announcement that we're making this week is that we're extending that same processing technology down to the endpoint with a new product we're calling Cribl Edge. And so we're taking our existing best-in-class management technology, and we're turning it into an agent. And that seems kind of interesting because… I think everybody sort of assumed that the agent is dead. Okay, well, we've been building agents for a decade or two decades. Isn't everything exactly the same as it was before?But we really saw kind of a dearth of innovation in that area in terms of being able to manage your agents, being able to understand what data is available to be collected, being able to auto-discover the data that needs to be able to be collected, turning those agents into interactive troubleshooting experiences so that we can, kind of, replicate the ability to zoom into a remote endpoint and replicate that Linux command line experience that we're not supposed to be getting anymore because we're not supposed to SSH into boxes anymore. Well, how do I replicate that? How do I see how much disk is on this given endpoint if I can't SSH into that box? And so Cribl Edge is a rethink about making this rich, interactive experience on top of all of these agents that become this really massive distributed system that we can process data all the way out at where the data is being emitted.And so that means that now we don't nec—if you want to process that data in the stream, okay, great, but if you want to process that data at its origination point, we can actually provide you cheaper cost because now you're using a lot of that capacity that's sitting out there on your endpoints that isn't really being used today anyway—the average utilization of a Kubernetes cluster is like 30%—Corey: It's that high. I'm sort of surprised.Clint: Right? I know. So, Datadog puts out the survey every year, which I think is really interesting, and that's a number that always surprised me is just that people are already paying for this capacity, right? It's sitting there, it's on their AWS bill already, and with that average utilization, a lot of the stuff that we're doing in other clusters, or while we're moving that data can actually just be done right there where the data is being emitted. And also, if we're doing things like filtering, we can lower egress charges, there's lots of really, really good goodness that we can do by pushing that processing further closer to its origination point.Corey: You know, the timing of this episode is somewhat apt because as of the time that we're recording this, I spent most of yesterday troubleshooting and fixing my home wireless network, which is a whole Ubiquity-managed thing. And the controller was one of their all-in-one box things that kept more or less power cycling for no apparent reason. How do I figure out why it's doing that? Well, I'm used to, these days, doing everything in a cloud environment where you can instrument things pretty easily, where things start and where things stop is well understood. Finally, I just gave up and used a controller that's sitting on an EC2 instance somewhere, and now great, now I can get useful telemetry out of it because now it's stuff I know how to deal with.It also, turns out that surprise, my EC2 instance is not magically restarting itself due to heat issues. What a concept. So, I have a newfound appreciation for the fact that oh, yeah, not everything lives in a cloud provider's regions. Who knew? This is a revelation that I think is going to be somewhat surprising for folks who've been building startups and believe that anything that's older than 18 months doesn't exist.But there's a lot of data centers out there, there are a lot of agents living all kinds of different places. And workloads continue to surprise me even now, just looking at my own client base. It's a very diverse world when we're talking about whether things are on-prem or whether they're in cloud environments.Clint: Well, also, there's a lot of agents on every endpoint period, just due to the fact that security guys want an agent, the observability guys want an agent, the logging people want an agent. And then suddenly, I'm, you know, I'm looking at every endpoint—cloud, on-prem, whatever—and there's 8, 10 agents sitting there. And so I think a lot of the opportunity that we saw was, we can unify the data collection for metric type of data. So, we have some really cool defaults. [unintelligible 00:07:30] this is one of the things where I think people don't focus much on, kind of, the end-user experience. Like, let's have reasonable defaults.Let's have the thing turn on, and actually, most people's needs are set without tweaking any knobs or buttons, and no diving into YAML files and looking at documentation and trying to figure out exactly the way I need to configure this thing. Let's collect metric data, let's collect log data, let's do it all from one central place with one agent that can send that data to multiple places. And I can send it to Grafana Cloud, if I want to; I can send it to Logz.io, I can send it to Splunk, I can send it to Elasticsearch, I can send it to AWS's new Elasticsearch-y the thing that we don't know what they're going to call it yet after the lawsuit. Any of those can be done right from the endpoint from, like, a rich graphical experience where I think that there's a really a desire now for people to kind of jump into these configuration files where really a lot of these users, this is a part-time job, and so hey, if I need to go set up data collection, do I want to learn about this detailed YAML file configuration that I'm only going to do once or twice, or should I be able to do it in an easy, intuitive way, where I can just sit down in front of the product, get my job done and move on without having to go learn some sort of new configuration language?Corey: Once upon a time, I saw an early circa 2012, 2013 talk from Jordan Sissel, who is the creator of Logstash, and he talked a lot about how challenging it was to wind up parsing all of the variety of log files out there. Even something is relatively straightforward—wink, wink, nudge, nudge—as timestamps was an absolute monstrosity. And a lot of people have been talking in recent years about OpenTelemetry being the lingua franca that everything speaks so that is the wave of the future, but I've got a level with you, looking around, it feels like these people are living in a very different reality than the one that I appear to have stumbled into because the conversations people are having about how great it is sound amazing, but nothing that I'm looking at—granted from a very particular point of view—seems to be embracing it or supporting it. Is that just because I'm hanging out in the wrong places, or is it still a great idea whose time has yet to come, or something else?Clint: So, I think a couple things. One is every conversation I have about OpenTelemetry is always, “Will be.” It's always in the future. And there's certainly a lot of interest. We see this from customer after customer, they're very interested in OpenTelemetry and what the OpenTelemetry strategy is, but as an example OpenTelemetry logging is not yet finalized specification; they believe that they're still six months to a year out. It seems to be perpetually six months to a year out there.They are finalized for metrics and they are finalized for tracing. Where we see OpenTelemetry tends to be with companies like Honeycomb, companies like Datadog with their tracing product, or Lightstep. So, for tracing, we see OpenTelemetry adoption. But tracing adoption is also not that high either, relative to just general metrics of logs.Corey: Yeah, the tracing implementations that I've seen, for example, Epsagon did this super well, where it would take a look at your Lambdas Function built into an application, and ah, we're going to go ahead and instrument this automatically using layers or extensions for you. And life was good because suddenly you got very detailed breakdowns of exactly how data was flowing in the course of a transaction through 15 Lambdas Function. Great. With everything else I've seen, it's, “Oh, you have to instrument all these things by hand.” Let me shortcut that for you: That means no one's going to do it. They never are.It's anytime you have to do that undifferentiated heavy lifting of making sure that you put the finicky code just so into your application's logic, it's a shorthand for it's only going to happen when you have no other choice. And I think that trying to surface that burden to the developer, instead of building it into the platform so they don't have to think about it is inherently the wrong move.Clint: I think there's a strong belief in Silicon Valley that—similar to, like, Hollywood—that the biggest export Silicon Valley is going to have is culture. And so that's going to be this culture of, like, developer supporting their stuff in production. I'm telling you, I sell to banks and governments and telcos and I don't see that culture prevailing. I see a application developed by Accenture that's operated by Tata. That's a lot of inertia to overcome and a lot of regulation to overcome as well, and so, like, we can say that, hey, separation of duties isn't really a thing and developers should be able to support all their own stuff in production.I don't see that happening. It may happen. It'll certainly happen more than zero. And tracing is predicated on the whole idea that the developer is scratching their own itch. Like that I am in production and troubleshooting this and so I need this high-fidelity trace-level information to understand what's going on with this one user's experience, but that doesn't tend to be in the enterprise, how things are actually troubleshot.And so I think that more than anything is the headwind that slowing down distributed tracing adoption. It's because you're putting the onus on solving the problem on a developer who never ends up using the distributed tracing solution to begin with because there's another operations department over there that's actually operating the thing on a day-to-day basis.Corey: Having come from one of those operations departments myself, the way that I would always fix things was—you know, in the era that I was operating it made sense—you'd SSH into a box and kick the tires, poke around, see what's going on, look at the logs locally, look at the behaviors, the way you'd expect it to these days, that is considered a screamingly bad anti-pattern and it's something that companies try their damnedest to avoid doing at all. When did that change? And what is the replacement for that? Because every time I asked people for the sorts of data that I would get from that sort of exploration when they're trying to track something down, I'm more or less met with blank stares.Clint: Yeah. Well, I think that's a huge hole and one of the things that we're actually trying to do with our new product. And I think the… how do I replicate that Linux command line experience? So, for example, something as simple, like, we'd like to think that these nodes are all ephemeral, but there's still a disk, whether it's virtual or not; that thing sometimes fills up, so how do I even do the simple thing like df -kh and see how much disk is there if I don't already have all the metrics collected that I needed, or I need to go dive deep into an application and understand what that application is doing or seeing, what files it's opening, or what log files it's writing even?Let's give some good examples. Like, how do I even know what files an application is running? Actually, all that information is all there; we can go discover that. And so some of the things that we're doing with Edge is trying to make this rich, interactive experience where you can actually teleport into the end node and see all the processes that are running and get a view that looks like top and be able to see how much disk is there and how much disk is being consumed. And really kind of replicating that whole troubleshooting experience that we used to get from the Linux command line, but now instead, it's a tightly controlled experience where you're not actually getting an arbitrary shell, where I could do anything that could give me root level access, or exploit holes in various pieces of software, but really trying to replicate getting you that high fidelity information because you don't need any of that information until you need it.And I think that's part of the problem that's hard with shipping all this data to some centralized platform and getting every metric and every log and moving all that data is the data is worthless until it isn't worthless anymore. And so why do we even move it? Why don't we provide a better experience for getting at the data at the time that we need to be able to get at the data. Or the other thing that we get to change fundamentally is if we have the edge available to us, we have way more capacity. I can store a lot of information in a few kilobytes of RAM on every node, but if I bring thousands of nodes into one central place, now I need a massive amount of RAM and a massive amount of cardinality when really what I need is the ability to actually go interrogate what's running out there.Corey: The thing that frustrates me the most is the way that I go back and find my old debug statements, which is, you know, I print out whatever it is that the current status is and so I can figure out where something's breaking.Clint: [Got here 00:15:08].Corey: Yeah. I do it within AWS Lambda functions, and that's great. And I go back and I remove them later when I notice how expensive CloudWatch logs are getting because at 50 cents per gigabyte of ingest on those things, and you have that Lambda function firing off a fair bit, that starts to add up when you've been excessively wordy with your print statements. It sounds ridiculous, but okay, then you're storing it somewhere. If I want to take that log data and have something else consume it, that's nine cents a gigabyte to get it out of AWS and then you're going to want to move it again from wherever it is over there—potentially to a third system, because why not?—and it seems like the entire purpose of this log data is to sit there and be moved around because every time it gets moved, it winds up somehow costing me yet more money. Why do we do this?Clint: I mean, it's a great question because one of the things that I think we decided 15 years ago was that the reason to move this data was because that data may go poof. So, it was on a, you know, back in my day, it was an HP DL360 1U rackmount server that I threw in there, and it had raid zero discs and so if that thing went dead, well, we didn't care, we'd replace it with another one. But if we wanted to find out why it went dead, we wanted to make sure that the data had moved before the thing went dead. But now that DL360 is a VM.Corey: Yeah, or a container that is going to be gone in 20 minutes. So yeah, you don't want to store it locally on that container. But discs are also a fair bit more durable than they once were, as well. And S3 talks about its 11 nines of durability. That's great and all but most of my application logs don't need that. So, I'm still trying to figure out where we went wrong.Clint: Well, I think it was right for the time. And I think now that we have durable storage at the edge where that blob storage has already replicated three times and we can reattach—if that box crashes, we can reattach new compute to that same block storage. Actually, AWS has some cool features now, you can actually attach multiple VMs to the same block store. So, we could actually even have logs being written by one VM, but processed by another VM. And so there are new primitives available to us in the cloud, which we should be going back and re-questioning all of the things that we did ten to 15 years ago and all the practices that we had because they may not be relevant anymore, but we just never stopped to ask why.Corey: Yeah, multi-attach was rolled out with their IO2 volumes, which are spendy but great. And they do warn you that you need a file system that actively supports that and applications that are aware of it. But cool, they have specific use cases that they're clearly imagining this for. But ten years ago, we were building things out, and, “Ooh, EBS, how do I wind up attaching that from multiple instances?” The answer was, “Ohh, don't do that.”And that shaped all of our perspectives on these things. Now suddenly, you can. Is that, “Ohh don't do that,” gut visceral reaction still valid? People don't tend to go back and re-examine the why behind certain best practices until long after those best practices are now actively harmful.Clint: And that's really what we're trying to do is to say, hey, should we move log data anymore if it's at a durable place at the edge? Should we move metric data at all? Like, hey, we have these big TSDBs that have huge cardinality challenges, but if I just had all that information sitting in RAM at the original endpoint, I can store a lot of information and barely even touch the free RAM that's already sitting out there at that endpoint. So, how to get out that data? Like, how to make that a rich user experience so that we can query it?We have to build some software to do this, but we can start to question from first principles, hey, things are different now. Maybe we can actually revisit a lot of these architectural assumptions, drive cost down, give more capability than we actually had before for fundamentally cheaper. And that's kind of what Cribl does is we're looking at software is to say, “Man, like, let's question everything and let's go back to first principles.” “Why do we want this information?” “Well, I need to troubleshoot stuff.” “Okay, well, if I need to troubleshoot stuff, well, how do I do that?” “Well, today we move it, but do we have to? Do we have to move that data?” “No, we could probably give you an experience where you can dive right into that endpoint and get really, really high fidelity data without having to pay to move that and store it forever.” Because also, like, telemetry information, it's basically worthless after 24 hours, like, if I'm moving that and paying to store it, then now I'm paying for something I'm never going to read back.Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they're all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don't dispute that but what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you'll receive a $100 in credit. Thats V-U-L-T-R.com slash screaming.Corey: And worse, you wind up figuring out, okay, I'm going to store all that data going back to 2012, and it's petabytes upon petabytes. And great, how do I actually search for a thing? Well, I have to use some other expensive thing of compute that's going to start diving through all of that because the way I set up my partitioning, it isn't aligned with anything looking at, like, recency or based upon time period, so right every time I want to look at what happened 20 minutes ago, I'm looking at what happened 20 years ago. And that just gets incredibly expensive, not just to maintain but to query and the rest. Now, to be clear, yes, this is an anti-pattern. It isn't how things should be set up. But how should they be set up? And it is the collective the answer to that right now actually what's best, or is it still harkening back to old patterns that no longer apply?Clint: Well, the future is here, it's just unevenly distributed. So there's, you know, I think an important point about us or how we think about building software is with this customer is first attitude and fundamentally bringing them choice. Because the reality is that doing things the old way may be the right decision for you. You may have compliance requirements to say—there's a lot of financial services institutions, for example, like, they have to keep every byte of data written on any endpoint for seven years. And so we have to accommodate their requirements.Like, is that the right requirement? Well, I don't know. The regulator wrote it that way, so therefore, I have to do it. Whether it's the right thing or the wrong thing for the business, I have no choice. And their decisions are just as right as the person who says this data is worthless and should all just be thrown away.We really want to be able to go and say, like, hey, what decision is right? We're going to give you the option to do it this way, we're going to give you the option to do it this way. Now, the hard part—and that when it comes down to, like, marketing, it's like you want to have this really simple message, like, “This is the one true path.” And a lot of vendors are this way, “There's this new wonderful, right, true path that we are going to take you on, and follow along behind me.” But the reality is, enterprise worlds are gritty and ugly, and they're full of old technology and new technology.And they need to be able to support getting data off the mainframe the same way as they're doing a brand new containerized microservices application. In fact, that brand new containerized microservices application is probably talking to the mainframe through some API. And so all of that has to work at once.Corey: Oh, yeah. And it's all of our payment data is in our PCI environment that PCI needs to have every byte logged. Great. Why is three-quarters of your infrastructure considered the PCI environment? Maybe you can constrain that at some point and suddenly save a whole bunch of effort, time, money, and regulatory drag on this.But as you go through that journey, you need to not only have a tool that will work when you get there but a tool that will work where you are today. And a lot of companies miss that mark, too. It's, “Oh, once you modernize and become the serverless success story of the decade, then our product is going to be right for you.” “Great. We'll send you a postcard if we ever get there and then you can follow up with us.”Alternately, it's well, “Yeah, we're this is how we are today, but we have a visions of a brighter tomorrow.” You've got to be able to meet people where they are at any point of that journey. One of the things I've always respected about Cribl has been the way that you very fluidly tell both sides of that story.Clint: And it's not their fault.Corey: Yeah.Clint: Most of the people who pick a job, they pick the job because, like—look, I live in Kansas City, Missouri, and there's this data processing company that works primarily on mainframes, it's right down the road. And they gave me a job and it pays me $150,000 a year, and I got a big house and things are great. And I'm a sysadmin sitting there. I don't get to play with the new technology. Like, that customer is just as an applicable customer, we want to help them exactly the same as the new Silicon Valley hip kid who's working at you know, a venture-backed startup, they're doing everything natively in the cloud. Those are all right decisions, depending on where you happen to find yourself, and we want to support you with our products, no matter where you find yourself on the technology spectrum.Corey: Speaking of old and new, and the trends of the industry, when you first set up this recording, you mentioned, “Oh, yeah, we should make it a point to maybe talk about the acquisition,” at which point I sprayed coffee across my iMac. Thanks for that. Turns out it wasn't your acquisition we were talking about so much as it is the—at the time we record this—-the yet-to-close rumored acquisition of Splunk by Cisco.Clint: I think it's both interesting and positive for some people, and sad for others. I think Cisco is obviously a phenomenal company. They run the networking world. The fact that they've been moving into observability—they bought companies like AppDynamics, and we were talking about Epsagon before the show, they bought—ServiceNow, just bought Lightstep recently. There's a lot of acquisitions in this space.I think that when it comes to something like Splunk, Splunk is a fast-growing company by compared to Cisco. And so for them, this is something that they think that they can put into their distribution channel, and what Cisco knows how to do is to sell things like they're very good at putting things through their existing sales force and really amplifying the sales of that particular thing that they have just acquired. That being said, I think for a company that was as innovative as Splunk, I do find it a bit sad with the idea that it's going to become part of this much larger behemoth and not really probably driving the observability and security industry forward anymore because I don't think anybody really looks at Cisco as a company that's driving things—not to slam them or anything, but I don't really see them as driving the industry forward.Corey: Somewhere along the way, they got stuck and I don't know how to reconcile that because they were a phenomenally fast-paced innovative company, briefly the most valuable company in the world during the dotcom bubble. And then they just sort of stalled out somewhere and, on some level, not to talk smack about it, but it feels like the level of innovation we've seen from Splunk has curtailed over the past half-decade or so. And selling to Cisco feels almost like a tacit admission that they are effectively out of ideas. And maybe that's unfair.Clint: I mean, we can look at the track record of what's been shipped over the last five years from Splunk. And again they're a partner, their customers are great, I think they still have the best log indexing engine on the market. That was their core product and what has made them the majority of their money. But there's not been a lot new. And I think objectively we can look at that without throwing stones and say like, “Well, what net-new? You bought SignalFX. Like, good for you guys like that seems to be going well. You've launched your observability suite based off of these acquisitions.” But organic product-wise, there's not a lot coming out of the factory.Corey: I'll take it a bit further-slash-sadder, we take a look at some great companies that were acquired—OpenDNS, Duo Security, SignalFX, as you mentioned, Epsagon, ThousandEyes—and once they've gotten acquired by Cisco, they all more or less seem to be frozen in time, like they're trapped in amber, which leads us up to the natural dinosaur analogy that I'll probably make in a less formal setting. It just feels like once a company is bought by Cisco, their velocity peters out, a lot of their staff leaves, and what you see is what you get. And I don't know if that's accurate, I'm just not looking in the right places, but every time I talk to folks in the industry about this, I get a lot of knowing nods that are tied to it. So, whether or not that's true or not, that is very clearly, at least in some corners of the market, the active perception.Clint: There's a very real fact that if you look even at very large companies, innovation is driven from a core set of a handful of people. And when those people start to leave, the innovation really stops. It's those people who think about things back from first principles—like why are we doing things? What different can we do?—and they're the type of drivers that drive change.So, Frank Slootman wrote a book recently called Amp it Up that I've been reading over the last weekend, and he talks—has this article that was on LinkedIn a while back called “Drivers vs. Passengers” and he's always looking for drivers. And those drivers tend to not find themselves as happy in bigger companies and they tend to head for the exits. And so then you end up with the people who are a lot of the passenger type of people, the people who are like—they'll carry it forward, they'll continue to scale it, the business will continue to grow at whatever rate it's going to grow, but you're probably not going to see a lot of the net-new stuff. And I'll put it in comparison to a company like Datadog who I have a vast amount of respect for I think they're incredibly innovative company, and I think they continue to innovate.Still driven by the founders, the people who created the original product are still there driving the vision, driving forward innovation. And that's what tends to move the envelope is the people who have the moral authority inside of an even larger organization to say, “Get behind me. We're going in this direction. We're going to go take that hill. We're going to go make things better for our customers.” And when you start to lose those handful of really critical contributors, that's where you start to see the innovation dry up.Corey: Where do you see the acquisitions coming from? Is it just at some point people shove money at these companies that got acquired that is beyond the wildest dreams of avarice? Is it that they believe that they'll be able to execute better on their mission and they were independently? These are still smart, driven, people who have built something and I don't know that they necessarily see an acquisition as, “Well, time to give up and coast for a while and then I'll leave.” But maybe it is. I've never found myself in that situation, so I can't speak for sure.Clint: You kind of I think, have to look at the business and then whoever's running the business at that time—and I sit in the CEO chair—so you have to look at the business and say, “What do we have inside the house here?” Like, “What more can we do?” If we think that there's the next billion-dollar, multi-billion-dollar product sitting here, even just in our heads, but maybe in the factory and being worked on, then we should absolutely not sell because the value is still there and we're going to grow the company much faster as an independent entity than we would you know, inside of a larger organization. But if you're the board of directors and you're looking around and saying like, hey look, like, I don't see another billion-dollar line of bus—at this scale, right, if your Splunk scale, right? I don't see another billion-dollar line of business sitting here, we could probably go acquire it, we could try to add it in, but you know, in the case of something like a Splunk, I think part of—you know, they're looking for a new CEO right now, so now they have to go find a new leader who's going to come in, re-energize and, kind of, reboot that.But that's the options that they're considering, right? They're like, “Do I find a new CEO who's going to reinvigorate things and be able to attract the type of talent that's going to lead us to the next billion-dollar line of business that we can either build inside or we can acquire and bring in-house? Or is the right path for me just to say, ‘Okay, well, you know, somebody like Cisco's interested?'” or the other path that you may see them go down to something like Silver Lake, so Silver Lake put a billion dollars into the company last year. And so they may be looking at and say, “Okay, well, we really need to do some restructuring here and we want to do it outside the eyes of the public market. We want to be able to change pricing model, we want to be able to really do this without having to worry about the stock price's massive volatility because we're making big changes.”And so I would say there's probably two big options there considering. Like, do we sell to Cisco, do we sell to Silver Lake, or do we really take another run at this? And those are difficult decisions for the stewards of the business and I think it's a different decision if you're the steward of the business that created the business versus the steward of the business for whom this is—the I've been here for five years and I may be here for five years more. For somebody like me, a company like Cribl is literally the thing I plan to leave on this earth.Corey: Yeah. Do you have that sense of personal attachment to it? On some level, The Duckbill Group, that's exactly what I'm staring at where it's great. Someone wants to buy the Last Week in AWS media side of the house.Great. Okay. What is that really, beyond me? Because so much of it's been shaped by my personality. There's an audience, sure, but it's a skeptical audience, one that doesn't generally tend to respond well to mass market, generic advertisements, so monetizing that is not going to go super well.“All right, we're going to start doing data mining on people.” Well, that's explicitly against the terms of service people signed up for, so good luck with that. So, much starts becoming bizarre and strange when you start looking at building something with the idea of, oh, in three years, I'm going to unload this puppy and make it someone else's problem. The argument is that by building something with an eye toward selling it, you build a better-structured business, but it also means you potentially make trade-offs that are best not made. I'm not sure there's a right answer here.Clint: In my spare time, I do some investments, angel investments, and that sort of thing, and that's always a red flag for me when I meet a founder who's like, “In three to five years, I plan to sell it to these people.” If you don't have a vision for how you're fundamentally going to alter the marketplace and our perception of everything else, you're not dreaming big enough. And that to me doesn't look like a great investment. It doesn't look like the—how do you attract employees in that way? Like, “Okay, our goal is to work really hard for the next three years so that we will be attractive to this other bigger thing.” They may be thinking it on the inside as an available option, but if you think that's your default option when starting a company, I don't think you're going to end up with the outcome is truly what you're hoping for.Corey: Oh, yeah. In my case, the only acquisition story I see is some large company buying us just largely to shut me up. But—Clint: [laugh].Corey: —that turns out to be kind of expensive, so all right. I also don't think it serve any of them nearly as well as they think it would.Clint: Well, you'll just become somebody else on Twitter. [laugh].Corey: Yeah, “Time to change my name again. Here we go.” So, if people want to go and learn more about a Cribl Edge, where can they do that?Clint: Yeah, cribl.io. And then if you're more of a technical person, and you'd like to understand the specifics, docs.cribl.io. That's where I always go when I'm checking out a vendor; just skip past the main page and go straight to the docs. So, check that out.And then also, if you're wanting to play with the product, we make online available education called Sandboxes, at sandbox.cribl.io, where you can go spin up your own version of the product, walk through some interactive tutorials, and get a view on how it might work for you.Corey: Such a great pattern, at least for the way that I think about these things. You can have flashy videos, you can have great screenshots, you can have documentation that is the finest thing on this earth, but let me play with it; let me kick the tires on it, even with a sample data set. Because until I can do that, I'm not really going to understand where the product starts and where it stops. That is the right answer from where I sit. Again, I understand that everyone's different, not everyone thinks like I do—thankfully—but for me, that's the best way I've ever learned something.Clint: I love to get my hands on the product, and in fact, I'm always a little bit suspicious of any company when I go to their webpage and I can't either sign up for the product or I can't get to the documentation, and I have to talk to somebody in order to learn. That's pretty much I'm immediately going to the next person in that market to go look for somebody who will let me.Corey: [laugh]. Thank you again for taking so much time to speak with me. I appreciate it. As always, it's a pleasure.Clint: Thanks, Corey. Always enjoy talking to you.Corey: Clint Sharp, CEO and co-founder of Cribl. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment. And when you hit submit, be sure to follow it up with exactly how many distinct and disparate logging systems that obnoxious comment had to pass through on your end of things.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
The Hari Seldon of Third Party Tooling with Aidan Steele

Screaming in the Cloud

Play Episode Listen Later Mar 16, 2022 33:39


About AidanAidan is an AWS enthusiast, due in no small part to sharing initials with the cloud. He's been writing software for over 20 years and getting paid to do it for the last 10. He's still not sure what he wants to be when he grows up.Links: Stedi: https://www.stedi.com/ GitHub: https://github.com/aidansteele Blog posts: https://awsteele.com/ Ipv6-ghost-ship: https://github.com/aidansteele/ipv6-ghost-ship Twitter: https://twitter.com/__steele TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.Corey: Today's episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that's built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that's exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn't eat all the data you've gotten on the system, it's exactly what you've been looking for. Check it out today at min.io/download, and see for yourself. That's min.io/download, and be sure to tell them that I sent you.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by someone who is honestly, feels like they're after my own heart. Aidan Steele by day is a serverless engineer at Stedi, but by night, he is an absolute treasure and a delight because not only does he write awesome third-party tooling and blog posts and whatnot around the AWS ecosystem, but he turns them into the most glorious, intricate, and technical shit posts that I think I've ever seen. Aidan, thank you for joining me.Aidan: Hi, Corey, thanks for having me. It's an honor to be here. Hopefully, we get to talk some AWS, and maybe also talk some nonsense as well.Corey: I would argue that in many ways, those things are one in the same. And one of the things I always appreciated about how you approach things is, you definitely seem to share that particular ethos with me. And there's been a lot of interesting content coming out from you in recent days. The thing that really wound up showing up on my radar in a big way was back at the start of January—2022, for those listening to this in the glorious future—about using IPv6 to use multi-factor auth, which it is so… I don't even have the adjectives to throw at this because, first it is ridiculous, two, it is effective, and three, it is just who thinks like that? What is this and what did you—what monstrosity have you built?Aidan: So, what did I end up calling it? I think it was ipv6-ghost-ship. And I think I called it that because I'd recently watched, oh, what was that series that was recently on Apple TV? Uh, the Isaac Asimov—Corey: If it's not Paw Patrol, I have no idea what it is because I have a four-year-old who is very insistent about these things. It is not so much a TV show as it is a way of life. My life is terrible. Please put me out of my misery.Aidan: Well, at least it's not Bluey. That's the one I usually hear about. That's Australia's greatest export. But it was one of the plot devices was a ship that would teleport around the place, and you could never predict where it was next. And so no one could access it. And I thought, “Oh, what about if I use the IPv6 address space?”Corey: Oh, Foundation?Aidan: That's the one. Foundation. That's how the name came about. The idea, honestly, it was because I saw—when was it?—sometime last year, AWS added support for those IP address prefixes. IPv4 prefixes were small; very useful and important, but IPv6 with more than 2 trillion IP addresses, per instance, I thought there's got to be fun to be had there.Corey: 281 trillion, I believe is the—Aidan: 281 trillion.Corey: Yeah. It is sarcastically large space. And that also has effectively, I would say in InfoSec sense, killed port scanning, the idea I'm going to scan the IP range and see what's there, just because that takes such a tremendous amount of time. Now here, in reality, you also wind up with people using compromised resources, and yeah, it turns out, I can absolutely scan trillions upon trillions of IP addresses as long as I'm using your AWS account and associated credit card in which to do it. But here in the real world, it is not an easily discoverable problem space.Aidan: Yeah. I made it as a novelty, really. I was looking for a reason to learn more about IPv6 and subnetting because it's the term I'd heard, a thing I didn't really understand, and the way I learn things is by trying to build them, realizing I have no idea what I'm doing, googling the error messages, reluctantly looking at the documentation, and then repeating until I've built something. And yeah, and then I built it, published it, and seemed to be pretty popular. It struck a chord. People retweeted it. It tickled your fancy. I think it spoke something in all of us who are trying not to take our jobs too seriously, you know, know we can have a little fun with this ludicrous tech that we get to play with.Corey: The idea being, you take the multi-factor auth code that your thing generates, and that is the last series of octets for the IP address you wind up going towards and that is such a large problem space that you're not going to find it in time, so whatever it is automatically connect to that particular IP address because that's the only one that's going to be listening for a 30 to 60-second span for the connection to be established. It is a great idea because SSH doesn't support this stuff natively. There's no good two-factor auth approach for this. And I love it. I'd be scared to death to run this in production for something that actually matters.And we also start caring a lot more about how accurate are the clocks on those instances, all of a sudden. But, oh, I just love the concept so much because it hits on the ethos of—I think—what so much of the cloud does were these really are fundamental building blocks that we can use to build incredible, awe-inspiring things that are globe-spanning, and also ridiculousness. And there's so much value of being able to do the same thing, sometimes at the same time.Aidan: Yeah, it's interesting, you mentioned, like, never using in prod, and I guess when I was building it, I thought, you know, that would be apparent. Like, “Yes, this is very neat, but surely no one's going to use it.” And I did see someone raised an issue on the GitHub project which was talking about clock skew. And I mentioned—Corey: Here at the bank where I'm running this in production, we're—Aidan: [laugh].Corey: —having some trouble with the clock. Yeah, it's—Aidan: You know, I mentioned that the underlying 2FA library did account for clock scheme 30 seconds either way, but it made me realize, I might need to put a disclaimer on the project. While the code is probably reasonably sound, I personally wouldn't run it in production, and it was more meant to be a piece of performance art or something to tickle one's fancy and to move on, not to roll it out. But I don't know, different strokes for different folks.Corey: I have gotten a lot better about calling out my ridiculous shitpost things when I do them. And the thing that really drove that home for me was talking about using DNS TXT records to store information about what server a virtual machine lives on—or container or whatnot—thus using Route 53 is a database. And that was a great gag, and then someone did a Reddit post of “This seems like a really good idea, so I'm going to start doing it, and I'm having these questions.”And at that point is like, “Okay, I've got a break character at that point.” And is, yeah, “Hi. That's my joke. Don't do it because X, Y, and Z are your failure modes, there are better tools for it. So yeah, there are ways you can do this with DNS, but it's not generally a great idea, and there are some risk factors to it. And okay, A, B, and C are the things you don't want to do, so let's instead do it in a halfway intelligent way because it's only funny if everyone's laughing. Otherwise, we fall into this trap of people take you seriously and they feel bad as a result when it doesn't work in production. So, calling it out as this is a joke tends to put a lot of that aside. It also keeps people from feeling left out.Aidan: Yeah. I realized that because the next novelty project I did a few days later—not sure if you caught it—it was a Rick Roll over ICMPv6 packets, where if you had run ping six to a certain IP range, it would return the lyrics to music's greatest treasure. So, I think that was hopefully a bit more self-evident that this should never be taken seriously. Who knows, I'm sure someone will find a use for it in prod.Corey: And I was looking through this, this is great. I love some of the stuff that you're doing because it's just fantastic. And I started digging a bit more to things you had done. And at that point, it was whoa, whoa, whoa, wait a minute. Back in 2020, you found an example of an issue with AWS's security model where CloudTrail would just start—if asked nicely—spewing other people's credential sets and CloudTrail events and whatnot into your account.And, A, that's kind of a problem. B, it was something that didn't make that big of a splash when it came out—I don't even think I linked to it at the time—and, C, it was examples of after the recent revelations around CloudFormation and Glue that the fine folks at Orca Security found out. That wasn't a one-off because you'd done this a year beforehand. We have now an established track record of cross-account data sharing and, potentially, exploits, and I'm looking at this and I got to level with you I felt incredibly naive because I had assumed that since we hadn't heard of this stuff in any real big sense that it simply didn't happen.So, when we heard about Azure; obviously, it's because Azure is complete clown shoes and the excellent people that AWS would never make these sorts of mistakes. Except we now have evidence that they absolutely did and didn't talk about it publicly. And I've got a level with you. I feel more than a little bit foolish, betrayed, naive for all this. What's your take on it?Aidan: Yeah, so just to clarify, it wasn't actually in your account. It was the new AWS custom resource execution model was you would upload a Lambda function that would run in an Amazon-managed account. And so that immediately set off my spidey sense because executing code in someone else's account seems fraught with peril. And so—Corey: Yeah, you can do all kinds of horrifying things there, like, use it to run containers.Aidan: Yeah. [laugh]. Thankfully, I didn't do anything that egregious. I stayed inside the Lambda function, but I look—I poked around at what credentials have had, and it would use CloudWatch to reinvoke itself and CloudWatch kept recording CloudTrail. And I won't go into all the details, but it ended up being that you could see credentials being recorded in CloudTrail in that account, and I could, sort of, funnel them out of there.When I found this, I was a little scared, and I don't think I'd reported an issue to AWS before, so I didn't want to go too far and do anything that could be considered malicious. So, I didn't actively seek out other people's credentials.Corey: Yeah, as a general rule, it's best once you discover things like that to do the right thing and report it, not proceed to, you know, inadvertently commit felonies.Aidan: Yeah. Especially because it was my first time. I felt better safe than sorry. So, I didn't see other credentials, but I had no reason to believe that, I wouldn't see it if I kept looking. I reported it to Amazon. Their security team was incredibly professional, made me feel very comfortable reporting it, and let me know when, you know, they'd remediated it, which was a matter of days later.But afterwards, it left me feeling a little surprised because I was able to publish about it, and a few people responded, you know, the sorts of people who pay close attention to the industry, but Amazon didn't publish anything as far as I was aware. And it changed the way I felt about AWS security, because like you, I sort of felt that AWS, more or less had a pretty perfect track record. They would have advisories about possible [Zen 00:12:04] exploits, and so on. But they'd never published anything about potential for compromise. And it makes me wonder how many of the things might have been reported in the past where either the third-party researcher either didn't end up publishing, or they published and it just disappeared into the blogosphere, and I hadn't seen it.Corey: They have a big earn trust principle over there, and I think that they always focus on the trust portion of it, but I think what got overlooked is the earn. When people are giving you trust that you haven't earned, on some level, the right thing to do is to call it out and be transparent around these things. Yes, I know, Wall Street's going to be annoyed and headlines, et cetera, et cetera, but I had always had the impression that had there been a cross-account vulnerability or a breach of some sort, they would communicate this and they would have their executives go on a speaking tour about it to explain how defense-in-depth mitigated some of it, and/or lessons learned, and/or what else we can learn. But it turns out that wasn't was happening at all. And I feel like they have been given trust that was unearned and now I am not happy with it.I suddenly have a lot more of a, I guess, skeptical position toward them as a result, and I have very little tolerance left for what has previously been a staple of the AWS security discussions, which is an executive getting on stage for a while and droning on about the shared responsibility model with the very strong implication that “Oh, yeah, we're fine. It's all on your side of the fence that things are going to break.” Yeah, turns out, that's not so true. Just you know, about the things on your side of the fence in a way that you don't about the things that are on theirs.Aidan: Yeah, it's an interesting one. Like, I think about it and I think, “Well, they never made an explicit promise that they would publish these things,” so, on one hand, I say to myself, “Oh, maybe that's on me for making that assumption.” But, I don't know, I feel like the way we felt was justified. Maybe naive in hindsight, but then, you know, I guess… I'm still not sure how to feel because of, like, I think about recent issues and how a couple of AWS Distinguished Engineers jumped on Twitter, and to their credit were extremely proactive in engaging with the community.But is that enough? It might be enough for say, to set my mind at ease or your mind at ease because we are, [laugh] to put it mildly, highly engaged, perhaps a little too engaged in the AWS space, but Twitter's very ephemeral. Very few of AWS's customers—Corey: Yeah, I can't link to tweets by distinguished engineers to present to an executive leadership team as an official statement from Amazon. I just can't.Aidan: Yeah. Yeah.Corey: And so the lesson we can take from this is okay, so “Well, we never actually said this.” “So, let me get this straight. You're content to basically let people assume whatever they want until they ask you an explicit question around these things. Really? Is that the lesson you want me to take from this? Because I have a whole bunch of very explicit questions that I will be asking you going forward, if that is in fact, your position. And you are not going to like the fact that I'm asking these questions.”Even if the answer is a hard no, people who did not have this context are going to wonder why are people asking those questions? It's a massive footgun here for them if that is the position that they intend to have. I want to be clear as well; this is also a messaging problem. It is not in any way, a condemnation of their excellent folks working on the security implementation themselves. This stuff is hard and those people are all-stars. I want to be very clear on this. It is purely around the messaging and positioning of the security posture.Aidan: Yeah, yeah. That's a good clarification because like you, my understanding that the service teams are doing a really stellar, above-average job, industry-wide, and the AWS Security Response Teams, I have absolute faith in them. It is a matter of messaging. And I guess what particularly brings it to front-of-mind is, it was earlier this month, or maybe it was last month, I received an email from a company called Sourcegraph. They do code search.I'm not even a customer of theirs yet, you know? I'm on a free trial, and I got an email that—I'm paraphrasing here—was something to the effect of, we discovered that it was possible for your code to appear in other customers' code search results. It was discovered by one of our own engineers. We found that the circumstances hadn't cropped up, but we wanted to tell you that it was possible. It didn't happen, and we're working on making sure it won't happen again.And I think about how radically different that is where they didn't have a third-party researcher forcing their hand; they could have very easily swept under the rug, but they were so proactive that, honestly, that's probably what's going to tipped me over to the edge into me becoming a customer. I mean, other than them having a great product. But yeah, it's a big contrast. It's how I like to see other companies work, especially Amazon.Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They've also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.Corey: The two companies that I can think of that have had security problems have been CircleCI and Travis CI. Circle had an incredibly transparent early-on blog post, they engaged with customers on the forums, and they did super well. Travis basically denied, stonewalled for ages, and now the only people who use Travis are there because they haven't found a good way to get off of it yet. It is effectively DOA. And I don't think those two things are unrelated.Aidan: Yeah. No, that's a great point. Because you know, I've been in this industry long enough. You have to know that humans write code and humans make mistakes—I know I've made more than my fair share—and I'm not going to write off the company for making a mistake. It's entirely in their response. And yeah, you're right. That's why Circle is still a trustworthy business that should earn people's business and why Travis—why I recommend everyone move away from.Corey: Yeah, I like Orca Security as a company and as a product, but at the moment, I am not their customer. I am AWS's customer. So, why the hell am I hearing it from Orca and not AWS when this happens?Aidan: Yeah, yeah. It's… not great. On one hand, I'm glad I'm not in charge of finding a solution to this because I don't have the skills or the expertise to manage that communication. Because like I think you said in the past, there's a lot of different audiences that they have to communicate with. They have to communicate with the stock market, they have to communicate with execs, they have to communicate with developers, and each of those audiences demands a different level of detail, a different focus. And it's tricky. And how do you manage that? But, I don't know, I feel like you have an obligation to when people place that level of trust in you.Corey: It's just a matter of doing right by your customers, on some level.Aidan: Yeah.Corey: How long have you been working on an AWS-side environments? Clearly, this is not like, “Well, it's year two,” because if so I'm going to feel remarkably behind.Aidan: [laugh]. So, I've been writing code in some capacity or another for 20 years. It took about five years to get anyone to pay me to do so. But yeah, I guess the start of my professional career—and by ‘professional,' I want to use it in strictest term, means getting paid for money; not that I [laugh] am necessarily a professional—coincided with the launch of AWS. So, I don't hadn't experienced with the before times of data centers, never had to think about direct connect, but it means I have been using AWS since sometime in 2008.I was just looking at my bill earlier, I saw that my first bill was for $70. It was—I was using a C1xLarge, which was 80 cents an hour, and it had eight-core CPUs. And to put that in context at the time—Corey: Eight vCPUs, technically I believe.Aidan: An it basically is—Corey: —or were they using [eCPU 00:20:31] model back then?Aidan: Yeah, no, that was vCPUs. But to me, that was extraordinary. You know, I was somewhere just after high school. It was—the Netflix Prize was around. If you're not sure what that was, it was Netflix had this open competition where they said anyone who could improve upon their movie recommendation algorithm could win a million dollars.And obviously being a teenager, I had a massive ego and [laugh] no self-doubt, so I thought I could win this, but I just don't have enough CPUs or RAM on my laptop. And so when EC2 launched, and I could pay 80 cents an hour, rather than signing up for a 12-month contract with a colocation company, it was just a dream come true. I was able to run my terrible algorithms, but I could run them eight times faster. Unfortunately and obviously, I didn't win because it turns out, I'm not a world-class statistician. But—Corey: Common mistake. I make that mistake myself all the time.Aidan: [laugh]. Yeah. I mean, you know, I think I was probably 19 at the time, so I had—my ego did make me think I was one, but it turned out not to be so. But I think that was what really blew my mind was that me, a nobody, could create an account with Amazon and get access to these incredibly powerful machines for less than the dollar. And so I was hooked.Since then, I've worked at companies that are AWS customers since then. I've worked at places that have zero EC2 service, worked at places that have had thousands, and places in between. And it's got to a point, actually, where, I guess, my career is so entwined with AWS that one, my initials are actually AWS, but also—and this might sound ridiculous, and it's probably just a sign of my privilege—that I wouldn't consider working somewhere that used another cloud. Not—Corey: No, I think that's absolutely the right approach.Aidan: Yeah.Corey: I had a Twitter thread on this somewhat recently, and I'm going to turn it into a blog post because I got some pushback. If I were looking at doing something and I would come into the industry right now, my first choice would be Google Cloud because its developer experience is excellent. But I'm not coming to this without any experience. I have spent a decade or so learning not just how it was works, but also how it breaks, understanding the failure mode and what that's going to look like and what it's good at and what it's not. That's the valuable stuff for running things in a serious way.Aidan: Yeah. It's an interesting one. And I mean, for better or worse, AWS is big. I'm sure you will know much better than I do the exact numbers, but if a junior developer came to me and said, “Which cloud should I learn, or should I learn all of them?” I mean, you're right, Google Cloud does have a better developer experience, especially for new developers, but when I think about the sheer number of jobs that are available for developers, I feel like I would be doing them a disservice by not suggesting AWS, at least in Australia. It seems they've got such a huge footprint that you'll always be able to find a job working as an AWS-familiar engineer. It seems like that would be less the case with Google Cloud or Azure.Corey: Again, I am not sitting here, suggesting that anyone should, “Oh, clouds are insecure. We're going to run our own stuff in our own data centers.” That is ridiculous in this era. They are still going to do a better job of security than any of us will individually, let's be clear here. And it empowers and unlocks an awful lot of stuff.But with their privileged position as these hyperscale providers that are the default choice for building things, I think comes with a significant level of responsibility that I am displeased to discover that they've been abdicating. And I don't love that.Aidan: Yeah, it's an interesting one, right, because, like you're saying, they have access and the expertise that people doing it themselves will never match. So, you know, I'm never going to hesitate to recommend people use AWS on account security because your company's security posture will almost always be better for using AWS and following their guidelines, and so on. But yeah, like you say, with great power comes significant responsibility to earn trust and retain that trust by admitting and publicizing when mistakes are made.Corey: One last topic I want to get into with you is one that you and I have talked about very briefly elsewhere, that I feel like you and I are both relatively up-to-date on AWS intricacies. I think that we are both better than the average bear working with the platform. But I know that I feel this way, and I suspect you do too that VPCs have gotten confusing as hell. Is that just me? Am I a secret moron that no one bothered to ever tell me this, and I should update my own self-awareness?Aidan: [laugh]. Yeah, it's… I mean, that's been the story of my career with AWS. When I started, VPCs didn't exist. It was EC2 Classic—well, I guess at the time, it was just EC2—and it was simple. You launched an instance and you had an IP address.And then along came VPCs, and I think at the time, I thought something to the effect of “This seems like needless complexity. I'm not going to bother learning this. It will never be relevant.” In the end that wasn't true. I worked in much large deployments when VPCs made fantastic sense made a lot of things possible, but I still didn't go into the weeds.Since then, AWS has announced that EC2 Classic will be retired; an end of an era. I'm not personally still running anything in EC2 Classic, and I think they've done an incredible job of maintain support for this long, but VPC complexity has certainly been growing year-on-year since then. I recently was using the AWS console—like we all do and no one ever admits to—to edit a VPC subnet route table. And I clicked the drop-down box for a target, and I was overwhelmed by the number of options. There were NAT gateways, internet gateways, carrier gateways, I think there was a thing called a wavelength gateway, ENI, and… I [laugh] I think I was surprised because I just scroll through the list, and I thought, “Wow, that is a lot of different options. Why is that?”Especially because it's not so relevant to me. But I realized a big thing of what AWS has been doing lately is trying to make themselves available to people who haven't used the cloud yet. And they have these complicated networking needs, and it seems like they're trying to—reasonably successfully—make anything possible. But with that comes, you know, additional complexity.Corey: I appreciate that the capacity is there, but there has to be an abstraction model for getting rid of some of this complexity because otherwise, the failure mode is you wind up with this amazingly capable thing that can build marvels, but you also need to basically have a PhD in some of these things to wind up tying it all together. And if you bring someone else in to do it, then you have no idea how to run the thing. You're effectively a golden retriever trying to fly a space shuttle.Aidan: Yeah. It's interesting, like, clearly, they must be acutely aware of this because they have default VPCs, and for many use cases, that's all people should need. But as soon as you want, say a private subnet, then you need to either modify that default VPC or create a new one, and it's sort of going from 0 to 100 complexity extremely quickly because, you know, you need to create route tables to everyone's favorite net gateways, and it feels like the on-ramp needs to be not so steep. Not sure what the solution is, I hope they find one.Corey: As do I. I really want to thank you for taking the time to speak with me about so many of these things. If people want to learn more about what you're up to, where's the best place to find you?Aidan: Twitter's the best place. On Twitter, my username is @__Steele, which is S-T-E-E-L-E. From there, that's where I'll either—I'll at least speculate on the latest releases or link to some of the silly things I put on GitHub. Sometimes they're not so silly things. But yeah, that's where I can be found. And I'd love to chat to anyone about AWS. It's something I can geek out about all day, every day.Corey: And we will certainly include links to that in the [show notes 00:29:50]. Thank you so much for taking the time to speak with me today. I really appreciate it.Aidan: Well, thank you so much for having me. It's been an absolute delight.Corey: Aidan Steele, serverless engineer at Stedi, and shit poster extraordinaire. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an immediate request to correct the record about what I'm not fully understanding about AWS's piss-weak security communications.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

TestGuild News Show
Books for Testers and Zen TGNS34

TestGuild News Show

Play Episode Listen Later Mar 15, 2022 9:56


Hey, do you want to know what software testing book to read next? Do you know there are ways you can actually lower your AWS cloud watch service bill? And have you seen the latest postmortem on Log4j? Find out the answer to these and all other full pipelines DevOps automation, performance testing, and security testing in this episode of the TestGuild new show for the week of March 13th. So grab your feet a cup of coffee or tea, and let's do this. Links to News Mentioned in this Episode 00:54 Blink https://links.testguild.com/8sG9d 01:50 Ponicode https://links.testguild.com/tdMpJ 03:06 Elastic https://links.testguild.com/SvZce 04:02 Books for Testers: https://links.testguild.com/JKuv1 04:58 Software Tester Guide: https://links.testguild.com/WjbSz 05:50 Zendesk: https://links.testguild.com/cvsEa 06:42 Cloudwatch: https://links.testguild.com/NBObJ 07:24 Observability: https://links.testguild.com/fx4Aa 058:17 Log4j: https://links.testguild.com/WJDzE    

AWS Bites
05. What are our favourite AWS services and why?

AWS Bites

Play Episode Listen Later Oct 8, 2021 9:47


In this episode Eoin and Luciano talk about their favourite AWS services and why they like them. Spoiler: we talk about EventBridge, Lambda, CloudFormation, CDK, S3, ECS/Fargate plus some honorable mentions like CloudWatch and IAM. In this episode we mentioned the following resources: SLIC Watch: https://github.com/fourTheorem/slic-watch/ This episode is also available on YouTube: https://www.youtube.com/channel/UC3WPpke5mzsbiij1zGFbeEA You can listen to AWS Bites wherever you get your podcasts: - Apple Podcasts: https://podcasts.apple.com/us/podcast/aws-bites/id1585489017 - Spotify: https://open.spotify.com/show/3Lh7PzqBFV6yt5WsTAmO5q - Google: https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy82YTMzMTJhMC9wb2RjYXN0L3Jzcw== - Breaker: https://www.breaker.audio/aws-bites - RSS: ​​https://anchor.fm/s/6a3312a0/podcast/rss Do you have any AWS questions you would like us to address? Leave a comment here or connect with us on Twitter: - https://twitter.com/eoins - https://twitter.com/loige

The Downtime Project
Kinesis Hits the Thread Limit

The Downtime Project

Play Episode Listen Later May 25, 2021 44:42


During a routine addition of some servers to the Kinesis front end cluster in US-East-1 in November 2020, AWS ran into an OS limit on the max number of threads. That resulted in a multi hour outage that affected a number of other AWS servers, including ECS, EKS, Cognito, and Cloudwatch. We probably won't do […]

cloudonaut
#20 End-user monitoring of your website with CloudWatch Synthetics

cloudonaut

Play Episode Listen Later Jun 4, 2020 34:49


There are countless reasons why your website is not working as your users expect. From a technical point of view, you can monitor your load balancers, your web servers, and your database. But what if that external script that you embed is breaking your site? Expired TLS certificate? Something wrong with DNS? How can you test that your website works for real users? In this episode, we introduce CloudWatch Synthetics as a solution to monitor your website from a user perspective.

cloudonaut
#16 CloudWatch Metrics & Alarms reloaded

cloudonaut

Play Episode Listen Later Mar 26, 2020 43:08


Amazon CloudWatch improved significantly over the years. It's time to look at its monitoring capabilities again. CloudWatch is an excellent starting point to implement enhanced monitoring on AWS. In this episode, Michael demonstrates what you can do with CloudWatch metrics and alarms. Metrics provide a time-series database for telemetry (e.g., CPU utilization of an EC2 instance). Alarms watch a metric and trigger actions if a threshold is reached.

Mobycast
Automate all the things - Updating container secrets using CloudWatch Events + Lambda

Mobycast

Play Episode Listen Later Mar 4, 2020 68:15


In this episode, we cover the following topics: Developing a system for automatically updating containers when secrets are updated is a two-part solution. First, we need to be notified when secrets are updated. Then, we need to trigger an action to update the ECS service. CloudWatch Events can be used to receive notifications when secrets are updated. We explain CloudWatch Events and its primary components: events, rules and targets. Event patterns are used to filter for the specific events that the rule cares about. We discuss how to write event patterns and the rules of matching events. The event data structure will be different for each type of emitter. We detail a handy tip for determining the event structure of an emitter. We discuss EventBridge and how it relates to CloudWatch Events. We explain how to create CloudWatch Event rules for capturing update events emitted by both Systems Manager Parameter Store and AWS Secrets Manager. AWS Lambda can be leveraged as a trigger of CloudWatch Events. We explain how to develop a Lambda function that invokes the ECS API to recycle all containers. We finish up by showing how this works for a common use case: using the automatic credential rotation feature of AWS Secrets Manager with a containerized app running on ECS that connects to a RDS database. Detailed Show NotesWant the complete episode outline with detailed notes? Sign up here: https://mobycast.fm/show-notes/Support Mobycasthttps://glow.fm/mobycastEnd SongNight Sea Journey by Derek RussoMore InfoFor a full transcription of this episode, please visit the episode webpage.We'd love to hear from you! You can reach us at: Web: https://mobycast.fm Voicemail: 844-818-0993 Email: ask@mobycast.fm Twitter: https://twitter.com/hashtag/mobycast Reddit: https://reddit.com/r/mobycast

Mobycast
Serverless Containers with ECS Fargate - Part 3

Mobycast

Play Episode Listen Later Nov 20, 2019 58:34


Support Mobycasthttps://glow.fm/mobycastShow DetailsIn this episode, we cover the following topics: Container networking ECS networking mode Configures the Docker networking mode to use for the containers in the taskSpecified as part of the task definition Valid values: noneContainers do not have external connectivity and port mappings can't be specified in the container definition bridge Utilizes Docker's built-in virtual network which runs inside each container instance Containers on an instance are connected to each other using the docker0 bridge Containers use this bridge to communicate with endpoints outside of the instance using primary ENI of instance they are running on Containers share networking properties of the primary ENI, including the firewall rules and IP addressing Containers are addressed by combination of IP address of primary ENI and host port to which they are mapped Cons: You cannot address these containers with the IP address allocated by DockerIt comes from pool of locally scoped addresses You cannot enforce finely grained network ACLs and firewall rules host Bypass Docker's built-in virtual network and maps container ports directly to the EC2's NIC directly You can't run multiple instantiations of the same task on a single container instance when port mappings are used awsvpc Each task is allocated its own ENI and IP addressMultiple applications (including multiple copies of same app) can run on same port number without conflict You must specify a NetworkConfiguration when you create a service or run a task with the task definition Default networking mode is bridge host and awsvpc network modes offer the highest networking performance They use the Amazon EC2 network stack instead of the virtualized network stack provided by the bridge mode Cannot take advantage of dynamic host port mappings Exposed container ports are mapped directly... host: to corresponding host port awsvpc: to attached elastic network interface port Task networking (aka awsvpc mode networking) Benefits Each task has its own attached ENIWith primary private IP address and internal DNS hostname Simplifies container networkingNo host port specified Container port is what is used by task ENI Container ports must be unique in a single task definition Gives more control over how tasks communicate With other tasks Containers share a network namespace Communicate with each other over localhost interfacee.g. curl 127.0.0.1:8080 With other services in VPC Note: containers that belong to the same task can communicate over the localhost interface Take advantage of VPC Flow Logs Better security through use of security groupsYou can assign different security groups to each task, which gives you more fine-grained security Limitations The number of ENIs that can be attached to EC2 instances is fairly smallE.g. c5.large EC2 may have up to 3 ENIs attached to it 1 primary, and 2 for task networking Therefore, you can only host 2 tasks using awsvpc mode networking on a c5.large However, you can increase ENI density using "VPC trunking" VPC trunking Allows for overcoming ENI density limits Multiplexes data over shared communication link How it works Two ENIs are attached to the instance Primary ENI Trunk ENINote that enabling trunking consumes an additional IP address per instance Your account, IAM user, or role must opt in to the awsvpcTrunking account setting Benefits Up to 5x-17x more ENIs per instance E.g. with trunking, c5.large goes from 3 to 12 ENIs1 primary, 1 trunk, and 10 for task networking Migrating a container from EC2 to Fargate IAM roles Roles created automatically by ECS Amazon ECS service-linked IAM role, AWSServiceRoleForECSGives permission to attach ENI to instance Task Execution IAM Role (ecsTaskExecutionRole)Needed for: Pulling images from ECR Pushing logs to CloudWatch Create a task-based IAM role Required because we don't have an ecsInstanceRole anymore Create a IAM policy that gives minimal privileges needed by task Remember two categories of policies: AWS Managed Customer Managed We are going to create a new customer managed policy that contains only the permissions our app needsKMS Decrypt, S3 GETs from specific bucket IAM -> Policies -> Create Policy -> JSONSee IAM Policy example below Create role based on "Elastic Container Service Task" service role This service role gives permission to ECS to use STS to assume role (sts:AssumeRole) and perform actions on its behalf IAM -> Roles -> Create Role "Select type of trusted entity": AWS Service Choose "Elastic Container Service", and then "Elastic Container Service Task" use case Next, then attach IAM policy we created to the role and save Task definition file changes Task-level parameters Add FARGATE for requiredCompatibilities Use awsvpc as the network mode Specify cpu and memory limits at the task level Specify Task Execution IAM Role (executionRoleARN)Allows task to pull images from ECR and send logs to CloudWatch Logs Specify task-based IAM role (taskDefinitionArn)Needed to give task permissions to perform AWS API calls (such as S3 reads) Container-level parametersOnly specify containerPort (do not specify hostPort) See Task Definition example below Create ECS service Choose cluster Specify networking VPC, subnets Create a security group for this task Security group is attached to the ENI Allow inbound port 80 traffic Auto-assign public IP Attach to existing application load balancer Specify production listener (port/protocol) Create a new target group When creating target group, you specify "target type" Instance IP Lambda function For awsvpc mode (and by default, Fargate), you must use the IP target type Specify path pattern for ALB listener, health check pathNote: you cannot specify host-based routing through the consoleYou can update that after creating the service through the ALB console Update security groups Security group for ALBAllow outbound port 80 to the security group we attached to our ENI Security group for RDSAllow inbound port 3306 from the security group for our ENI Create Route 53 recordALIAS pointing to our ALB Log integration with SumoLogic Update task to send logs to stdout/stderrDo not log to file Configure containers for CloudWatch logging ("awslogs" driver) Create Lambda function that subscribes to CloudWatch Log Group Lambda function converts from CloudWatch format to Sumo, then POSTs data to Sumo HTTP Source DeadLetterQueue is recommended to handle/retry failed sends to Sumo Links Introducing Cloud Native Networking for Amazon ECS Containers Task Networking in AWS Fargate Under the Hood: Task Networking for Amazon ECS Task Networking with the awsvpc Network Mode ECS Task Definition - Network Mode How many tasks using the awsvpc network mode can be launched Optimizing Amazon ECS task density using awsvpc network mode Migrating Your Amazon ECS Containers to AWS Fargate Sumo Logic - Collect Logs from AWS Fargate End SongDrifter - Roy EnglandFor a full transcription of this episode, please visit the episode webpage.We'd love to hear from you! You can reach us at: Web: https://mobycast.fm Voicemail: 844-818-0993 Email: ask@mobycast.fm Twitter: https://twitter.com/hashtag/mobycast Reddit: https://reddit.com/r/mobycast