Podcasts about Observability

  • 285PODCASTS
  • 741EPISODES
  • 41mAVG DURATION
  • 1DAILY NEW EPISODE
  • Dec 2, 2022LATEST

POPULARITY

20152016201720182019202020212022

Categories



Best podcasts about Observability

Show all podcasts related to observability

Latest podcast episodes about Observability

Machine Learning and AI applications
#33 Product observability and telemetry in the realm of AI

Machine Learning and AI applications

Play Episode Listen Later Dec 2, 2022 55:30


In this third podcast of our season 4, I had a detailed conversation with a guest from a start-up in the context of product observability, feature usage and telemetry in the realm of AI. Our guest, John Cutler(from Amplitude), talks about the importance of being product aware for analysts rather than just being analytics driven usage for the product experts. Stay tuned for more interesting conversations from the (XTrawAI.com) podcast series on machine learning and AI applications. --- Send in a voice message: https://anchor.fm/raghu-banda/message

Cribl: The Stream Life
How to Get Funding for Observability and Security Projects

Cribl: The Stream Life

Play Episode Listen Later Nov 30, 2022 22:43


In this episode of The Stream Life Podcast, I chat with Cribl's Ed Bailey about the challenges IT professionals have as they prepare their annual budgets. Resources On-demand Webinar: How To Get Funding For Observability And Security Projects Join the Cribl Community Begin your Cribl certification path Connect with Bradley on Twitter Connect with Ed on Twitter If you want to get every episode of the Stream Life podcast automatically, you can subscribe on Apple Podcasts, Spotify, Pocket Casts, Overcast, RSS, or wherever you get your podcasts.

Cloud Talk
Episode 126: Full-Stack Observability: Bringing Developers and the Business Together

Cloud Talk

Play Episode Listen Later Nov 30, 2022 30:57


Join Joe Byrne, CTO of AppDynamics, as he and Jeff DeVerter discuss the impact of connecting business KPIs to the quality of the developer's code and the operations of the underlying systems. Special Guest: Joe Byrne.

The Data Engineering Show
Making Observability a Key Business Driver

The Data Engineering Show

Play Episode Listen Later Nov 29, 2022 48:59


80% of the code that you write doesn't work on the first try. And that's fine. But knowing which 80% is not working and which 20% is working is the actual challenge. After 10 years at Facebook, managing and scaling the Seattle site to over 6000 engineers(!) Vijaye Raji founded Statsig to make observability automated and real-time. How is the semantic layer managed? How was the Statsig team able to build an observability product that handles real-time ever-changing metadata? What are Vijaye's main takeaways from engineering at Facebook? Tune in.

Open Source Startup Podcast
E64: Open Source Data Observability with Elementary Data

Open Source Startup Podcast

Play Episode Listen Later Nov 28, 2022 38:51


Maayan Salom is Co-Founder of Elementary Data, the open source data observability platform which allows users to monitor their data warehouse directly from dbt. Their project, also called Elementary, is built for analytics engineers and today has almost 1K GitHub stars and a rapidly growing community of almost 600 users. The company has raised from leading Israel and US-based venture firms as well as a number of high-profile angel investors. In this episode, we discuss having a culture of experimentation, building a community alongside other communities (ie. dbt), using your community for product feedback, the hustle involved in early GTM, learnings from building for a fast-growing community & more!

Machine Learning and AI applications
#32 Leveraging ML observability in the context of AI

Machine Learning and AI applications

Play Episode Listen Later Nov 25, 2022 52:10


In this second podcast of our season 4, I had an engaging conversation with 2 of our guests from an AI start-up on ML platforms. Our guests Amber Roberts and Dat Ngo, talk about how ML observability is important in the realm of building and monitoring ML models for enterprises and consumer businesses. Stay tuned for more interesting conversations from the (XTrawAI.com) podcast series on machine learning and AI applications. --- Send in a voice message: https://anchor.fm/raghu-banda/message

The New Stack Podcast
Chronosphere Nudges Observability Standards Toward Maturity

The New Stack Podcast

Play Episode Listen Later Nov 23, 2022 15:31


DETROIT — Rob Skillington's grandfather was a civil engineer, working in an industry that, in over a century, developed processes and know-how that enabled the creation of buildings, bridges and road. “A lot of those processes matured to a point where they could reliably build these things,” said Skillington, co-founder and chief technology officer at Chronosphere, an observability platform. “And I think about observability as that same maturity of engineering practice. When  it comes to building software that actually is useful in the world, it is this process that helps you actually achieve the deployment and operation of these large scale systems that we use every day.” Skillington spoke about the evolution of observability, and his company's recent donation of an open source project to Prometheus, in this episode of The New Stack Makers podcast. Heather Joslyn, features editor of TNS, hosted the conversation. This On the Road edition of The New Stack Makers was recorded at KubeCon + CloudNativeCon North America, in the Motor City. The episode was sponsored by Chronosphere.A Donation to the Prometheus ProjectHelping observability practices grow as mature and reliable as civil engineering rules that help build sturdy skyscrapers is a tough task, Skillington suggested. In the cloud era, he said, “you have to really prepare the software for a whole set of runtime environments. And so the challenges around that is really about making it consistent, well understood and robust.” At KubeCon in late October, Chronosphere and PromLabs (founded by Julius Volz, creator of Prometheus) announced that they had donated their open source project PromLens to the Prometheus project, the open source monitoring and alerts primitive. The donation is a way of placing a bet on a tool that integrates well with Kubernetes. “There's this real yearning for essentially a standard that can be built upon by everyone in the industry, when it comes to these core primitives, essentially,” Skillington said. “And Prometheus is one of those primitives. We want to continue to solidify that as a primitive that stands the test of time.” “We can't build a self-driving car if we're always building a different car,” he added. PromLens builds Prometheus queries in a sort of integrated development environment (IDE), Skillington said. It also makes it easier for more people in an organization to create queries and understand the meaning and seriousness of alerts. The PromLens tool breaks queries into a visual format, and allows users to edit them through a UI. “Basically, it's kind of like a What You See Is What You Get editor, or  WYSIWYG editor, for Prometheus queries,” Skillington said. “Some of our customers have tens of thousands of these alerts to find in PromQL, which is the query language for Prometheus,” he noted. “Having a tool like an integrated development environment — where you can really understand these complex queries and iterate faster on, setting these up and getting back to your day job — is incredibly important.” Check out the full episode for more on PromLens and the current state of observability.

Enginears
Elastic and portfolio of products (ElasticSearch, ELK, Observability) | Enginears Podcast

Enginears

Play Episode Listen Later Nov 23, 2022 35:40


Introduction into Developer Advocate roleUnderstanding Elastic's product line   - Enterprise Search   - Observability   - SecurityJourney of Elastic into an Enterprise product   - Elasticsearch / practical examples      - Full-text search      - Querying      - Tokenisers      - Search as you type      - Reverse querying (percolators)Observability   - Request tracing   - Track usage volume, and integrate anomaly detection to find patterns and trends   - Log browsing and searching   - Error stack traces   - RUM browser and user behaviour informationPrefer to watch your podcasts? Check us out on YouTube: https://bit.ly/3dCMPRb You can also find us on Twitter at: http://bit.ly/36cTJYr

Elixir Mix
Understanding Observability in Elixir with Dave Lucia - EMx 195

Elixir Mix

Play Episode Listen Later Nov 23, 2022 55:18


Dave Lucia is a CTO at a media company called Bitfo, which builds high-quality educational content in the cryptocurrency space. He has been an Elixir Developer for about 6 years. He is the author of “Elixir Observability: OpenTelemetry, Lightstep, Honeycomb”. He joins the show to talk about how they were able to build their system and other websites like DeFi Rate and ethereumprice.About this Episode Observability OpenTelemetry OpenTracing Analyzing and Making Data useful Tools used for tracing and metrics Sponsors Chuck's Resume Template Developer Book Club starting with Clean Architecture by Robert C. Martin Become a Top 1% Dev with a Top End Devs Membership Links Elixir Observability: OpenTelemetry, Lightstep, Honeycomb Bitfo DeFi Rate ethereumprice Dave Lucia's Blog GitHub: davydog187 Twitter: @davydog187 Picks Allen - Distributed Services with Go Dave - Software Unscripted Dave - bitfo/timescale Dave - bitfo/ectorange Sascha - ex_union

Secrets of Data Analytics Leaders
The Blending Disciplines Of Data Observability, DataOps, And FinOps - Audio Blog

Secrets of Data Analytics Leaders

Play Episode Listen Later Nov 21, 2022 12:32


Data observability provides intelligence about data quality and data pipeline performance, contributing to the disciplines of DataOps and FinOps. Vendors such as DataKitchen, DataOps.live, Informatica, and Unravel offer solutions to help enterprises address these overlapping disciplines. Published at: https://www.eckerson.com/articles/the-blending-disciplines-of-data-observability-dataops-and-finops

This Week in Enterprise Tech (Video HD)
TWiET 520: Catchpoint and Release - Apple to use US-made chips, Amazon layoffs, observability with Catchpoint

This Week in Enterprise Tech (Video HD)

Play Episode Listen Later Nov 19, 2022 69:15


Apple to use US-made chips in 2024, vulnerabilities in space, network observability with Catchpoint, and more. Zero-trust initiatives stall, as cyberattack costs rocket to $1M per incident Tim Cook says Apple will use chips made in America from 2024 Spacecraft vulnerable to failure, thanks to aerospace networking bug Amazon to cut around 10,000 corporate and tech jobs, making largest layoffs in its history ISP deploys fiber service with a wrinkle - the users themselves own each network Catchpoint CMO Gerardo Dada talks about network monitoring and observability Hosts: Curt Franklin and Brian Chee Guest: Gerardo Dada Download or subscribe to this show at https://twit.tv/shows/this-week-in-enterprise-tech. Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit Sponsors: canary.tools/twit - use code: TWIT Code Comments nureva.com/twit

This Week in Enterprise Tech (MP3)
TWiET 520: Catchpoint and Release - Apple to use US-made chips, Amazon layoffs, observability with Catchpoint

This Week in Enterprise Tech (MP3)

Play Episode Listen Later Nov 19, 2022 68:56


Apple to use US-made chips in 2024, vulnerabilities in space, network observability with Catchpoint, and more. Zero-trust initiatives stall, as cyberattack costs rocket to $1M per incident Tim Cook says Apple will use chips made in America from 2024 Spacecraft vulnerable to failure, thanks to aerospace networking bug Amazon to cut around 10,000 corporate and tech jobs, making largest layoffs in its history ISP deploys fiber service with a wrinkle - the users themselves own each network Catchpoint CMO Gerardo Dada talks about network monitoring and observability Hosts: Curt Franklin and Brian Chee Guest: Gerardo Dada Download or subscribe to this show at https://twit.tv/shows/this-week-in-enterprise-tech. Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit Sponsors: canary.tools/twit - use code: TWIT Code Comments nureva.com/twit

Software at Scale
Software at Scale 52 - Building Build Systems with Benjy Weinberger

Software at Scale

Play Episode Listen Later Nov 17, 2022 62:57


Benjy Weinberger is the co-founder of Toolchain, a build tool platform. He is one of the creators of the original Pants, an in-house Twitter build system focused on Scala, and was the VP of Infrastructure at Foursquare. Toolchain now focuses on Pants 2, a revamped build system.Apple Podcasts | Spotify | Google PodcastsIn this episode, we go back to the basics, and discuss the technical details of scalable build systems, like Pants, Bazel and Buck. A common challenge with these build systems is that it is extremely hard to migrate to them, and have them interoperate with open source tools that are built differently. Benjy's team redesigned Pants with an initial hyper-focus on Python to fix these shortcomings, in an attempt to create a third generation of build tools - one that easily interoperates with differently built packages, but still fast and scalable.Machine-generated Transcript[0:00] Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Benji Weinberger, previously a software engineer at Google and Twitter, VP of Infrastructure at Foursquare, and now the founder and CEO of Toolchain.Thank you for joining us.Thanks for having me. It's great to be here. Yes. Right from the beginning, I saw that you worked at Google in 2002, which is forever ago, like 20 years ago at this point.What was that experience like? What kind of change did you see as you worked there for a few years?[0:37] As you can imagine, it was absolutely fascinating. And I should mention that while I was at Google from 2002, but that was not my first job.I have been a software engineer for over 25 years. And so there were five years before that where I worked at a couple of companies.One was, and I was living in Israel at the time. So my first job out of college was at Check Point, which was a big successful network security company. And then I worked for a small startup.And then I moved to California and started working at Google. And so I had the experience that I think many people had in those days, and many people still do, of the work you're doing is fascinating, but the tools you're given to do it with as a software engineer are not great.This, I'd had five years of experience of sort of struggling with builds being slow, builds being flaky with everything requiring a lot of effort. There was almost a hazing,ritual quality to it. Like, this is what makes you a great software engineer is struggling through the mud and through the quicksand with this like awful substandard tooling. And,We are not users, we are not people for whom products are meant, right?We make products for other people. Then I got to Google.[2:03] And Google, when I joined, it was actually struggling with a very massive, very slow make file that took forever to parse, let alone run.But the difference was that I had not seen anywhere else was that Google paid a lot of attention to this problem and Google devoted a lot of resources to solving it.And Google was the first place I'd worked and I still I think in many ways the gold standard of developers are first class participants in the business and deserve the best products and the best tools and we will if there's nothing out there for them to use, we will build it in house and we will put a lot of energy into that.And so it was for me, specifically as an engineer.[2:53] A big part of watching that growth from in the sort of early to late 2000s was. The growth of engineering process and best practices and the tools to enforce it and the thing i personally am passionate about is building ci but i'm also talking about.Code review tools and all the tooling around source code management and revision control and just everything to do with engineering process.It really was an object lesson and so very, very fascinating and really inspired a big chunk of the rest of my career.I've heard all sorts of things like Python scripts that had to generate make files and finally they move the Python to your first version of Blaze. So it's like, it's a fascinating history.[3:48] Maybe can you tell us one example of something that was like paradigm changing that you saw, like something that created like a magnitude, like order of magnitude difference,in your experience there and maybe your first aha moment on this is how good like developer tools can be?[4:09] Sure. I think I had been used to using make basically up till that point. And Google again was, as you mentioned, using make and really squeezing everything it was possible to squeeze out of that lemon and then some.[4:25] But when the very early versions of what became blaze which was that big internal build system which inspired basil which is the open source variant of that today. Hey one thing that really struck me was the integration with the revision controls system which was and i think still is performance.I imagine many listeners are very familiar with Git. Perforce is very different. I can only partly remember all of the intricacies of it, because it's been so long since I've used it.But one interesting aspect of it was you could do partial checkouts. It really was designed for giant code bases.There was this concept of partial checkouts where you could check out just the bits of the code that you needed. But of course, then the question is, how do you know what those bits are?But of course the build system knows because the build system knows about dependencies. And so there was this integration, this back and forth between the, um.[5:32] Perforce client and the build system that was very creative and very effective.And allowed you to only have locally on your machine, the code that you actually needed to work on the piece of the codebase you're working on,basically the files you cared about and all of their transitive dependencies. And that to me was a very creative solution to a problem that involved some lateral thinking about how,seemingly completely unrelated parts of the tool chain could interact. And that's kind of been that made me realize, oh, there's a lot of creative thought at work here and I love it.[6:17] Yeah, no, I think that makes sense. Like I interned there way back in 2016. And I was just fascinated by, I remember by mistake, I ran like a grep across the code base and it just took forever. And that's when I realized, you know, none of this stuff is local.First of all, like half the source code is not even checked out to my machine.And my poor grep command is trying to check that out. But also how seamlessly it would work most of the times behind the scenes.Did you have any experience or did you start working on developer tools then? Or is that just what inspired you towards thinking about developer tools?I did not work on the developer tools at Google. worked on ads and search and sort of Google products, but I was a big user of the developer tools.Exception which was that I made some contributions to the.[7:21] Protocol buffer compiler which i think many people may be familiar with and that is. You know if i very deep part of the toolchain that is very integrated into everything there and so that gave me.Some experience with what it's like to hack on a tool that's everyone in every engineer is using and it's the sort of very deep part of their workflow.But it wasn't until after google when i went to twitter.[7:56] I noticed that the in my time of google my is there the rest of the industry had not. What's up and suddenly i was sort of stressed ten years into the past and was back to using very slow very clunky flaky.Tools that were not designed for the tasks we were trying to use them for. And so that made me realize, wait a minute, I spent eight years using these great tools.They don't exist outside of these giant companies. I mean, I sort of assumed that maybe, you know, Microsoft and Amazon and some other giants probably have similar internal tools, but there's something out there for everyone else.And so that's when I started hacking on that problem more directly was at Twitter together with John, who is now my co-founder at Toolchain, who was actually ahead of me and ahead ofthe game at Twitter and already begun working on some solutions and I joined him in that.Could you maybe describe some of the problems you ran into? Like were the bills just taking forever or was there something else?[9:09] So there were...[9:13] A big part of the problem was that Twitter at the time, the codebase I was interested in and that John was interested in was using Scala. Scala is a fascinating, very rich language.[9:30] Its compiler is very slow. And we were in a situation where, you know, you'd make some small change to a file and then builds would take just,10 minutes, 20 minutes, 40 minutes. The iteration time on your desktop was incredibly slow.And then CI times, where there was CI in place, were also incredibly slow because of this huge amount of repetitive or near repetitive work. And this is because the build tools,etc. were pretty naive about understanding what work actually needs to be done given a set of changes.There's been a ton of work specifically on SBT since then.[10:22] It has incremental compilation and things like that, but nonetheless, that still doesn't really scale well to large corporate codebases that are what people often refer to as monorepos.If you don't want to fragment your codebase with all of the immense problems that that brings, you end up needing tooling that can handle that situation.Some of the biggest challenges are, how do I do less than recompile the entire codebase every time. How can tooling help me be smart about what is the correct minimal amount of work to do.[11:05] To make compiling and testing as fast as it can be?[11:12] And I should mention that I dabbled in this problem at Twitter with John. It was when I went to Foursquare that I really got into it because Foursquare similarly had this big Scala codebase with a very similar problem of incredibly slow builds.[11:29] The interim solution there was to just upgrade everybody's laptops with more RAM and try and brute force the problem. It was very obvious to everyone there, tons of,force-creation pattern still has lots of very, very smart engineers.And it was very obvious to them that this was not a permanent solution and we were casting around for...[11:54] You know what can be smart about scala builds and i remember this thing that i had hacked on twitter and. I reached out to twitter and ask them to open source it so we could use it and collaborate on it wasn't obviously some secret sauce and that is how the very first version of the pants open source build system came to be.I was very much designed around scarlet did eventually.Support other languages. And we hacked on it a lot at Foursquare to get it to...[12:32] To get the codebase into a state where we could build it sensibly. So the one big challenge is build speed, build performance.The other big one is managing dependencies, keeping your codebase sane as it scales.Everything to do with How can I audit internal dependencies?How do I make sure that it is very, very easy to accidentally create all sorts of dependency tangles and cycles and create a code base whose dependency structure is unintelligible, really,hard to work with and actually impacts performance negatively, right?If you have a big tangle of dependencies, you're more likely to invalidate a large chunk of your code base with a small change.And so tooling that allows you to reason about the dependencies in your code base and.[13:24] Make it more tractable was the other big problem that we were trying to solve. Mm-hmm. No, I think that makes sense.I'm guessing you already have a good understanding of other build systems like Bazel and Buck.Maybe could you walk us through what are the difference for PANs, Veevan? What is the major design differences? And even maybe before that, like, how was Pants designed?And is it something similar to like creating a dependency graph? You need to explicitly include your dependencies.Is there something else that's going on?[14:07] Maybe just a primer. Yeah. Absolutely. So I should mention, I was careful to mention, you mentioned Pants V1.The version of Pants that we use today and base our entire technology stack around is what we very unimaginatively call Pants V2, which we launched two years ago almost to the day.That is radically different from Pants V1, from Buck, from Bazel. It is quite a departure in ways that we can talk about later.One thing that I would say Panacea V1 and Buck and Bazel have in common is that they were designed around the use cases of a single organization. is a.[14:56] Open source variant or inspired by blaze its design was very much inspired by. Here's how google does engineering and a buck similarly for facebook and pansy one frankly very similar for.[15:11] Twitter and we sort of because Foursquare also contributed a lot to it, we sort of nudged it in that direction quite a bit. But it's still very much if you did engineering in this one company's specific image, then this might be a good tool for you.But you had to be very much in that lane.But what these systems all look like is, and the way they are different from much earlier systems is.[15:46] They're designed to work in large scalable code bases that have many moving parts and share a lot of code and that builds a lot of different deployables, different, say, binaries or DockerDocker images or AWS lambdas or cloud functions or whatever it is you're deploying, Python distributions, Java files, whatever it is you're building, typically you have many of them in this code base.Could be lots of microservices, could be just lots of different things that you're deploying.And they live in the same repo because you want that unity. You want to be able to share code easily. you don't want to introduce dependency hell problems in your own code. It's bad enough that we have dependency hell problems third-party code.[16:34] And so these systems are all if you squint at them from thirty thousand feet today all very similar in that they make that the problem of. Managing and building and testing and packaging in a code base like that much more tractable and the way they do this is by applying information about the dependencies in your code base.So the important ingredient there is that these systems understand the find the relatively fine grained dependencies in your code base.And they can use that information to reason about work that needs to happen. So a naive build system, you'd say, run all the tests in the repo or in this part of the repo.So a naive system would literally just do that, and first they would compile all the code.[17:23] But a scalable build system like these would say, well, you've asked me to run these tests, but some of them have already been cached and these others, okay, haven't.So I need to look at these ones I actually need to run. So let me see what needs to be done before I can run them.Oh, so these source files need to be compiled, but some of those already in cache and then these other ones I need to compile. But I can apply concurrency because there are multiple cores on this machine.So I can know through dependency analysis which compile jobs can run concurrently and which cannot. And then when it actually comes time to run the tests, again, I can apply that sort of concurrency logic.[18:03] And so these systems, what they have in common is that they use dependency information to make your building testing packaging more tractable in a large code base.They allow you to not have to do the thing that unfortunately many organizations find themselves doing, which is fragmenting the code base into lots of different bits andsaying, well, every little team or sub team works in its own code base and they consume each other's code through, um, so it was third party dependencies in which case you are introducing a dependency versioning hell problem.Yeah. And I think that's also what I've seen that makes the migration to a tool like this hard. Cause if you have an existing code base that doesn't lay out dependencies explicitly.[18:56] That migration becomes challenging. If you already have an import cycle, for example.[19:01] Bazel is not going to work with you. You need to clean that up or you need to create one large target where the benefits of using a tool like Bazel just goes away. And I think that's a key,bit, which is so fascinating because it's the same thing over several years. And I'm hoping that,it sounds like newer tools like Go, at least, they force you to not have circular dependencies and they force you to keep your code base clean so that it's easy to migrate to like a scalable build system.[19:33] Yes exactly so it's funny that is the exact observation that let us to pans to see to so they said pans to be one like base like buck was very much inspired by and developed for the needs of a single company and other companies were using it a little bit.But it also suffered from any of the problems you just mentioned with pans to for the first time by this time i left for square and i started to chain with the exact mission of every company every team of any size should have this kind of tooling should have this ability this revolutionary ability to make the code base is fast and tractable at any scale.And that made me realize.We have to design for that we have to design for not for. What a single company's code base looks like but we have to design.To support thousands of code bases of all sorts of different challenges and sizes and shapes and languages and frameworks so.We actually had to sit down and figure out what does it mean to make a tool.Like this assistant like this adoptable over and over again thousands of times you mentioned.[20:48] Correctly, that it is very hard to adopt one of those earlier tools because you have to first make your codebase conform to whatever it is that tool expects, and then you have to write huge amounts of manual metadata to describe all of the dependencies in your,the structure and dependencies of your codebase in these so-called build files.If anyone ever sees this written down, it's usually build with all capital letters, like it's yelling at you and that those files typically are huge and contain a huge amount of information your.[21:27] I'm describing your code base to the tool with pans be to eat very different approaches first of all we said this needs to handle code bases as they are so if you have circular dependencies it should handle them if you have. I'm going to handle them gracefully and automatically and if you have multiple conflicting external dependencies in different parts of your code base this is pretty common right like you need this version of whatever.Hadoop or NumPy or whatever it is in this part of the code base, and you have a different conflicting version in this other part of the code base, it should be able to handle that.If you have all sorts of dependency tangles and criss-crossing and all sorts of things that are unpleasant, and better not to have, but you have them, the tool should handle that.It should help you remove them if you want to, but it should not let those get in the way of adopting it.It needs to handle real-world code bases. The second thing is it should not require you to write all this crazy amount of metadata.And so with Panzer V2, we leaned in very hard on dependency inference, which means you don't write these crazy build files.You write like very tiny ones that just sort of say, you know, here is some code in this language for the build tool to pay attention to.[22:44] But you don't have to edit the added dependencies to them and edit them every time you change dependencies.Instead, the system infers dependencies by static analysis. So it looks at your, and it does this at runtime.So you, you know, almost all your dependencies, 99% of the time, the dependencies are obvious from import statements.[23:05] And there are occasional and you can obviously customize this because sometimes there are runtime dependencies that have to be inferred from like a string. So from a json file or whatever is so there are various ways to customize this and of course you can always override it manually.If you have to be generally speaking ninety.Seven percent of the boilerplate that used to going to build files in those old systems including pans v1 no. You know not claiming we did not make the same choice but we goes away with pans v2 for exactly the reason that you mentioned these tools,because they were designed to be adopted once by a captive audience that has no choice in the matter.And it was designed for how that code base that adopting code base already is. is these tools are very hard to adopt.They are massive, sometimes multi-year projects outside of that organization. And we wanted to build something that you could adopt in days to weeks and would be very easy,to customize to your code base and would not require these massive wholesale changes or huge amounts of metadata.And I think we've achieved that. Yeah, I've always wondered like, why couldn't constructing the build file be a part of the build. In many ways, I know it's expensive to do that every time. So just like.[24:28] Parts of the build that are expensive, you cache it and then you redo it when things change.And it sounds like you've done exactly that with BANs V2.[24:37] We have done exactly that. The results are cached on a profile basis. So the very first time you run something, then dependency inference can take some time. And we are looking at ways to to speed that up.I mean, like no software system has ever done, right? Like it's extremely rare to declare something finished. So we are obviously always looking at ways to speed things up.But yeah, we have done exactly what you mentioned. We don't, I should mention, we don't generate the dependencies into build for, we don't edit build files and then you check them in.We do that a little bit. So I mentioned you do still with PANSTL V2, you need these little tiny build files that just say, here is some code.They typically can literally be one line sometimes, almost like a marker file just to say, here is some code for you to pay attention to.We're even working on getting rid of those.We do have a little script that generates those one time just to help you onboard.But...[25:41] The dependencies really are just generated a runtime as on demand as needed and used a runtime so we don't have this problem of. Trying to automatically add or edit a otherwise human authored file that is then checked in like this generating and checking in files is.Problematic in many ways, especially when those files also have to take human written edits.So we just do away with all of that and the dependency inference is at runtime, on demand, as needed, sort of lazily done, and the information is cached. So both cached in memory in the surpassed V2 has this daemon that runs and caches a huge amount of state in memory.And the results of running dependency inference are also cached on disk. So they survive a daemon restart, etc.I think that makes sense to me. My next question is going to be around why would I want to use panthv2 for a smaller code base, right? Like, usually with the smaller codebase, I'm not running into a ton of problems around the build.[26:55] I guess, do you notice these inflection points that people run into? It's like, okay, my current build setup is not enough. What's the smallest codebase that you've seen that you think could benefit? Or is it like any codebase in the world? And I should start with,a better build system rather than just Python setup.py or whatever.I think the dividing line is, will this code base ever be used for more than one thing?[27:24] So if you have a, let's take the Python example, if literally all this code base will ever do is build this one distribution and a top level setup pie is all I need. And that is, you know, this,sometimes you see this with open source projects and the code base is going to remain relatively small, say it's only ever going to be a few thousand lines and the tests, even if I runthe tests from scratch every single time, it takes under five minutes, then you're probably fine.But I think two things I would look at are, am I going to be building multiple things in this code base in the future, or certainly if I'm doing it now.And that is much more common with corporate code bases. You have to ask yourself, okay, my team is growing, more and more people are cooperating on this code base.I want to be able to deploy multiple microservices. I want to be able to deploy multiple cloud functions.I want to be able to deploy multiple distributions or third-party artifacts.I want to be able to.[28:41] You know, multiple sort of data science jobs, whatever it is that you're building. If you want, if you ever think you might have more than one, now's the time to think about,okay, how do I structure the code base and what tooling allows me to do this effectively?And then the other thing to look at is build times. If you're using compiled languages, then obviously compilation, in all cases testing, if you start to see like, I can already see that that tests are taking five minutes, 10 minutes, 15 minutes, 20 minutes.Surely, I want some technology that allows me to speed that up through caching, through concurrency, through fine-grained invalidation, namely, don't even attempt to do work that isn't necessary for the result that was asked for.Then it's probably time to start thinking about tools like this, because the earlier you adopt it, the easier it is to adopt.So if you wait until you've got a tangle of multiple setup pies in the repo and it's unclear how you manage them and how you keep their dependencies synchronized,so there aren't version conflicts across these different projects, specifically with Python,this is an interesting problem.I would say with other languages, there is more because of the compilation step in jvm languages or go you.[30:10] Encounter the need for a build system much much earlier a bill system of some kind and then you will ask yourself what kind with python because you can get a bite for a while just running. What are the play gate and pie test and i directly and all everything is all together in a single virtual and.But the Python tooling, as mighty as it is, mostly is not designed for larger code bases with multiple, that deploy multiple things and have multiple different sets of.[30:52] Internal and external dependencies the tooling generally implicitly assume sort of one top level set up i want top level. Hi project dot com all you know how are you configuring things and so especially using python let's say for jango flask apps or for data scienceand your code base is growing and you've hired a bunch of data scientists and there's more and more code going in there. With Python, you need to start thinking about what tooling allows me to scale this code base. No, I think I mostly resonate with that. The first question that comes to my mind is,let's talk specifically about the deployment problem. If you're deployed to multiple AWS lambdas or cloud functions or whatever, the first thought that would come to my mind isis I can use separate Docker images that can let me easily produce this container image that I can ship independently.Would you say that's not enough? I totally get that for the build time problem.A Docker image is not going to solve anything. But how about the deployment step?[32:02] So again, with deployments, I think there are two ways where a tool like this can really speed things up.One is only build the things that actually need to be redeployed. And because the tool understands dependencies and can do change analysis, it can figure that out.So one of the things that HansB2 does is it integrates with Git.And so it natively understands how to figure out Git diffs. So you can say something like, show me all the whatever, lambdas, let's say, that are affected by changes between these two branches.[32:46] And it knows and it understands it can say, well, these files changed and you know, we, I understand the transitive dependencies of those files.So I can see what actually needs to be deployed. And, you know, many cases, many things will not need to be redeployed because they haven't changed.The other thing is there's a lot of performance improvements and process improvements around building those images. So, for example, we have for Python specifically, we have an executable format called PEX,which stands for Python executable, which is a single file that embeds all of your Python code that is needed for your deployable and all of its external requirements, transitive external requirements, all bundled up into this single sort of self-executing file.This allows you to do things like if you have to deploy 50 of these, you can basically have a single docker image.[33:52] The different then on top of that you add one layer for each of these fifty and the only difference in that layer is the presence of this pecs file. Where is without all this typically what you would do is.You have fifty docker images each one of which contains a in each one of which you have to build a virtual and which means running.[34:15] Pip as part of building the image, and that gets slow and repetitive, and you have to do it 50 times.We have a lot of ways to speed up. Even if you are deploying 50 different Docker images, we have ways of speeding that up quite dramatically.Because again, of things like dependency analysis, the PECS format, and the ability to build incrementally.Yeah, I think I remember that at Dropbox, we came up with our own, like, par format to basically bundle up a Python binary with, I think par stood for Python archive. I'm notentirely sure. But it did something remarkably similar to solve exactly this problem. It just takes so long, especially if you have a large Python code base. I think that makes sense to me. The other thing that one might ask is, with Python, you don't really have,too long of a build time, is what you would guess, because there's nothing to build. Maybe myPy takes some time to do some static analysis, and, of course, your tests can take forever,and you don't want to rerun them. But there isn't that much of a build time that you have to think about. Would you say that you agree with this, or there's some issues that end,up happening on real-world code basis.[35:37] Well that's a good question the word builds means different things to different people and we recently taken to using the time see i more. Because i think that is clear to people what that means but when i say build or see i mean it in the law in in the extended sense everything you do to go from.Human written source code to a verified.Test did. deployable artifact and so it's true that for python there's no compilation step although arguably. Running my pie is really important and now that i'm really in the habit of using.My pie i will probably never not use it on python code ever again but so that are.[36:28] Sort of build-ish steps for Python such as type checking, such as running code generators like Thrift or Protobuf.And obviously a big, big one is running, resolving third-party dependencies such as running PIP or poetry or whatever it is you're using. So those are all build steps.But with Python, really the big, big, big thing is testing and packaging and primarily testing.And so with Python, you have to be even more rigorous about unit testing than you do with other languages because you don't have a compiler that is catching whole classes of bugs.So and again, MyPy and type checking does really help with that. And so when I build to me includes, build in the large sense includes running tests,includes packaging and includes everything, all the quality control that you run typically in CI or on your desktop in order to go say, well, I've made some edits and here's the proof that these edits are good and I can merge them or deploy them.[37:35] I think that makes sense to me. And like, I certainly saw it with the limited number of testing, the limited amount of type checking you can do with Python, like MyPy is definitelyimproving on this. You just need to unit test a lot to get the same amount of confidence in your own code and then unit tests are not cheap. The biggest question that comes tomy mind is that is BANs V2 focused on Python? Because I have a TypeScript code base at my workplace and I would love to replace the TypeScript compiler with something that was slightly smarter and could tell me, you know what, you don't need to run every unit test every change.[38:16] Great question so when we launched a pass me to which was two years ago. The.We focused on python and that was the initial language we launched with because you had to start somewhere and in the city ten years in between the very scarlet centric work we were doing on pansy one. And the launch of hands be to something really major happened in the industry which was the python skyrocketed in popularity sky python went from.Mostly the little scripting language around the edges of your quote unquote real code, I can use python like fancy bash to people are building massive multi billion dollar businesses entirely on python code bases and there are a few things that drove this one was.I would say the biggest one probably was the python became the. Language of choice for data science and we have strong support for those use cases. There was another was the,Django and Flask became very popular for writing web apps more and more people were used there were more in Intricate DevOps use cases and Python is very popular for DevOps for various good reasons. So.[39:28] Python became super popular. So that was the first thing we supported in pants v2, but we've since added support for or Go, Java, Scala, Kotlin, Shell.Definitely what we don't have yet is JavaScript TypeScript. We are looking at that very closely right now, because that is the very obvious next thing we want to add.Actually, if any listeners have strong opinions about what that should look like, we would love to hear from them or from you on our Slack channels or on our GitHub discussions where we are having some lively discussions about exactly this because the JavaScript.[40:09] And TypeScript ecosystem is already very rich with tools and we want to provide only value add, right? We don't want to say, you have to, oh, you know, here's another paradigm you have to adopt.And here's, you know, you have to replace, you've just done replacing whatever this with this, you know, NPM with yarn. And now you have to do this thing. And now we're, we don't want to beanother flavor of the month. We only want to do the work that uses those tools, leverages the existing ecosystem but adds value. This is what we do with Python and this is one of the reasons why our Python support is very, very strong, much stronger than any other comparable tool out there is.[40:49] A lot of leaning in on the existing Python tool ecosystem but orchestrating them in a way that brings rigor and speed to your builds.And I haven't used the word we a lot. And I just kind of want to clarify who we is here.So there is tool chain, the company, and we're working on, um, uh, SAS and commercial, um, solutions around pants, which we can talk about in a bit.But there is a very robust open source community around pants that is not. tightly held by Toolchain, the company in a way that some other companies open source projects are.So we have a lot of contributors and maintainers on Pants V2 who are not working at Toolchain, but are using Pants in their own companies and their own organizations.And so we have a very wide range of use cases and opinions that are brought to bear. And this is very important because, as I mentioned earlier,we are not trying to design a system for one use case, for one company or a team's use case.We are trying, you know, we are working on a system we want.[42:05] Adoption for over and over and over again at a wide variety of companies. And so it's very important for us to have the contributions and the input from a wide variety of teams and companiesand people. And it's very fortunate that we now do. I mean, on that note, the thing that comes to my mind is another benefit of your scalable build system like Vance or Bazel or Buck is that youYou don't have to learn various different commands when you are spelunking through the code base, whether it's like a Go code base or like a Java code base or TypeScript code base.You just have to run pants build X, Y, Z, and it can construct the appropriate artifacts for you. At least that was my experience with Bazel.Is that something that you are interested in or is that something that pants V2 does kind of act as this meta layer for various other build systems or is it much more specific and knowledgeable about languages itself?[43:09] It's, I think your intuition is correct. The idea is we want you to be able to do something like pants test or pants test, you know, give it a path to a directory and it understands what that means.Oh, this directory contains Python code. Therefore, I should run PyTest in this way. And oh, Oh, it also contains some JavaScript code, so I should run the JavaScript test in this way.And it basically provides a conceptual layer above all the individual tools that gives you this uniformity across frameworks, across languages.One way to think about this is.[43:52] The tools are all very imperative. say you have to run them with a whole set of flags and inputs and you have to know how to use each one separately. So it's like having just the blades of a Swiss Army knife withno actual Swiss Army knife. A tool like Pants will say, okay, we will encapsulate all of that complexity into a much more simple command line interface. So you can do, like I said,test or pants lint or pants format and it understands, oh, you asked me to format your code. I see that you have the black and I sort configured as formatters. So I will run them. And I happen to know that formatting, because formatting can change the source files,I have to run them sequentially. But when you ask for lint, it's not changing the source files. So I know that I can run them multiple lint as concurrently, that sort of logic. And And different tools have different ways of being configured or of telling you what they want to do, but we...[44:58] Can't be to sort of encapsulate all that away from you and so you get this uniform simple command line interface that abstract away a lot of the specifics of these tools and let you run simple commands and the reason this is important is that. This extra layer of indirection is partly what allows pants to apply things like cashing.And invalidation and concurrency because what you're saying is.[45:25] Hey, the way to think about it is not, I am telling pants to run tests. It is I am telling pants that I want the results of tests, which is a subtle difference.But pants then has the ability to say, well, I don't actually need to run pi test on all these tests because I have results from some of them already cached. So I will return them from cache.So that layer of indirection not only simplifies the UI, but provides the point where you can apply things like caching and concurrency.Yeah, I think every programmer wants to work with declarative tools. I think SQL is one of those things where you don't have to know how the database works. If SQL were somewhat easier, that dream would be fulfilled. But I think we're all getting there.I guess the next question that I have is, what benefit do I get by using the tool chain, like SaaS product versus Pants V2?When I think about build systems, I think about local development, I think about CI.[46:29] Why would I want to use the SaaS product? That's a great question.So Pants does a huge amount of heavy lifting, but in the end it is restricted to the resources is on the machine on which it's running. So when I talk about cash, I'm talking about the local cash on that machine. When I talk about concurrency, I'm talking about using,the cores on your machine. So maybe your CI machine has four cores and your laptop has eight cores. So that's the amount of concurrency you get, which is not nothing at all, which is great.[47:04] Thanks for watching![47:04] You know as i mentioned i worked at google for many years and then other companies where distributed systems were saying like i come from a distributed systems background and it really. Here is a problem.All of a piece of work taking a long time because of. Single machine resource constraints the obvious answer here is distributed distributed the work user distributed system and so that's what tool chain offers essentially.[47:30] You configure Pants to point to the toolchain system, which is currently SAS.And we will have some news soon about some on-prem solutions.And now the cache that I mentioned is not just did this test run with these exact inputs before on my machine by me me while I was iterating, but has anyone in my organization or any CI run this test before,with these exact inputs?So imagine a very common situation where you come in in the morning and you pull all the changes that have happened since you last pulled.Those changes presumably passed CI, right? And the CI populated the cache.So now when I run tests, I can get cache hits from the CI machine.[48:29] Now pretty much, yeah. And then with concurrency, again, so let's say, you know, post cache, there are still 200 tests that need to be run.I could run them eight at a time on my machine or the CI machine could run them, you know, say, four at a time on four cores, or I could run 50 or 100 at a time on a cluster of machines.That's where, again, as your code base gets bigger and bigger, that's where some massive, massive speedups come in.The other aspects of the... I should mention that the remote execution that I just mentioned is something we're about to launch. It is not available today. The remote caching is.The other aspects are things like observability. So when you run builds on your laptop or CI, they're ephemeral.Like the output gets lost in the scroll back.And it's just a wall of text that gets lost with them.[49:39] Toolchain all of that information is captured and stored in structured form so you have. Hey the ability to see past bills and see build behavior over time and drill death search builds and drill down into individual builds and see well.How often does this test fail and you know when did this get slow all this kind of information and so you get.This more enterprise level.Observability into a very core piece of developer productivity, which is the iteration time.The time it takes to run tests and build deployables and parcel the quality control checks so that you can merge and deploy code directly relates to time to release.It directly relates to some of the core metrics of developer productivity. How long is it going to take to get this thing out the door?And so having the ability to both speed that up dramatically through distributing the work and having observability into what work is going on, that is what toolchain provides,on top of the already, if I may say, pretty robust open source offering.[51:01] So yeah, that's kind of it.[51:07] Pants on its own gives you a lot of advantages, but it runs standalone. Plugging it into a larger distributed system really unleashes the full power of Pants as a client to that system.[51:21] No, I think what I'm seeing is this interesting convergence. There's several companies trying to do this for Bazel, like BuildBuddy and Edgeflow. So, and it really sounds like the build system of the future, like 10 years from now.[51:36] No one will really be developing on their local machines anymore. Like there's GitHub code spaces on one side. It's like you're doing all your development remotely.[51:46] I've always found it somewhat odd that development that happens locally and whatever scripts you need to run to provision your CI machine to run the same set of testsare so different sometimes that you can never tell why something's passing locally and failing in in CI or vice versa. And there really should just be this one execution layer that can say, you know what, I'm going to build at a certain commit or run at a certain commit.And that's shared between the local user and the CI user. And your CI script is something as simple as pants build slash slash dot dot dot. And it builds the whole code base for,you. So yeah, I certainly feel like the industry is moving in that direction. I'm curious whether You think that's the same.Do you have an even stronger vision of how folks will be developing 10 years from now? What do you think it's going to look like?Oh, no, I think you're absolutely right. I think if anything, you're underselling it. I think this is how all development should be and will be in the future for multiple reasons.One is performance.[52:51] Two is the problem of different platforms. And so today, big thorny problem is I want to, you know, I want to,I'm developing on my Mac book, but the production, so I'm running, when I run tests locally and when I run anything locally, it's running on my Mac book, but that's not our deployable, right?Typically your deploy platform is some flavor of Linux. So...[53:17] With the distributed system approach you can run the work in. Containers that exactly match your production environments you don't even have to care about can this run.On will my test pass on mac os do i need ci the runs on mac os just to make sure the developers can. past test on Mac OS and that is somehow correlated with success on the production environment.You can cut away a whole suite of those problems, which today, frankly, I had mentioned earlier, you can get cache hits on your desktop from remote, from CI populating the cache.That is hampered by differences in platform.Is hampered by other differences in local setup that we are working to mitigate. But imagine a world in which build logic is not actually running on your MacBook, or if it is,it's running in a container that exactly matches the container that you're targeting.It cuts away a whole suite of problems around platform differences and allows you to focus because on just a platform you're actually going to deploy too.[54:35] And the...[54:42] And just the speed and the performance of being able to work and deploy and the visibility that it gives you into the productivity and the operational work of your development team,I really think this absolutely is the future.There is something very strange about how in the last 15 years or so, so many business functions have had the distributed systems treatment applied to them.Function is now that there are these massive valuable companies providing systems that support sales and systems that support marketing and systems that support HR and systems supportoperations and systems support product management and systems that support every business function,and there need to be more of these that support engineering as a business function.[55:48] And so i absolutely think the idea that i need a really powerful laptop so that my running tests can take thirty minutes instead of forty minutes when in reality it should take three minutes is. That's not the future right the future is to as it has been for so many other systems to the web the laptop is that i can take anywhere is.Particularly in these work from home times, is a work from anywhere times, is just a portal into the system that is doing the actual work.[56:27] Yeah. And there's all these improvements across the stack, right? When I see companies like Versel, they're like, what if you use Next.js, we provide the best developer platform forthat and we want to provide caching. Then there's like the lower level systems with build systems, of course, like bands and Bazel and all that. And at each layer, we're kindof trying to abstract the problem out. So to me, it still feels like there is a lot of innovation to be done. And I'm also going to be really curious to know, you know, there'sgoing to be like a few winners of this space, or if it's going to be pretty broken up. And like everyone's using different tools. It's going to be fascinating, like either way.Yeah, that's really hard to know. I think one thing you mentioned that I think is really important is you said your CI should be as simple as just pants build colon, colon, or whatever.That's our syntax would be sort of pants test lint or whatever.I think that's really important. So.[57:30] Today one of the big problems with see i. Which is still growing right now home market is still growing is more more teams realize the value and importance of automated.Very aggressive automated quality control. But configuring CI is really, really complicated. Every CI provider have their own configuration language,and you have to reason about caching, and you have to manually construct cache keys to the extent,that caching is even possible or useful.There's just a lot of figuring out how to configure and set up CI, And even then it's just doing the naive thing.[58:18] So there are a couple of interesting companies, Dagger and Earthly, or interesting technologies around simplifying that, but again, you still have to manually,so they are providing a, I would say, better config and more uniform config language that allows you to, for example, run build steps in containers.And that's not nothing at all.[58:43] Um, but you still manually creating a lot of, uh, configuration to run these very coarse grained large scale, long running build steps, you know, I thinkthe future is something like my entire CI config post cloning the repo is basically pants build colon, colon, because the system does the configuration for you.[59:09] It figures out what that means in a very fast, very fine grained way and does not require you to manually decide on workflows and steps and jobs and how they all fit together.And if I want to speed this thing up, then I have to manually partition the work somehow and write extra config to implement that partitioning.That is the future, I think, is rather than there's the CI layer, say, which would be the CI providers proprietary config or theodagger and then underneath that there is the buildtool, which would be Bazel or Pants V2 or whatever it is you're using, could still be we make for many companies today or Maven or Gradle or whatever, I really think the future is the integration of those two layers.In the same way that I referenced much, much earlier in our conversation, how one thing that stood out to me at Google was that they had the insight to integrate the version control layer and the build tool to provide really effective functionality there.I think the build tool being the thing that knows about your dependencies.[1:00:29] Can take over many of the jobs of the c i configuration layer in a really smart really fast. Where is the future where essentially more and more of how do i set up and configure and run c i is delegated to the thing that knows about your dependencies and knows about cashing and knows about concurrency and is able,to make smarter decisions than you can in a YAML config file.[1:01:02] Yeah, I'm excited for the time that me as a platform engineer has to spend less than 5% of my time thinking about CI and CD and I can focus on other things like improving our data models rather than mucking with the YAML and Terraform configs. Well, yeah.Yeah. Yeah. Today you have to, we're still a little bit in that state because we are engineers and because we, the tools that we use are themselves made out of software. There's,a strong impulse to tinker and there's a strong impulse sake. Well, I want to solve this problem myself or I want to hack on it or I should be able to hack on it. And that's, you should be able to hack on it for sure. But we do deserve more tooling that requires less hacking,and more things and paradigms that have tested and have survived a lot of tire kicking.[1:02:00] Will we always need to hack on them a little bit? Yes, absolutely, because of the nature of what we do. I think there's a lot of interesting things still to happen in this space.Yeah, I think we should end on that happy note as we go back to our day jobs mucking with YAML. Well, thanks so much for being a guest. I think this was a great conversation and I hope to have you again for the show sometime.Would love that. Thanks for having me. It was fascinating. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Software Engineering Unlocked
Collaborative debugging with Fiberplane

Software Engineering Unlocked

Play Episode Listen Later Nov 16, 2022 43:51


​This episode is sponsored by Fiberplane. Your platform for collaborative debugging notebooks!Episode Resources:Try Fiberplane hereFiberplane websiteFiberplane DocsNP-hard Ventures About Micha Hernandez van LeuffenMicha Hernandez van Leuffen is the founder and CEO of Fiberplane. He previously founded Wercker, a container-native CI/CD platform that was acquired by Oracle. Micha has dedicated his career to improving the workflows of developers. Read the whole episode (Transcript)[If you want, you can help make the transcript better, and improve the podcast's accessibility via Github. I'm happy to lend a hand to help you get started with pull requests, and open source work.][00:00:00] Michaela: Hello and welcome to the Software Engineering Unlocked Podcast. I'm host, Dr. McKayla, and today I have the pleasure to talk to Micha Hernandez van Leuffen. He is the founder and CEO of Fiberplane. He previously was the founder of Wercker, a container native CI/CD platform that was acquired by Oracle. Micha has dedicated his career to improving the workflow of developers, so he and I have a lot to talk about today.I'm really, really happy that he's here today and he's also sponsoring today's episode. Welcome to the show. I'm happy that you're here, Micha.[00:00:36] Micha: Thank you for having me. Excited to be on the show.[00:00:38] Michaela: Yeah, I'm really, really excited. So, Micha, I wanted to start really from the beginning. So you are the CEO of Fiberplane and you are the founder of Wercker, which you already sold.So, can you tell me a little bit about how you actually started to this entrepreneur journey of yours and what brought you to the developer experience area.[00:01:03] Micha: Yeah, sure thing. So I have a background in computer science and I did my so, I'm originally from Amsterdam, but I did my thesis at USF.And the topic was autonomous resource provision using software containers. This was all before Docker was a thing, you know, the container format that we now know and love. And I sort of got excited by that field of, of so containers and decided to start a company around it. That company was Worker, so container native CI/CD platform.So we helped developers build tests and deploy their applications to the cloud. We went, I would say, so we went through various iterations of the platform. You know, eventually, you know, we started off with Lxc as a container format and then eventually ended up, you know, having to, to platform on Docker.And Kubernetes. But, you know, it was quite a, quite a journey. So that company eventually got acquired by Oracle to bolster their cloud native strategy. And then, you know, spent a couple years in a Bay area as a VP of software development focusing on their cloud native efforts.Tried to do a little bit of open source there as well, and then, you know, move back to Europe. And so sort of started thinking about what's. Did some angel investing. We're still doing some angel investing as well actually in the sort of same arena. So developer tools, infrastructure building blocks for tomorrow.So I run a, a small precede seat fund with to other friends of mine. But then also started, you know, thinking about what to build next. And you know, we can get into that, but sort of from our experience at running work or this sort of large distributed. Sort of fiber plane was, was born.[00:02:26] Michaela: Cool. Yeah. And so how, how was the acquisition for you? I, from the time I'm, you said you were studying at the university, but then did you write out of university, you know, start worker or maybe already while[00:02:40] Micha: you were Yeah. More or less studying? Yeah. Yeah, more or less just out of university. So it was around 20, 20 12, 20 13.And then, you know, expanded the team. Of course we got an office in San Francisco and, and London. And then 2017 we got acquired by Whirlpool. Oh,[00:02:56] Michaela: very cool. Wow. Cool. So, and you were the, you were the founder of that and also probably cto, CEO. At, at the beginning you were one person shop, or was this, or have this idea and I get some funding and I already, you know, have a team when I'm starting out, or was it more bootstrapped way?How, how was that?[00:03:16] Micha: Yeah, yeah. We both gates, both fiber plane and, and, and worker. We got some funding early on. Then eventually got a CTO. For worker was one of the co-founders of, of OpenStack. So also, you know, very early in the, in that sort of, mm-hmm. container and, and cloud infrastructure journey.And then if for fiber plane, Yeah. There, there's no cto. I'm. I'm both CEO and cto, I guess[00:03:38] Michaela: at the same time. Yeah. Cool, cool. Can you tell me a little bit about fiber plane? What is fiber plane? You know, what does, what does it has to do with containers and with developer experience? What, what kind of of a product is it?[00:03:51] Micha: Yeah, sure thing. So, so guess coming back to the worker days, right? So we, we, you know, we're running this distributed system cic cd, so we were also running users arbitrary code. You know, any, any sort of job could happen on the platform on top of Kubernetes, inside of containers. So one of the things that, you know, stuck with me was it was very hard to always sort of debug the system, like figure out what's really going on when we had some kind of issue.You know, we've going back and forth between metrics, logs, traces, trying to figure out what is the root cause of an issue. So sort of that, that was sort of one thing. So we're thinking a lot about, you know, surely there must be a better way to, to, to help you on this, on this journey. . The other thing that I started thinking about a lot was sort of just challenge the assumption of the dashboard, mm-hmm.So if you think about it, like a lot of the monitoring observability tools are modeled after the dashboard, like sort of cockpit like view of your infrastructure. But I'd say that those are great for the known knowns. So dashboard is great. You set it up in advance, you know exactly what's gonna go wrong.These are the things to monitor. These are the things, you know, to keep tabs on. But then reality hits and you know, the thing that you're looking at, at the dashboard is not necessarily a thing that's. Going wrong. Right? So started thinking a lot about you know, what, what is a better form factor to support that sort of more investigative explorative debugging of your infrastructure.And not to say that dashboards don't have their place, right? It's like still that sort of cockpit view of your infrastructure. I think that's a, a good thing to have. But for debugging, you might wanna sort of more explorative a form factor that also gives you actionable intelligence. I think the other thing that you see a lot with dashboards, like everybody's monitoring everything and now you get a lot of signal and a lot of inputs, but not necessarily the actionable intelligence to figure out what's going on.So that's sort of the other piece where it, then the other, like, the third like I would say is collaboration sort of thing that stuck with. Was also like we've come to enjoy tools like Notion, you know, Google Docs obviously. You know, in the design space we got Figma where collaboration is built in from the get go and it is found that it was kind of odd how in the developer tools and then sort of specifically DevOps.We don't really have sort of these collab collaboration not really built in. Right. If you think about it you know, the status quo of, of you and I debugging an issue is we get on, you know, we get on a. You share your screen you open some dashboard and we started talking over it or something.Right. And so it's, and it's, you know, I guess sort of covid accelerated his thinking a bit, but you know, of everybody going remote you know, how can you make that experience more collaborative?[00:06:22] Michaela: Mm-hmm. . So it's in the incident space, it's in the monitoring space, and you want to bring more collaboration.So how does it work? Yeah,[00:06:32] Micha: yeah, yeah, exactly. So what's your solution now? Yeah. Now I've explained sort of the in inception. Yeah. But yeah, but what is it? What is it? Right. So it's, it's it's a notebook form factor. So very much inspired by data science, right? Like rc, like Jupiter. Yeah, we can Jupyter Notebooks.Yeah. Think of, think of that form factor. Mm-hmm. . We don't use Jupyter or anything like that. We've written everything from scratch. But it's a sort of, yeah, a notebook form factor and you know, built in with collaboration. So you can add, mention people like you would on Slack. You can leave, you know, comments or discussions and all and all that.But where it gets interesting, we've got these things called providers, which are effectively plugins. So they're web assembly bundles, which we can sort of dive into into that as well. But they're providers that connect to your infrastructure, right? So we have, for instance, a provider for Elastic Search for your logs.We have a provider for Prometheus for your. And it allows you to connect to these observability systems and kind of pull 'em together into one form, factor the notebook, and then, you know, start collaborating around that. Mm-hmm. . So, you know, imagine if Notion and Datadog would have a baby . Yeah.That's kind what you get. Yeah.[00:07:41] Michaela: That's cool. So I can imagine that. Let's. I'm on call and hopefully I'm not alone. A call. You are also on call, right? Yeah, and so we would open a fiber plane notebook.[00:07:52] Micha: Hopefully we're in the same time zone and we don't need to like wake up in the middle of that. Yeah.[00:07:57] Michaela: Hopefully. Yes. And then we want to understand. How the system is behaving. And so we are pulling in observers. These are data sources. Yeah. More or less. Right. And then we can do some transformation with those data. Data sources or[00:08:12] Micha: Yeah, yeah. That, yeah, exactly. That, that might be the case. The other thing that we integrate with is, for instance, PagerDuty.So an alert goes off indeed we are on call, but an alert goes off and we have this PagerDuty integration. And subsequently a notebook is created for us already. Mm-hmm. . Okay. Maybe, maybe even with, you know, some, some charts and logs that are already related to the service that might be down.Okay. So depend, So depending on the alert, obviously you're, as you know, you're as good as how you've instrumented your alerts. But say we've written some good alerts, we now have a notebook ready to go. Based off a template. So that's another thing that we, that we have as well, which is this template mechanism.And now, you know, we're ready to, to, to go in, get in into things and start debugging. So we might have a checklist, you know you look at the metrics, I'll look at the logs, sort of this action plan. We pull in that data we start a discussion around it. Mm-hmm. , hopefully we come, we come to the, to the, you know, the root cause of, of our issue.[00:09:11] Michaela: Okay. And so this discussion and this pulling in data, this happens all in the notebook. Can you explain me a little bit more, and also our listeners Exactly. When we are on this, you know, on this call now, having a fiber plane notebook in front of our, what do we see, right? How does that, how does the tool look?[00:09:28] Micha: It's, it's very similar to, I would say, like a Notion Page or a Google Doc page. Mm-hmm. . So we've got like different, different headings. The other thing that we have is, so you might have a title for a notebook, right? You know, the billing, the billing API is now. The other thing that we have is sort of this, this time range.So maybe usually when there's an issue, you know, we've seen this behavior over the last three hours, so we can sort of have that time range locked into place. So we only want to see our. For the last three hours. And that means that any chart that we plot or any log that we pull in will adhere to that global timeframe.So that's what we see. Mm-hmm. . We have support for labels, so, you know, obviously big fans of Kubernetes and, and Promeus. So we, you know, labels are. A first class primitive on the platform. So you're able to sort of populate the notebook with the labels that might maybe be related to our service.Right? So it's a US East one, which is our region. It might, you know, say service is the billing. It might be, you know, environment is production. And the status of our incident is, mm-hmm. ongoing, stuff like that. So we have, we've got, go ahead.[00:10:34] Michaela: Cool. Yeah. And, and so is it then from top to bottom we are writing and we are investigating and we are writing out down the questions that we have and the investigation.Yeah, exactly. We do.[00:10:44] Micha: Yeah. Yeah. And so, so[00:10:45] Michaela: we might have, Is it an Yeah. Is it Yes,[00:10:48] Micha: we our work? Yeah. Yeah. It's sort of Exactly. And I think in the most ideal use case, right. And I do it most ideal scenario, you're kind of like writing your postmortem as you go along. That's what[00:10:59] Michaela: I, I was thinking exactly that.Right. And then maybe next time I'm on call again and I get PagerDuty and something is down, it's again, billing. Can I search in the fiber plane notebooks to find, you know, what we did last time and then[00:11:13] Micha: Exactly. So you'll, you'll search, jump to the conclusion . Yeah. Yeah, exactly. Hopefully, hopefully if you, if you experience, you know, the same issue multiple times at some point, we'll, we'll, you know, do a little commit on GitHub and we, we fix our, fix our, Yeah.Do. But yeah, indeed, so you can search Yeah. Cool. On the notebooks and see if you've, you know, ran into similar issues. So that's, you know, it's great for building up this, this system of record, right. This knowledge base of mm-hmm. . Mm-hmm. . Of infrastructure issues and, and incidents. And it's also great for onboarding, right?If a new person joins, like, this is our process. These are some of examples that we've run into you know, have a look. And now you've got a sense for you know, how we, how we handle things and some of the issues that we've investigated. Yeah. Cool. One more thing on the, on the product. So the other, so, you know, sort of explained the, the notebook form factor.We've got these providers, right, that pull in data. From different, different data sources like Elastic Search or, or Prometheus. The other thing that we have is a command line interface which is called fp. Mm-hmm . And apart from, you know, being able to create notebooks from your terminal and you know, even invite people from the terminal, all this sort of usual stuff that you would, you know, expect from interacting with an API, with, with a product like this, there's two other things that we do.So one is a command called FP Run. And it allows you to, if you are typing a command like cube, ctl logs for a specific pod, you can pipe that the output of that command to a notebook. And why that is useful is of course, you know, when we're de debugging this issue, you and I, and you're start typing things in your terminal.I have no idea what you just did. And this is a way sort of to capture that. So you're piping these these, these outputs from your, the stuff that you're typing into your into your. into the notebook. And the cool thing is, you know, in, on your laptop you just, you know, sort of see text, right?Monospace output. But for certain outputs such as the cube CTL logs command, we actually know the structure of the data and we're actually capable of formatting that in the notebook in the structured manner that you can start filtering on, on the logs and you know, select certain columns and sort of highlight even certain loglines for prosperity that you say, Hey, these are the culprits, these are the things that you need to take into, into consideration next.So we have this sort of command line interface companion, and the other thing that it does, you're actually capable of running a long, like, sort of same use case as it just, I mentioned, but like a long running recording, like you actually record your entire shell session session as you're debugging this thing and all the output gets piped into the notebook.Cool. Cool. And[00:13:46] Michaela: so I have two questions for fiber plane. One. Is the software engineer the right person to interact with you know, fiber plane or is it the site reliability engineer that's really designed, you know, or the tool is designed[00:14:03] Micha: for? Yeah, it's, it's, that's an excellent question. So I think one, one site reliability engineers, you kind of see in more larger organizations, right, where you start splitting up your teams.I will say, I think at the end of the day, right, is if you're an engineer, you've built the service, now you need to maintain it now you need to operate it like it's, it's your baby, right? You need to, you probably know best how that system behaves than anybody else. So indeed I would say that, yeah, the target group is, you know, developers.Mm-hmm. .[00:14:40] Michaela: And so the other question that I had around fiber plane is also. When we are on this call and we are writing in this notebook, how does the whole scenario look like? Are we still on a call, like, do we have Zoom or, you know, Google meet open, or are we really in the, in the fiber plane document just writing, Or are we sitting next to each other?You know, what, what's the traditional, Is there a traditional scenario or is this all possible with fiber plane? How would you recommend using[00:15:09] Micha: it? Yeah, Yeah. Yeah. Not a, not a great question. Right. I think back in the day, it would be that, you know, we maybe sit in the same office and I scoot over and we start looking at a, at a screen, right?And start typing together. Mm-hmm. . The reality is, of course, we're all doing remote work now, and we might not be in the same room. So I do think people will still use a Zoom call or a Google meet you know, as a companion to talk over stuff. I think, you know, people will still communicate in Slack and sort of start chatting back and forth.But I think what we hope to achieve with fiber plane is like the pasting of screenshots, right? Well, if you take a screenshot of some kind of chart in your dashboard and you put it in Slack and you know, somebody yells, Oh, that's not the, that's not the thing that you should be looking at. You should, you know, like all that sort of slack glue That, you know, it's our, our goal to do away with that.[00:15:59] Michaela: Yeah. And, and the slack blue is also very problematic for the search. At least I'm never able to find it again. Right. It's like is in the dark, super in[00:16:07] Micha: the dark area. Yeah. Super ephemeral. Yeah. Yeah. You can't, can't go back in time easily. And, and you know, how did we solve this last time? So again, like building up that system of record, I think.[00:16:17] Michaela: Yeah. Very cool. And so how long are you now working on fiber plane already?[00:16:23] Micha: So we've been working on it for about two years now. Which is a, is a, is a long time. I think as a sort of, you know, one of the things that we've, I guess, sort of discovered along the way that we're kind of like building two startups at the same time, Right?We're doing a notion or like a, a rich text, collaborative rich text editing experience, which is kind of like a startup on its own. Mm-hmm. . And we're building sort of this infrastructure product. So it's, you know, it's taken quite some time and, and energy to, to get the product to where it is now.[00:16:54] Michaela: Yeah. And do you have already users? Is it like can people that listen today, can they hop on fiber plane already or.[00:17:02] Micha: It's, it's in it's been in private beta, mm-hmm. , but I think by the time this gets aired it's will be in public beta and people can sign up and take it for a spin. And, you know, we would love to get feedback on, on our roadmap, right?And Okay. People can suggest what other types of providers we need to support, what are types of integrations we, you know, would love to, to have that convers.[00:17:23] Michaela: Cool. Yeah. So is there, There is the provider side. Is there something else that you want feedback on that you are exploring[00:17:30] Micha: maybe. . Yeah. Yeah. So we've got the providers that's one thing.We've got sort of our templating stack. Mm-hmm. So curious to sort of see how people sort of start codifying their knowledge, right? What's, what, what kind of processes people have to debug their infrastructure and sort of run their incidents or write their postmortems. So curious to see what people come up with there.Other types of integrations. Right? So we have as I said, sort of PagerDuty what other type of, sort of alert, alert to notebook or other types of external systems that we need to plug in with. I would love to get some feedback on that as well. Yeah,[00:18:04] Michaela: I think I had page Bailey over on the podcast.She's from GitHub and she was she was also, they were releasing something with copilot and you know, For data scientists, some, some spaces here. And she also said like, well, we really need input from the users, right? So try it out, you know, tell us how it's working. I think it's so valuable, right, to see not only like you have your vision and obviously.It's going one way, but then if you have your users, sometimes they take your product and they use it in a very different, you know, way than you anticipate it, which can be very informative. Right. I dunno. You have done two startups already. Have you seen that? And how do you react to it? Do you instrument the data a little bit?How do you realize that people are using your product in a different. . Yeah.[00:18:50] Micha: So, so obviously we have metrics and analytics on sort of usage patterns of the, of the product. But I think, I think that data is excellent, right? But also qualitative data is, mm-hmm. , especially at this stage is probably even better, right?Where you can get somebody on a call and, you know, tell us about your use case. Tell, tell us about the problem that you're trying to solve here and how can we be, be helpful in like what types of integrations should we support? I think sort of the difference between. Worker, I would say, and, and fiber plane is that, you know, worker was a pretty confined piece of surface area, right?Cic, c d the whole goal is so you either have a, you know, a green check mark next to your build or a red check mark next to your build. Like it either, you know, failed or passed. And we need to sort of do that fast for you, get, get that result quick. Mm-hmm. . And with fiber plane, it's a more. I think that the interesting thing here is like, it's a, it's a more explorative and a sort of rich design space, right?It's this notebook, which already you, you know, you can start typing and text and images and headings and check checklists and whatnot, right? It's a very open form factor and design space. And then of course, with the integrations, it can even, you know, be richer. So I'm very curious into your point, right what direction people will pull the product into.Cause you can take it into all sorts. Use cases and scenarios. Yeah,[00:20:05] Michaela: exactly. And I think as a founder also, or as the design team, product team, it's it's also a little bit of a balancing act, right? So how far, you know, let me, are we going with what the user are doing with our product and where are we setting some boundaries that they can't do everything right?So there's also often the talk about opinionated products, right? That you can actually do one thing and on one thing only, and we have an opinion on, you know, how. Supposed to use our product. And you know, we try to, if we see people deviate from that, we try to put an end to it. And then there's the other way where you say, Well, you know, if you take fiber plane and you do X with it and we haven't thought about this maybe, you know, we are okay with it.Or maybe we even support that path, right.[00:20:48] Micha: Yeah, I think, I think we're more on indeed on the, on the ladder, right? I think what we've sort of, we talk about this a lot internally, sort of everything is a building block. You know, you've, we've got the notebook, you've got these different cell types, you've got providers, you've got templates.Mm-hmm. You've got the command line interface. So like for us, like everything is a building block and we, we actually want to retain that flexibility. Not be too prescriptive. Cause maybe you have a, a if you think about sort of the, the incident debugging or, or investigating your infrastructure, like you might have a certain process, I might have a completely different process and we need to be able to facilitate, you know, these different workflows.So, you know, thus far sort of our, our thinking around the product has been everything is a building block. And it should be this sort of flexible form factor that people can pull into into different scenarios and use. I mean,[00:21:36] Michaela: we have infrastructure as code, right? And we have like security as code.Maybe we have debugging as code. Maybe, you know, this is what's coming next. Can, can you envision that, that it's going in this direction? Because while we have building blocks, maybe right now it's not you know, programming language for debugging, but it could go a little bit into the distraction, right?No code coding for debugging.[00:22:02] Micha: Yeah, we've actually, we've, we've had some of, of that sort of discussion internally as well. If you think about the templates right. To, to some extent that is a, you know, we use J Sonet as a, as a sort of language, but we sort of codified them in a certain way and you can, you could argue that the templates is, you know, sort of a programming language for at least, you know, that debugging process, right?Yeah, exactly. Right. Yeah. And. And, and we, you can take that even further and make it kind of like statically typed and make it adhere to, you know, certain rules and maybe even have control flow. So I think that, that there's, there's a piece there. And then maybe, you know, obviously we have you know, some YAML configuration on how you set up your providers, right?Like how to connect to your infrastructure. So there's some, you know, observability as code in mm-hmm. in that realm. Yeah. Yeah. I think that'll be an interesting part of the journey, right? Like to figure out can we, and some even.[00:22:55] Michaela: Yeah, in some parts should be well, don't repeat yourself, right?Like, for example, pulling in these providers, configuring that, you know, I get the right data. This would actually be something that I'm, you know, pulling in again. And probably that's what your templates do, right? So you say billing, oh, and then check, check, check, check, check. I have, you know, all my signals here and they're configured in a way that it's useful.And then for this investigation, hopefully, One at a type thing, right? So I'm investigating, and as we, as we talked before once I realized what's going on, hopefully in my postmortem I'm going to, you know, make sure that this is not happening again. So this code probably is not going to be reused that often.Maybe some, you know, some ideas from it, but hopefully we won't reproduce the same sect completely exact thing again.[00:23:44] Micha: Yeah. Cool. Yeah, that's, that's a super great point. And I think coming back, sort of the early part of the conversation around dashboards, right? I think thus far what we've sort of experienced as, you know, engineers ourselves, like, I think, I think we probably had sort of a phase around information gathering.Like all these dashboards are great for information gathering, but now with Kubernetes and containers and microservices like the, the, the number. Services that we're running and the complexity has increased. So I think, I think there's sort of an opportunity for more exactly what you're describing. So it's more about action, right?Mm-hmm. , what? What are we doing? We want to have the information, we want actionable intelligence that informs us what to do.[00:24:20] Michaela: Yeah, yeah, exactly. Because now I'm looking at this dashboard and I'm seeing the signals. But then everything else is outside of, you know, this realm, right? So what actions do I take?Do I go, go to the console? Do I restart that service? You know, or, you know, whatever I'm doing. And, and it's also vanishing, right? So I'm doing it, but then. Who can see it, What I did. Right? Yeah, exactly. And so now we are capturing this, which is very nice, and then we can learn from it Right. Postmortems as well.Yeah. So I looked a little bit through your blog and and, and your Twitter, and you were also talking about blameless postmortems. So how do you think about psychological safety? How should people. In an organization look at on call and incident management to really make it sure that we are ending the blame game.Right. You probably have some thoughts about that as well, because you're working in this area.[00:25:19] Micha: Yeah. I, I think it's important to, and you like not have put any blame on any person. Right. It, it is a, and I guess sort of, you know, that's also why we're building this product. It is a collaborative process to debug an issue or resolve an incident.Like, and what you want to achieve is to put the entire team in the best possible position to solve the issue at hand and and, you know, a support structure around it. So, you know, coming back to the product, like being able to, to open discussions. Point people in the, in the, in the right direction.[00:25:52] Michaela: So maybe also if it's easier to find a problem to root cause it, and, you know, incidents become no issue or at least a lesser issue. So maybe the blame game is not that important. Can, can we say it that way?[00:26:08] Micha: I think so. Yeah. Yeah, yeah. If, if, you know, if the process becomes repeatable and we codify that and we collaborate on it and we build up that, again, that system of record and knowledge base I think that, you know, puts us in a safer position to, to solve the next one.That's[00:26:25] Michaela: true. Yeah. Another thing that I was thinking of when I looked through, you know, fiber plane and what it does is KS engineering and I thought like what KS engineering is where you try to prevent not only the knowns, but also the unknowns, right? So really think about, you know, what, what could go wrong and then, you know, make a fallback so that your system is reliable.Or, you know, if this database goes down that not the whole system goes down, but only a part of it and so on. Do you think that KS engineers can act. Source or, you know, use those notebooks that you're creating as input for knowing, you know, what we should actually look at and, Yeah.[00:27:02] Micha: Well, I think it, well, one thing I think it'd be a great provider yeah.integrating with, with, with, you know, one or many of the, the chaos engineering services out there. I think it's a great way to train your team, right? You, we plug in some K engineering provider. The, the provider communicates with your infrastructure and such, pulling out wires from from, you know, your, your system.And then now go ahead and start, you know, debugging this issue and mm-hmm. and you know, use different templates and you can, you know, sort of trial all sorts of different issues. I think it'd be super fun. Yeah.[00:27:37] Michaela: Yeah. So Micha, one thing that I also saw is that some of your of fiber plane is open source.So what's your vision for open sourcing that are, you know, are some parts being open source? Can people help with the building fiber plane?[00:27:51] Micha: Yeah, great question. So right now what we've open sourced is a project called fp bind Gen. So this is actually of SDK bindings, generat. For how you would create full stack web assembly plugins.So this is what we use to build our own elastic search and our Prometheus plugins. So we've, we've open sourced that. It's on GitHub we've already got some, quite some feedback on it. So, but would love some more. And then going forward we'll be open sourcing sort of our templating stack the proxy.Which sort of sets which you install inside your cluster and sort of sets up the secure connections between the providers and your infrastructure and then the fabric plane managed service. And then the command line interface that I mentioned will also be open source. So expect more to hear from us on the open source front.[00:28:36] Michaela: Yeah, Cool. I think that's so important, especially for developer tooling, that people can also really get it into their hands and then help, you know, shape the, or make the best product for their, for their environments that they have. I think this is such a success strategy.[00:28:50] Micha: Yeah, exactly. And you know, we, as I said, we would love to get feedback on the, on the providers and the, the plugin model, but maybe even, you know, once we open source the the, the provider stack would be great if people maybe come up with crazy ideas.Right? You can think of any type of provider that you could surface data inside of, inside of the notebook. Yeah. Doesn't need to be observability or like monitoring data. Like could be. Yeah.[00:29:14] Michaela: Cool. Yeah, I'm super excited. What, you know, what will come out of that. Yeah. So I want to come back a little bit to your founding story because I know a lot of people are interested in developer tools and, you know, and, and Startup founding as well.And you did it twice already, right? And maybe several more times in your life, I dunno. But right now we know of two instances. Yeah. There, there. So, and and also for fiber plane, you already got funding, right? Several million dollars. And so how do you do. How do you do it out of Europe is also some of my questions that I have because I think it's a little bit a different game here in Europe than it's in Silicon Valley.Yeah. It doesn't look like, you know, opportunities around the corner everywhere. I, I have been studying in the Netherlands, so I know that actually Netherlands is really a good place, I think for, for tech startups and, you know, also a little bit out of the universities I saw there like You know, you get a little bit of help and, and, and funding and things like this, but still, I would assume it's harder than in Silicon Valley.So how did you make it work? How did you get funding? You also said that worker had some funding at the beginning. Yeah.[00:30:26] Micha: Yeah. It's a good question. Well, how did we do the second time around, to be honest, Because it's the second time. Yeah. It was a bit easier. I mean, it's never, It's, Yeah. Yeah. It's obviously, you know, never as easy.But it was definitely easier. I do think in Europe, if I also compare it to the worker days to where we are now, Like I do think the funding climate and sort of the, the, the, the thinking around startups has improved a lot, right? There's there's more funding out there, there's more feess. I think more importantly though, what we've seen is that now.Sort of the European unicorns have exited or gone ipo. And we have actually more operators inside of Europe that have experience in either founding a startup are able to sort of start doing angel investing or have worked at multiple startups and we have just more operating experience you know, versus honestly like bankers, right?That That, you know, help you out or are, are investing in you? So actually the, the, the funds that funder does were Crane Venture Partners which is actually a seed fund out of London that's actually focused on developer tools and infrastructure. So I would highly recommend, you know, talking to them.If you're thinking about, you know, building a developer tool company and you need some funding, of course my own fund is also focused on developer tool. So shameless plug there on MP Hard Ventures. You can just Google that and find me. And then we have North Zone, which is a, you know, very like multi-stage fund.Also out of, well actually quite different geographies and Notion Capital out of out of London as well. Okay. We've got some have several micro VCs, several things. Yeah. We have somebody funded West Coast Alana Anderson was doing with base case capitals investing in a lot of infrastructure and enterprise startups and Max Cloud from System one in Berlin.Is another one. So yeah, we have a good crew of, you know, a diff different experience and sort of different stage type of funding as well.[00:32:19] Michaela: Yeah. This was my next question that I had for you. It's probably not only about the money, you said experience, right? It's also about the knowledge that people have, right.How to do things. Probably, yeah. The people that they know, right? So that they can Yeah. To be Yeah, exactly. Can consider the right people have the right network and so.[00:32:36] Micha: Yeah, I think, I think the most, yeah, it's is, is introductions, but it's also. You know, if you, if you think about the, the funds that actually do developer tools, right?So they, in their portfolio, they, they've seen, you know, startups trying over and over to tackle some kind of go to market issue or trying to build an open source, mm-hmm. company, right? So they have some, some pattern matching and some, some knowledge about, you know, what to do and what, what not to do.Of course, it's all advice, but it's good to sort of have some people in your corner that have at least seen this, these types of companies being built. Over and over again. Right. That's, and then, and then other VCs have more experience in, you know, more, more like how to build up or scale up a sales organization and thinking about how to run a SaaS company.So yeah. Different experience from different, different funds.[00:33:20] Michaela: And so now you listed quite a lot of different investors. Do you reach out to each one of them or do you have like a whole group meeting and they're all in there and you ask them for advice? , how does it[00:33:33] Micha: Yeah. No, it's, it's sort of one on one chats, right?Either over, over chat or, you know, we meet up for coffee or, or or breakfast, mm-hmm. . But yeah, we try to do that on a, on a regular cadence. And then of course, when, you know, something exciting happens, such as our launch know, we try to group them together and get them all on the same page around the same time.Or of course if an issue arises, Right, which could also be the case. Yeah. And then sort of all hands on deck and everybody in the same room or zoom.[00:34:01] Michaela: And what about your biggest struggle on your, on your entrepreneurial journey, maybe now with fiber plane or maybe with Worker? Did you ever think that, you know, worker, when you started it, did you think that somebody is going to buy this and.This is going to be huge.[00:34:16] Micha: Yeah. Yeah. I think, I think the ambition was always there. Mm-hmm. . And, but, and, and sort of that drive to just make better developer tools. I think that sort of, that, you know, that's been true for all the companies or all too. Yeah, that's,[00:34:30] Michaela: Yeah. And what[00:34:32] Micha: I struggle. Yeah. Yeah. So I think, I think as I think for fiber plane now, it's not necessarily a struggle, it's just the real, which this mission of this flexible form factor, just the fact that we're doing sort of two startups at the same time has been sort of mm-hmm. An interesting thing to to build now, right? You're doing this rich, collaborative, rich tech editor and trying to build this infrastructure oriented company, and I think that's been yeah, just an interesting experience with building out a team.You know, the technology and the product that we.[00:35:01] Michaela: Yeah. Yeah. So maybe can you tell me a little bit more about again, if people want to hop over to Fiber plane now and try it out how does that work? Do you have to, you know is there a sign up? Is there a waiting list? I mean, you said probably when this airs there is a public beat, but still do you have to, you know, what do you have to reach out to you, you give me a demo or I just fill in my credentials and I'm off togo.[00:35:25] Micha: you can just sign, sign up with Google and then you're off to the races. And then of course, if you want a demo and sort of get some more, more more help or onboarding we're happy to help you and get on a call and walk you through it. But yeah. Okay, cool. Try playing com. Is there[00:35:40] Michaela: also a, Yeah, is there a video or something that we can look[00:35:44] Micha: at?Yes. The, the website and there's a video.[00:35:50] Michaela: Okay. I will link that so that people can go Yeah. And it will explain everything to them. Right. What about pricing? Whatever pricing? Yeah. You have already some idea around pricing. Yeah.[00:36:01] Micha: We've got some ideas on how to charge, but I think right now for us, it's important to get the product market fit, mm-hmm.and as such, you know, get, get the feedback. From these companies and these teams using the product. So we'll introduce pricing at a later stage. So for now it's, it's free to use, mm-hmm. . And you just give us your time and your feedback, and then Yeah, we're grateful.[00:36:20] Michaela: Yeah. And what about my data?Is it safe with you? Like, do you have some visibility into my data or do I send it over to[00:36:29] Micha: you? Yeah, so we actually so the way the, the providers work the plugins, so they actually get activated through a proxy. So we install a proxy inside of your cluster. The proxy sets up a secure bidirectional tunnel from your infrastructure to the fiber plane managed service.And then we do, for that specific query, we do store the data that's related to that query. So of a result, we do store that in the notebook. And yeah, we probably will come up with sort of more enterprisey ideas around how to self host[00:36:59] Michaela: it, Right? Or something[00:37:01] Micha: as an example. Yeah, yeah, yeah. But again, we'd love to get some feedback on that.[00:37:07] Michaela: How that works. Right? Yeah. Okay, cool. So yeah, that sounds really good. I think you, at least my questions, , you could answer them all, but maybe my listeners have questions and then they can send them to you. I think you will be, Yeah. Quite happy, right?[00:37:22] Micha: A hundred percent. At mes on Twitter, m i e s and at fiber, playing on Twitter, fiber playing.com.Sign up, take it for spin, shoot us a message. Yeah, sounds.[00:37:33] Michaela: Yeah. Yeah, it sounds super interesting. I hope that a lot of my listeners will do that, and I will link everything in my show notes that we, you know, talked about your, your Twitter handle and everything so that people can reach you. And I hope you get a lot of questions and people give it a spin and give it a try and send you their use cases,And yeah. I hope you all the best with your product. Thank you so much for being on my show today Micha. And yeah. Thank you. Bye.[00:37:59] Micha: Thank. Thank you for having me.[00:38:01] Michaela: Yeah, it was really great. Bye bye[00:38:04] Micha: bye.[00:38:06] Michaela: This was another episode of the Software Engineering Unlocked Podcast. If you enjoyed the episode, please help me spread the word about the podcast.Send episode to a friend via email, Twitter, LinkedIn. Well, whatever messaging system you use, or give it a positive review on your favorite podcasting platform such as Spotify or iTune. This would mean really a lot to me. So thank you for listening. Don't forget to subscribe and I will talk to you in two weeks.

The Cloudcast
Cloud Cost Management

The Cloudcast

Play Episode Listen Later Nov 16, 2022 23:33


Kayla Taylor (Sr. Product Manager @datadoghq) talks about taking control of cloud costs and managing costs along with observability and troubleshooting.SHOW: 669CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Monitoring: Modern Monitoring and AnalyticsStart monitoring your infrastructure, applications, logs and security in one place with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.CDN77 - CDN Focused on VOD and SecurityCDN77 - ask for a free trial with no duration or traffic limits. SHOW NOTES:Gain visibility and control of your cloud spend with Datadog Cloud Cost ManagementDatadog introduces Cloud Cost ManagementTopic 1 - Welcome to the show. Let's talk about your background, and what you focus on at Datadog.Topic 2 - Cloud cost management has become a very active focus area since the economy has been slowing down. How are you seeing customers, or engineering teams thinking about cost differently than before? Topic 3 - Since Datadog provides so much visibility into active applications (APM, Observability, Troubleshooting), do you take a similar approach to Cost Management? Topic 4 - Accountants understand costs, but not technology. Engineers understand technology, but do you find that they understand costs? How do you help them conceptualize costs and be able to make changes that impact their applications? Topic 5 - What are some of the unique things that Datadog does with the new Cloud Cost Management offering? How does it tie into the other aspects of Datadog?Topic 6 - The service is new, but what are some of the surprising things you've observed about how companies are using this insight into their cloud costs?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet

B2B Tech Talk with Ingram Micro
The New IT of Observability

B2B Tech Talk with Ingram Micro

Play Episode Listen Later Nov 15, 2022 24:42 Transcription Available


Applications evolve rapidly and are creating unexplored levels of complexity and an extremely dynamic environment.  The founders of Instana realized legacy APM tools were going to struggle to handle emerging cloud-based containerized microservice applications.   The new observability approach goes far beyond old capabilities, maximizes efficiency and reduces costs.  Shelby Skrhak dives into the all-encompassing ideal of observability with Chris Farrell, vice president of Value Services at IBM, and Manoj Khabe, VP, Enterprise Management Solutions and Strategy at Converge Technology Solutions Corp:   The need for a new way of monitoring ever more complex applications How organizations can use observability to increase efficiency and cut costs The challenges of ensuring optimal performance and avoiding catastrophic events  To join the discussion, follow us on Twitter @IngramTechSol #B2BTechTalk  Listen to this episode and more like it by subscribing to B2B Tech Talk on Spotify,Apple Podcasts or Stitcher. Or, tune in on our website.

The New Stack Podcast
OpenTelemetry Properly Explained and Demoed

The New Stack Podcast

Play Episode Listen Later Nov 15, 2022 18:16


OpenTelemetry project offers vendor-neutral integration points that help organizations obtain the raw materials — the "telemetry" — that fuel modern observability tools, and with minimal effort at integration time. But what does OpenTelemetry mean for those who use their favorite observability tools but don't exactly understand how it can help them? How might OpenTelemetry be relevant to the folks who are new to Kuberentes (the majority of KubeCon attendees during the past years) and those who are just getting started with observability?  Austin Parker, head of developer relations, Lightstep and Morgan McLean, director of product management, Splunk, discuss during this podcast at KubeCon + CloudNativeCon 2022 how the OpenTelemetry project has created demo services to help cloud native community members better understand cloud native development practices and test out OpenTelemetry, as well as Kubernetes, observability software, etc.  At this conjecture in DevOps history, there has been considerable hype around observability for developers and operations teams, and more recently, much attention has been given to helping combine the different observability solutions out there in use through a single interface, and to that end, OpenTelemetry has emerged as a key standard.  DevOps teams today need OpenTelemetry since they typically work with a lot of different data sources for observability processes, Parker said. “If you want observability, you need to transform and send that data out to any number of open source or commercial solutions and you need a lingua franca to to be consistent. Every time I have a host, or an IP address, or any kind of metadata, consistency is key and that's what OpenTelemetry provides.” Additionally, as a developer or an operator, OpenTelemetry serves to instrument your system for observability, McLean said. “OpenTelemetry does that through the power of the community working together to define those standards and to provide the components needed to extract that data among hundreds of thousands of different combinations of software and hardware and infrastructure that people are using,” McLean said. Observability and OpenTelemetry, while conceptually straightforward, do require a learning curve to use. To that end, the OpenTelemetry project has released a demo to help. It is intended to both better understand cloud native development practices and to test out OpenTelemetry, as well as Kubernetes, observability software, etc.,the project's creators say. OpenTelemetry Demo v1.0 general release is available on GitHub and on the OpenTelemetry site. The demo helps with learning how to add instrumentation to an application to gather metrics, logs and traces for observability. There is heavy instruction for open source projects like Prometheus for Kubernetes and Jaeger for distributed tracing. How to acquaint yourself with tools such as Grafana to create dashboards are shown. The demo also extends to scenarios in which failures are created and OpenTelemetry data is used for troubleshooting and remediation. The demo was designed for the beginner or the intermediate level user, and can be set up to run on Docker or Kubernetes in about five minutes.  “The demo is a great way for people to get started,” Parker said. “We've also seen a lot of great uptake from our commercial partners as well who have said ‘we'll use this to demo our platform.'”

Cloud Security Podcast by Google
EP96 Cloud Security Observability for Detection and Response

Cloud Security Podcast by Google

Play Episode Listen Later Nov 14, 2022 32:32


Guest: Jeff Bollinger,  Director of Incident Response and Detection Engineering @ Linkedin  Topics: Observability sounds cool (please define it for us BTW), but relating it to security has been “hand-wavy” at best. What is your opinion on the relevance of observability data for security use cases? What use cases are those, apart from saving the data for IR just in case? How can we best approach observability in the cloud, particularly around network communications, so that we improve security as a result? Are there other areas of cloud where observability might be more relevant? Does the massive shift to TLS 1.3 impact this? If the Internet is shifting towards an end-user/device centric model with everything as a service (SaaS), how does security monitoring even work anymore?  Does it mean the end of both endpoint and network eras and the arrival of the application security monitoring era? Can we do deep monitoring of complex applications and app clusters for abuse or should we just focus on identity and profiling? Resources: “Instrumenting Modern Application Stack for Detection and Response” (ep34) “Crafting the InfoSec Playbook: Security Monitoring and Incident Response Master Plan” by Jeff Bollinger, Brandon Enright, Matthew Valites (book) RFC 7258  Pervasive Monitoring Is an Attack RFC 8890 Internet is for end users “(Re)building Threat Detection and Incident Response at LinkedIn” “Martian Chronicles“ by Ray Bradberry (because migrating to cloud is like flying to Mars)

Infinite Machine Learning
Data observability, creating open source software, being a founder, being an author | Andy Petrella, cofounder of Kensu

Infinite Machine Learning

Play Episode Listen Later Nov 14, 2022 38:20


Andy Petrella is the cofounder of Kensu, a data observability solution that offers companies the opportunity to monitor their data usage. He is the creator of Spark Notebook, an open-source software for data engineers and scientists using Spark and Scala. He is the author of the book Fundamentals of Data Observability. He has built a fantastic career in data over the last 16 years. In this episode, we cover a range of topics including: - data observability - being a founder - being an author - creating open source software - Kensu's unique approach "Data Observability Driven Development" -------- Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: http://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 

SolarWinds TechPod
It's All Tech to Me: Adjusting to an IT Career

SolarWinds TechPod

Play Episode Listen Later Nov 8, 2022 51:30


In today's ever-expanding job market, many people find themselves on a career path they never originally planned for - IT. So how do those "non-technical" people become "technical"? Join TechPod hosts Sean and Chris as they dissect their own experiences working in tech, learning how to grapple with unknowns, Observability, and the importance of diversity and unique thought in the tech world.  © 2022 SolarWinds Worldwide, LLC. All rights reserved

Tiny DevOps
Paul Cothenet — Observations on observability

Tiny DevOps

Play Episode Listen Later Nov 8, 2022 48:40


Paul Cothenet of Patch.io joins me this time to discuss war stories implementing observabillity at two small startups.In this episode…- How to choose an obervabillity tool/platform- Why AWS doesn't provide the best observability platform- Teaching the team to use observability- How to convince stakeholders that observability is valuable- What would you miss the most if your observability platform was no longer available?- The business value of a good observability solution- Making observability metrics easy for management to use- What does it all cost?- Advice for getting startedResourcesRands Leadership Slack: https://randsinrepose.com/welcome-to-rands-leadership-slack/GuestPaul CothenetTwitter: @paulcothenetCompany, and jobs: patch.ioWatch this episode on YouTube.

Infinite Machine Learning
Explainable AI, model monitoring, observability | Krishna Gade, CEO of Fiddler AI

Infinite Machine Learning

Play Episode Listen Later Nov 7, 2022 47:15


Krishna Gade is the cofounder and CEO of Fiddler AI. They provide an explainable monitoring solution for production AI systems. Fiddler has been featured on Forbes AI 50 startups in 2020 and was awarded Technology Pioneer by World Economic Forum. He has previously held engineering leadership roles at Microsoft, Twitter, Pinterest, and Meta.  In this episode, we cover a range of topics including: - model monitoring - explainability - what they are building at Fiddler - how they got their first 5 customers - key takeaways on company building - how to maintain product discipline in enterprise SaaS -------- Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: http://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 

Modern Digital Business
Cloud-Native Observability with Bruno Kurtic, Sumo Logic

Modern Digital Business

Play Episode Listen Later Nov 7, 2022 35:10 Transcription Available


Operating a modern digital business means building and operating large, highly-scaled applications that are more and more cloud-native in their architecture and implementation. Observability is critical in maintaining the highly scaled, highly available, highly adaptive nature of these modern cloud-native applications. You just can't keep a large, complex, modern application operating without having a solid, modern observability platform as part of your system. And ideally, in today's cloud-native market, you want an observability platform that is based on cloud-native technologies.Sumo Logic is a leader in cloud-native observability. They not only focus on providing analytics for cloud-native applications, but they themselves also operate on a cloud-native platform. And while Sumo Logic provides tools for improving application reliability and availability, what really sets them apart is their focus on security and compliance in a cloud-native environment.Bruno Kurtic, founding Chief Strategy Officer for Sumo Logic, today on Modern Digital Business.Useful LinksBruno Kurtic | LinkedInCloud Log Management, Monitoring, SIEM Tools | Sumo LogicArchitecting for Scale, 2nd Edition, O'Reilly Media About LeeLee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O'Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.Take a look at Lee's many books, courses, and articles by going to leeatchison.com. Looking to modernize your application organization?Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers. Don't Miss Out!Subscribe here to catch each new episode as it becomes available.Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books and courses from Lee. Don't worry, we won't send you spam and you can unsubscribe at any time. Mentioned in this episode:LinkedIn Learning CoursesAre you looking to become an architect? Or perhaps are you looking to learn how to drive your organization towards better utilization of the cloud? Are you you looking for ways to help you utilize a Cloud Center of Excellence in your...

Data Transforming Business
Sifflet: Observability and the Future of Data Engineering

Data Transforming Business

Play Episode Listen Later Nov 3, 2022 22:32


Data engineering is how companies design and build systems for collecting and using their data at scale. A critical practice as companies are pressured to make more and more data-driven decisions, data engineering is essential for enterprises to get ahead.  In this episode of the EM360 Podcast, Editor https://em360tech.com/user/3673 (Matt Harris) speaks to https://www.linkedin.com/in/salma-bakouk-2999711b6?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAADJVC1UBRDqdneOVftZFsyUXKwa2JZVptAE&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3Bzi1fFov0SOWW2R4dOI1ocw%3D%3D (Salma Bakouk), Co-Founder and CEO at https://em360tech.com/solution-providers/sifflet (Sifflet), to discuss; Trends shaping the data engineering industry Caring about data reliability and observability What the future brings

EM360 Podcast
Sifflet: Observability and the Future of Data Engineering

EM360 Podcast

Play Episode Listen Later Nov 3, 2022 22:32


Data engineering is how companies design and build systems for collecting and using their data at scale.A critical practice as companies are pressured to make more and more data-driven decisions, data engineering is essential for enterprises to get ahead. In this episode of the EM360 Podcast, Editor Matt Harris speaks to Salma Bakouk, Co-Founder and CEO at Sifflet, to discuss;Trends shaping the data engineering industryCaring about data reliability and observabilityWhat the future brings

Feds At The Edge by FedInsider
Ep. 77 Hybrid Cloud Observability

Feds At The Edge by FedInsider

Play Episode Listen Later Nov 1, 2022 38:21


When you read the current literature on cloud systems management, one key factor is what is called “observability.”  With so many moving parts, one tends to focus on a specific group of indicators and miss the overall activity. When Vivek Kundra started to talk about “Cloud First” back in 2009. He had no idea the size and complexity of clouds that would evolve in his desire to reduce cost and increase flexibility for federal projects. SolarWinds has been a leader in system observability for decades. Today, we have a person who has successfully used Solar Winds on a variety of systems and relates some of the best practices for gaining this elusive observability. Scott Pross has seen systems managers look at specific aspects of federal systems in silos. For example, a manager may have data coming in on servers, or even the network itself. Another set of metrics may give information on applications. Managing virtual environments has evolved into a category in and of itself. When a problem arises, a troubleshooter may microfocus on an area, but not realize how it impacts the entire ecosystem. Scott suggests that using solutions from SolarWinds can give a systems analyst a view from 40,000 feet instead of ten feet off the ground. Another consideration is how to monitor activity in a cloud when its architecture is changing. The change may be for compliance reasons, expanding applications, or something as basic as running out of room and having to adapt to an influx of data. Leaders of federal agencies are not interested in extremely detailed observations about logs.  They want to know when their application will be available for citizens or employees. Addressing this issue, Scott details how, using SolarWinds, he can assemble dashboards to give leaders a better understanding of what has gone wrong to enable them to make data-based decisions.  

.NET Rocks!
Observability in Production with Alayshia Knighten

.NET Rocks!

Play Episode Listen Later Oct 27, 2022 48:39


What can observability do for you? While at NDC in Oslo, Carl and Richard chatted with Alayshia Knighten about her work with honeycomb and helping people understand what's happening with their applications in production. Alayshia talks about instrumenting applications to provide insight into behavior in real-time - by leveraging existing tools to provide data and reporting. The conversation digs into how sysadmins and developers see applications differently, and how standard telemetry systems make it easier for everyone to be on the same page!

And There You Have IT!
Achieve Better Product Delivery with Better Observability

And There You Have IT!

Play Episode Listen Later Oct 26, 2022 31:50


Listen to our podcast to learn how observability provides organizations with continuous and complete telemetry, enabling faster, automated problem identification and resolution. 

This Week in Enterprise Tech (Video HD)
TWiET 516: Single Pain In The Glass - Patching-as-a-Service, critical VMware flaw, OpenTelemetry observability

This Week in Enterprise Tech (Video HD)

Play Episode Listen Later Oct 21, 2022 68:40


Critical VMware flaw, Patching-as-a-Service, OpenTelemetry observability, and more. Starlink signals can be reverse-engineered to work like GPS—whether SpaceX likes it or not A list of common passwords accounts for nearly all cyberattacks The FTC is looking at fixing appliance repair, but it needs to go beyond manuals Hackers exploit critical VMware flaw to drop ransomware, miners Patching-as-a-Service offers benefits and challenges OpenTelemetry Co-Founder Morgan McLean talks about monitoring and observability with their collection of free and open-source tools, APIs, and SDKs. Hosts: Louis Maresca, Brian Chee, and Curt Franklin Guest: Morgan McLean Download or subscribe to this show at https://twit.tv/shows/this-week-in-enterprise-tech. Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit Sponsors: canary.tools/twit - use code: TWIT nordlayer.com/twit

This Week in Enterprise Tech (MP3)
TWiET 516: Single Pain In The Glass - Patching-as-a-Service, critical VMware flaw, OpenTelemetry observability

This Week in Enterprise Tech (MP3)

Play Episode Listen Later Oct 21, 2022 68:21


Critical VMware flaw, Patching-as-a-Service, OpenTelemetry observability, and more. Starlink signals can be reverse-engineered to work like GPS—whether SpaceX likes it or not A list of common passwords accounts for nearly all cyberattacks The FTC is looking at fixing appliance repair, but it needs to go beyond manuals Hackers exploit critical VMware flaw to drop ransomware, miners Patching-as-a-Service offers benefits and challenges OpenTelemetry Co-Founder Morgan McLean talks about monitoring and observability with their collection of free and open-source tools, APIs, and SDKs. Hosts: Louis Maresca, Brian Chee, and Curt Franklin Guest: Morgan McLean Download or subscribe to this show at https://twit.tv/shows/this-week-in-enterprise-tech. Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit Sponsors: canary.tools/twit - use code: TWIT nordlayer.com/twit

The Soda Podcast
S1 Ep 07 In Conversation With Ramón Medrano: The Meet and Greet

The Soda Podcast

Play Episode Listen Later Oct 21, 2022 6:57


Meet Ramón Medrano, Senior Staff Site Reliability Engineer at Google. Inspired by the book, Site Reliability Engineering: How Google Runs Production Systems, our host Maarten will explore and discuss how the practice of using software to automate IT infrastructure tasks such as system management and application monitoring - can be adopted and applied to the practice of data management. Hear Maarten and Ramón get the conversation started with introductions. More about our host, Maarten Masschelein Read the transcript of this episode Learn more about the chosen charity, Open Arms Connect with us on social media: Twitter, LinkedIn, Facebook From Soda, the provider of data reliability tools and observability platform to enable data teams to find, analyze, and resolve data issues.

Podcast – Software Engineering Daily
Modern Application Observability with Berkay Mollamustafaoglu

Podcast – Software Engineering Daily

Play Episode Listen Later Oct 20, 2022 50:03


Observability is a critical aspect of modern digital applications. You can’t operate an application at scale that satisfies your customer needs without understanding how the application is currently performing, whether it’s understanding the current operating needs of the application, adjusting resource usage, detecting issues before they become serious or solving an ongoing technical issue as The post Modern Application Observability with Berkay Mollamustafaoglu appeared first on Software Engineering Daily.

Infinite Machine Learning
AI observability, being a founder, measuring the ROI of a product | Alessya Visnjic, CEO of WhyLabs

Infinite Machine Learning

Play Episode Listen Later Oct 20, 2022 51:31


Alessya Visnjic is the cofounder and CEO of WhyLabs, which is an AI and data observability platform. She was previously a CTO-in-residence at the Allen Institute. Prior to that, she spent 9 years at Amazon building various machine learning products. In this episode, we cover a range of topics including: - AI observability - Being a founder - Getting the first 5 customers - How does an AI model fail - How to validate the problem as a founder - How to measure the ROI of your product --------Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: http://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 

Software Engineering Radio - The Podcast for Professional Software Developers

Andy Dang, Head of Engineering at WhyLabs discusses observability and data ops for AI/ML applications and how that differs from traditional observability. SE Radio host Akshay Manchale speaks with Andy about running an AI/ML model in production and how...

On Cloud
Observability isn't a bolt-on; it's an integral part of the cloud ecosystem

On Cloud

Play Episode Listen Later Oct 19, 2022 26:44


Observability is a huge help to organizations in solving systems performance issues, but some still see it as a bolt-on. In this episode David Linthicum talks with Dynatrace's Michael Allen and Deloitte's Jay McDonald about why that isn't so. The trio also discusses why companies should view Observability as a holistic process that can be built as code into every facet of their cloud ecosystem—from infrastructure, to apps, to security, to governance, to compliance, and more.

Day 2 Cloud
Day Two Cloud 168: Get Kubernetes Observability With AppDynamics Cloud (Sponsored)

Day 2 Cloud

Play Episode Listen Later Oct 19, 2022 45:13


Today's Day Two Cloud podcast, sponsored by AppDynamics, explores how AppDynamics Cloud brings observability to your Kubernetes deployments by ingesting and visualizing all metrics, events, log and trace data from across your cloud and on-prem landscapes. The post Day Two Cloud 168: Get Kubernetes Observability With AppDynamics Cloud (Sponsored) appeared first on Packet Pushers.

Packet Pushers - Fat Pipe
Day Two Cloud 168: Get Kubernetes Observability With AppDynamics Cloud (Sponsored)

Packet Pushers - Fat Pipe

Play Episode Listen Later Oct 19, 2022 45:13


Today's Day Two Cloud podcast, sponsored by AppDynamics, explores how AppDynamics Cloud brings observability to your Kubernetes deployments by ingesting and visualizing all metrics, events, log and trace data from across your cloud and on-prem landscapes. The post Day Two Cloud 168: Get Kubernetes Observability With AppDynamics Cloud (Sponsored) appeared first on Packet Pushers.

Packet Pushers - Full Podcast Feed
Day Two Cloud 168: Get Kubernetes Observability With AppDynamics Cloud (Sponsored)

Packet Pushers - Full Podcast Feed

Play Episode Listen Later Oct 19, 2022 45:13


Today's Day Two Cloud podcast, sponsored by AppDynamics, explores how AppDynamics Cloud brings observability to your Kubernetes deployments by ingesting and visualizing all metrics, events, log and trace data from across your cloud and on-prem landscapes. The post Day Two Cloud 168: Get Kubernetes Observability With AppDynamics Cloud (Sponsored) appeared first on Packet Pushers.

Screaming in the Cloud
The Evolution of Cloud Services with Richard Hartmann

Screaming in the Cloud

Play Episode Listen Later Oct 18, 2022 45:26


About RichardRichard "RichiH" Hartmann is the Director of Community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Advisory Group Observability chair, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendess, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch. Go through his talks, podcasts, interviews, and articles at https://github.com/RichiH/talks or follow him on Twitter at https://twitter.com/TwitchiH for musings on the intersection of technology and society.Links Referenced: Grafana Labs: https://grafana.com/ Twitter: https://twitter.com/TwitchiH Richard Hartmann list of talks: https://github.com/richih/talks TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloudCorey: Welcome to Screaming in the Cloud, I'm Corey Quinn. There are an awful lot of people who are incredibly good at understanding the ins and outs and the intricacies of the observability world. But they didn't have time to come on the show today. Instead, I am talking to my dear friend of two decades now, Richard Hartmann, better known on the internet as RichiH, who is the Director of Community at Grafana Labs, here to suffer—in a somewhat atypical departure for the theme of this show—personal attacks for once. Richie, thank you for joining me.Richard: And thank you for agreeing on personal attacks.Corey: Exactly. It was one of your riders. Like, there have to be the personal attacks back and forth or you refuse to appear on the show. You've been on before. In fact, the last time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to?You're still at Grafana Labs. And in many cases, I would point out that, wow, you've been there for many years; that seems to be an atypical thing, which is an American tech industry perspective because every time you and I talk about this, you look at folks who—wow, you were only at that company for five years. What's wrong with you—you tend to take the longer view and I tend to have the fast twitch, time to go ahead and leave jobs because it's been more than 20 minutes approach. I see that you're continuing to live what you preach, though. How's it been?Richard: Yeah, so there's a little bit of Covid brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center. But the last two-and-a-half years didn't really happen for many people, myself included. So, I guess [laugh] that includes you.Corey: No, no you're right. You've only been at Grafana Labs a couple of years. One would think I would check the notes for shooting my mouth off. But then, one wouldn't know me.Richard: What notes? Anyway, I've been around Prometheus and Grafana Since 2015. But it's like, real, full-time everything is 2020. There was something in between. Since 2018, I contracted to do vulnerability handling and everything for Grafana Labs because they had something and they didn't know how to deal with it.But no, full time is 2020. But as to the space in the [unintelligible 00:02:45] of itself, it's maybe a little bit German of me, but trying to understand the real world and trying to get an overview of systems and how they actually work, and if they are working correctly and as intended, and if not, how they're not working as intended, and how to fix this is something which has always been super important to me, in part because I just want to understand the world. And this is a really, really good way to automate understanding of the world. So, it's basically a work-saving mechanism. And that's why I've been sticking to it for so long, I guess.Corey: Back in the early days of monitoring systems—so we called it monitoring back then because, you know, are using simple words that lack nuance was sort of de rigueur back then—we wound up effectively having tools. Nagios is the one that springs to mind, and it was terrible in all the ways you would expect a tool written in janky Perl in the early-2000s to be. But it told you what was going on. It tried to do a thing, generally reach a server or query it about things, and when things fell out of certain specs, it screamed its head off, which meant that when you had things like the core switch melting down—thinking of one very particular incident—you didn't get a Nagios alert; you got 4000 Nagios alerts. But start to finish, you could wrap your head rather fully around what Nagios did and why it did the sometimes strange things that it did.These days, when you take a look at Prometheus, which we hear a lot about, particularly in the Kubernetes space and Grafana, which is often mentioned in the same breath, it's never been quite clear to me exactly where those start and stop. It always feels like it's a component in a larger system to tell you what's going on rather than a one-stop shop that's going to, you know, shriek its head off when something breaks in the middle of the night. Is that the right way to think about it? The wrong way to think about it?Richard: It's a way to think about it. So personally, I use the terms monitoring and observability pretty much interchangeably. Observability is a relatively well-defined term, even though most people won't agree. But if you look back into the '70s into control theory where the term is coming from, it is the measure of how much you're able to determine the internal state of a system by looking at its inputs and its outputs. Depending on the definition, some people don't include the inputs, but that is the OG definition as far as I'm aware.And from this, there flow a lot of things. This question of—or this interpretation of the difference between telling that, yes, something's broken versus why something's broken. Or if you can't ask new questions on the fly, it's not observability. Like all of those things are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system I have just by looking at what is coming in, what is going out. And that is at the core the thing. Now, obviously, it's become a buzzword, which is oftentimes the fate of successful things. So, it's become a buzzword, and you end up with cargo culting.Corey: I would argue periodically, that observability is hipster monitoring. If you call it monitoring, you get yelled at by Charity Majors. Which is tongue and cheek, but she has opinions, made, nonetheless shall I say, frustrating by the fact that she is invariably correct in those opinions, which just somehow makes it so much worse. It would be easy to dismiss things she says if she weren't always right. And the world is changing, especially as we get into the world of distributed systems.Is the server that runs the app working or not working loses meaning when we're talking about distributed systems, when we're talking about containers running on top of Kubernetes, which turns every outage into a murder mystery. We start having distributed applications composed of microservices, so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue related to the request coming into a completely separate microservice? And it seems that for those types of applications, the answer has been tracing for a long time now, where originally that was something that felt like it was sprung, fully-formed from the forehead of some God known as one of the hyperscalers, but now is available to basically everyone, in theory.In practice, it seems that instrumenting applications still one of the hardest parts of all of this. I tried hooking up one of my own applications to be observed via OTEL, the open telemetry project, and it turns out that right now, OTEL and AWS Lambda have an intersection point that makes everything extremely difficult to work with. It's not there yet; it's not baked yet. And someday, I hope that changes because I would love to interchangeably just throw metrics and traces and logs to all the different observability tools and see which ones work, which ones don't, but that still feels very far away from current state of the art.Richard: Before we go there, maybe one thing which I don't fully agree with. You said that previously, you were told if a service up or down, that's the thing which you cared about, and I don't think that's what people actually cared about. At that time, also, what they fundamentally cared about: is the user-facing service up, or down, or impacted? Is it slow? Does it return errors every X percent for requests, something like this?Corey: Is the site up? And—you're right, I was hand-waving over a whole bunch of things. It was, “Okay. First, the web server is returning a page, yes or no? Great. Can I ping the server?” Okay, well, there are ways of server can crash and still leave enough of the TCP/IP stack up or it can respond to pings and do little else.And then you start adding things to it. But the Nagios thing that I always wanted to add—and had to—was, is the disk full? And that was annoying. And, on some level, like, why should I care in the modern era how much stuff is on the disk because storage is cheap and free and plentiful? The problem is, after the third outage in a month because the disk filled up, you start to not have a good answer for well, why aren't you monitoring whether the disk is full?And that was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to midsize applications, which is what I'm talking about, the only things that people would let me touch. I wasn't running hyperscale stuff where you have a fleet of 10,000 web servers and, “Is the server up?” Yeah, in that scenario, no one cares. But when we're talking about the database server and the two application servers and the four web servers talking to them, you think about it more in terms of pets than you do cattle.Richard: Yes, absolutely. Yet, I think that was a mistake back then, and I tried to do it differently, as a specific example with the disk. And I'm absolutely agreeing that previous generation tools limit you in how you can actually work with your data. In particular, once you're with metrics where you can do actual math on the data, it doesn't matter if the disk is almost full. It matters if that disk is going to be full within X amount of time.If that disk is 98% full and it sits there at 98% for ten years and provides the service, no one cares. The thing is, will it actually run out in the next two hours, in the next five hours, what have you. Depending on this, is this currently or imminently a customer-impacting or user-impacting then yes, alert on it, raise hell, wake people, make them fix it, as opposed to this thing can be dealt with during business hours on the next workday. And you don't have to wake anyone up.Corey: Yeah. The big filer with massive amounts of storage has crossed the 70% line. Okay, now it's time to start thinking about that, what do you want to do? Maybe it's time to order another shelf of discs for it, which is going to take some time. That's a radically different scenario than the 20 gigabyte root volume on your server just started filling up dramatically; the rate of change is such that'll be full in 20 minutes.Yeah, one of those is something you want to wake people up for. Generally speaking, you don't want to wake people up for what is fundamentally a longer-term strategic business problem. That can be sorted out in the light of day versus, “[laugh] we're not going to be making money in two hours, so if I don't wake up and fix this now.” That's the kind of thing you generally want to be woken up for. Well, let's be honest, you don't want that to happen at all, but if it does happen, you kind of want to know in advance rather than after the fact.Richard: You're literally describing linear predict from Prometheus, which is precisely for this, where I can look back over X amount of time and make a linear prediction because everything else breaks down at scale, blah, blah, blah, to detail. But the thing is, I can draw a line with my pencil by hand on my data and I can predict when is this thing going to it. Which is obviously precisely correct if I have a TLS certificate. It's a little bit more hand-wavy when it's a disk. But still, you can look into the future and you say, “What will be happening if current trends for the last X amount of time continue in Y amount of time.” And that's precisely a thing where you get this more powerful ability of doing math with your data.Corey: See, when you say it like that, it sounds like it actually is a whole term of art, where you're focusing on an in-depth field, where salaries are astronomical. Whereas the tools that I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin that I was grunting and pointing: “This is gonna fill up.” And that is how I thought about it. And this is the challenge where it's easy to think about these things in narrow, defined contexts like that, but at scale, things break.Like the idea of anomaly detection. Well, okay, great if normally, the CPU and these things are super bored and suddenly it gets really busy, that's atypical. Maybe we should look into it, assuming that it has a challenge. The problem is, that is a lot harder than it sounds because there are so many factors that factor into it. And as soon as you have something, quote-unquote, “Intelligent,” making decisions on this, it doesn't take too many false positives before you start ignoring everything it has to say, and missing legitimate things. It's this weird and obnoxious conflation of both hard technical problems and human psychology.Richard: And the breaking up of old service boundaries. Of course, when you say microservices, and such, fundamentally, functionally a microservice or nanoservice, picoservice—but the pendulum is already swinging back to larger units of complexity—but it fundamentally does not make any difference if I have a monolith on some mainframe or if I have a bunch of microservices. Yes, I can scale differently, I can scale horizontally a lot more easily, vertically, it's a little bit harder, blah, blah, blah, but fundamentally, the logic and the complexity, which is being packaged is fundamentally the same. More users, everything, but it is fundamentally the same. What's happening again, and again, is I'm breaking up those old boundaries, which means the old tools which have assumptions built in about certain aspects of how I can actually get an overview of a system just start breaking down, when my complexity unit or my service or what have I, is usually congruent with a physical piece, of hardware or several services are congruent with that piece of hardware, it absolutely makes sense to think about things in terms of this one physical server. The fact that you have different considerations in cloud, and microservices, and blah, blah, blah, is not inherently that it is more complex.On the contrary, it is fundamentally the same thing. It scales with users' everything, but it is fundamentally the same thing, but I have different boundaries of where I put interfaces onto my complexity, which basically allow me to hide all of this complexity from the downstream users.Corey: That's part of the challenge that I think we're grappling with across this entire industry from start to finish. Where we originally looked at these things and could reason about it because it's the computer and I know how those things work. Well, kind of, but okay, sure. But then we start layering levels of complexity on top of layers of complexity on top of layers of complexity, and suddenly, when things stop working the way that we expect, it can be very challenging to unpack and understand why. One of the ways I got into this whole space was understanding, to some degree, of how system calls work, of how the kernel wound up interacting with userspace, about how Linux systems worked from start to finish. And these days, that isn't particularly necessary most of the time for the care and feeding of applications.The challenge is when things start breaking, suddenly having that in my back pocket to pull out could be extremely handy. But I don't think it's nearly as central as it once was and I don't know that I would necessarily advise someone new to this space to spend a few years as a systems person, digging into a lot of those aspects. And this is why you need to know what inodes are and how they work. Not really, not anymore. It's not front and center the way that it once was, in most environments, at least in the world that I live in. Agree? Disagree?Richard: Agreed. But it's very much unsurprising. You probably can't tell me how to precisely grow sugar cane or corn, you can't tell me how to refine the sugar out of it, but you can absolutely bake a cake. But you will not be able to tell me even a third of—and I'm—for the record, I'm also not able to tell you even a third about the supply chain which just goes from I have a field and some seeds and I need to have a package of refined sugar—you're absolutely enabled to do any of this. The thing is, you've been part of the previous generation of infrastructure where you know how this underlying infrastructure works, so you have more ability to reason about this, but it's not needed for cloud services nearly as much.You need different types of skill sets, but that doesn't mean the old skill set is completely useless, at least not as of right now. It's much more a case of you need fewer of those people and you need them in different places because those things have become infrastructure. Which is basically the cloud play, where a lot of this is just becoming infrastructure more and more.Corey: Oh, yeah. Back then I distinctly remember my elders looking down their noses at me because I didn't know assembly, and how could I possibly consider myself a competent systems admin if I didn't at least have a working knowledge of assembly? Or at least C, which I, over time, learned enough about to know that I didn't want to be a C programmer. And you're right, this is the value of cloud and going back to those days getting a web server up and running just to compile Apache's httpd took a week and an in-depth knowledge of GCC flags.And then in time, oh, great. We're going to have rpm or debs. Great, okay, then in time, you have apt, if you're in the dev land because I know you are a Debian developer, but over in Red Hat land, we had yum and other tools. And then in time, it became oh, we can just use something like Puppet or Chef to wind up ensuring that thing is installed. And then oh, just docker run. And now it's a checkbox in a web console for S3.These things get easier with time and step by step by step we're standing on the shoulders of giants. Even in the last ten years of my career, I used to have a great challenge question that I would interview people with of, “Do you know what TinyURL is? It takes a short URL and then expands it to a longer one. Great, on the whiteboard, tell me how you would implement that.” And you could go up one side and down the other, and then you could add constraints, multiple data centers, now one goes offline, how do you not lose data? Et cetera, et cetera.But these days, there are so many ways to do that using cloud services that it almost becomes trivial. It's okay, multiple data centers, API Gateway, a Lambda, and a global DynamoDB table. Now, what? “Well, now it gets slow. Why is it getting slow?”“Well, in that scenario, probably because of something underlying the cloud provider.” “And so now, you lose an entire AWS region. How do you handle that?” “Seems to me when that happens, the entire internet's kind of broken. Do people really need longer URLs?”And that is a valid answer, in many cases. The question doesn't really work without a whole bunch of additional constraints that make it sound fake. And that's not a weakness. That is the fact that computers and cloud services have never been as accessible as they are now. And that's a win for everyone.Richard: There's one aspect of accessibility which is actually decreasing—or two. A, you need to pay for them on an ongoing basis. And B, you need an internet connection which is suitably fast, low latency, what have you. And those are things which actually do make things harder for a variety of reasons. If I look at our back-end systems—as in Grafana—all of them have single binary modes where you literally compile everything into a single binary and you can run it on your laptop because if you're stuck on a plane, you can't do any work on it. That kind of is not the best of situations.And if you have a huge CI/CD pipeline, everything in this cloud and fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more accessible. I disagree that it is becoming more accessible along all possible axes.Corey: I would agree. There is a silver lining to that as well, where yes, they are fraught and dangerous and I would preface this with a whole bunch of warnings, but from a cost perspective, all of the cloud providers do have a free tier offering where you can kick the tires on a lot of these things in return for no money. Surprisingly, the best one of those is Oracle Cloud where they have an unlimited free tier, use whatever you want in this subset of services, and you will never be charged a dime. As opposed to the AWS model of free tier where well, okay, it suddenly got very popular or you misconfigured something, and surprise, you now owe us enough money to buy Belize. That doesn't usually lead to a great customer experience.But you're right, you can't get away from needing an internet connection of at least some level of stability and throughput in order for a lot of these things to work. The stuff you would do locally on a Raspberry Pi, for example, if your budget constrained and want to get something out here, or your laptop. Great, that's not going to work in the same way as a full-on cloud service will.Richard: It's not free unless you have hard guarantees that you're not going to ever pay anything. It's fine to send warning, it's fine to switch the thing off, it's fine to have you hit random hard and soft quotas. It is not a free service if you can't guarantee that it is free.Corey: I agree with you. I think that there needs to be a free offering where, “Well, okay, you want us to suddenly stop serving traffic to the world?” “Yes. When the alternative is you have to start charging me through the nose, yes I want you to stop serving traffic.” That is definitionally what it says on the tin.And as an independent learner, that is what I want. Conversely, if I'm an enterprise, yeah, I don't care about money; we're running our Superbowl ad right now, so whatever you do, don't stop serving traffic. Charge us all the money. And there's been a lot of hand wringing about, well, how do we figure out which direction to go in? And it's, have you considered asking the customer?So, on a scale of one to bank, how serious is this account going to be [laugh]? Like, what are your big concerns: never charge me or never go down? Because we can build for either of those. Just let's make sure that all of those expectations are aligned. Because if you guess you're going to get it wrong and then no one's going to like you.Richard: I would argue this. All those services from all cloud providers actually build to address both of those. It's a deliberate choice not to offer certain aspects.Corey: Absolutely. When I talk to AWS, like, “Yeah, but there is an eventual consistency challenge in the billing system where it takes”—as anyone who's looked at the billing system can see—“Multiple days, sometimes for usage data to show up. So, how would we be able to stop things if the usage starts climbing?” To which my relatively direct responses, that sounds like a huge problem. I don't know how you'd fix that, but I do know that if suddenly you decide, as a matter of policy, to okay, if you're in the free tier, we will not charge you, or even we will not charge you more than $20 a month.So, you build yourself some headroom, great. And anything that people are able to spin up, well, you're just going to have to eat the cost as a provider. I somehow suspect that would get fixed super quickly if that were the constraint. The fact that it isn't is a conscious choice.Richard: Absolutely.Corey: And the reason I'm so passionate about this, about the free space, is not because I want to get a bunch of things for free. I assure you I do not. I mean, I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely, “It's too expensive.” It's that the billing dimension is hard to predict or doesn't align with a customer's experience or prices a service out of a bunch of use cases where it'll be great. But very rarely do I just sit here shaking my fist and saying, “It costs too much.”The problem is when you scare the living crap out of a student with a surprise bill that's more than their entire college tuition, even if you waive it a week or so later, do you think they're ever going to be as excited as they once were to go and use cloud services and build things for themselves and see what's possible? I mean, you and I met on IRC 20 years ago because back in those days, the failure mode and the risk financially was extremely low. It's yeah, the biggest concern that I had back then when I was doing some of my Linux experimentation is if I typed the wrong thing, I'm going to break my laptop. And yeah, that happened once or twice, and I've learned not to make those same kinds of mistakes, or put guardrails in so the blast radius was smaller, or use a remote system instead. Yeah, someone else's computer that I can destroy. Wonderful. But that was on we live and we learn as we were coming up. There was never an opportunity for us, to my understanding, to wind up accidentally running up an $8 million charge.Richard: Absolutely. And psychological safety is one of the most important things in what most people do. We are social animals. Without this psychological safety, you're not going to have long-term, self-sustaining groups. You will not make someone really excited about it. There's two basic ways to sell: trust or force. Those are the two ones. There's none else.Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomemento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Yeah. And it also looks ridiculous. I was talking to someone somewhat recently who's used to spending four bucks a month on their AWS bill for some S3 stuff. Great. Good for them. That's awesome. Their credentials got compromised. Yes, that is on them to some extent. Okay, great.But now after six days, they were told that they owed $360,000 to AWS. And I don't know how, as a cloud company, you can sit there and ask a student to do that. That is not a realistic thing. They are what is known, in the United States at least, in the world of civil litigation as quote-unquote, “Judgment proof,” which means, great, you could wind up finding that someone owes you $20 billion. Most of the time, they don't have that, so you're not able to recoup it. Yeah, the judgment feels good, but you're never going to see it.That's the problem with something like that. It's yeah, I would declare bankruptcy long before, as a student, I wound up paying that kind of money. And I don't hear any stories about them releasing the collection agency hounds against people in that scenario. But I couldn't guarantee that. I would never urge someone to ignore that bill and see what happens.And it's such an off-putting thing that, from my perspective, is beneath of the company. And let's be clear, I see this behavior at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS, but they are the 800-pound gorilla in the space, and that's important. Or as I just to mention right now, like, as I—because I was about to give you crap for this, too, but if I go to grafana.com, it says, and I quote, “Play around with the Grafana Stack. Experience Grafana for yourself, no registration or installation needed.”Good. I was about to yell at you if it's, “Oh, just give us your credit card and go ahead and start spinning things up and we won't charge you. Honest.” Even your free account does not require a credit card; you're doing it right. That tells me that I'm not going to get a giant surprise bill.Richard: You have no idea how much thought and work went into our free offering. There was a lot of math involved.Corey: None of this is easy, I want to be very clear on that. Pricing is one of the hardest things to get right, especially in cloud. And it also, when you get it right, it doesn't look like it was that hard for you to do. But I fix [sigh] I people's AWS bills for a living and still, five or six years in, one of the hardest things I still wrestle with is pricing engagements. It's incredibly nuanced, incredibly challenging, and at least for services in the cloud space where you're doing usage-based billing, that becomes a problem.But glancing at your pricing page, you do hit the two things that are incredibly important to me. The first one is use something for free. As an added bonus, you can use it forever. And I can get started with it right now. Great, when I go and look at your pricing page or I want to use your product and it tells me to ‘click here to contact us.' That tells me it's an enterprise sales cycle, it's got to be really expensive, and I'm not solving my problem tonight.Whereas the other side of it, the enterprise offering needs to be ‘contact us' and you do that, that speaks to the enterprise procurement people who don't know how to sign a check that doesn't have to commas in it, and they want to have custom terms and all the rest, and they're prepared to pay for that. If you don't have that, you look to small-time. When it doesn't matter what price you put on it, you wind up offering your enterprise tier at some large number, it's yeah, for some companies, that's a small number. You don't necessarily want to back yourself in, depending upon what the specific needs are. You've gotten that right.Every common criticism that I have about pricing, you folks have gotten right. And I definitely can pick up on your fingerprints on a lot of this. Because it sounds like a weird thing to say of, “Well, he's the Director of Community, why would he weigh in on pricing?” It's, “I don't think you understand what community is when you ask that question.”Richard: Yes, I fully agree. It's super important to get pricing right, or to get many things right. And usually the things which just feel naturally correct are the ones which took the most effort and the most time and everything. And yes, at least from the—like, I was in those conversations or part of them, and the one thing which was always clear is when we say it's free, it must be free. When we say it is forever free, it must be forever free. No games, no lies, do what you say and say what you do. Basically.We have things where initially you get certain pro features and you can keep paying and you can keep using them, or after X amount of time they go away. Things like these are built in because that's what people want. They want to play around with the whole thing and see, hey, is this actually providing me value? Do I want to pay for this feature which is nice or this and that plugin or what have you? And yeah, you're also absolutely right that once you leave these constraints of basically self-serve cloud, you are talking about bespoke deals, but you're also talking about okay, let's sit down, let's actually understand what your business is: what are your business problems? What are you going to solve today? What are you trying to solve tomorrow?Let us find a way of actually supporting you and invest into a mutual partnership and not just grab the money and run. We have extremely low churn for, I would say, pretty good reasons. Because this thing about our users, our customers being successful, we do take it extremely seriously.Corey: It's one of those areas that I just can't shake the feeling is underappreciated industry-wide. And the reason I say that this is your fingerprints on it is because if this had been wrong, you have a lot of… we'll call them idiosyncrasies, where there are certain things you absolutely will not stand for, and misleading people and tricking them into paying money is high on that list. One of the reasons we're friends. So yeah, but I say I see your fingerprints on this, it's yeah, if this hadn't been worked out the way that it is, you would not still be there. One other thing that I wanted to call out about, well, I guess it's a confluence of pricing and logging in the rest, I look at your free tier, and it offers up to 50 gigabytes of ingest a month.And it's easy for me to sit here and compare that to other services, other tools, and other logging stories, and then I have to stop and think for a minute that yeah, discs have gotten way bigger, and internet connections have gotten way faster, and even the logs have gotten way wordier. I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data. Do you have any, I guess, ballpark examples of what that looks like? Because it's been long enough since I've been playing in these waters that I can't really contextualize it anymore.Richard: Lord of the Rings is roughly five megabytes. It's actually less. So, we're talking literally 10,000 Lord of the Rings, which you can just shove in us and we're just storing this for you. Which also tells you that you're not going to be reading any of this. Or some of it, yes, but not all of it. You need better tooling and you need proper tooling.And some of this is more modern. Some of this is where we actually pushed the state of the art. But I'm also biased. But I, for myself, do claim that we did push the state of the art here. But at the same time you come back to those absolute fundamentals of how humans deal with data.If you look back basically as far as we have writing—literally 6000 years ago, is the oldest writing—humans have always dealt with information with the state of the world in very specific ways. A, is it important enough to even write it down, to even persist it in whatever persistence mechanisms I have at my disposal? If yes, write a detailed account or record a detailed account of whatever the thing is. But it turns out, this is expensive and it's not what you need. So, over time, you optimize towards only taking down key events and only noting key events. Maybe with their interconnections, but fundamentally, the key events.As your data grows, as you have more stuff, as this still is important to your business and keeps being more important to—or doesn't even need to be a business; can be social, can be whatever—whatever thing it is, it becomes expensive, again, to retain all of those key events. So, you turn them into numbers and you can do actual math on them. And that's this path which you've seen again, and again, and again, and again, throughout humanity's history. Literally, as long as we have written records, this has played out again, and again, and again, and again, for every single field which humans actually cared about. At different times, like, power networks are way ahead of this, but fundamentally power networks work on metrics, but for transient load spike, and everything, they have logs built into their power measurement devices, but those are only far in between. Of course, the main thing is just metrics, time-series. And you see this again, and again.You also were sysadmin in internet-related all switches have been metrics-based or metrics-first for basically forever, for 20, 30 years. But that stands to reason. Of course the internet is running at by roughly 20 years scale-wise in front of the cloud because obviously you need the internet because as you wouldn't be having a cloud. So, all of those growing pains why metrics are all of a sudden the thing, “Or have been for a few years now,” is basically, of course, people who were writing software, providing their own software services, hit the scaling limitations which you hit for Internet service providers two decades, three decades ago. But fundamentally, you have this complete system. Basically profiles or distributed tracing depending on how you view distributed tracing.You can also argue that distributed tracing is key events which are linked to each other. Logs sit firmly in the key event thing and then you turn this into numbers and that is metrics. And that's basically it. You have extremes at the and where you can have valid, depending on your circumstances, engineering trade-offs of where you invest the most, but fundamentally, that is why those always appear again in humanity's dealing with data, and observability is no different.Corey: I take a look at last month's AWS bill. Mine is pretty well optimized. It's a bit over 500 bucks. And right around 150 of that is various forms of logging and detecting change in the environment. And on the one hand, I sit here, and I think, “Oh, I should optimize that,” because the value of those logs to me is zero.Except that whenever I have to go in and diagnose something or respond to an incident or have some forensic exploration, they then are worth an awful lot. And I am prepared to pay 150 bucks a month for that because the potential value of having that when the time comes is going to be extraordinarily useful. And it basically just feels like a tax on top of what it is that I'm doing. The same thing happens with application observability where, yeah, when you just want the big substantial stuff, yeah, until you're trying to diagnose something. But in some cases, yeah, okay, then crank up the verbosity and then look for it.But if you're trying to figure it out after an event that isn't likely or hopefully won't recur, you're going to wish that you spent a little bit more on collecting data out of it. You're always going to be wrong, you're always going to be unhappy, on some level.Richard: Ish. You could absolutely be optimizing this. I mean, for $500, it's probably not worth your time unless you take it as an exercise, but outside of due diligence where you need specific logs tied to—or specific events tied to specific times, I would argue that a lot of the problems with logs is just dealing with it wrong. You have this one extreme of full-text indexing everything, and you have this other extreme of a data lake—which is just a euphemism of never looking at the data again—to keep storage vendors happy. There is an in between.Again, I'm biased, but like for example, with Loki, you have those same label sets as you have on your metrics with Prometheus, and you have literally the same, which means you only index that part and you only extract on ingestion time. If you don't have structured logs yet, only put the metadata about whatever you care about extracted and put it into your label set and store this, and that's the only thing you index. But it goes further than just this. You can also turn those logs into metrics.And to me this is a path of optimization. Where previously I logged this and that error. Okay, fine, but it's just a log line telling me it's HTTP 500. No one cares that this is at this precise time. Log levels are also basically an anti-pattern because they're just trying to deal with the amount of data which I have, and try and get a handle on this on that level whereas it would be much easier if I just counted every time I have an HTTP 500, I just up my counter by one. And again, and again, and again.And all of a sudden, I have literally—and I did the math on this—over 99.8% of the data which I have to store just goes away. It's just magic the way—and we're only talking about the first time I'm hitting this logline. The second time I'm hitting this logline is functionally free if I turn this into metrics. It becomes cheap enough that one of the mantras which I have, if you need to onboard your developers on modern observability, blah, blah, blah, blah, blah, the whole bells and whistles, usually people have logs, like that's what they have, unless they were from ISPs or power companies, or so; there they usually start with metrics.But most users, which I see both with my Grafana and with my Prometheus [unintelligible 00:38:46] tend to start with logs. They have issues with those logs because they're basically unstructured and useless and you need to first make them useful to some extent. But then you can leverage on this and instead of having a debug statement, just put a counter. Every single time you think, “Hey, maybe I should put a debug statement,” just put a counter instead. In two months time, see if it was worth it or if you delete that line and just remove that counter.It's so much cheaper, you can just throw this on and just have it run for a week or a month or whatever timeframe and done. But it goes beyond this because all of a sudden, if I can turn my logs into metrics properly, I can start rewriting my alerts on those metrics. I can actually persist those metrics and can more aggressively throw my logs away. But also, I have this transition made a lot easier where I don't have this huge lift, where this day in three months is to be cut over and we're going to release the new version of this and that software and it's not going to have that, it's going to have 80% less logs and everything will be great and then you missed the first maintenance window or someone is ill or what have you, and then the next Big Friday is coming so you can't actually deploy there. I mean Black Friday. But we can also talk about deploying on Fridays.But the thing is, you have this huge thing, whereas if you have this as a continuous improvement process, I can just look at, this is the log which is coming out. I turn this into a number, I start emitting metrics directly, and I see that those numbers match. And so, I can just start—I build new stuff, I put it into a new data format, I actually emit the new data format directly from my code instrumentation, and only then do I start removing the instrumentation for the logs. And that allows me to, with full confidence, with psychological safety, just move a lot more quickly, deliver much more quickly, and also cut down on my costs more quickly because I'm just using more efficient data types.Corey: I really want to thank you for spending as much time as you have. If people want to learn more about how you view the world and figure out what other personal attacks they can throw your way, where's the best place for them to find you?Richard: Personal attacks, probably Twitter. It's, like, the go-to place for this kind of thing. For actually tracking, I stopped maintaining my own website. Maybe I'll do again, but if you go on github.com/ritchieh/talks, you'll find a reasonably up-to-date list of all the talks, interviews, presentations, panels, what have you, which I did over the last whatever amount of time. [laugh].Corey: And we will, of course, put links to that in the [show notes 00:41:23]. Thanks again for your time. It's always appreciated.Richard: And thank you.Corey: Richard Hartmann, Director of Community at Grafana Labs. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment. And then when someone else comes along with an insulting comment they want to add, we'll just increment the counter by one.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

The Tech Blog Writer Podcast
2140: SolarWinds Hybrid Cloud Observability

The Tech Blog Writer Podcast

Play Episode Listen Later Oct 14, 2022 29:14


SolarWinds is number one in network management software beating out IBM, Cisco and others, and serves over 300,000 customers including 498 of the Fortune 500. Its products have become ubiquitous because they are easier to use, more powerful and scalable and more affordable than other providers, including its recently launched Hybrid Cloud Observability Platform. This is the first platform that the company has built from the ground up since the SUNBURST supply chain incident that targeted SolarWinds and other technology companies last year. The platform provides customers with an integrated, full-stack solution that's able to detect productivity and security anomalies, identify issues and take automated remediation actions to maximize productivity, prevent security issues and reduce costs. The company's Head Geeks, Chrystal Taylor and Sascha Giese join me on Tech Talks Daily to talk about what observability can do for network administrators and learn more about SolarWinds' new Hybrid Cloud Observability Platform.

Screaming in the Cloud
Raising Awareness on Cloud-Native Threats with Michael Clark

Screaming in the Cloud

Play Episode Listen Later Oct 13, 2022 38:44


About MichaelMichael is the Director of Threat Research at Sysdig, managing a team of experts tasked with discovering and defending against novel security threats. Michael has more than 20 years of industry experience in many different roles, including incident response, threat intelligence, offensive security research, and software development at companies like Rapid7, ThreatQuotient, and Mantech. Prior to joining Sysdig, Michael worked as a Gartner analyst, advising enterprise clients on security operations topics.Links Referenced: Sysdig: https://sysdig.com/ “2022 Sysdig Cloud-Native Threat Report”: https://sysdig.com/threatreport TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Something interesting about this particular promoted guest episode that is brought to us by our friends at Sysdig is that when they reached out to set this up, one of the first things out of their mouth was, “We don't want to sell anything,” which is novel. And I said, “Tell me more,” because I was also slightly skeptical. But based upon the conversations that I've had, and what I've seen, they were being honest. So, my guest today—surprising as though it may be—is Mike Clark, Director of Threat Research at Sysdig. Mike, how are you doing?Michael: I'm doing great. Thanks for having me. How are you doing?Corey: Not dead yet. So, we take what we can get sometimes. You folks have just come out with the “2022 Sysdig Cloud-Native Threat Report”, which on one hand, it feels like it's kind of a wordy title, on the other it actually encompasses everything that it is, and you need every single word of that report. At a very high level, what is that thing?Michael: Sure. So, this is our first threat report we've ever done, and it's kind of a rite of passage, I think for any security company in the space; you have to have a threat report. And the cloud-native part, Sysdig specializes in cloud and containers, so we really wanted to focus in on those areas when we were making this threat report, which talks about, you know, some of the common threats and attacks we were seeing over the past year, and we just wanted to let people know what they are and how they protect themselves.Corey: One thing that I've found about a variety of threat reports is that they tend to excel at living in the fear, uncertainty, and doubt space. And invariably, they paint a very dire picture of the internet about become cascading down. And then at the end, there's always a, “But there is hope. Click here to set up a meeting with us.” It's basically a very thinly- veiled cover around what is fundamentally a fear, uncertainty, and doubt-driven marketing strategy, and then it tries to turn into a sales pitch.This does absolutely none of that. So, I have to ask, did you set out to intentionally make something that added value in that way and have contributed to the body of knowledge, or is it because it's your inaugural report; you didn't realize you were supposed to turn it into a terrible sales pitch.Michael: We definitely went into that on purpose. There's a lot of ways to fix things, especially these days with all the different technologies, so we can easily talk about the solutions without going into specific products. And that's kind of way we went about it. There's a lot of ways to fix each of the things we mentioned in the report. And hopefully, the person reading it finds a good way to do it.Corey: I'd like to unpack a fair bit of what's in the report. And let's be clear, I don't intend to read this report into a microphone; that is generally not a great way of conveying information that I have found. But I want to highlight a few things that leapt out to me that I find interesting. Before I do that, I'm curious to know, most people who write reports, especially ones of this quality, are not sitting there cogitating in their office by themselves, and they set pen to paper and emerge four days later with the finished treatise. There's a team involved, there's more than one person that weighs in. Who was behind this?Michael: Yeah, it was a pretty big team effort across several departments. But mostly, it came to the Sysdig threat research team. It's about ten people right now. It's grown quite a bit through the past year. And, you know, it's made up of all sorts of backgrounds and expertise.So, we have machine learning people, data scientists, data engineers, former pen-testers and red team, a lot of blue team people, people from the NSA, people from other government agencies as well. And we're also a global research team, so we have people in Europe and North America working on all of this. So, we try to get perspectives on how these threats are viewed by multiple areas, not just Silicon Valley, and express fixes that appeal to them, too.Corey: Your executive summary on this report starts off with a cloud adversary analysis of TeamTNT. And my initial throwaway joke on that, it was going to be, “Oh, when you start off talking about any entity that isn't you folks, they must have gotten the platinum sponsorship package.” But then I read the rest of that paragraph and I realized that wait a minute, this is actually interesting and germane to something that I see an awful lot. Specifically, they are—and please correct me if I'm wrong on any of this; you are definitionally the expert whereas I am, obviously the peanut gallery—but you talk about TeamTNT as being a threat actor that focuses on targeting the cloud via cryptojacking, which is a fanciful word for, “Okay, I've gotten access to your cloud environment; what am I going to do with it? Mine Bitcoin and other various cryptocurrencies.” Is that generally accurate or have I missed the boat somewhere fierce on that? Which is entirely possible.Michael: That's pretty accurate. We also think it just one person, actually, and they are very prolific. So, they were pretty hard to get that platinum support package because they are everywhere. And even though it's one person, they can do a lot of damage, especially with all the automation people can make now, one person can appear like a dozen.Corey: There was an old t-shirt that basically encompassed everything that was wrong with the culture of the sysadmin world back in the naughts, that said, “Go away, or I will replace you with a very small shell script.” But, on some level, you can get a surprising amount of work done on computers, just with things like for loops and whatnot. What I found interesting was that you have put numbers and data behind something that I've always taken for granted and just implicitly assumed that everyone knew. This is a common failure mode that we all have. We all have blind spots where we assume the things that we spend our time on is easy and the stuff that other people are good at and you're not good at, those are the hard things.It has always been intuitively obvious to me as a cloud economist, that when you wind up spending $10,000 in cloud resources to mine cryptocurrency, it does not generate $10,000 of cryptocurrency on the other end. In fact, the line I've been using for years is that it's totally economical to mine Bitcoin in the cloud; the only trick is you have to do it in someone else's account. And you've taken that joke and turned it into data. Something that you found was that in one case, that you were able to attribute $8,100 of cryptocurrency that were generated by stealing $430,000 of cloud resources to do it. And oh, my God, we now have a number and a ratio, and I can talk intelligently and sound four times smarter. So, ignoring anything else in this entire report, congratulations, you have successfully turned this into what is beginning to become a talking point of mine. Value unlocked. Good work. Tell me more.Michael: Oh, thank you. Cryptomining is kind of like viruses in the old on-prem environment. Normally it just cleaned up and never thought of again; the antivirus software does its thing, life goes on. And I think cryptominers are kind of treated like that. Oh, there's a miner; let's rebuild the instance or bring a new container online or something like that.So, it's often considered a nuisance rather than a serious threat. It also doesn't have the, you know, the dangerous ransomware connotation to it. So, a lot of people generally just think of as a nuisance, as I said. So, what we wanted to show was, it's not really a nuisance and it can cost you a lot of money if you don't take it seriously. And what we found was for every dollar that they make, it costs you $53. And, you know, as you mentioned, it really puts it into view of what it could cost you by not taking it seriously. And that number can scale very quickly, just like your cloud environment can scale very quickly.Corey: They say this cloud scales infinitely and that is not true. First, tried it; didn't work. Secondly, it scales, but there is an inherent limit, which is your budget, on some level. I promise they can add hard drives to S3 faster than you can stuff data into it. I've checked.One thing that I've seen recently was—speaking of S3—I had someone reach out in what I will charitably refer to as a blind panic because they were using AWS to do something. Their bill was largely $4 a month in S3 charges. Very reasonable. That carries us surprisingly far. And then they had a credential leak and they had a threat actor spin up all the Lambda functions in all of the regions, and it went from $4 a month to $60,000 a day and it wasn't caught for six days.And then AWS as they tend to do, very straight-faced, says, “Yeah, we would like our $360,000, please.” At which point, people start panicking because a lot of the people who experience this are not themselves sophisticated customers; they're students, they're learning how this stuff works. And when I'm paying $4 a month for something, it is logical and intuitive for me to think that, well, if I wind up being sloppy with their credentials, they could run that bill up to possibly $25 a month and that wouldn't be great, so I should keep an eye on it. Yeah, you dropped a whole bunch of zeros off the end of that. Here you go. And as AWS spins up more and more regions and as they spin up more and more services, the ability to exploit this becomes greater and greater. This problem is not getting better, it is only getting worse, by a lot.Michael: Oh, yeah, absolutely. And I feel really bad for those students who do have that happen to them. I've heard on occasion that the cloud providers will forgive some debts, but there's no guarantee of that happening, from breaches. And you know, the more that breaches happen, the less likely they are going to forgive it because they still to pay for it; someone's paying for it in the end. And if you don't improve and fix your environment and it keeps happening, one day, they're just going to stick you with the bill.Corey: To my understanding, they've always done the right thing when I've highlighted something to them. I don't have intimate visibility into it and of course, they have a threat model themselves of, okay, I'm going to spin up a bunch of stuff, mine cryptocurrency for a month—cry and scream and pretend I got hacked because fraud is very much a thing, there is a financial incentive attached to this—and they mostly seem to get it right. But the danger that I see for the cloud provider is not that they're going to stop being nice and giving money away, but assume you're a student who just winds up getting more than your entire college tuition as a surprise bill for this month from a cloud provider. Even assuming at the end of that everything gets wiped and you don't owe anything. I don't know about you, but I've never used that cloud provider again because I've just gotten a firsthand lesson in exactly what those risks are, it's bad for the brand.Michael: Yeah, it really does scare people off of that. Now, some cloud providers try to offer more proactive protections against this, try to shut down instances really quick. And you know, you can take advantage of limits and other things, but they don't make that really easy to do. And setting those up is critical for everybody.Corey: The one cloud provider that I've seen get this right, of all things, has been Oracle Cloud, where they have an always free tier. Until you affirmatively upgrade your account to chargeable, they will not charge you a penny. And I have experimented with this extensively, and they're right, they will not charge you a penny. They do have warnings plastered on the site, as they should, that until you upgrade your account, do understand that if you exceed a threshold, we will stop serving traffic, we will stop servicing your workload. And yeah, for a student learner, that's absolutely what I want. For a big enterprise gearing up for a giant Superbowl commercial or whatnot, it's, “Yeah, don't care what it costs, just make sure you continue serving traffic. We don't get a redo on this.” And without understanding exactly which profile of given customer falls into, whenever the cloud provider tries to make an assumption and a default in either direction, they're wrong.Michael: Yeah, I'm surprised that Oracle Cloud of all clouds. It's good to hear that they actually have a free tier. Now, we've seen attackers have used free tiers quite a bit. It all depends on how people set it up. And it's actually a little outside the threat report, but the CI/CD pipelines in DevOps, anywhere there's free compute, attackers will try to get their miners in because it's all about scale and not quality.Corey: Well, that is something I'd be curious to know. Because you talk about focusing specifically on cloud and containers as a company, which puts you in a position to be authoritative on this. That Lambda story that I mentioned about, surprise $60,000 a day in cryptomining, what struck me about that and caught me by surprise was not what I think would catch most people who didn't swim in this world by surprise of, “You can spend that much?” In my case, what I'm wondering about is, well hang on a minute. I did an article a year or two ago, “17 Ways to Run Containers On AWS” and listed 17 AWS services that you could use to run containers.And a few months later, I wrote another article called “17 More Ways to Run Containers On AWS.” And people thought I was belaboring the point and making a silly joke, and on some level, of course I was. But I was also highlighting very clearly that every one of those containers running in a service could be mining cryptocurrency. So, if you get access to someone else's AWS account, when you see those breaches happen, are people using just the one or two services they have things ready to go for, or are they proliferating as many containers as they can through every service that borderline supports it?Michael: From what we've seen, they usually just go after a compute, like EC2 for example, as it's most well understood, it gets the job done, it's very easy to use, and then get your miner set up. So, if they happen to compromise your credentials versus the other method that cryptominers or cryptojackers do is exploitation, then they'll try to spread throughout their all their EC2 they can and spin up as much as they can. But the other interesting thing is if they get into your system, maybe via an exploit or some other misconfiguration, they'll look for the IAM metadata service as soon as they get in, to try to get your IAM credentials and see if they can leverage them to also spin up things through the API. So, they'll spin up on the thing they compromised and then actively look for other ways to get even more.Corey: Restricting the permissions that anything has in your cloud environment is important. I mean, from my perspective, if I were to have my account breached, yes, they're going to cost me a giant pile of money, but I know the magic incantations to say to AWS and worst case, everyone has a pet or something they don't want to see unfortunate things happen to, so they'll waive my fee; that's fine. The bigger concern I've got—in seriousness—I think most companies do is the data. It is the access to things in the account. In my case, I have a number of my clients' AWS bills, given that that is what they pay me to work on.And I'm not trying to undersell the value of security here, but on the plus side that helps me sleep at night, that's only money. There are datasets that are far more damaging and valuable about that. The worst sleep I ever had in my career came during a very brief stint I had about 12 years ago when I was the director of TechOps at Grindr, the gay dating site. At that scenario, if that data had been breached, people could very well have died. They live in countries where that winds up not being something that is allowed, or their family now winds up shunning them and whatnot. And that's the stuff that keeps me up at night. Compared to that, it's, “Well, you cost us some money and embarrassed a company.” It doesn't really rank on the same scale to me.Michael: Yeah. I guess the interesting part is, data requires a lot of work to do something with for a lot of attackers. Like, it may be opportunistic and come across interesting data, but they need to do something with it, there's a lot more risk once they start trying to sell the data, or like you said, if it turns into something very unfortunate, then there's a lot more risk from law enforcement coming after them. Whereas with cryptomining, there's very little risk from being chased down by the authorities. Like you said, people, they rebuild things and ask AWS for credit, or whoever, and move on with their lives. So, that's one reason I think cryptomining is so popular among threat actors right now. It's just the low risk compared to other ways of doing things.Corey: It feels like it's a nuisance. One thing that I was dreading when I got this copy of the report was that there was going to be what I see so often, which is let's talk about ransomware in the cloud, where people talk about encrypting data in S3 buckets and sneakily polluting the backups that go into different accounts and how your air -gapping and the rest. And I don't see that in the wild. I see that in the fear-driven marketing from companies that have a thing that they say will fix that, but in practice, when you hear about ransomware attacks, it's much more frequently that it is their corporate network, it is on-premises environments, it is servers, perhaps running in AWS, but they're being treated like servers would be on-prem, and that is what winds up getting encrypted. I just don't see the attacks that everyone is warning about. But again, I am not primarily in the security space. What do you see in that area?Michael: You're absolutely right. Like we don't see that at all, either. It's certainly theoretically possible and it may have happened, but there just doesn't seem to be that appetite to do that. Now, the reasoning? I'm not a hundred percent sure why, but I think it's easier to make money with cryptomining, even with the crypto markets the way they are. It's essentially free money, no expenses on your part.So, maybe they're not looking because again, that requires more effort to understand especially if it's not targeted—what data is important. And then it's not exactly the same method to do the attack. There's versioning, there's all this other hoops you have to jump through to do an extortion attack with buckets and things like that.Corey: Oh, it's high risk and feels dirty, too. Whereas if you're just, I guess, on some level, psychologically, if you're just going to spin up a bunch of coin mining somewhere and then some company finds it and turns it off, whatever. You're not, as in some cases, shaking down a children's hospital. Like that's one of those great, I can't imagine how you deal with that as a human being, but I guess it takes all types. This doesn't get us to sort of the second tentpole of the report that you've put together, specifically around the idea of supply chain attacks against containers. There have been such a tremendous number of think pieces—thought pieces, whatever they're called these days—talking about a software bill of materials and supply chain threats. Break it down for me. What are you seeing?Michael: Sure. So, containers are very fun because, you know, you can define things as code about what gets put on it, and they become so popular that sharing sites have popped up, like Docker Hub and other public registries, where you can easily share your container, it has everything built, set up, so other people can use it. But you know, attackers have kind of taken notice of this, too. Where anything's easy, an attacker will be. So, we've seen a lot of malicious containers be uploaded to these systems.A lot of times, they're just hoping for a developer or user to come along and use them because your Docker Hub does have the official designation, so while they can try to pretend to be like Ubuntu, they won't be the official. But instead, they may try to see theirs and links and things like that to entice people to use theirs instead. And then when they do, it's already pre-loaded with a miner or, you know, other malware. So, we see quite a bit of these containers in Docker Hub. And they're disguised as many different popular packages.They don't stand up to too much scrutiny, but enough that, you know, a casual looker, even Docker file may not see it. So yeah, we see a lot of—and embedded credentials and other big part that we see in these containers. That could be an organizational issue, like just a leaked credential, but you can put malicious credentials into Docker files, to0, like, say an SSH private key that, you know, if they start this up, the attacker can now just log—SSH in. Or other API keys or other AWS changing commands you can put in there. You can put really anything in there, and wherever you load it, it's going to run. So, you have to be really careful.[midroll 00:22:15]Corey: Years ago, I gave a talk at the conference circuit called, “Terrible Ideas in Git” that purported to teach people how to get worked through hilarious examples of misadventure. And the demos that I did on that were, well, this was fun and great, but it was really annoying resetting them every time I gave the talk, so I stuffed them all into a Docker image and then pushed that up to Docker Hub. Great. It was awesome. I didn't publicize it and talk about it, but I also just left it as an open repository there because what are you going to do? It's just a few directories in the route that have very specific contrived scenarios with Git, set up and ready to go.There's nothing sensitive there. And the thing is called, “Terrible Ideas.” And I just kept watching the download numbers continue to increment week over week, and I took it down because it's, I don't know what people are going to do with that. Like, you see something on there and it says, “Terrible Ideas.” For all I know, some bank is like, “And that's what we're running in production now.” So, who knows?But the idea o—not that there was necessarily anything wrong with that, but the fact that there's this theoretical possibility someone could use that or put the wrong string in if I give an example, and then wind up running something that is fairly compromisable in a serious environment was just something I didn't want to be a part of. And you see that again, and again, and again. This idea of what Docker unlocks is amazing, but there's such a tremendous risk to it. I mean, I've never understood 15 years ago, how you're going to go and spin up a Linux server on top of EC2 and just grab a community AMI and use that. It's yeah, I used to take provisioning hardware very seriously to make sure that I wasn't inadvertently using something compromised. Here, it's like, “Oh, just grab whatever seems plausible from the catalog and go ahead and run that.” But it feels like there's so much of that, turtles all the way down.Michael: Yeah. And I mean, even if you've looked at the Docker file, with all the dependencies of the things you download, it really gets to be difficult. So, I mean, to protect yourself, it really becomes about, like, you know, you can do the static scanning of it, looking for bad strings in it or bad version numbers for vulnerabilities, but it really comes down to runtime analysis. So, when you start to Docker container, you really need the tools to have visibility to what's going on in the container. That's the only real way to know if it's safe or not in the end because you can't eyeball it and really see all that, and there could be a binary assortment of layers, too, that'll get run and things like that.Corey: Hell is other people's workflows, as I'm sure everyone's experienced themselves, but one of mine has always been that if I'm doing something as a proof of concept to build it up on a developer box—and I do keep my developer environments for these sorts of things isolated—I will absolutely go and grab something that is plausible- looking from Docker Hub as I go down that process. But when it comes time to wind up putting it into a production environment, okay, now we're going to build our own resources. Yeah, I'm sure the Postgres container or whatever it is that you're using is probably fine, but just so I can sleep at night, I'm going to take the public Docker file they have, and I'm going to go ahead and build that myself. And I feel better about doing that rather than trusting some rando user out there and whatever it is that they've put up there. Which on the one hand feels like a somewhat responsible thing to do, but on the other, it feels like I'm only fooling myself because some rando putting things up there is kind of what the entire open-source world is, to a point.Michael: Yeah, that's very true. At some point, you have to trust some product or some foundation to have done the right thing. But what's also true about containers is they're attacked and use for attacks, but they're also used to conduct attacks quite a bit. And we saw a lot of that with the Russian-Ukrainian conflict this year. Containers were released that were preloaded with denial-of-service software that automatically collected target lists from, I think, GitHub they were hosted on.So, all a user to get involved had to do was really just get the container and run it. That's it. And now they're participating in this cyberwar kind of activity. And they could also use this to put on a botnet or if they compromise an organization, they could spin up at all these instances with that Docker container on it. And now that company is implicated in that cyber war. So, they can also be used for evil.Corey: This gets to the third point of your report: “Geopolitical conflict influences attacker behaviors.” Something that happened in the early days of the Russian invasion was that a bunch of open-source maintainers would wind up either disabling what their software did or subverting it into something actively harmful if it detected it was running in the Russian language and/or in a Russian timezone. And I understand the desire to do that, truly I do. I am no Russian apologist. Let's be clear.But the counterpoint to that as well is that, well, to make a reference I made earlier, Russia has children's hospitals, too, and you don't necessarily know the impact of fallout like that, not to mention that you have completely made it untenable to use anything you're doing for a regulated industry or anyone else who gets caught in that and discovers that is now in their production environment. It really sets a lot of stuff back. I've never been a believer in that particular form of vigilantism, for lack of a better term. I'm not sure that I have a better answer, let's be clear. I just, I always knew that, on some level, the risk of opening that Pandora's box were significant.Michael: Yeah. Even if you're doing it for the right reasons. It still erodes trust.Corey: Yeah.Michael: Especially it erodes trust throughout open-source. Like, not just the one project because you'll start thinking, “Oh, how many other projects might do this?” And—Corey: Wait, maybe those dirty hippies did something in our—like, I don't know, they've let those people anywhere near this operating system Linux thing that we use? I don't think they would have done that. Red Hat seems trustworthy and reliable. And it's yo, [laugh] someone needs to crack open a history book, on some level. It's a sticky situation.I do want to call out something here that it might be easy to get the wrong idea from the summary that we just gave. Very few things wind up raising my hackles quite like companies using tragedy to wind up shilling whatever it is they're trying to sell. And I'll admit when I first got this report, and I saw, “Oh, you're talking about geopolitical conflict, great.” I'm not super proud of this, but I was prepared to read you the riot act, more or less when I inevitably got to that. And I never did. Nothing in this entire report even hints in that direction.Michael: Was it you never got to it, or, uh—Corey: Oh, no. I've read the whole thing, let's be clear. You're not using that to sell things in the way that I was afraid you were. And simultaneously I want to say—I want to just point that out because that is laudable. At the same time, I am deeply and bitterly resentful that that even is laudable. That should be the common state.Capitalizing on tragedy is just not something that ever leaves any customer feeling good about one of their vendors, and you've stayed away from that. I just want to call that out is doing the right thing.Michael: Thank you. Yeah, it was actually a big topic about how we should broach this. But we have a good data point on right after it started, there was a huge spike in denial-of-service installs. And that we have a bunch of data collection technology, honeypots and other things, and we saw the day after cryptomining started going down and denial-of-service installs started going up. So, it was just interesting how that community changed their behaviors, at least for a time, to participate in whatever you want to call it, the hacktivism.Over time, though, it kind of has gone back to the norm where maybe they've gotten bored or something or, you know, run out of funds, but they're starting cryptomining again. But these events can cause big changes in the hacktivism community. And like I mentioned, it's very easy to get involved. We saw over 150,000 downloads of those pre-canned denial-of-service containers, so it's definitely something that a lot of people participated in.Corey: It's a truism that war drives innovation and different ways of thinking about things. It's a driver of progress, which says something deeply troubling about us. But it's also clear that it serves as a driver for change, even in this space, where we start to see different applications of things, we see different threat patterns start to emerge. And one thing I do want to call out here that I think often gets overlooked in the larger ecosystem and industry as a whole is, “Well, no one's going to bother to hack my nonsense. I don't have anything interesting for them to look at.”And it's, on some level, an awful lot of people running tools like this aren't sophisticated enough themselves to determine that. And combined with your first point in the report as well that, well, you have an AWS account, don't you? Congratulations. You suddenly have enormous piles of money—from their perspective—sitting there relatively unguarded. Yay. Security has now become everyone's problem, once again.Michael: Right. And it's just easier now. It means, it was always everyone's problem, but now it's even easier for attackers to leverage almost everybody. Like before, you had to get something on your PC. You had to download something. Now, your search of GitHub can find API keys, and then that's it, you know? Things like that will make it game over or your account gets compromised and big bills get run up. And yeah, it's very easy for all that to happen.Corey: Ugh. I do want to ask at some point, and I know you asked me not to do it, but I'm going to do it anyway because I have this sneaking suspicion that given that you've spent this much time on studying this problem space, that you probably, as a company, have some answers around how to address the pain that lives in these problems. What exactly, at a high level, is it that Sysdig does? Like, how would you describe that in an elevator without sabotaging the elevator for 45 minutes to explain it in depth to someone?Michael: So, I would describe it as threat detection and response for cloud containers and workloads in general. And all the other kind of acronyms for cloud, like CSPM, CIEM.Corey: They're inventing new and exciting acronyms all the time. And I honestly at this point, I want to have almost an acronym challenge of, “Is this a cybersecurity acronym or is it an audio cable? Which is it?” Because it winds up going down that path, super easily. I was at RSA walking the expo floor and I had I think 15 different companies I counted pitching XDR, without a single one bothering to explain what that meant. Okay, I guess it's just the thing we've all decided we need. It feels like security people selling to security people, on some level.Michael: I was a Gartner analyst.Corey: Yeah. Oh… that would do it then. Terrific. So, it's partially your fault, then?Michael: No. I was going to say, don't know what it means either.Corey: Yeah.Michael: So, I have no idea [laugh]. I couldn't tell you.Corey: I'm only half kidding when I say in many cases, from the vendor perspective, it seems like what it means is whatever it is they're trying to shoehorn the thing that they built into filling. It's kind of like observability. Observability means what we've been doing for ten years already, just repurposed to catch the next hype wave.Michael: Yeah. The only thing I really understand is: detection and response is a very clear detect things and respond to things. So, that's a lot of what we do.Corey: It's got to beat the default detection mechanism for an awful lot of companies who in years past have found out that they have gotten breached in the headline of The New York Times. Like it's always fun when that, “Wait, what? What? That's u—what? How did we not know this was coming?”It's when a third party tells you that you've been breached, it's never as positive—not that it's a positive experience anyway—than discovering yourself internally. And this stuff is complicated, the entire space is fraught, and it always feels like no matter how far you go, you could always go further, but left to its inevitable conclusion, you'll burn through the entire company budget purely on security without advancing the other things that company does.Michael: Yeah.Corey: It's a balance.Michael: It's tough because it's a lot to know in the security discipline, so you have to balance how much you're spending and how much your people actually know and can use the things you've spent money on.Corey: I really want to thank you for taking the time to go through the findings of the report for me. I had skimmed it before we spoke, but talking to you about this in significantly more depth, every time I start going to cite something from it, I find myself coming away more impressed. This is now actively going on my calendar to see what the 2023 version looks like. Congratulations, you've gotten me hooked. If people want to download a copy of the report for themselves, where should they go to do that?Michael: They could just go to sysdig.com/threatreport. There's no email blocking or gating, so you just download it.Corey: I'm sure someone in your marketing team is twitching at that. Like, why can't we wind up using this as a lead magnet? But ugh. I look at this and my default is, oh, wow, you definitely understand your target market. Because we all hate that stuff. Every mandatory field you put on those things makes it less likely I'm going to download something here. Click it and have a copy that's awesome.Michael: Yep. And thank you for having me. It's a lot of fun.Corey: No, thank you for coming. Thanks for taking so much time to go through this, and thanks for keeping it to the high road, which I did not expect to discover because no one ever seems to. Thanks again for your time. I really appreciate it.Michael: Thanks. Have a great day.Corey: Mike Clark, Director of Threat Research at Sysdig. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment pointing out that I didn't disclose the biggest security risk at all to your AWS bill, an AWS Solutions Architect who is working on commission.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

A Bootiful Podcast
Google mad scientist Josh Suereth on Observability with OpenTelemetry, building better build tools, and so much more

A Bootiful Podcast

Play Episode Listen Later Oct 13, 2022 92:59


Hi, Spring fans! In this installment Josh Long (@starbuxman) looks at the latest and greatest in Spring Boot 3 AOT, then talks to Google's Josh Suereth (@jsuereth) about observability with OpenTelemetry, building better build tools, and so much more.

SolarWinds TechPod
Hybrid IT...It's Trendy?

SolarWinds TechPod

Play Episode Listen Later Oct 13, 2022 43:18


The IT Trends report is released by SolarWinds every year with predictions on what will be the most popular trends in IT during the upcoming months. As we enter the second half of 2022, hosts Sean Sebring and Chris Bowie unpack the SolarWinds 2022 IT Trends report with Professor Dr. Sally Eaves. The three discuss findings from the report, getting hybrid IT right, managing complexity, Observability, and even... time travel? 

The New Stack Podcast
KubeCon+CloudNativeCon 2022 Rolls into Detroit

The New Stack Podcast

Play Episode Listen Later Oct 13, 2022 27:01


It's that time of the year again, when cloud native enthusiasts and professionals assemble to discuss all things Kubernetes. KubeCon+CloudNativeCon 2023 is being held later this month in Detroit, October 24-28. In this latest edition of The New Stack Makers podcast, we spoke with Priyanka Sharma, general manager of the Cloud Native Computing Foundation — which organizes KubeCon —and CERN computer engineer and KubeCon co-chair Ricardo Rocha. For this show, we discussed what we can expect from the upcoming event. This year, there will be a focus on Kubernetes in the enterprise, Sharma said. "We are reaching a point where Kubernetes is becoming the de facto standard when it comes to container orchestration. And there's a reason for it. It's not just about Kubernetes. Kubernetes spawned the cloud native ecosystem and the heart of the cloud native movement is building fast, resiliently observable software that meets customer needs. So ultimately, it's making you a better provider to your customers, no matter what kind of business you are." Of this year's topics, security will be a big theme, Rocha said. Technologies such as Falco and Cilium will be discussed. Linux ke