POPULARITY
Gündemde şu sıra ilk sırayı alan konu çok bereketli. Otel odalarında, salonlarında, dört bavulla gelip toplantı yapanlar. Aynı zamanda şenlikli bir konu bu. Hem konuş konuş bitmiyor, hem gizem dolu. Dört valizle gelen “siyah giymiş adamlar” var. Kameralar koli bandıyla kapatılıyor falan. O valizlerde ne olduğu belli değil. Para, sinyal kesici, kıyafet.
David and Nick are joined by George Otel. George discusses getting into financing commercial real estate and his business journey. You can find George at https://www.linkedin.com/in/georgeotel/ and https://www.usbizfunding.net/ Shout out to MLVC for the great Theme Song! You can find them at https://www.instagram.com/mlvc_91/ If you have questions for either Nick or David please contact us at: bucksandbrewsllc@gmail.com
Turizm Kafası (19 Nisan 2025) - Otel Yöneticilerinin Mutfaktaki Yardımcısı by Kafa Radyo
TÜRK MİSAFİRPERVERLİĞİ Sizlere Ord. Prof. Dr. Anna Masala'nın kendi ağzından Türk mutfağını ve Türk misafirperverliğini anlattığı bir anısını aktarmak istiyorum: “Yanlış hatırlamıyorsam tanıdığım bütün Türklerin evinde yemek yedim. Konya'da Selçuklu yemeği, Eskişehir'de Tatar yemeği yedim. Zenginlerin ve fakirlerin evinde kahvaltı ettim, öğle ve akşam yemekleri yedim. Bazen birbirleriyle aynı günde evlerine davet eden dostları kırmamak için üç kez akşam yemeği yediğim bile oldu. Türkiye'de misafirperverlik anlayışı çok farklıdır. Anadolu'da en fakir köylü bile tek tavuğunu misafiri için keser ve ona yedirir. Ben, dünyanın en iyi mutfaklarından biri olan Türk mutfağını ve Türk sofrasını çok severim. Her sofra bir gökkuşağı gibidir: altın renkli börekler, gümüş baklalar, yeşil kırmızı çoban salataları, beyaz peynirler, her çeşit et yemeği, imam bayıldı, pilavlar, fasulye, tarhana ve tatlılar... Bir kere Prof. Ziya Umur, Suha Umur ve eşleriyle birlikte Prof. Sahir Erman'ın misafiri oldum. Büyük bir otelin lokantasındaydık. Yemek çeşitleri gerçekten kırk bir miydi bilmem ama çok çeşitli vardı. O akşam “imam bayıldı” veya “hünkarbeğendi” gibi yemek adlarının anlamını çözdüm. Her birimiz için içinde gül yaprakları olan bir tasla ılık su ve muhteşem sıcak peçeteler geldi. Otel, o akşam gözümde âdeta bir Osmanlı sarayına dönüşüverdi. Türk misafirperverliği sadece yemeğe dayanmaz; sanırım sadece Türkiye'de “diş kirası” âdeti vardır. Yani misafirlere ev sahibi tarafından bir hediye verilir. Eski dönemlerde büyükler misafirlere altın para hediye ederlermiş. Şunu bilmelisiniz ki bir Türk'ün misafiri olursanız ondan mutlaka bir hediye alırsınız. Mesela bana, boncuklar, bilezikler, yemeniler, kıymetli kitaplar, el işçiliği tabaklar, gümüş bir ayna ve daha birçok güzel hediye verildi. Anadolu'da bazı köylerde misafir odalarında, işlemeli divanlar, yastıklar ve renk renk halılar arasında uyuduğum da olmuştur. Halının üzerinde bir tepsi, tepside çay, meyve ve fıstık görüntüsü unutamadığım anlardandır. Sabah erken saatte, namaz vaktinde, küçücük bir minareden gelen ezan sesleriyle ev halkı uyanır ve kahvaltı edilirdi. O köy evi de bir saray oluverirdi.
Bolu Kartalkaya'da 78 kişinin hayatını kaybettiği otel yangın faciasının bilirkişi heyeti Meclis'te konuyla ilgili kurulan komisyonda sunum yaptı.
So you think Distributed Tracing is the new thing? Well - its not! But its never been as exciting as today!In this episode we combine 50 years of Distributed Tracing experience across our guests and hosts. We invited Christoph Neumueller and Thomas Rothschaedl who have seen the early days of agent-based instrumentation, how global standards like the W3C Trace Context allowed tracing to connect large enterprise systems and how OpenTelemetry is commoditizing data collection across all tech stacks.Tune in and learn about the difference between spans and traces, why collecting the data is only part of the story, how to combat the challenge when dealing with too much data and how traces relate and connect to logs, metrics and events.Links we discussedYouTube with Christoph: LINK WILL FOLLOW ONCE VIDEO IS POSTEDChristoph's LinkedIn: https://www.linkedin.com/in/christophneumueller/Thomas's LinkedIn: https://www.linkedin.com/in/rothschaedl/
In episode 79 of o11ycast, Hamel Husain joins the o11ycast crew to discuss the challenges of monitoring AI systems, why off-the-shelf metrics can be misleading, and how error analysis is the key to making AI models more reliable. Plus, insights into how Honeycomb built its Query Assistant and what teams should prioritize when working with AI observability.
In episode 79 of o11ycast, Hamel Husain joins the o11ycast crew to discuss the challenges of monitoring AI systems, why off-the-shelf metrics can be misleading, and how error analysis is the key to making AI models more reliable. Plus, insights into how Honeycomb built its Query Assistant and what teams should prioritize when working with AI observability.
Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B's new workshop and RSVP here!We're happy to announce that today's guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left.If you're a Python developer, it's very likely that you've heard of Pydantic. Every month, it's downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it's at the core of FastAPI, and if you've followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”. Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework.Logfire: bringing OTEL to AIOpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam's view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps. If you're interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part.Agents are the killer app for graphsPydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it's very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc.They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed.“We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.”Why Graphs?* More natural for complex or multi-step AI workflows.* Easy to visualize and debug with mermaid diagrams.* Potential for distributed runs, or “waiting days” between steps in certain flows.In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously.Full Video EpisodeLike and subscribe!Chapters* 00:00:00 Introductions* 00:00:24 Origins of Pydantic* 00:05:28 Pydantic's AI moment * 00:08:05 Why build a new agents framework?* 00:10:17 Overview of Pydantic AI* 00:12:33 Becoming a believer in graphs* 00:24:02 God Model vs Compound AI Systems* 00:28:13 Why not build an LLM gateway?* 00:31:39 Programmatic testing vs live evals* 00:35:51 Using OpenTelemetry for AI traces* 00:43:19 Why they don't use Clickhouse* 00:48:34 Competing in the observability space* 00:50:41 Licensing decisions for Pydantic and LogFire* 00:51:48 Building Pydantic.run* 00:55:24 Marimo and the future of Jupyter notebooks* 00:57:44 London's AI sceneShow Notes* Sam Colvin* Pydantic* Pydantic AI* Logfire* Pydantic.run* Zod* E2B* Arize* Langsmith* Marimo* Prefect* GLA (Google Generative Language API)* OpenTelemetry* Jason Liu* Sebastian Ramirez* Bogomil Balkansky* Hood Chatham* Jeremy Howard* Andrew LambTranscriptAlessio [00:00:03]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:12]: Good morning. And today we're very excited to have Sam Colvin join us from Pydantic AI. Welcome. Sam, I heard that Pydantic is all we need. Is that true?Samuel [00:00:24]: I would say you might need Pydantic AI and Logfire as well, but it gets you a long way, that's for sure.Swyx [00:00:29]: Pydantic almost basically needs no introduction. It's almost 300 million downloads in December. And obviously, in the previous podcasts and discussions we've had with Jason Liu, he's been a big fan and promoter of Pydantic and AI.Samuel [00:00:45]: Yeah, it's weird because obviously I didn't create Pydantic originally for uses in AI, it predates LLMs. But it's like we've been lucky that it's been picked up by that community and used so widely.Swyx [00:00:58]: Actually, maybe we'll hear it. Right from you, what is Pydantic and maybe a little bit of the origin story?Samuel [00:01:04]: The best name for it, which is not quite right, is a validation library. And we get some tension around that name because it doesn't just do validation, it will do coercion by default. We now have strict mode, so you can disable that coercion. But by default, if you say you want an integer field and you get in a string of 1, 2, 3, it will convert it to 123 and a bunch of other sensible conversions. And as you can imagine, the semantics around it. Exactly when you convert and when you don't, it's complicated, but because of that, it's more than just validation. Back in 2017, when I first started it, the different thing it was doing was using type hints to define your schema. That was controversial at the time. It was genuinely disapproved of by some people. I think the success of Pydantic and libraries like FastAPI that build on top of it means that today that's no longer controversial in Python. And indeed, lots of other people have copied that route, but yeah, it's a data validation library. It uses type hints for the for the most part and obviously does all the other stuff you want, like serialization on top of that. But yeah, that's the core.Alessio [00:02:06]: Do you have any fun stories on how JSON schemas ended up being kind of like the structure output standard for LLMs? And were you involved in any of these discussions? Because I know OpenAI was, you know, one of the early adopters. So did they reach out to you? Was there kind of like a structure output console in open source that people were talking about or was it just a random?Samuel [00:02:26]: No, very much not. So I originally. Didn't implement JSON schema inside Pydantic and then Sebastian, Sebastian Ramirez, FastAPI came along and like the first I ever heard of him was over a weekend. I got like 50 emails from him or 50 like emails as he was committing to Pydantic, adding JSON schema long pre version one. So the reason it was added was for OpenAPI, which is obviously closely akin to JSON schema. And then, yeah, I don't know why it was JSON that got picked up and used by OpenAI. It was obviously very convenient for us. That's because it meant that not only can you do the validation, but because Pydantic will generate you the JSON schema, it will it kind of can be one source of source of truth for structured outputs and tools.Swyx [00:03:09]: Before we dive in further on the on the AI side of things, something I'm mildly curious about, obviously, there's Zod in JavaScript land. Every now and then there is a new sort of in vogue validation library that that takes over for quite a few years and then maybe like some something else comes along. Is Pydantic? Is it done like the core Pydantic?Samuel [00:03:30]: I've just come off a call where we were redesigning some of the internal bits. There will be a v3 at some point, which will not break people's code half as much as v2 as in v2 was the was the massive rewrite into Rust, but also fixing all the stuff that was broken back from like version zero point something that we didn't fix in v1 because it was a side project. We have plans to move some of the basically store the data in Rust types after validation. Not completely. So we're still working to design the Pythonic version of it, in order for it to be able to convert into Python types. So then if you were doing like validation and then serialization, you would never have to go via a Python type we reckon that can give us somewhere between three and five times another three to five times speed up. That's probably the biggest thing. Also, like changing how easy it is to basically extend Pydantic and define how particular types, like for example, NumPy arrays are validated and serialized. But there's also stuff going on. And for example, Jitter, the JSON library in Rust that does the JSON parsing, has SIMD implementation at the moment only for AMD64. So we can add that. We need to go and add SIMD for other instruction sets. So there's a bunch more we can do on performance. I don't think we're going to go and revolutionize Pydantic, but it's going to continue to get faster, continue, hopefully, to allow people to do more advanced things. We might add a binary format like CBOR for serialization for when you'll just want to put the data into a database and probably load it again from Pydantic. So there are some things that will come along, but for the most part, it should just get faster and cleaner.Alessio [00:05:04]: From a focus perspective, I guess, as a founder too, how did you think about the AI interest rising? And then how do you kind of prioritize, okay, this is worth going into more, and we'll talk about Pydantic AI and all of that. What was maybe your early experience with LLAMP, and when did you figure out, okay, this is something we should take seriously and focus more resources on it?Samuel [00:05:28]: I'll answer that, but I'll answer what I think is a kind of parallel question, which is Pydantic's weird, because Pydantic existed, obviously, before I was starting a company. I was working on it in my spare time, and then beginning of 22, I started working on the rewrite in Rust. And I worked on it full-time for a year and a half, and then once we started the company, people came and joined. And it was a weird project, because that would never go away. You can't get signed off inside a startup. Like, we're going to go off and three engineers are going to work full-on for a year in Python and Rust, writing like 30,000 lines of Rust just to release open-source-free Python library. The result of that has been excellent for us as a company, right? As in, it's made us remain entirely relevant. And it's like, Pydantic is not just used in the SDKs of all of the AI libraries, but I can't say which one, but one of the big foundational model companies, when they upgraded from Pydantic v1 to v2, their number one internal model... The metric of performance is time to first token. That went down by 20%. So you think about all of the actual AI going on inside, and yet at least 20% of the CPU, or at least the latency inside requests was actually Pydantic, which shows like how widely it's used. So we've benefited from doing that work, although it didn't, it would have never have made financial sense in most companies. In answer to your question about like, how do we prioritize AI, I mean, the honest truth is we've spent a lot of the last year and a half building. Good general purpose observability inside LogFire and making Pydantic good for general purpose use cases. And the AI has kind of come to us. Like we just, not that we want to get away from it, but like the appetite, uh, both in Pydantic and in LogFire to go and build with AI is enormous because it kind of makes sense, right? Like if you're starting a new greenfield project in Python today, what's the chance that you're using GenAI 80%, let's say, globally, obviously it's like a hundred percent in California, but even worldwide, it's probably 80%. Yeah. And so everyone needs that stuff. And there's so much yet to be figured out so much like space to do things better in the ecosystem in a way that like to go and implement a database that's better than Postgres is a like Sisyphean task. Whereas building, uh, tools that are better for GenAI than some of the stuff that's about now is not very difficult. Putting the actual models themselves to one side.Alessio [00:07:40]: And then at the same time, then you released Pydantic AI recently, which is, uh, um, you know, agent framework and early on, I would say everybody like, you know, Langchain and like, uh, Pydantic kind of like a first class support, a lot of these frameworks, we're trying to use you to be better. What was the decision behind we should do our own framework? Were there any design decisions that you disagree with any workloads that you think people didn't support? Well,Samuel [00:08:05]: it wasn't so much like design and workflow, although I think there were some, some things we've done differently. Yeah. I think looking in general at the ecosystem of agent frameworks, the engineering quality is far below that of the rest of the Python ecosystem. There's a bunch of stuff that we have learned how to do over the last 20 years of building Python libraries and writing Python code that seems to be abandoned by people when they build agent frameworks. Now I can kind of respect that, particularly in the very first agent frameworks, like Langchain, where they were literally figuring out how to go and do this stuff. It's completely understandable that you would like basically skip some stuff.Samuel [00:08:42]: I'm shocked by the like quality of some of the agent frameworks that have come out recently from like well-respected names, which it just seems to be opportunism and I have little time for that, but like the early ones, like I think they were just figuring out how to do stuff and just as lots of people have learned from Pydantic, we were able to learn a bit from them. I think from like the gap we saw and the thing we were frustrated by was the production readiness. And that means things like type checking, even if type checking makes it hard. Like Pydantic AI, I will put my hand up now and say it has a lot of generics and you need to, it's probably easier to use it if you've written a bit of Rust and you really understand generics, but like, and that is, we're not claiming that that makes it the easiest thing to use in all cases, we think it makes it good for production applications in big systems where type checking is a no-brainer in Python. But there are also a bunch of stuff we've learned from maintaining Pydantic over the years that we've gone and done. So every single example in Pydantic AI's documentation is run on Python. As part of tests and every single print output within an example is checked during tests. So it will always be up to date. And then a bunch of things that, like I say, are standard best practice within the rest of the Python ecosystem, but I'm not followed surprisingly by some AI libraries like coverage, linting, type checking, et cetera, et cetera, where I think these are no-brainers, but like weirdly they're not followed by some of the other libraries.Alessio [00:10:04]: And can you just give an overview of the framework itself? I think there's kind of like the. LLM calling frameworks, there are the multi-agent frameworks, there's the workflow frameworks, like what does Pydantic AI do?Samuel [00:10:17]: I glaze over a bit when I hear all of the different sorts of frameworks, but I like, and I will tell you when I built Pydantic, when I built Logfire and when I built Pydantic AI, my methodology is not to go and like research and review all of the other things. I kind of work out what I want and I go and build it and then feedback comes and we adjust. So the fundamental building block of Pydantic AI is agents. The exact definition of agents and how you want to define them. is obviously ambiguous and our things are probably sort of agent-lit, not that we would want to go and rename them to agent-lit, but like the point is you probably build them together to build something and most people will call an agent. So an agent in our case has, you know, things like a prompt, like system prompt and some tools and a structured return type if you want it, that covers the vast majority of cases. There are situations where you want to go further and the most complex workflows where you want graphs and I resisted graphs for quite a while. I was sort of of the opinion you didn't need them and you could use standard like Python flow control to do all of that stuff. I had a few arguments with people, but I basically came around to, yeah, I can totally see why graphs are useful. But then we have the problem that by default, they're not type safe because if you have a like add edge method where you give the names of two different edges, there's no type checking, right? Even if you go and do some, I'm not, not all the graph libraries are AI specific. So there's a, there's a graph library called, but it allows, it does like a basic runtime type checking. Ironically using Pydantic to try and make up for the fact that like fundamentally that graphs are not typed type safe. Well, I like Pydantic, but it did, that's not a real solution to have to go and run the code to see if it's safe. There's a reason that starting type checking is so powerful. And so we kind of, from a lot of iteration eventually came up with a system of using normally data classes to define nodes where you return the next node you want to call and where we're able to go and introspect the return type of a node to basically build the graph. And so the graph is. Yeah. Inherently type safe. And once we got that right, I, I wasn't, I'm incredibly excited about graphs. I think there's like masses of use cases for them, both in gen AI and other development, but also software's all going to have interact with gen AI, right? It's going to be like web. There's no longer be like a web department in a company is that there's just like all the developers are building for web building with databases. The same is going to be true for gen AI.Alessio [00:12:33]: Yeah. I see on your docs, you call an agent, a container that contains a system prompt function. Tools, structure, result, dependency type model, and then model settings. Are the graphs in your mind, different agents? Are they different prompts for the same agent? What are like the structures in your mind?Samuel [00:12:52]: So we were compelled enough by graphs once we got them right, that we actually merged the PR this morning. That means our agent implementation without changing its API at all is now actually a graph under the hood as it is built using our graph library. So graphs are basically a lower level tool that allow you to build these complex workflows. Our agents are technically one of the many graphs you could go and build. And we just happened to build that one for you because it's a very common, commonplace one. But obviously there are cases where you need more complex workflows where the current agent assumptions don't work. And that's where you can then go and use graphs to build more complex things.Swyx [00:13:29]: You said you were cynical about graphs. What changed your mind specifically?Samuel [00:13:33]: I guess people kept giving me examples of things that they wanted to use graphs for. And my like, yeah, but you could do that in standard flow control in Python became a like less and less compelling argument to me because I've maintained those systems that end up with like spaghetti code. And I could see the appeal of this like structured way of defining the workflow of my code. And it's really neat that like just from your code, just from your type hints, you can get out a mermaid diagram that defines exactly what can go and happen.Swyx [00:14:00]: Right. Yeah. You do have very neat implementation of sort of inferring the graph from type hints, I guess. Yeah. Is what I would call it. Yeah. I think the question always is I have gone back and forth. I used to work at Temporal where we would actually spend a lot of time complaining about graph based workflow solutions like AWS step functions. And we would actually say that we were better because you could use normal control flow that you already knew and worked with. Yours, I guess, is like a little bit of a nice compromise. Like it looks like normal Pythonic code. But you just have to keep in mind what the type hints actually mean. And that's what we do with the quote unquote magic that the graph construction does.Samuel [00:14:42]: Yeah, exactly. And if you look at the internal logic of actually running a graph, it's incredibly simple. It's basically call a node, get a node back, call that node, get a node back, call that node. If you get an end, you're done. We will add in soon support for, well, basically storage so that you can store the state between each node that's run. And then the idea is you can then distribute the graph and run it across computers. And also, I mean, the other weird, the other bit that's really valuable is across time. Because it's all very well if you look at like lots of the graph examples that like Claude will give you. If it gives you an example, it gives you this lovely enormous mermaid chart of like the workflow, for example, managing returns if you're an e-commerce company. But what you realize is some of those lines are literally one function calls another function. And some of those lines are wait six days for the customer to print their like piece of paper and put it in the post. And if you're writing like your demo. Project or your like proof of concept, that's fine because you can just say, and now we call this function. But when you're building when you're in real in real life, that doesn't work. And now how do we manage that concept to basically be able to start somewhere else in the in our code? Well, this graph implementation makes it incredibly easy because you just pass the node that is the start point for carrying on the graph and it continues to run. So it's things like that where I was like, yeah, I can just imagine how things I've done in the past would be fundamentally easier to understand if we had done them with graphs.Swyx [00:16:07]: You say imagine, but like right now, this pedantic AI actually resume, you know, six days later, like you said, or is this just like a theoretical thing we can go someday?Samuel [00:16:16]: I think it's basically Q&A. So there's an AI that's asking the user a question and effectively you then call the CLI again to continue the conversation. And it basically instantiates the node and calls the graph with that node again. Now, we don't have the logic yet for effectively storing state in the database between individual nodes that we're going to add soon. But like the rest of it is basically there.Swyx [00:16:37]: It does make me think that not only are you competing with Langchain now and obviously Instructor, and now you're going into sort of the more like orchestrated things like Airflow, Prefect, Daxter, those guys.Samuel [00:16:52]: Yeah, I mean, we're good friends with the Prefect guys and Temporal have the same investors as us. And I'm sure that my investor Bogomol would not be too happy if I was like, oh, yeah, by the way, as well as trying to take on Datadog. We're also going off and trying to take on Temporal and everyone else doing that. Obviously, we're not doing all of the infrastructure of deploying that right yet, at least. We're, you know, we're just building a Python library. And like what's crazy about our graph implementation is, sure, there's a bit of magic in like introspecting the return type, you know, extracting things from unions, stuff like that. But like the actual calls, as I say, is literally call a function and get back a thing and call that. It's like incredibly simple and therefore easy to maintain. The question is, how useful is it? Well, I don't know yet. I think we have to go and find out. We have a whole. We've had a slew of people joining our Slack over the last few days and saying, tell me how good Pydantic AI is. How good is Pydantic AI versus Langchain? And I refuse to answer. That's your job to go and find that out. Not mine. We built a thing. I'm compelled by it, but I'm obviously biased. The ecosystem will work out what the useful tools are.Swyx [00:17:52]: Bogomol was my board member when I was at Temporal. And I think I think just generally also having been a workflow engine investor and participant in this space, it's a big space. Like everyone needs different functions. I think the one thing that I would say like yours, you know, as a library, you don't have that much control of it over the infrastructure. I do like the idea that each new agents or whatever or unit of work, whatever you call that should spin up in this sort of isolated boundaries. Whereas yours, I think around everything runs in the same process. But you ideally want to sort of spin out its own little container of things.Samuel [00:18:30]: I agree with you a hundred percent. And we will. It would work now. Right. As in theory, you're just like as long as you can serialize the calls to the next node, you just have to all of the different containers basically have to have the same the same code. I mean, I'm super excited about Cloudflare workers running Python and being able to install dependencies. And if Cloudflare could only give me my invitation to the private beta of that, we would be exploring that right now because I'm super excited about that as a like compute level for some of this stuff where exactly what you're saying, basically. You can run everything as an individual. Like worker function and distribute it. And it's resilient to failure, et cetera, et cetera.Swyx [00:19:08]: And it spins up like a thousand instances simultaneously. You know, you want it to be sort of truly serverless at once. Actually, I know we have some Cloudflare friends who are listening, so hopefully they'll get in front of the line. Especially.Samuel [00:19:19]: I was in Cloudflare's office last week shouting at them about other things that frustrate me. I have a love-hate relationship with Cloudflare. Their tech is awesome. But because I use it the whole time, I then get frustrated. So, yeah, I'm sure I will. I will. I will get there soon.Swyx [00:19:32]: There's a side tangent on Cloudflare. Is Python supported at full? I actually wasn't fully aware of what the status of that thing is.Samuel [00:19:39]: Yeah. So Pyodide, which is Python running inside the browser in scripting, is supported now by Cloudflare. They basically, they're having some struggles working out how to manage, ironically, dependencies that have binaries, in particular, Pydantic. Because these workers where you can have thousands of them on a given metal machine, you don't want to have a difference. You basically want to be able to have a share. Shared memory for all the different Pydantic installations, effectively. That's the thing they work out. They're working out. But Hood, who's my friend, who is the primary maintainer of Pyodide, works for Cloudflare. And that's basically what he's doing, is working out how to get Python running on Cloudflare's network.Swyx [00:20:19]: I mean, the nice thing is that your binary is really written in Rust, right? Yeah. Which also compiles the WebAssembly. Yeah. So maybe there's a way that you'd build... You have just a different build of Pydantic and that ships with whatever your distro for Cloudflare workers is.Samuel [00:20:36]: Yes, that's exactly what... So Pyodide has builds for Pydantic Core and for things like NumPy and basically all of the popular binary libraries. Yeah. It's just basic. And you're doing exactly that, right? You're using Rust to compile the WebAssembly and then you're calling that shared library from Python. And it's unbelievably complicated, but it works. Okay.Swyx [00:20:57]: Staying on graphs a little bit more, and then I wanted to go to some of the other features that you have in Pydantic AI. I see in your docs, there are sort of four levels of agents. There's single agents, there's agent delegation, programmatic agent handoff. That seems to be what OpenAI swarms would be like. And then the last one, graph-based control flow. Would you say that those are sort of the mental hierarchy of how these things go?Samuel [00:21:21]: Yeah, roughly. Okay.Swyx [00:21:22]: You had some expression around OpenAI swarms. Well.Samuel [00:21:25]: And indeed, OpenAI have got in touch with me and basically, maybe I'm not supposed to say this, but basically said that Pydantic AI looks like what swarms would become if it was production ready. So, yeah. I mean, like, yeah, which makes sense. Awesome. Yeah. I mean, in fact, it was specifically saying, how can we give people the same feeling that they were getting from swarms that led us to go and implement graphs? Because my, like, just call the next agent with Python code was not a satisfactory answer to people. So it was like, okay, we've got to go and have a better answer for that. It's not like, let us to get to graphs. Yeah.Swyx [00:21:56]: I mean, it's a minimal viable graph in some sense. What are the shapes of graphs that people should know? So the way that I would phrase this is I think Anthropic did a very good public service and also kind of surprisingly influential blog post, I would say, when they wrote Building Effective Agents. We actually have the authors coming to speak at my conference in New York, which I think you're giving a workshop at. Yeah.Samuel [00:22:24]: I'm trying to work it out. But yes, I think so.Swyx [00:22:26]: Tell me if you're not. yeah, I mean, like, that was the first, I think, authoritative view of, like, what kinds of graphs exist in agents and let's give each of them a name so that everyone is on the same page. So I'm just kind of curious if you have community names or top five patterns of graphs.Samuel [00:22:44]: I don't have top five patterns of graphs. I would love to see what people are building with them. But like, it's been it's only been a couple of weeks. And of course, there's a point is that. Because they're relatively unopinionated about what you can go and do with them. They don't suit them. Like, you can go and do lots of lots of things with them, but they don't have the structure to go and have like specific names as much as perhaps like some other systems do. I think what our agents are, which have a name and I can't remember what it is, but this basically system of like, decide what tool to call, go back to the center, decide what tool to call, go back to the center and then exit. One form of graph, which, as I say, like our agents are effectively one implementation of a graph, which is why under the hood they are now using graphs. And it'll be interesting to see over the next few years whether we end up with these like predefined graph names or graph structures or whether it's just like, yep, I built a graph or whether graphs just turn out not to match people's mental image of what they want and die away. We'll see.Swyx [00:23:38]: I think there is always appeal. Every developer eventually gets graph religion and goes, oh, yeah, everything's a graph. And then they probably over rotate and go go too far into graphs. And then they have to learn a whole bunch of DSLs. And then they're like, actually, I didn't need that. I need this. And they scale back a little bit.Samuel [00:23:55]: I'm at the beginning of that process. I'm currently a graph maximalist, although I haven't actually put any into production yet. But yeah.Swyx [00:24:02]: This has a lot of philosophical connections with other work coming out of UC Berkeley on compounding AI systems. I don't know if you know of or care. This is the Gartner world of things where they need some kind of industry terminology to sell it to enterprises. I don't know if you know about any of that.Samuel [00:24:24]: I haven't. I probably should. I should probably do it because I should probably get better at selling to enterprises. But no, no, I don't. Not right now.Swyx [00:24:29]: This is really the argument is that instead of putting everything in one model, you have more control and more maybe observability to if you break everything out into composing little models and changing them together. And obviously, then you need an orchestration framework to do that. Yeah.Samuel [00:24:47]: And it makes complete sense. And one of the things we've seen with agents is they work well when they work well. But when they. Even if you have the observability through log five that you can see what was going on, if you don't have a nice hook point to say, hang on, this is all gone wrong. You have a relatively blunt instrument of basically erroring when you exceed some kind of limit. But like what you need to be able to do is effectively iterate through these runs so that you can have your own control flow where you're like, OK, we've gone too far. And that's where one of the neat things about our graph implementation is you can basically call next in a loop rather than just running the full graph. And therefore, you have this opportunity to to break out of it. But yeah, basically, it's the same point, which is like if you have two bigger unit of work to some extent, whether or not it involves gen AI. But obviously, it's particularly problematic in gen AI. You only find out afterwards when you've spent quite a lot of time and or money when it's gone off and done done the wrong thing.Swyx [00:25:39]: Oh, drop on this. We're not going to resolve this here, but I'll drop this and then we can move on to the next thing. This is the common way that we we developers talk about this. And then the machine learning researchers look at us. And laugh and say, that's cute. And then they just train a bigger model and they wipe us out in the next training run. So I think there's a certain amount of we are fighting the bitter lesson here. We're fighting AGI. And, you know, when AGI arrives, this will all go away. Obviously, on Latent Space, we don't really discuss that because I think AGI is kind of this hand wavy concept that isn't super relevant. But I think we have to respect that. For example, you could do a chain of thoughts with graphs and you could manually orchestrate a nice little graph that does like. Reflect, think about if you need more, more inference time, compute, you know, that's the hot term now. And then think again and, you know, scale that up. Or you could train Strawberry and DeepSeq R1. Right.Samuel [00:26:32]: I saw someone saying recently, oh, they were really optimistic about agents because models are getting faster exponentially. And I like took a certain amount of self-control not to describe that it wasn't exponential. But my main point was. If models are getting faster as quickly as you say they are, then we don't need agents and we don't really need any of these abstraction layers. We can just give our model and, you know, access to the Internet, cross our fingers and hope for the best. Agents, agent frameworks, graphs, all of this stuff is basically making up for the fact that right now the models are not that clever. In the same way that if you're running a customer service business and you have loads of people sitting answering telephones, the less well trained they are, the less that you trust them, the more that you need to give them a script to go through. Whereas, you know, so if you're running a bank and you have lots of customer service people who you don't trust that much, then you tell them exactly what to say. If you're doing high net worth banking, you just employ people who you think are going to be charming to other rich people and set them off to go and have coffee with people. Right. And the same is true of models. The more intelligent they are, the less we need to tell them, like structure what they go and do and constrain the routes in which they take.Swyx [00:27:42]: Yeah. Yeah. Agree with that. So I'm happy to move on. So the other parts of Pydantic AI that are worth commenting on, and this is like my last rant, I promise. So obviously, every framework needs to do its sort of model adapter layer, which is, oh, you can easily swap from OpenAI to Cloud to Grok. You also have, which I didn't know about, Google GLA, which I didn't really know about until I saw this in your docs, which is generative language API. I assume that's AI Studio? Yes.Samuel [00:28:13]: Google don't have good names for it. So Vertex is very clear. That seems to be the API that like some of the things use, although it returns 503 about 20% of the time. So... Vertex? No. Vertex, fine. But the... Oh, oh. GLA. Yeah. Yeah.Swyx [00:28:28]: I agree with that.Samuel [00:28:29]: So we have, again, another example of like, well, I think we go the extra mile in terms of engineering is we run on every commit, at least commit to main, we run tests against the live models. Not lots of tests, but like a handful of them. Oh, okay. And we had a point last week where, yeah, GLA is a little bit better. GLA1 was failing every single run. One of their tests would fail. And we, I think we might even have commented out that one at the moment. So like all of the models fail more often than you might expect, but like that one seems to be particularly likely to fail. But Vertex is the same API, but much more reliable.Swyx [00:29:01]: My rant here is that, you know, versions of this appear in Langchain and every single framework has to have its own little thing, a version of that. I would put to you, and then, you know, this is, this can be agree to disagree. This is not needed in Pydantic AI. I would much rather you adopt a layer like Lite LLM or what's the other one in JavaScript port key. And that's their job. They focus on that one thing and they, they normalize APIs for you. All new models are automatically added and you don't have to duplicate this inside of your framework. So for example, if I wanted to use deep seek, I'm out of luck because Pydantic AI doesn't have deep seek yet.Samuel [00:29:38]: Yeah, it does.Swyx [00:29:39]: Oh, it does. Okay. I'm sorry. But you know what I mean? Should this live in your code or should it live in a layer that's kind of your API gateway that's a defined piece of infrastructure that people have?Samuel [00:29:49]: And I think if a company who are well known, who are respected by everyone had come along and done this at the right time, maybe we should have done it a year and a half ago and said, we're going to be the universal AI layer. That would have been a credible thing to do. I've heard varying reports of Lite LLM is the truth. And it didn't seem to have exactly the type safety that we needed. Also, as I understand it, and again, I haven't looked into it in great detail. Part of their business model is proxying the request through their, through their own system to do the generalization. That would be an enormous put off to an awful lot of people. Honestly, the truth is I don't think it is that much work unifying the model. I get where you're coming from. I kind of see your point. I think the truth is that everyone is centralizing around open AIs. Open AI's API is the one to do. So DeepSeq support that. Grok with OK support that. Ollama also does it. I mean, if there is that library right now, it's more or less the open AI SDK. And it's very high quality. It's well type checked. It uses Pydantic. So I'm biased. But I mean, I think it's pretty well respected anyway.Swyx [00:30:57]: There's different ways to do this. Because also, it's not just about normalizing the APIs. You have to do secret management and all that stuff.Samuel [00:31:05]: Yeah. And there's also. There's Vertex and Bedrock, which to one extent or another, effectively, they host multiple models, but they don't unify the API. But they do unify the auth, as I understand it. Although we're halfway through doing Bedrock. So I don't know about it that well. But they're kind of weird hybrids because they support multiple models. But like I say, the auth is centralized.Swyx [00:31:28]: Yeah, I'm surprised they don't unify the API. That seems like something that I would do. You know, we can discuss all this all day. There's a lot of APIs. I agree.Samuel [00:31:36]: It would be nice if there was a universal one that we didn't have to go and build.Alessio [00:31:39]: And I guess the other side of, you know, routing model and picking models like evals. How do you actually figure out which one you should be using? I know you have one. First of all, you have very good support for mocking in unit tests, which is something that a lot of other frameworks don't do. So, you know, my favorite Ruby library is VCR because it just, you know, it just lets me store the HTTP requests and replay them. That part I'll kind of skip. I think you are busy like this test model. We're like just through Python. You try and figure out what the model might respond without actually calling the model. And then you have the function model where people can kind of customize outputs. Any other fun stories maybe from there? Or is it just what you see is what you get, so to speak?Samuel [00:32:18]: On those two, I think what you see is what you get. On the evals, I think watch this space. I think it's something that like, again, I was somewhat cynical about for some time. Still have my cynicism about some of the well, it's unfortunate that so many different things are called evals. It would be nice if we could agree. What they are and what they're not. But look, I think it's a really important space. I think it's something that we're going to be working on soon, both in Pydantic AI and in LogFire to try and support better because it's like it's an unsolved problem.Alessio [00:32:45]: Yeah, you do say in your doc that anyone who claims to know for sure exactly how your eval should be defined can safely be ignored.Samuel [00:32:52]: We'll delete that sentence when we tell people how to do their evals.Alessio [00:32:56]: Exactly. I was like, we need we need a snapshot of this today. And so let's talk about eval. So there's kind of like the vibe. Yeah. So you have evals, which is what you do when you're building. Right. Because you cannot really like test it that many times to get statistical significance. And then there's the production eval. So you also have LogFire, which is kind of like your observability product, which I tried before. It's very nice. What are some of the learnings you've had from building an observability tool for LEMPs? And yeah, as people think about evals, even like what are the right things to measure? What are like the right number of samples that you need to actually start making decisions?Samuel [00:33:33]: I'm not the best person to answer that is the truth. So I'm not going to come in here and tell you that I think I know the answer on the exact number. I mean, we can do some back of the envelope statistics calculations to work out that like having 30 probably gets you most of the statistical value of having 200 for, you know, by definition, 15% of the work. But the exact like how many examples do you need? For example, that's a much harder question to answer because it's, you know, it's deep within the how models operate in terms of LogFire. One of the reasons we built LogFire the way we have and we allow you to write SQL directly against your data and we're trying to build the like powerful fundamentals of observability is precisely because we know we don't know the answers. And so allowing people to go and innovate on how they're going to consume that stuff and how they're going to process it is we think that's valuable. Because even if we come along and offer you an evals framework on top of LogFire, it won't be right in all regards. And we want people to be able to go and innovate and being able to write their own SQL connected to the API. And effectively query the data like it's a database with SQL allows people to innovate on that stuff. And that's what allows us to do it as well. I mean, we do a bunch of like testing what's possible by basically writing SQL directly against LogFire as any user could. I think the other the other really interesting bit that's going on in observability is OpenTelemetry is centralizing around semantic attributes for GenAI. So it's a relatively new project. A lot of it's still being added at the moment. But basically the idea that like. They unify how both SDKs and or agent frameworks send observability data to to any OpenTelemetry endpoint. And so, again, we can go and having that unification allows us to go and like basically compare different libraries, compare different models much better. That stuff's in a very like early stage of development. One of the things we're going to be working on pretty soon is basically, I suspect, GenAI will be the first agent framework that implements those semantic attributes properly. Because, again, we control and we can say this is important for observability, whereas most of the other agent frameworks are not maintained by people who are trying to do observability. With the exception of Langchain, where they have the observability platform, but they chose not to go down the OpenTelemetry route. So they're like plowing their own furrow. And, you know, they're a lot they're even further away from standardization.Alessio [00:35:51]: Can you maybe just give a quick overview of how OTEL ties into the AI workflows? There's kind of like the question of is, you know, a trace. And a span like a LLM call. Is it the agent? It's kind of like the broader thing you're tracking. How should people think about it?Samuel [00:36:06]: Yeah, so they have a PR that I think may have now been merged from someone at IBM talking about remote agents and trying to support this concept of remote agents within GenAI. I'm not particularly compelled by that because I don't think that like that's actually by any means the common use case. But like, I suppose it's fine for it to be there. The majority of the stuff in OTEL is basically defining how you would instrument. A given call to an LLM. So basically the actual LLM call, what data you would send to your telemetry provider, how you would structure that. Apart from this slightly odd stuff on remote agents, most of the like agent level consideration is not yet implemented in is not yet decided effectively. And so there's a bit of ambiguity. Obviously, what's good about OTEL is you can in the end send whatever attributes you like. But yeah, there's quite a lot of churn in that space and exactly how we store the data. I think that one of the most interesting things, though, is that if you think about observability. Traditionally, it was sure everyone would say our observability data is very important. We must keep it safe. But actually, companies work very hard to basically not have anything that sensitive in their observability data. So if you're a doctor in a hospital and you search for a drug for an STI, the sequel might be sent to the observability provider. But none of the parameters would. It wouldn't have the patient number or their name or the drug. With GenAI, that distinction doesn't exist because it's all just messed up in the text. If you have that same patient asking an LLM how to. What drug they should take or how to stop smoking. You can't extract the PII and not send it to the observability platform. So the sensitivity of the data that's going to end up in observability platforms is going to be like basically different order of magnitude to what's in what you would normally send to Datadog. Of course, you can make a mistake and send someone's password or their card number to Datadog. But that would be seen as a as a like mistake. Whereas in GenAI, a lot of data is going to be sent. And I think that's why companies like Langsmith and are trying hard to offer observability. On prem, because there's a bunch of companies who are happy for Datadog to be cloud hosted, but want self-hosted self-hosting for this observability stuff with GenAI.Alessio [00:38:09]: And are you doing any of that today? Because I know in each of the spans you have like the number of tokens, you have the context, you're just storing everything. And then you're going to offer kind of like a self-hosting for the platform, basically. Yeah. Yeah.Samuel [00:38:23]: So we have scrubbing roughly equivalent to what the other observability platforms have. So if we, you know, if we see password as the key, we won't send the value. But like, like I said, that doesn't really work in GenAI. So we're accepting we're going to have to store a lot of data and then we'll offer self-hosting for those people who can afford it and who need it.Alessio [00:38:42]: And then this is, I think, the first time that most of the workloads performance is depending on a third party. You know, like if you're looking at Datadog data, usually it's your app that is driving the latency and like the memory usage and all of that. Here you're going to have spans that maybe take a long time to perform because the GLA API is not working or because OpenAI is kind of like overwhelmed. Do you do anything there since like the provider is almost like the same across customers? You know, like, are you trying to surface these things for people and say, hey, this was like a very slow span, but actually all customers using OpenAI right now are seeing the same thing. So maybe don't worry about it or.Samuel [00:39:20]: Not yet. We do a few things that people don't generally do in OTA. So we send. We send information at the beginning. At the beginning of a trace as well as sorry, at the beginning of a span, as well as when it finishes. By default, OTA only sends you data when the span finishes. So if you think about a request which might take like 20 seconds, even if some of the intermediate spans finished earlier, you can't basically place them on the page until you get the top level span. And so if you're using standard OTA, you can't show anything until those requests are finished. When those requests are taking a few hundred milliseconds, it doesn't really matter. But when you're doing Gen AI calls or when you're like running a batch job that might take 30 minutes. That like latency of not being able to see the span is like crippling to understanding your application. And so we've we do a bunch of slightly complex stuff to basically send data about a span as it starts, which is closely related. Yeah.Alessio [00:40:09]: Any thoughts on all the other people trying to build on top of OpenTelemetry in different languages, too? There's like the OpenLEmetry project, which doesn't really roll off the tongue. But how do you see the future of these kind of tools? Is everybody going to have to build? Why does everybody want to build? They want to build their own open source observability thing to then sell?Samuel [00:40:29]: I mean, we are not going off and trying to instrument the likes of the OpenAI SDK with the new semantic attributes, because at some point that's going to happen and it's going to live inside OTEL and we might help with it. But we're a tiny team. We don't have time to go and do all of that work. So OpenLEmetry, like interesting project. But I suspect eventually most of those semantic like that instrumentation of the big of the SDKs will live, like I say, inside the main OpenTelemetry report. I suppose. What happens to the agent frameworks? What data you basically need at the framework level to get the context is kind of unclear. I don't think we know the answer yet. But I mean, I was on the, I guess this is kind of semi-public, because I was on the call with the OpenTelemetry call last week talking about GenAI. And there was someone from Arize talking about the challenges they have trying to get OpenTelemetry data out of Langchain, where it's not like natively implemented. And obviously they're having quite a tough time. And I was realizing, hadn't really realized this before, but how lucky we are to primarily be talking about our own agent framework, where we have the control rather than trying to go and instrument other people's.Swyx [00:41:36]: Sorry, I actually didn't know about this semantic conventions thing. It looks like, yeah, it's merged into main OTel. What should people know about this? I had never heard of it before.Samuel [00:41:45]: Yeah, I think it looks like a great start. I think there's some unknowns around how you send the messages that go back and forth, which is kind of the most important part. It's the most important thing of all. And that is moved out of attributes and into OTel events. OTel events in turn are moving from being on a span to being their own top-level API where you send data. So there's a bunch of churn still going on. I'm impressed by how fast the OTel community is moving on this project. I guess they, like everyone else, get that this is important, and it's something that people are crying out to get instrumentation off. So I'm kind of pleasantly surprised at how fast they're moving, but it makes sense.Swyx [00:42:25]: I'm just kind of browsing through the specification. I can already see that this basically bakes in whatever the previous paradigm was. So now they have genai.usage.prompt tokens and genai.usage.completion tokens. And obviously now we have reasoning tokens as well. And then only one form of sampling, which is top-p. You're basically baking in or sort of reifying things that you think are important today, but it's not a super foolproof way of doing this for the future. Yeah.Samuel [00:42:54]: I mean, that's what's neat about OTel is you can always go and send another attribute and that's fine. It's just there are a bunch that are agreed on. But I would say, you know, to come back to your previous point about whether or not we should be relying on one centralized abstraction layer, this stuff is moving so fast that if you start relying on someone else's standard, you risk basically falling behind because you're relying on someone else to keep things up to date.Swyx [00:43:14]: Or you fall behind because you've got other things going on.Samuel [00:43:17]: Yeah, yeah. That's fair. That's fair.Swyx [00:43:19]: Any other observations just about building LogFire, actually? Let's just talk about this. So you announced LogFire. I was kind of only familiar with LogFire because of your Series A announcement. I actually thought you were making a separate company. I remember some amount of confusion with you when that came out. So to be clear, it's Pydantic LogFire and the company is one company that has kind of two products, an open source thing and an observability thing, correct? Yeah. I was just kind of curious, like any learnings building LogFire? So classic question is, do you use ClickHouse? Is this like the standard persistence layer? Any learnings doing that?Samuel [00:43:54]: We don't use ClickHouse. We started building our database with ClickHouse, moved off ClickHouse onto Timescale, which is a Postgres extension to do analytical databases. Wow. And then moved off Timescale onto DataFusion. And we're basically now building, it's DataFusion, but it's kind of our own database. Bogomil is not entirely happy that we went through three databases before we chose one. I'll say that. But like, we've got to the right one in the end. I think we could have realized that Timescale wasn't right. I think ClickHouse. They both taught us a lot and we're in a great place now. But like, yeah, it's been a real journey on the database in particular.Swyx [00:44:28]: Okay. So, you know, as a database nerd, I have to like double click on this, right? So ClickHouse is supposed to be the ideal backend for anything like this. And then moving from ClickHouse to Timescale is another counterintuitive move that I didn't expect because, you know, Timescale is like an extension on top of Postgres. Not super meant for like high volume logging. But like, yeah, tell us those decisions.Samuel [00:44:50]: So at the time, ClickHouse did not have good support for JSON. I was speaking to someone yesterday and said ClickHouse doesn't have good support for JSON and got roundly stepped on because apparently it does now. So they've obviously gone and built their proper JSON support. But like back when we were trying to use it, I guess a year ago or a bit more than a year ago, everything happened to be a map and maps are a pain to try and do like looking up JSON type data. And obviously all these attributes, everything you're talking about there in terms of the GenAI stuff. You can choose to make them top level columns if you want. But the simplest thing is just to put them all into a big JSON pile. And that was a problem with ClickHouse. Also, ClickHouse had some really ugly edge cases like by default, or at least until I complained about it a lot, ClickHouse thought that two nanoseconds was longer than one second because they compared intervals just by the number, not the unit. And I complained about that a lot. And then they caused it to raise an error and just say you have to have the same unit. Then I complained a bit more. And I think as I understand it now, they have some. They convert between units. But like stuff like that, when all you're looking at is when a lot of what you're doing is comparing the duration of spans was really painful. Also things like you can't subtract two date times to get an interval. You have to use the date sub function. But like the fundamental thing is because we want our end users to write SQL, the like quality of the SQL, how easy it is to write, matters way more to us than if you're building like a platform on top where your developers are going to write the SQL. And once it's written and it's working, you don't mind too much. So I think that's like one of the fundamental differences. The other problem that I have with the ClickHouse and Impact Timescale is that like the ultimate architecture, the like snowflake architecture of binary data in object store queried with some kind of cache from nearby. They both have it, but it's closed sourced and you only get it if you go and use their hosted versions. And so even if we had got through all the problems with Timescale or ClickHouse, we would end up like, you know, they would want to be taking their 80% margin. And then we would be wanting to take that would basically leave us less space for margin. Whereas data fusion. Properly open source, all of that same tooling is open source. And for us as a team of people with a lot of Rust expertise, data fusion, which is implemented in Rust, we can literally dive into it and go and change it. So, for example, I found that there were some slowdowns in data fusion's string comparison kernel for doing like string contains. And it's just Rust code. And I could go and rewrite the string comparison kernel to be faster. Or, for example, data fusion, when we started using it, didn't have JSON support. Obviously, as I've said, it's something we can do. It's something we needed. I was able to go and implement that in a weekend using our JSON parser that we built for Pydantic Core. So it's the fact that like data fusion is like for us the perfect mixture of a toolbox to build a database with, not a database. And we can go and implement stuff on top of it in a way that like if you were trying to do that in Postgres or in ClickHouse. I mean, ClickHouse would be easier because it's C++, relatively modern C++. But like as a team of people who are not C++ experts, that's much scarier than data fusion for us.Swyx [00:47:47]: Yeah, that's a beautiful rant.Alessio [00:47:49]: That's funny. Most people don't think they have agency on these projects. They're kind of like, oh, I should use this or I should use that. They're not really like, what should I pick so that I contribute the most back to it? You know, so but I think you obviously have an open source first mindset. So that makes a lot of sense.Samuel [00:48:05]: I think if we were probably better as a startup, a better startup and faster moving and just like headlong determined to get in front of customers as fast as possible, we should have just started with ClickHouse. I hope that long term we're in a better place for having worked with data fusion. We like we're quite engaged now with the data fusion community. Andrew Lam, who maintains data fusion, is an advisor to us. We're in a really good place now. But yeah, it's definitely slowed us down relative to just like building on ClickHouse and moving as fast as we can.Swyx [00:48:34]: OK, we're about to zoom out and do Pydantic run and all the other stuff. But, you know, my last question on LogFire is really, you know, at some point you run out sort of community goodwill just because like, oh, I use Pydantic. I love Pydantic. I'm going to use LogFire. OK, then you start entering the territory of the Datadogs, the Sentrys and the honeycombs. Yeah. So where are you going to really spike here? What differentiator here?Samuel [00:48:59]: I wasn't writing code in 2001, but I'm assuming that there were people talking about like web observability and then web observability stopped being a thing, not because the web stopped being a thing, but because all observability had to do web. If you were talking to people in 2010 or 2012, they would have talked about cloud observability. Now that's not a term because all observability is cloud first. The same is going to happen to gen AI. And so whether or not you're trying to compete with Datadog or with Arise and Langsmith, you've got to do first class. You've got to do general purpose observability with first class support for AI. And as far as I know, we're the only people really trying to do that. I mean, I think Datadog is starting in that direction. And to be honest, I think Datadog is a much like scarier company to compete with than the AI specific observability platforms. Because in my opinion, and I've also heard this from lots of customers, AI specific observability where you don't see everything else going on in your app is not actually that useful. Our hope is that we can build the first general purpose observability platform with first class support for AI. And that we have this open source heritage of putting developer experience first that other companies haven't done. For all I'm a fan of Datadog and what they've done. If you search Datadog logging Python. And you just try as a like a non-observability expert to get something up and running with Datadog and Python. It's not trivial, right? That's something Sentry have done amazingly well. But like there's enormous space in most of observability to do DX better.Alessio [00:50:27]: Since you mentioned Sentry, I'm curious how you thought about licensing and all of that. Obviously, your MIT license, you don't have any rolling license like Sentry has where you can only use an open source, like the one year old version of it. Was that a hard decision?Samuel [00:50:41]: So to be clear, LogFire is co-sourced. So Pydantic and Pydantic AI are MIT licensed and like properly open source. And then LogFire for now is completely closed source. And in fact, the struggles that Sentry have had with licensing and the like weird pushback the community gives when they take something that's closed source and make it source available just meant that we just avoided that whole subject matter. I think the other way to look at it is like in terms of either headcount or revenue or dollars in the bank. The amount of open source we do as a company is we've got to be open source. We're up there with the most prolific open source companies, like I say, per head. And so we didn't feel like we were morally obligated to make LogFire open source. We have Pydantic. Pydantic is a foundational library in Python. That and now Pydantic AI are our contribution to open source. And then LogFire is like openly for profit, right? As in we're not claiming otherwise. We're not sort of trying to walk a line if it's open source. But really, we want to make it hard to deploy. So you probably want to pay us. We're trying to be straight. That it's to pay for. We could change that at some point in the future, but it's not an immediate plan.Alessio [00:51:48]: All right. So the first one I saw this new I don't know if it's like a product you're building the Pydantic that run, which is a Python browser sandbox. What was the inspiration behind that? We talk a lot about code interpreter for lamps. I'm an investor in a company called E2B, which is a code sandbox as a service for remote execution. Yeah. What's the Pydantic that run story?Samuel [00:52:09]: So Pydantic that run is again completely open source. I have no interest in making it into a product. We just needed a sandbox to be able to demo LogFire in particular, but also Pydantic AI. So it doesn't have it yet, but I'm going to add basically a proxy to OpenAI and the other models so that you can run Pydantic AI in the browser. See how it works. Tweak the prompt, et cetera, et cetera. And we'll have some kind of limit per day of what you can spend on it or like what the spend is. The other thing we wanted to b
Çalar Saat, samimi ve dürüst habercilik anlayışıyla Türkiye'nin dört bir yanından derlediği haberleri izleyicilerle buluşturup ülkenin nabzını tutmaya devam ediyor. Türkiye'nin lider sabah haber programı Çalar Saat NOW'da! Bizi sosyal medyadan takip edin: Facebook: https://www.facebook.com/nowhaber Twitter: http://www.twitter.com/NOWhaber Instagram: https://www.instagram.com/nowhaber.tr/ Podcast: https://anchor.fm/now-haber
Kartalkaya'da bir kayak merkezinde bulunan otelde çıkan korkunç yangında 36'sı çocuk 78 kişi hayatını yitirdi. Türkiye günlerdir kurbanlar için yas tutuyor. Altı savcı, yangının çıkma sebebini ve olası ihmalleri ortaya çıkarmak için soruşturma başlattı. Otelin sahibi tutuklandı. Ailelerin yok olduğu bu felaket neden yaşandı? Türkiye'de yangına karşı bina güvenliği nasıl düzenleniyor ve denetleniyor? 19 Mayıs Üniversitesi Öğretim Üyesi ve Afet Yönetimi Uzmanı Prof. Afşin Ahmet Kaya ile konuştuk. Mikrofonda Gökçe Göksu ve Elmas Topcu var. Von Gökce Göksu.
Bolu Kartalkaya'da 78 vatandaşın yaşamını yitirdiği yangın ülke olarak hüznü iliklerimize kadar hissetmemize ve gözyaşlarına boğulmamıza sebep oldu. Keşkeler havada uçuşuyor. Ama gidenler geri gelmeyecek. Bundan sonrası için kamu yönetiminde yapılması gerekenlere dikkat çekmeye çalışacağız.
Haftalık'ta Fatma İnce ve Mert Büyükkarabacak değerlendiriyor ✔️ Trump ilk olarak neler yaptı? ✔️ Çin'de keşfedilen nadir metal elementleri ✔️ İsrail, Gazze ve Suriye ✔️ Suriye'de Alevilerin durumu ✔️ Rusya'ya Tesla'da verilen imtiyazın geri alınması ✔️ Grand Kartal oteli faciası ✔️ CHP ablukası ✔️ Merkez Bankası'nın faiz kararı ✔️ Ümit Özdağ'ın tutuklanması ✔️ İmralı görüşmesi
Günün en sıcak ve çarpıcı gelişmelerini bulabileceğiniz; güvenilir, tarafsız ve kaliteli haberin adresi NOW Ana Haber; deneyimli gazeteci Selçuk Tepeli'nin sunumuyla izleyicileriyle buluşuyor. Sıradanlaşmış bültenlerden çok daha farklı, interaktif bir sunum ile izleyiciye aktarılan NOW Ana Haber, her gün 19.00'da NOW'da! Bizi sosyal medyadan takip edin: X: https://twitter.com/nowhaber Facebook: https://www.facebook.com/nowhaber.tr Instagram: https://www.instagram.com/nowhaber.tr/ Podcast: https://anchor.fm/now-haber
Bolu'da her yıl on binlerce kişiyi ağırlayan Kartalkaya Kayak Merkezi'ndeki Grand Kartal Otel yangınında 76 kişi hayatını kaybetti, onlarca kişi yaralandı. Belki de yüzlerce ailenin evine ateş düştü. Bu, Türkiye tarihinde kayıtlara geçen en ölümcül otel yangınlarından biri oldu. Yangının üzerinden en az 30 saat geçti, cevapsız pek çok soru var. Senem Görür Yücel canlı yayında uzman konuklarıyla değerlendirecek. Göksel Göksu, Bolu-Kartalkaya'dan izlenimlerini aktaracak. Learn more about your ad choices. Visit megaphone.fm/adchoices
Çiğdem Aydoğdu'nun hazırlayıp sunduğu Turizm Gündemi programına Seyahat Acentesi ve Otel İşletmecisi Serhad Uslan konuk oldu.
Çiğdem Aydoğdu'nun hazırlayıp sunduğu Turizm Gündemi programına Seyahat Acentesi ve Otel İşletmecisi Serhad Uslan konuk oldu.
For the past 10 years Anton has been working at Booking.com - one of the leading digital travel companies based out of Amsterdam. The journey that started as System Administrator has led Anton to be an Engineering Manager for Site Reliability where over the past 3 years he led the rollout and adoption of OpenTelemetry as the standard for getting observability into new cloud native deployments.Tune in and learn how Anton saw R&D grow from 300 to 2000, why they replaced their home-grown Perl-based Observability Framework with OpenTelemetry, how they tackle adoption challenges and how they extend and contribute back to the open source communityLinks we discussed:Anton's LinkedIn Profile: https://www.linkedin.com/in/antontimofieiev/Observability & SRE Summit: https://www.iqpc.com/events-observability-sre-summit/speakers/anton-timofieievOpenTelemetry: https://opentelemetry.io/
Adıyaman'daki İsias Otel'inde 6 Şubat depreminde hayatını kaybeden KKTC'li 35 çocuğun ailelerinin kurduğu Şampiyon Melekleri Yaşatma Derneği'nin başkanı Ruşen Yücesoylu Karakaya ile bir araya geliyoruz.
AWS Morning Brief for the week of December 2, with Corey Quinn. Links:Amazon CloudWatch adds context to observability data in service consoles, accelerating analysisAmazon Cognito introduces Managed Login to support rich branding for end user journeysAmazon Cognito now supports passwordless authentication for low-friction and secure loginsAmazon Connect Email is now generally availableAmazon EBS announces Time-based Copy for EBS SnapshotsAmazon EC2 Auto Scaling introduces highly responsive scaling policiesAmazon EC2 Capacity Blocks now supports instant start times and extensionsAmazon ECR announces 10x increase in repository limit to 100,000Amazon EFS now supports up to 2.5 million IOPS per file systemAmazon S3 now supports enforcement of conditional write operations for S3 general purpose bucketsApplication Signals provides OTEL support via X-Ray OTLP endpoint for tracesAWS delivers enhanced root cause insights to help explain cost anomaliesEnhanced Pricing Calculator now supports discounts and purchase commitments (in preview)AWS PrivateLink now supports cross-region connectivityAnnouncing the new AWS User Notifications SDKAnnouncing new feature tiers: Essentials and Plus for Amazon CognitoAnnouncing Savings Plans Purchase AnalyzerData Exports for FOCUS 1.0 is now in general availabilityIntroducing a new experience for AWS Systems ManagerIntroducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)Understanding how certain database parameters impact scaling in Amazon Aurora Serverless v2Analyzing your AWS Cost Explorer data with Amazon Q Developer: Now Generally AvailableYour guide to AWS for Advertising & Marketing at re:Invent 2024AWS IoT Services alignment with US Cyber Trust MarkStreamlining AWS Organizations Cleanup StrategiesSponsorWiz: wiz.io/lastweek
PromCon, the flagship yearly event of the Prometheus community, is back in Berlin, and we're here to bring you the highlights from the Prometheus ecosystem. And this year we've got some major news: Prometheus's long-awaited major release, v3.0! Join us to hear all about the revamped user interface, about Remote Write 2.0, and about Prometheus' goal to become the default backend for storing OpenTelemetry metrics, featuring native OTel support, and much more. We'll cover these and more highlights from the Prometheus ecosystem. Our guest is no other than Julius Volz, creator of Prometheus, and founder of the PromCon conference. Julius created the Prometheus monitoring at SoundCloud and led the project through open source and beyond. He now focuses on growing the Prometheus community, and helps companies use and adapt Prometheus through his company PromLabs. Before that, Julius was a Site Reliability Engineer at Google, where he gained experience monitoring at hyperscale. The episode was live-streamed on 4 September 2024 and the video is available at www.youtube.com/watch?v=iPUCU-78RD4 Check out the episode recap: https://medium.com/p/1c5edca32c87/ OpenObservability Talks episodes are released monthly, on the last Thursday of each month and are available for listening on your favorite podcast app and on YouTube. We live-stream the episodes on Twitch and YouTube Live - tune in to see us live, and chime in with your comments and questions on the live chat. https://www.youtube.com/@openobservabilitytalks https://www.twitch.tv/openobservability Show Notes: 00:00 - episode and guest intro 01:56 - Prometheus origins 07:23 - Kubernetes synergy 09:34 - Origins of PromCon and this year's event 11:44 - The idea for Prometheus 3.0 13:26 - new UI for Prometheus 20:42 - Beyond Prometheus UI into the broader UI/UX vision 23:07 - OpenTelemetry support and compatibility 37:26 - Native histograms 43:14 - Remote Write 2.0 46:53 - New governance model 48:49 - OpenMetrics is archived, merged into Prometheus 53:34 - Perses joins the CNCF sandbox 57:15 - The landscape of long-term storage for Prometheus 59:13 - Updates in Thanos project 01:00:34 - the growth of Prometheus-semi-compatible solutions 01:04:09 - Kubernets 1.31 is released Resources: PromCon recap: https://medium.com/p/1c5edca32c87/ PromCon: https://promcon.io/2024-berlin/ Prometheus now supports OpenTelemetry: https://horovits.medium.com/83f85878e46a OpenMetrics archived, merged into Prometheus: https://horovits.medium.com/d555598d2d04 Prometheus 3.0-Beta release: https://github.com/prometheus/prometheus/releases/tag/v3.0.0-beta.0 Prometheus 3.0-Beta release blog: https://prometheus.io/blog/2024/09/11/prometheus-3-beta/ Perses project introduction: https://horovits.medium.com/f05b5324d7da Last roundup of Prometheus updates: https://horovits.medium.com/fbede9b5cc9 Last PromCon (2023) recap: https://logz.io/blog/promcon-prometheus-ecosystem-updates/ Socials: Twitter: https://twitter.com/OpenObserv YouTube: https://www.youtube.com/@openobservabilitytalks Dotan Horovits ============ Twitter: @horovits LinkedIn: www.linkedin.com/in/horovits Mastodon: @horovits@fosstodon Julius Volz ========= Twitter: https://twitter.com/juliusvolz LinkedIn: https://www.linkedin.com/in/julius-volz/ Mastodon: https://chaos.social/@juliusvolz
Hans Kristian is a Platform Engineer for NAV's Kubernetes Platform Nais hosting Norway's wellfare services. With 10 years on Kubernetes, 2000 apps and 1000 developers across more than 100 teams there was a need to make OpenTelemetry adoption as easy as possible.Tune in as we hear from Hans Kristian who is also a CNCF Ambassador and hosts Cloud Native Day Bergen why OpenTelemetry is chosen by the public sector, why it took much longer to adopt, which challenges they had to scale the observability backend and how they are tackling the "noisy data problem"Links we discussed in the episodeFollow Hans Kristian on LinkedIn: https://www.linkedin.com/in/hansflaatten/From 0 to 100 OTel Blog: https://nais.io/blog/posts/otel-from-0-to-100/?foo=barCloud Native Day Bergen: https://2024.cloudnativebergen.dev/Public Money, Public Code. How we open source everything we do! (https://m.youtube.com/watch?v=4v05Huy2mlw&pp=ygUkT3BlbiBzb3VyY2Ugb3BlbiBnb3Zlcm5tZW50IGZsYWF0dGVu)State of Platform Engineering in Norway (https://m.youtube.com/watch?v=3WFZhETlS9s&pp=ygUYc3RhdGUgb2YgcGxhdGZvcm0gbm9yd2F5)
OTEL Universe and OTEL Videos join us as we kick off Florida. Meet four of their video guests through their music.Bill Ricci with The Dutchman's CurveHighway Jones with Rock and Roll DressJinxx with Come AwayVibe- It Don't Matter
gdh editörlerinin gündeminde neler var? İstanbul'dan tersine göç, otel fiyatları, deepfake, fahiş zamlar ve çok daha fazlası Editörün Gündemi'nde. #gdh #otel #istanbul #göç
Turizmci Murat Serim ile turizmdeki kötüleşmeyi, Türklerin Yunan adalarını tercih etme nedenlerini ve otel ve restaurantlardaki ücretlerin fahiş olup olmadığını konuştuk.
Konut satışlarındaki düşüş Haziran'da da devam etti. Suudi Arabistan'da gözaltında tutulan gazeteci Kurtuluş Demirbaş serbest bırakıldı. Muhalefetin çalışması sokak hayvanlarına ilişkin teklifin neler getirdiğini ortaya koydu. Bu bölüm Martı Hotel & Marinas hakkında reklam içermektedir. Sürdürülebilir girişimler ile enerjide %40'ın üzerinde, su tüketiminde ise %80'den fazla tasarruf sağlayan Martı Hotel & Marinas, siz huzurlu bir tatil geçirirken yarınları sizin için düşünüyor. Otel konforunda doğa dostu bir tatil için marti.com.tr'yi ziyaret edebilirsiniz. Aposto Gündem'e buradan ulaşabilirsiniz.
We dropping H's like it's the 20'S! Wanna go stay in a potato Otel? Ow bout we go to Ave a nice Wine Ike trough te Trail? Best trails to go to on vacation!
Requesting more CPU for your database used to take 6 months of planning 20 years ago. Now it takes the execution of a Terraform script. What has stayed the same all those years is Almudena Vivanco's passion for performance engineering to keep systems optimized. Ensuring that systems are available, scalable and resilient even during spike events such as the upcoming Euro Cup or any holiday specials.Tune in and hear from Almudena, who is currently working for SCRM Lidl, on how moving to the cloud gave new justification to performance engineering. She explains the importance of connecting business with service level objectives and gives insights on how Lidl makes sure to sell 50000 pieces of pork without breaking the cloud bankHere the additional links we discussedSlides from Barcelona Meetup: https://docs.google.com/presentation/d/1h83V4gUyqAmIWeAAtKb4BcRvuJV-XirLk-9Xq077nbwVideo from TestCon: https://www.youtube.com/watch?v=rIP_G-YBy04LinkedIn: https://www.linkedin.com/in/almudenavivanco/
Making observability available to everyone! This noble goal needs superhero powers in an IT world where there is so much chatter and confusion about what observability is, how to sell the value add besides a glorified troubleshooting tool and how OpenTelemetry will disrupt the landscape.In our latest episode we have Rainer Schuppe, Observability Veteran (more than 20+ years in the space), who has worked for the majority of the observability vendors. He is sharing his observability expertise through workshops in his home town of Mallorca. Teaching organizations from basic to strategic observability implementations.Tune in and learn about the typical adoption and maturity path of observability within enterprises: from fixing a problem at hand, to justifying the cost to keep it until enabling companies to become information driven digital organizations! Also check out his OpenTelemetry journey in his blog post seriesHere are the links we discussed today:Observability Heroes Website: https://observability-heroes.com/Observability Heroes Community: https://observability.mn.co/Cloud Native Mallorca Meetup: https://www.meetup.com/cloud-native-mallorca/OpenTelemetry: https://opentelemetry.io/Rainer on LinkedIn: https://www.linkedin.com/in/rainerschuppe/
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/ // Abstract Dive into the challenges of scaling AI models from Minimum Viable Product (MVP) to full production. The panel emphasizes the importance of continually updating knowledge and data, citing examples like teaching AI systems nuanced concepts and handling brand name translations. User feedback's role in model training, alongside evaluation steps like human annotation and heuristic-based assessment, was highlighted. The speakers stressed the necessity of tooling for user evaluation, version control, and regular performance updates. Insights on in-house and external tools for annotation and evaluation were shared, providing a comprehensive view of the complexities involved in scaling AI models. // Bio Alex Volkov - Moderator Alex Volkov is an AI Evangelist at Weights & Biases, celebrated for his expertise in clarifying the complexities of AI and advocating for its beneficial uses. He is the founder and host of ThursdAI, a weekly newsletter, and podcast that explores the latest in AI, its practical applications, open-source, and innovation. With a solid foundation as an AI startup founder and 20 years in full-stack software engineering, Alex offers a deep well of experience and insight into AI innovation. Eric Peter Product management leader and 2x founder with experience in enterprise products, data, and machine learning. Currently building tools for generative AI @Databricks. Donné Stevenson Focused on building AI-powered products that give companies the tools and expertise needed to harness to power of AI in their respective fields. Phillip Carter Phillip is on the product team at Honeycomb where he works on a bunch of different developer tooling things. He's an OpenTelemetry maintainer -- chances are if you've read the docs to learn how to use OTel, you've read his words. He's also Honeycomb's (accidental) prompt engineering expert by virtue of building and shipping products that use LLMs. In a past life, he worked on developer tools at Microsoft, helping bring the first cross-platform version of .NET into the world and grow to 5 million active developers. When not doing computer stuff, you'll find Phillip in the mountains riding a snowboard or backpacking in the Cascades. Andrew Hoh Andrew Hoh is the President and Co-Founder of LastMile AI. Previously, he was a Group PM Manager at Meta AI, driving product for their AI Platform. Previously, he was the Product Manager for the Machine Learning Infrastructure team at Airbnb and a founding team member of Azure Cosmos DB, Microsoft Azure's distributed NoSQL database. He graduated with a BA in Computer Science from Dartmouth College. A big thank you to our Premium Sponsors, @Databricks and @baseten for their generous support! // Sign up for our Newsletter to never miss an event: https://mlops.community/join/ // Watch all the conference videos here: https://home.mlops.community/home/collections // Check out the MLOps Community podcast: https://open.spotify.com/show/7wZygk3mUUqBaRbBGB1lgh?si=242d3b9675654a69 // Read our blog: mlops.community/blog // Join an in-person local meetup near you: https://mlops.community/meetups/ // MLOps Swag/Merch: https://mlops-community.myshopify.com/ // Follow us on Twitter: https://twitter.com/mlopscommunity //Follow us on Linkedin: https://www.linkedin.com/company/mlopscommunity/
OpenTelemetry is expanding beyond the traditional “three pillars of observability” and introduces a groundbreaking addition to its signals - Continuous Profiling. The new Profiling Special Interest Group (SIG) that was formed to lead the topic has already made significant advancements, to be featured at KubeCon Europe. Join us in this special panel episode of OpenObservability Talks as we explore the significance of this new dimension in understanding application behavior, optimizing performance, and gaining deeper insights into your systems. Our expert guests, Felix Geisendörfer and Ryan Perry, members of the OpenTelemetry Profiling SIG, share their insights into how Profiling enhances the OpenTelemetry framework, and update on the work for open specification and implementation. This special episode hosts a panel of two distinguished members of OpenTelemetry's Profile SIG, and prominent members of the observability vendor ecosystem. Felix Geisendörfer is a Senior Staff Engineer at Datadog where he works on Continuous Profiling and contributes to the Go runtime. Before that he was working at Apple, co-founded Transloadit, contributed to node.js and inspired a generation of mad scientists to program flying robots with it. Ryan Perry is Principal Product Manager at Grafana Labs. He has built a career at various startups while actively contributing to open source projects and advancing open telemetry initiatives. Most recently he built Pyroscope, an open source continuous profiling project/company, which has been acquired by Grafana Labs. The episode was live-streamed on 7 March 2024 and the video is available at https://www.youtube.com/watch?v=iGM67RT12gQ OpenObservability Talks episodes are released monthly, on the last Thursday of each month and are available for listening on your favorite podcast app and on YouTube. We live-stream the episodes on Twitch and YouTube Live - tune in to see us live, and chime in with your comments and questions on the live chat.https://www.twitch.tv/openobservabilityhttps://www.youtube.com/@openobservabilitytalks Show Notes: 00:00 - show intro 01:03 - episode and guests intro 04:02 - trends and advancements in the Profiling space 05:42 - from cost and performance into broader observability 11:27 - turning profile data into metrics 12:45 - runtime vs. full host profilers and eBPF use 18:44 - pprof JFR and other existing profile standards 21:19 - profile visualizations - from flame graphs to timeline view 22:37 - entrepreneur PoV on the profiling market 26:54 - OpenTelemetry adds profiles as a new signal 32:22 - OTel choosing a pprof extended standard 39:06 - discrete events vs. pre-aggregated data 41:09 - use cases for processing profiling data 44:19 - OTel Profiles reference implementation 49:11 - latest milestone and roadmap 54:44 - who's involved in OTel Profiles 56:41 - how to follow OTel Profiles and the guests 59:34 - March community events and conferences 1:00:38 - Falco and CloudEvents projects reached CNCF graduation 1:01:59 - Prometheus and Linkerd latest releases 1:03:29 - Netflix open-sources bpftop CLI for eBPF app performance monitoring 1:05:15 - show outro Resources: Continuous Profiling: A New Observability Signal (previous episode): https://logz.io/blog/continuous-profiling-new-observability-signal-in-opentelemetry/?utm_source=devrel&utm_medium=devrel OpenTelemetry extension proposal for adding Profiles: https://github.com/open-telemetry/oteps/pull/239 OTel Profile SIG notes: https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit#heading=h.63a4klfdbcob eBPF adoption in observability - github stats: https://www.linkedin.com/feed/update/urn:li:activity:7171044354667585537/ ProfilerPedia: https://profilerpedia.markhansen.co.nz/ Netflix releases bpftop CLI tool: https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5 OpenTelemetry announces support of Profiles at KubeCon Paris 2024: https://opentelemetry.io/blog/2024/profiling/ Socials: Twitter: https://twitter.com/OpenObserv YouTube: https://www.youtube.com/@openobservabilitytalks Dotan Horovits ============ Twitter: https://twitter.com/horovits LinkedIn: https://www.linkedin.com/in/horovits/ Mastodon: @horovits@fosstodon.org Felix Geisendörfer =============== Twitter: https://twitter.com/felixge LinkedIn: https://www.linkedin.com/in/felixg2/ Ryan Perry ========== Twitter: https://twitter.com/rperry_ Linkedin: https://www.linkedin.com/in/ryanaperry/
In this compelling episode, we sit down with Ben Reinhart who shares his journey of transitioning from the JavaScript ecosystem, specifically migrating off of Next.js and Vercel, to Elixir and Phoenix, with Fly.io as the new host. Ben discusses his frustrations with the complexity and performance issues he faced, and how the switch to Elixir helped streamline operations and improve the efficiency of his AI-focused product at Axflow. He delves into his strategic choice for leveraging the operational simplicity and real-time features of Phoenix, while also acknowledging trade-offs such as rebuilding front-end components. Join us to explore Ben's story, learn about the features of Elixir that helped him, and discover how the move has influenced Axflow's path towards finding product-market fit, and more! Show Notes online - http://podcast.thinkingelixir.com/195 (http://podcast.thinkingelixir.com/195) Elixir Community News - Update on the phoenixlivereload package to v1.5 containing useful tips. - https://www.elixirstreams.com/tips/streamserverlogstoconsole (https://www.elixirstreams.com/tips/stream_server_logs_to_console?utm_source=thinkingelixir&utm_medium=shownotes) – Tips on how to stream Elixir server logs to the browser console. - https://github.com/phoenixframework/phoenixlivereload?tab=readme-ov-file#streaming-serving-logs-to-the-web-console (https://github.com/phoenixframework/phoenix_live_reload?tab=readme-ov-file#streaming-serving-logs-to-the-web-console?utm_source=thinkingelixir&utm_medium=shownotes) – Documentation on streaming Elixir server logs to the web console using phoenixlivereload v1.5. - Advise to change Appearance theme to "Dark" in the browser console for better readability of debug-level messages. - https://github.com/phoenixframework/phoenixlivereload?tab=readme-ov-file#jumping-to-heex-function-definitions (https://github.com/phoenixframework/phoenix_live_reload?tab=readme-ov-file#jumping-to-heex-function-definitions?utm_source=thinkingelixir&utm_medium=shownotes) – Information on the new feature "Jumping to HEEx function definitions" in phoenixlivereload v1.5. - https://blog.appsignal.com/2024/03/19/direct-file-uploads-to-amazon-s3-with-phoenix-liveview.html (https://blog.appsignal.com/2024/03/19/direct-file-uploads-to-amazon-s3-with-phoenix-liveview.html?utm_source=thinkingelixir&utm_medium=shownotes) – A new blog post by Joshua Plique about uploading files directly to S3 using Phoenix LiveView. - https://hexdocs.pm/phoenixliveview/uploads-external.html (https://hexdocs.pm/phoenix_live_view/uploads-external.html?utm_source=thinkingelixir&utm_medium=shownotes) – Official Phoenix documentation on direct file uploads to external services like S3. - https://x.com/whatyouhide/status/1768345597369532660 (https://x.com/whatyouhide/status/1768345597369532660?utm_source=thinkingelixir&utm_medium=shownotes) – Andrea Leopardi working on integrating Open Telemetry (OTel) with Sentry for the Elixir SDK. - https://github.com/getsentry/sentry-elixir/issues/538 (https://github.com/getsentry/sentry-elixir/issues/538?utm_source=thinkingelixir&utm_medium=shownotes) – A Github issue discussing the integration of OTel with Sentry's Elixir SDK. - https://twitter.com/TylerAYoung/status/1769741350126149857 (https://twitter.com/TylerAYoung/status/1769741350126149857?utm_source=thinkingelixir&utm_medium=shownotes) – Tyler Young's tip for keeping Elixir tests running faster and asynchronously by using the Process dictionary instead of Application environment. - https://saltycrackers.dev/posts/bye-bye-async-false/ (https://saltycrackers.dev/posts/bye-bye-async-false/?utm_source=thinkingelixir&utm_medium=shownotes) – An article discussing how to avoid async false in tests by using the Process dictionary. - https://github.com/jbsf2/process-tree (https://github.com/jbsf2/process-tree?utm_source=thinkingelixir&utm_medium=shownotes) – Introduction of a new Elixir library, ProcessTree, to navigate the process ancestry hierarchy and aid in better test configuration. - Advice on using the process dictionary check only in MIX_ENV=test to prevent runtime overhead in production. Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at show@thinkingelixir.com (mailto:show@thinkingelixir.com) Discussion Resources - https://axflow.dev/ (https://axflow.dev/?utm_source=thinkingelixir&utm_medium=shownotes) - https://twitter.com/benjreinhart/status/1758616465589014531 (https://twitter.com/benjreinhart/status/1758616465589014531?utm_source=thinkingelixir&utm_medium=shownotes) - https://exercism.org/tracks/elixir (https://exercism.org/tracks/elixir?utm_source=thinkingelixir&utm_medium=shownotes) - https://www.youtube.com/watch?v=JvBT4XBdoUE (https://www.youtube.com/watch?v=JvBT4XBdoUE?utm_source=thinkingelixir&utm_medium=shownotes) - https://www.typescriptlang.org/ (https://www.typescriptlang.org/?utm_source=thinkingelixir&utm_medium=shownotes) - https://nextjs.org/ (https://nextjs.org/?utm_source=thinkingelixir&utm_medium=shownotes) - https://vercel.com/ (https://vercel.com/?utm_source=thinkingelixir&utm_medium=shownotes) - https://supabase.com/ (https://supabase.com/?utm_source=thinkingelixir&utm_medium=shownotes) - https://remix.run/ (https://remix.run/?utm_source=thinkingelixir&utm_medium=shownotes) - https://inertiajs.com/ (https://inertiajs.com/?utm_source=thinkingelixir&utm_medium=shownotes) - https://vitejs.dev/ (https://vitejs.dev/?utm_source=thinkingelixir&utm_medium=shownotes) - https://github.com/fidr/phoenixlivereact (https://github.com/fidr/phoenix_live_react?utm_source=thinkingelixir&utm_medium=shownotes) - https://github.com/geolessel/react-phoenix (https://github.com/geolessel/react-phoenix?utm_source=thinkingelixir&utm_medium=shownotes) - https://www.pinterest.com/ (https://www.pinterest.com/?utm_source=thinkingelixir&utm_medium=shownotes) - https://fly.io/docs/gpus/ (https://fly.io/docs/gpus/?utm_source=thinkingelixir&utm_medium=shownotes) Guest Information - https://twitter.com/benjreinhart (https://twitter.com/benjreinhart?utm_source=thinkingelixir&utm_medium=shownotes) – Ben on Twitter - https://twitter.com/axflow_dev (https://twitter.com/axflow_dev?utm_source=thinkingelixir&utm_medium=shownotes) – AxFlow on Twitter - https://github.com/benjreinhart/ (https://github.com/benjreinhart/?utm_source=thinkingelixir&utm_medium=shownotes) – on Github - https://benreinhart.com/ (https://benreinhart.com/?utm_source=thinkingelixir&utm_medium=shownotes) – Blog - https://axflow.dev/ (https://axflow.dev/?utm_source=thinkingelixir&utm_medium=shownotes) – AxFlow Website Find us online - Message the show - @ThinkingElixir (https://twitter.com/ThinkingElixir) - Message the show on Fediverse - @ThinkingElixir@genserver.social (https://genserver.social/ThinkingElixir) - Email the show - show@thinkingelixir.com (mailto:show@thinkingelixir.com) - Mark Ericksen - @brainlid (https://twitter.com/brainlid) - Mark Ericksen on Fediverse - @brainlid@genserver.social (https://genserver.social/brainlid) - David Bernheisel - @bernheisel (https://twitter.com/bernheisel) - David Bernheisel on Fediverse - @dbern@genserver.social (https://genserver.social/dbern) - Cade Ward - @cadebward (https://twitter.com/cadebward) - Cade Ward on Fediverse - @cadebward@genserver.social (https://genserver.social/cadebward)
Cristina Oțel Cu o experiență de peste 12 ani în Learning & Development Îmbinând neuroștiița cu elemente de psihologie pozitivă, face design de training și livreza workshop-uri de Life Design, Rescrierea convingerilor limitative, Emotional Fitness, Personal Mastery, Valori și misiune, Soulful Leadership sau Team Synergy și îi ajuta pe oameni să își creeze vieți pe care să le iubească. Astăzi vorbim despre rescrierea convingerilor limitative. ”Nu sunt suficient de…”, ”Sunt prea…”, ”Nu sunt la fel de… ca…”, ”Nu pot…”, ”Nu am timp…”, ”Eu niciodată…”, ”Alții mereu…”. Îți sună cunoscute vorbele astea? Le folosești și tu când vorbești despre tine? Vestea bună este că gândurile formulate așa sunt convingeri limitative. Vestea și mai bună este că ele nu sunt definitive și pot fi rescrise, dacă îți asumi munca de care este nevoie în acest proces. Support the showConectează-te cu mine: Abonează-te la Newsletter: https://lbs.alexandraene.com/subscribe-podcast Instagram: @alexandra_ene_ Facebook: https://www.facebook.com/obsimplificat YouTube: https://www.youtube.com/@AlexandraEne Pentru sugestii, invitații și colaborări, scrie-mi pe: hello@alexandraene.com.-----------------------------------VREI MAI MULT?Vezi ultimele oferte si resurse: https://lbs.alexandraene.com/everything-----------------RATE & REVIEW & SHARE *Dacă ți-a plăcut episodul, ai putea să iei în vedere să lași un review pe Apple Podcast / iTunes? Durează mai puțin de 60 secunde și ne ajută să îmbunătățim calitatea și să intervievam noi oameni interesanți.
#turizm #ekonomi #2024Turizmci Murat Serim ile turizm sektörüne ilişkin 2024 beklentilerini ve senaryolarını konuştuk. Kendisine Türkiye'nin benzer ülkelere göre pahalı olup olmadığını sorduk. 2023 yılında turizmi Rusların kurtardığını söyleyen Serim, iç turizm ile ilgili olarak "iki farklı Türkiye var" dedi.İyi seyirler..
Günaydın, çarşamba sabahından herkese merhaba! Ben Gamze Elvan. Güne Başlarken geri döndü. Artık her sabah saat 06:00'da Türkiye ve dünyanın gündemini, ekonomiyi, sporu, yani aklınıza gelebilecek her şeyi sizlere aktaracağım. O zaman başlayalım!
Rachel Dines, Head of Product and Technical Marketing at Chronosphere, joins Corey on Screaming in the Cloud to discuss why creating a cloud-native observability strategy is so critical, and the challenges that come with both defining and accomplishing that strategy to fit your current and future observability needs. Rachel explains how Chronosphere is taking an open-source approach to observability, and why it's more important than ever to acknowledge that the stakes and costs are much higher when it comes to observability in the cloud. About RachelRachel leads product and technical marketing for Chronosphere. Previously, Rachel wore lots of marketing hats at CloudHealth (acquired by VMware), and before that, she led product marketing for cloud-integrated storage at NetApp. She also spent many years as an analyst at Forrester Research. Outside of work, Rachel tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston where she's based.Links Referenced: Chronosphere: https://chronosphere.io/ LinkedIn: https://www.linkedin.com/in/rdines/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's featured guest episode is brought to us by our friends at Chronosphere, and they have also brought us Rachel Dines, their Head of Product and Solutions Marketing. Rachel, great to talk to you again.Rachel: Hi, Corey. Yeah, great to talk to you, too.Corey: Watching your trajectory has been really interesting, just because starting off, when we first started, I guess, learning who each other were, you were working at CloudHealth which has since become VMware. And I was trying to figure out, huh, the cloud runs on money. How about that? It feels like it was a thousand years ago, but neither one of us is quite that old.Rachel: It does feel like several lifetimes ago. You were just this snarky guy with a few followers on Twitter, and I was trying to figure out what you were doing mucking around with my customers [laugh]. Then [laugh] we kind of both figured out what we're doing, right?Corey: So, speaking of that iterative process, today, you are at Chronosphere, which is an observability company. We would have called it a monitoring company five years ago, but now that's become an insult after the observability war dust has settled. So, I want to talk to you about something that I've been kicking around for a while because I feel like there's a gap somewhere. Let's say that I build a crappy web app—because all of my web apps inherently are crappy—and it makes money through some mystical form of alchemy. And I have a bunch of users, and I eventually realize, huh, I should probably have a better observability story than waiting for the phone to ring and a customer telling me it's broken.So, I start instrumenting various aspects of it that seem to make sense. Maybe I go too low level, like looking at all the discs on every server to tell me if they're getting full or not, like their ancient servers. Maybe I just have a Pingdom equivalent of is the website up enough to respond to a packet? And as I wind up experiencing different failure modes and getting yelled at by different constituencies—in my own career trajectory, my own boss—you start instrumenting for all those different kinds of breakages, you start aggregating the logs somewhere and the volume gets bigger and bigger with time. But it feels like it's sort of a reactive process as you stumble through that entire environment.And I know it's not just me because I've seen this unfold in similar ways in a bunch of different companies. It feels to me, very strongly, like it is something that happens to you, rather than something you set about from day one with a strategy in mind. What's your take on an effective way to think about strategy when it comes to observability?Rachel: You just nailed it. That's exactly the kind of progression that we so often see. And that's what I really was excited to talk with you about today—Corey: Oh, thank God. I was worried for a minute there that you'd be like, “What the hell are you talking about? Are you just, like, some sort of crap engineer?” And, “Yes, but it's mean of you to say it.” But yeah, what I'm trying to figure out is there some magic that I just was never connecting? Because it always feels like you're in trouble because the site's always broken, and oh, like, if the disk fills up, yeah, oh, now we're going to start monitoring to make sure the disk doesn't fill up. Then you wind up getting barraged with alerts, and no one wins, and it's an uncomfortable period of time.Rachel: Uncomfortable period of time. That is one very polite way to put it. I mean, I will say, it is very rare to find a company that actually sits down and thinks, “This is our observability strategy. This is what we want to get out of observability.” Like, you can think about a strategy and, like, the old school sense, and you know, as an industry analyst, so I'm going to have to go back to, like, my roots at Forrester with thinking about, like, the people, and the process, and the technology.But really what the bigger component here is like, what's the business impact? What do you want to get out of your observability platform? What are you trying to achieve? And a lot of the time, people have thought, “Oh, observability strategy. Great, I'm just going to buy a tool. That's it. Like, that's my strategy.”And I hate to bring it to you, but buying tools is not a strategy. I'm not going to say, like, buy this tool. I'm not even going to say, “Buy Chronosphere.” That's not a strategy. Well, you should buy Chronosphere. But that's not a strategy.Corey: Of course. I'm going to throw the money by the wheelbarrow at various observability vendors, and hope it solves my problem. But if that solved the problem—I've got to be direct—I've never spoken to those customers.Rachel: Exactly. I mean, that's why this space is such a great one to come in and be very disruptive in. And I think, back in the days when we were running in data centers, maybe even before virtual machines, you could probably get away with not having a monitoring strategy—I'm not going to call it observability; it's not we call the back then—you could get away with not having a strategy because what was the worst that was going to happen, right? It wasn't like there was a finite amount that your monitoring bill could be, there was a finite amount that your customer impact could be. Like, you're paying the penny slots, right?We're not on the penny slots anymore. We're in the $50 craps table, and it's Las Vegas, and if you lose the game, you're going to have to run down the street without your shirt. Like, the game and the stakes have changed, and we're still pretending like we're playing penny slots, and we're not anymore.Corey: That's a good way of framing it. I mean, I still remember some of my biggest observability challenges were building highly available rsyslog clusters so that you could bounce a member and not lose any log data because some of that was transactionally important. And we've gone beyond that to a stupendous degree, but it still feels like you don't wind up building this into the application from day one. More's the pity because if you did, and did that intelligently, that opens up a whole world of possibilities. I dream of that changing where one day, whenever you start to build an app, oh, and we just push the button and automatically instrument with OTel, so you instrument the thing once everywhere it makes sense to do it, and then you can do your vendor selection and what you said were decisions later in time. But these days, we're not there.Rachel: Well, I mean, and there's also the question of just the legacy environment and the tech debt. Even if you wanted to, the—actually I was having a beer yesterday with a friend who's a VP of Engineering, and he's got his new environment that they're building with observability instrumented from the start. How beautiful. They've got OTel, they're going to have tracing. And then he's got his legacy environment, which is a hot mess.So, you know, there's always going to be this bridge of the old and the new. But this was where it comes back to no matter where you're at, you can stop and think, like, “What are we doing and why?” What is the cost of this? And not just cost in dollars, which I know you and I could talk about very deeply for a long period of time, but like, the opportunity costs. Developers are working on stuff that they could be working on something that's more valuable.Or like the cost of making people work round the clock, trying to troubleshoot issues when there could be an easier way. So, I think it's like stepping back and thinking about cost in terms of dollar sense, time, opportunity, and then also impact, and starting to make some decisions about what you're going to do in the future that's different. Once again, you might be stuck with some legacy stuff that you can't really change that much, but [laugh] you got to be realistic about where you're at.Corey: I think that that is a… it's a hard lesson to be very direct, in that, companies need to learn it the hard way, for better or worse. Honestly, this is one of the things that I always noticed in startup land, where you had a whole bunch of, frankly, relatively early-career engineers in their early-20s, if not younger. But then the ops person was always significantly older because the thing you actually want to hear from your ops person, regardless of how you slice it, is, “Oh, yeah, I've seen this kind of problem before. Here's how we fixed it.” Or even better, “Here's the thing we're doing, and I know how that's going to become a problem. Let's fix it before it does.” It's the, “What are you buying by bringing that person in?” “Experience, mostly.”Rachel: Yeah, that's an interesting point you make, and it kind of leads me down this little bit of a side note, but a really interesting antipattern that I've been seeing in a lot of companies is that more seasoned ops person, they're the one who everyone calls when something goes wrong. Like, they're the one who, like, “Oh, my God, I don't know how to fix it. This is a big hairy problem,” I call that one ops person, or I call that very experienced person. That experience person then becomes this huge bottleneck into solving problems that people don't really—they might even be the only one who knows how to use the observability tool. So, if we can't find a way to democratize our observability tooling a little bit more so, like, just day-to-day engineers, like, more junior engineers, newer ones, people who are still ramping, can actually use the tool and be successful, we're going to have a big problem when these ops people walk out the door, maybe they retire, maybe they just get sick of it. We have these massive bottlenecks in organizations, whether it's ops or DevOps or whatever, that I see often exacerbated by observability tools. Just a side note.Corey: Yeah. On some level, it feels like a lot of these things can be fixed with tooling. And I'm not going to say that tools aren't important. You ever tried to implement observability by hand? It doesn't work. There have to be computers somewhere in the loop, if nothing else.And then it just seems to devolve into a giant swamp of different companies, doing different things, taking different approaches. And, on some level, whenever you read the marketing or hear the stories any of these companies tell you also to normalize it from translating from whatever marketing language they've got into something that comports with the reality of your own environment and seeing if they align. And that feels like it is so much easier said than done.Rachel: This is a noisy space, that is for sure. And you know, I think we could go out to ten people right now and ask those ten people to define observability, and we would come back with ten different definitions. And then if you throw a marketing person in the mix, right—guilty as charged, and I know you're a marketing person, too, Corey, so you got to take some of the blame—it gets mucky, right? But like I said a minute ago, the answer is not tools. Tools can be part of the strategy, but if you're just thinking, “I'm going to buy a tool and that's going to solve my problem,” you're going to end up like this company I was talking to recently that has 25 different observability tools.And not only do they have 25 different observability tools, what's worse is they have 25 different definitions for their SLOs and 25 different names for the same metric. And to be honest, it's just a mess. I'm not saying, like, go be Draconian and, you know, tell all the engineers, like, “You can only use this tool [unintelligible 00:10:34] use that tool,” you got to figure out this kind of balance of, like, hands-on, hands-off, you know? How much do you centralize, how much do you push and standardize? Otherwise, you end up with just a huge mess.Corey: On some level, it feels like it was easier back in the days of building it yourself with Nagios because there's only one answer, and it sucks, unless you want to start going down the world of HP OpenView. Which step one: hire a 50-person team to manage OpenView. Okay, that's not going to solve my problem either. So, let's get a little more specific. How does Chronosphere approach this?Because historically, when I've spoken to folks at Chronosphere, there isn't that much of a day one story, in that, “I'm going to build a crappy web app. Let's instrument it for Chronosphere.” There's a certain, “You must be at least this tall to ride,” implicit expectation built into the product just based upon its origins. And I'm not saying that doesn't make sense, but it also means there's really no such thing as a greenfield build out for you either.Rachel: Well, yes and no. I mean, I think there's no green fields out there because everyone's doing something for observability, or monitoring, or whatever you want to call it, right? Whether they've got Nagios, whether they've got the Dog, whether they've got something else in there, they have some way of introspecting their systems, right? So, one of the things that Chronosphere is built on, that I actually think this is part of something—a way you might think about building out an observability strategy as well, is this concept of control and open-source compatibility. So, we only can collect data via open-source standards. You have to send this data via Prometheus, via Open Telemetry, it could be older standards, like, you know, statsd, Graphite, but we don't have any proprietary instrumentation.And if I was making a recommendation to somebody building out their observability strategy right now, I would say open, open, open, all day long because that gives you a huge amount of flexibility in the future. Because guess what? You know, you might put together an observability strategy that seems like it makes sense for right now—actually, I was talking to a B2B SaaS company that told me that they made a choice a couple of years ago on an observability tool. It seemed like the right choice at the time. They were growing so fast, they very quickly realized it was a terrible choice.But now, it's going to be really hard for them to migrate because it's all based on proprietary standards. Now, of course, a few years ago, they didn't have the luxury of Open Telemetry and all of these, but now that we have this, we can use these to kind of future-proof our mistakes. So, that's one big area that, once again, both my recommendation and happens to be our approach at Chronosphere.Corey: I think that that's a fair way of viewing it. It's a constant challenge, too, just because increasingly—you mentioned the Dog earlier, for example—I will say that for years, I have been asked whether or not at The Duckbill Group, we look at Azure bills or GCP bills. Nope, we are pure AWS. Recently, we started to hear that same inquiry specifically around Datadog, to the point where it has become a board-level concern at very large companies. And that is a challenge, on some level.I don't deviate from my typical path of I fix AWS bills, and that's enough impossible problems for one lifetime, but there is a strong sense of you want to record as much as possible for a variety of excellent reasons, but there's an implicit cost to doing that, and in many cases, the cost of observability becomes a massive contributor to the overall cost. Netflix has said in talks before that they're effectively an observability company that also happens to stream movies, just because it takes so much effort, engineering, and raw computing resources in order to get that data do something actionable with it. It's a hard problem.Rachel: It's a huge problem, and it's a big part of why I work at Chronosphere, to be honest. Because when I was—you know, towards the tail end at my previous company in cloud cost management, I had a lot of customers coming to me saying, “Hey, when are you going to tackle our Dog or our New Relic or whatever?” Similar to the experience you're having now, Corey, this was happening to me three, four years ago. And I noticed that there is definitely a correlation between people who are having these really big challenges with their observability bills and people that were adopting, like Kubernetes, and microservices and cloud-native. And it was around that time that I met the Chronosphere team, which is exactly what we do, right? We focus on observability for these cloud-native environments where observability data just goes, like, wild.We see 10X 20X as much observability data and that's what's driving up these costs. And yeah, it is becoming a board-level concern. I mean, and coming back to the concept of strategy, like if observability is the second or third most expensive item in your engineering bill—like, obviously, cloud infrastructure, number one—number two and number three is probably observability. How can you not have a strategy for that? How can this be something the board asks you about, and you're like, “What are we trying to get out of this? What's our purpose?” “Uhhhh… troubleshooting?”Corey: Right because it turns into business metrics as well. It's not just about is the site up or not. There's a—like, one of the things that always drove me nuts not just in the observability space, but even in cloud costing is where, okay, your costs have gone up this week so you get a frowny face, or it's in red, like traffic light coloring. Cool, but for a lot of architectures and a lot of customers, that's because you're doing a lot more volume. That translates directly into increased revenues, increased things you care about. You don't have the position or the context to say, “That's good,” or, “That's bad.” It simply is. And you can start deriving business insight from that. And I think that is the real observability story that I think has largely gone untold at tech conferences, at least.Rachel: It's so right. I mean, spending more on something is not inherently bad if you're getting more value out of it. And it definitely a challenge on the cloud cost management side. “My costs are going up, but my revenue is going up a lot faster, so I'm okay.” And I think some of the plays, like you know, we put observability in this box of, like, it's for low-level troubleshooting, but really, if you step back and think about it, there's a lot of larger, bigger picture initiatives that observability can contribute to in an org, like digital transformation. I know that's a buzzword, but, like that is a legit thing that a lot of CTOs are out there thinking about. Like, how do we, you know, get out of the tech debt world, and how do we get into cloud-native?Maybe it's developer efficiency. God, there's a lot of people talking about developer efficiency. Last week at KubeCon, that was one of the big, big topics. I mean, and yeah, what [laugh] what about cost savings? To me, we've put observability in a smaller box, and it needs to bust out.And I see this also in our customer base, you know? Customers like DoorDash use observability, not just to look at their infrastructure and their applications, but also look at their business. At any given minute, they know how many Dashers are on the road, how many orders are being placed, cut by geos, down to the—actually down to the second, and they can use that to make decisions.Corey: This is one of those things that I always found a little strange coming from the world of running systems in large [unintelligible 00:17:28] environments to fixing AWS bills. There's nothing that even resembles a fast, reactive response in the world of AWS billing. You wind up with a runaway bill, they're going to resolve that over a period of weeks, on Seattle business hours. If you wind up spinning something up that creates a whole bunch of very expensive drivers behind your bill, it's going to take three days, in most cases, before that starts showing up anywhere that you can reasonably expect to get at it. The idea of near real time is a lie unless you want to start instrumenting everything that you're doing to trap the calls and then run cost extrapolation from there. That's hard to do.Observability is a very different story, where latencies start to matter, where being able to get leading indicators of certain events—be a technical or business—start to be very important. But it seems like it's so hard to wind up getting there from where most people are. Because I know we like to talk dismissively about the past, but let's face it, conference-ware is the stuff we're the proudest of. The reality is the burning dumpster of regret in our data centers that still also drives giant piles of revenue, so you can't turn it off, nor would you want to, but you feel bad about it as a result. It just feels like it's such a big leap.Rachel: It is a big leap. And I think the very first step I would say is trying to get to this point of clarity and being honest with yourself about where you're at and where you want to be. And sometimes not making a choice is a choice, right, as well. So, sticking with the status quo is making a choice. And so, like, as we get into things like the holiday season right now, and I know there's going to be people that are on-call 24/7 during the holidays, potentially, to keep something that's just duct-taped together barely up and running, I'm making a choice; you're make a choice to do that. So, I think that's like the first step is the kind of… at least acknowledging where you're at, where you want to be, and if you're not going to make a change, just understanding the cost and being realistic about it.Corey: Yeah, being realistic, I think, is one of the hardest challenges because it's easy to wind up going for the aspirational story of, “In the future when everything's great.” Like, “Okay, cool. I appreciate the need to plant that flag on the hill somewhere. What's the next step? What can we get done by the end of this week that materially improves us from where we started the week?” And I think that with the aspirational conference-ware stories, it's hard to break that down into things that are actionable, that don't feel like they're going to be an interminable slog across your entire existing environment.Rachel: No, I get it. And for things like, you know, instrumenting and adding tracing and adding OTEL, a lot of the time, the return that you get on that investment is… it's not quite like, “I put a dollar in, I get a dollar out,” I mean, something like tracing, you can't get to 60% instrumentation and get 60% of the value. You need to be able to get to, like, 80, 90%, and then you'll get a huge amount of value. So, it's sort of like you're trudging up this hill, you're charging up this hill, and then finally you get to the plateau, and it's beautiful. But that hill is steep, and it's long, and it's not pretty. And I don't know what to say other than there's a plateau near the top. And those companies that do this well really get a ton of value out of it. And that's the dream, that we want to help customers get up that hill. But yeah, I'm not going to lie, the hill can be steep.Corey: One thing that I find interesting is there's almost a bimodal distribution in companies that I talk to. On the one side, you have companies like, I don't know, a Chronosphere is a good example of this. Presumably you have a cloud bill somewhere and the majority of your cloud spend will be on what amounts to a single application, probably in your case called, I don't know, Chronosphere. It shares the name of the company. The other side of that distribution is the large enterprise conglomerates where they're spending, I don't know, $400 million a year on cloud, but their largest workload is 3 million bucks, and it's just a very long tail of a whole bunch of different workloads, applications, teams, et cetera.So, what I'm curious about from the Chronosphere perspective—or the product you have, not the ‘you' in this metaphor, which gets confusing—is, it feels easier to instrument a Chronosphere-like company that has a primary workload that is the massive driver of most things and get that instrumented and start getting an observability story around that than it does to try and go to a giant company and, “Okay, 1500 teams need to all implement this thing that are all going in different directions.” How do you see it playing out among your customer base, if that bimodal distribution holds up in your world?Rachel: It does and it doesn't. So, first of all, for a lot of our customers, we often start with metrics. And starting with metrics means Prometheus. And Prometheus has hundreds of exporters. It is basically built into Kubernetes. So, if you're running Kubernetes, getting Prometheus metrics out, actually not a very big lift. So, we find that we start with Prometheus, we start with getting metrics in, and we can get a lot—I mean, customers—we have a lot of customers that use us just for metrics, and they get a massive amount of value.But then once they're ready, they can start instrumenting for OTEL and start getting traces in as well. And yeah, in large organizations, it does tend to be one team, one application, one service, one department that kind of goes at it and gets all that instrumented. But I've even seen very large organizations, when they get their act together and decide, like, “No, we're doing this,” they can get OTel instrumented fairly quickly. So, I guess it's, like, a lining up. It's more of a people issue than a technical issue a lot of the time.Like, getting everyone lined up and making sure that like, yes, we all agree. We're on board. We're going to do this. But it's usually, like, it's a start small, and it doesn't have to be all or nothing. We also just recently added the ability to ingest events, which is actually a really beautiful thing, and it's very, very straightforward.It basically just—we connect to your existing other DevOps tools, so whether it's, like, a Buildkite, or a GitHub, or, like, a LaunchDarkly, and then anytime something happens in one of those tools, that gets registered as an event in Chronosphere. And then we overlay those events over your alerts. So, when an alert fires, then first thing I do is I go look at the alert page, and it says, “Hey, someone did a deploy five minutes ago,” or, “There was a feature flag flipped three minutes ago,” I solved the problem right then. I don't think of this as—there's not an all or nothing nature to any of this stuff. Yes, tracing is a little bit of a—you know, like I said, it's one of those things where you have to make a lot of investment before you get a big reward, but that's not the case in all areas of observability.Corey: Yeah. I would agree. Do you find that there's a significant easy, early win when customers start adopting Chronosphere? Because one of the problems that I've found, especially with things that are holistic, and as you talk about tracing, well, you need to get to a certain point of coverage before you see value. But human psychology being what it is, you kind of want to be able to demonstrate, oh, see, the Meantime To Dopamine needs to come down, to borrow an old phrase. Do you find that some of there's some easy wins that start to help people to see the light? Because otherwise, it just feels like a whole bunch of work for no discernible benefit to them.Rachel: Yeah, at least for the Chronosphere customer base, one of the areas where we're seeing a lot of traction this year is in optimizing the costs, like, coming back to the cost story of their overall observability bill. So, we have this concept of the control plane in our product where all the data that we ingest hits the control plane. At that point, that customer can look at the data, analyze it, and decide this is useful, this is not useful. And actually, not just decide that, but we show them what's useful, what's not useful. What's being used, what's high cardinality, but—and high cost, but maybe no one's touched it.And then we can make decisions around aggregating it, dropping it, combining it, doing all sorts of fancy things, changing the—you know, downsampling it. We can do this, on the trace side, we can do it both head based and tail based. On the metrics side, it's as it hits the control plane and then streams out. And then they only pay for the data that we store. So typically, customers are—they come on board and immediately reduce their observability dataset by 60%. Like, that's just straight up, that's the average.And we've seen some customers get really aggressive, get up to, like, in the 90s, where they realize we're only using 10% of this data. Let's get rid of the rest of it. We're not going to pay for it. So, paying a lot less helps in a lot of ways. It also helps companies get more coverage of their observability. It also helps customers get more coverage of their overall stack. So, I was talking recently with an autonomous vehicle driving company that recently came to us from the Dog, and they had made some really tough choices and were no longer monitoring their pre-prod environments at all because they just couldn't afford to do it anymore. It's like, well, now they can, and we're still saving the money.Corey: I think that there's also the downstream effect of the money saving to that, for example, I don't fix observability bills directly. But, “Huh, why is your CloudWatch bill through the roof?” Or data egress charges in some cases? It's oh because your observability vendor is pounding the crap out of those endpoints and pulling all your log data across the internet, et cetera. And that tends to mean, oh, yeah, it's not just the first-order effect; it's the second and third and fourth-order effects this winds up having. It becomes almost a holistic challenge. I think that trying to put observability in its own bucket, on some level—when you're looking at it from a cost perspective—starts to be a, I guess, a structure that makes less and less sense in the fullness of time.Rachel: Yeah, I would agree with that. I think that just looking at the bill from your vendor is one very small piece of the overall cost you're incurring. I mean, all of the things you mentioned, the egress, the CloudWatch, the other services, it's impacting, what about the people?Corey: Yeah, it sure is great that your team works for free.Rachel: [laugh]. Exactly, right? I know, and it makes me think a little bit about that viral story about that particular company with a certain vendor that had a $65 million per year observability bill. And that impacted not just them, but, like, it showed up in both vendors' financial filings. Like, how did you get there? How did you get to that point? And I think this all comes back to the value in the ROI equation. Yes, we can all sit in our armchairs and be like, “Well, that was dumb,” but I know there are very smart people out there that just got into a bad situation by kicking the can down the road on not thinking about the strategy.Corey: Absolutely. I really want to thank you for taking the time to speak with me about, I guess, the bigger picture questions rather than the nuts and bolts of a product. I like understanding the overall view that drives a lot of these things. I don't feel I get to have enough of those conversations some weeks, so thank you for humoring me. If people want to learn more, where's the best place for them to go?Rachel: So, they should definitely check out the Chronosphere website. Brand new beautiful spankin' new website: chronosphere.io. And you can also find me on LinkedIn. I'm not really on the Twitters so much anymore, but I'd love to chat with you on LinkedIn and hear what you have to say.Corey: And we will, of course, put links to all of that in the [show notes 00:28:26]. Thank you so much for taking the time to speak with me. It's appreciated.Rachel: Thank you, Corey. Always fun.Corey: Rachel Dines, Head of Product and Solutions Marketing at Chronosphere. This has been a featured guest episode brought to us by our friends at Chronosphere, and I'm Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment that I will one day read once I finished building my highly available rsyslog system to consume it with.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.
Mike Goldsmith, Staff Software Engineer at Honeycomb, joins Corey on Screaming in the Cloud to talk about Open Telemetry, company culture, and the pros and cons of Go vs. .NET. Corey and Mike discuss why OTel is such an important tool, while pointing out its double-edged sword of being fully open-source and community-driven. Opening up about Honeycomb's company culture and how to find a work-life balance as a fully-remote employee, Mike points out how core-values and social interaction breathe life into a company like Honeycomb.About MikeMike is an OpenSource focused software engineer that builds tools to help users create, shape and deliver system & application telemetry. Mike contributes to a number of OpenTelemetry initiatives including being a maintainer for Go Auto instrumentation agent, Go proto packages and an emeritus .NET SDK maintainer..Links Referenced: Honeycomb: https://www.honeycomb.io/ Twitter: https://twitter.com/Mike_Goldsmith Honeycomb blog: https://www.honeycomb.io/blog LinkedIn: https://www.linkedin.com/in/mikegoldsmith/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest episode is brought to us by our friends at Honeycomb who I just love talking to. And we've gotten to talk to these folks a bunch of different times in a bunch of different ways. They've been a recurring sponsor of this show and my other media nonsense, they've been a reference customer for our consulting work at The Duckbill Group a couple of times now, and we just love working with them just because every time we do we learn something from it. I imagine today is going to be no exception. My guest is Mike Goldsmith, who's a staff software engineer over at Honeycomb. Mike, welcome to the show.Mike: Hello. Thank you for having me on the show today.Corey: So, I have been familiar with Honeycomb for a long time. And I'm still trying to break myself out of the misapprehension that, oh, they're a small, scrappy, 12-person company. You are very much not that anymore. So, we've gotten to a point now where I definitely have to ask the question: what part of the observability universe that Honeycomb encompasses do you focus on?Mike: For myself, I'm very focused on the telemetry side, so the place where I work on the tools that customers deploy in their own infrastructure to collect all of that useful data and make—that we can then send on to Honeycomb to make use of and help identify where the problems are, where things are changing, how we can best serve that data.Corey: You've been, I guess on some level, there's—I'm trying to make this not sound like an accusation, but I don't know if we can necessarily avoid that—you have been heavily involved in OpenTelemetry for a while, both professionally, as well as an open-source contributor in your free time because apparently you also don't know how to walk away from work when the workday is done. So, let's talk about that a little bit because I have a number of questions. Starting at the very beginning, for those who have not gone trekking through that particular part of the wilderness-slash-swamp, what is OpenTelemetry?Mike: So, OpenTelemetry is a vendor-agnostic set of tools that allow anybody to collect data about their system and then send it to a target back-end to make use of that data. The data, the visualization tools, and the tools that make use of that data are a variety of different things, so whether it's tracing data or metrics or logs, and then it's trying to take value from that. The big thing what OpenTelemetry is aimed at doing is making the collection of the data and the transit of the data to wherever you want to send it a community-owned resource, so it's not like you get vendor lock-in by going to using one competitor and then go to a different—you want to go and try a different tool and you've got to re-instrument or change your application heavily to make use of that. OpenTelemetry abstracts all that away, so all you need to know about is what you're instrumented with, what [unintelligible 00:03:22] can make of that data, and then you can send it to one or multiple different tools to make use of that data. So, you can even compare some tools side-by-side if you wanted to.Corey: So, given that it's an open format, from the customer side of the world, this sounds awesome. Is it envisioned that this is something—an instrument that gets instrumented at the application itself or once I send it to another observability vendor, is it envisioned that okay, if I send this data to Honeycomb, I can then instrument what Honeycomb sees about that and then send that onward somewhere else, maybe my ancient rsyslog server, maybe a different observability vendor that has a different emphasis. Like, how is it envisioned unfolding within the ecosystem? Like, in other words, can I build a giant ring of these things that just keep building an infinitely expensive loop?Mike: Yeah. So ideally, you would try and try to pick one or a few tools that will provide the most value that you can send to, and then it could answer all of the questions for you. So, at Honeycomb, we try to—we are primarily focused on tracing because we want to do application-level information to say, this user had this interaction, this is the context of what happened, these are the things that they clicked on, this is the information that flowed through your back-end system, this is the line-item order that was generated, the email content, all of those things all linked together so we know that person did this thing, it took this amount of time, and then over a longer period of time, from the analytics point of view, you can then say, “These are the most popular things that people are doing. This is typically how long it takes.” And then we can highlight outliers to say, “Okay, this person is having an issue.” This individual person, we can identify them and say, “This is an issue. This is what's different about what they're doing.”So, that's quite a unique tracing tool or opportunity there. So, that lets you really drive what's happening rather than what has happened. So, logs and metrics are very backward-looking to say, “This is the thing that this thing happened,” and tries to give you the context about it. Tracing tries to give you that extra layer of context to say that this thing happened and it had all of these things related to it, and why is it interesting?Corey: It's odd to me that vendors would be putting as much energy into OpenTelemetry—or OTel, as it seems to always be abbreviated as when I encounter it, so I'm using the term just so people, “Oh, wait, that's that thing I keep seeing. What is that?” Great—but it seems odd to me that vendors would be as embracing of that technology as they have been, just because historically, I remember whenever I had an application when I was using production in anger—which honestly, ‘anger' is a great name for the production environment—whenever I was trying to instrument things, it was okay, you'd have to grab this APM tools library and instrument there, and then something else as well, and you wound up with an order of operations where which one wrapped the other. And sometimes that caused problems. And of course, changing vendors meant you had to go and redeploy your entire application with different instrumentation and hope nothing broke. There was a lock-in story that was great for the incumbents back when that was state of the art. But even some of those incumbents are now embracing OTel. Why?Mike: I think it's because it's showing that there's such a diverse group of tools there, and [unintelligible 00:06:32] being the one that you've selected a number of years ago and then they could hold on to that. The momentum slowed because they were able to move at a slower pace because they were the organizations that allowed us—they were the de facto tooling. And then once new companies and competitors came around and we're open to trying to get a part of that market share, it's given the opportunity to then really pick the tool that is right for the job, rather than just the best than what is perceived to be the best tool because they're the largest one or the ones that most people are using. OpenTelemetry allows you to make an organization and a tool that's providing those tools focus on being the best at it, rather than just the biggest one.Corey: That is, I think, a more enlightened perspective than frankly, I expect a number of companies out there to have taken, just because it seems like lock-in seems to be the order of the day for an awful lot of companies. Like, “Okay, why are customers going to stay with us?” “Because we make it hard to leave,” is… I can understand the incentive, but that only works for so long if you're not actively solving a problem that customers have. One of the challenges that I ran into, even with OTel, was back when I was last trying to instrument a distributed application—which was built entirely on Lambda—is the fact that I was doing this for an application that was built entirely on Lambda. And it felt like the right answer was to, oh, just use an OTel layer—a Lambda layer that wound up providing the functionality you cared about.But every vendor seemed to have their own. Honeycomb had one, Lightstep had one, AWS had one, and now it's oh, dear, this is just the next evolution of that specific agent problem. How did that play out? Is that still the way it works? Is there other good reasons for this? Or is this just people trying to slap a logo on things?Mike: Yeah, so being a fully open-source project and a community-driven project is a double-edged sword in some ways. One it gives the opportunity for everybody to participate, everybody can move between tools a lot easier and you can try and find the best fit for you. The unfortunate part around open-source-driven projects like that means that it's extremely configuration-heavy because it can do anything; it's not opinionated, which means that if you want to have the opportunity to do everything, every possible use case is available to everyone all of the time. So, if you might have a very narrow use case, say, “I want to learn about this bit of information,” like, “I'm working with the [unintelligible 00:09:00] SDK. I want to talk about—I've got an [unintelligible 00:09:03] application and I want to collect data that's running in a Lambda layer.” The OpenTelemetry SDK that has to serve all of the [other 00:09:10] JavaScript projects across all the different instrumentations, possibly talking about auto-instrumentation, possibly talking about lots of other tools that can be built into that project, it just leads to a very highly configurable but very complicated tool.So, what the vendor specifics, what you've suggested there around like Honeycomb, or other organizations providing the layers, they're trying to simplify the usage of the SDK to make some of those assumptions for you that you are going to be sending telemetry to Honeycomb, you are going to be talking about an API key that is going to be in a particular format, it is easier to pass that information into the SDK so it knows how to communicate rather than—as well as where it's going to communicate that data to.Corey: There's a common story that I tend to find myself smacking into almost against my will, where I have found myself at the perfect intersection of a variety of different challenges, and for some reason, I have stumbled blindly and through no ill intent into ‘this is terrible' territory. I wound to finally getting blocked and getting distracted by something else shiny on this project about two years ago because the problem I was getting into was, okay, I got to start sending traces to various places and that was awesome, but now I wanted to annotate each span with a user identity that could be derived from code, and the way that it interfaced with the various Lambda layers at that point in time was, ooh, that's not going to be great. And I think there were a couple of GitHub issues opened on it as feature enhancements for a couple of layers. And then I, again, was still distracted by shiny things and never went back around to it. But I was left with the distinct impression that building something purely out of Lambda functions—and also probably popsicle sticks—is something of an edge case. Is there a particular software architecture or infrastructure architecture that OTel favors?Mike: I don't think it favors any in particular, but it definitely suffers because it's, as I said earlier, it's trying to do that avail—the single SDK is available to many different use cases, which has its own challenges because then it has to deal with so many different options. But I don't think OpenTelemetry has a specific, like, use case in mind. It's definitely focused on, like—sorry, telemetry tracing—tracing is focused on application telemetry. So, it's focused on about your code that you build yourself and then deploy. There are other tools that can collect operational data, things like the OpenTelemetry Collector is then available to sit outside of that process and say, what's going on in my system?But yeah, I wouldn't say that there's a specific infrastructure that it's aimed at doing. A lot of the cloud operators and tools are trying to make sure that that information is available and OpenTelemetry SDKs are available. But yeah, at the moment, it does require some knowledge around what's best for your application if you're not in complete control of all of the infrastructure that it's running in.Corey: It feels that with most things that are sort of pulled into the orbit of the CNCF—and OTel is no exception to this—that there's an idea that oh, well, everything is going to therefore be running in containers, on top of Kubernetes. And that might be unfair, but it also, frankly, winds up following pretty accurately what a lot of applications I'm seeing in client environments have been doing. Don't take it as a criticism. But it does seem like it is designed with an eye toward everything being microservices running on containers, scheduled which, from a infrastructure perspective, what appears to be willy-nilly abandoned, and how do you wind up gathering useful information out of that without drowning in data? That seems to be, from at least my brief experience with OTel, the direction it heads in. Is that directionally correct?Mike: Yeah, I think so. I think OpenTelemetry has a quite strong relationship with CNCF and therefore Kubernetes. That is a use case that we see as a very common with customers that we engage with, both at the prospect level and then just initial conversations, people using something like Kubernetes to do the application orchestration is very, very common. It's something that OpenTelemetry and Honeycomb are wanting to improve on as well. We want to get by a very good experience because it is so common when we come up to it that we want to have a very good, strong opinion around, well, if you're running in Kubernetes, these are the tools and these are the right ways to use OpenTelemetry to get the best out of it.Corey: I want to change gears a little bit. Something that's interested me about Honeycomb for a while has been its culture. Your founders have been very public about their views on a variety of different things that are not just engineering-centric, but tangential to it, like, engineering management: how not to be terrible at it. And based on a huge number of conversations I've had with folks over there, I'm inclined to agree that the stories they tell in public do align with how things go internally. Or at least if they're not, I would not expect you to admit it on the record, so either way, we'll just take that as a given.What I'm curious about is that you are many timezones away from their very nice office here in San Francisco. What's it like working remote in a company that is not fully distributed? Which is funny, we talk about distributed applications as if they're a given but distributed teams are still something we're wrangling with.Mike: Yeah, it's something that I've dealt with for quite a while, for maybe seven or eight years is worked with a few different organizations that are not based in my timezone. There's been a couple, primarily based in San Francisco area, so Pacific Time. An eight-hour time difference for the UK is challenging, it has its own challenges, but it also has a lot of benefits, too. So typically, I get to really have a lot of focus time on a morning. That means that I can start my day, look through whatever I think is appropriate for that morning, and not get interrupted very easily.I get a lot of time to think and plan and I think that's helped me at, like, the tech lead level because I can really focus on something and think it through without that level of interruption that I think some people do if you're working in the same timezone or even in the same office as someone. That approachability is just not naturally there. But the other side of that is that I have a very limited amount of natural overlap with people I work with on a day-to-day basis, so it's typically meetings from 2 till 5 p.m. most days to try and make sure that I build those social relationships, I'm talking to the right people, giving status updates, planning and that sort of thing. But it works for me. I really enjoy that balance of some ty—like, having a lot of focus time and having, like, then dedicated time to spend with people.And I think that's really important, as well is that a distributed team naturally means that you don't get to spend a lot of time with people and a lot of, like, one-on-one time with people, so that's something that I definitely focus on is doing a lot of social interaction as well. So, it's not just I have a meeting, we've got to stand up, we've got 15 minutes, and then everyone goes and does their own thing. I like to make sure that we have time so we can talk, we can connect to each other, we know each other, things that would—[unintelligible 00:16:35] that allow a space for conversations to happen that would naturally happen if you were sat next to somebody at a desk, or like, the more traditional, like, water cooler conversations. You hear somebody having a conversation, you go talk to them, that naturally evolves.Corey: That was where I ran into a lot of trouble with it myself. My first outing as a manager, I had—most of the people on my team were in the same room as I was, and then we had someone who was in Europe. And as much as we tried to include this person in all of our meetings, there was an intrinsic, “Let's go get a cup of coffee,” or, “Let's have a discussion and figure things out.” And sometimes it's four in the afternoon, we're going to figure something out, and they have long since gone to bed or have a life, hopefully. And it was one of those areas where despite a conscious effort to avoid this problem, it was very clear that they did not have an equal voice in the team dynamic, in the team functioning, in the team culture, and in many cases, some of the decisions we ultimately reached as an outgrowth of those sidebar conversations. This led to something of an almost religious belief for me, for at least a while, was that either everyone's distributed or no one is because otherwise you wind up with the unequal access problem. But it's clearly worked for you folks. How have you gotten around that?Mike: For Honeycomb, it was a conscious decision not long before the Covid pandemic that the team would be distributed first; the whole organization will be distributed first. So, a number of months before that happened, the intention was that anybody across the organization—which at the time, was only North America-based staff—would be able to do their job outside of the office. Because I think around the end of 2019 to the beginning of 2020, a lot of the staff were based in the San Francisco area and that was starting to grow, and want more staff to come into the business. And there were more opportunities for people outside of that area to join the business, so the business decided that if we're going to do this, if we're going to hire people outside of the local area, then we do want to make sure that, as you said, that everybody has an equal access, everyone has equal opportunity, they can participate, and everybody has the same opportunity to do those things. And that has definitely fed through pandemic, and then even when the office reopened and people can go back into the office. More than—I think there's only… maybe 25% of the company now is even in Pacific Time Zone. And then the office space itself is not very large considering the size of the company, so we couldn't fit everybody into our office space if we wanted to.Corey: Yeah, that's one of the constant growing challenges, too, that I understand that a lot of companies do see value in the idea of getting everyone together in a room. I know that I, for example, I'm a lot more effective and productive when I'm around other people. But I'm really expensive to their productivity because I am Captain Interrupter, which, you know, we have to recognize our limitations as we encounter them. But that also means that the office expense exceeds the AWS bill past a certain point of scale, and that is not a small thing. Like, I try not to take too much of a public opinion on should we be migrating everyone back to return-to-office as a mandate, yes, no, et cetera.I can see a bunch of different perspectives on this that are nuanced and I don't think it lends itself to my usual reactionary take on the Twitters, as it were, but it's a hard problem with no easy answer to it. Frankly, I also think it's a big mistake to do full-remote only for junior employees, just because so much of learning how the workforce works is through observation. You don't learn a lot about those unspoken dynamics in any other way than observing it directly.Mike: Yes, I fully agree. I think the stage that Honeycomb was at when I joined and has continued to be is that I think a very junior person joining an organization that is fully distributed is more challenging. It has different challenges, but it has more challenges because it doesn't have those… you can't just see something happening and know that that's the norm or that that's the expectation. You've got to push yourself into those in those different arenas, those different conversations, and it can be quite daunting when you're new to an organization, especially if you are not experienced in that organization or experienced in the role that you're currently occupying. Yeah, I think the distributed organizations is—fully distributed has its challenges and I think that's something that we do at Honeycomb is that we intentionally do that twice a year, maybe three times a year, bring in the people that do work very closely, bringing them together so they have that opportunity to work together, build those social interactions like I mentioned earlier, and then do some work together as well.And it builds a stronger trust relationship because of that, as well because you're reinforcing the social side with the work side in a face-to-face context. And there's just, there's no direct replacement for face-to-face. If you worked for somebody and never met them for over a year, it'd be very difficult to then just be in a room together and have a normal conversation.Corey: It takes a lot of effort because there's so much to a company culture that is not meetings or agenda-driven or talking about the work. I mean, companies get this wrong with community all the time where they think that a community is either a terrible option of people we can sell things to or more correctly, a place where users of our product or service or offering or platform can gather together to solve common challenges and share knowledge with each other. But where they fall flat often is it also has to have a social element. Like ohh, having a conversation about your lives is not on topic for this community Slack team is, great, that strangles community before it can even form, in many cases. And work is no different.Mike: Yeah, I fully agree. We see that with the Honeycomb Pollinators Slack channel. So, we use that as a primary way of community members to participate, talk to each other, share their experiences, and we can definitely see that there is a high level of social interaction alongside of that. They connect because they've got a shared interest or a shared tool or a shared problem that they're trying to solve, but we do see, like, people, the same people, reconnecting or re-communicating with each other because they have built that social connection there as well.And I think that's something that as organizations—like, OpenTelemetry is a community is more welcoming to that. And then you can participate with something that then transcends different organizations that you may work for as well because you're already part of this community. So, if that community then reaches to another organization, there's an opportunity to go, to move between organizations and then maintain a level of connection.Corey: That seems like one of the better approaches that people can have to this stuff. It's just a—the hard part, of course, is how do you change culture? I think the easy way to do it—the only easy way to do it—is you have to build the culture from the beginning. Every time I see companies bringing in outsiders to change the corporate culture, I can't help but feel that they're setting giant piles of money on fire. Culture is one of those things that's organic and just changing it by fiat doesn't work. If I knew how to actually change culture, I would have a much more lucrative target for my consultancy than I do today. You think AWS bills are a big problem? Everyone has a problem with company cultures.Mike: Yeah, I fully agree. I think that culture is something that you're right is very organic, it naturally happens. I think the value when organizations go through, like, a retrospective, like, what is our culture? How would we define it? What are the core values of that and how do we articulate that to people that might be coming into the organization, that's very valuable, too, because those core values are very useful to communicate to people.So, one of the bigger core values that we've got at Honeycomb is that—we refer to as, “We hire adults,” meaning that when somebody needs to do something, they just can go and do it. You don't have to report to somebody, you don't have to go and tell somebody, “I need a doctor appointment,” or, “I've got to go and pick up the kids from school,” or something like that. You're trusted to do your job to the highest level, and if you need additional help, you can ask for it. If somebody requires something of you they ask for it. They do it in a humane way and they expect to be treated like a human and an adult all of the time.Corey: On some level, I've always found, for better or worse, that people will largely respond to how you treat them and live up or down to the expectation placed upon them. You want a bunch of cogs who are going to have to raise their hand to go to the bathroom? Okay, you can staff that way if you want, but don't be surprised when those teams don't volunteer to come up with creative solutions to things either. You can micromanage people to death.Mike: Yeah. Yeah, definitely. I've been in organizations, like, fresh out of college and had to go to work at a particular place and it was very time-managed. And I had inbound sales calls and things like that and it was very, like, you've spent more than three minutes on a wrap-up call from having a previous call, and if you don't finish that call within three minutes, your manager will call your phone to say, “You need to go on to the next call.” And it's… you could have had a really important call or you could have had a very long call. They didn't care. They just wanted—you've had your time now move on to the next one and they didn't care.Corey: One last question I want to ask you about before we wind up calling this an episode, and it distills down to I guess, effectively, your history, for lack of a better term. You have done an awful lot of Go maintenance work—Go meaning the language, not the imperative command, to be clear—but you also historically were the .NET SDK maintainer for something or other. Do you find those languages to be similar or… how did that come to be? I mean, to be clear, my programming languages of choice are twofold: both brute force and enthusiasm. Most people take a slightly different path.Mike: Yeah, I worked with .NET for a very long time, so that was, like, the place—the first place that I joined as a real organization after finishing college was .NET and it just sort of stuck. I enjoyed the language. At the time, sort of, what 15 year—12, 15 years ago, the language itself was moving pretty well, there was things being added to it, it was enjoyable to use.Over the last maybe four or five years, I've had the opportunity to work a lot more in Go. And they are very different. So, Go is much more focused on simplicity and not hiding anything from anybody and just being very efficient at what you can see it does. .NET and many other languages such as Java, Ruby, JavaScript, Python, all have a level of magic to them, so if you're not part of the ecosystem or if you don't know particular really common packages that can do things for you, not knowing something about the ecosystem causes pain.I think Go takes away some of that because if you don't know those ecosystems or if you don't know those tools, you can still solve the problem fairly quickly and fairly simply. Tools will help but they're not required. .NET is probably on the boundary for me. It's still very easy to use, I enjoy using it, but it just… I found that it's not that long ago, I would say that I've switched from thinking like a .NET developer, so whenever I'm forming code in my head, like, how I would solve a problem, for a very long time, it was in .NET and C#.I'd probably say in the last 12 months or so, it's definitely moved more to Go just because of the simplicity. And it's also the tool that is most used within Honeycomb, especially, so if you're talking about Go code, you've got a wider audience to bounce ideas off, to talk to, communicate, get ideas from. .NET is not a very well used language within Honeycomb and probably even, like… even maybe West Coast-based organizations, it seems to be very high-level organizations that are willing to pay their money up for, like, Microsoft support. Like, Go is something that a lot of developers use because it's very simple, very quick, can move quick.Corey: I found that it was very easy for me to pick up Go to build out something ridiculous a few years back when I need to control my video camera through its ‘API' to use the term charitably. And it just works in a way that made an awful lot of sense. But I still find myself reaching for Python or for—God help me—TypeScript if I'm doing some CDK work these days. And honestly, they all tend to achieve more or less the same outcome. It's just different approaches to—well, to be unkind—dependency management in some cases, and also the ecosystem around it and what is done for you.I don't think there's a bad language to learn. I don't want this to be interpreted as language snobbery, but I haven't touched anything in the Microsoft ecosystem for a long time in production, so .NET was just never on my radar. But it's clear they have an absolutely massive community ecosystem built around it and that is no small thing. I'd say it rivals Java.Mike: Yeah definitely. I think over the last ten years or so, the popularity of .NET as a language to be built from enterprise, especially at larger-scale organizations have taken it on, and then, like, six, seven years ago, they introduced the .NET Core Framework, which allowed it to run on non-Windows platforms, and that accelerated the language dramatically, so they have a consistent API that can be used on Windows, on Linux, Mac, and that makes a huge difference for creating a larger audience for people to interact with it. And then also, with Azure becoming much more popular, they can have all of these—this language that people are typically used to using Linux as an operating system that runs infrastructure, but not being forced to use Windows is probably quite a big thing for Azure as well.Corey: I really want to thank you for taking the time to talk about what you're up to over there. If people want to learn more, where's the best place for them to go find you?Mike: Typically, I use Twitter, so it's Mike_Goldsmith. I create blogs on the Honeycomb blog website, which I've done a few different things; I've got a new one coming up soon to talk about different ways of collecting data. So yeah, those are the two main places. LinkedIn is usual as ever, but that's a little bit more work-focused.Corey: It does seem to be. And we'll put links to all of that in the [show notes 00:31:11]. Thank you so much for being so generous with your time, and of course, thank you Honeycomb for sponsoring this episode of my ridiculous podcast.Mike: Yeah, thank you very much for having me on.Corey: Mike Goldsmith, staff software engineer at Honeycomb. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an insulting comment that we will then have instrumented across the board with a unified observability platform to keep our days eventful.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Seif Lotfy, Co-Founder and CTO at Axiom, joins Corey on Screaming in the Cloud to discuss how and why Axiom has taken a low-cost approach to event data. Seif describes the events that led to him helping co-found a company, and explains why the team wrote all their code from scratch. Corey and Seif discuss their views on AWS pricing, and Seif shares his views on why AWS doesn't have to compete on price. Seif also reveals some of the exciting new products and features that Axiom is currently working on. About SeifSeif is the bubbly Co-founder and CTO of Axiom where he has helped build the next generation of logging, tracing, and metrics. His background is at Xamarin, and Deutche Telekom and he is the kind of deep technical nerd that geeks out on white papers about emerging technology and then goes to see what he can build.Links Referenced: Axiom: https://axiom.co/ Twitter: https://twitter.com/seiflotfy TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest episode is brought to us by my friends, and soon to be yours, over at Axiom. Today I'm talking with Seif Lotfy, who's the co-founder and CTO of Axiom. Seif, how are you?Seif: Hey, Corey, I am very good, thank you. It's pretty late here, but it's worth it. I'm excited to be on this interview. How are you today?Corey: I'm not dead yet. It's weird, I see you at a bunch of different conferences, and I keep forgetting that you do in fact live half a world away. Is the entire company based in Europe? And where are you folks? Where do you start and where do you stop geographically? Let's start there. We over—everyone dives right into product. No, no, no. I want to know where in the world people sit because apparently, that's the most important thing about a company in 2023.Seif: Unless you ask Zoom because they're undoing whatever they did. We're from New Zealand, all the way to San Francisco, and everything in between. So, we have people in Egypt and Nigeria, all around Europe, all around the US… and UK, if you don't consider it Europe anymore.Corey: Yeah, it really depends. There's a lot of unfortunate naming that needs to get changed in the wake of that.Seif: [laugh].Corey: But enough about geopolitics. Let's talk about industry politics. I've been a fan of Axiom for a while and I was somewhat surprised to realize how long it had been around because I only heard about you folks a couple of years back. What is it you folks do? Because I know how I think about what you're up to, but you've also gone through some messaging iteration, and it is a near certainty that I am behind the times.Seif: Well, at this point, we just define ourselves as the best home for event data. So, Axiom is the best home for event data. We try to deal with everything that is event-based, so time-series. So, we can talk metrics, logs, traces, et cetera. And right now predominantly serving engineering and security.And we're trying to be—or we are—the first cloud-native time-series platform to provide streaming search, reporting, and monitoring capabilities. And we're built from the ground up, by the way. Like, we didn't actually—we're not using Parquet [unintelligible 00:02:36] thing. We're completely everything from the ground up.Corey: When I first started talking to you folks a few years back, there were two points to me that really stood out, and I know at least one of them still holds true. The first is that at the time, you were primarily talking about log data. Just send all your logs over to Axiom. The end. And that was a simple message that was simple enough that I could understand it, frankly.Because back when I was slinging servers around and you know breaking half of them, logs were effectively how we kept track of what was going on, where. These days, it feels like everything has been repainted with a very broad brush called observability, and the takeaway from most company pitches has been, you must be smarter than you are to understand what it is that we're up to. And in some cases, you scratch below the surface and realize it no, they have no idea what they're talking about either and they're really hoping you don't call them on that.Seif: It's packaging.Corey: Yeah. It is packaging and that's important.Seif: It's literally packaging. If you look at it, traces and logs, these are events. There's a timestamp and just data with it. It's a timestamp and data with it, right? Even metrics is all the way to that point.And a good example, now everybody's jumping on [OTel 00:03:46]. For me, OTel is nothing else, but a different structure for time series, for different types of time series, and that can be used differently, right? Or at least not used differently but you can leverage it differently.Corey: And the other thing that you did that was interesting and is a lot, I think, more sustainable as far as [moats 00:04:04] go, rather than things that can be changed on a billboard or whatnot, is your economic position. And your pricing has changed around somewhat, but I ran a number of analyses on your cost that you were passing on to customers and my takeaway was that it was a little bit more expensive to store data for logs in Axiom than it was to store it in S3, but not by much. And it just blew away the price point of everything else focused around logs, including AWS; you're paying 50 cents a gigabyte to ingest CloudWatch logs data over there. Other companies are charging multiples of that and Cisco recently bought Splunk for $28 billion because it was cheaper than paying their annual Splunk bill. How did you get to that price point? Is it just a matter of everyone else being greedy or have you done something different?Seif: We looked at it from the perspective of… so there's the three L's of logging. I forgot the name of the person at Netflix who talked about that, but basically, it's low costs, low latency, large scale, right? And you will never be able to fulfill all three of them. And we decided to work on low costs and large scale. And in terms of low latency, we won't be low as others like ClickHouse, but we are low enough. Like, we're fast enough.The idea is to be fast enough because in most cases, I don't want to compete on milliseconds. I think if the user can see his data in two seconds, he's happy. Or three seconds, he's happy. I'm not going to be, like, one to two seconds and make the cost exponentially higher because I'm one second faster than the other. And that's, I think, that the way we approached this from day one.And from day one, we also started utilizing the idea of existence of Open—Object Storage, we have our own compressions, our own encodings, et cetera, from day one, too, so and we still stick to that. That's why we never converted to other existing things like Parquet. Also because we are a Schema-On-Read, which Parquet doesn't allow you really to do. But other than that, it's… from day one, we wanted to save costs by also making coordination free. So, ingest has to be coordination free, right, because then we don't run a shitty Kafka, like, honestly a lot—a lot of the [logs 00:06:19] companies who running a Kafka in front of it, the Kafka tax reflects in what they—the bill that you're paying for them.Corey: What I found fun about your pricing model is it gets to a point that for any reasonable workload, how much to log or what to log or sample or keep everything is no longer an investment decision; it's just go ahead and handle it. And that was originally what you wound up building out. Increasingly, it seems like you're not just the place to send all the logs to, which to be honest, I was excited enough about that. That was replacing one of the projects I did a couple of times myself, which is building highly available, fault-tolerant, rsyslog clusters in data centers. Okay, great, you've gotten that unlocked, the economics are great, I don't have to worry about that anymore.And then you started adding interesting things on top of it, analyzing things, replaying events that happen to other players, et cetera, et cetera, it almost feels like you're not just a storage depot, but you also can forward certain things on under a variety of different rules or guises and format them as whatever on the other side is expecting them to be. So, there's a story about integrating with other observability vendors, for example, and only sending the stuff that's germane and relevant to them since everyone loves to charge by ingest.Seif: Yeah. So, we did this one thing called endpoints, the number one. Endpoints was a beginning where we said, “Let's let people send us data using whatever API they like using, let's say Elasticsearch, Datadog, Honeycomb, Loki, whatever, and we will just take that data and multiplex it back to them.” So, that's how part of it started. This allows us to see, like, how—allows customers to see how we compared to others, but then we took it a bit further and now, it's still in closed invite-only, but we have Pipelines—codenamed Pipelines—which allows you to send data to us and we will keep it as a source of truth, then we will, given specific rules, we can then ship it anywhere to a different destination, right, and this allows you just to, on the fly, send specific filter things out to, I don't know, a different vendor or even to S3 or you could send it to Splunk. But at the same time, you can—because we have all your data, you can go back in the past, if the incident happens and replay that completely into a different product.Corey: I would say that there's a definite approach to observability, from the perspective of every company tends to visualize stuff a little bit differently. And one of the promises of OTel that I'm seeing that as it grows is the idea of oh, I can send different parts of what I'm seeing off to different providers. But the instrumentation story for OTel is still very much emerging. Logs are kind of eternal and the only real change we've seen to logs over the past decade or so has been instead of just being plain text and their positional parameters would define what was what—if it's in this column, it's an IP address and if it's in this column, it's a return code, and that just wound up being ridiculous—now you see them having schemas; they are structured in a variety of different ways. Which, okay, it's a little harder to wind up just cat'ing a file together and piping it to grep, but there are trade-offs that make it worth it, in my experience.This is one of those transitional products that not only is great once you get to where you're going, from my playing with it, but also it meets you where you already are to get started because everything you've got is emitting logs somewhere, whether you know it or not.Seif: Yes. And that's why we picked up on OTel, right? Like, one of the first things, we now support… we have an OTel endpoint natively bec—or as a first-class citizen because we wanted to build this experience around OTel in general. Whether we like it or not, and there's more reasons to like it, OTel is a standard that's going to stay and it's going to move us forward. I think of OTel as will have the same effect if not bigger as [unintelligible 00:10:11] back of the day, but now it just went away from metrics, just went to metrics, logs, and traces.Traces is, for me, very interesting because I think OTel is the first one to push it in a standard way. There were several attempts to make standardized [logs 00:10:25], but I think traces was something that OTel really pushed into a proper standard that we can follow. It annoys me that everybody uses a different bits and pieces of it and adds something to it, but I think it's also because it's not that mature yet, so people are trying to figure out how to deliver the best experience and package it in a way that it's actually interesting for a user.Corey: What I have found is that there's a lot that's in this space that is just simply noise. Whenever I spend a protracted time period working on basically anything and I'm still confused by the way people talk about that thing, months or years later, I'm starting to get the realization that maybe I'm not the problem here. And I'm not—I don't mean this to be insulting, but one of the things I've loved about you folks is I've always understood what you're saying. Now, you can hear that as, “Oh, you mean we talk like simpletons?” No, it means what you're talking about resonates with at least a subset of the people who have the problem you solve. That's not nothing.Seif: Yes. We've tried really hard because one of the things we've tried to do is actually bring observability to people who are not always busy or it's not part of their day to day. So, we try to bring into [Versal 00:11:37] developers, right, with doing a Versal integration. And all of a sudden, now they have their logs, and they have a metrics, and they have some traces. So, all of a sudden, they're doing the observability work. Or they have actual observability, for their Versal based, [unintelligible 00:11:54]-based product.And we try to meet the people where they are, so we try to—instead of actually telling people, “You should send us data.”—I mean, that's what they do now—we try to find, okay, what product are you using and how can we grab data from there and send it to us to make your life easier? You see that we did that with Versal, we did that with Cloudflare. AWS, we have extensions, Lambda extensions, et cetera, but we're doing it for more things. For Netlify, it's a one-click integration, too, and that's what we're trying to do to actually make the experience and the journey easier.Corey: I want to change gears a little bit because something that we spent a fair bit of time talking about—it's why we became friends, I would think anyway—is that we have a shared appreciation for several things. One of which, at most notable to anyone around us is whenever we hang out, we greet each other effusively and then immediately begin complaining about costs of cloud services. What is your take on the way that clouds charge for things? And I know it's a bit of a leading question, but it's core and foundational to how you think about Axiom, as well as how you serve customers.Seif: They're ripping us off. I'm sorry [laugh]. They just—the amount of money they make, like, it's crazy. I would love to know what margins they have. That's a big question I've always had. I'm like, what are the margins they have at AWS right now?Corey: Across the board, it's something around 30 to 40%, last time I looked at it.Seif: That's a lot, too.Corey: Well, that's also across the board of everything, to be clear. It is very clear that some services are subsidized by other services. As it should be. If you start charging me per IAM call, we're done.Seif: And also, I mean, the machine learning stuff. Like, they won't be doing that much on top of it right now, right, [else nobody 00:13:32] will be using it.Corey: But data transfer? Yeah, there's a significant upcharge on that. But I hear you. I would moderate it a bit. I don't think that I would say that it's necessarily an intentional ripoff. My problem with most cloud services that they offer is not usually that they're too expensive—though there are exceptions to that—but rather that the dimensions are unpredictable in advance. So, you run something for a while and see what it costs. From where I sit, if a customer uses your service and then at the end of usage is surprised by how much it cost them, you've kind of screwed up.Seif: Look, if they can make egress free—like, you saw how Cloudflare just did the egress of R2 free? Because I am still stuck with AWS because let's face it, for me, it is still my favorite cloud, right? Cloudflare is my next favorite because of all the features that are trying to develop and the pace they're picking, the pace they're trying to catch up with. But again, one of the biggest things I liked is R2, and R2 egress is free. Now, that's interesting, right?But I never saw anything coming back from S3 from AWS on S3 for that, like you know. I think Amazon is so comfortable because from a product perspective, they're simple, they have the tools, et cetera. And the UI is not the flashiest one, but you know what you're doing, right? The CLI is not the flashiest one, but you know what you're doing. It is so cool that they don't really need to compete with others yet.And I think they're still dominantly the biggest cloud out there. I think you know more than me about that, but [unintelligible 00:14:57], like, I think they are the biggest one right now in terms of data volume. Like, how many customers are using them, and even in terms of profiles of people using them, it's very, so much. I know, like, a lot of the Microsoft Azure people who are using it, are using it because they come from enterprise that have been always Microsoft… very Microsoft friendly. And eventually, Microsoft also came in Europe in these all these different weird ways. But I feel sometimes ripped off by AWS because I see Cloudflare trying to reduce the prices and AWS just looking, like, “Yeah, you're not a threat to us so we'll keep our prices as they are.”Corey: I have it on good authority from folks who know that there are reasons behind the economic structures of both of those companies based—in terms of the primary direction the traffic flows and the rest. But across the board, they've done such a poor job of articulating this that, frankly, I think the confusion is on them to clear up, not us.Seif: True. True. And the reason I picked R2 and S3 to compare there and not look at Workers and Lambdas because I look at it as R2 is S3 compatible from an API perspective, right? So, they're giving me something that I already use. Everything else I'm using, I'm using inside Amazon, so it's in a VPC, but just the idea. Let me dream. Let me dream that S3 egress will be free at some point.Corey: I can dream.Seif: That's like Christmas. It's better than Christmas.Corey: What I'm surprised about is how reasonable your pricing is in turn. You wind up charging on the basis of ingest, which is basically the only thing that really makes sense for how your company is structured. But it's predictable in advance, the free tier is, what, 500 gigs a month of ingestion, and before people think, “Oh, that doesn't sound like a lot,” I encourage you to just go back and think how much data that really is in the context of logs for any toy project. Like, “Well, our production environment spits out way more than that.” Yes, and by the word production that you just used, you probably shouldn't be using a free trial of anything as your critical path observability tooling. Become a customer, not a user. I'm a big believer in that philosophy, personally. For all of my toy projects that are ridiculous, this is ample.Seif: People always tend to overestimate how much logs they're going to be sending. Like so, there's one thing. What you said it right: people who already have something going on, they already know how much logs they'll be sending around. But then eventually they're sending too much, and that's why we're back here and they're talking to us. Like, “We want to ttry your tool, but you know, we'll be sending more than that.” So, if you don't like our pricing, go find something else because I think we are the cheapest out there right now. We're the competitive the cheapest out there right now.Corey: If there is one that is less expensive, I'm unaware of it.Seif: [laugh].Corey: And I've been looking, let's be clear. That's not just me saying, “Well, nothing has skittered across my desk.” No, no, no, I pay attention to this space.Seif: Hey, where's—Corey, we're friends. Loyalty.Corey: Exactly.Seif: If you find something, you tell me.Corey: Oh, if I find something, I'll tell everyone.Seif: Nononon, you tell me first and you tell me in a nice way so I can reduce the prices on my site [laugh].Corey: This is how we start a price was, industry-wide, and I would love to see it.Seif: [laugh]. But there's enough channels that we share at this point across different Slacks and messaging apps that you should be able to ping me if you find one. Also, get me the name of the CEO and the CTO while you're at it.Corey: And where they live. Yes, yes, of course. The dire implications will be awesome.Seif: That was you, not me. That was your suggestion.Corey: Exactly.Seif: I will not—[laugh].Corey: Before we turn into a bit of an old thud and blunder, let's talk about something else that I'm curious about here. You've been working on Axiom for something like seven years now. You come from a world of databases and events and the like. Why start a company in the model of Axiom? Even back then, when I looked around, my big problem with the entire observability space could never have been described as, “You know what we need? More companies that do exactly this.” What was it that you saw that made you say, “Yeah, we're going to start a company because that sounds easy.”Seif: So, I'll be very clear. Like, I'm not going to, like, sugarcoat this. We kind of got in a position where it [forced counterweighted 00:19:10]. And [laugh] by that I mean, we came from a company where we were dealing with logs. Like, we actually wrote an event crash analytics tool for a company, but then we ended up wanting to use stuff like Datadog, but we didn't have the budget for that because Datadog was killing us.So, we ended up hosting our own Elasticsearch. And Elasticsearch, it costs us more to maintain our Elasticsearch cluster for the logs than to actually maintain our own little infrastructure for the crash events when we were getting, like, 1 billion crashes a month at this point. So eventually, we just—that was the first burn. And then you had alert fatigue and then you had consolidating events and timestamps and whatnot. The whole thing just seemed very messy.So, we started off after some company got sold, we started off by saying, “Okay, let's go work on a new self-hosted version of the [unintelligible 00:20:05] where we do metrics and logs.” And then that didn't go as well as we thought it would, but we ended up—because from day one, we were working on cloud na—because we d—we cloud ho—we were self-hosted, so we wanted to keep costs low, we were working on and making it stateless and work against object store. And this is kind of how we started. We realized, oh, our cost, we can host this and make it scale, and won't cost us that much.So, we did that. And that started gaining more attention. But the reason we started this was we wanted to start a self-hosted version of Datadog that is not costly, and we ended up doing a Software as a Service. I mean, you can still come and self-hosted, but you'll have to pay money for it, like, proper money for that. But we do as a SaaS version of this and instead of trying to be a self-hosted Datadog, we are now trying to compete—or we are competing with Datadog.Corey: Is the technology that you've built this on top of actually that different from everything else out there, or is this effectively what you see in a lot of places: “Oh, yeah, we're just going to manage Elasticsearch for you because that's annoying.” Do you have anything that distinguishes you from, I guess, the rest of the field?Seif: Yeah. So, very just bluntly, like, I think Scuba was the first thing that started standing out, and then Honeycomb came into the scene and they start building something based on Scuba, the [unintelligible 00:21:23] principles of Scuba. Then one of the authors of actual Scuba reached out to me when I told him I'm trying to build something, and he's gave me some ideas, and I start building that. And from day one, I said, “Okay, everything in S3. All queries have to be serverless.”So, all the queries run on functions. There's no real disks. It's just all on S3 right now. And the biggest issue—achievement we got to lower our cost was to get rid of Kafka, and have—let's say, in behind the scenes we have our own coordination-free mechanism, but the idea is not to actually have to use Kafka at all and thus reduce the costs incredibly. In terms of technology, no, we don't use Elasticsearch.We wrote everything from the ground up, from scratch, even the query language. Like, we have our own query language that's based—modeled after Kusto—KQL by Microsoft—so everything we have is built from absolutely from the ground up. And no Elastic. I'm not using Elastic anymore. Elastic is a horror for me. Absolutely horror.Corey: People love the API, but no, I've never met anyone who likes managing Elasticsearch or OpenSearch, or whatever we're calling your particular flavor of it. It is a colossal pain, it is subject to significant trade-offs, regardless of how you work with it, and Amazon's managed offering doesn't make it better; it makes it worse in a bunch of ways.Seif: And the green status of Elasticsearch is a myth. You'll only see it once: the first time you start that cluster, that's what the Elasticsearch cluster is green. After that, it's just orange, or red. And you know what? I'm happy when it's orange. Elasticsearch kept me up for so long. And we had actually a very interesting situation where we had Elasticsearch running on Azure, on Windows machines, and I would have server [unintelligible 00:23:10]. And I'd have to log in and every day—you remember, what's it called—RP… RP Something. What was it called?Corey: RDP? Remote Desktop Protocol, or something else?Seif: Yeah, yeah. Where you have to log in, like, you actually have visual thing, and you have to go in and—Corey: Yep.Seif: And visually go in and say, “Please don't restart.” Every day, I'd have to do that. Please don't restart, please don't restart. And also a lot of weird issues, and also at that point, Azure would decide to disconnect the pod, wanted to try to bring in a new pod, and all these weird things were happening back then. So, eventually, end up with a [unintelligible 00:23:39] decision. I'm talking 2013, '14, so it was back in the day when Elasticsearch was very young. And so, that was just a bad start for me.Corey: I will say that Azure is the most cost-effective cloud because their security is so clown shoes, you can just run whatever you want in someone else's account and it's free to you. Problem solved.Seif: Don't tell people how we save costs, okay?Corey: [laugh]. I love that.Seif: [laugh]. Don't tell people how we do this. Like, Corey, come on [laugh], you're exposing me here. Let me tell you one thing, though. Elasticsearch is the reason I literally use a shock collar or a shock bracelet on myself every time it went down—which was almost every day, instead of having PagerDuty, like, ring my phone.And, you know, I'd wake up and my partner back then would wake up. I bought a Bluetooth collar off of Alibaba that would tase me every time I'd get a notification, regardless of the notification. So, some things are false alarm, but I got tased for at least two, three weeks before I gave up. Every night I'd wake up, like, to a full discharge.Corey: I would never hook myself up to a shocker tied to outages, even if I owned a company. There are pleasant ways to wake up, unpleasant ways to wake up, and even worse. So, you're getting shocked for some—so someone else can wind up effectively driving the future of the business. You're, more or less, the monkey that gets shocked awake to go ahead and fix the thing that just broke.Seif: [laugh]. Well, the fix to that was moving from Azure to AWS without telling anybody. That got us in a lot of trouble. Again, that wasn't my company.Corey: They didn't notice that you did this, or it caused a lot of trouble because suddenly nothing worked where they thought it would work?Seif: They—no, no, everything worked fine on AWS. That's how my love story began. But they didn't notice for, like, six months.Corey: That's kind of amazing.Seif: [laugh]. That was specta—we rewrote everything from C# to Node.js and moved everything away from Elasticsearch, started using Redshift, Redis and a—you name it. We went AWS all the way and they didn't even notice. We took the budget from another department to start filling that in.But we cut the costs from $100,000 down to, like, 40, and then eventually down to $30,000 a month.Corey: More than a little wild.Seif: Oh, God, yeah. Good times, good times. Next time, just ask me to tell you the full story about this. I can't go into details on this podcast. I'll get in a lot—I think I'll get in trouble. I didn't sign anything though.Corey: Those are the best stories. But no, I hear you. I absolutely hear you. Seif, I really want to thank you for taking the time to speak with me. If people want to learn more, where should they go?Seif: So, axiom.co—not dot com. Dot C-O. That's where they learn more about Axiom. And other than that, I think I have a Twitter somewhere. And if you know how to write my name, you'll—it's just one word and find me on Twitter.Corey: We will put that all in the [show notes 00:26:33]. Thank you so much for taking the time to speak with me. I really appreciate it.Seif: Dude, that was awesome. Thank you, man.Corey: Seif Lotfy, co-founder and CTO of Axiom, who has brought this promoted guest episode our way. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that one of these days, I will get around to aggregating in some horrifying custom homebrew logging system, probably built on top of rsyslog.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
In this episode, I had great fun chatting with Laïla Bougriâ about NServiceBus and how it can help with your distributed architecture. From its support for observability (including OTel) and dashboarding, to sagas and pub/sub. Even zombie and ghost messages! (spooky!). And as has been known to happen on this show - we did end up going on a tangent and geeking out about the awesome Jetbrains Rider too!Laïla is a Solutions Architect and Software Engineer at Particular Software. She's also a Microsoft MVP, and frequent speaker talking about topics such as dotnet, messaging, distributed systems, and software engineering in general.For a full list of show notes, or to add comments - please see the website here
Austin Parker, Community Maintainer at OpenTelemetry, joins Corey on Screaming in the Cloud to discuss OpenTelemetry's mission in the world of observability. Austin explains how the OpenTelemetry community was able to scale the OpenTelemetry project to a commercial offering, and the way Open Telemetry is driving innovation in the data space. Corey and Austin also discuss why Austin decided to write a book on OpenTelemetry, and the book's focus on the evergreen applications of the tool. About AustinAustin Parker is the OpenTelemetry Community Maintainer, as well as an event organizer, public speaker, author, and general bon vivant. They've been a part of OpenTelemetry since its inception in 2019.Links Referenced: OpenTelemetry: https://opentelemetry.io/ Learning OpenTelemetry early release: https://www.oreilly.com/library/view/learning-opentelemetry/9781098147174/ Page with Austin's social links: https://social.ap2.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Look, I get it. Folks are being asked to do more and more. Most companies don't have a dedicated DBA because that person now has a full time job figuring out which one of AWS's multiple managed database offerings is right for every workload. Instead, developers and engineers are being asked to support, and heck, if time allows, optimize their databases. That's where OtterTune comes in. Their AI is your database co-pilot for MySQL and PostgresSQL on Amazon RDS or Aurora. It helps improve performance by up to four x OR reduce costs by 50 percent – both of those are decent options. Go to ottertune dot com to learn more and start a free trial. That's O-T-T-E-R-T-U-N-E dot com.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's been a few hundred episodes since I had Austin Parker on to talk about the things that Austin cares about. But it's time to rectify that. Austin is the community maintainer for OpenTelemetry, which is a CNCF project. If you're unfamiliar with, we're probably going to fix that in short order. Austin, Welcome back, it's been a month of Sundays.Austin: It has been a month-and-a-half of Sundays. A whole pandemic-and-a-half.Corey: So, much has happened since then. I tried to instrument something with OpenTelemetry about a year-and-a-half ago, and in defense to the project, my use case is always very strange, but it felt like—a lot of things have sharp edges, but it felt like this had so many sharp edges that you just pivot to being a chainsaw, and I would have been at least a little bit more understanding of why it hurts so very much. But I have heard from people that I trust that the experience has gotten significantly better. Before we get into the nitty-gritty of me lobbing passive-aggressive bug reports at you have for you to fix in a scenario in which you can't possibly refuse me, let's start with the beginning. What is OpenTelemetry?Austin: That's a great question. Thank you for asking it. So, OpenTelemetry is an observability framework. It is run by the CNCF, you know, home of such wonderful award-winning technologies as Kubernetes, and you know, the second biggest source of YAML in the known universe [clear throat].Corey: On some level, it feels like that is right there with hydrogen as far as unlimited resources in our universe.Austin: It really is. And, you know, as we all know, there are two things that make, sort of, the DevOps and cloud world go around: one of them being, as you would probably know, AWS bills; and the second being YAML. But OpenTelemetry tries to kind of carve a path through this, right, because we're interested in observability. And observability, for those that don't know or have been living under a rock or not reading blogs, it's a lot of things. It's a—but we can generally sort of describe it as, like, this is how you understand what your system is doing.I like to describe it as, it's a way that we can model systems, especially complex, distributed, or decentralized software systems that are pretty commonly found in larg—you know, organizations of every shape and size, quite often running on Kubernetes, quite often running in public or private clouds. And the goal of observability is to help you, you know, model this system and understand what it's doing, which is something that I think we can all agree, a pretty important part of our job as software engineers. Where OpenTelemetry fits into this is as the framework that helps you get the telemetry data you need from those systems, put it into a universal format, and then ship it off to some observability back-end, you know, a Prometheus or a Datadog or whatever, in order to analyze that data and get answers to your questions you have.Corey: From where I sit, the value of OTel—or OpenTelemetry; people in software engineering love abbreviations that are impenetrable from the outside, so of course, we're going to lean into that—but what I found for my own use case is the shining value prop was that I could instrument an application with OTel—in theory—and then send whatever I wanted that was emitted in terms of telemetry, be it events, be it logs, be it metrics, et cetera, and send that to any or all of a curation of vendors on a case-by-case basis, which meant that suddenly it was the first step in, I guess, an observability pipeline, which increasingly is starting to feel like a milit—like an industrial-observability complex, where there's so many different companies out there, it seems like a good approach to use, to start, I guess, racing vendors in different areas to see which performs better. One of the challenges I've had with that when I started down that path is it felt like every vendor who was embracing OTel did it from a perspective of their implementation. Here's how to instrument it to—send it to us because we're the best, obviously. And you're a community maintainer, despite working at observability vendors yourself. You have always been one of those community-first types where you care more about the user experience than you do this quarter for any particular employer that you have, which to be very clear, is intended as a compliment, not a terrifying warning. It's why you have this authentic air to you and why you are one of those very few voices that I trust in a space where normally I need to approach it with significant skepticism. How do you see the relationship between vendors and OpenTelemetry?Austin: I think the hard thing is that I know who signs my paychecks at the end of the day, right, and you always have, you know, some level of, you know, let's say bias, right? Because it is a bias to look after, you know, them who brought you to the dance. But I think you can be responsible with balancing, sort of, the needs of your employer, and the needs of the community. You know, the way I've always described this is that if you think about observability as, like, a—you know, as a market, what's the total addressable market there? It's literally everyone that uses software; it's literally every software company.Which means there's plenty of room for people to make their numbers and to buy and sell and trade and do all this sort of stuff. And by taking that approach, by taking sort of the big picture approach and saying, “Well, look, you know, there's going to be—you know, of all these people, there are going to be some of them that are going to use our stuff and there are some of them that are going to use our competitor's stuff.” And that's fine. Let's figure out where we can invest… in an OpenTelemetry, in a way that makes sense for everyone and not just, you know, our people. So, let's build things like documentation, right?You know, one of the things I'm most impressed with, with OpenTelemetry over the past, like, two years is we went from being, as a project, like, if you searched for OpenTelemetry, you would go and you would get five or six or ten different vendor pages coming up trying to tell you, like, “This is how you use it, this is how you use it.” And what we've done as a community is we've said, you know, “If you go looking for documentation, you should find our website. You should find our resources.” And we've managed to get the OpenTelemetry website to basically rank above almost everything else when people are searching for help with OpenTelemetry. And that's been really good because, one, it means that now, rather than vendors or whoever coming in and saying, like, “Well, we can do this better than you,” we can be like, “Well, look, just, you know, put your effort here, right? It's already the top result. It's already where people are coming, and we can prove that.”And two, it means that as people come in, they're going to be put into this process of community feedback, where they can go in, they can look at the docs, and they can say, “Oh, well, I had a bad experience here,” or, “How do I do this?” And we get that feedback and then we can improve the docs for everyone else by acting on that feedback, and the net result of this is that more people are using OpenTelemetry, which means there are more people kind of going into the tippy-tippy top of the funnel, right, that are able to become a customer of one of these myriad observability back ends.Corey: You touched on something very important here, when I first was exploring this—you may have been looking over my shoulder as I went through this process—my impression initially was, oh, this is a ‘CNCF project' in quotes, where—this is not true universally, of course, but there are cases where it clearly—is where this is an, effectively, vendor-captured project, not necessarily by one vendor, but by an almost consortium of them. And that was my takeaway from OpenTelemetry. It was conversations with you, among others, that led me to believe no, no, this is not in that vein. This is clearly something that is a win. There are just a whole bunch of vendors more-or-less falling all over themselves, trying to stake out thought leadership and imply ownership, on some level, of where these things go. But I definitely left with a sense that this is bigger than any one vendor.Austin: I would agree. I think, to even step back further, right, there's almost two different ways that I think vendors—or anyone—can approach OpenTelemetry, you know, from a market perspective, and one is to say, like, “Oh, this is socializing, kind of, the maintenance burden of instrumentation.” Which is a huge cost for commercial players, right? Like, if you're a Datadog or a Splunk or whoever, you know, you have these agents that you go in and they rip telemetry out of your web servers, out of your gRPC libraries, whatever, and it costs a lot of money to pay engineers to maintain those instrumentation agents, right? And the cynical take is, oh, look at all these big companies that are kind of like pushing all that labor onto the open-source community, and you know, I'm not casting any aspersions here, like, I do think that there's an element of truth to it though because, yeah, that is a huge fixed cost.And if you look at the actual lived reality of people and you look at back when SignalFx was still a going concern, right, and they had their APM agents open-sourced, you could go into the SignalFx repo and diff, like, their [Node Express 00:10:15] instrumentation against the Datadog Node Express instrumentation, and it's almost a hundred percent the same, right? Because it's truly a commodity. There's no—there's nothing interesting about how you get that telemetry out. The interesting stuff all happens after you have the telemetry and you've sent it to some back-end, and then you can, you know, analyze it and find interesting things. So, yeah, like, it doesn't make sense for there to be five or six or eight different companies all competing to rebuild the same wheels over and over and over and over when they don't have to.I think the second thing that some people are starting to understand is that it's like, okay, let's take this a step beyond instrumentation, right? Because the goal of OpenTelemetry really is to make sure that this instrumentation is native so that you don't need a third-party agent, you don't need some other process or jar or whatever that you drop in and it instruments stuff for you. The JVM should provide this, your web framework should provide this, your RPC library should provide this right? Like, this data should come from the code itself and be in a normalized fashion that can then be sent to any number of vendors or back ends or whatever. And that changes how—sort of, the competitive landscape a lot, I think, for observability vendors because rather than, kind of, what you have now, which is people will competing on, like, well, how quickly can I throw this agent in and get set up and get a dashboard going, it really becomes more about, like, okay, how are you differentiating yourself against every other person that has access to the same data, right? And you get more interesting use cases and how much more interesting analysis features, and that results in more innovation in, sort of, the industry than we've seen in a very long time.Corey: For me, just from the customer side of the world, one of the biggest problems I had with observability in my career as an SRE-type for years was you would wind up building your observability pipeline around whatever vendor you had selected and that meant emphasizing the things they were good at and de-emphasizing the things that they weren't. And sometimes it's worked to your benefit; usually not. But then you always had this question when it got things that touched on APM or whatnot—or Application Performance Monitoring—where oh, just embed our library into this. Okay, great. But a year-and-a-half ago, my exposure to this was on an application that I was running in distributed fashion on top of AWS Lambda.So great, you can either use an extension for this or you can build in the library yourself, but then there's always a question of precedence where when you have multiple things that are looking at this from different points of view, which one gets done first? Which one is going to see the others? Which one is going to enmesh the other—enclose the others in its own perspective of the world? And it just got incredibly frustrating. One of the—at least for me—bright lights of OTel was that it got away from that where all of the vendors receiving telemetry got the same view.Austin: Yeah. They all get the same view, they all get the same data, and you know, there's a pretty rich collection of tools that we're starting to develop to help you build those pipelines yourselves and really own everything from the point of generation to intermediate collection to actually outputting it to wherever you want to go. For example, a lot of really interesting work has come out of the OpenTelemetry collector recently; one of them is this feature called Connectors. And Connectors let you take the output of certain pipelines and route them as inputs to another pipeline. And as part of that connection, you can transform stuff.So, for example, let's say you have a bunch of [spans 00:14:05] or traces coming from your API endpoints, and you don't necessarily want to keep all those traces in their raw form because maybe they aren't interesting or maybe there's just too high of a volume. So, with Connectors, you can go and you can actually convert all of those spans into metrics and export them to a metrics database. You could continue to save that span data if you want, but you have options now, right? Like, you can take that span data and put it into cold storage or put it into, like, you know, some sort of slow blob storage thing where it's not actively indexed and it's slow lookups, and then keep a metric representation of it in your alerting pipeline, use metadata exemplars or whatever to kind of connect those things back. And so, when you do suddenly see it's like, “Oh, well, there's some interesting p99 behavior,” or we're hitting an alert or violating an SLO or whatever, then you can go back and say, like, “Okay, well, let's go dig through the slow da—you know, let's look at the cold data to figure out what actually happened.”And those are features that, historically, you would have needed to go to a big, important vendor and say, like, “Hey, here's a bunch of money,” right? Like, “Do this for me.” Now, you have the option to kind of do all that more interesting pipeline stuff yourself and then make choices about vendors based on, like, who is making a tool that can help me with the problem that I have? Because most of the time, I don't—I feel like we tend to treat observability tools as—it depends a lot on where you sit in the org—but you certainly seen this movement towards, like, “Well, we don't want a tool; we want a platform. We want to go to Lowe's and we want to get the 48-in-one kit that has a bunch of things in it. And we're going to pay for the 48-in-one kit, even if we only need, like, two things or three things out of it.”OpenTelemetry lets you kind of step back and say, like, “Well, what if we just got, like, really high-quality tools for the two or three things we need, and then for the rest of the stuff, we can use other cheaper options?” Which is, I think, really attractive, especially in today's macroeconomic conditions, let's say.Corey: One thing I'm trying to wrap my head around because we all find when it comes to observability, in my experience, it's the parable of three blind people trying to describe an elephant by touch; depending on where you are on the elephant, you have a very different perspective. What I'm trying to wrap my head around is, what is the vision for OpenTelemetry? Is it specifically envisioned to be the agent that runs wherever the workload is, whether it's an agent on a host or a layer in a Lambda function, or a sidecar or whatnot in a Kubernetes cluster that winds up gathering and sending data out? Or is the vision something different? Because part of what you're saying aligns with my perspective on it, but other parts of it seem to—that there's a misunderstanding somewhere, and it's almost certainly on my part.Austin: I think the long-term vision is that you as a developer, you as an SRE, don't even have to think about OpenTelemetry, that when you are using your container orchestrator or you are using your API framework or you're using your Managed API Gateway, or any kind of software that you're building something with, that the telemetry data from that software is emitted in an OpenTelemetry format, right? And when you are writing your code, you know, and you're using gRPC, let's say, you could just natively expect that OpenTelemetry is kind of there in the background and it's integrated into the actual libraries themselves. And so, you can just call the OpenTelemetry API and it's part of the standard library almost, right? You add some additional metadata to a span and say, like, “Oh, this is the customer ID,” or, “This is some interesting attribute that I want to track for later on,” or, “I'm going to create a histogram here or counter,” whatever it is, and then all that data is just kind of there, right, invisible to you unless you need it. And then when you need it, it's there for you to kind of pick up and send off somewhere to any number of back-ends or databases or whatnot that you could then use to discover problems or better model your system.That's the long-term vision, right, that it's just there, everyone uses it. It is a de facto and du jour standard. I think in the medium term, it does look a little bit more like OpenTelemetry is kind of this Swiss army knife agent that's running on—inside cars in Kubernetes or it's running on your EC2 instance. Until we get to the point of everyone just agrees that we're going to use OpenTelemetry protocol for the data and we're going to use all your stuff and we just natively emit it, then that's going to be how long we're in that midpoint. But that's sort of the medium and long-term vision I think. Does that track?Corey: It does. And I'm trying to equate this to—like the evolution back in the Stone Age was back when I was first getting started, Nagios was the gold standard. It was kind of the original Call of Duty. And it was awful. There were a bunch of problems with it, but it also worked.And I'm not trying to dunk on the people who built that. We all stand on the shoulders of giants. It was an open-source project that was awesome doing exactly what it did, but it was a product built for a very different time. It completely had the wheels fall off as soon as you got to things were even slightly ephemeral because it required this idea of the server needed to know where all of the things that was monitoring lived as an individual host basis, so there was this constant joy of, “Oh, we're going to add things to a cluster.” Its perspective was, “What's a cluster?” Or you'd have these problems with a core switch going down and suddenly everything else would explode as well.And even setting up an on-call rotation for who got paged when was nightmarish. And a bunch of things have evolved since then, which is putting it mildly. Like, you could say that about fire, the invention of the wheel. Yeah, a lot of things have evolved since the invention of the wheel, and here we are tricking sand into thinking. But we find ourselves just—now it seems that the outcome of all of this has been instead of one option that's the de facto standard that's kind of terrible in its own ways, now, we have an entire universe of different products, many of which are best-of-breed at one very specific thing, but nothing's great at everything.It's the multifunction printer conundrum, where you find things that are great at one or two things at most, and then mediocre at best at the rest. I'm excited about the possibility for OpenTelemetry to really get to a point of best-of-breed for everything. But it also feels like the money folks are pushing for consolidation, if you believe a lot of the analyst reports around this of, “We already pay for seven different observability vendors. How about we knock it down to just one that does all of these things?” Because that would be terrible. What do you land on that?Austin: Well, as I intu—or alluded to this earlier, I think the consolidation in the observability space, in general, is very much driven by that force you just pointed out, right? The buyers want to consolidate more and more things into single tools. And I think there's a lot of… there are reasons for that that—you know, there are good reasons for that, but I also feel like a lot of those reasons are driven by fundamentally telemetry-side concerns, right? So like, one example of this is if you were Large Business X, and you see—you are an engineering director and you get a report, that's like, “We have eight different metrics products.” And you're like, “That seems like a lot. Let's just use Brand X.”And Brand X will tell you very, very happily tell you, like, “Oh, you just install our thing everywhere and you can get rid of all these other tools.” And usually, there's two reasons that people pick tools, right? One reason is that they are forced to and then they are forced to do a bunch of integration work to get whatever the old stuff was working in the new way, but the other reason is because they tried a bunch of different things and they found the one tool that actually worked for them. And what happens invariably in these sort of consolidation stories is, you know, the new vendor comes in on a shining horse to consolidate, and you wind up instead of eight distinct metrics tools, now you have nine distinct metrics tools because there's never any bandwidth for people to go back and, you know—you're Nagios example, right, Nag—people still use Nagios every day. What's the economic justification to take all those Nagios installs, if they're working, and put them into something else, right?What's the economic justification to go and take a bunch of old software that hasn't been touched for ten years that still runs and still does what needs to do, like, where's the incentive to go and re-instrument that with OpenTelemetry or anything else? It doesn't necessarily exist, right? And that's a pretty, I think, fundamental decision point in everyone's observability journey, which is what do you do about all the old stuff? Because most of the stuff is the old stuff and the worst part is, most of the stuff that you make money off of is the old stuff as well. So, you can't ignore it, and if you're spending, you know, millions of millions of dollars on the new stuff—like, there was a story that went around a while ago, I think, Coinbase spent something like, what, $60 million on Datadog… I hope they asked for it in real money and not Bitcoin. But—Corey: Yeah, something I've noticed about all the vendors, and even Coinbase themselves, very few of them actually transact in cryptocurrency. It's always cash on the barrelhead, so to speak.Austin: Yeah, smart. But still, like, that's an absurd amount of money [laugh] for any product or service, I would argue, right? But that's just my perspective. I do think though, it goes to show you that you know, it's very easy to get into these sort of things where you're just spending over the barrel to, like, the newest vendor that's going to come in and solve all your problems for you. And just, it often doesn't work that way because most places aren't—especially large organizations—just aren't built in is sort of like, “Oh, we can go through and we can just redo stuff,” right? “We can just roll out a new agent through… whatever.”We have mainframes [unintelligible 00:25:09], mainframes to thinking about, you have… in many cases, you have an awful lot of business systems that most, kind of, cloud people don't like, think about, right, like SAP or Salesforce or ServiceNow, or whatever. And those sort of business process systems are actually responsible for quite a few things that are interesting from an observability point of view. But you don't see—I mean, hell, you don't even see OpenTelemetry going out and saying, like, “Oh, well, here's the thing to let you know, observe Apex applications on Salesforce,” right? It's kind of an undiscovered country in a lot of ways and it's something that I think we will have to grapple with as we go forward. In the shorter term, there's a reason that OpenTelemetry mostly focuses on cloud-native applications because that's a little bit easier to actually do what we're trying to do on them and that's where the heat and light is. But once we get done with that, then the sky is the limit.[midroll 00:26:11]Corey: It still feels like OpenTelemetry is evolving rapidly. It's certainly not, I don't want to say it's not feature complete, which, again, what—software is never done. But it does seem like even quarter-to-quarter or month-to-month, its capabilities expand massively. Because you apparently enjoy pain, you're in the process of writing a book. I think it's in early release or early access that comes out next year, 2024. Why would you do such a thing?Austin: That's a great question. And if I ever figure out the answer I will tell you.Corey: Remember, no one wants to write a book; they want to have written the book.Austin: And the worst part is, is I have written the book and for some reason, I went back for another round. I—Corey: It's like childbirth. No one remembers exactly how horrible it was.Austin: Yeah, my partner could probably attest to that. Although I was in the room, and I don't think I'd want to do it either. So, I think the real, you know, the real reason that I decided to go and kind of write this book—and it's Learning OpenTelemetry; it's in early release right now on the O'Reilly learning platform and it'll be out in print and digital next year, I believe, we're targeting right now, early next year.But the goal is, as you pointed out so eloquently, OpenTelemetry changes a lot. And it changes month to month sometimes. So, why would someone decide—say, “Hey, I'm going to write the book about learning this?” Well, there's a very good reason for that and it is that I've looked at a lot of the other books out there on OpenTelemetry, on observability in general, and they talk a lot about, like, here's how you use the API. Here's how you use the SDK. Here's how you make a trace or a span or a log statement or whatever. And it's very technical; it's very kind of in the weeds.What I was interested in is saying, like, “Okay, let's put all that stuff aside because you don't necessarily…” I'm not saying any of that stuff's going to change. And I'm not saying that how to make a span is going to change tomorrow; it's not, but learning how to actually use something like OpenTelemetry isn't just knowing how to create a measurement or how to create a trace. It's, how do I actually use this in a production system? To my point earlier, how do I use this to get data about, you know, these quote-unquote, “Legacy systems?” How do I use this to monitor a Kubernetes cluster? What's the important parts of building these observability pipelines? If I'm maintaining a library, how should I integrate OpenTelemetry into that library for my users? And so on, and so on, and so forth.And the answers to those questions actually probably aren't going to change a ton over the next four or five years. Which is good because that makes it the perfect thing to write a book about. So, the goal of Learning OpenTelemetry is to help you learn not just how to use OpenTelemetry at an API or SDK level, but it's how to build an observability pipeline with OpenTelemetry, it's how to roll it out to an organization, it's how to convince your boss that this is what you should use, both for new and maybe picking up some legacy development. It's really meant to give you that sort of 10,000-foot view of what are the benefits of this, how does it bring value and how can you use it to build value for an observability practice in an organization?Corey: I think that's fair. Looking at the more quote-unquote, “Evergreen,” style of content as opposed to—like, that's the reason for example, I never wind up doing tutorials on how to use an AWS service because one console change away and suddenly I have to redo the entire thing. That's a treadmill I never had much interest in getting on. One last topic I want to get into before we wind up wrapping the episode—because I almost feel obligated to sprinkle this all over everything because the analysts told me I have to—what's your take on generative AI, specifically with an eye toward observability?Austin: [sigh], gosh, I've been thinking a lot about this. And—hot take alert—as a skeptic of many technological bubbles over the past five or so years, ten years, I'm actually pretty hot on AI—generative AI, large language models, things like that—but not for the reasons that people like to kind of hold them up, right? Not so that we can all make our perfect, funny [sigh], deep dream, meme characters or whatever through Stable Fusion or whatever ChatGPT spits out at us when we ask for a joke. I think the real win here is that this to me is, like, the biggest advance in human-computer interaction since resistive touchscreens. Actually, probably since the mouse.Corey: I would agree with that.Austin: And I don't know if anyone has tried to get someone that is, you know, over the age of 70 to use a computer at any time in their life, but mapping human language to trying to do something on an operating system or do something on a computer on the web is honestly one of the most challenging things that faces interface design, face OS designers, faces anyone. And I think this also applies for dev tools in general, right? Like, if you think about observability, if you think about, like, well, what are the actual tasks involved in observability? It's like, well, you're making—you're asking questions. You're saying, like, “Hey, for this metric named HTTPrequestsByCode,” and there's four or five dimensions, and you say, like, “Okay, well break this down for me.” You know, you have to kind of know the magic words, right? You have to know the magic promQL sequence or whatever else to plug in and to get it to graph that for you.And you as an operator have to have this very, very well developed, like, depth of knowledge and math and statistics to really kind of get a lot of—Corey: You must be at least this smart to ride on this ride.Austin: Yeah. And I think that, like that, to me is the real—the short-term win for certainly generative AI around using, like, large language models, is the ability to create human language interfaces to observability tools, that—Corey: As opposed to learning your own custom SQL dialect, which I see a fair number of times.Austin: Right. And, you know, and it's actually very funny because there was a while for the—like, one of my kind of side projects for the past [sigh] a little bit [unintelligible 00:32:31] idea of, like, well, can we make, like, a universal query language or universal query layer that you could ship your dashboards or ship your alerts or whatever. And then it's like, generative AI kind of just, you know, completely leapfrogs that, right? It just says, like, well, why would you need a query language, if we can just—if you can just ask the computer and it works, right?Corey: The most common programming language is about to become English.Austin: Which I mean, there's an awful lot of externalities there—Corey: Which is great. I want to be clear. I'm not here to gatekeep.Austin: Yeah. I mean, I think there's a lot of externalities there, and there's a lot—and the kind of hype to provable benefit ratio is very skewed right now towards hype. That said, one of the things that is concerning to me as sort of an observability practitioner is the amount of people that are just, like, whole-hog, throwing themselves into, like, oh, we need to integrate generative AI, right? Like, we need to put AI chatbots and we need to have ChatGPT built into our products and da-da-da-da-da. And now you kind of have this perfect storm of people that really don't ha—because they're just using these APIs to integrate gen AI stuff with, they really don't understand what it's doing because a lot you know, it is very complex, and I'll be the first to admit that I really don't understand what a lot of it is doing, you know, on the deep, on the foundational math side.But if we're going to have trust in, kind of, any kind of system, we have to understand what it's doing, right? And so, the only way that we can understand what it's doing is through observability, which means it's incredibly important for organizations and companies that are building products on generative AI to, like, drop what—you know, walk—don't walk, run towards something that is going to give you observability into these language models.Corey: Yeah. “The computer said so,” is strangely dissatisfying.Austin: Yeah. You need to have that base, you know, sort of, performance [goals and signals 00:34:31], obviously, but you also need to really understand what are the questions being asked. As an example, let's say you have something that is tokenizing questions. You really probably do want to have some sort of observability on the hot path there that lets you kind of break down common tokens, especially if you were using, like, custom dialects or, like, vectors or whatever to modify the, you know, neural network model, like, you really want to see, like, well, what's the frequency of the certain tokens that I'm getting they're hitting the vectors versus not right? Like, where can I improve these sorts of things? Where am I getting, like, unexpected results?And maybe even have some sort of continuous feedback mechanism that it could be either analyzing the tone and tenor of end-user responses or you can have the little, like, frowny and happy face, whatever it is, like, something that is giving you that kind of constant feedback about, like, hey, this is how people are actually like interacting with it. Because I think there's way too many stories right now people just kind of like saying, like, “Oh, okay. Here's some AI-powered search,” and people just, like, hating it. Because people are already very primed to distrust AI, I think. And I can't blame anyone.Corey: Well, we've had an entire lifetime of movies telling us that's going to kill us all.Austin: Yeah.Corey: And now you have a bunch of, also, billionaire tech owners who are basically intent on making that reality. But that's neither here nor there.Austin: It isn't, but like I said, it's difficult. It's actually one of the first times I've been like—that I've found myself very conflicted.Corey: Yeah, I'm a booster of this stuff; I love it, but at the same time, you have some of the ridiculous hype around it and the complete lack of attention to safety and humanity aspects of it that it's—I like the technology and I think it has a lot of promise, but I want to get lumped in with that set.Austin: Exactly. Like, the technology is great. The fan base is… ehh, maybe something a little different. But I do think that, for lack of a better—not to be an inevitable-ist or whatever, but I do think that there is a significant amount of, like, this is a genie you can't put back in the bottle and it is going to have, like, wide-ranging, transformative effects on the discipline of, like, software development, software engineering, and white collar work in general, right? Like, there's a lot of—if your job involves, like, putting numbers into Excel and making pretty spreadsheets, then ooh, that doesn't seem like something that's going to do too hot when I can just have Excel do that for me.And I think we do need to be aware of that, right? Like, we do need to have that sort of conversation about, like… what are we actually comfortable doing here in terms of displacing human labor? When we do displace human labor, are we doing it so that we can actually give people leisure time or so that we can just cram even more work down the throats of the humans that are left?Corey: And unfortunately, I think we might know what that answer is, at least on our current path.Austin: That's true. But you know, I'm an optimist.Corey: I… don't do well with disappointment. Which the show has certainly not been. I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place for them to find you?Austin: Welp, I—you can find me on most social media. Many, many social medias. I used to be on Twitter a lot, and we all know what happened there. The best place to figure out what's going on is check out my bio, social.ap2.io will give you all the links to where I am. And yeah, been great talking with you.Corey: Likewise. Thank you so much for taking the time out of your day. Austin Parker, community maintainer for OpenTelemetry. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment pointing out that actually, physicists say the vast majority of the universe's empty space, so that we can later correct you by saying ah, but it's empty whitespace. That's right. YAML wins again.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
MLOps Coffee Sessions #170 with Phillip Carter, All the Hard Stuff with LLMs in Product Development. We are now accepting talk proposals for our next LLM in Production virtual conference on October 3rd. Apply to speak here: https://go.mlops.community/NSAX1O // Abstract Delve into challenges in implementing LLMs, such as security concerns and collaborative measures against attacks. Emphasize the role of ML engineers and product managers in successful implementation. Explore identifying leading indicators and measuring ROI for impactful AI initiatives. // Bio Phillip is on the product team at Honeycomb where he works on a bunch of different developer tooling things. He's an OpenTelemetry maintainer -- chances are if you've read the docs to learn how to use OTel, you've read his words. He's also Honeycomb's (accidental) prompt engineering expert by virtue of building and shipping products that use LLMs. In a past life, he worked on developer tools at Microsoft, helping bring the first cross-platform version of .NET into the world and grow to 5 million active developers. When not doing computer stuff, you'll find Phillip in the mountains riding a snowboard or backpacking in the Cascades. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: https://phillipcarter.dev/ https://www.honeycomb.io/blog/improving-llms-production-observability https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm https://phillipcarter.dev/posts/how-to-make-an-fsharp-code-fixer/ The "hard stuff" post: https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm Our follow-up on iterating on LLMs in prod: https://www.honeycomb.io/blog/improving-llms-production-observability --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Phillip on LinkedIn: https://www.linkedin.com/in/phillip-carter-4714a135/
Nedim Hazar | Otel odaları (1) | 29.06.2023 by Tr724
36 million generated OpenTelemetry spans per hour for GraphQL based queries – that's just one of the stats we discussed with Justin Scherer, Sr Developer and Consultant, who is leading OTel adoption and Shift-Left observability efforts at NWM. For Justin, OpenTelemetry helps commoditize data gathering in modern cloud native environments so that the backend observability platform of choice can focus on answering higher level business impacting questions.If you are about to roll out OpenTelemetry in your organization then take the advice from Justin such as: Bringing Business Leaders early into the discussion! Engage with the OpenTelemetry community! Understand what your Observability Platform already gives you and focus on the gaps! To learn more about OpenTelemetry check out some of the links we discussed during the podcast:OpenTelemetry Website: https://opentelemetry.io/IsItObservable: https://isitobservable.io/open-telemetryPodcast: https://www.spreaker.com/user/pureperformance/adopting-open-observability-across-your-LinkedIn Profile: https://www.linkedin.com/in/justin-scherer-198126160/
CEVHERİ GÜVEN #aliyeşildağ #tayyiperdoğan #MustafaErdoğan Ali Yeşildağ 3. videosunda İstanbul Boğazındaki 15 bin metrekarelik arsaya nasıl çöküldüğünü anlatıyor. Tayyip Erdoğan'a en yakın aile olarak bilinen Yeşildağ ailesinin önemli ismi Ali Yeşildağ, çökme işleminde kendisinin de yeraldığını Çete Liderinin Tayyip Erdoğan olduğunu belirtip çetenin üyelerini tek tek sayıyor. Soysal Holding'den Mehmet Soysal'a ait olan arazi çöküldükten sonra üzerine şuan Türkiye'nin en pahalı oteli olan Mandarin Oriental Bosphorus oteli inşa ediliyor. Ali Yeşildağ, otelin asıl sahibinin Erdoğan olduğunu kimi paravan olarak kullandıklarını da ismiyle ortaya döküyor. Bu videoda Ali Yeşildağ'ın ikinci hedefi ise Tayyip Erdoğan'ın kardeşi Mustafa Erdoğan. Mustafa Erdoğan ile Hasan Yeşildağ'ın kirli ilişkileri ortaya dökülüyor. Ali Yeşildağ, Tayyip Erdoğan'ın oğlu Burak Erdoğan'ın, amcası Mustafa Erdoğan'ı dövdüğünü anlatıyor. Yeşildağ, Mustafa Erdoğan'ı Tayyip Erdoğan'ın da dövdüğünü, çünkü Tayyip Erdoğan'ın para dolu çantalarını çaldığını belirtiyor. Ve dahası...
Liz Rice, Chief Open Source Officer at Isovalent, joins Corey on Screaming in the Cloud to discuss the release of her newest book, Learning eBPF, and the exciting possibilities that come with eBPF technology. Liz explains what got her so excited about eBPF technology, and what it was like to write a book while also holding a full-time job. Corey and Liz also explore the learning curve that comes with kernel programming, and Liz illustrates why it's so important to be able to explain complex technologies in simple terminology. About LizLiz Rice is Chief Open Source Officer with eBPF specialists Isovalent, creators of the Cilium cloud native networking, security and observability project. She sits on the CNCF Governing Board, and on the Board of OpenUK. She was Chair of the CNCF's Technical Oversight Committee in 2019-2022, and Co-Chair of KubeCon + CloudNativeCon in 2018. She is also the author of Container Security, and Learning eBPF, both published by O'Reilly.She has a wealth of software development, team, and product management experience from working on network protocols and distributed systems, and in digital technology sectors such as VOD, music, and VoIP. When not writing code, or talking about it, Liz loves riding bikes in places with better weather than her native London, competing in virtual races on Zwift, and making music under the pseudonym Insider Nine.Links Referenced: Isovalent: https://isovalent.com/ Learning eBPF: https://www.amazon.com/Learning-eBPF-Programming-Observability-Networking/dp/1098135121 Container Security: https://www.amazon.com/Container-Security-Fundamental-Containerized-Applications/dp/1492056707/ GitHub for Learning eBPF: https://github.com/lizRice/learning-eBPF TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Our returning guest today is Liz Rice, who remains the Chief Open Source Officer with Isovalent. But Liz, thank you for returning, suspiciously closely timed to when you have a book coming out. Welcome back.Liz: [laugh]. Thanks so much for having me. Yeah, I've just—I've only had the physical copy of the book in my hands for less than a week. It's called Learning eBPF. I mean, obviously, I'm very excited.Corey: It's an O'Reilly book; it has some form of honeybee on the front of it as best I can tell.Liz: Yeah, I was really pleased about that. Because eBPF has a bee as its logo, so getting a [early 00:01:17] honeybee as the O'Reilly animal on the front cover of the book was pretty pleasing, yeah.Corey: Now, this is your second O'Reilly book, is it not?Liz: It's my second full book. So, I'd previously written a book on Container Security. And I've done a few short reports for them as well. But this is the second, you know, full-on, you can buy it on Amazon kind of book, yeah.Corey: My business partner wrote Practical Monitoring for O'Reilly and that was such an experience that he got entirely out of observability as a field and ran running to AWS bills as a result. So, my question for you is, why would anyone do that more than once?Liz: [laugh]. I really like explaining things. And I had a really good reaction to the Container Security book. I think already, by the time I was writing that book, I was kind of interested in eBPF. And we should probably talk about what that is, but I'll come to that in a moment.Yeah, so I've been really interested in eBPF, for quite a while and I wanted to be able to do the same thing in terms of explaining it to people. A book gives you a lot more opportunity to go into more detail and show people examples and get them kind of hands-on than you can do in their, you know, 40-minute conference talk. So, I wanted to do that. I will say I have written myself a note to never do a full-size book while I have a full-time job because it's a lot [laugh].Corey: You do have a full-time job and then some. As we mentioned, you're the Chief Open Source Officer over at Isovalent, you are on the CNCF governing board, you're on the board of OpenUK, and you've done a lot of other stuff in the open-source community as well. So, I have to ask, taking all of that together, are you just allergic to things that make money? I mean, writing the book as well on top of that. I'm told you never do it for the money piece; it's always about the love of it. But it seems like, on some level, you're taking it to an almost ludicrous level.Liz: Yeah, I mean, I do get paid for my day job. So, there is that [laugh]. But so, yeah—Corey: I feel like that's the only way to really write a book is, in turn, to wind up only to just do it for—what someone else is paying you to for doing it, viewing it as a marketing exercise. It pays dividends, but those dividends don't, in my experience from what I've heard from everyone say, pay off as of royalties on book payments.Liz: Yeah, I mean, it's certainly, you know, not a bad thing to have that income stream, but it certainly wouldn't make you—you know, I'm not going to retire tomorrow on the royalty stream unless this podcast has loads and loads of people to buy the book [laugh].Corey: Exactly. And I'm always a fan of having such [unintelligible 00:03:58]. I will order it while we're on the call right now having this conversation because I believe in supporting the things that we want to see more of in the world. So, explain to me a little bit about what it is. Whatever you talking about learning X in a title, I find that that's often going to be much more approachable than arcane nonsense deep-dive things.One of the O'Reilly books that changed my understanding was Linux Kernel Internals, or Understanding the Linux Kernel. Understanding was kind of a heavy lift at that point because it got very deep very quickly, but I absolutely came away understanding what was going on a lot more effectively, even though I was so slow I needed a tow rope on some of it. When you have a book that started with learning, though, I imagined it assumes starting at zero with, “What's eBPF?” Is that directionally correct, or does it assume that you know a lot of things you don't?Liz: Yeah, that's absolutely right. I mean, I think eBPF is one of these technologies that is starting to be, particularly in the cloud-native world, you know, it comes up; it's quite a hot technology. What it actually is, so it's an acronym, right? EBPF. That acronym is almost meaningless now.So, it stands for extended Berkeley Packet Filter. But I feel like it does so much more than filtering, we might as well forget that altogether. And it's just become a term, a name in its own right if you like. And what it really does is it lets you run custom programs in the kernel so you can change the way that the kernel behaves, dynamically. And that is… it's a superpower. It's enabled all sorts of really cool things that we can do with that superpower.Corey: I just pre-ordered it as a paperback on Amazon and it shows me that it is now number one new release in Linux Networking and Systems Administration, so you're welcome. I'm sure it was me that put it over the top.Liz: Wonderful. Thank you very much. Yeah [laugh].Corey: Of course, of course. Writing a book is one of those things that I've always wanted to do, but never had the patience to sit there and do it or I thought I wasn't prolific enough, but over the holidays, this past year, my wife and business partner and a few friends all chipped in to have all of the tweets that I'd sent bound into a series of leather volumes. Apparently, I've tweeted over a million words. And… yeah, oh, so I have to write a book 280 characters at a time, mostly from my phone. I should tweet less was really the takeaway that I took from a lot of that.But that wasn't edited, that wasn't with an overall theme or a narrative flow the way that an actual book is. It just feels like a term paper on steroids. And I hated term papers. Love reading; not one to write it.Liz: I don't know whether this should make it into the podcast, but it reminded me of something that happened to my brother-in-law, who's an artist. And he put a piece of video on YouTube. And for unknowable reasons if you mistyped YouTube, and you spelt it, U-T-U-B-E, the page that you would end up at from Google search was a YouTube video and it was in fact, my brother-in-law's video. And people weren't expecting to see this kind of art movie about matches burning. And he just had the worst comment—like, people were so mean in the comments. And he had millions of views because people were hitting this page by accident, and he ended up—Corey: And he made the cardinal sin of never read the comments. Never break that rule. As soon as you do that, it doesn't go well. I do read the comments on various podcast platforms on this show because I always tell people to insulted all they want, just make sure you leave a five-star review.Liz: Well, he ended up publishing a book with these comments, like, one comment per page, and most of them are not safe for public consumption comments, and he just called it Feedback. It was quite something [laugh].Corey: On some level, it feels like O'Reilly books are a little insulated from the general population when it comes to terrible nonsense comments, just because they tend to be a little bit more expensive than the typical novel you'll see in an airport bookstore, and again, even though it is approachable, Learning eBPF isn't exactly the sort of title that gets people to think that, “Ooh, this is going to be a heck of a thriller slash page-turner with a plot.” “Well, I found the protagonist unrelatable,” is not sort of the thing you're going to wind up seeing in the comments because people thought it was going to be something different.Liz: I know. One day, I'm going to have to write a technical book that is also a murder mystery. I think that would be, you know, quite an achievement. But yeah, I mean, it's definitely aimed at people who have already come across the term, want to know more, and particularly if you're the kind of person who doesn't want to just have a hand-wavy explanation that involves boxes and diagrams, but if, like me, you kind of want to feel the code, and you want to see how things work and you want to work through examples, then that's the kind of person who might—I hope—enjoy working through the book and end up with a possible mental model of how eBPF works, even though it's essentially kernel programming.Corey: So, I keep seeing eBPF in an increasing number of areas, a bunch of observability tools, a bunch of security tools all tend to tie into it. And I've seen people do interesting things as far as cost analysis with it. The problem that I run into is that I'm not able to wind up deploying it universally, just because when I'm going into a client engagement, I am there in a purely advisory sense, given that I'm biasing these days for both SaaS companies and large banks, that latter category is likely going to have some problems if I say, “Oh, just take this thing and go ahead and deploy it to your entire fleet.” If they don't have a problem with that, I have a problem with their entire business security posture. So, I don't get to be particularly prescriptive as far as what to do with it.But if I were running my own environment, it is pretty clear by now that I would have explored this in some significant depth. Do you find that it tends to be something that is used primarily in microservices environments? Does it effectively require Kubernetes to become useful on day one? What is the onboard path where people would sit back and say, “Ah, this problem I'm having, eBPF sounds like the solution.”Liz: So, when we write tools that are typically going to be some sort of infrastructure, observability, security, networking tools, if we're writing them using eBPF, we're instrumenting the kernel. And the kernel gets involved every time our application wants to do anything interesting because whenever it wants to read or write to a file, or send receive network messages, or write something to the screen, or allocate memory, or all of these things, the kernel has to be involved. And we can use eBPF to instrument those events and do interesting things. And the kernel doesn't care whether those processes are running in containers, under Kubernetes, just running directly on the host; all of those things are visible to eBPF.So, in one sense, doesn't matter. But one of the reasons why I think we're seeing eBPF-based tools really take off in cloud-native is that you can, by applying some programming, you can link events that happened in the kernel to specific containers in specific pods in whatever namespace and, you know, get the relationship between an event and the Kubernetes objects that are involved in that event. And then that enables a whole lot of really interesting observability or security tools and it enables us to understand how network packets are flowing between different Kubernetes objects and so on. So, it's really having this vantage point in the kernel where we can see everything and we didn't have to change those applications in any way to be able to use eBPF to instrument them.Corey: When I see the stories about eBPF, it seems like it's focused primarily on networking and flow control. That's where I'm seeing it from a security standpoint, that's where I'm seeing it from cost allocation aspect. Because, frankly, out of the box, from a cloud provider's perspective, Kubernetes looks like a single-tenant application with a really weird behavioral pattern, and some of that crosstalk gets very expensive. Is there a better way than either using eBPF and/or VPC flow logs to figure out what's talking to what in the Kubernetes ecosystem, or is BPF really your first port of call?Liz: So, I'm coming from a position of perspective of working for the company that created the Cilium networking project. And one of the reasons why I think Cilium is really powerful is because it has this visibility—it's got a component called Hubble—that allows you to see exactly how packets are flowing between these different Kubernetes identities. So, in a Kubernetes environment, there's not a lot of point having network flows that talk about IP addresses and ports when what you really want to know is, what's the Kubernetes namespace, what's the application? Defining things in terms of IP addresses makes no sense when they're just being refreshed and renewed every time you change pods. So yeah, Kubernetes changes the requirements on networking visibility and on firewalling as well, on network policy, and that, I think, is you don't have to use eBPF to create those tools, but eBPF is a really powerful and efficient platform for implementing those tools, as we see in Cilium.Corey: The only competitor I found to it that gives a reasonable explanation of why random things are transferring multiple petabytes between each other in the middle of the night has been oral tradition, where I'm talking to people who've been around there for a while. It's, “So, I'm seeing this weird traffic pattern at these times a day. Any idea what that might be?” And someone will usually perk up and say, “Oh, is it—” whatever job that they're doing. Great. That gives me a direction to go in.But especially in this era of layoffs and as environments exist for longer and longer, you have to turn into a bit of a data center archaeologist. That remains insufficient, on some level. And some level, I'm annoyed with trying to understand or needing to use tooling like this that is honestly this powerful and this customizable, and yes, on some level, this complex in order to get access to that information in a meaningful sense. But on the other, I'm glad that that option is at least there for a lot of workloads.Liz: Yeah. I think, you know, that speaks to the power of this new generation of tooling. And the same kind of applies to security forensics, as well, where you might have an enormous stream of events, but unless you can tie those events back to specific Kubernetes identities, which you can use eBPF-based tooling to do, then how do you—the forensics job of tying back where did that event come from, what was the container that was compromised, it becomes really, really difficult. And eBPF tools—like Cilium has a sub-project called Tetragon that is really good at this kind of tying events back to the Kubernetes pod or whether we want to know what node it was running on what namespace or whatever. That's really useful forensic information.Corey: Talk to me a little bit about how broadly applicable it is. Because from my understanding from our last conversation, when you were on the show a year or so ago, if memory serves, one of the powerful aspects of it was very similar to what I've seen some of Brendan Gregg's nonsense doing in his kind of various talks where you can effectively write custom programming on the fly and it'll tell you exactly what it is that you need. Is this something that can be instrument once and then effectively use it for basically anything, [OTEL 00:16:11]-style, or instead, does it need to be effectively custom configured every time you want to get a different aspect of information out of it?Liz: It can be both of those things.Corey: “It depends.” My least favorite but probably the most accurate answer to hear.Liz: [laugh]. But I think Brendan did a really great—he's done many talks talking about how powerful BPF is and built lots of specific tools, but then he's also been involved with Bpftrace, which is kind of like a language for—a high-level language for saying what it is that you want BPF to trace out for you. So, a little bit like, I don't know, awk but for events, you know? It's a scripting language. So, you can have this flexibility.And with something like Bpftrace, you don't have to get into the weeds yourself and do kernel programming, you know, in eBPF programs. But also there's gainful employment to be had for people who are interested in that eBPF kernel programming because, you know, I think there's just going to be a whole range of more tools to come, you know>? I think we're, you know, we're seeing some really powerful tools with Cilium and Pixie and [Parker 00:17:27] and Kepler and many other tools and projects that are using eBPF. But I think there's also a whole load of more to come as people think about different ways they can apply eBPF and instrument different parts of an overall system.Corey: We're doing this over audio only, but behind me on my wall is one of my least favorite gifts ever to have been received by anyone. Mike, my business partner, got me a thousand-piece puzzle of the Kubernetes container landscape where—Liz: [laugh].Corey: This diagram is psychotic and awful and it looks like a joke, except it's not. And building that puzzle was maddening—obviously—but beyond that, it was a real primer in just how vast the entire container slash Kubernetes slash CNCF landscape really is. So, looking at this, I found that the only reaction that was appropriate was a sense of overwhelmed awe slash frustration, I guess. It's one of those areas where I spend a lot of time focusing on drinking from the AWS firehose because they have a lot of products and services because their product strategy is apparently, “Yes,” and they're updating these things in a pretty consistent cadence. Mostly. And even that feels like it's multiple full-time jobs shoved into one.There are hundreds of companies behind these things and all of them are in areas that are incredibly complex and difficult to go diving into. EBPF is incredibly powerful, I would say ridiculously so, but it's also fiendishly complex, at least shoulder-surfing behind people who know what they're doing with it has been breathtaking, on some level. How do people find themselves in a situation where doing a BPF deep dive make sense for them?Liz: Oh, that's a great question. So, first of all, I'm thinking is there an AWS Jigsaw as well, like the CNCF landscape Jigsaw? There should be. And how many pieces would it have? [It would be very cool 00:19:28].Corey: No, because I think the CNCF at one point hired a graphic designer and it's unclear that AWS has done such a thing because their icons for services are, to be generous here, not great. People have flashcards that they've built for is what services does logo represent? Haven't a clue, in almost every case because I don't care in almost every case. But yeah, I've toyed with the idea of doing it. It's just not something that I'd ever want to have my name attached to it, unfortunately. But yeah, I want someone to do it and someone else to build it.Liz: Yes. Yeah, it would need to refresh every, like, five minutes, though, as they roll out a new service.Corey: Right. Because given that it appears from the outside to be impenetrable, it's similar to learning VI in some cases, where oh, yeah, it's easy to get started with to do this trivial thing. Now, step two, draw the rest of the freaking owl. Same problem there. It feels off-putting just from a perspective of you must be at least this smart to proceed. How do you find people coming to it?Liz: Yeah, there is some truth in that, in that beyond kind of Hello World, you quite quickly start having to do things with kernel data structures. And as soon as you're looking at kernel data structures, you have to sort of understand, you know, more about the kernel. And if you change things, you need to understand the implications of those changes. So, yeah, you can rapidly say that eBPF programming is kernel programming, so why would anybody want to do it? The reason why I do it myself is not because I'm a kernel programmer; it's because I wanted to really understand how this is working and build up a mental model of what's happening when I attach a program to an event. And what kinds of things can I do with that program?And that's the sort of exploration that I think I'm trying to encourage people to do with the book. But yes, there is going to be at some point, a pretty steep learning curve that's kernel-related but you don't necessarily need to know everything in order to really have a decent understanding of what eBPF is, and how you might, for example—you might be interested to see what BPF programs are running on your existing system and learn why and what they might be doing and where they're attached and what use could that be.Corey: Falling down that, looking at the process table once upon a time was a heck of an education, one week when I didn't have a lot to do and I didn't like my job in those days, where, “Oh, what is this Avahi daemon that constantly running? MDNS forwarding? Who would need that?” And sure enough, that tickled something in the back of my mind when I wound up building out my networking box here on top of BSD, and oh, yeah, I want to make sure that I can still have discovery work from the IoT subnet over to whatever it is that my normal devices live. Ah, that's what that thing always running for. Great for that one use case. Almost never needed in other cases, but awesome. Like, you fire up a Raspberry Pi. It's, “Why are all these things running when I'm just want to have an embedded device that does exactly one thing well?” Ugh. Computers have gotten complicated.Liz: I know. It's like when you get those pop-ups on—well certainly on Mac, and you get pop-ups occasionally, let's say there's such and such a daemon wants extra permissions, and you think I'm not hitting that yes button until I understand what that daemon is. And it turns out, it's related, something completely innocuous that you've actually paid for, but just under a different name. Very annoying. So, if you have some kind of instrumentation like tracing or logging or security tooling that you want to apply to all of your containers, one of the things you can use is a sidecar container approach. And in Kubernetes, that means you inject the sidecar into every single pod. And—Corey: Yes. Of course, the answer to any Kubernetes problem appears to be have you tried running additional containers?Liz: Well, right. And there are challenges that can come from that. And one of the reasons why you have to do that is because if you want a tool that has visibility over that container that's inside the pod, well, your instrumentation has to also be inside the pod so that it has visibility because your pod is, by design, isolated from the host it's running on. But with eBPF, well eBPF is in the kernel and there's only one kernel, however many containers were running. So, there is no kind of isolation between the host and the containers at the kernel level.So, that means if we can instrument the kernel, we don't have to have a separate instance in every single pod. And that's really great for all sorts of resource usage, it means you don't have to worry about how you get those sidecars into those pods in the first place, you know that every pod is going to be instrumented if it's instrumented in the kernel. And then for service mesh, service mesh usually uses a sidecar as a Layer 7 Proxy injected into every pod. And that actually makes for a pretty convoluted networking path for a packet to sort of go from the application, through the proxy, out to the host, back into another pod, through another proxy, into the application.What we can do with eBPF, we still need a proxy running in userspace, but we don't need to have one in every single pod because we can connect the networking namespaces much more efficiently. So, that was essentially the basis for sidecarless service mesh, which we did in Cilium, Istio, and now we're using a similar sort of approach with Ambient Mesh. So that, again, you know, avoiding having the overhead of a sidecar in every pod. So that, you know, seems to be the way forward for service mesh as well as other types of instrumentation: avoiding sidecars.Corey: On some level, avoiding things that are Kubernetes staples seems to be a best practice in a bunch of different directions. It feels like it's an area where you start to get aligned with the idea of service meesh—yes, that's how I pluralize the term service mesh and if people have a problem with that, please, it's imperative you've not send me letters about it—but this idea of discovering where things are in a variety of ways within a cluster, where things can talk to each other, when nothing is deterministically placed, it feels like it is screaming out for something like this.Liz: And when you think about it, Kubernetes does sort of already have that at the level of a service, you know? Services are discoverable through native Kubernetes. There's a bunch of other capabilities that we tend to associate with service mesh like observability or encrypted traffic or retries, that kind of thing. But one of the things that we're doing with Cilium, in general, is to say, but a lot of this is just a feature of the networking, the underlying networking capability. So, for example, we've got next generation mutual authentication approach, which is using SPIFFE IDs between an application pod and another application pod. So, it's like the equivalent of mTLS.But the certificates are actually being passed into the kernel and the encryption is happening at the kernel level. And it's a really neat way of saying we don't need… we don't need to have a sidecar proxy in every pod in order to terminate those TLS connections on behalf of the application. We can have the kernel do it for us and that's really cool.Corey: Yeah, at some level, I find that it still feels weird—because I'm old—to have this idea of one shared kernel running a bunch of different containers. I got past that just by not requiring that [unintelligible 00:27:32] workloads need to run isolated having containers run on the same physical host. I found that, for example, running some stuff, even in my home environment for IoT stuff, things that I don't particularly trust run inside of KVM on top of something as opposed to just running it as a container on a cluster. Almost certainly stupendous overkill for what I'm dealing with, but it's a good practice to be in to start thinking about this. To my understanding, this is part of what AWS's Firecracker project starts to address a bit more effectively: fast provisioning, but still being able to use different primitives as far as isolation boundaries go. But, on some level, it's nice to not have to think about this stuff, but that's dangerous.Liz: [laugh]. Yeah, exactly. Firecracker is really nice way of saying, “Actually, we're going to spin up a whole VM,” but we don't ne—when I say ‘whole VM,' we don't need all of the things that you normally get in a VM. We can get rid of a ton of things and just have the essentials for running that Lambda or container service, and it becomes a really nice lightweight solution. But yes, that will have its own kernel, so unlike, you know, running multiple kernels on the same VM where—sorry, running multiple containers on the same virtual machine where they would all be sharing one kernel, with Firecracker you'll get a kernel per instance of Firecracker.Corey: The last question I have for you before we wind up wrapping up this episode harkens back to something you said a little bit earlier. This stuff is incredibly technically nuanced and deep. You clearly have a thorough understanding of it, but you also have what I think many people do not realize is an orthogonal skill of being able to articulate and explain those complex concepts simply an approachably, in ways that make people understand what it is you're talking about, but also don't feel like they're being spoken to in a way that's highly condescending, which is another failure mode. I think it is not particularly well understood, particularly in the engineering community, that there are—these are different skill sets that do not necessarily align congruently. Is this something you've always known or is this something you've figured out as you've evolved your career that, oh I have a certain flair for this?Liz: Yeah, I definitely didn't always know it. And I started to realize it based on feedback that people have given me about talks and articles I'd written. I think I've always felt that when people use jargon or they use complicated language or they, kind of, make assumptions about how things are, it quite often speaks to them not having a full understanding of what's happening. If I want to explain something to myself, I'm going to use straightforward language to explain it to myself [laugh] so I can hold it in my head. And I think people appreciate that.And you can get really—you know, you can get quite in-depth into something if you just start, step by step, build it up, explain everything as you go along the way. And yeah, I think people do appreciate that. And I think people, if they get lost in jargon, it doesn't help anybody. And yeah, I very much appreciate it when people say that, you know, they saw a talk or they read something I wrote and it meant that they finally grokked whatever that concept was that that I was trying to explain. I will say at the weekend, I asked ChatGPT to explain DNS in the style of Liz Rice, and it started off, it was basically, “Hello there. I'm Liz Rice and I'm here to explain DNS in very simple terms.” I thought, “Okay.” [laugh].Corey: Every time I think I've understood DNS, there's another level to it.Liz: I'm pretty sure there is a lot about DNS that I don't understand, yeah. So, you know, there's always more to learn out there.Corey: There's certainly is. I really want to thank you for taking time to speak with me today about what you're up to. Where's the best place for people to find you to learn more? And of course, to buy the book.Liz: Yeah, so I am Liz Rice pretty much everywhere, all over the internet. There is a GitHub repo that accompanies the books that you can find that on GitHub: lizRice/learning-eBPF. So, that's a good place to find some of the example code, and it will obviously link to where you can download the book or buy it because you can pay for it; you can also download it from Isovalent for the price of your contact details. So, there are lots of options.Corey: Excellent. And we will, of course, put links to that in the [show notes 00:32:08]. Thank you so much for your time. It's always great to talk to you.Liz: It's always a pleasure, so thanks very much for having me, Corey.Corey: Liz Rice, Chief Open Source Officer at Isovalent. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that you have somehow discovered this episode by googling for knitting projects.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Organizations that experience Monitoring Data Obesity – having too many arbitrary logs or metrics without context – are suffering twice: high cost for storage and not getting the answers they need!OpenTelemetry, the cloud native standard for observability, solves those challenges and therefore sees rapid adoption from both startups and established enterprises.In this episode we have Daniel Gomez Blanco (@dan_gomezblanco), Principal Software Engineer at Skyscanner and author of the recently published book Practical OpenTelemetry.Tune in and learn about the latest status of OpenTelemetry, lessons learned from adopting OpenTelemetry in a large organization, considerations between metrics and traces, the difference between statistical and tail based sampling and much more Here the links we discussed during the episode:Chat I had with M. Hausenblas on his podcast the other day: https://inuse.o11y.engineering/episode/meet-daniel-skyscannerLink to QCon talk (although I believe the video won't be made available till later in the year) https://qconlondon.com/presentation/mar2023/effective-and-efficient-observability-opentelemetryRecent InfoQ interview covering the talk: https://www.infoq.com/news/2023/03/effective-observability-otel/Video on a talk I did with Ted Young a couple years ago during our tracing migration to OpenTelemetry: https://youtu.be/HExcLWA2b8MTalk at o11yfest 2021 on our tracing migration to OTel: https://vi.to/hubs/o11yfest/videos/3143Mastodon: https://mas.to/@dan_gomezblanco
One World in a New World with Eileen Bild - Founder OTEL Universe, Executive Producer Eileen and Trever Bild are living the dream of working together in their company, Ordinary To Extraordinary Life. Eileen admits she's been on a journey of self-discovery since her early 40s, consistent with statistical reports. Important childhood question: 'Can I fit in like I am?' Enjoy the exploration of that question with Eileen. A medical condition threatened her life, taking her into a deep reconciliation. She gave a vehement call request to either be taken or healed. She recovered what she was missing, the connection, and dove into the process. Her description and story are worth exploring here in this conversation. Does your authentic expressed overwhelm others? How do you navigate that experience? Are people attracted to you, challenge you, and cause you to grow in uncomfortable ways? Eileen reveals some vulnerable moments in her development. Along about the 16-minute mark is a profound truth Eileen expresses that for sure will ripple through your thoughtmosphere with a lot of empathic resonance. The notion of 'nothingness' comes up for an apocalyptic moment as the discussion moves into managing the thoughts in our mind. What did you do previously that has precipitated your 'now moment? In observing your creation cycles, what can you do differently to grow and mature into your 'flow' in life? Are you able to sit and observe mindfully easily? Does it give you a sense of autonomy? Eileen posits some very interesting questions regarding the nature of maturity. Her insights are profound in reflecting on the parental nurturing, or lack thereof, in child development processes. How can we nurture the nature within children to help them meet the challenges we've left them? Can we help develop better educational and world views to support the emerging global village? Further into the conversation the layers of the onion are peeled back, reflected and reviewed in apocalyptic style where subtle distinctions and evolutions in perception are revealed. The conversation is insightful, lively and vulnerable in the depth of the dive. Dive in with Eileen and Zen, ponder your own sojourn and reflect on what you've garnered as insight and/or understanding. Thanks in advance and please do share with those you think will benefit. Connect with Eileen on LI: https://www.linkedin.com/in/eileenbild/ OTEL Universe website: https://www.oteluniverse.com/ Eileen's Coaching website: corethinkingblueprint.com BizCatalyst 360: https://www.bizcatalyst360.com/otel-universe-a-universal-voice/ _________ Connect with Zen: https://linkedin.com/zenbenefiel Zen's web: https://zenbenefiel.com Transformational Coaching: https://BeTheDream.com Live and Let Live Global Peace Movement: https://liveandletlive.org
To mark the first anniversary of Russia's invasion of Ukraine, Tina Daheley talks to documentary film directors Alisa Kovalenko and Yelizaveta Smith about their experiences over the past year and how that has shaped their work. Alisa's feature We Will Not Fade Away tells the story of teenagers growing up in eastern Ukraine against the background of war and was selected for the Berlin Film Festival. Yelizaveta's feature School Number Three is about a school in the Donbas, which was destroyed during the war. Andrey Kurkov is one of Ukraine's most famous and prolific writers. His novel Death And The Penguin is a worldwide best seller and his books are full of black humour and intrigue. He is also a diarist who has been sharing his thoughts and experiences on life in Ukraine for the BBC. To mark this first anniversary he has written a piece especially for The Cultural Frontline. Ukrainian comedian Hanna Kochegura is currently taking her stand-up across Ukraine in a countrywide tour visiting 19 cities. She tells us why humour can be powerful in a time of war. Over the past decade, the club scene in Kyiv has been growing, with thousands of people attending raves known for their raw energy and vibe. One of the people at the centre of this scene is Pavlo Derhachov, co-founder and manager of the experimental club Otel'. He told The Cultural Frontline about the impact of the invasion on the club. (Image: A drawing of a bird on a wall in Kyiv. Credit: Roman Pilipey/Getty Images)
About RichardRichard "RichiH" Hartmann is the Director of Community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Advisory Group Observability chair, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendess, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch. Go through his talks, podcasts, interviews, and articles at https://github.com/RichiH/talks or follow him on Twitter at https://twitter.com/TwitchiH for musings on the intersection of technology and society.Links Referenced: Grafana Labs: https://grafana.com/ Twitter: https://twitter.com/TwitchiH Richard Hartmann list of talks: https://github.com/richih/talks TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloudCorey: Welcome to Screaming in the Cloud, I'm Corey Quinn. There are an awful lot of people who are incredibly good at understanding the ins and outs and the intricacies of the observability world. But they didn't have time to come on the show today. Instead, I am talking to my dear friend of two decades now, Richard Hartmann, better known on the internet as RichiH, who is the Director of Community at Grafana Labs, here to suffer—in a somewhat atypical departure for the theme of this show—personal attacks for once. Richie, thank you for joining me.Richard: And thank you for agreeing on personal attacks.Corey: Exactly. It was one of your riders. Like, there have to be the personal attacks back and forth or you refuse to appear on the show. You've been on before. In fact, the last time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to?You're still at Grafana Labs. And in many cases, I would point out that, wow, you've been there for many years; that seems to be an atypical thing, which is an American tech industry perspective because every time you and I talk about this, you look at folks who—wow, you were only at that company for five years. What's wrong with you—you tend to take the longer view and I tend to have the fast twitch, time to go ahead and leave jobs because it's been more than 20 minutes approach. I see that you're continuing to live what you preach, though. How's it been?Richard: Yeah, so there's a little bit of Covid brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center. But the last two-and-a-half years didn't really happen for many people, myself included. So, I guess [laugh] that includes you.Corey: No, no you're right. You've only been at Grafana Labs a couple of years. One would think I would check the notes for shooting my mouth off. But then, one wouldn't know me.Richard: What notes? Anyway, I've been around Prometheus and Grafana Since 2015. But it's like, real, full-time everything is 2020. There was something in between. Since 2018, I contracted to do vulnerability handling and everything for Grafana Labs because they had something and they didn't know how to deal with it.But no, full time is 2020. But as to the space in the [unintelligible 00:02:45] of itself, it's maybe a little bit German of me, but trying to understand the real world and trying to get an overview of systems and how they actually work, and if they are working correctly and as intended, and if not, how they're not working as intended, and how to fix this is something which has always been super important to me, in part because I just want to understand the world. And this is a really, really good way to automate understanding of the world. So, it's basically a work-saving mechanism. And that's why I've been sticking to it for so long, I guess.Corey: Back in the early days of monitoring systems—so we called it monitoring back then because, you know, are using simple words that lack nuance was sort of de rigueur back then—we wound up effectively having tools. Nagios is the one that springs to mind, and it was terrible in all the ways you would expect a tool written in janky Perl in the early-2000s to be. But it told you what was going on. It tried to do a thing, generally reach a server or query it about things, and when things fell out of certain specs, it screamed its head off, which meant that when you had things like the core switch melting down—thinking of one very particular incident—you didn't get a Nagios alert; you got 4000 Nagios alerts. But start to finish, you could wrap your head rather fully around what Nagios did and why it did the sometimes strange things that it did.These days, when you take a look at Prometheus, which we hear a lot about, particularly in the Kubernetes space and Grafana, which is often mentioned in the same breath, it's never been quite clear to me exactly where those start and stop. It always feels like it's a component in a larger system to tell you what's going on rather than a one-stop shop that's going to, you know, shriek its head off when something breaks in the middle of the night. Is that the right way to think about it? The wrong way to think about it?Richard: It's a way to think about it. So personally, I use the terms monitoring and observability pretty much interchangeably. Observability is a relatively well-defined term, even though most people won't agree. But if you look back into the '70s into control theory where the term is coming from, it is the measure of how much you're able to determine the internal state of a system by looking at its inputs and its outputs. Depending on the definition, some people don't include the inputs, but that is the OG definition as far as I'm aware.And from this, there flow a lot of things. This question of—or this interpretation of the difference between telling that, yes, something's broken versus why something's broken. Or if you can't ask new questions on the fly, it's not observability. Like all of those things are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system I have just by looking at what is coming in, what is going out. And that is at the core the thing. Now, obviously, it's become a buzzword, which is oftentimes the fate of successful things. So, it's become a buzzword, and you end up with cargo culting.Corey: I would argue periodically, that observability is hipster monitoring. If you call it monitoring, you get yelled at by Charity Majors. Which is tongue and cheek, but she has opinions, made, nonetheless shall I say, frustrating by the fact that she is invariably correct in those opinions, which just somehow makes it so much worse. It would be easy to dismiss things she says if she weren't always right. And the world is changing, especially as we get into the world of distributed systems.Is the server that runs the app working or not working loses meaning when we're talking about distributed systems, when we're talking about containers running on top of Kubernetes, which turns every outage into a murder mystery. We start having distributed applications composed of microservices, so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue related to the request coming into a completely separate microservice? And it seems that for those types of applications, the answer has been tracing for a long time now, where originally that was something that felt like it was sprung, fully-formed from the forehead of some God known as one of the hyperscalers, but now is available to basically everyone, in theory.In practice, it seems that instrumenting applications still one of the hardest parts of all of this. I tried hooking up one of my own applications to be observed via OTEL, the open telemetry project, and it turns out that right now, OTEL and AWS Lambda have an intersection point that makes everything extremely difficult to work with. It's not there yet; it's not baked yet. And someday, I hope that changes because I would love to interchangeably just throw metrics and traces and logs to all the different observability tools and see which ones work, which ones don't, but that still feels very far away from current state of the art.Richard: Before we go there, maybe one thing which I don't fully agree with. You said that previously, you were told if a service up or down, that's the thing which you cared about, and I don't think that's what people actually cared about. At that time, also, what they fundamentally cared about: is the user-facing service up, or down, or impacted? Is it slow? Does it return errors every X percent for requests, something like this?Corey: Is the site up? And—you're right, I was hand-waving over a whole bunch of things. It was, “Okay. First, the web server is returning a page, yes or no? Great. Can I ping the server?” Okay, well, there are ways of server can crash and still leave enough of the TCP/IP stack up or it can respond to pings and do little else.And then you start adding things to it. But the Nagios thing that I always wanted to add—and had to—was, is the disk full? And that was annoying. And, on some level, like, why should I care in the modern era how much stuff is on the disk because storage is cheap and free and plentiful? The problem is, after the third outage in a month because the disk filled up, you start to not have a good answer for well, why aren't you monitoring whether the disk is full?And that was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to midsize applications, which is what I'm talking about, the only things that people would let me touch. I wasn't running hyperscale stuff where you have a fleet of 10,000 web servers and, “Is the server up?” Yeah, in that scenario, no one cares. But when we're talking about the database server and the two application servers and the four web servers talking to them, you think about it more in terms of pets than you do cattle.Richard: Yes, absolutely. Yet, I think that was a mistake back then, and I tried to do it differently, as a specific example with the disk. And I'm absolutely agreeing that previous generation tools limit you in how you can actually work with your data. In particular, once you're with metrics where you can do actual math on the data, it doesn't matter if the disk is almost full. It matters if that disk is going to be full within X amount of time.If that disk is 98% full and it sits there at 98% for ten years and provides the service, no one cares. The thing is, will it actually run out in the next two hours, in the next five hours, what have you. Depending on this, is this currently or imminently a customer-impacting or user-impacting then yes, alert on it, raise hell, wake people, make them fix it, as opposed to this thing can be dealt with during business hours on the next workday. And you don't have to wake anyone up.Corey: Yeah. The big filer with massive amounts of storage has crossed the 70% line. Okay, now it's time to start thinking about that, what do you want to do? Maybe it's time to order another shelf of discs for it, which is going to take some time. That's a radically different scenario than the 20 gigabyte root volume on your server just started filling up dramatically; the rate of change is such that'll be full in 20 minutes.Yeah, one of those is something you want to wake people up for. Generally speaking, you don't want to wake people up for what is fundamentally a longer-term strategic business problem. That can be sorted out in the light of day versus, “[laugh] we're not going to be making money in two hours, so if I don't wake up and fix this now.” That's the kind of thing you generally want to be woken up for. Well, let's be honest, you don't want that to happen at all, but if it does happen, you kind of want to know in advance rather than after the fact.Richard: You're literally describing linear predict from Prometheus, which is precisely for this, where I can look back over X amount of time and make a linear prediction because everything else breaks down at scale, blah, blah, blah, to detail. But the thing is, I can draw a line with my pencil by hand on my data and I can predict when is this thing going to it. Which is obviously precisely correct if I have a TLS certificate. It's a little bit more hand-wavy when it's a disk. But still, you can look into the future and you say, “What will be happening if current trends for the last X amount of time continue in Y amount of time.” And that's precisely a thing where you get this more powerful ability of doing math with your data.Corey: See, when you say it like that, it sounds like it actually is a whole term of art, where you're focusing on an in-depth field, where salaries are astronomical. Whereas the tools that I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin that I was grunting and pointing: “This is gonna fill up.” And that is how I thought about it. And this is the challenge where it's easy to think about these things in narrow, defined contexts like that, but at scale, things break.Like the idea of anomaly detection. Well, okay, great if normally, the CPU and these things are super bored and suddenly it gets really busy, that's atypical. Maybe we should look into it, assuming that it has a challenge. The problem is, that is a lot harder than it sounds because there are so many factors that factor into it. And as soon as you have something, quote-unquote, “Intelligent,” making decisions on this, it doesn't take too many false positives before you start ignoring everything it has to say, and missing legitimate things. It's this weird and obnoxious conflation of both hard technical problems and human psychology.Richard: And the breaking up of old service boundaries. Of course, when you say microservices, and such, fundamentally, functionally a microservice or nanoservice, picoservice—but the pendulum is already swinging back to larger units of complexity—but it fundamentally does not make any difference if I have a monolith on some mainframe or if I have a bunch of microservices. Yes, I can scale differently, I can scale horizontally a lot more easily, vertically, it's a little bit harder, blah, blah, blah, but fundamentally, the logic and the complexity, which is being packaged is fundamentally the same. More users, everything, but it is fundamentally the same. What's happening again, and again, is I'm breaking up those old boundaries, which means the old tools which have assumptions built in about certain aspects of how I can actually get an overview of a system just start breaking down, when my complexity unit or my service or what have I, is usually congruent with a physical piece, of hardware or several services are congruent with that piece of hardware, it absolutely makes sense to think about things in terms of this one physical server. The fact that you have different considerations in cloud, and microservices, and blah, blah, blah, is not inherently that it is more complex.On the contrary, it is fundamentally the same thing. It scales with users' everything, but it is fundamentally the same thing, but I have different boundaries of where I put interfaces onto my complexity, which basically allow me to hide all of this complexity from the downstream users.Corey: That's part of the challenge that I think we're grappling with across this entire industry from start to finish. Where we originally looked at these things and could reason about it because it's the computer and I know how those things work. Well, kind of, but okay, sure. But then we start layering levels of complexity on top of layers of complexity on top of layers of complexity, and suddenly, when things stop working the way that we expect, it can be very challenging to unpack and understand why. One of the ways I got into this whole space was understanding, to some degree, of how system calls work, of how the kernel wound up interacting with userspace, about how Linux systems worked from start to finish. And these days, that isn't particularly necessary most of the time for the care and feeding of applications.The challenge is when things start breaking, suddenly having that in my back pocket to pull out could be extremely handy. But I don't think it's nearly as central as it once was and I don't know that I would necessarily advise someone new to this space to spend a few years as a systems person, digging into a lot of those aspects. And this is why you need to know what inodes are and how they work. Not really, not anymore. It's not front and center the way that it once was, in most environments, at least in the world that I live in. Agree? Disagree?Richard: Agreed. But it's very much unsurprising. You probably can't tell me how to precisely grow sugar cane or corn, you can't tell me how to refine the sugar out of it, but you can absolutely bake a cake. But you will not be able to tell me even a third of—and I'm—for the record, I'm also not able to tell you even a third about the supply chain which just goes from I have a field and some seeds and I need to have a package of refined sugar—you're absolutely enabled to do any of this. The thing is, you've been part of the previous generation of infrastructure where you know how this underlying infrastructure works, so you have more ability to reason about this, but it's not needed for cloud services nearly as much.You need different types of skill sets, but that doesn't mean the old skill set is completely useless, at least not as of right now. It's much more a case of you need fewer of those people and you need them in different places because those things have become infrastructure. Which is basically the cloud play, where a lot of this is just becoming infrastructure more and more.Corey: Oh, yeah. Back then I distinctly remember my elders looking down their noses at me because I didn't know assembly, and how could I possibly consider myself a competent systems admin if I didn't at least have a working knowledge of assembly? Or at least C, which I, over time, learned enough about to know that I didn't want to be a C programmer. And you're right, this is the value of cloud and going back to those days getting a web server up and running just to compile Apache's httpd took a week and an in-depth knowledge of GCC flags.And then in time, oh, great. We're going to have rpm or debs. Great, okay, then in time, you have apt, if you're in the dev land because I know you are a Debian developer, but over in Red Hat land, we had yum and other tools. And then in time, it became oh, we can just use something like Puppet or Chef to wind up ensuring that thing is installed. And then oh, just docker run. And now it's a checkbox in a web console for S3.These things get easier with time and step by step by step we're standing on the shoulders of giants. Even in the last ten years of my career, I used to have a great challenge question that I would interview people with of, “Do you know what TinyURL is? It takes a short URL and then expands it to a longer one. Great, on the whiteboard, tell me how you would implement that.” And you could go up one side and down the other, and then you could add constraints, multiple data centers, now one goes offline, how do you not lose data? Et cetera, et cetera.But these days, there are so many ways to do that using cloud services that it almost becomes trivial. It's okay, multiple data centers, API Gateway, a Lambda, and a global DynamoDB table. Now, what? “Well, now it gets slow. Why is it getting slow?”“Well, in that scenario, probably because of something underlying the cloud provider.” “And so now, you lose an entire AWS region. How do you handle that?” “Seems to me when that happens, the entire internet's kind of broken. Do people really need longer URLs?”And that is a valid answer, in many cases. The question doesn't really work without a whole bunch of additional constraints that make it sound fake. And that's not a weakness. That is the fact that computers and cloud services have never been as accessible as they are now. And that's a win for everyone.Richard: There's one aspect of accessibility which is actually decreasing—or two. A, you need to pay for them on an ongoing basis. And B, you need an internet connection which is suitably fast, low latency, what have you. And those are things which actually do make things harder for a variety of reasons. If I look at our back-end systems—as in Grafana—all of them have single binary modes where you literally compile everything into a single binary and you can run it on your laptop because if you're stuck on a plane, you can't do any work on it. That kind of is not the best of situations.And if you have a huge CI/CD pipeline, everything in this cloud and fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more accessible. I disagree that it is becoming more accessible along all possible axes.Corey: I would agree. There is a silver lining to that as well, where yes, they are fraught and dangerous and I would preface this with a whole bunch of warnings, but from a cost perspective, all of the cloud providers do have a free tier offering where you can kick the tires on a lot of these things in return for no money. Surprisingly, the best one of those is Oracle Cloud where they have an unlimited free tier, use whatever you want in this subset of services, and you will never be charged a dime. As opposed to the AWS model of free tier where well, okay, it suddenly got very popular or you misconfigured something, and surprise, you now owe us enough money to buy Belize. That doesn't usually lead to a great customer experience.But you're right, you can't get away from needing an internet connection of at least some level of stability and throughput in order for a lot of these things to work. The stuff you would do locally on a Raspberry Pi, for example, if your budget constrained and want to get something out here, or your laptop. Great, that's not going to work in the same way as a full-on cloud service will.Richard: It's not free unless you have hard guarantees that you're not going to ever pay anything. It's fine to send warning, it's fine to switch the thing off, it's fine to have you hit random hard and soft quotas. It is not a free service if you can't guarantee that it is free.Corey: I agree with you. I think that there needs to be a free offering where, “Well, okay, you want us to suddenly stop serving traffic to the world?” “Yes. When the alternative is you have to start charging me through the nose, yes I want you to stop serving traffic.” That is definitionally what it says on the tin.And as an independent learner, that is what I want. Conversely, if I'm an enterprise, yeah, I don't care about money; we're running our Superbowl ad right now, so whatever you do, don't stop serving traffic. Charge us all the money. And there's been a lot of hand wringing about, well, how do we figure out which direction to go in? And it's, have you considered asking the customer?So, on a scale of one to bank, how serious is this account going to be [laugh]? Like, what are your big concerns: never charge me or never go down? Because we can build for either of those. Just let's make sure that all of those expectations are aligned. Because if you guess you're going to get it wrong and then no one's going to like you.Richard: I would argue this. All those services from all cloud providers actually build to address both of those. It's a deliberate choice not to offer certain aspects.Corey: Absolutely. When I talk to AWS, like, “Yeah, but there is an eventual consistency challenge in the billing system where it takes”—as anyone who's looked at the billing system can see—“Multiple days, sometimes for usage data to show up. So, how would we be able to stop things if the usage starts climbing?” To which my relatively direct responses, that sounds like a huge problem. I don't know how you'd fix that, but I do know that if suddenly you decide, as a matter of policy, to okay, if you're in the free tier, we will not charge you, or even we will not charge you more than $20 a month.So, you build yourself some headroom, great. And anything that people are able to spin up, well, you're just going to have to eat the cost as a provider. I somehow suspect that would get fixed super quickly if that were the constraint. The fact that it isn't is a conscious choice.Richard: Absolutely.Corey: And the reason I'm so passionate about this, about the free space, is not because I want to get a bunch of things for free. I assure you I do not. I mean, I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely, “It's too expensive.” It's that the billing dimension is hard to predict or doesn't align with a customer's experience or prices a service out of a bunch of use cases where it'll be great. But very rarely do I just sit here shaking my fist and saying, “It costs too much.”The problem is when you scare the living crap out of a student with a surprise bill that's more than their entire college tuition, even if you waive it a week or so later, do you think they're ever going to be as excited as they once were to go and use cloud services and build things for themselves and see what's possible? I mean, you and I met on IRC 20 years ago because back in those days, the failure mode and the risk financially was extremely low. It's yeah, the biggest concern that I had back then when I was doing some of my Linux experimentation is if I typed the wrong thing, I'm going to break my laptop. And yeah, that happened once or twice, and I've learned not to make those same kinds of mistakes, or put guardrails in so the blast radius was smaller, or use a remote system instead. Yeah, someone else's computer that I can destroy. Wonderful. But that was on we live and we learn as we were coming up. There was never an opportunity for us, to my understanding, to wind up accidentally running up an $8 million charge.Richard: Absolutely. And psychological safety is one of the most important things in what most people do. We are social animals. Without this psychological safety, you're not going to have long-term, self-sustaining groups. You will not make someone really excited about it. There's two basic ways to sell: trust or force. Those are the two ones. There's none else.Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomemento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Yeah. And it also looks ridiculous. I was talking to someone somewhat recently who's used to spending four bucks a month on their AWS bill for some S3 stuff. Great. Good for them. That's awesome. Their credentials got compromised. Yes, that is on them to some extent. Okay, great.But now after six days, they were told that they owed $360,000 to AWS. And I don't know how, as a cloud company, you can sit there and ask a student to do that. That is not a realistic thing. They are what is known, in the United States at least, in the world of civil litigation as quote-unquote, “Judgment proof,” which means, great, you could wind up finding that someone owes you $20 billion. Most of the time, they don't have that, so you're not able to recoup it. Yeah, the judgment feels good, but you're never going to see it.That's the problem with something like that. It's yeah, I would declare bankruptcy long before, as a student, I wound up paying that kind of money. And I don't hear any stories about them releasing the collection agency hounds against people in that scenario. But I couldn't guarantee that. I would never urge someone to ignore that bill and see what happens.And it's such an off-putting thing that, from my perspective, is beneath of the company. And let's be clear, I see this behavior at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS, but they are the 800-pound gorilla in the space, and that's important. Or as I just to mention right now, like, as I—because I was about to give you crap for this, too, but if I go to grafana.com, it says, and I quote, “Play around with the Grafana Stack. Experience Grafana for yourself, no registration or installation needed.”Good. I was about to yell at you if it's, “Oh, just give us your credit card and go ahead and start spinning things up and we won't charge you. Honest.” Even your free account does not require a credit card; you're doing it right. That tells me that I'm not going to get a giant surprise bill.Richard: You have no idea how much thought and work went into our free offering. There was a lot of math involved.Corey: None of this is easy, I want to be very clear on that. Pricing is one of the hardest things to get right, especially in cloud. And it also, when you get it right, it doesn't look like it was that hard for you to do. But I fix [sigh] I people's AWS bills for a living and still, five or six years in, one of the hardest things I still wrestle with is pricing engagements. It's incredibly nuanced, incredibly challenging, and at least for services in the cloud space where you're doing usage-based billing, that becomes a problem.But glancing at your pricing page, you do hit the two things that are incredibly important to me. The first one is use something for free. As an added bonus, you can use it forever. And I can get started with it right now. Great, when I go and look at your pricing page or I want to use your product and it tells me to ‘click here to contact us.' That tells me it's an enterprise sales cycle, it's got to be really expensive, and I'm not solving my problem tonight.Whereas the other side of it, the enterprise offering needs to be ‘contact us' and you do that, that speaks to the enterprise procurement people who don't know how to sign a check that doesn't have to commas in it, and they want to have custom terms and all the rest, and they're prepared to pay for that. If you don't have that, you look to small-time. When it doesn't matter what price you put on it, you wind up offering your enterprise tier at some large number, it's yeah, for some companies, that's a small number. You don't necessarily want to back yourself in, depending upon what the specific needs are. You've gotten that right.Every common criticism that I have about pricing, you folks have gotten right. And I definitely can pick up on your fingerprints on a lot of this. Because it sounds like a weird thing to say of, “Well, he's the Director of Community, why would he weigh in on pricing?” It's, “I don't think you understand what community is when you ask that question.”Richard: Yes, I fully agree. It's super important to get pricing right, or to get many things right. And usually the things which just feel naturally correct are the ones which took the most effort and the most time and everything. And yes, at least from the—like, I was in those conversations or part of them, and the one thing which was always clear is when we say it's free, it must be free. When we say it is forever free, it must be forever free. No games, no lies, do what you say and say what you do. Basically.We have things where initially you get certain pro features and you can keep paying and you can keep using them, or after X amount of time they go away. Things like these are built in because that's what people want. They want to play around with the whole thing and see, hey, is this actually providing me value? Do I want to pay for this feature which is nice or this and that plugin or what have you? And yeah, you're also absolutely right that once you leave these constraints of basically self-serve cloud, you are talking about bespoke deals, but you're also talking about okay, let's sit down, let's actually understand what your business is: what are your business problems? What are you going to solve today? What are you trying to solve tomorrow?Let us find a way of actually supporting you and invest into a mutual partnership and not just grab the money and run. We have extremely low churn for, I would say, pretty good reasons. Because this thing about our users, our customers being successful, we do take it extremely seriously.Corey: It's one of those areas that I just can't shake the feeling is underappreciated industry-wide. And the reason I say that this is your fingerprints on it is because if this had been wrong, you have a lot of… we'll call them idiosyncrasies, where there are certain things you absolutely will not stand for, and misleading people and tricking them into paying money is high on that list. One of the reasons we're friends. So yeah, but I say I see your fingerprints on this, it's yeah, if this hadn't been worked out the way that it is, you would not still be there. One other thing that I wanted to call out about, well, I guess it's a confluence of pricing and logging in the rest, I look at your free tier, and it offers up to 50 gigabytes of ingest a month.And it's easy for me to sit here and compare that to other services, other tools, and other logging stories, and then I have to stop and think for a minute that yeah, discs have gotten way bigger, and internet connections have gotten way faster, and even the logs have gotten way wordier. I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data. Do you have any, I guess, ballpark examples of what that looks like? Because it's been long enough since I've been playing in these waters that I can't really contextualize it anymore.Richard: Lord of the Rings is roughly five megabytes. It's actually less. So, we're talking literally 10,000 Lord of the Rings, which you can just shove in us and we're just storing this for you. Which also tells you that you're not going to be reading any of this. Or some of it, yes, but not all of it. You need better tooling and you need proper tooling.And some of this is more modern. Some of this is where we actually pushed the state of the art. But I'm also biased. But I, for myself, do claim that we did push the state of the art here. But at the same time you come back to those absolute fundamentals of how humans deal with data.If you look back basically as far as we have writing—literally 6000 years ago, is the oldest writing—humans have always dealt with information with the state of the world in very specific ways. A, is it important enough to even write it down, to even persist it in whatever persistence mechanisms I have at my disposal? If yes, write a detailed account or record a detailed account of whatever the thing is. But it turns out, this is expensive and it's not what you need. So, over time, you optimize towards only taking down key events and only noting key events. Maybe with their interconnections, but fundamentally, the key events.As your data grows, as you have more stuff, as this still is important to your business and keeps being more important to—or doesn't even need to be a business; can be social, can be whatever—whatever thing it is, it becomes expensive, again, to retain all of those key events. So, you turn them into numbers and you can do actual math on them. And that's this path which you've seen again, and again, and again, and again, throughout humanity's history. Literally, as long as we have written records, this has played out again, and again, and again, and again, for every single field which humans actually cared about. At different times, like, power networks are way ahead of this, but fundamentally power networks work on metrics, but for transient load spike, and everything, they have logs built into their power measurement devices, but those are only far in between. Of course, the main thing is just metrics, time-series. And you see this again, and again.You also were sysadmin in internet-related all switches have been metrics-based or metrics-first for basically forever, for 20, 30 years. But that stands to reason. Of course the internet is running at by roughly 20 years scale-wise in front of the cloud because obviously you need the internet because as you wouldn't be having a cloud. So, all of those growing pains why metrics are all of a sudden the thing, “Or have been for a few years now,” is basically, of course, people who were writing software, providing their own software services, hit the scaling limitations which you hit for Internet service providers two decades, three decades ago. But fundamentally, you have this complete system. Basically profiles or distributed tracing depending on how you view distributed tracing.You can also argue that distributed tracing is key events which are linked to each other. Logs sit firmly in the key event thing and then you turn this into numbers and that is metrics. And that's basically it. You have extremes at the and where you can have valid, depending on your circumstances, engineering trade-offs of where you invest the most, but fundamentally, that is why those always appear again in humanity's dealing with data, and observability is no different.Corey: I take a look at last month's AWS bill. Mine is pretty well optimized. It's a bit over 500 bucks. And right around 150 of that is various forms of logging and detecting change in the environment. And on the one hand, I sit here, and I think, “Oh, I should optimize that,” because the value of those logs to me is zero.Except that whenever I have to go in and diagnose something or respond to an incident or have some forensic exploration, they then are worth an awful lot. And I am prepared to pay 150 bucks a month for that because the potential value of having that when the time comes is going to be extraordinarily useful. And it basically just feels like a tax on top of what it is that I'm doing. The same thing happens with application observability where, yeah, when you just want the big substantial stuff, yeah, until you're trying to diagnose something. But in some cases, yeah, okay, then crank up the verbosity and then look for it.But if you're trying to figure it out after an event that isn't likely or hopefully won't recur, you're going to wish that you spent a little bit more on collecting data out of it. You're always going to be wrong, you're always going to be unhappy, on some level.Richard: Ish. You could absolutely be optimizing this. I mean, for $500, it's probably not worth your time unless you take it as an exercise, but outside of due diligence where you need specific logs tied to—or specific events tied to specific times, I would argue that a lot of the problems with logs is just dealing with it wrong. You have this one extreme of full-text indexing everything, and you have this other extreme of a data lake—which is just a euphemism of never looking at the data again—to keep storage vendors happy. There is an in between.Again, I'm biased, but like for example, with Loki, you have those same label sets as you have on your metrics with Prometheus, and you have literally the same, which means you only index that part and you only extract on ingestion time. If you don't have structured logs yet, only put the metadata about whatever you care about extracted and put it into your label set and store this, and that's the only thing you index. But it goes further than just this. You can also turn those logs into metrics.And to me this is a path of optimization. Where previously I logged this and that error. Okay, fine, but it's just a log line telling me it's HTTP 500. No one cares that this is at this precise time. Log levels are also basically an anti-pattern because they're just trying to deal with the amount of data which I have, and try and get a handle on this on that level whereas it would be much easier if I just counted every time I have an HTTP 500, I just up my counter by one. And again, and again, and again.And all of a sudden, I have literally—and I did the math on this—over 99.8% of the data which I have to store just goes away. It's just magic the way—and we're only talking about the first time I'm hitting this logline. The second time I'm hitting this logline is functionally free if I turn this into metrics. It becomes cheap enough that one of the mantras which I have, if you need to onboard your developers on modern observability, blah, blah, blah, blah, blah, the whole bells and whistles, usually people have logs, like that's what they have, unless they were from ISPs or power companies, or so; there they usually start with metrics.But most users, which I see both with my Grafana and with my Prometheus [unintelligible 00:38:46] tend to start with logs. They have issues with those logs because they're basically unstructured and useless and you need to first make them useful to some extent. But then you can leverage on this and instead of having a debug statement, just put a counter. Every single time you think, “Hey, maybe I should put a debug statement,” just put a counter instead. In two months time, see if it was worth it or if you delete that line and just remove that counter.It's so much cheaper, you can just throw this on and just have it run for a week or a month or whatever timeframe and done. But it goes beyond this because all of a sudden, if I can turn my logs into metrics properly, I can start rewriting my alerts on those metrics. I can actually persist those metrics and can more aggressively throw my logs away. But also, I have this transition made a lot easier where I don't have this huge lift, where this day in three months is to be cut over and we're going to release the new version of this and that software and it's not going to have that, it's going to have 80% less logs and everything will be great and then you missed the first maintenance window or someone is ill or what have you, and then the next Big Friday is coming so you can't actually deploy there. I mean Black Friday. But we can also talk about deploying on Fridays.But the thing is, you have this huge thing, whereas if you have this as a continuous improvement process, I can just look at, this is the log which is coming out. I turn this into a number, I start emitting metrics directly, and I see that those numbers match. And so, I can just start—I build new stuff, I put it into a new data format, I actually emit the new data format directly from my code instrumentation, and only then do I start removing the instrumentation for the logs. And that allows me to, with full confidence, with psychological safety, just move a lot more quickly, deliver much more quickly, and also cut down on my costs more quickly because I'm just using more efficient data types.Corey: I really want to thank you for spending as much time as you have. If people want to learn more about how you view the world and figure out what other personal attacks they can throw your way, where's the best place for them to find you?Richard: Personal attacks, probably Twitter. It's, like, the go-to place for this kind of thing. For actually tracking, I stopped maintaining my own website. Maybe I'll do again, but if you go on github.com/ritchieh/talks, you'll find a reasonably up-to-date list of all the talks, interviews, presentations, panels, what have you, which I did over the last whatever amount of time. [laugh].Corey: And we will, of course, put links to that in the [show notes 00:41:23]. Thanks again for your time. It's always appreciated.Richard: And thank you.Corey: Richard Hartmann, Director of Community at Grafana Labs. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment. And then when someone else comes along with an insulting comment they want to add, we'll just increment the counter by one.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.