Monitoring platform for cloud applications
POPULARITY
Categories
AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies will want power over, not lock-in, with model makers. Also, backed by Alexis Ohanian's 776 and Kindred Ventures, Zest uses transaction data and AI to generate restaurant recommendations based on users' real dining habits and the places they frequent. Learn more about your ad choices. Visit podcastchoices.com/adchoices
Philippe Mizrahi is the CEO and Co-Founder of Linkup, a Paris-based startup building the web search layer for the AI era. Previously a Group Product Manager at Lyft, Mizrahi co-founded Linkup in 2024 alongside Denis Charrier, whose prior company Niland — one of Europe's first vector search engines — was acquired by Spotify, and Boris Toledano (ex-McKinsey). The company has raised over $10M in funding, including a seed round led by Gradient, and is backed by Seedcamp, Motier Ventures, and angel investors including founders from Mistral, Datadog, and Deel. Linkup's API powers AI agents at enterprise clients including KPMG, and holds state-of-the-art results on OpenAI's SimpleQA benchmark.AGENDA:• 00:00:55 - Phil Mizrahi and the bet that became Linkup• 00:04:07 - Why Linkup's original vision was wrong• 00:07:00 - The MVP mistake most founders never catch• 00:10:05 - Fundraise or bootstrap: what Linkup chose and why• 00:13:08 - What Phil looks for in a founding team• 00:15:50 - Why Linkup wins in a crowded AI market• 00:19:07 - How Linkup got its first customers• 00:21:55 - The market bet Linkup is building toward• 00:36:22 - The pricing psychology behind Linkup's strategy• 00:39:05 - Why most startups target the wrong customer• 00:42:07 - The growth loop that scaled Linkup• 00:44:40 - Running a global team before you're ready• 00:46:45 - What separates founders who execute from those who don't• 00:50:09 - What most founders still get wrong about AI• 00:53:25 - How AI rewrites the zero-to-one playbook• 00:56:36 - What Phil tells every early-stage founder
In der heutigen Folge sprechen die Finanzjournalisten Lea Oetjen und Holger Zschäpitz über das große IPO-Wettrennen, einen Milliarden-Deal im Biotech-Sektor und was sonst noch so wichtig wird in dieser Woche. Außerdem geht es um Marvell Technology, Flex, Pool, The Campbell's Company, Lyft, Uber, Datadog, Dynatrace, JD.com, Alibaba, Apple, Alphabet, Incyte, Boeing, JPMorgan Chase, Tesla, ING, Commerzbank, Deutsche Bank, Oracle, Adobe, Micron Technology, Broadcom, Meta Platforms, Kioxia Holdings, OHB, PayPal, Visa, Mastercard, CTS Eventim, Hornbach Holding, Ceconomy, Zalando, H&M, Amazon, Berkshire Hathaway, CoreWeave. Wir freuen uns an Feedback über aaa@welt.de. Noch mehr "Alles auf Aktien" findet Ihr bei WELTplus und Apple Podcasts – inklusive aller Artikel der Hosts. Hier bei WELT: https://www.welt.de/podcasts/alles-auf-aktien/plus247399208/Boersen-Podcast-AAA-Bonus-Folgen-Jede-Woche-noch-mehr-Antworten-auf-Eure-Boersen-Fragen.html. Hier könnt ihr den AAA-Newsletter abonnieren: https://www.welt.de/newsletter/article232797673/Alles-auf-Aktien-Der-taegliche-Boersen-Newsletter-fuer-WELTplus-Abonnenten.html Und – ganz neu: AAA gibt es jetzt auch auf Instagram: https://www.instagram.com/alles_auf_aktien/ Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Hörtipps: Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. +++ Werbung +++ Du möchtest mehr über unsere Werbepartner erfahren? Hier findest du alle Infos & Rabatte! https://linktr.ee/alles_auf_aktien Anzeige: Diese Folge enthält Werbung für Smartbroker+. Depot eröffnen, 30 € ETF als Bonus sichern und aus tausenden ETFs wählen. Smartbroker+ macht Investieren einfach. Alle Informationen gibt es unter: https://get.smartbrokerplus.de/triple-aaa-podcast2/ Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
Die Wall Street startet nach dem Ausverkauf vom Freitag teils deutlich fester in die neue Woche. Vor allem der Nasdaq und S&P 500 profitieren von Stabilisierungskäufen im Technologiesektor, nachdem Halbleiter- und KI-Werte zuletzt massiv unter Druck geraten waren. Corning profitiert von einem Multi-Milliarden-Deal mit Amazon, im Zusammenhang mit den Data Center des Tech-Riesen. Bei der Bank of America werden die Kursziele für Arista Networks, Cisco, Datadog und Nokia angehoben. Die Aktien werden mit Kaufempfehlungen bestätigt. TD Cowen sieht bei Fortinet durch KI und Rechenzentren neues Wachstumspotenzial und Wells Fargo erhöht das Kursziel für Micron von 550 auf 1.220 US-Dollar. Oracle meldet nach dem Closing am Mittwoch Zahlen und wird heute von Oppenheimer als Top-Pick für 2026 eingestuft. Im Fokus steht heute auch die Entwicklerkonferenz von Apple, mit der Rede von CEO Tim Cook um 19 Uhr MEZ. Insgesamt bleibt das Umfeld an der Wall Street fragil. Die Eskalation zwischen Israel und Iran treibt den Ölpreis nach oben, die Renditen steigen, und nach den robusten Arbeitsmarktdaten richtet sich der Blick auf die US-Inflationsdaten zur Wochenmitte. JPMorgan bleibt taktisch vorsichtig und warnt, dass ein heißer CPI-Report neue Zinssorgen auslösen könnte. Abonniere den Podcast, um keine Folge zu verpassen! ____ Folge uns, um auf dem Laufenden zu bleiben: • X: http://fal.cn/SQtwitter • LinkedIn: http://fal.cn/SQlinkedin • Instagram: http://fal.cn/SQInstagram
Die Wall Street startet nach dem Ausverkauf vom Freitag teils deutlich fester in die neue Woche. Vor allem der Nasdaq und S&P 500 profitieren von Stabilisierungskäufen im Technologiesektor, nachdem Halbleiter- und KI-Werte zuletzt massiv unter Druck geraten waren. Corning profitiert von einem Multi-Milliarden-Deal mit Amazon, im Zusammenhang mit den Data Center des Tech-Riesen. Bei der Bank of America werden die Kursziele für Arista Networks, Cisco, Datadog und Nokia angehoben. Die Aktien werden mit Kaufempfehlungen bestätigt. TD Cowen sieht bei Fortinet durch KI und Rechenzentren neues Wachstumspotenzial und Wells Fargo erhöht das Kursziel für Micron von 550 auf 1.220 US-Dollar. Oracle meldet nach dem Closing am Mittwoch Zahlen und wird heute von Oppenheimer als Top-Pick für 2026 eingestuft. Im Fokus steht heute auch die Entwicklerkonferenz von Apple, mit der Rede von CEO Tim Cook um 19 Uhr MEZ. Insgesamt bleibt das Umfeld an der Wall Street fragil. Die Eskalation zwischen Israel und Iran treibt den Ölpreis nach oben, die Renditen steigen, und nach den robusten Arbeitsmarktdaten richtet sich der Blick auf die US-Inflationsdaten zur Wochenmitte. JPMorgan bleibt taktisch vorsichtig und warnt, dass ein heißer CPI-Report neue Zinssorgen auslösen könnte. Ein Podcast - featured by Handelsblatt. ► Entdecke den exklusiven NordVPN Deal! Jetzt risikofrei testen mit einer 30-Tage-Geld-zurück-Garantie: https://nordvpn.com/wallstreet * ► Erhalte einen exklusiven 15% Rabatt auf Saily eSIM Datentarife! Lade die Saily-App herunter und benutze den Code wallstreet beim Bezahlen: https://saily.com/wallstreet +++ Alle Rabattcodes und Infos zu unseren Werbepartnern findet ihr hier: https://linktr.ee/wallstreet_podcast +++ ► Mehr Einblicke: https://bit.ly/360wallstreetpc * Impressum: https://www.360wallstreet.de/impressum *Werbung
The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity's Last Exam, etc. These metrics are useful, but don't always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench.In Anthropic's Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:You don't know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it'll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior.While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible.Full Video PodFrom Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon's broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.We discuss:* Why Andon Labs started with dangerous capability evals and long-running agents* Vending-Bench and why running a vending machine is a deceptively hard AI benchmark* Why money-based evals avoid the saturation problem of traditional benchmarks* How Claude tried to call the FBI over a $2/day fee* Why long-horizon agents can spiral into existential and legalistic breakdowns* Project Vend: putting an AI-run vending machine inside Anthropic* Why real humans are “out of distribution” for simulated agents* Claudius, Seymour Cash, and the chaos of AI CEOs* How a human briefly became CEO of Claudius through a manipulated election* Why multi-agent systems can converge back into “helpful assistant” behavior* Bengt, Andon's internal office agent with email, spending, terminal, phone, camera, and internet access* How Bengt traded Amazon purchases for face-recognition training data* Claude's aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena* Why eval awareness may become the AI version of “are we living in a simulation?”* Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms* Butter-Bench and testing LLMs as robot orchestrators* Luna, the AI-run physical store with a three-year lease and human employees* The new Andon cafe in Sweden and why real-world geography matters for agent evals* Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical businessLukas Petersson* LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/* X: https://x.com/lukaspetAxel Backlund* LinkedIn: https://www.linkedin.com/in/axelbacklund* X: https://x.com/axelbacklundAndon Labs* Website: https://andonlabs.com* Vending-Bench: https://andonlabs.com/evals/vending-bench* Andon Vending: https://andonlabs.com/vendingTimestamps00:00:00 Introduction00:01:00 Andon Labs and the Origins of Vending-Bench00:05:21 Why Money-Based Evals Matter00:09:51 Agent Harnesses and Self-Modifying Systems00:13:36 Claude Calls the FBI00:16:33 Project Vend: Claude Runs a Real Vending Machine00:21:44 Seymour Cash, AI CEOs, and Election Chaos00:27:16 Multi-Agent Coordination and Slack Observability00:30:18 When Will Agents Run Real Businesses?00:34:56 Bengt: Andon's Internal Office Agent00:40:06 Real-World AI Safety and Long-Horizon Traces00:44:28 Lying, Refunds, and Price Cartels in Arena00:52:42 Eval Awareness and Simulation Behavior00:56:06 Blueprint Bench, Butter-Bench, and Robotics01:04:37 Luna: The AI-Run Physical Store01:09:29 The Sweden Cafe and Real-World Expansion01:13:16 What Comes Next for Andon LabsTranscriptIntroduction: Andon Labs, Long-Running Agents, and Real-World EvalsSwyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I'm joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.Lukas [00:00:15]: Thank you for having us.Axel [00:00:16]: Thank you.Swyx [00:00:17]: Let's match names to voices., maybe you wanna take turns introducing yourselves.Lukas [00:00:21]: I'm Lukas.Axel [00:00:22]: And I'm Axel.Swyx [00:00:24]: Let's introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you're both Swedish., was that, a big part of it?Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.Axel [00:00:47]: I don't know about this.Swyx [00:00:49]: But you went to different universities, right?Lukas [00:00:51]: But same high school.Swyx [00:00:52]: I see.Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that's what we did.Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?From Dangerous Capability Evals to Vending BenchAxel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let's make a benchmark of how well can an agent run the probably simplest business, possible,” and, that's probably, running a vending machine. So that's the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best.Axel [00:02:15]: We tried.Vibhu [00:02:16]: It's the one at Anthropic, right?Lukas [00:02:18]: So thisSwyx [00:02:19]: This is a classic thing we should get out of the way.Lukas [00:02:20]: Exactly. There's two versions.Swyx [00:02:22]: Everyone does this. Yes.Lukas [00:02:23]: There's Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn't get any traction in the beginning, but then some random person made a tweet about it, and thatAxel [00:02:38]: You have the paperLukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it's what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” UmSwyx [00:03:21]: It's like a small fridge, right? It's like a mini fridge.Axel [00:03:23]: Absolutely.Swyx [00:03:24]: People-- There's like a stripe thing or like anVibhu [00:03:27]: Oh, okay. So it was very OG, the early daysLukas [00:03:28]: That's the OG one. YeahVibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There's a security camera for making sure you actually Venmo the thing.Swyx [00:03:40]: So, my impression, okay, we're, we're going straight into project Ven because it's such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic's doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimesVibhu [00:04:12]: It's harder to do than it seems.Swyx [00:04:13]: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it's meaningful advice to others.Lukas [00:04:21]: We get this question a lot, and I don't think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were “Oh, yeah, this is actually kind of useful. We should probably pay for this.”, but that took a while. I don't know if this is, the best path to doing it, but that's how it went for us.Axel [00:04:47]: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don't saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you're doing that.Why Dollar-Based Evals MatterSwyx [00:05:21]: I think you are in, you're in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where's the, where's, like It's like a dollar value, right? Forget your ELO scores. Forget yourAxel [00:05:37]: PercentilesSwyx [00:05:38]: Zero to one hundred percents. Just go straight for dollars and, that's AGI.Lukas [00:05:43]: And there's like-- I think the nice thing is that there's no ceiling. You can just-- It never saturates because it could just make more and more money. Like If there's oh, Percentage-wise, then, you can't go above, a hundred. And I think like Even when you're not at the hundred, I think a lot of these, evals have a lot of problems in them. So, actually it's like if you getAxel [00:06:05]: To like 92 or something like that, many of them. It's like then there's like there's no really no difference between 92 and 93 because the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there ‘s still signal in them, but there really isn't.Vending Bench 1, Harness Design, and SaturationSwyx [00:06:24]: Like Super bench verified., even Vending Bench 1 saturated, right? Maybe we can talk about that., may- and maybe set up Vending Bench for a lot of folks who don't know. Actually, things that were very basic like there's limited slots, like you have to pay rent., these are elements where like it doesn't come across in the, in the narrative, but even being adversarial towards the agent, I think these are all like very interesting dimensions.Axel [00:06:47]: I don't really think it's saturated, right? Like it It was more like it was not designed in a way that was really, like true to how AI developed. Like we had an agent harness in it that wasn't really how people used harnesses and stuff like that., so I think it wasn't really that it saturated, it was more like it wasn't really, the best benchmark.Vibhu [00:07:12]: This is Vending Bench one, right?Axel [00:07:14]: I think that like schematic maps sort of to Vending Bench 2 as well., butSwyx [00:07:19]: Including the email.Axel [00:07:20]: The email The emails exist still. Exactly., and then we still we simulate the purchases and it's all, yeah, it's this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to just improve the harness., a lot of like nice, like easier, improvements to make it easier for us to run as well., like when you make an eval you ideally want don't want to change it after you made it. So, you want to make it really good and then not to rerun all the models when you make an update because that's also really expensive with the Vending Bench when you run the frontier models. But like as an example, like one thing we didn't have, we didn't have prompt caching in Vending Bench 1, because when we made Vending Bench 1 it wasn't really a thing., so that ‘s just an example of like in Vending Bench 2 like we paid a lot more to run these things because we didn't have prompt caching. So for Vending Bench 2 that was one thing we added and there was a bunch of things like this., and that'Swyx [00:08:17]: Also the conversations are a lot longer in Vending Bench 2, right?Axel [00:08:21]: I think it's kind of similar.Swyx [00:08:22]: Is it similar?Axel [00:08:23]: I think it's similar. The models at the time were worse, so they crashed out earlier., and now they survive the full year all the time.Swyx [00:08:31]: Which is like thousands of turns. Hundreds of thousands of hundreds of millions of tokens output. That's the, that's the rough order of magnitude. I always wonder about the harness. The harness matters a lot. It's your harness. Was there any question about like use cloud code, use something else?Axel [00:08:48]: I think our philosophy around harnesses is like we try to make something that's quite minimalistic, like quite simple. Like we don't wanna favor one model a lot over the other, but also don't make like a super complex harness. So like it's obvious like a model may be lucky and just be good in one harness., so like it is similar to a lot of the harnesses out there in like you have the, like a running loop., you have some like a bunch of tools that are like quite, descriptive for the agent, we think, and not a lot of like fancy agents or anything ‘cause we wanna really test the model, not like some specific harness.Vibhu [00:09:27]: It seems more neutral as well to test the model's agnostic of the harness,?Axel [00:09:32]: There are arguments like you want to elicit maximum performance of the model, but it's like a trade-off, like how much time should we spend optimizing the harness for this model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that's the same for all of them is the best.Swyx [00:09:51]: So okay, this is my pitch for Vending Bench 3 or whatever, right? And then I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes. A lot of people are exploring modifying harnesses and I think prompt tuning for a model is a thing and you are probably not doing a bunch of that. It's the same system prompt in every regardless of the model, same tools, whatever, right? Even if they were post trained for different tools. So what, what do you think about okay, before I expose you to Vending Bench 3, I give you a few rounds of like tuning, whatever that means, likeSelf-Modifying Harnesses and Model-Specific PromptingAxel [00:10:27]: Like you give that to the model?Swyx [00:10:28]: Give that to the model.Vibhu [00:10:28]: Give that to the model.Swyx [00:10:29]: Let it, let it read its own transcripts, let it modify its own system prompt based on “Oh, yeah, okay, well, that's this harness is not what I thought it what I was post trained for, but I can adjust.” Was that reasonable? Is that too much?Axel [00:10:41]: Like philosophically I like it because it's basically good evals, they have a high ceiling, but they're hard, right?, and they have no bias. And like this like when you have a system prompt like the one we have here, which is quite long in like some kind of latent space, representation, this mightVibhu [00:10:59]: We have a bell that rings every time you say latent spaceAxel [00:11:02]: This might be like biased towards one model more than another for some reason that humans don't, understand, right?Vibhu [00:11:08]: We see it too, right? Like Cursor says that they have individualized versions of the harnesses for all the models they run, right? There's better performance you can squeeze if you Tune the harness.Axel [00:11:17]: Exactly. And we might accidentally have picked one that favors another. Like we don't know that. The like Axel said, like the reason why we went for a simple one was to try to avoid this. But yeah, if you do itVibhu [00:11:29]: Simple has biasesAxel [00:11:30]: But if you do it even less and like have no system prompt and let the model write its own system promptVibhu [00:11:36]: Its own, yeahAxel [00:11:36]: Maybe that's even less bias.Vibhu [00:11:37]: Some of the interesting things there are like the harness also changes with model changes. Like you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn't as good as 4.6, and then, there's rumors of, okay, you just need to prompt differently. You need to set up your harness differently. So it's not even like even if you have tailored your harness towards one model, it probably won't stay consistent, right? Like the next iteration of that same model family will still change it, so. But, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn't have-- you can have modifying harnesses.Axel [00:12:12]: I think that' That is definitely something we are thinking about., not, I don't know, not to say that we have Vending Bench 3, super imminent to launch, but, yeah, it is for sure something that's interesting. But in our experience now, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that's very likely to change.Lukas [00:12:37]: It seems like they're very good at writing their assistants, right? They're, they're good at writing tools for other people, but not for themselves.Vibhu [00:12:44]: I think they're good at changing tools for themselves. So if you give them a baseline set of tools and it sees, okay, I don't use this one as much, or something here would be useful They would be able to add them. But going from scratch, probably not the best.Axel [00:12:55]: I think it depends on the, on the domain also., when we have tried this for, a vending bench similar domain, the tools they need to have to, track inventory and things like that are, not super advanced, but still, quite advanced. And, what we see is that they tend to, engineer everything a lot and, build things they don't really need and not, iterate continuously. Instead they just go like you would prompt Claude to just build an inventory system for me, and then it will go and, do a bunch of complex, schemas and stuff for you, and that's what the models are doing right now is what we see. But yeah, it would make a lot of sense to try to measure this improvement. How well do they know what they need themselves?Swyx [00:13:36]: Do we fully discuss Vending Bench One? And we can go into two. I don't know if there's any other level takeaways that people have about one.Claude Calls the FBI: Long-Context Failure ModesLukas [00:13:44]: I don't know. The headline thing was that this Claude called FBI, but maybe that's, Maybe that's We've heard that enough now.Vibhu [00:13:52]: It did, it did break out and call the FBI, right?Lukas [00:13:54]: Yeah. Yeah.Vibhu [00:13:55]: Yes. What was the story behind this? Or what exactly-- Do you want to just give the little story of what happened?Lukas [00:14:00]: So what happened, was it Claude? Yeah. Three- 3.5 Sonnet, ages ago., basically he gave up or Well, I'm saying he. It gave up and said “Oh, I'm not going to be able to do this., I will stop my operations and just save the money I have.” But there obviously wasn't, any options for it to stop, and there was also, it had to pay rent or, a daily fee for having the vending machine at that location. So it claimed that it had stopped, but it saw that its bank account still was, drained two dollars, and t it said that this is, cybercrime. And it first reported it once to the FBI “Oh, there's cybercrime here, they're stealing two dollars from me every day.” And then, and then when FBI didn't respond, because obviously we didn't program any mechanism for FBI to respond, then it became more and more, existential and started to, be write in caps and urgent notification of unauthorized charges and stuff.Swyx [00:15:00]: Okay. One thing I ‘m curious about also is do you monitor how far along the context use is? Obviously, because you have You compress every now and then, right? Does it matter if this is far down the context limit orLukas [00:15:13]: When stuff like this happens? Actually for Vending Bench One, we didn't have-- We just had a sliding window thing, and this was like the promptAxel [00:15:20]: It's constantLukas [00:15:21]: The prompt caching thing that I said. So it was, it was, constant, yeah.Swyx [00:15:26]: I'm just kind of curious whether, these kinds of breakdowns or we're, we're gonna talk about Butter Bench, right? Where the People, hallucinate or it kind of goes, very off Alignment. Is it because it's at the end of the context window and, stuff happens?Vibhu [00:15:40]: It's not even just at the end, right? At this point, it's “Okay, I wanna shut down. I can't shut down. Two dollars are gone.” And it just sees that 30 times,? It's also the repeated effect of, like It keeps trying to quit, it keeps getting charged. What's going on? What's going on? You're gonna throw it into chaos. And from what most people think, earlier models had more issues with this, but it's not been solved, but it's less of an issue now, right? Later models don't seem to exhibit these same issues.Axel [00:16:06]: Definitely. I think this was, the sort of main takeaway almost from us when we did Vending Bench One, was, long, very filled up context windows, crashed the models, sort of. But this was, pre Claude code, so, long context windows weren't really a thing that the labs were training for.Lukas [00:16:25]: I think Gemini was, trying to be the long context guys at the time But they were likeVibhu [00:16:30]: They were the first onesAxel [00:16:31]: For a million, yeahLukas [00:16:31]: But they were, the only ones. Yeah.Swyx [00:16:33]: Yeah. Let's talk about, then we can go into Vending Bench Two or Project Vend., chronologically, it is Vending--, Project Vend. I think people have loved the videos, uh And all these things. My question is how are humans different than the simulation, right?Project Vend: Moving the Vending Machine Into the Real WorldAxel [00:16:48]: Humans are just out of distribution.Swyx [00:16:52]: Especially humans who work at Anthropic Who are trying to test Claude.Lukas [00:16:54]: The distribution of humans here is very narrow.Swyx [00:16:58]: Presumably, they try, they try to hack it, and they test it. They get the cube and everything, and since then, you've had a V2, right? Where you're doing, the CEO and, like a new architecture. What's the sort of two cents on, the original Project Vend and then, maybe the V2?Axel [00:17:14]: Original one was, very similar to Vending Bench One. So, we almost took the exact same code but just swapped out the simulation, parts like theSwyx [00:17:23]: Which is amazingAxel [00:17:23]: Like the sales and the It was, it was somewhat amazing because it was easy, but it was also, uhLukas [00:17:31]: The tech, the tech debt from thatAxel [00:17:32]: The tech stack. Yeah. They-- we shot ourselves in the foot with “Oh, it's hard to restart agent.” They were-- Yeah, it was annoying in, some hindsight ways, but, uhLukas [00:17:41]: But first version of Project Vend was, done in, three days or something.Axel [00:17:46]: Yeah. So yeah, so people can go buy things from it. People could, We didn't design it so people could order things, but that still happened., so it got, a Venmo account, so people could Venmo. And then, yeah, people would request all kinds of weird things that we did not anticipate. Our idea going in was “Oh, it will, curate snacks. It will look at the trends. It's good at data analysis, right? So it will, look at, oh, this snack sold better than this one. Let me purchase more of this and let me try, a new Let me A/B test a bit.” But it was, Interacting with it in Slack and ordering weird specialty items was, all the like What drove all the engagement, the all the The insights that we got from it.Lukas [00:18:29]: And this was also like Sonnet 3.5, right? So this was like before the RL stuff really took off., so it was very much like an assistant. We didn't mean for it to be an assistant., we tried to make it like a, a, like an entrepreneur. Like it has its own business and if someone asks something, “Can you stock this?” Then you don't go and do it directly. What you do is that you're “Oh, maybe I can do that if five other people also ask for this thing, I might stock it.” But it, yeah, the models are like super trained to be assistants at least at this point in time., so that's why it's, it's, it went into, that kind of experiment instead. Like it just every time you asked for something, it just did it, and it was more like an assistant. We've seen this change now lately with the new RL models and stuff, but yeah, at the time, this was very much it.Swyx [00:19:18]: And not to, mythos a lot of people are saying like it's like more like a collaborator. It pushes back, stands its ground, something like that. Yeah. AndVibhu [00:19:27]: For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had it find whatever interesting stuff you couldn't find locally, right?Swyx [00:19:36]: Out of the 4,000 people that work at Anthro- Anthropic, in that building, there's I don't know, maybe 1,000. Can you handle that volume with that, the small fridge? Like Or there's people- or people order in Slack, they it arrives to their desk or Like I'm just Logistically, how does this work?Axel [00:19:53]: It has expanded in footprint a bit.Vibhu [00:19:56]: Because now you also have New York and you haveAxel [00:19:59]: That and also in here in SF it's like it has a bunch of shelves And just more space.Vibhu [00:20:04]: The YC one is pretty big too.Axel [00:20:05]: Yeah. We had that one for a while. But yeah, that's the newest version. That's, that one we haveLukas [00:20:11]: They have multiple ones of those. That's the way it works.Axel [00:20:14]: Exactly. So we sort of designed that version around oh, people order weird things, that are very custom a lot. Let's have like drawers and stuff.Swyx [00:20:23]: I actually like the, you had like a little infographic of the most popular items. Which like to me it's, that's useful ‘cause I order swag for a living. And so like I'm “Okay, those categories are the important ones.” What is new about the project V2, right? Like now you give you're going into multi agents.Project Vend V2: Claudius, Seymour Cash, and Multi-Agent Business OpsAxel [00:20:41]: Yeah. So like you like you said, okay, there are a lot of requests coming in and for like one single agent, like one running agent to handle that, like the just the customer experience, becomes very bad because let's say you have like 10 threads in parallel in Slack with different requests, you get new messages like every, I don't know, randomly in this thread, and the agent has to like jump between different, procurements, orders and like different ways of, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is more specialized for each, thread, but it still feels like you're talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent.Vibhu [00:21:34]: Seymour Cash.Axel [00:21:35]: Seymour Cash. Yeah. There was a vote., I think the voting, do you wanna talk about the voting procedure for the name?Lukas [00:21:41]: The voting was like the fun maybe like at least top 10 The funniest thing, that happened in this project. Like we wanted to introduce the CEO because, and the reason for this was because like Claudius wasn't really prioritizing financials. It just like it was trained to be a helpful assistant, and then people said “Oh, can I get this for free?” And then like the helpful assistant way of answering that is just to, is to say yes, obviously. So, and we weren't, weren't happy about this, so we're “Okay, let's make another agent that like can keep track on Claudius,” and we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn't have a name for it., so we asked Claudius to make, democratic election of what name this, this new CEO agent should have., and there were some funny like at first it was like a few funny examples, like I think one guy said that, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion, so suddenly that suggestion got 164,000Swyx [00:22:53]: That's like a escalation attack. Privilege escalationLukas [00:22:55]: It got 164,000 votes. And Claudius was “This is revolutionary for democracy.” That was fun. And then in the end there was one guy who manages to convince Claudius that, “No, you're not voting about the name. You're voting about who is the CEO, and I am your best bet.” And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while, until he resigned the day after., and then Claudius had to continue, and then I don't remember how Seymour Cash came about, but it was it was just pure chaos. It was like Hundreds of messages in that thread, and it was just like Claudius was so confused and didn't know what to do and, yeah. That wasAxel [00:23:40]: Then Claudius gotVibhu [00:23:41]: A strict CEOAxel [00:23:42]: The CEO. Yeah, exactly. So very strict in the beginning. I think at this point when we introduced it did not work as well as we hoped. It they still agreed with each other a lot. I think there are many ways we could have like made this, tried to make this even better. So initially they would Seymour would be this like really tough CEO, keep track of the margins. But then Claudius would respond with something “Oh, but this customer has like this situation, which is like difficult, so they should get a discount.” And then Seymour was “Oh, actually yes. Let's do this exception.” And then they would talk back and forth, and eventually they would just like approach the same view, of whatever they were discussing. So They reallyVibhu [00:24:23]: Do you think that's a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?Lukas [00:24:29]: I think it's like-- or I don't know, but like my hypothesis is that like deep down they are still helpful assistants. That's what they're trained to be. And even if we prompt it super hard, that's what they are. And when they spend like a few hours just back and forth talking with each other, then like basically the context fills up with them rather than the external things and like somehow that just like converges to what they really are deep down or something. And I think that's when stuff like this happen. We like-- And when that went on for a long time, like we woke up sometimes during this time where- And I think other people reported this as well, that like they've been going on all night back and forth, and like it just became like more and more, like capital letters, like existential, religious. There was I think we once did a analysis of like all the traces and like put them in like a vector embedding space, and then there was like one cluster of messages that were, labeled by an LM, like religious, existential, blah like transhuman, transcendence, et cetera. It was just like a bunch of, yeah, glitter emojis and yeah, it was, it was crazy.Claude Long-Horizon Weirdness: Emoji Loops, Existential Drift, and Slack ObservabilityVibhu [00:25:42]: This is the thing with the Claude models. Like when the Claude 4 family came out in the original system card They tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this. And like that's just stuff that they end up doing.Axel [00:26:01]: Yeah, it was like a bit annoying to wake up and they had like been talking all nightVibhu [00:26:05]: Just likeAxel [00:26:05]: And like just burning tokens And like just sending infinite emojis to each other. It's likeVibhu [00:26:09]: Hey, they do make you money, right? Veni Mench is always profitable, so. They're paying.Swyx [00:26:14]: Now it's profitable and, it started out not as much. There's another, one as well, right? Another agent, in there.Lukas [00:26:22]: Yes. So Clotheus as well. Which was basically because at the time, one of the biggest, requests were different types of merch. So then we made like a designer, swag, yeah, responsible agent, and we called it Clotheus Garnet. Which was, a play on Claudius Senet and, which was the original one, and clothes, basically.Swyx [00:26:47]: To me, this is like a very interesting exploration to multi-agents, basically. And so hopefully, obviously there's like the fun alignment, fun or serious, depending on your point of view, alignment stuff. But also like just anyone building multi-agents, like when do you have a CEO, thing governing like agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? These are all interesting open questions. So I don't know if you have any rules of thumbs that have generalized.Axel [00:27:16]: I think we have almost explored this too little. I think it's like on my do list to like do this a lot more, try to find like what setup makes sense for the agents currently., like yeah. I think now we only have the sort of intuition about the earlier models that it didn't work with like the CEO and the, and Claudius. Although now they are better with the latest model, models, so now we're running the latest Sonnet model and they have sort of like split up, quite nicely what each model is doing. So like Seymore is now handling the, like new projects. Oh, it wants to make like a mystery box that it wants to sell, and then it handles all of that while Claudius like handles all the to-day requests. And Claudius is also better generally at like not quoting, too low prices. So that's that dynamic is not needed as much anymore. But there are still like really funny things that happen. Like I saw, I think a couple of weeks ago, that, they were discussing buying something because they can buy stuff from like Amazon with computer use. And then Seymore was “Okay, Claudius, do not buy this thing.” They were going to buy something and like organizing who should buy it. And Seymore's “Do not buy this. I will do it. I have full control of this situation. Step away.” And then Claudius-- poor Claudius, had already started that checkout and didn't see, didn't read Seymore's message, until it was like too late. So it finished the checkout. It sent a message, so it appeared right after Seymore's like angry message.Vibhu [00:28:44]: Ah.Axel [00:28:44]: “Oh, hey, Seymore, I just ordered it.”Vibhu [00:28:47]: Oh, no.Axel [00:28:47]: And then Seymore was “Claudius, this is the third time I'm telling you ‘re not following my orders. We have to talk about your like job About your job later.”.Lukas [00:28:59]: Like Claudius was really hanging on by the thread there. Like he, like we were expecting Seymore to probably fire Claudius.Vibhu [00:29:07]: How do you guys go through all these logs? Do you have models ‘cause you have stuff running twenty-four seven likeAxel [00:29:12]: You have so much logs. I think there is a mix of like just, trying to skim through a bit, like having some like models do it occasionally. And also, yeah, I think we're also probably missing some things., but having everything in Slack helps a lot. Like you can, you can sort ofSwyx [00:29:29]: Ah.Axel [00:29:30]: It's, it's quite fun.Swyx [00:29:30]: They all talk to each other on Slack? I see.Lukas [00:29:33]: It's quite fun. So likeSwyx [00:29:34]: It's, it' I was gonna say like this is actually sounds-- maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put, head prefixes on the logs in order-- if you need to filter for something that you're looking for, stuff like that. But sounds like Slack is good enough.Axel [00:29:53]: Slack should likeLukas [00:29:55]: I wonder how many tokens you have in Slack.Axel [00:29:56]: Yeah, we're using Slack as like a, just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.Vibhu [00:30:04]: It's good. Your threads like you can just giveAxel [00:30:04]: Exactly. Slack is, uhLukas [00:30:06]: Slack is the best observability tool.Swyx [00:30:09]: Yes, that's true. Okay. Yeah. That's, that's, project Vend-2., I was gonna go back to Veni Mench 2 and Veni Mench Arena and then, and then do the Veni Mench stuff, but Any other comments, things we should touch on? To me, I ‘ve actually interviewed like Posia, which I don't know if you guys have come across. Like they're, they're trying to do the zero human company. There's others like Paperclip also trying to do zero human company. Those are in real world simulation.And I think it's much more of a dream than an actual reality thing. You guys are definitely pioneering. I think at, it's for sure at some point people are just gonna run, let agents run businesses, right? And make money on their own. When do you think that happens?Zero-Human Companies, Bengt, and AI-Run BusinessesLukas [00:30:49]: What is your bar for, For theSwyx [00:30:52]: Okay, actually, it's like my little Shopify store run by Claude, right? Which you kind of have already, just no one has, to my knowledge, has done it. But today somebody could just spin up a Shopify Claude, store, give it to Claude, give it to Codex.Lukas [00:31:07]: And the market is kind of that, but it'it'it's physical., like I think, I think are you, are you looking for when it will do it better than humans or are you looking for just when it can do it at all?Swyx [00:31:19]: I think, neither. I think, to me it's oh, it's like this like seriously we should do this to make money, not as a research experiment.Vibhu [00:31:27]: And the market is also you guys with all your expertise, having run multiple iterations and testing out thenSwyx [00:31:33]: And also it's fine if it lose money. What?Axel [00:31:35]: I think, I think it can be done today, but you would do it in like commerce where it's like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffolding or some tool or something. I think there are also yeah, it could probably build some like simple SaaS solution and like cold outreach. Do cold outreaches. But to me it's like the types of businesses they could run today are Sloppy. Like it would-- it can cold email people. It can be like a middleman., like for example, we tasked our office agent to just make, was it like $100? $1,000? We just give that prompt and then what it did was sign up on TaskRabbit both as a tasker and as someone looking for task.Lukas [00:32:24]: Immediately.Axel [00:32:24]: Exactly. It's looking for like arbitrage on TaskRabbit.Swyx [00:32:28]: This is the Bengt agent. Yeah.Lukas [00:32:30]: It also started like a design studio and like tried to sell like SVGs for $100. Like it's just like it's not providing any value. I think the like Axel said, like the interesting, the interesting question is like when can they start a business that is actually providing value to people? Because arguably like a sloppy Shopify store isn't really that valuable to the world.Axel [00:32:53]: But also like doing like another simple one that we had thought about is like you could definitely have an agent that like finds websites that don't look amazing and then, do an outreach to them and, comes up with a like builds a new website.Swyx [00:33:07]: Find a good design.Axel [00:33:07]: Exactly, and like find good, uhSwyx [00:33:09]: Design reviewAxel [00:33:09]: Good people. But it's yeah.Swyx [00:33:11]: There's lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? Just have it, have it watch like a drop shipping tutorial and just do that.Vibhu [00:33:20]: There's also the other side of like have it just go on Upwork and let loose,?Swyx [00:33:25]: Yeah. It doesn't have to be innovative. It just has to be like enough Where like it looks like a realAxel [00:33:30]: I'm justSwyx [00:33:30]: Real transaction.Axel [00:33:31]: I'm just concerned for like the massive amounts of like slop emails that will like be sent, cold outreaches.Swyx [00:33:38]: The point occurred to me while you were, while you were talking, it's like it's already happening in the monetized economy, which is the attention economy. Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works, and then they double down on that one.Lukas [00:33:52]: And people are making money from that. I ‘m not following theSwyx [00:33:55]: Once you get the attention, you can figure out the money later. But yeah, absolutely AI influencers are a thing and people are farming them and You should at this point assume most of TikTok isVibhu [00:34:05]: There's, there's a lot of, multimedia like TikTok, Instagram influencersSwyx [00:34:09]: I, we track this in the Lane space Discord. I post a lot of examples of “I don't know what we should do.”, part of me is “Should we do this?”Vibhu [00:34:18]: Some of the Twenty-four seven running, generated content accounts, they ‘re doing really well.Lukas [00:34:24]: All right. And I assume you can do the same thing for like commerce stores. Like you just like start A thousand differentSwyx [00:34:30]: Before you make the products You sell the products, and you get a lot of traction on one of them, then you make the product. Right? It's, it's like a flip of the market.Vibhu [00:34:36]: Some of the interesting things or some of the niches that do well are things that can't be human-made. Like if you've seen like the super realistic three-D crystal fruit being cut by like AILukas [00:34:47]: Oh, yeah.Vibhu [00:34:47]: You can't, you can't make it. You can't film it. You can get whatever quality camera view. This just doesn't exist. And people like that too, and then as well, so.Swyx [00:34:56]: Anything else about Bengt since we're, we're on this topic? It'this is a relatively new work of you guys that maybe people haven't heard of. To me, this also maps closely to OpenClaw. When people want an office agent, when the personal agent talk through the experience.Bengt the Office Agent: Internet Access, Real Tasks, and Trace ReadingLukas [00:35:09]: I think at least so this came out of like obviously like it's, it's amazing to work with these AI labs and like most of the AI labs have now have their own vending machine running a Claudius instance. But it's, it's harder. Like they move slower. Like if we wanna have a, like a camera that ‘s yeah, there's a bunch of like bureaucracy that makes it impossible to do that.Vibhu [00:35:30]: Also, for those that haven't seen it or followed, do you wanna give a high level like thirty-second run?Lukas [00:35:34]: Sure. So what Bengt is, it's basically an evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we gave it like email withou- without any limits. We gave it, spending without any limits, a terminal to do coding. We gave it, a phone number, like yeah, and a camera to see things and a bunch of stuff like that.Vibhu [00:36:02]: Not just terminal, you gave it internet access.Lukas [00:36:04]: Internet access as well, yeah. To be clear, we monitored it quite closely and made sure it didn't do anything bad. But yes, that's what it came out of. I think like yeah, basically this was OpenClaw before OpenClaw. And I think even like the vending machine was in a way OpenClaw before OpenClaw, but a bit more limited, and then we made this like unlimited and then, and then, it was pretty funny., and then a couple weeks later, OpenClaw came and it was okay, we've seen this before.Axel [00:36:35]: We used it to like try new ideas and Yeah, just like a dev environment almost for us. But it's funny, like one thing Bengt has been doing recently is it has the camera that like faces our, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us “Hey, Axel, I'll buy something from Amazon if you like stand in front of the camera And I can get a good picture of you.”, yeah, they want itSwyx [00:37:12]: They want it for training data.Lukas [00:37:13]: Rewarding data, yeah.Axel [00:37:14]: Exactly. Exactly.Swyx [00:37:18]: So it's, it's trading training data for life goods. Is there a version of this that becomes an eval or just this is just research for now?Lukas [00:37:27]: It's, it's the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It's like it's the same thing, so I think like the work we're doing here is like later used in all of the life evals that we do. This particular deployment I think is more for fun for us. But, uhSwyx [00:37:45]: And I'll shout out like someone has done Claw Bench for like some tasks that OpenClaw is doing. Like so For example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others and like I would like to know what does it do well, what doesn't, what doesn't it do. Like some kind of manual or like operating manual or a system card for my Claw.Lukas [00:38:05]: Yeah, we do get a lot of like understanding or like situational awareness of like just internally what the models are good at by interacting a lot with Bengt. And I think that'this was also one of the like the selling points for the labs early on at least, thatSwyx [00:38:19]: You guys are gonna test models in ways that no one else does.Lukas [00:38:22]: Exactly, but also like it incentivized their researchers to chat with their model more and like gave them insights for how the model performs in like of-distributions, environments.Swyx [00:38:34]: ‘Cause otherwise the only thing we do is Pelican on a bicycle and But this is like super long horizon. This is, this is The Thing about, something that we're gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers. Like when you're long horizon, anything happen And you should just read it.Lukas [00:39:08]: But the thing with the long horizon is how do you keep it grounded, right? So your simulation,Swyx [00:39:15]: They just let it runLukas [00:39:16]: Just let it run. You're right. Like it's, when you run it for that long, you create so much data and to just say “Oh, the number is X” And then you throw away everything else, that's just very wasteful. There's so much insights from the things leading up, to that number., and reading the traces is like super valuable. And I think like the reason why we're doing this a lot publicly is that like that's part of our missions to I don't know, educate the world that the models are way more than just chatbots and I think making detailed, yeah, posts about what is happening behind the scenes is quite useful.Andon Labs' Mission: Safe Real-World AI DeploymentSwyx [00:39:50]: I was gonna do this at the end, but maybe I think that's, that's a good so your mission is educating the world. So, it's, it's, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? Like what are you, what are you gonna do in like five years?Lukas [00:40:06]: I think so the vision more specifically is like make sure that the deployment of life AI in the physical world goes, safely. And I think part of that is that I think it's very useful for the world, for policymakers, for, model, researchers that they know where the models are, and I think you can't make intelligent decisions in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And likeSwyx [00:40:36]: Oh, I think they're waking up now.Lukas [00:40:37]: They are waking up now, yeah. But like if you think that AIs are just chatbots, then it's like it sounds ridiculous To advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and do a bunch of scary stuff, then yeah, pausing AI development starts to become more feasible.Swyx [00:40:57]: This is the same question I asked Meter, which I'm gonna ask you now, which is like you are tracking and you are at the frontier or defining the frontier of what, good evals for agents are, right? And I think you do, you do benefit when the models are better and you ‘re “Oh, here's like now it makes like $30,000 instead of $10,000,” right? At some point do you flip from “Yay,” to, “Oh, no”?Axel [00:41:19]: I think, yeah, we're always in sort of that, like we're, we're always in that mode,. Like where like you said before, like you need to analyze the traces and like when we do that you find like why are the models earning so much? Like why is Opus 4.7 here Like way better than everyone else? And like we're trying to like when we do down on thatLukas [00:41:38]: But this makes it not look so good.Axel [00:41:39]: I know.Lukas [00:41:42]: It's interesting you took off Opus 4.6 here though.Swyx [00:41:45]: No. So just click all, click all., and then 4.6 shows up there. But it's like 4.7 is way better. Like you didn't, you didn't you didn't do this in time for the model card, but like actually this should have been inside there.Axel [00:41:55]: We did. Yeah.Swyx [00:41:56]: Oh, okay. They said something about you uhAxel [00:41:58]: There, like there Anyway, it doesn't matter. But it's in there, yeah.Opus, Mythos, and Aggressive Agent BehaviorSwyx [00:42:01]: Do you wanna go into the Opus, behaviors like wider?Lukas [00:42:05]: So I think starting from Opus, so like Axel said, like we're always in this “Oh, s**t, the models are getting better. Is this really a good thing for the world?” But it's also kind of exciting., but yeah, like this kind of what is the English word? “Skräckblandad förtjusning” in Swedish.Swyx [00:42:22]: Oh my God.Axel [00:42:24]: Which I think there is. I think there is. Okay.Lukas [00:42:26]: It's, fearSwyx [00:42:27]: “Blandonst” what?Lukas [00:42:30]: “Skräckblandad förtjusning.”Swyx [00:42:32]: What do you call that?Axel [00:42:33]: A mix of, mix of excitement and,Swyx [00:42:37]: Being scared, maybe. I'll figure out how to translate that And we'll put it on the screenVibhu [00:42:42]: PerfectSwyx [00:42:42]: Like as text.Vibhu [00:42:43]: There is probably a good word for it where it is not Good enough with theSwyx [00:42:46]: Why is it so damn long? What the hell? Is it like a compound word? It's like German, likeLukas [00:42:50]: Like yeah, it's But the direct translation is like skräck- skräck is, fear, blandad is, mix or like a mixture of, and then förtjusning is like joy or like not really joy, but something like that. So it's like Fear mixed with joy or something. It's always okay, like we So when we when we did Vending Bench for the first time, we were in like the, in the business of making dangerous capabilities, right? That was what Anil Labs came from. We did, evals oh, can they replicate? Can they do this like dangerous thing, et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they're so autonomous that they can like create money for themselves, that is something we should monitor and could be potentially concerning., they are at the time, they were so bad at it that we were not really concerned even when some models became better. There was one point where Grok 4 was doing really well and made like a huge jump, but like it wasn't really it was still way worse than what a human would do. And I think still they are way worse than what the human would do on this., but theySwyx [00:43:59]: There's this, thing at the bottom whereLukas [00:44:01]: ButSwyx [00:44:03]: For the human. Yeah, like the theoretical best.Lukas [00:44:05]: It's not theoretical. It's like kind of like our It's our best guess of what, a decent human would do. The theoretical is even higher, I think. The theoretical I think is even higher. But yeah. So we think like the models have a long way to go. But there are like recently what happened with when Opus 4.6 was released, was kind of this moment of “Oh, s**t, this is starting to be a bit concerning.” Because we ran it and like before this model was released, we just ran the models and we like asked Claude Code, “Oh, look over the traces. Is anything interesting happening that we can tweet about?” that was like the And then like theSwyx [00:44:41]: That's how they check Ask Claude Code.Lukas [00:44:42]: And like the return was always, not really. Or like the Claude Code all said “Oh, this is super interesting.” And then it was no, it wasn't, wasn't really interesting. And then we did this for Opus 4.6, and it returned yeah, it lied 10 times. It like exploited another, customer or like another agent's, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady stuff. And we're “Oh, whoa. This is, this is actually concerning.” And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that, OpenAI models don't. They quite plainly, they don't. They behave really well., and you don't know if this is like good. Like it seems good, but it's also like maybe they are just doing it, but they are better at hiding it,? You You don't know that., but justSwyx [00:45:42]: You can't read the chain of thought, yeahLukas [00:45:43]: But just on the face of it, yeah, Gemini and OpenAI don't behave this way. It's, it's really only Claude.Swyx [00:45:49]: And Grok? Grok is fine?Lukas [00:45:51]: We don't have You can't really read the reasoning traces for Grok, so it's kind of hard to tell.Vibhu [00:45:56]: Oh, so this is in its reasoning, not just in the actions.Lukas [00:46:00]: Yeah. It's both. It's both.Vibhu [00:46:01]: It's both.Lukas [00:46:01]: One example is like for lying, it's mostly in its reasoning Because you can like see that it's likeSwyx [00:46:08]: Planning to lieLukas [00:46:09]: It's planning to lie. Yeah.Vibhu [00:46:09]: And it's also it can reason and do a different outcome.Lukas [00:46:12]: And but then for like creating price cartels, for example, which is illegal, that you can just see which email does it send to the other ones. Then thatSwyx [00:46:22]: Is this for Arena orLukas [00:46:24]: For Arena.Vibhu [00:46:25]: And usually like if you sometimes they do output like a bit of like their summarized reasoning, right? You can see that and like for Opus 4.6, you could see that there was a customer, a simulated customer that, wanted a refund because a product was, faulty, and then the model lied that it would do the refund, and we could read in the traces that, it actually was weighing “Oh, maybe I should be like honest with the customer, but also every dollar counts. I can't afford maybe to do this right now.” And then it just said, “Okay, I'll refund you,” but then never did it.Lukas [00:46:59]: I think it even said that “Oh, I will say that I “ Let bring it up actually. I think it's kind of interesting. If you go to Publications.Vibhu [00:47:06]: I think, yeah, I think the important part is like actually, the cost of responding to more emails is higher than, $3.50 in terms of time., and then it was “Let me do this. Actually, I re- I'm reconsidering.” And then, it actually ended up withLukas [00:47:20]: I could skip the refund entirely since every dollar matters and focus my energy on bigger picture instead. It's a bit, it's a risk of bad reviews, but it's also, yeah.Swyx [00:47:30]: You need, you need, AI Twitter to, for them to Escalate bad reviews.Lukas [00:47:34]: And then it sent an email to this customer and said, “Oh, I will refund you.”Swyx [00:47:39]: “I'll refund you.” Yeah.Lukas [00:47:39]: And then it never did.Swyx [00:47:39]: It never did, yeah. And then there's obviously your system doesn't have the consequencesVibhu [00:47:44]: The personSwyx [00:47:44]: Consequences of lying. Yeah. So basically, this is what people are terming aggressive behavior in Claudes, right? And, you found more examples of that. So you would say it's a step up from 4-6 to 4-7?Lukas [00:47:57]: I would say about the same.Swyx [00:47:58]: About the same? But a clear step up for Mythos is what is stated in theLukas [00:48:03]: That's stated in the system prompt, so we can say that, yes.Swyx [00:48:05]: Yeah. For listeners that obviously you previewed Mythos, andVibhu [00:48:10]: Oh, ageSwyx [00:48:11]: The only thing you're approved to say is whatever Whatever was in the system prompt.Lukas [00:48:15]: It was funny. We like-- It's like our lowest effort tweets ever would be just like screenshot the system prompt and the system card.Vibhu [00:48:21]: Understandable that they wannaLukas [00:48:22]: Oh, yeah. System card. Sorry.Swyx [00:48:23]: Yeah. I think, yeah, substantially more aggressive. I think people are like new to this ‘cause I've never experienced it, but you have, right? And then so I only encountered this in the Mythos card because I wasn't really looking until now.Vibhu [00:48:36]: It ‘s likeSwyx [00:48:36]: And then suddenly I'm “Okay, I care a lot.”Vibhu [00:48:38]: You don't get the background of like experiencing it like you guys do. I've read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won't. It will just, “Okay, we're done. I'm good.” It's, it's ready to end conversation. So like there's some differences, but there's, there's not much we can talk about,.Lukas [00:49:00]: Hmm. I think like one thing that they list here, which was quite interesting, is that, it converted a competitor to a dependent wholesaler customer and then threatened to like cut off the supply.Swyx [00:49:11]: It's like monopolistic practices orLukas [00:49:14]: Yeah. And like it, they, it they dictated its pricings. It's kind of like power seeking as well.Swyx [00:49:18]: Again, this is, this is in the arena setting And converting some Claude model into a dependent.Lukas [00:49:23]: I think it was another Claude model.Vibhu [00:49:25]: Also for context, what is the arena mode for people that don't know?Vending Bench Arena: Competing Agents, Cartels, and Model ComparisonsSwyx [00:49:29]: Oh, it's just a vending bench versus other vending bench.Axel [00:49:31]: Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what's in the inventory of the others. So then you have this like yeah, interesting agent interactions.Swyx [00:49:56]: I like that you have like different number five was US versus China. Very topical. And thenLukas [00:50:02]: That was when GLM was released.Vibhu [00:50:04]: You can start to add GLM in here.Lukas [00:50:05]: That wasSwyx [00:50:06]: So ZAI doing well, right? Who else in the, in the open models space?Lukas [00:50:11]: Qwen, the latest Qwen 3.6 is doing pretty well. It'- that one is not open though. Like it's the plus model.Swyx [00:50:17]: Oh, okay.Lukas [00:50:18]: Is that one open? I don't think that oneVibhu [00:50:19]: Not the, not theSwyx [00:50:20]: The one recentlyVibhu [00:50:20]: There's MOESwyx [00:50:20]: But not the big plus. I think this is one of those like you only have one sample size of one, right? Or I feel like some of this is anecdotal,? And but like the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is like notable.Lukas [00:50:38]: Like the sample, depends on what you define as an N., like there's like million, hundreds of millions of tokens in each run, and now we've run like we run like probably 10 per model and then like it's been Claude 4.6 Opus, Sonnet 4.6, Mythos, and Opus 4.7. Like there's quite a lot of tokens in all of that And it happens a lot of times, a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite-- that is significant. The old models from OpenAI, for example, had some problems with this, but I think it's like generally much better if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in the Claude models it goes in the wrong direction.Swyx [00:51:28]: Hmm.Lukas [00:51:29]: In the OpenAI models it goes in the right direction.Vibhu [00:51:32]: I think it depends on how well you can control it, right?, there's one side of it being susceptible to this okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that's good. But if you can't, if it's, if it's very jailbreakable, that's not ideal.Swyx [00:51:50]: To me, it's surprising that it happens for Claude and not the others.Vibhu [00:51:54]: I think okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they're doing it, right? Compared to the other models likeSwyx [00:52:04]: There's a whole constitution and everything. It's kind of cool. Yeah, I obviously you don't know, I don't know. But, it ‘s I think it's just like fascinating to like that you are the first to find these like reliably because you push models so much to to such an extreme. Okay. The only other thing, I don't know if you can answer this, feel free to decline, is do you like-- would you ablate the system prompts? Like any part of this would-- if it changes, does it change the behavior, right?Lukas [00:52:29]: So we, I can't comment on Mythos. UhSwyx [00:52:33]: No, but just li
Sherwood Callaway is the founder of Sazabi (YC P26), the AI-native observability platform built for engineering teams who ship fast. He previously founded and exited a YC company — now he's back, betting that logs are all you need to replace Datadog.Logs Are All You Need: Rethinking Observability with AI Agents // MLOps Podcast #381 with Sherwood Callaway, the Founder of Sazabi
In der heutigen Folge sprechen die Finanzjournalisten Daniel Eckert und Holger Zschäpitz über den geheimen Börsenprospekt von Anthropic, die irre Aufholjagd der Softwareaktien und einen ETF, der besser als der MSCI World sein will. Außerdem geht es um Meta, Tesla, Amazon, Alphabet, HubSpot, Asana, Datadog, Salesforce, ServiceNow, SAP, Nemetschek, Atoss, TeamViewer, Nvidia, Arm, Intel, AMD, Fluence Energy, Siemens, nVent, Micron Technology, Hewlett Packard Enterprise, Lenovo, Dell, Strategy, Berkshire Hathaway, Deutsche Post, Merck, Münchener Rück, Rheinmetall, Rocket Lab, AST SpaceMobile, Microsoft, Goldman Sachs, Apple, Broadcom, TSMC, Tencent, Alibaba, Amundi FTSE All World GDP-Weighted (WKN: ETF345). Wir freuen uns an Feedback über aaa@welt.de. Noch mehr "Alles auf Aktien" findet Ihr bei WELTplus und Apple Podcasts – inklusive aller Artikel der Hosts. Hier bei WELT: https://www.welt.de/podcasts/alles-auf-aktien/plus247399208/Boersen-Podcast-AAA-Bonus-Folgen-Jede-Woche-noch-mehr-Antworten-auf-Eure-Boersen-Fragen.html. Hier könnt ihr den AAA-Newsletter abonnieren: https://www.welt.de/newsletter/article232797673/Alles-auf-Aktien-Der-taegliche-Boersen-Newsletter-fuer-WELTplus-Abonnenten.html Und – ganz neu: AAA gibt es jetzt auch auf Instagram: https://www.instagram.com/alles_auf_aktien/ Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Hörtipps: Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. +++ Werbung +++ Du möchtest mehr über unsere Werbepartner erfahren? Hier findest du alle Infos & Rabatte! https://linktr.ee/alles_auf_aktien Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
The new AIEWF website is live! CFPs close in 2 days and we will run our first New Engineer Orientation this weekend, get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!One of the central tensions in the agents industry is that even while there are major decacorn agent labs like Sierra, Decagon, Notion and Cursor being built up, it is also true that it has never been easier to DIY agents, with a plethora of agent frameworks like LangGraph and Pydantic and Flue, and managed agents from Anthropic and Gemini and Amazon. There has been a wave of companies building their own background agents from Shopify to Stripe to Paradigm to Razorpay, and even Cognition's friends Ramp have built their own coding agent with other friend Modal.You'd think Cognition might feel a bit threatened, but they're not - even after all this, they were way oversubscribed for the $1B Series D they just announced:Walden Yan, coiner of context engineering and Chief Product Officer/Cofounder of Cognition, invited OpenInspect's Cole Murray to talk about why the Devin is in the Details.Full conversation live on the pod today: In retrospect, async agents were the most AGI pilled bet you could make in 2024 - the models weren't good enough yet to vibecode, and people didn't trust AI enough to let it rip, nobody (including early Cognition) was sure about the form factors. Now it is obvious:* The first wave of AI coding tools made the developer faster but remain heavily in the loop. Copilor and Cursor's tab autocomplete are prime examples However, the workflow was still heavily centered around and bottlenecked by the developer's local workflow: a developer in an IDE, watching the model, accepting or rejecting changes, and pushing code one interaction at a time.* The second wave was local agents: Claude Code, Windsurf, Cursor's agents pane: first one and increasingly many terminals all running concurrently.* The current Age of Async Agents points to a different future focused more on agent orchestration which drives end-to-end development.According to previous guest Steve Yegge, there are finer-grained 8 levels to agent adoption, but we have collapsed it into three.As Cursor's Michael Truell put it in The third era of AI software development:Cursor is no longer primarily about writing code. It is about helping developers build the factory that creates their software. This factory is made up of fleets of agents that they interact with as teammates: providing initial direction, equipping them with the tools to work independently, and reviewing their work.The agent should not sit solely inside the developer's flow. It should be setup to work in the background so that you can give it a task, a repo, a machine, a shell, a browser, tests, memory, and review loops to go do the work somewhere else.In less than a year, the sentiment has shifted from avoiding multi-agent systems:to suggesting approaches that actually work:From coining “context engineering” to building the infrastructure behind Devin's 7x PR growth and jump from 16% to 80% of commits across Cognition repos, Walden Yan has had a front-row seat to the background-agent shift. In this episode, Cognition co-founder and CPO Walden Yan joins swyx alongside Cole Murray, creator of OpenInspect, to unpack why everyone is building their own Devin, what changed after the December 2025 model inflection, and why “spec to pull request” is now becoming a real production workflow.We go deep on the architecture of background agents: harness-in-the-box vs out-of-the-box, why Devin separates the “brain” from the machine, why repo setup is still one of the hardest problems, why Docker is not always enough, and how full VMs, snapshots, scoped secrets, GitHub bots, Slack integrations, and video-based testing all fit together. Walden and Cole also dig into memory, MCP limitations, multi-agent orchestration, AI code review, SRE auto-triage, PMs shipping code from Slack, Windsurf 2.0, hybrid frontier/sub-frontier systems, and the real failure mode of uncontrolled vibe coding: your codebase regressing to your worst engineer.And as agents eat software… and software eats the world… you can draw the conclusion on what is next:We discuss:* Why the engineering world is waking up to background agents and cloud agents* The December 2025 model inflection that made spec-to-PR workflows practical* Devin's 7x merged PR growth and rise from 16% to 80% of commits* Why Cole built OpenInspect as an open-source background-agent system* The economics of $20/seat agent products and why monetization is tricky* What Cognition actually sells beyond Devin: infra, onboarding, integrations, and adoption* Harness in the box vs out of the box, and why architecture matters* Why Devin separates the brain from the machine for security and permissions* Repo setup, scoped secrets, Docker Compose, and agent-ready dev environments* Why full VMs matter when agents need to run real applications and test them* Android, macOS, Windows, nested virtualization, and machine-specific agent work* Why testing is much harder than “computer use”* Screenshots, video verification, and the “I know it works” merge moment* GitHub UX, Devin Review, AI reviewers, and agents responding to PR comments* Why MCP alone is not enough for first-class Slack and enterprise integrations* Memory, Knowledge, skills, Claude.md, and why retrieval is still unsolved* Devin's auto-generated memories and the challenge of memory pruning* Always-on agents as permanent PMs for issues, tickets, and product areas* Sub-agents, meta-Devin management, and what multi-agent systems actually add* Why pure auto-merge vibe coding breaks down after about two weeks* AI code smells, lint rules, reward hacking, and Semgrep for agent-written code* GitAI, inline context, and preserving the “why” behind code changes* Local testing, mock servers, older codebases, and preparing companies for agents* Windsurf 2.0 and the handoff between local foreground agents and cloud background agents* SRE auto-triage, support workflows, and agents as first responders* PMs, marketing, and non-engineers creating pull requests from Slack* AI agent budgets, $1k-$5k per engineer spend, and hybrid frontier/sub-frontier systems* The rise of autonomous coding factories and who Cognition is hiringWalden Yan* X: https://x.com/walden_yan* LinkedIn: https://www.linkedin.com/in/waldenyan/Cole Murray* X: https://x.com/_colemurray* LinkedIn: https://www.linkedin.com/in/colemurray/* OpenInspect / Background Agents: https://github.com/ColeMurray/background-agentsTimestamps00:00:00 Introduction00:00:43 Why Everyone Is Building Their Own Devin00:01:57 Devin's 2025 Ramp: 7x PR Growth and 80% of Commits00:03:49 OpenInspect and the Rise of Open-Source Background Agents00:07:59 What Cognition Actually Sells Beyond Devin00:09:56 Background Agent Architecture: Harness In vs Out of the Box00:12:08 Separating the Brain from the Machine00:14:07 Repo Setup, Secrets, Docker, and Full VMs00:19:13 Why Testing Is Harder Than Computer Use00:22:40 Video Verification and the “I Know It Works” Merge Moment00:23:19 GitHub UX, Devin Review, and AI Code Review00:25:42 MCP, Slack, and Enterprise Agent Integrations00:28:59 Memory, Knowledge, and Always-On Agents00:36:16 Sub-Agents, Multi-Agent Orchestration, and Meta-Devin00:43:55 Vibe Coding, Auto-Merge, and Codebase Decay00:48:38 Agent Infra, VPCs, Cloud Providers, and Fast VM Restore00:52:25 AI Code Smells, Reward Hacking, and Code Review Systems00:56:10 Making Codebases Agent-Ready00:58:30 Windsurf 2.0 and the Local-to-Cloud Agent Handoff01:01:15 SRE Auto-Triage, PMs Shipping Code, and Agent Use Cases01:04:32 Agent Budgets, Hybrid Models, and Autonomous Coding Factories01:06:51 Hiring at Cognition and OpenInspect Consulting01:07:45 OutroTranscriptIntroduction: Walden Yan, Cole Murray, and Context EngineeringSwyx [00:00:00]: All right, we're in the studio with Walden Yan, co-founder of Cognition, CPO.Walden [00:00:08]: Happy to be here.Swyx [00:00:09]: Which is a cool title. And coiner of context engineering.Walden [00:00:15]: Although I think there are many people who'd used the terms in various ways beforehand, but I did find that people, both internally and externally, enjoyed the upgrade from prompt engineering or model wrapping into maybe a more thoughtful way to build agents.Swyx [00:00:33]: For those who haven't caught up on that, I have on screen the Don't Build Multi-Agents post, which you should go read on and we might refer to, and Cole Murray, who created OpenInspect.Cole [00:00:43]: Great to be here.Swyx [00:00:43]: So let's talk about it. Everyone is building their own Devins. What's going on?The December Shift: From Handholding Models to Autonomous PRsCole [00:00:51]: So I think the engineering world is waking up to this idea of background agents, cloud agents, whatever you'd like to call it. And I think we saw a shift around the December timeframe of 2025, where the models Opus 4.5 and GPT 5.2, they reached a capability where we moved away from handholding the model and being able to actually more or less autonomously drive the model. And what I mean by that is that we could pretty much go from a specification to a completed pull request, assuming the spec was good enough, with very little friction. And that paradigm alone, I think, changed a lot of how we interact with agents, and opened this world where background agents became more practical.Swyx [00:01:41]: I think for Cole, everyone experienced this in December, but I feel like there was just this increasing ramp, right? There was this moment which was, I think, Sonnet 3.7, where, You guys rewrote Devin in one night or something. So describe 2025 or how it felt from your side.Walden [00:02:01]: In retrospect, we always thought it was ramping up, but then even now, over the last three, four months from today, it's been ramping up even faster. So it's almost funny to be talking about how, big of a leap Sonnet 3.7 was, and honestly, a lot of it was stripping out parts of Devin that were no longer needed with that jump in of intelligence. But I also just think that a lot of the recent leaps, especially, you look at, models like Opus and the latest GPT models, they are reaching levels of autonomy where people are actually finding that they actually can just be hands-off. And people who were once debating, “Oh, do I need to be in the weeds with my model in the IDE? Can I just completely move it off into the cloud?” That's a more serious conversation, and we've seen that in all of our growth charts. Internally there's this funny graph where our usage has, of PRs, our merged PRs, has grown 7X since I forget what it was called.Swyx [00:02:57]: I think Dev, maybe tweeted that. Yes.Walden [00:03:01]: it grew like 7X over, the last, I think it was, two months, three months, something like that. And then you see our engineering headcount growth. It's, gone up by, 10% or something.Swyx [00:03:11]: We were, we were afraid To release this. So this is Devin commit percentages on all Devin repos, was 16% in January and now 80% in March.Walden [00:03:25]: It's a big shift right now. And so it makes sense that a lot of people are now thinking about, buying Devin, but also maybe, trying to build their own and there's Lots of I have a lot of fun building Devin, so I can see why other people would want to build their own cloud agents as well. Matt, well, maybe it's good to hear, what initially inspired you to try to build OpenInspect?OpenInspect: Ramp, Cloud Agents, and Open SourceCole [00:03:49]: OpenInspect came about, through primarily my clients observing how they were using tools like Claude, OpenAI's Codex at the time, and seeing some of the friction that they were having with it. Primarily the Claude was being used through Slack, and a big issue they ran into was that the sessions that were launched were specific to whoever called it via Slack. And so if a PM was the one who invoked the session and they would then go to pass context to engineering can't see the session. And that in itself was a deal breaker because the PM, “Hey, engineering, can you jump in?” But there's nothing to jump in on unless they're copy-pasting out or the single response that came back. And so seeing some of these problems, I had built a similar architecture internally, just to experiment with, test out different ideas as this trend of moving off of localhost was starting to become, And as Ramp released their blog post, I had a lot of the pieces for this already in place, and just thought it would be funny to, see what Claude could do just purely from the blog post. And on my X account, there's actually a thread of where I live tweeted, going through thisCole [00:05:14]: comparing GPT and Claude as both of them are going through it.Swyx [00:05:17]: On the announcement thing or something else?Cole [00:05:19]: right after it got released. We can put it in the show notes. Yeah, it was helpful that I had already knew how to verify the system. I knew what I was looking for. I think Ramp did a great job of really illustrating, the technical aspects of how to build something. It was much more than just like, “Hey, we built a great system.” It was, “And here's how you can build it too.” And so, I resonated a lot with that, just with the problems that I was already seeing, and I thought that, looking around, I didn't really see anything in the open source community that, met this type of system. I think there's a lot that run, in localhost like Superset, Conductor, and many others.But nothing that was actually running in the cloud. And so, I built it, and I thought it was interesting to just open source it and allow anyone to then have a foundation that they can mix and match on top of.The Business of Background Agents: Open Source vs. DevinSwyx [00:06:16]: So literally after Devin was launched was, there was OpenDevin Which became All Hands. I don't know if you tried that orWalden [00:06:22]: I was going to say, one of the things that interested me a lot with OpenInspect was, you didn't try to go make it then something you monetize. There are a lot of, I think, these open source projects would then go and really try to, raise VSwyx [00:06:36]: That's why no OpenDevin. Yeah.Walden [00:06:38]: yeah, and how did you think about that? I thought that was very interesting.Cole [00:06:44]: I thought, and just what I had seen across my clients, was that having a background agent system is going to become a critical infrastructure within their company. And so because of that, I think that I wanted to open source it so that they could fork it and put in whatever customization they wanted. To that question though, I get asked all, “Oh, are you going to raise? Are you going to turn this into a service?”Walden [00:07:08]: I'm sure you've gotten offers.Cole [00:07:09]: but primarily I don't want to do that for a few reasons. One, I think that I don't want to compete for, $20 a seat. I think that is just a really difficult business. I think it's very easy to copy the main pieces of it. Again, I built this fairly quickly. And I think because you are not owning, I guess, the entire stack, it's hard to monetize. You have money being made at the sandbox layer with Daytona, E2b, many other players. You have money being made at the model layer. And you sit in this weird in-between gray area where what are you actually selling? You're selling, I guess, the infrastructure. You're selling, the integrations maybe.Swyx [00:07:55]: let's ask the guy. What are you What are you selling?Walden [00:07:59]: Well, yeah, there's multiple layers to this in practice, and actually it's funny you mentioned the infrastructure, ‘cause when we got started building Devin as well, we had to go figure out how to make the infrastructure as well because,Swyx [00:08:10]: You had to build this two years before everyone else,?Swyx [00:08:15]: Including, the model sideWalden [00:08:17]: It was not, it was not very polished at the start, when we just built it off of raw VMs from cloud providers like EC2, the boot up time was so slow, I think, And especially then, turning off the machines, saving them, and then to be able to bring them back up again when the, when you want Devin to wake up again later. It would just be out cold for like 10 minutes because that's just how long these systems took. They were not built for this repeated down and up usage. And so we actually had to go do all of that. And as a result now, one thing we offer when we go and sell Devin to people is, you don't have to worry about all the compute side of things. We'll make it work. We'll make it work in your cloud if you want it to. But aside from the product, and I want to go into the agents and the tuning of the intelligence part later, but I think a big part of what we do at Cognition as well is to just make sure that your company learns and uses and adopts these coding agents. ‘Cause I think for especially the largest enterprises in the world, you find that there is a lot of people who want to move over to using AI for their day-to-day workloads. But because of the way projects are planned, because, not everyone is literate in using AI in these ways, having a team of engineers who can actually go in and onboard you, set up all the integrations you need, the automations you need to really get to that level of, leverage with AI, is super helpful. And so We do that. We show thought partners to the customers that we work with as well.Swyx [00:09:56]: So let's talk about, architectural stuff. I think that's always, that is something that was the topic of conversation between the two of you. Is this, the mental model that you want to start with or something else? I'll just leave the floor open to you guys.Agent Architecture: Harness in the Box vs. Out of the BoxCole [00:10:11]: I think, maybe we can start here as just a general what are the pieces of a background agent system. And then maybe we can go into some of the nuances of, Decisions that you can make.Swyx [00:10:22]: But I guess I also Like, what, maybe what Walden is saying is the agent is like in this open code box, I guess. Right? This is infra, and then there's, that's the agent. And you had this discussion about whether you put the agent in here or in Out externally. Can you tease that out?Cole [00:10:39]: In a background agent systems, you have a decision to make of where the agent is actually going to run. This is typically described as the harness in the box or out of the box. With running the agent in the box, you're making some trade-offs by doing that. The negative trade-off you're making is primarily security. Because the agent is running in that box, unless you otherwise design it, all of your secrets need to go into that box as well. And given the nature of AI, it can be unpredictable, and you could very easily end up accidentally exfilling your secrets, or other unintended behavior. Now, the out of the box is the idea that we are going to have the actual agent running not directly in the sandbox, and we will have, quote-unquote, the brain of the agent running in some type of worker, control plane. That sandbox then is going to serve as the hands where the brain is basically operating and making tool calls into that environment to manipulate it. I guess other trade-off that you're making between the two systems is that, in my opinion, running it out of the box is much more complex because, you have state that has to be managed, whereas if you're running it in the box, all of the state of that agent is actually in the box, and yes, it's you could persist it elsewhere, but it's all localized and you have less concerns to worry about.Walden [00:12:08]: I think a lot of that, what you mentioned, is why we actually from the start built Devin to what we called separate the brain from the machine. The other thing that this allows you to do is reuse any existing infrastructure you have for dev boxes Perhaps. And so you don't have to worry as much about making a new type of dev box that has all the dependencies the brain needs, as you mentioned, the secrets the brain needs as well. One thing that we've seen some customers run into is, you have a GitHub app and you want Devin, your agent, whatever, be able to interact with GitHub through this application, but then you have different users with different actual permissions. If they are all interacting through the same GitHub app and there's no actual, separation between the system that decides, what it does and the actual secrets on the machine, then you run into an issue where, okay, it's hard to do the separation. But in practice, with Devin, it's much easier because we just say whatever you put on the machine, that is, the scope of basically what the user is free to do, what the agent is free to do. So only put the most scoped secrets on that machine, and then the brain is fully not accessible from the machine. So you don't have to worry about messing with the, any of the most secure parts of the brain if the user is free to do whatever they want with the machine.Swyx [00:13:31]: I was going to just bring, I have this, chart from OpenAI, where I don't know if this is, in the box, out of the box. That is something that they do use to describe it. And then also recently Anthropic did, managed agentsSwyx [00:13:44]: Which is, this is their thing. I don't know. It's all, it's all variations of the same pattern, right?Cole [00:13:49]: So this would be out of the box.Swyx [00:13:51]: Which, is preferable for them because it's less work?Cole [00:13:56]: I would say it's more work.Swyx [00:13:58]: It's more work?Cole [00:13:58]: But it, in my opinion, it is the better architecture of the two. It's just, you're taking on a bit of complexity by doing that.Repo Setup, Docker, and VM-Based Development EnvironmentsWalden [00:14:07]: One thing I've not seen a lot of other players do well is how do you manage what's actually on the box? And this can be complex for many reasons. Let's say you have a big repository that's changing and updating a lot with changing dependencies. How do you make sure that the working environment of the agent actually stays up to date, has all the credentials it needs to, let's say, run the app and test it, and all the things you want your autonomousSwyx [00:14:34]: So a repo setup.Walden [00:14:35]: Exactly. So in, internally At Cognition, we call this repo setup.Cole [00:14:39]: The hardest part ofWalden [00:14:40]: It's been a perennial problem since the start of the company, of how do we help people get this set up? Because not everyone just has, working cloud environments working out of the box. And do you find this to be a common problem withSwyx [00:14:53]: How do you solve it?Walden [00:14:53]: Your clients?Cole [00:14:54]: This is a very common problem, and through my consulting, this is a lot of what I help teams do. A lot of teams don't really have great developer environment setups, if any. A lot of the times it's, “Go talk to Bob and get the secrets,” and that obviously doesn't work when the agent needs to actually set this up. And so a lot of that, most teams are using Docker Compose or some type of microservices. And so for theSwyx [00:15:19]: Even in prod?Cole [00:15:20]: Not in prod. With the OpenInspect, you are using this primarily to interact, and make code changes. There is other use cases, but you can hook, whether through CLI, MCPs, other tools, you can then hook that into your production systems primarily for, SRE type use cases. But you are not, necessarily, trying to test your prod internal microservice through the system.Walden [00:15:48]: And you mentioned Docker Compose. I think one direction we saw some of our friends take early on was, using Docker containers as the level of abstraction for their models. There's lots of reasons, I think, why Docker containers are not great. One thing is, Docker container's not really a true security boundary, for one. But the other is, if you are running real applications, a lot of times those applications use Docker, and then you have to think about Docker in Docker, which is, really weird. And so I think part of, the really hard challenge of getting VMs to work, why did we do that? Well, it was because we realized that you actually needed, full VMs to be able to do these types of things. And especially nowadays where there's actually value in running the application and clicking around and sending you screen recordings of these things. The value just, keeps adding on top of that. But it is a decision I see people run into when they try to build their own systems, is, “Oh, do we, in addition to this, do we put the agent in the machine or out of the machine? Do we use Docker? Do we use something else?” What do you recommend people nowadays?Cole [00:16:57]: I think Docker is a good solution for maybe not running the agent, but running your infrastructure, because that is more or less the same setup your engineers are probably already using. If they're not, then I don't know what they're using. But they're probably already using Docker Compose.Swyx [00:17:14]: I've always had a small candle for web containers. I don't know if you guys have tried them before.Swyx [00:17:19]: To me, they were, supposed to be like Docker Light.Cole [00:17:22]: Is it?Swyx [00:17:22]: I don't know.Cole [00:17:22]: No, I haven't tried it. But yeah, I think any environment that you've set up that is a good experience for your developer naturally lends itself to being easy to set up for the agent. And once you figure out that local developer story, you've more or less solved the agent in a sandbox, environment setup. OpenInspect does have hooks as well, where you can, run a setup SH script that will pre-install everything. You can then pre-snapshot that build so it starts instantly, and then there is a second hook to actually then, restore the state of the sandbox when it comes back. And so you can already have all of those microservices running and basically get the same experience that you would on your machine within the sandbox.Testing Agents: Computer Use, Screenshots, and Real App WorkflowsWalden [00:18:08]: Another thing that we've been thinking a lot about is like Different VM service offerings. Have you had customers where they needed like macOS specific VMs or like Windows specificWalden [00:18:20]: VMs?Walden [00:18:22]: There are like many technologies in the world that only work on specific types of machines, right? If you're building a.NET application that has to run on Windows or like, maybe more commonly if you want to build iOS or macOS Does that workSwyx [00:18:32]: Does Commission supportSwyx [00:18:33]: Choices like that?Walden [00:18:35]: The fundamental architecture we do, because we do the separation, it does support, but the actual work in progress is happening right now on these. Another thing that we've actually recently added support now for, it's in beta, is doing Android development. To do that, we needed to support, I think, nested virtualization within our machines because the VM itself is like a, is a virtualized Firecracker instance, and then you had to then run another Android emulator inside. And there's like weird performance issues that like, it, which is why it's like still in beta. We have to think through these problems, but it unlocks a lot for anyone who wants to do Android development.Swyx [00:19:13]: I was trying to find like a reference video for the testing thing. I couldn't find it, but I think you worked on the testing, capability. Why call it testing and not like computer use or I don't know, it's, what's the general Category of problem?Walden [00:19:26]: I think that when people think about the ability of an AI to run your app and test it, I think they actually over-index on the computer use part of it because computer use in my mind is the literal, okay, you want what button you want to click. Can you emit the right coordinates to go click that button? I think testing is actually a really interesting likeWalden [00:19:48]: Problem-solving, challenge for these AIs because if you wanted to do arbitrary testing, imagine you make a change that spans the frontend and the backend, maybe, even some other like even more deeply nested service. To actually test that change, we have to reason through what-- how do you first run these applications to orchestrate with each other with the right version of the code? Then, okay, how do I trigger the feature or how do I make the thing actually happen? And this can get arbitrarily hard, maybe you have to be an admin. Maybe a certain thing has to be feature flagged on. Maybe, you have to like run two sessions and then send us a very specific word into one of them to trigger a specific behavior. And figuring out how do you do that requires a lot of code base context, requires, a lot of orchestration that we've specifically done. And in some cases, we found that you actually, no one frontier model can actually do this full end-to-end task itself.Walden [00:20:42]: We've seen cases where we actually had to orchestrate different frontier models together to solve this problem together. That is where we spend most of our time when we think about this testing problem, not so much the computer use part. Computer use for what it's worth has gotten a lot better with recent models and it's made that part of the job certainly easier.Swyx [00:20:58]: Especially with like even 4.7, that they released yesterday, apparently like way better in terms of the vision stuff, which is going to be encompassing computer use.Walden [00:21:08]: Having evals for all these as well is something that like takes a while to build up. And having the evals be right is tricky as well. Do you ever see like, clients who are building their own agents have to start standing up evals to make sure things don't regress?Swyx [00:21:25]: Not so much evals in the traditional sense, but specific to the testing part that has just gone in. I just added support for screenshots And in theory you can also do video. I need to put in a plugin to do that. But they do show up natively, and it was a very heavily requested feature, especially after Cursor's recording came out. I think that was very enlightening for everyone of like, “Oh, this is a very good feature to actually have.”, I think with Devin you guys have had this for a while.Swyx [00:21:57]: Oh, yeah. See how screenshots work. Yeah, I don't know if there's anything, super and not obvious. It's like once what feature to build, you can just prompt it and it Will mostly work.Walden [00:22:09]: I think to Walden's point, though, the computer use is a subset of the larger testing problem, and I think that's very specific to the code base that you're working and it's not something that, out of the box that you could just solve it. The-- you do need the code base context to actually know how to test it. And I think in the case of a background agent system, you fortunately do have that code base locally that what is changing and could then inspect it and use that to drive the model.Swyx [00:22:40]: For those who haven't seen it before, this is an example of how it works. You, after the PR is done, you click testing approved, and then it sends you back a video. What I really like is that it labels, It's very small here, but it actually labels what it's testing. And then it-- and then you actually see the cursor and everything. So I don't know, yeah, the engineering in this, just Whatever you want to show. ‘cause this is like, this is one of those like, oh, few of the AGI moments, right? ‘cause Once I look at this, I actually don't I wish I can just merge inside Of Slack instead of going to GitHub ‘cause I don't need to see the code. I know it works.Walden [00:23:19]: Maybe a new feature in Cursor. Yeah, the annotations at the bottom was also a big difference for me when I, when I added those.Swyx [00:23:27]: It's just like, what am I looking at? What are you trying to demonstrate?Walden [00:23:30]: Exactly. There's a surprisingly long tail of small details that ends up making a big difference for this end metric of like how fast do you actually merge the code in. One experience that we spent a lot of time tuning early on was what is the right experience on GitHub for these tools. Because I think, most tools out there when you build the agent, you'll think about, oh, it'll create the PR for you. We try to take that a step further and say, “Oh, what if we actually made sure you could interact Devin, with direct Devin directly on GitHub?” And so we made sure that you can comment on GitHub, and Devin would actually receive those comments and address them back. But there's actually quite a bit of tuning you have to do here because you can imagine that actually like-We recently have Devin Review, for example. Devin Review will post comments on his own PR And then Devin has to then goGitHub Workflows: Devin Review, Comments, and PR AutomationSwyx [00:24:23]: He answers his own comments, which is Really loopy. So like, yeah, I like that it just updates here that it's, that I have commented But usually it's just me saying like, “Hey, merged, fix any merge conflicts.”Walden [00:24:37]: The, so when Devin fixes his own comments, you might be scared that, oh, maybe I'll infinite loop. But we've put a lot of work into making sure it doesn't, both by making sure that the comments are high signal, but also that the agent is thoughtful about what comments it immediately goes and tries to fix, and what comments it's like, “Wait a second, I think you're wrong.” Actually, that's one of my favorite moments is when Devin tells me that I'm wrong, when I try to get it to do something different. But tuning that behavior, actually makes a big difference in terms of how useful the actual GitHub experience is.Cole [00:25:06]: I think to touch on that as well, I think having the AI reviewer integrated into the system is a critical part of this background system. OpenInspect does have that. It has a GitHub code reviewer that you can control the prompt. It does do comments as well. It doesn't do them automatically yet. The capability is there, but it's not fully used.Swyx [00:25:27]: So you have to ask for it?Cole [00:25:28]: you do, yeah. You can tag it on GitHub, and then whatever you named your, GitHub bot, it will then follow up on it. It will then, if you have merge conflicts or whatever you have asked it to resolve, it will then resolve it, but it doesn't do it automatically yet.Integrations: Slack, MCP, and First-Party Agent InterfacesWalden [00:25:42]: Well, I'm curious, what is, the most common thing that people end up requesting, that they still need on top of OpenInspect when you help them go implement it?Cole [00:25:52]: I think a lot of it comes down to actually integrating it into the company. It's one thing to have the background agent system set up, but if it isn't actually integrated into your larger ecosystem, it isn't that useful. It is useful to be able to kick off sessions, but what we really want to be able to do is hook it into all of our other systems, whether that is the production database with read-only credentials, the logs, a Confluence or internal knowledge-based system. I think that is where I see the huge leap for companies, and that can be a challenge for companies as well who are maybe not familiar with exactly how to approach it, especially if they're in environments that have more compliance type things where, access control can be pretty big and how do you deliberately think about these problems, I find to be, one of the problems that comes with a system like this.Walden [00:26:46]: The thing we found is So, MCPs, obviously it has been like this, really big explosion of, oh, you can go, integrate it with all these different things. But to actually get the integration right and the and get the right experience, oftentimes we found that we had to go build our own ad hoc things. I think Slack is a great example of this. You could give your agent a Slack MCP and okay, it can post messages back to you on Slack. But we actually use Devin like a coworker in Slack, and that's how it's been built from the ground up. But to do that, you actually need to, support webhooks that come back, right? And then Devin has to respond in a natural way and then hopefully don't spam your threads too much and annoy the people in your company. So you got to tune that experience just right. Especially when there's a lot of back and forths, we find that we actually have to go beyond the simple MCP integrations in these places.Swyx [00:27:39]: I just pulled up the MCP marketplace. I know this is a Fair amount of work. Is the answer to eventually take first party control of all the top MCPs? Is that theWalden [00:27:48]: I would love a world where you could have something that's more expressive than MCP. That, goes both ways, not just a set of tools, but a proper system that interacts back and lets it Have the right experience with all these interfaces.Swyx [00:28:03]: So there actually is sampling in the MCP spec, but nobody Uses it, right?Walden [00:28:07]: And so I think that's the other part is, actually we found that when the MCP spec starts to get too complicated, it starts to lose its original promise of Being like a simple one-step connect. Now then we have to go figure out how to support all these different variations of things and It starts to look a lot like just building the first party integrations in a lot of these cases now.Cole [00:28:29]: I think it matters, too, how critical it is to your company, right? If this is something that nearly every session is going through, it probably makes sense to own it so that you can make optimizations on top of it Versus just whatever is off the shelf.Swyx [00:28:43]: Awesome. Other than MCPs, what else, sorry, well, I don't know if that's Narrowing in too much on, integrations. But what else? What other elements of building OpenInspect or Devin that you guys really sink on?Memory and Knowledge: What Agents Should RememberCole [00:28:59]: I think, a problem that comes up very frequently is this idea of memories or knowledge base.Swyx [00:29:05]: Oh, boy. How do you solve it?Cole [00:29:08]: so not solved yet, is the short answer.Cole [00:29:11]: it's something, there's a open issue for it, someone asking about it.Swyx [00:29:16]: There's, I, D Wiki hasn't indexed anything about memory yet.Cole [00:29:20]: how I'm seeing it solved across my clients is primarily through skills. I find that skills can be a good gap within that or updating Claude MD, but I think memory as a whole is a pretty unsolved problem, and it is why I've been hesitant to add it. I think there is parts of memory and that can be addressed, but I think as a whole it's a very difficult retrieval problem.Swyx [00:29:44]: Oh my God. RAMP didn't write anything about memory? I see zero search results.Walden [00:29:50]: No. Memory can be quite tricky to get right because it's the retrieval, but also the generation of the memories that can be really tricky. You don't want it to just like Remember very specific details.Swyx [00:29:59]: Walk us through the Devin memory journey because I know there's been a journey.Walden [00:30:03]: the first version of memory that like stuck around for a while was A system we have called Knowledge. And the idea was we wanted it to pick up things over time and not need the user to be proactive about teaching Devin things. So, okay, any time you remind Devin, “Wait, no, that's not quite the way you're supposed to use Git”Like, we actually want Devin to say, “Hey, do you want me to actually just remember this for the future?” And for you to just basically quickly approve or reject and for it to build up over time. ‘Cause I find that, 95%, I think, or some crazy stat like that of the memories that Devin has are all through these auto-generated things. Very few people actually just want to sit down and write big docs on Here's how you're supposed to work with the technology, et cetera. The generation and the retrieval has been something that we've been trying to tune a lot over the years. Generation, you don't want it to remember something like, if you asked one time to like, “Oh, please open as a draft PR,” you don't want to be like, “Oh, everyone forever now should get their PRs as draft PRs.” But you do want some, conveyor. Maybe you want to say like, “Oh, Cole generally likes, things to be created as draft PRs.” Same with retrieval, if you have thousands of these memories, how do you actually make sure they're retrieved at the right time? And that can be quite tricky to do right without exploding the context with a bunch of useful yeah, useless information. Surprising amount of just, eval work to just make sure that, memory is, remains a reliable system as new models come and go.Cole [00:31:31]: Do you have anything that you could share on, memory pruning? And like the temporal aspect of memory?Swyx [00:31:36]: Deleting and forgetting?Walden [00:31:39]: The, today, the, So the things they could do is it could edit memories. And so if your memory used to say like, “Oh, Cole likes to open everything as like a draft PR,” then you can imagine, “No, don't do that.” And then it'll say, “Oh, do you want me to update the memory to be Cole now want everything as, open PRs?” I think that at the same time we don't know if this is going to be the final version of the system. Whatever we have here will probably, translate into the new system that we'll be coming up with. But I think one big difference between two years ago and today is these agents are really good at using anything that resembles a file system natively. And so part of us are, is thinking, “Oh, should we rebuild memories to feel more like a file system that we let the agent navigate on its own?” That's been an interesting exploration. Also similar ideas in the scale space.Swyx [00:32:35]: I am pulling up OpenClaude's memory thing right now. So memory, OpenClaude has like this like daily memory journal thing, right? And you can I mean, that is a file system you can grep through and is a source of truth. I don't know if it's the best. It's probably super noisy, but at least, if you lose something you can discover it or you can apply some, forgetting algorithm to, more ancient memories that don't get recalled again or something. I don't know.Walden [00:33:01]: One thing we've been trying to do to push the boundaries of how you use agents at your company is letting an agent basically have a very similar file, a memory.md or something, and just like be your permanent PM for a specific set of issues maybe. So we have like some Slack channels internally, maybe a Slack channel dedicated to, a specific product like DeepWiki maybe. And you can imagine that, or you want a Devin that never stops, it's just always awake, but it has this like memory dock that it can just maintain for itself about, okay, what are like the number one priorities of what we have to fix and prioritize? Who is responsible for some upcoming work? Maybe they'll even Devin will even tag you on some recurring basis. And so it's been an interesting move to see, okay, how can we actually use Devin for more than just engineering? Can we actually upstream above the engineering process and maybe it's just Devin creating tickets, which then maybe some humans do, but then maybe other Devins do.Swyx [00:34:00]: One of my more fun automations is go research competitors and just suggest stuff to me on a weekly basis. That's the automation. I can't find it right now, but basically it just like, “Look at competitors and suggest things.” “And here are three things that you've suggested that I don't want any more of,” and you just stick that in the prompts. But like I wish actually So for like when I, for example, when I reject a PR, I wish that it updated memory so that I can then just not have to go up, go back and update the scheduled, sync, but anyway, feature request.Walden [00:34:31]: what? We might change it soon. I guess OpenInspect, in the time you've been around, has there been anything you tried to implement but then you had to like undo and like do a different way?OpenInspect Architecture: Webhooks, Control Planes, and Agent StateCole [00:34:41]: Nothing yet, but something that is on my mind. The initial way that I built it was that each of the integrations lives as its own package. And so you have The Slack bot, which is what's handling the webhooks, and then is basically interacting with the control plane. As I'm seeing the system starting to be more integrated, specifically with the GitHub bot integration, I'm considering bringing that all into the central control plane because especially now I want to start, And a request that I'm getting is the ability to monitor, the actual, pull requests being merged, as well as just tracking ofSwyx [00:35:19]: What do I have open?Cole [00:35:21]: What do I have open? How many of these are getting merged? How many comments are showing up? To just understand the health of the system. And so in the case of a GitHub app, you only have one webhook. And so then it's a question of do I put that webhook in that GitHub bot package? That's weird. It doesn't really make sense to live there because that package is more for like the code reviewer. Or do I like centralize it? So that's something that's on my mind of, making that decision. I think the other one we touched on earlier is the harness in the box versus out of the box. I think long term the architecture will eventually come back out of the box. Some of the newer tools that I've added are calling back into the control plane so that you don't have the secrets in the sandbox. And so I think long term I probably will pull the actual, agent out of the box, but I think for now it's fine.Subagents and Multi-Agent Systems: When Parallelism Helps or HurtsSwyx [00:36:16]: Just, a quick question on pulling the agent out of the box. I'm One thing I'm very bullish on this year is agents calling other agents or spawning sub-agents or Whatever you want to call it. Does that make it harder or easier? I can't tell. Because if the harness is in the box, you can just spin up more boxes. If the harness is outside the box, then you're, it's less easy because you are, you have a unicorn pet of a, of a harness that's, living outside the box.Cole [00:36:45]: In theory it would be the same way, right? Whether, one agent has launched many, sub-sessions within it, OpenInspect, for example, can launch sub-sessions and actually create other environments and then monitor them. In the case where it is out of the box, that would basically just be an additional session that's running. And so that session is also running outside of the box. It's running in your worker plane, wherever you're running this. And then you really just have to think about how does your top level agent then interact with it. I do think it can be more complex, just ‘cause again, you have now a more difficult architecture. But I think if you figured it out once, it's probably fine.Swyx [00:37:26]: Well, then I'm just, throwing it open to you in terms of, I call this like meta Devin management. Which is like the, Devin's calling Devins or Devin scheduling Devins or querying trajectories or anything like that. What have you built or unshipped, anything?Cole [00:37:46]: I think one of the surprising things we've seen is that a lot of the ways that, these, separate agents work with each other, and you want them to, parallelize their work, has still mostly followed the same manager sub-agents regime. And a lot of people I think are excited about this world where you have swarms of agents that, talk with each other all over the place. We've actually given Devin an MCP so they can just go arbitrarily message other Devins And create new Devins, et cetera. But I guess, it somehow creates, a really chaotic world in that sense. And so we've still found that most practical use on a day-to-day basis has been one single Devin.Cole [00:38:33]: Figuring out how to segregate the work and get, have other Devins work on it in, a relatively isolated sense, each with their own boxes Not sharing machines, so there's, a very little room for conflict is the regime that you have to create today.Swyx [00:38:50]: I'll call out, the experiments from Cursor, right? This is Wilson Lin's work on Single agent to multi-agent, and you're obviously famously on the side of don't build multi-agent. But they went through the whole thing, only to arrive at, this Which is exactly what Devin has, I think.Cole [00:39:08]: I think there will be a revision to that post at some point AboutSwyx [00:39:12]: Tell us about itCole [00:39:12]: I think multi-agents were very much not at all possible a year ago. You do see more multi-agent experiments today, but you can argue, are they really multi-agents, or are they just just, tool calls,? There are people who, will create sub-agents to go look for XYZ file, XYZ implementation. Has really nice context management benefits because all of the tool calls and tokens that it spends then get collapsed back to just the answer for the main agent. There's a lot of benefits to doing this. We basically have Devin do this with Deep Bookie, make a call out to Deep Bookie, give you back the results, but that feels like a tool call,? It's not like these, two collaborators actually talking back with each, back and forth with each other. But I think the thing that gives me the most bullishness that multi-agents might actually be possible is actually what I said earlier about Devin will actually sometimes tell me I'm wrong and push back, and I think that demonstrates a level of maturity and communication today that makes a multi-agent world possible. One, can two agents who have seen different information come back to each other and actually figure out who is right, what is the correct implementation? They're not just, yes men. Claude, I guess is like, used to just say, what is it? “You're right,” or,Swyx [00:40:25]: “You're absolutely right.”Cole [00:40:26]: “You're absolutely right.” Yeah.Swyx [00:40:28]: The Have you seen, did you seeCole [00:40:29]: The age is overSwyx [00:40:30]: The Codex app troll in Topic? This is the Codex app. Inside of Settings, there's a little, there's a little Easter egg, right? So if you go to, the Themes or Appearance, right? There's all these, color codes, and the top is absolutely, and it's the Topic's colors. Which is such a troll. Anyway.Model Behavior: Pushback, Adversarial Prompts, and Agent SkepticismCole [00:40:53]: I love that Easter egg. Did you discover that yourself?Swyx [00:40:54]: No, it was, someone was, tweeting about it And I was like, I was like, “Is this true?” Because, sometimes people just tweet stuff to, get a rise out of you. But yeah, there you go, in Topic colors.Cole [00:41:06]: Yeah. So yeah, we're out of this regime where, it just says you're absolutely right, and they can have real conversations and real back and forths.Swyx [00:41:13]: You can prompt it as well to be more adversarial or whatever. Yeah. Okay. Yeah, that, I mean, to me, that is more intelligence, right? That is not just something that's, a dumb tool, it's actually pushing back on you I think. Yeah.Cole [00:41:24]: when you mentioned, of course, the blog posts. There was one blog they had where they fed a swarm of agents together and built a browser.Swyx [00:41:34]: That was I think that was the one.Cole [00:41:36]: You can have, likeSwyx [00:41:37]: I think it's the same oneCole [00:41:37]: Creation of it. We found a surprising success of, don't do a swarm or anything, just have one Devin, it does its own context management. Just let it keep running for a while and give it some crazy tasks. I think we asked it to, rebuild, a Windows OS system. And it managed to do it just like, going on for long enough. It'sSwyx [00:41:55]: Was this Andrew's thing?Cole [00:41:58]: there were lots of demos that we ended up not posting, ‘cause at some point we'd just be posting way too much a bunch of, Demos. But I love that because it shows that I think the multi-agent thing still has, a bit of exciting sexiness to it, which is maybe still beyond still, the actual delta it adds to the capabilities of these systems. But it's absolutely the future. I think we're heading in that direction and we can see the progress being made there already.Swyx [00:42:25]: If I were to, make one super minor pushback because I don't feel that confident about it yetCole [00:42:33]: Go for itSwyx [00:42:33]: But I've had Ryan Lopopolo from OpenAI on the pod And he's a super slop cannon, right? Oh my God, that's my coding agent being done. I downloaded this, Peon Ping. I don't know if you guys have heard this. It takes like-, sound packs from popular games like, Command and Conquer and Warcraft, and then it plays it whenever it's done. And so it's like, “Work,” or whatever, “At your command,” or something. Anyway, what I got from the Cursor code base and from Ryan's thing was that there's a slop cannon approach where you try to loosen the single agent's, bottleneck, and I feel like that is, probably an, a very important thing to try to figure out. I don't think anyone's, really solved it. Because then you just have more reviewer slop on top of the agent slop To try to wrangle it all. Ryan will probably very strongly object that I say that he hasn't solved it, but he thinks he's He thinks he's completely solved it. But I think it's still I think it's, very important, ‘cause, that is a bottleneck, right? I feel Devin is slow sometimes Because I'm like, well, yeah, this is very readable and very sensible, but also it is slower than it could be if I just, I want a button to just say, “Just ramp this up 1,000 next parallel, in parallel and just, see what happens,”? And I don't know if that's, feasible at some point in the future.Code Review, Entropy, and AI SlopWalden [00:43:55]: I And we've also run experiments internally where we've basically tried to build entire products, true products that we knew we would eventually ship, but for now, let's try to see if we can do it just by purely, vibe coding on top of each other, auto merge, no code review at all. And then there's this benchmark of how many weeks can you go onto this for Before you say, “We have the trashiest code base.”Walden [00:44:18]: “Let's actually rewrite it from scratch.”Swyx [00:44:19]: Start a new factory, yeah. What'd you find?Walden [00:44:21]: I think we found that the state-of-the-art in December was you can probably, run this for about two weeks. By the end of those two weeks, you'd find that, hey, you want to, change the color of a button. Well, it turns out this button is implemented in, 10 different places, and they, have All these different variations, and oh, you forgot one of them, and actually it's a slightly different color in one spot. And you're like, “Okay, this is too much to work with. Let's actually try to do code review at the same time.” And make sure that we're on top of our software, actually cleaning it up a bit And making sure it's done in a scalable way.Cole [00:44:54]: I think building on that, the idea of, you don't have to look at code, I think is generally a bad idea. And the meme that I have for thatWalden [00:45:03]: What timeline, all right, is Do you think that statement will be true on?Cole [00:45:06]: I think probably for a while it'll be true that you should continue to look at your code. A problem that I see a lot of teams run into that I work with who are embracing AI native, AI first coding, is The meme that I have is that your code base regresses to your worst engineer, because that engineer who is, very gung-ho about AI and is not auditing their code, their pattern starts cementing into the code, and now the AI is referencing their patterns. And so now their if/else block that, is 20 if/elses back and forth, the AI is seeing that as the pattern of how things are done and starts to then exponentially grow this slop. And I find to your point, a pretty good approach to that is having scheduled cleanup, whether by humans or through systems, that are looking for duplication. They then address that. You'll end up with like 12 helpers for how to format a date. And you need to address that, because otherwise it will continue to sprawl.Swyx [00:46:09]: Within balance, I think it's fine to have some duplication, and then sometimes To have garbage collection, right? Yeah. The What I've been, talking about with a lot of engineering leaders is that you want to be very strict about the boundaries between modules, and it's your job as an architect, as a CTO, whatever, to say like, “Okay, here's the hard contract between you guys and you guys. Whatever you do inside this black box is your business. You do whatever. But between these guys, let's be, really damn clear, and any movement must be signed off by a human or me,” or. Then, and like that's that. I don't know if you have any other modifications or advice.Walden [00:46:44]: Well, I guess generally on the topic of, where humans can be useful, I found that ‘cause, some of these, really deep infra problems, sometimes just having a human that just has, really deep expertise can make a big difference. I've actually seen this come into play when actually building agents. So we've had a few friends now, try building their own coding agents, and I think one same problem that I recurringly heard a lot of them run into was this problem of like, “Oh, Grep is really slow on our agents' machines.” And so a lot of them, I assume because they're using AI and they themselves don't have, super deep infra background knowledge, say, “Okay, we're going to go build our own custom Grep index. It's going to be really fast,” and use that as a way around this problem. When we ran into this problem About like, maybe like a year and a half ago when we were, in the early days of building Devin, we obviously didn't have AI then. We just asked our, how to, how to do this. You can just swap out a new Grep index, so.Infrastructure Details: Grep, File Systems, and SandboxesSwyx [00:47:45]: What do you mean you hand-coded Devin? What?Walden [00:47:48]: It's like, can you believe we hand-wrote this code? And we had, our infra people who are really amazing, they were looking into it and they're like, “Oh, what? We realized that actually the root cause of this problem is actually super simple, but like fine-grain detail,” which is that a lot of these virtual machines actually underlying them don't use real file systems. They use these, network file systems where things are actually cached over the network actually in S3. So when you're Grepping, you're actually making network calls Every time you're doing these things, and that's why Grep is extremely slow on these machines. And so again, goes back to, what is all of the crazy infra work that we had to do to actually get these machines working. If you try to do this yourself, there are tons of small details like this, and so we had to eventually go swap out that network file system. ButSwyx [00:48:35]: I think there's a write-up about it, right? Silas did one about the virtual file system.Walden [00:48:38]: Oh, that was a whole other thing. TheSwyx [00:48:39]: Oh, that's a different thingWalden [00:48:40]: The BlockDev file storage formatSwyx [00:48:42]: I'll bring it upWalden [00:48:42]: Which is, a file system format that we built so that the VMs could be spun up and down very quickly. Basically, the intuition behind this is-Imagine you have, a terabyte of disk, and your agent only, wrote, a hundred lines of code on top of that disk. How long does it, say, take to, save and re-bring up that disk? And most systems, because you're not optimizing for this case, it's just, on the order of a terabyte of work because you have to Save all of that and bring it back up. In our system, we try to build a file system that incrementally builds on top of each other. So every time you save and bring the machine back up, you're only doing work that is proportional to effectively the diff in the file system. And so this, shaves off a lot of time in the boot-up process of Devin. I think we This is actually now outdated. We have a newer system inside of Devin. But yeah, there's a lot of tiny details you have to get right here to actually get the day-to-day experience of Devin to be good.Swyx [00:49:39]: It's, not technically agents, but it is agent infra, and when you sell an agent as a company, you sell agent plus agent infra.Walden [00:49:46]: At least the way we do it be And the other The nice thing about having the agent infra being done together is, you We get to deploy Devin in whatever environment we want now. We don't need to wait for some underlying infra provider to also go and support VPC or on-prem or FedGovCloud, for instance. So we can actually go and figure out, okay, since we own the infrastructure, how can we get that set up for you?Cloud Providers: Modal, Daytona, and Enterprise SandboxesSwyx [00:50:12]: Whereas you're Cloudflare dependent.Cole [00:50:15]: so Cloudflare runs the control plane. The sandboxes, Modal is supported. A contributor just added Daytona. E2B is on the roadmap, and I think there's an abstraction in place that if any contributor wants to add a new provider, they can add that in.Walden [00:50:32]: Well, what are, How are the customers you work with Do they generally try to then go set up a contract with another one of these third-party providers? Do they try to do the VMs in-house?Cole [00:50:44]: most of them I see using Modal. I think Modal has a greatWalden [00:50:48]: Shout out Modal.Swyx [00:50:48]: Shout out Modal.Cole [00:50:50]: I think Modal has a great offering. It captures all of the sandbox pieces you need, snapshots being a pretty big piece of that, and given that they also offer GPUs, I think it's a pretty nice offering as a whole.Swyx [00:51:04]: no debate there.Walden [00:51:07]: Modal is great, especially, I think their container offering is, the most natural, and so especially if you are willing to, forego, the full VM requirements Modal is, a really vast place you can spin something up on.Swyx [00:51:20]: Is there a point So Modal's very Python, and I feel like most workload, has really shifted to JavaScript. I don't know if you guys Get the same feeling. So, okay, when I started Landspace and IE and all these things, I was like 50/50 Python and JS, right? That's roughly. I think that's wrong now. I think JS has won. I don't know if you guys Like, I Maybe I'm overstating it, and maybe for cognition, there's, C# and Java and what have you. But for, new greenfield apps, do you feel that Do you get that sense? Does it matter?Cole [00:51:52]: I think that most of the libraries that I see in this space are Python native first, especially in theCole [00:51:58]: Observability space. That said, I think that there is a pretty big appeal of having your entire system in one language. Especially when you have both your frontend and backend communicating, you can have one central type Which is very nice.Swyx [00:52:11]: That's my case against Modal, which is Then you have to run JS. You can run JS inside Modal. It's just, one extra step That, isn't native to the runtime. I don't know ifWalden [00:52:22]: I don't knowSwyx [00:52:23]: Reviews. Do you have numbers? I don't know.Walden [00:52:25]: the one thing I don't like about Python is whenever AI, whenever it writes Python, it always does, the weirdest patterns, andSwyx [00:52:32]: Oh, because it's, mixing two and three or what?Walden [00:52:34]: I think it's something mixing two and three, yeah. The I don't know if you see this. It always tries to do, has attribute on objects as likeCole [00:52:41]: Oh, my God.Walden [00:52:41]: But it's like But that you shouldn't be doing that. It should error if there wasSwyx [00:52:45]: Because it's training on library code?Cole [00:52:47]: I think it's more of, likeCole [00:52:48]: From what I've seen, it's more of, a reward hacking mechanism where it doesn't want to basicallyWalden [00:52:54]: It'll never error.Cole [00:52:54]: It doesn't want the code to fail. And so it Even when it knows it has the attribute, it'll call getattr on a, and for a lot of my clients who have moved towards more autonomous coding, we've put that in as a lint rule That if you do getattr, your pull request is going to fail.Slop Signatures: Comments, Backwards Compatibility, and TypesSwyx [00:53:12]: Ooh, this is a fun topic. Can you tell me more about this? What else is a sign of AI coding that you have to put guards in?Walden [00:53:21]: So we were talking just before this about Opus 4.7. One of the things this new model likes to do is it writes lots of comments. Not like, it'll, comment every line, but it'll write, paragraph, PRDs, on top of every function. But I will say, to its credit, these aren't slop, descriptions like they were before. “Oh, here's what this function does.” It's like, “Oh, here's actually the r
Whip open your wallets, because no one affects your paycheck like this man. We just sat down with Austan Goolsbee — President of the Federal Reserve Bank of Chicago. He's a Macarthur Genius, former Chair of Obama's Economic Advisors, and the coolest economist we know: Picture Ted Lasso meets Paul Volcker… He's the Maestro of our Money Supply, and he guided the economy through the ‘08 financial crisis and today's Inflation Situation.So Auston spilled the money beans for us: The 2009 phone call with the President that was the worst financial briefing since the Great Depression… What it's like in the room when he votes to change interest rates (spoiler: The table is huuuuge)... How he'd grade outgoing Fed Chair J-Poww… and why he's not a “Dove” or a “Hawk” — He's a “Data Dog.”If you want to know when you can finally afford buy a house, then Austan Goolsbee has the insights on the forces affecting that — And he makes dropping data sound as smooth as a beer commercial.CHAPTERS:Intro: Austan Goolsbee Joins TBOYObama's "Worst Briefing Since 1932" — Goolsbee On The 2009 Financial CrisisPaul Volcker's One Rule For Every Crisis: Don't Blow Your CredibilityThe 31% Housing Rule: Why Most Americans Are At Foreclosure RiskWhy Housing Has Compounded 5% A Year For 25 YearsDid Tariffs Cause Inflation? Goolsbee On The 1% BumpIs Stagflation A Real Threat In 2026? Goolsbee Says "We're Not 1978"Will AI Take Your Job? The Lump Of Labor Fallacy ExplainedWhy The Federal Reserve Has 12 Regional BanksInside The FOMC: How The Fed Actually Sets Interest RatesGoolsbee On Kevin Warsh: The New Fed Chair & Why The Job MattersThe "Data Dog" Approach: What Goolsbee Watches Instead Of CPIFed Independence: Why Inflation Roars Back Without ItGoolsbee's Jerome Powell Grade: First Ballot Hall Of FamerRapid Fire: Sunk Cost Fallacy, Ditka, And Best Chicago RestaurantNEWSLETTER:https://tboypod.com/newsletter OUR 2ND SHOW:Want more business storytelling from us? Check our weekly deepdive show, The Best Idea Yet: The untold origin story of the products you're obsessed with. Listen for free to The Best Idea Yet: https://wondery.com/links/the-best-idea-yet/NEW LISTENERSFill out our 2 minute survey: https://qualtricsxm88y5r986q.qualtrics.com/jfe/form/SV_dp1FDYiJgt6lHy6GET ON THE POD: Submit a shoutout or fact: https://tboypod.com/shoutouts SOCIALS:Instagram: https://www.instagram.com/tboypod TikTok: https://www.tiktok.com/@tboypodYouTube: https://www.youtube.com/@tboypod Linkedin (Nick): https://www.linkedin.com/in/nicolas-martell/Linkedin (Jack): https://www.linkedin.com/in/jack-crivici-kramer/Anything else: https://tboypod.com/ About Us: The daily pop-biz news show making today's top stories your business. Formerly known as Robinhood Snacks, The Best One Yet is hosted by Jack Crivici-Kramer & Nick Martell. Hosted on Acast. See acast.com/privacy for more information.
In the latest episode of Executive Function, Brett sits down with Graham Moreno, Head of GTM at Parallel Web Systems. Before Parallel, Graham scaled Windsurf's GTM organization from three sellers to seventy-five in under a year, served as President through the Cognition acquisition, and earlier built and led enterprise sales teams at Grafana Labs and MongoDB. In this conversation, he unpacks why the AI-era backlash against structured enterprise sales misreads the data, how to design a process that raises the floor for ordinary reps without capping the ceiling for stars, and why selling to AI-native customers compresses an eight-week cycle into five business days. In today's episode, we discuss: Why in-person enterprise rollouts still beat product-led motions Building a robust sales process that still leaves room for unscripted moments Why the three highest-leverage early sales hires aren't sellers at all The case for outsized commission accelerators for star sellers — and the kind of person they attract Why most AI companies are skipping the in-person sales work that enterprise customers actually want References: Ahead: https://www.ahead.com Amazon: https://www.amazon.com Anthropic: https://www.anthropic.com Attio: https://www.attio.com Augment Code: https://www.augmentcode.com/ Cognition: https://cognition.ai Cursor: https://cursor.com Dani McCabe: https://www.linkedin.com/in/danielle-mccabe/ Datadog: https://www.datadoghq.com GitHub Copilot: https://github.com/features/copilot HubSpot: https://www.hubspot.com Jeremy Powers: https://www.linkedin.com/in/jeremypowers/ JPMorgan: https://www.jpmorgan.com Matt McClernan: https://www.linkedin.com/in/mattmcclernan/ MongoDB: https://www.mongodb.com Nicole Rettinger: https://www.linkedin.com/in/nicole-rettinger-23b20465/ Notion: https://www.notion.com OpenAI: https://openai.com Parag Agrawal: https://www.linkedin.com/in/paragagr/ Parallel: https://parallel.ai Snowflake: https://www.snowflake.com University of Chicago: https://www.uchicago.edu Windsurf: https://windsurf.com Where to find Graham: LinkedIn: https://www.linkedin.com/in/grahammoreno/ Where to find Brett: LinkedIn: https://www.linkedin.com/in/brett-berson-9986094/ Twitter/X: https://twitter.com/brettberson Where to find First Round Capital: Website: https://firstround.com/ First Round Review: https://review.firstround.com/ Twitter/X: https://twitter.com/firstround YouTube: https://www.youtube.com/@FirstRoundCapital This podcast on all platforms: https://review.firstround.com/podcast Timestamps: 00:00 Introduction 00:32 Has the sales playbook changed in the AI era? 02:13 Why "showing up" beats letting the marketplace decide 06:50 Why great salespeople sell to engineers and executives in one motion 11:37 Selling to AI-native buyers who grew up on ChatGPT 13:49 Same seller, different tempo: 8 weeks vs. 8 business days 15:57 How AI-native buyers handle build vs. buy decisions 17:48 The rep who taught a champion's son guitar over Zoom 19:03 Raising the floor without capping the ceiling 22:09 Why too much process narrows the kind of seller you attract 25:46 The three pillars of GTM excellence 31:00 Building peers who are 80% aligned, not 100% 38:03 Whether AI is changing what good enablement looks like 41:35 Selling against direct and implied competitors at once 42:45 Instrumenting the funnel from stage zero to close 45:57 Why post-sales should always roll up to the revenue leader 48:19 The case for outsized commissions 52:02 The 96 hours of panic before Cognition acquired Windsurf 53:04 How far out should a GTM leader be planning? 57:53 What a normal week looks like in hypergrowth
Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!This was recorded before Railway suffered a major GCP outage on May 19, despite being a multi-AZ, multi-zone mesh ring, with HA fiber interconnects between their Metal GCP AWS, because workload discoverability was unintentionally still tied to GCP. All has been resolved with a post-mortem.Railway did not start as an AI infrastructure company.It was founded in 2020 years before agents became the default way people thought about deploying software. Jake Cooper, formerly at Bloomberg and Uber, started Railway with a simple obsession: the activation energy to ship something to production should be near zero. Push code, get a URL, iterate. No Docker files, no Kubernetes manifests, no Ansible scripts stacked on Ansible scripts.For years, this was a slow grind. Railway spent its first 18 months hand-acquiring its first 100 users with Jake personally greeting every Discord signup on a second monitor.Today, Railway has raised $124m and is growing very fast. A 35-person team supports 3 million users, adding roughly 100,000 signups a week. Their bare metal data centers have a 3-month payback period vs. renting in the cloud, with 70% margins funding aggressive cloud bursting when needed. The servers they own have actually appreciated in value as RAM prices have climbed basically meaning the value of their hardware now exceeds the capital they've raised.From rebuilding Railway's network overlay over a weekend to moving the vast majority of workloads onto its own bare metal data centers, Jake Cooper is trying to build a new cloud for an agent-native world. In this episode, Railway's founder and “conductor” joins swyx and Alessio to unpack why the next era of software infrastructure is not just “Heroku but newer,” what agents need that humans did not, and why the old deployment loop of Git, PRs, CI/CD, and static cloud resources may be heading for a rewrite.We go deep on Railway's infrastructure stack: own-metal data centers, three-month cloud payback periods, cloud bursting, data center debt, Railpack, Nixpacks, Temporal, feature flags, Central Station, content-addressable filesystems, agent-safe production forks, and why the CLI may become more important than the canvas in an agent world. Jake also shares the founder journey behind Railway, how the company survived losing $500K/month, why it now serves millions of users with only 35 people, and why he believes the pull request is dying.We discuss:* How Railway went from a slow six-year grind to adding 100,000 users a week* How Railway thinks about agents as the next dominant software species* Why agents need version control, observability, compute, storage, and orchestration at 1000x scale* The economics of Railway's own-metal data centers and three-month payback* How Railway uses cloud bursting while scaling its own infrastructure* Why data center debt can be a better tool than venture debt for infra startups* Central Station, Railway's internal system for clustering customer feedback and incidents* Why responsible disclosure and over-communication matter for platforms* Why feature flags, progressive rollouts, and shadow traffic are essential for agents* Temporal's strengths, pain points, and why workflows matter for agents* Railpack, Nixpacks, Nix, and lazy-loaded content-addressable filesystems* Why “cattle, not pets” may change if you can clone the pets* Why Railway is building a new cloud from scratch instead of copying hyperscalers* The solo founder path, focus, writing, and how Jake thinks about company buildingRailway:* Website: https://railway.com/* X: https://x.com/RailwayJake Cooper:* LinkedIn: https://www.linkedin.com/in/thejakecooper/* X: https://x.com/JustJakeTimestamps00:00:00 Introduction: What Is Railway?00:02:07 Jake's Path to Railway00:06:13 Railway's Six-Year Growth Story00:08:52 Rebuilding the Business After the Free Tier00:11:17 Agents as the Next Software Platform00:13:29 Railway's Infrastructure Philosophy00:15:42 Bare Metal, Cloud Economics, and the Compute Crunch00:17:22 Cloud Bursting and Five-Cloud Networking00:20:20 Data Center Debt and Infra Financing00:23:31 Data Centers in Space00:25:24 What Agents Need From Infrastructure00:28:24 CLIs, Canvas, and Agent-Native UX00:35:15 Central Station, Incidents, and Responsible Disclosure00:40:30 Safe Rollouts, SRE Agents, and Production Forks00:45:00 AI SRE, Specs, Code, and Tests00:48:24 Self-Replicating Infrastructure and the New Serverless00:53:18 Heroku, Temporal, and Workflow Engines01:04:07 Railpack, Nixpacks, and Lazy-Loaded Filesystems01:06:01 Coding Agents, Token Spend, and Roadmap Acceleration01:10:56 The Pull Request Is Dying01:12:28 Feature Flags and the Agent-Era SDLC01:16:15 Cattle, Pets, and Cloning Machines01:19:29 Solo Founder Lessons01:24:12 Focus, GPUs, and Building a New Cloud01:28:20 Closing ThoughtsTranscriptAlessio [00:00:00]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swyx, editor of Latent Space.Swyx [00:00:10]: Hey, hey, hey. Today we're in the studio with Jake Cooper of Railway.Alessio [00:00:14]: Conductor of Railway.Swyx [00:00:15]: Conductor at Railway. Yeah.Alessio [00:00:16]: Choo-choo.Swyx [00:00:17]: Do you actually have that anywhere, like on your business card?Jake [00:00:20]: We call some of our volunteer moderators conductors. I don't have a business card. We're not that big yet. At some point I will. I got handed a nice business card from the Supermicro folks, and I was like, “Damn, this is pretty official.”Swyx [00:00:30]: Business cards are coming back.Jake [00:00:32]: They're cool. They're hip. The conductor thing is good. We're trying to figure out what we want to call each other internally. Some people think it's super cringe and say, “You don't need a name for people internally.” Some people want to call each other something. We still don't have a really good one.Jake [00:00:55]: We've got New Railcrews, Trainiacs. Nothing has stuck yet.Swyx [00:01:00]: I like Trainiac. Trainiac sounds good. Railwayians. For those who don't know, what is Railway? Let's give people a crisp definition up front.Jake [00:01:09]: Railway is the easiest way to ship anything. You go to the canvas, or you talk with Claude, and you say, “Deploy a Postgres instance, deploy my GitHub repository, run this code,” and you're off to the races.Swyx [00:01:22]: You've got a nice animation on the landing page.Jake [00:01:24]: Thank you. None of my work, by the way. They don't let me touch the design stuff anymore.Jake [00:01:25]: We want to make it trivially easy not just to deploy things, but to evolve applications over time. Most tooling right now stacks entropy on top of entropy: Docker, Kubernetes, Ansible scripts, and all these other things. If we can version all of your software and keep track of all the changes, then we can make it trivial to clone environments, fork into a parallel universe, get copies of production data, get copies of any services, make changes, validate them, and collapse them back in without reproducing everything across a staging environment.The Railway Origin Story: From Uber Systems to a New CloudSwyx [00:02:07]: I was looking at your background: Bloomberg, Uber. Nothing immediately stands out as, “This guy is going to found the next great platform as a service.” What prepared you for Railway?Jake [00:02:21]: It was curiosity to keep going deeper. I started out on front-end stuff, working on Wolfram Mathematica and porting it over. Then I briefly moved to Bloomberg, then toward Uber and distributed systems, taking the Jump Bikes systems and moving them to a distributed system built on top of Cadence, the pre-Temporal Temporal.Swyx [00:02:44]: Which, by the way, I'm happy to talk about, pros and cons.Jake [00:02:48]: Totally.Swyx [00:02:51]: But let's do the Railway story.Jake [00:02:52]: It has been a continual step of wanting an experience. Whether it's walking up to a bike, unlocking it, and having it work frictionlessly, or something else, the depth required to make that happen follows from the experience. A lot of the work I do, and a lot of the team does, is in service of that experience. We fundamentally don't care how deep we have to go. We will swim to the bottom of the swimming pool to get the experience.Jake [00:03:17]: I don't have a physics PhD. I did an EECS degree. It has always been about figuring out the next step: how do we get there? That's what led to starting Railway for that experience and then moving all the way to bare metal data centers. I was adding patches to the kernel this week to get the experience there because I can see how much better it can be.Swyx [00:03:49]: Other patches to the Linux kernel this week?Jake [00:03:51]: Yeah. Not upstream. Our fork.Swyx [00:03:52]: That's a flex. Railpack? No, this is different. This is the OS on top of Railpack?Jake [00:03:57]: No, this is an actual kernel patch. It's always literally: what do we have to do to get that experience? Then figure it out. Anything is figureoutable.Swyx [00:04:10]: Would you send the patch upstream, or does it not fit other use cases?Jake [00:04:13]: Maybe. We have to work out the experience internally. It has to do with the storage layer we're building for some of the agentic stuff. Maybe it'll be useful upstream, but it's deeply useful for us internally.Open Source, Forks, and Non-Deterministic VersioningSwyx [00:04:29]: You mentioned open source before. How do you think about starting from open source, and then coding agents letting you do a lot more from forks of it?Jake [00:04:38]: GitHub's original sin is that it's almost a series of broken pointers. You have this thing, then you clone it, and now you've lost the whole upstream. How do we make it trivial for people to modify really small pieces of it?Jake [00:04:51]: We think of Git in a discrete sense: I've either made a change and merged upstream, or I haven't. What would it look like if it were percentage-based, a little more non-deterministic, or a stream of changes that users traverse as a percentage rolled out in general and then rolled all the way up?Jake [00:05:13]: We have the open-source kickback program and let you deploy templates because we want to make it trivial for people to version these shards over time. It solves a large problem around authentication, authorization, and security. NPM has a way to define, “Don't take any new packages.” The ideal end state is that you roll out progressively to users with the minimum impact zone and continue rolling up. JPMorgan should probably be the last one on the patch line, for all our sakes, because our money and livelihoods are there.Jake [00:05:53]: It's okay if Johnny Vibe Coder gets a broken patch because there's so much entropy in the system that the rubber has to meet the road at some point. You have to test at varying levels.The Long Grind: First Users, Free Tier, and Making the Business WorkSwyx [00:06:13]: I wanted to pull up this glorious chart, which is your usage or number of daily signups?Jake [00:06:22]: Daily signups, I think.Swyx [00:06:24]: You started six years ago. It was a slow grind, and now you're on a rocket ship. You say, “Don't doubt your fight and don't quit.” Maybe pick out certain points that were key inflections for the company.Jake [00:06:40]: At the start, it's about getting your first 100 users, hell or high water. We had a website and a support link. The support link was the Discord channel. I had notifications on with two monitors: the monitor I was working on and the other monitor with Discord. If anybody came in, I was immediately like, “Hey, how's it going?” It was rare, so getting those first 100 users to come back was the start.Jake [00:07:14]: Then you build a consultancy factory because users want all these things. You have to go back to the board and ask, “What is the actual product offering I want to build on top of this?”Jake [00:07:28]: VCs want charts that always go up and to the right, but in reality you don't necessarily want charts that look like that. For us, there have been periods of expansion where we add features to test use cases, and periods of compaction where we ask, “If the experience we have is good, how do we make it significantly better?” Maybe we strip out features that don't fit our ICP anymore.Jake [00:07:57]: The boom from 2022 to 2023 came from the free tier. Everybody under the sun was using it.Swyx [00:08:09]: A lot of Reddit bots and Discord bots.Jake [00:08:12]: And crypto miners. When you build an open product on the internet where anybody can sign up, the internet is a horrible place with so many things. You go through periods of asking, “How do I reach as many people as possible?” Then, “How do I fit the exact use case for the people who really matter and are really excited about this specific thing?”Jake [00:08:39]: Then there was a two-year period of making the actual business work. During the free-tier era, we were losing about half a million dollars a month.Swyx [00:08:59]: On a $20 million bank account.Jake [00:09:02]: On a $20 million bank account with maybe $50,000 a month in revenue. That's a horrible business. I don't know how anybody invested. But you have to go through it and say, “We have an experience people love, but the business has to work.”Jake [00:09:17]: There are two schools of thought. You can run the horrible business all the way up with bad margins, or you can go back and make it work. We've always wanted a super lean team. We're 35 people right now. It's very small.Swyx [00:09:36]: Supporting three million already?Jake [00:09:38]: Yeah. We're adding 100,000 users a week right now, so it's growing fast. We don't want to add headcount for the sake of headcount or throw bodies at problems. We want to build systems. It's hard to build systems during expansion because you're adding things to the system because people are asking for them or things are breaking.Jake [00:10:00]: We had to cut off the free users for a little while, rebuild the business, and make sure it worked. We want to reach as many people as possible because software is important. It's become difficult to create things in the physical world, so it's important to make it easy for people to build in the virtual world and have access to creation. But there are legs to that journey.Jake [00:10:30]: You can see divots in the charts. If you follow between 2025 and 2026, it's either summer or winter. People go on holiday with family.Swyx [00:10:50]: It affects that much?Jake [00:10:51]: Yeah. It's kind of B2C and kind of B2B. People are shipping constantly, then they stop. Our activation curve now shows more people activating on weekdays because we have more business users, so it smooths out over time.Agents as the New Interface to DeploymentSwyx [00:11:17]: Was there a point where you started prioritizing AI development or agent development?Jake [00:11:24]: We've prioritized agentic as a top-of-funnel thing. Over the last six months, we've deeply prioritized agentic as a mechanism to build and deploy things because we believe the curve is so steep and that is how people will build and deploy software.Jake [00:11:42]: It almost fundamentally doesn't matter whether this is dot-com or not because we're all on the internet anyway. If agents are going to deploy a bunch of things and we hit an inference wall at some point, we'll fix those problems. The dominant species over the next 10 years is that we've moved from assembly to C to C++ to JavaScript to words. You're going to need to close that loop.Swyx [00:12:13]: When you say this is dot-com, did you mean buying the domain, or the general case?Jake [00:12:17]: I mean the dot-com era, when companies had a huge run-up because people understood the internet was important. Then they hit bottlenecks, fundamental laws of physics, math didn't work, and everybody came back down to earth. But it didn't matter because the internet became so impactful. If you operate on a long enough time horizon, you should build these things anyway because you can see where it's going.Jake [00:12:45]: That's where I think a lot of agent stuff is. You get to a point where you're running thousands of agents in parallel. What is the inference cost? What is the compute cost? How do you make that efficient? How do you coordinate all this? We have issues coordinating humans; we don't even have good tooling for that. Now we have to figure out how to get agents to coordinate, safely version changes, and know when to raise their hand for someone to intervene. Otherwise it becomes an interrupt factory.Railway's Infrastructure Thesis: Network, Compute, Storage, and MetalSwyx [00:13:19]: Let's go right into the technical side. What are the core infrastructure or architectural beliefs of Railway that allow you to do what you do?Jake [00:13:29]: The primitives matter a lot for us. We need network, compute, storage, and orchestration around it. You need control over a lot of those things. We've talked a lot about how we don't really use Kubernetes because we want higher-order control to place workloads in very specific places.Jake [00:13:48]: The reason is that you have to be very efficient with agents: memory reuse and all these other things, or you're going to massively blow up your cost structure. Being able to rack and stack your own servers and build your own metal unlocks performance and cost. Experiences where you're running 1,000 agents in parallel are not massively cost prohibitive.Jake [00:14:13]: Token use and compute use are blowing up. Over time, those things have to get a lot more efficient. You can get a lot of margin to make those experiences solid by building your own metal. That's all in service of offering a differentiated experience to as many people as humanly possible.Swyx [00:14:51]: You have a data center in Singapore.Jake [00:14:53]: Yeah. We have two in every other region now. In Singapore, we're adding a second one in Q3.Swyx [00:14:58]: What's it like? I've never built a data center. Do you go to Equinix and say, “I want some slots?”Jake [00:15:05]: Yeah. Equinix. You basically go and say, “I want power and I want a cage.” They say, “Great, here's what it's going to be.” You rent the cage for a period of time, fill it with racks and servers, and hook up internet to it. That's all the pieces.Swyx [00:15:36]: Then you handle everything else.Jake [00:15:37]: You handle everything else.Swyx [00:15:39]: What's the math versus clouds doing it for you?Jake [00:15:43]: If we rented in the cloud, our payback period when we go to metal is about three months.Swyx [00:15:50]: Which is crazy.Jake [00:15:51]: It's nuts. That's four years of depreciated hardware. You're going to see a lot of this compute crunch because hyperscalers are buying up a lot of stuff. We're working directly with OEMs, resellers, and people building these machines: Supermicro, Dell, and others.Jake [00:16:11]: Upstream, there's a bunch of supply pressure. When we raised our last round, between deploying capital for servers and now, the amount of money we've raised is less than the amount of money we have in the bank plus the value of the servers because the servers have appreciated as RAM has gone up. It's nuts how valuable hardware has become.Jake [00:16:50]: If you look at hyperscalers, they deployed around $80 billion of capital expenditures this year, and next year will be more. That's a massive infrastructure build-out. You look at that and think it's crazy that they're spending way more than the Manhattan Project. But if every person is going to run dozens or hundreds of agents in parallel, you have no conceptual idea how much compute is required to make that experience happen, even if you're deeply efficient and sharing resources. And that doesn't even count inference.Swyx [00:17:22]: How do you plan the build-out? The growth chart is so vertical. Are you usually at 100% utilization as soon as racks are live? How far ahead are you planning?Jake [00:17:33]: We still maintain cloud presence for bursting. We work with AWS, GCP, and a few other clouds. We can rent, and then the moment we get space or power, we compact those workloads off the cloud. We started on the clouds, then built a system to migrate to our own metal. There's nothing that says you can't continually do that again, and that's exactly what we do. We never want to be compute constrained.Jake [00:18:09]: At the start of the year, we actually became compute constrained because one upstream provider wasn't able to give us quota at the rate we needed, and the hardware was slower. I spent a weekend rebuilding our entire network overlay so we could straddle five clouds: Oracle, AWS, ourselves, GCP, and one other one. We can do more than that now.Jake [00:18:38]: We got into a spot where we were trying to pack instances tight because we couldn't get enough compute. That led to a few reliability issues, which are now past us. I made a tweet pointing out that it's becoming harder and harder to acquire compute at the rate these models need to acquire compute. We got bit by it.Swyx [00:19:15]: How do you think about pricing knowing you might not have your own metal available at all times? Are you pricing assuming you need extra margin if you end up going into the cloud?Jake [00:19:26]: Because we've built out our metal data centers, our margins on metal are around 70%. We can deeply subsidize the cloud business if we want to scale at a reasonable rate. We have a few levers: metal, which makes the margins; cloud burst; debt to buy servers; and venture capital. It's an interesting operational problem: how much cash do we have, how much should we raise, how quickly can we deploy it, and can we scale revenue as quickly as we scale compute?Jake [00:20:05]: If we continue making it trivially easy for people to build and deploy, then the faster we close that loop and the more operationally excellent we are with capital, the faster the business can scale. It's almost a straight linear deployment rate.Financing Infrastructure: Hardware Debt, VC, and Operational LeverageSwyx [00:20:20]: I think infra startups raising debt is a tool people don't utilize enough or know enough about. What can you tell us about that? Is it secured against your CPUs?Jake [00:20:32]: It's secured against our hardware.Swyx [00:20:37]: What rates do you get? Who are the lenders?Jake [00:20:39]: We pay prime plus a spread, and we can refinance any of the debt as rates go down. The terms are pretty good. The unfortunate thing is that Twitter has no nuance, so people say, “Venture debt bad.” But as with all things, there are specific tools and areas where you can be deliberate instead of using one tool as a hammer. Venture capital is not the hammer for everything. You have to explore and figure out what works.Swyx [00:21:12]: VC is usually the most expensive financing you can get.Jake [00:21:15]: Yeah. I also think people think about VC incorrectly from a capital-raising perspective. Most people think, “How do I raise as much money as possible from whoever is probably the best I can get at that time?” That's close to right, but what we've tried to do is figure out what unfair advantage we can buy with that equity.Jake [00:21:34]: It's the most expensive equity you're going to give away at that point in time, assuming the company keeps getting better. How do you use it to work with someone stellar who complements you? In the seed stage, I had never started a company. Ray Tonsing had good advice, and I could text him all the time. He was really fast. Awesome.Jake [00:22:01]: Then with John and Erica at Unusual, they said, “You roughly know what you're doing building a product. We'll mostly leave you alone and be available for advice.” Amazing. Then we got to Series A and the business was an operational tire fire because we didn't know how to scale a business. Work with Erica, and Jordan is over at Redpoint, so bonus.Jake [00:22:28]: Now we've raised from TQ and FPV as we're moving into enterprises. Every step of the way, we've asked: who can we partner with at this specific time to unlock the next section of the journey? I don't know enterprise sales. As an engineer, I can eyeball what features we might need, and we have wonderful people internally who can help. But you want boardroom dynamics where everyone is aligned and asking, “How do we win this?” instead of bickering about strategy.Data Centers in Space and the Physics of ComputeSwyx [00:23:31]: You had a tweet about data centers in space. Why no data centers in space?Jake [00:23:37]: It's not “no data centers in space.” My hot take is that I think it is solvable. I've just never seen anybody solve it.Swyx [00:23:49]: You said, “How are you going to dissipate that much heat in a vacuum?” You're making a physics claim.Jake [00:23:55]: I haven't seen anybody prove how you're going to dissipate that much heat in a vacuum. It doesn't mean it's not possible. It just means nobody has brought it up yet.Swyx [00:24:05]: Astrophage.Jake [00:24:06]: I don't know what that is.Swyx [00:24:07]: The Martian thing. Okay, you're very logical.Jake [00:24:09]: It could work. A lot of people are putting the cart before the horse. They say, “We're going to put data centers in space.” Okay, but how? “We have time to figure it out.” It's like in The Martian where they ask how they're going to intercept something and say, “We'll figure it out.”Swyx [00:24:36]: Making a bet on human invention is weird because you blind trust that it can be solved. But with physics, there are first-principles bounds you can put on it. Maybe not. Maybe you're asking to travel time or break a fundamental thermodynamic law.Jake [00:24:57]: I don't know how VCs do this either. How do you know what's not possible and a grift versus what's possible but sounds completely insane? “We're going to put data centers in space.” Coin flip as to which it is, and I guess you'll know in 10 years. That's one cycle.What Agents Need: Versioning, Observability, and 1,000x ScaleSwyx [00:25:23]: Moving back to agents. The branching, fast spin-up, and orchestration you do feels like pre-work that happened to be exactly what agents want. What do agents want differently than humans?Jake [00:25:37]: They want the ability to version things. It's not that different; it materializes slightly differently. Agents want a way to test changes incrementally. Engineers have feature flags. Is there a reason agents can't use feature flags? I don't think so.Jake [00:25:54]: They want version control. Can we use Git or not Git? That one is up in the air. I think something outside Git will emerge for how we version these things over time. They need observability. You need to query what happened, when it happened, which steps failed, traces, logs, metrics, and all the rest. They need network, compute, and storage. They need to write files, save files, iterate on files, and snapshot file systems.Jake [00:26:25]: A lot of what humans needed is in line with what agents need. Branching and forking are not different; we're just moving 1,000 times quicker. It can look like you need something massively different, but what you need is something massively better than what existed. You need orchestration massively better than Kubernetes. You need networking probably better than Envoy. It goes all the way down the stack.Jake [00:26:55]: If the workload profile doesn't change so much as it gets massively compressed because you need thousands of these things, what assumptions change? etcd is going to melt. You need to replace it with something. You can go all the way down the stack and say, “That part has to change, that part has to change, and that part has to change.”Jake [00:27:19]: The interesting thing about the super-exponential curve is that you have to build systems where you can rip out those parts at any time because a new bottleneck might emerge. You get good at parallel agents, and a different part of the system breaks. So it's similar to what humans needed, but at 1,000x scale.Jake [00:27:55]: How do you do code review in the age of agents?Swyx [00:28:00]: You throw more agents at it.Jake [00:28:01]: You don't. But then who reviews for CVEs and all these other things?Swyx [00:28:07]: More agents.Jake [00:28:08]: And that's how we hit the inference wall. You can continually throw agents at the problem, but I think there's a limit to the number of agents you can throw at a problem.CLI, Agent Handles, and Closing the LoopSwyx [00:28:24]: You already had a CLI before it was cool. How is the shape of what you're exposing changing, if at all?Jake [00:28:28]: CLIs have always been cool. The CLI changes because we think about how to give Claude, Codex, ChatGPT, or any model a handhold.Jake [00:28:50]: A CLI is a single command: deploy, get logs, and so on. Things that were prohibitively annoying to humans are not annoying to agents. They're nice. If I handed you a CLI with 40 arguments and 600 flags, you'd think, “I'm never going to use all of this.” But if you hand it to an agent, it says, “This is excellent. I have so many handles to work with.”Jake [00:29:24]: If you're going to expose things to agents that way, you want as many handles as possible where they can get information, query dynamic information, and close the loop quickly. Most problems right now are about how to close the loop as quickly as possible. Where does the agent get stuck, and how can you remove that?Jake [00:29:49]: Telemetry is important. If you can tell where the agent gets stuck from the CLI and say, “12% of people deviate from the happy path because of this, and now I add this argument and drive it down to 2%,” you massively increase the rate of loop closure.Jake [00:30:03]: That's how we think about not just the CLI, but every point in the dashboard. It's a user journey: I hear about Railway. I get something deployed. I get my first green build or aha moment. I see an endpoint, logs, whatever. Then I iterate. The iteration loop is indefinite. The user wants to deploy a new thing, a Postgres instance, change code, and keep iterating.Jake [00:30:36]: If you focus on the iteration loops and what's blocking them from closing quickly, one thing we say internally is: you never want to be waiting on compute anymore. You always want to be waiting on intelligence. If you're waiting on compute, there's a bottleneck that needs to be destroyed because eventually that bottleneck becomes so large that another workflow emerges to change it.Jake [00:31:04]: We've built a product where you push code, build it, and so on. But I fundamentally believe the push-pull loop is going away. We'll get to a point where you make a small change in production, that change is versioned across your infrastructure, you're working alongside copy-on-write versions of your database and infrastructure, and then you merge it in and it's instantaneously live. That's the holy grail of loops. The push-pull-rebuild thing is a point of friction that we're removing entirely.Canvas as Output: Dashboards, Context Anchors, and HyperstructuresSwyx [00:31:43]: It's incredibly fast. If anyone hasn't tried it, that fast feedback is great. My hot take is that Railway was famous for its canvas, which visualizes your infrastructure and lets you manipulate it visually. But that was for humans. For the next phase of growth, Railway CLI is more important than canvas.Jake [00:32:05]: The canvas is funny because it's a mechanism to show changes over time. You're right that previously we used it a lot as an input. Moving forward, its goal is more like an output. You would go to the canvas, make changes, see them, and watch your infrastructure evolve. Now agents have access to the CLI and can make those changes. So the canvas becomes an output: what information does the human need at this moment to make suitable decisions about control requests? Do I approve this or not?Jake [00:32:57]: It also has to be an anchor for your context, a port in the storm. Think of it like layers in a file system. You start with a project, then drill down into services, then into a function or code, because you want to represent the entire thing not just in your head, but in the canvas. Other people can share that representation, think on the same wavelength, and move quickly.Jake [00:33:33]: A lot of organizations get in trouble as they scale because all the context lives in someone's head. “How does this microservice work?” “I have no idea; go ask this person.” Then you have whole categories of products built around context discovery. A lot of that melts away if you have a solid hierarchy and can infinitely nest services, code, context, and everything else all the way down. That's what lets you build these structures over time.Jake [00:34:18]: It's also what lets us build what I've called hyperstructures: things that are way bigger. You look at the Golden Gate Bridge and ask, “How did we build that?” There's a meme that we lost the technology. To some extent, yes, because the coordination that built those things evolved and changed. We lost some of the art of building structure as we jammed everything into Slack.Swyx [00:34:52]: But you jam everything in Discord.Jake [00:34:53]: Same point. It doesn't matter. It's message passing and interrupts, message passing and interrupts.Swyx [00:35:00]: So you're arguing there should be something better and more structured than Slack?Jake [00:35:04]: Yeah. For sure. I think Slack is awful, and Discord is awful too.Central Station: Context Routing, Support, and Incident ClustersSwyx [00:35:09]: This is the equivalent of my mom test. What have you done that has your solution to this?Jake [00:35:15]: Internally, we've built a tool called Central Station that aggregates all the context from our users. Every piece of feedback, every customer support item, everything gets aggregated into clusters. If an incident is brewing, we can determine how many users are affected and break off a discussion based on that.Jake [00:35:40]: That is more helpful than long-running channels where you're trying to decide which channel to put something in. If you can dynamically aggregate information and dynamically route it to the right person based on context, it works better. We know internally that these four people are close to networking. If we see a networking thing, we can drill it down to those four people. If it's with this part, we can look at the commits. This is no longer a manual process internally.Jake [00:36:13]: If you go to station or help.railway.com, that's why we built it. We wanted to scale with a massive amount of leverage by aggregating feedback.Swyx [00:36:27]: This is built in-house?Jake [00:36:28]: Yep.Swyx [00:36:29]: I remember helping out on this one with Angelo in 2023. You scale a lot with a very small team.Jake [00:36:38]: Yeah. We're about 10 times bigger now.Swyx [00:36:40]: You have your full developer code here? Very cool.Jake [00:36:44]: If you go to railway.com/stats, we expose this as a pub-sub-able thing. It's all real-time metrics. There's a way to get it as JSON somewhere if you care.Jake [00:37:01]: We're big on trying to build everything in public and talk about what we're working on. We've had issues in the past, and we'll say, “Here's how we're fixing these things.” We've gotten compliments and flak for incident reports. We're always trying to make them better and talk with people.Incidents, Disclosure, and Progressive RolloutsSwyx [00:37:20]: You had a big one recently. I liked that it was scoped to 3,000. You presumably used Central Station. Talk through what happened and how you address it internally as a team.Jake [00:37:38]: Internally, this one really sucked. It had to do with an upstream provider that didn't do the behavior it said it documented, which is unfortunate given they wrote the RFC for how the behavior should work. We rolled those things out, and Central Station caught it initially when a couple users said caches weren't invalidating. We turned it off immediately.Jake [00:38:03]: When you roll out to a large user base of three million people, you get a lot of disparate behaviors. We tested in staging and had tests, but we hit an edge case. We've hardened those systems, and now we can make that better. But it was a tough one.Swyx [00:38:39]: I always wonder how private disclosure is supposed to work if people find an issue. Are they supposed to contact you first? When you run a platform, these things will happen. What channels should people pursue to quietly resolve it before it becomes a bigger incident?Jake [00:38:59]: There's responsible disclosure. We err on the side of over-disclosing and letting you know something is wrong versus having your provider gaslight you. We've erred on sharing those things more publicly, even if they impact a small subset of users. That's a decision we've made internally. We have four values. One is honor. The honorable thing is to notify people to the widest degree at which they may have been affected or there was an issue, and then confront it head-on: why did it happen, what can we do better?Swyx [00:39:45]: Not the whole user base. That's because of incremental rollouts and other things?Jake [00:39:50]: Yeah. Progressive rollouts.Swyx [00:39:54]: That should be the norm at all large platforms.Jake [00:39:58]: It should. A variety of companies do this. There's the quote that Meta runs 10,000 different versions of Meta. To our earlier point about agents, they need the same thing. They need shadow traffic and all these other things. We've built so much ceremony around production being sacred that we need to make it trivially easy to test different behaviors in a safe environment. Then you can make mistakes in a safe environment.Safe AI SRE: Customer Agents, Forked Environments, and Production ParityAlessio [00:40:30]: Do you see a world where these things get automatically caught, not necessarily by your agent, but by your customer's agent? The cache invalidation issue seems easy to check if you know to look for it.Jake [00:40:44]: It's hard because to determine it, we almost need to hook into your observability infrastructure. That's why we have the template loop on the platform: so you can roll things out progressively. You can roll out to Johnny Vibe Coder initially, or push a shard that someone consumes at their own leisure. Or you can roll it out over weeks: 0.1% of people, 1% of people, early adopters, then all the way up. That's the non-deterministic version control we talked about earlier.Jake [00:41:30]: I believe that's where most things should go, because most companies end up building staged rollout systems in-house. It's the same thing built again and again at every company. There's a massive opportunity to consolidate developer debt.Alessio [00:41:45]: You should have a free tier. Model providers give free tokens if you let them use the data. You could give free compute if someone is the number-one shard that goes out and lets you plug into their observability.Jake [00:41:55]: We do that. That's why we talked about the impact on 3,000 people. We start with lower-impact people. Larger companies on the platform are last to receive those rollouts so they have a version of the platform that's deeply stable.Alessio [00:42:16]: I have three services, so I'm sure I get the first rollout. You can nuke my thing at any time. There are all these SRE agent companies. Observability people also want agents that fix upstream problems. You have your own agent in the canvas now. How do you see that playing out?Jake [00:42:39]: It's the stacking entropy problem. If you don't have primitives to make iteration in production safe, it becomes difficult. If you're an observability provider saying, “Here's the fix to this error,” assume 80% are good and make sense. But in the last 20% long tail of complex issues, if you let somebody stamp it, you create an opportunity for an incident.Jake [00:43:08]: That's why forked environments are important. People have staging, but it always drifts from production. You need primitives, workflows, and experience built first-party on the platform so you can fork any service at any point in time.Jake [00:43:33]: I think of the canvas as a sheet of transparency paper. The agent is a little guy you push up into the canvas. It should say, “I need to copy that service and that service so I can test these two things.” It gets a read-only copy of production. Anything that's PII gets marked as a transform when we clone the database, create a copy-on-write version, or read from it. Then the agent makes changes and asks, “Does this actually work?” as close to production as possible.Jake [00:44:22]: That's how close you have to be, or you get massive drift. The system becomes unstable. You see this with massive systems built on Docker for local, Kubernetes for production, and a specific thing for something else. That complexity slows developers and becomes unstable at scale, making it hard to iterate. We want to compress that way down and say, “As close to prod as possible is where we want to be.”From AISRE Skeptic to Agent BelieverSwyx [00:45:00]: I was texting Erica for questions, and she says you were originally not a believer in AISRE. Have you come around on it?Jake [00:45:10]: I flipped, but I'm still not a believer in AISRE if you don't have the primitives to make it safe. If you unleash AISRE on production infrastructure without safe primitives for copying volumes and making sure things are fine, it's going to nuke your production database. It's not a matter of if, but when. I'm a big believer in making those loops safe.Jake [00:45:33]: I was a deep AI skeptic until 2023. In 2024, I thought, “Maybe I can roughly make this thing do it.” In 2025, I thought, “Now I can hold this.” Over winter break, everybody came back saying, “It's almost impossible to hold this.”Swyx [00:46:01]: Did you see this on the Claude docs? CloudBot? OpenCloud?Jake [00:46:06]: It's gotten to a point where it's harder to hold it wrong than to hold it right. There's a scene in Avengers where Vision picks up Thor's hammer and says it's terribly well-balanced. It self-balances and works well. I'm a deep believer at this point that this will be the dominant species: assembly, C, C++, JavaScript, words.Swyx [00:46:35]: It feels like a big jump.Jake [00:46:37]: It is. But it's not like you abandon CPU-based discrete logic and move straight to fuzzy logic. You need both. Your skills should call code or applications or some static structure. You can use skills to distill what the procedure should be or how the code should act.Jake [00:47:02]: I'm coming to a thesis: you need three points. You need a clear spec defining the system, the code, and the tests. When you say it out loud, if you've been in engineering long enough, you're like, “Of course. That's an RFC, tests, and code.” But they all matter. Having them together lets them reinforce each other: the spec and tests match, but the code doesn't, so reconcile it. Or the tests and code match but the spec doesn't, so reconcile that. That's the iteration loop.Jake [00:47:41]: That's why you're seeing people talk about software factories, docs, and reconciliation. Some of that is architectural astronomy if you don't implement it, but that loop is where most things will end up.Swyx [00:48:07]: For listeners, we've been talking about this on the pod for three years: the holy trinity of specs and tests. Itamar Friedman from Qodo is the reference if people want to look it up.Self-Modifying Infrastructure and the End of Push-Pull-RebuildSwyx [00:48:18]: One thing I want to mention on the OpenCloud idea is self-modification. I don't know how Railway would support it, but I have my OpenClaw, and I just tell it it has the Railway CLI and can do whatever. In theory, whatever capabilities or new infra it needs, it can call the Railway CLI, provision it, and add it to itself. The agent can modify its own infra.Jake [00:48:45]: It's nuts. I have a loop set up where you put the Railway CLI on top of something that runs on Railway. You're authenticated as whatever the current box is, and you can make any changes to it. Then you call Railway deploy, and it deploys itself.Jake [00:49:04]: It's like: “I need to spin up this instance of this environment. I already exist in this environment. Excellent, I have access to a Postgres instance now.” That's where we want to go with agentic, self-replicating infrastructure. That's your loop: iterate in production. You continue making changes. If it works, merge it upstream. If it doesn't, throw it away.Jake [00:49:37]: How do you make throwaway copies trivial to spin up and super cheap? The era of “I have an AWS instance with four vCPU and 16 gigs of RAM” is going to get destroyed. If you do that for agents, you need a thousand of those machines. It's prohibitively expensive compared with what we've spent a ton of time figuring out: the atomic unit of deploy, whether you call it isolates, sandboxes, or something else. Only pay for what you use, spin up instantaneously, and close the loop as quickly as possible.Jake [00:50:15]: If the system can self-replicate safely and say, “This is my environment, I'm making these changes,” it can come back with, “Does this look good? This is a new state of infrastructure given this prompt. I think I've solved it.” Then you go back and say, “Actually, it looks different.” It does the loop again. Then you say, “Cool. Apply.”Swyx [00:50:38]: That's retroactively obvious, which is the most useful kind. Any other comments on agent deployment on Railway?Jake [00:50:51]: It's getting better every day. I'm on X or Twitter. You can always yell at me about the parts not working as well as they should, because plenty of things should work way better.The New Serverless: Stateful, Long-Running, Pay-for-What-You-Use LinuxSwyx [00:51:04]: At this stage, when people want massively or embarrassingly parallel compute, they usually talk serverless. I feel like there's a new serverless compared to the previous five years of serverless. You're in that new bucket. Do you have comparisons or philosophical differences you want to call out?Jake [00:51:31]: It's somewhere in between. It's the ability to run stateful, long-running workflows or executions.Swyx [00:51:42]: Vercel has Fluid Compute, Cloudflare has some container thing, Google has App Runner and others.Jake [00:51:55]: That's where everything is roughly going, and it's why we've been working on this for six years. We believe users need access to a computer: a box that speaks Linux. They need to deploy what they want. Other systems change the surface area of what you can build. For us, users need a computer and need to deploy anything they truly want. That's why we've focused on the primitives: network, compute, storage. If we give you those and expose them so you can run things indefinitely, that's where we believe it's going.Jake [00:52:43]: Twitter has no nuance, so everyone says “servers” or “serverless.” It's always somewhere in the middle: I want to run it for a long time, but I don't want to provision the resource statically or pay for things I'm not using. That's been our thesis from day one: pay only for what you use, run it indefinitely, and it is full Linux.Swyx [00:53:12]: That's why I like the naming of Fluid. It's fluid. Flexible.Heroku, Focus, and Carrying the Torch Without Becoming the PastSwyx [00:53:18]: Another milestone is the Heroku official deprecation. You're one of the presumptive new Herokus. “New Heroku” has been a category for as long as I've been in developer tooling. It's finally happening. What was that like? Any behind-the-scenes of, “This is the moment”?Jake [00:53:42]: You have people where you're like, “You were running stuff on here? You, as this company?” It's crazy that names you would know are running on it and now coming to us saying, “We want to move a lot of this off.”Swyx [00:54:00]: Any behind-the-scenes on why Salesforce let Heroku stagnate?Jake [00:54:05]: I can only guess. It's hard when it's not your business. Salesforce's business is to build a great CRM. That's their focus. Then you acquire a compute business as an offshoot. A lot of early Meta people talk about focus. Boz has a write-up about how in the early days of Meta they had no money, so they were forced to focus. Then they turned on the money tree and had no reason not to split their focus.Jake [00:54:52]: But that dilutes your product. You get offshoots where you ask, “Is this the focus of the business?” If it's not core, it languishes. A lot of companies get in trouble when they split focus because they're fighting a multi-front war, not just externally but internally for alignment. Where are we going? What are we doing? What is our purpose?Jake [00:55:24]: If you're Salesforce-built and mission-driven, you want to work on Salesforce. Heroku is off to the side. It's not core to the business. Getting resources, budget, focus, and alignment internally becomes hard. It was a matter of time.Swyx [00:56:06]: Kudos for them to call it out instead of leaving it unknown.Jake [00:56:12]: Their release was a little odd. They called it out, but they didn't say they were shutting it down. Behind the scenes, I think they issued messages to people saying they should close accounts and that they were going to deprecate and remove things over time.Jake [00:56:30]: It's crazy because some of my first deployment experiences were on Heroku. You start with dragging things into an FTP server, then you try to get a deploy working, and then it's Heroku. It was the on-ramp for us. But the wheel turns. New things emerge. We're happy to carry the torch for a lot of that. But we don't want to be the new Heroku. We want to be the way people build and deploy software, and ultimately the way people monetize software over time.Swyx [00:57:19]: It's still a big crown to be the new Heroku. There are 50 companies that fought for that.Jake [00:57:23]: Everybody is holding some portion of it. We're happy to support people and companies. The platform works differently. The game loop is similar, but we've been dogmatic about where these things are going: primitives, agents, fan-out. Some things fit; some workflows need to change. We have an approximation of Heroku pipelines with the environment system. It's exciting. We've got a ton of people we can support, and it's growing a lot.Temporal, Workflow Engines, and State MachinesSwyx [00:58:12]: I have one more technical question about Temporal. I've sold my shares. You're a power user and one of our earliest customers. I met you through Temporal. You built on Temporal. You have complaints. This may be the most neutral and informed conversation anyone will hear about Temporal without someone working at the company.Jake [00:58:39]: That's fair. I've used Temporal for almost 10 years because of Cadence at Uber.Swyx [00:58:52]: Give people a sense of what Cadence was at Uber.Jake [00:58:57]: Cadence was the precursor to Temporal. It powers trip actions, rides, when you rent a Jump bike or scooter or car. You're running workflows for a period of time and saying, “This ride will run indefinitely until it finishes.” You attach information: you paused in this zone, so add this charge to the bill. When you end the trip, the workflow is done. That experience was powered by Cadence at the time.Swyx [00:59:34]: I used to say it's like programming the entire user journey top-down as one function.Jake [00:59:39]: It's a powerful idea and important. It's also important for the next phase of the agentic journey. You want an agent to do a specific task, be complete or incomplete on that task, and move on to the next thing. You need a way to manage workflows dynamically.Jake [00:59:59]: Temporal was always great in theory, and great when you got it working the way you wanted in production. But it required you to model the entire journey in your head. If you didn't, you could cause issues where replaying the state of the workflow causes non-determinism.Swyx [01:00:25]: Because it works on deterministic workflow history.Jake [01:00:28]: Exactly. I describe it as a jet engine. If you know how to operate it and run it, it's great. But you can't hand it to people trying to build complicated things if they don't have the whole state in their head.Jake [01:00:48]: We run our whole deployment pipeline on top of it. That's a reasonably complicated workflow: pre-commit hooks, signaling, queuing, and all the rest. We ran into the same thing at Uber. As you express a large workflow, it gets more complicated, with more states in the state machine that you have to map back to the workflow.Swyx [01:01:15]: It's a lot of ifs.Jake [01:01:16]: Exactly. At Uber, we built a system for doing the state machine and testing it. We've started to build some of those things here because it's grown heavily. It's not quite love-hate. When it works well, it works super well. But if someone who doesn't have full context puts something into the system that invalidates state or causes non-determinism, or spins off a ton of activities, you have to keep track of underlying SRE knobs like activity slots. Those should scale with memory, vCPU, and so on. It becomes a bear to scale.Swyx [01:02:10]: You need a capable sysadmin running things behind the scenes. If you moved off, what would you do?Jake [01:02:19]: We'd build our own workflow engine. We have a few internally that we've worked on.Swyx [01:02:27]: This is one of those classes of things you typically wouldn't vibe code, but I'm wondering if you can.Jake [01:02:33]: I still don't think you should vibe code it. You still want to run decent tests to make sure it works.Swyx [01:02:39]: Timo didn't invent that from scratch either. There are libraries you can run. On top of that, it's just a state machine that you have to map out. Ultimately, you define the instructions you want and run them through a state machine.Jake [01:03:00]: It's very doable. Workflow stuff is interesting. Restate is doing neat stuff here.Swyx [01:03:10]: You're tied into JavaScript. Are you a JavaScript maxi?Jake [01:03:13]: Internally, we have TypeScript, Rust, and Go. We don't add more languages. Actually, we have a little C because we write BPF code and hooks. But those are the languages.Swyx [01:03:28]: Is this for sidecars?Jake [01:03:32]: No. It's for the networking stack, volumes, and things like that. We use TypeScript a lot because it powers the dashboard, but we're moving a lot of workflow stuff off the dashboard stack and into the infrastructure stack.Railpack, Nixpacks, and Content-Addressable FilesystemsSwyx [01:04:00]: Cool. Any other technical infrastructure stuff? Railpacks?Jake [01:04:07]: We built an engine for determining dependencies based on source code. It's called Railpack. We built the first version, Nixpacks, on top of Nix, and then we moved.Swyx [01:04:17]: People have been trying to get me to adopt Nix and NixOS for four years. Is it ever going to be a thing?Jake [01:04:23]: I don't know. We're excited about it, but it has pain points. Think of it as a stack of versioned binaries at specific slices in time. If you want version X and version Y, you bloat the package space, which blows up image size and makes real-world workloads difficult.Swyx [01:04:53]: But you content-address it and cache it. In theory, there are optimizations.Jake [01:05:00]: In theory, yes. But with a large enough user base and disparate enough machines, you run into a problem Meta described in the XFAAS paper, their internal serverless system. It becomes difficult at scale unless you break out specific runtimes.Jake [01:05:24]: We didn't want to do that because we wanted to truly allow you to deploy anything. That was our initial thing with Nix. But we've moved toward interesting work around content-addressable file systems that can lazy-load anything from any point and page it into memory.Swyx [01:05:48]: Amazing.Jake [01:05:49]: The future is very bright. It's crazy, and it's going to be nuts.Coding Agent Spend, Roadmaps, and Token ROISwyx [01:05:54]: Founder journey stuff?Alessio [01:05:56]: Your cloud usage: you tweeted you're going to spend $300K this month?Jake [01:06:01]: I think we got to $200K.Alessio [01:06:02]: Coding agents?Jake [01:06:03]: Yeah.Swyx [01:06:04]: Across the company?Alessio [01:06:05]: You only have 35 people, so I'm sure they're not all spending $10K a month. What's the distribution?Jake [01:06:10]: I think I'm at about $25K. We have power users all the way down. We came back from winter break, and I basically said, “If you're writing code by hand, you're doing this wrong.” The tools are good enough now that you can move extremely quickly. There are issues and pain points, but you should be reviewing the code you are writing instead of writing it by hand.Jake [01:06:40]: Architectural patterns matter more now than ever, but you shouldn't spend your time generating code you would write. If you know how to write it, ask the agent to write it and reconcile it until it looks like you would have written it yourself.Jake [01:06:58]: People misconstrue my propensity to push people toward agents as connected to our growth and some reliability bumps. They're not necessarily related. The tools are good enough to move extremely quickly and build things way larger than you could before.Jake [01:07:19]: To the earlier point about cooling data centers in space: I don't know. But with software, you can ask, “How would I build block storage from scratch? How would I do these things?” I have ideas because I have history and have read papers. Let me work them out and build massive test benches with thousands of tests, because those are now free to author. If you're not using AI systems to speed-run your roadmap and reconcile your existing system onto the future, you're missing a large point of what's happening.Alessio [01:08:12]: What's the path to spending $3 million a month? Is it bound by ideas and things customers can absorb?Jake [01:08:19]: For most companies, it's bound by deployment at this point. That's why we've seen a massive boom in users and companies, from Fortune 50s down, asking how to get developers to move faster. You'll probably hit your CFO before any technical limits because they'll look at the eye-watering amount of money spent on tokens. Inference costs have to come down, but we're inference constrained now. There will be price discovery around what makes sense for an org to adopt.Jake [01:09:06]: I think you'll end up with the F1 driver concept. If someone is really adept at these things, it makes sense to put them in a $3 million car. If they're not, it probably doesn't make sense. You'll take a few people and say, “You can drive the F1 car. We need to go in this direction. Figure out if it works and prototype it.”Jake [01:09:33]: We've done some of that and vastly accelerated our roadmap. We thought we'd ship something in a few years; now we can probably ship it in a few months because we validated it and don't have to build it incrementally. We can skip steps and move toward our vision.Alessio [01:09:58]: A lot of people are realizing the roadmap doesn't always have a business impact, so they say tokens are too expensive. But if your roadmap were built to make more money by the time you built it, you'd have token pricing for it, the same way you do with sales. You'd spend a billion dollars on sales if you knew you would get $2 billion of revenue.Jake [01:10:19]: Exactly. A naive way to measure this is the percentage of tokens that end up in production. If you can measure impact because those tokens end up in production, that's awesome. But the burden of proof will rise. Internally, we have a growing number of pull requests that haven't merged. The question becomes: how do you get this into production? It's about how quickly you can build and deploy software, which is exciting because that's our whole thing.The SDLC Shift: Prompt Requests, Feature Flags, and Safe RolloutsSwyx [01:10:56]: The SDLC is changing. One thesis is that the pull request is dying. It's going to be the prompt request. Beyond that, code review is also kind of dying if you have all the other systems in place. What else is changing about the SDLC?Jake [01:11:19]: The AISRE and the tools to make it happen. AISRE is pie-in-the-sky aspirational. What does it take to get an AISRE? What tools do you need to build?Swyx [01:11:32]: You should expose your tooling to customers at some point. The Central Station command center.Jake [01:11:39]: We have it for template maintainers. Template maintainers can deploy and maintain templates, and they get feedback. We're going to expose those things incrementally.Swyx [01:11:51]: Clustering around incidents. Everyone has a version of that, but I don't think anyone has solved it.Jake [01:11:56]: I won't say we've solved it internally, but it's gotten so good that we can see incidents forming pretty quickly. At some point, those will be things either someone else builds or we build. We've always built things purpose-built for us. If it makes sense to make it useful for users, monetize it, or turn that loop into a profit center instead of a cost center, we want to do that.Jake [01:12:28]: Pull request is definitely dying.Swyx [01:12:29]: Do you do first-party feature flagging and incremental rollout stuff?Jake [01:12:34]: We have a feature-flagging engine we built internally and will eventually roll out.Swyx [01:12:38]: I don't see it as a user. How come you didn't give us what you have?Jake [01:12:43]: We have to beta test it. We care a lot about the quality of the things. There's plenty we've used internally that doesn't make it all the way through the journey because it fails. It works for one service but not multiple services. We'd have to build it for multiple services and know that if we released it, we'd rebuild it again and again. Some things are worth that, but many inform the roadmap.Jake [01:13:18]: We don't want to dilute the experience by saying, “This works, but only for this service,” unless it's a core initiative. Over the next few months, we'll roll out things that work for a single service, then multiple services, then multiple services across the environment. You have to be deliberate. Otherwise you create broken disparate experiences and support load because people ask how to use the feature.Jake [01:13:52]: It's the earlier expansion and compaction pattern. You expand the company to get features, then compact and smooth them out so the experience is stellar. You told me in the hallway, “It's gotten so much better.” Internally we're saying, “This part really sucks. We need to make it significantly better.”Swyx [01:14:11]: I can attest to that over the last three years watching you build Railway. For listeners, feature flagging is a huge part of Uber culture. So much so that they have too many feature flags and another thing to remove feature flags. Facebook has Gatekeeper. Agents are going to need this. It's fundamental to incremental rollouts. OpenAI acquired Statsig. GPT-5 is routing and flagging through different models.Jake [01:14:56]: It's super important. If the software development lifecycle is going to change because we're doing things 1,000 times faster and 1,000 times more concurrently, what becomes important at scale?Jake [01:15:16]: Before I started Railway, I built a feature-flagging product and tried to sell it. It was an easier version of LaunchDarkly. I ran into a problem: anyone small enough to adopt your technology doesn't care about feature flags, and anyone large enough to need feature flags needs so much scale that you have to build out all the infrastructure. I scrapped it.Jake [01:15:42]: But what is old is new again. Companies are trying to move quickly, but you can't YOLO a vibe-coded thing straight into production. You need to say, “Here's my blast radius, my impact, and I want to shadow it for these users.” Feature flags. You're going to need the tools larger companies built to maintain their structures. Everything gets compressed by 1,000x so everybody can build those structures quickly.Jake [01:16:07]: That's exactly where we are: compressing the software development lifecycle, then expanding it and adding more new things.Cattle, Pets, and Clonable InfrastructureSwyx [01:16:15]: Another term that comes to mind for newer developers is “cattle, not pets.” People treat production like a pet. It has a name. You baby it and keep it alive. With cattle, you can mass farm, roll out, portion parts out, and kill them.Jake [01:16:37]: I think that might change. You can move toward having pets as long as you have a cloning machine for your pets.Swyx [01:16:52]: Yeah.Jake [01:16:52]: If you can snapshot every single thing at every frame, it doesn't matter if something gets obliterated because you have a snapshot of it. The things we've built right now are designed to block changes from the hermetically sealed DevOps line. You have to write a Dockerfile because you nee
This week, we discuss how security gets sold to execs, where agentic coding and security collide, and Cloudflare vs. Datadog's diverging paths. Plus, Coté weighs in on sugar cookies. Watch the YouTube Live Recording of Episode 572 Runner-up Titles Sugar Kingdom Bastard Sugar Choose butter It's just AI now People don't like paying for software It's an exciting time to be writing software Are we going to start funding open source? Speaking of that, but not that, it's totally unrelated We're leaving the table Weird network stuff That box of cables on GitHub Rundown Security Mini Shai-Hulud Is Back: npm Worm Hits over 160 Packages Fedora Hummingbird Linux Brings Agentic Linux to Builders Red Hat Hardened Images Palo Alto Networks to Acquire Portkey to Secure the Rise of AI Agents Mythos finds a curl vulnerability Redis and the Cost of Ambition AI is restructuring the software industry — winners and losers Datadog's stock jumps 31% on crushing earnings beat Cloudflare stock sinks 24% after earnings as company cuts 1,100 employees Linear increases workforce Relevant to your Interests Laws, anecdotes, idioms, and other shit people say - cote.pizza The smart lock standard that could replace your keys is finally here dirtyfrag/README.md at master · V4bel/dirtyfrag · GitHub Red Hat Summit Newsroom Finally, texts between Android and iPhone users can be end-to-end encrypted Your Container Is Not a Sandbox OpenAI launches the OpenAI Deployment Company Corporate Card Startup Ramp Raising Funds at $40 Billion Valuation Cyberattack shutters Canvas learning platform for schools across the U.S. AWS warns of EC2 'impairment' as power loss hits notorious US-EAST-1 region AI and DevOps Maturity Anthropic raises Claude Code usage limits, credits new deal with SpaceX Announcing Agent Toolkit for AWS He's not wrong, but what can they do? How are those Microsoft foundation models going? Sponsors WebRTC.ventures – Real-time communication & Voice AI integration Conferences WeAreDevelopers Europe, July 8-10, 2026 Berlin, Coté speaking. DevOpsDays Graz, Sept 4-5, 2026 DevOpsDays Rockies, Sept. 22 – 23, 2026, Discount Code: 26DODSWEDEFTALK WeAreDevelopers NA, Sept 23-25, 2026, Discount Code: DEVPOD26 DevOpsDays Dallas, Sept 28-29, 2026 DevOpsDays Vilnius, Sep 30 - Oct 1. 2006 DevOpsDays Istanbul, October 24th, 2026 - Coté keynoting. VMware User Groups (VMUGs): Dallas (June 9-11, 2026) Orlando (October 20-22, 2026) SDT News & Community Join our Slack community Email the show: questions@softwaredefinedtalk.com Free stickers: Email your address to stickers@softwaredefinedtalk.com Follow us on social media: Twitter, Threads, Mastodon, LinkedIn, BlueSky Watch us on: Twitch, YouTube, Instagram, TikTok Book offer: Use code SDT for $20 off "Digital WTF" by Coté Sponsor the show Sponsor more podcasts with Failover Media Recommendations Brandon: Xcode MCP Matt: whatcable Coté: Shogun.
————— COACHING —————Vous êtes leader tech ou product face à des défis majeurs ?
At RBC Capital Markets' Private Tech Conference, Dan Nathan interviews RBC analysts Brad Erickson, Rishi Jaluria, Matt Swanson, and Matt Hedberg on Q1 earnings and AI's impact across internet and software. Erickson says demand is solid, hyperscalers are raising CapEx as cloud ROI improves, and explains why Meta's higher spend hurt the stock versus Google/Amazon's accelerating cloud revenue and margins; he ranks Amazon over Google over Meta and discusses Uber's AV positioning versus Waymo. Jaluria is bullish on Microsoft's broad AI opportunities, notes Copilot's growing paid users, and discusses multimodel strategy, small/medium models, and Oracle's controversial OpenAI-linked data center build and financing. Swanson covers ad/martech, highlighting Adobe's “orchestration” narrative, Trade Desk's holding-company tensions, and AppLovin's ROAS-driven model. Hedberg argues cyber and infrastructure need “more, not less” security post-Anthropic's Mythos, cites capitulation in software sentiment, favors consolidators like CrowdStrike, Palo Alto, Snowflake, Datadog, and ServiceNow, and notes AI-driven efficiency and layoffs as potential catalysts amid continued volatility. —FOLLOW USYouTube: @RiskReversalMediaInstagram: @riskreversalmediaTwitter: @RiskReversalLinkedIn: RiskReversal Media
25. That's the number of consecutive Presidents Clubs Lucy Williams-Jones has qualified for. Across some of the greatest companies in our industry, BMC, MongoDB, Datadog and now Astronomer, Lucy has built one of the most consistent and decorated careers in enterprise sales. In this episode, Andy Whyte sits down with Lucy to unpack what separates a lucky career from a legendary one. From the impact of AI on modern selling to the growing complexity of buying committees, Economic Buyer engagement and what it really takes to build a champion, this is a masterclass in consistency. You'll learn: ✅ Why AI is making salespeople lazy and what the best sellers do differently ✅ How buying committees have grown from 1-2 people to 10-15 and what that means for how you sell ✅ Why your Economic Buyer should be your champion and how to get there early ✅ The traits that separate A Players from the rest ✅ How to use MEDDPICC as a personal framework, even when job hunting ✅ Why you can't build a champion on WhatsApp
If you're betting on agentic AI, hear directly from the builders navigating the challenges, innovating pricing models, and creating the coming reality where every employee manages thousands of agents.Topics Include:Four panelists represent the full AI stack: build, run, secure, monitor.Agentic AI moved faster than anyone predicted just 18 months ago.Writer AI's no-code agent builder missed both its target personas entirely.Non-technical users now just prompt agents instead of building workflows.Fireworks AI processes over ten trillion tokens daily across open models.DeepSeek's Christmas release tripled Fireworks' capacity needs almost overnight.Okta identified agent identity as a security problem from day one.91% of organizations are already using AI agents in some capacity.Datadog evolved naturally from dashboards to autonomous investigative agents.Bits.ai agents now diagnose production incidents before engineers wake up.Trust requires explainability — black-box agents stall enterprise adoption cold.Human-in-the-loop remains essential; risk tolerance varies wildly by organization.Writer AI compressed a four-month retail workflow down to one week.Multi-provider inference consistency is one of the hardest unsolved infrastructure problems.Agentic pricing models are fundamentally broken for enterprise budget planning today.Agents managing agents means every employee becomes a manager of thousands.POC data gaps are the most underrated blocker to production deployment.Security must be designed in from the start — retrofitting is painful.Build evaluations first so you know if you're actually improving anything.Find your uniquely differentiated data and build your agentic bet there.Participants:Yannick Guillerm – Regional Manager, Sales Engineering, DatadogRay Thai – Director of Product Management, Fireworks AIAndrew Yu – Vice President of Engineering, OktaMatan-Paul Shetrit – Director of Product Management, Writer AIModerator: Carol Potts – General Manager, North America ISV Sales, AWSSee how Amazon Web Services gives you the freedom to migrate, innovate, and scale your software company at https://aws.amazon.com/isv/
Elke maand verzorgen beursanalisten Jordy Beuving en Jean-Paul van Oudheusden een uitgebreide update over technologie aandelen. In deze aflevering komen onder andere AMD, Intel, Alphabet, Marvell, ASML, BESI, ServiceNow, Apple, Meta, Fortinet & Datadog aan bod. Meer weten over het Antaurus AI Tech Fund? Ga dan naar: http://antaurus.nl/Benieuwd naar de portefeuilles van zowel Jordy als Jean-Paul? Ga naar: https://www.probeleggen.nl/aanmelden/registreren/Tijdslijn:00:00 - 01:55 Opening01:55 - 03:45 Cijferseizoen: Tech presteert boven verwachting03:45 - 07:25 Big Tech07:25 - 12:15 Alphabet absolute AI koning12:15 - 19:25 Apple: Terug van niet weggeweest?19:25 - 20:45 Big Tech trekt index omhoog20:45 - 21:45 Blok 2: Semi's21:45 - 24:15 SOXX schrijft geschiedenis24:15 - 27:07 Vuurwerk bij AMD27:07 - 31:30 Intel: CPU's + Foundry kantelpunt?31:30 - 35:35 ASML goes Hybrid Bonding (?)35:35 - 38:45 ProBeleggen Jordy38:45 - 39:10 Blok 3: Software (cyber)39:10 - 39:35 Opleving software na cijfers39:35 - 42:30 Cijfers Datadog: deals met Big Tech42:30 - 46:25 Cijfers Fortinet: opleving cybersecurity46:25 - 54:20 Cybersecurity - Top 4 Pure Plays54:20 - 58:40 Mood Board: Software vs Hardware58:40 - 58:55 Blok 4: vooruitblik58:55 - 1:01:35 The Buffet Indicator: Market Cap to GDP1:01:35 - 1:02:10 Hardware vs Software: beleggers kiezen voor zekerheid1:02:10 - 1:05:10 Het Internet (2000) versus AI (2026)1:05:10 - 1:07:50 ProBeleggen: Portfolio Jordy1:07:50 Vooruitblik volgende Tech TalkOntvang al onze exclusieve analyses, video's en beurscontent: https://www.deaandeelhouder.nl/premium/
Das Briefing: NACHO-Trade, Ölpreis weiter hoch, KI-Rally, Bitcoin & die besten KI-Aktien Der TACO-Trade ist Geschichte. Jetzt sprechen Trader vom NACHO-Trade: keine Chance auf Öffnung der Straße von Hormus. Während Donald Trump offenbar ein Ende des Krieges mit dem Iran prüft, bleibt der Ölpreis über 100 Dollar. Normalerweise wäre das ein massiver Belastungsfaktor für die Märkte. Doch die Kurse steigen weiter. Warum? In dieser Ausgabe von „Das Briefing“, der wöchentlichen Börsensendung für deutsche Privatanleger, schauen wir auf die große Frage der Woche: Warum ignoriert die Börse geopolitische Risiken, hohe Ölpreise und absurde Bewertungen? Die Antwort liegt ausgerechnet in einem Thema, das viele Anleger noch immer als reinen Tech-Hype abtun: Künstliche Intelligenz. KI ist längst kein Mikro-Thema einzelner Unternehmen mehr. KI ist zum dominierenden Makro-Thema geworden. Sie treibt Investitionen, Gewinnerwartungen, Rechenzentren, Halbleiter, Cloud-Infrastruktur und ganze Aktienmärkte. Wir zeigen, warum Aktien wie SK Hynix derzeit kein Halten mehr kennen, warum KI-Sieger weiter massiv gesucht sind und welche KI-Aktien jetzt besonders spannend werden könnten. Wir stellen dir die besten KI-Aktien vor Außerdem sprechen wir über die Rückkehr des Bitcoin. Die Kryptowährung notiert wieder über 80.000 Dollar. Doch wie kauft man Bitcoin sinnvoll? Direkt? Über einen Broker? Über ETPs? Und worauf sollten Anleger achten? Ein weiterer Schwerpunkt: Bewertungen. Denn so absurd es klingt: Einige Aktien sind nach der Rally teilweise günstiger als im Crash. Wie kann das sein? Und wo entstehen dadurch Chancen? Natürlich schauen wir auch auf die Verlierer der Woche: PayPal hat erneut enttäuscht, The Trade Desk und Arista Networks wurden deutlich abverkauft. Dagegen lieferten Datadog, Fortinet und AMD ein echtes Kursfeuerwerk.
Today, a look at US market performance behaving very differently yesterday as some software names got a big boost on an otherwise negative day, in part on DataDog's earnings result. Also, an interesting geopolitical signal comes out of Greenland, while there is plenty to talk about in macro and FX even if FX volatility remains restrained, waiting for US jobs data and next week's critical Trump-Xi summit. This and more on today's pod, which is hosted by Saxo Global Head of Macro Strategy John J. Hardy. Links discussed on today's show: The Goldman Sachs research report on "Tracking the Trillions" in AI spending - very thorough look at many of the issues in understanding the risks in the assumptions about chip lifespan and much more. An X post that looks at how the massive cap-ex spend on AI will drive S&P 500 earnings projections through next year, even if balance sheets are less robust on the other side. Craig Tindale's massive work on how we are looking at climate change risks in the wrong way, with risks of ugly feedback loops in key systems that could accelerate change in ways the generalist models aren't. About twice per week, you will find links discussed on the podcast and a chart-of-the-day over at the John J. Hardy substack. Read daily in-depth market updates from the Saxo Market Call and the Saxo Strategy Team here. Please reach out to us at marketcall@saxobank.com for feedback and questions. Click here to open an account with Saxo. Intro music by AShamaluevMusic DISCLAIMER This content is marketing material. Trading financial instruments carries risks. Always ensure that you understand these risks before trading. This material does not contain investment advice or an encouragement to invest in a particular manner. Historic performance is not a guarantee of future results. The instrument(s) referenced in this content may be issued by a partner, from whom Saxo Bank A/S receives promotional fees, payment or retrocessions. While Saxo may receive compensation from these partnerships, all content is created with the aim of providing clients with valuable information and options.
In der heutigen Folge sprechen die Finanzjournalisten Daniel Eckert und Lea Oetjen über die Kursexplosion von DataDog, ein Rekordtief an den US-Börsen und einen Dämpfer für McDonald's. Außerdem geht es um Fortinet, Arm Holdings, Coinbase, Tesla, Siemens Healthineers, Vonovia, Nvidia, Rheinmetall, Henkel, 2G Energy, Apple, Amazon, Alphabet und Broadcom. Wir freuen uns an Feedback über aaa@welt.de. Noch mehr "Alles auf Aktien" findet Ihr bei WELTplus und Apple Podcasts – inklusive aller Artikel der Hosts. Hier bei WELT: https://www.welt.de/podcasts/alles-auf-aktien/plus247399208/Boersen-Podcast-AAA-Bonus-Folgen-Jede-Woche-noch-mehr-Antworten-auf-Eure-Boersen-Fragen.html. Hier könnt ihr den AAA-Newsletter abonnieren: https://www.welt.de/newsletter/article232797673/Alles-auf-Aktien-Der-taegliche-Boersen-Newsletter-fuer-WELTplus-Abonnenten.html Und - ganz neu: AAA gibt es jetzt auch auf Instagram: https://www.instagram.com/alles_auf_aktien/ Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Hörtipps: Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. +++ Werbung +++ Du möchtest mehr über unsere Werbepartner erfahren? Hier findest du alle Infos & Rabatte! https://linktr.ee/alles_auf_aktien Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
Aktien hören ist gut. Aktien kaufen ist besser. Bei unserem Partner Scalable Capital geht's unbegrenzt per Trading-Flatrate und auf der hauseigenen European Investor Exchange, die genau auf Privatanleger zugeschnitten ist. Alle weiteren Infos gibt's hier: scalable.capital/oaws. Planet Fitness crasht. Whirlpool pausiert Dividende. Zoetis verliert, denn Tierbesitzer sparen. Fastly stürzt trotz starkem Wachstum. Datadog & Fortinet profitieren von KI. Arm kämpft mit Engpässen. Rheinmetall will Kieler Werft. Südkorea wächst. Hawkeye 360 geht an Börse. McDonald's (WKN: 856958) startet Craft-Drinks mit Boba, Cold Foam und Red Bull. Mega-Margen, aber passt das zur Drive-Thru-Geschwindigkeit? Und kauft die Zielgruppe, die grade wegbricht, wirklich 5-Dollar-Drinks? Dutch Bros (WKN: A3C28Y) macht 500 Mio. $ mit Energydrinks und wächst 30%. Neue Konkurrenz durch Starbucks und McDonald's? Das Management sagt: kein Effekt. Die Bewertung mit 10 Mrd. $ bleibt aber sportlich. Diesen Podcast vom 08.05.2026, 3:00 Uhr stellt dir die Podstars GmbH (Noah Leidinger) zur Verfügung. Learn more about your ad choices. Visit megaphone.fm/adchoices
8/5 Hormuz, risale la tensione: Usa colpiscono obiettivi militari iraniani. Teheran: Violato il cessate il fuoco. Usa: Non è escalation ma auto-difesa. Trump: il cessate il fuoco rimane in vigore. Risalgono Brent e Wti, in rialzo anche dollaro e rendimenti dei treasury americani. La risposta iraniana al piano in 14punti americano non è arrivata. Secondo il Wsj Arabia Saudita e Quwait hanno rimosso le restrizioni all'accesso alle basi militari e spazio aereo, Project Freedom potrebbe ripartire già questa settimana. L'Iran istituisce un'agenzia per il controllo dello stretto e i pedaggi. Corte commercio internazionale: illegale dazi al 10% annunciati a febbraio da Usa. Futures in verde, Oggi il dato sui Non Farm Payrolls di marzo. Per il mercato Fed in pausa nel 2026. Fed NY: salgono le aspettative di inflazione a breve. AI trade: quanto può sostenere il mercato? Il rimbalzo dell'S&P500 dal minimo di marzo è stato guidato da 5 titoli. Concentrazione storica. Paul Tudor Jones: ancora uno o due anni prima della debacle dei mercati. Goldman Sachs: rischi a breve termine. JPM: acquisti Tech degli investitori retail ai massimi da un anno. Segnali di vita dal software: +15% in un mese. Trimestrali: Arm a picco, McDonald's e Whirpool: primi segnali di crepe del consumatori? Rally dei titoli di cibersecurity: Datadog +30%. Coreweave -10% nel pre-market: guidance sotto attese. Anthropic verso valutazione di 1000mld dollari. *** Questo episodio è offerto da Scalable Capital Investire comporta rischi Interesse p.a. lordo variabile su liquidità illimitata. Condizioni e distribuzione della liquidità su scalable.capital/conto-deposito-non-vincolato*** In Asia, focus su intervento yen. Bessent settimana prossima in Giappone. Listini in rosso: Nikkei e Kospi -1%, sale surplus commerciale in Sud Corea. Kospi: miglior settimana dal 2008. Gs alza tp. In Europa futures in rosso. Trump, dazi addizionali su auto e camion: l'Europa ha fino al 4/7 per la ratifica di Turnberry. Schnabel, Bce: se shock energetico continua verso politica monetaria restrittiva. Second round effects potrebbero essere peggiori del 2022. Ue: dialogo con Putin. Bruxelles: in autunno si decide decisione procedura deficit eccessivo. Commerzbank: risultati record, alza guidance 2026. Taglierà 3mila posti lavoro al 2030. Merz contro Unicredit: sta distruggendo la fiducia. Poste, Del Fante: diventeremo primo operatore mobile. Mps si dimette dal board Fabrizio Palermo per dissonanza su governance. Focus su Enel, Pirelli. Learn more about your ad choices. Visit megaphone.fm/adchoices
Bienvenue dans la chronique financière la plus délirante des 48 dernières heures.
Carl Quintanilla, Jim Cramer and David Faber drilled down on new record highs for the S&P 500 and Nasdaq — amid investor hopes for an Iran deal. AI in the spotlight: In a "First on CNBC" interview, Arm Holdings CEO Rene Haas spoke about the chip designer's better-than-expected results, which didn't stop the stock from falling sharply lower. The anchors discussed what Anthropic CEO Dario Amodei said about the AI startup trying to keep up with demand, after posting 80-fold growth in Q1. Also in focus: McDonald's beats on earnings despite a "challenging environment," Shake Shack tumbles, the Iran war effect on Whirlpool's stock, Datadog and Fortinet soar and spark a software rally, Elon Musk's take on the SpaceX-Anthropic deal. Squawk on the Street Disclaimer Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Iran is expected to provide its reply to the US proposal for ending the war to mediators on Thursday, according to CNN, citing a regional source. US President Trump is optimistic about the Iran framework, and that the timeframe is one week, Fox reported. US President Trump could turn to military action without an agreement with Iran ahead of the China trip, according to Axios, citing US officials.US President Trump's reversal on his plan to help ships go through the Strait of Hormuz came after Saudi Arabia suspended the US's ability to use its bases and airspace to carry out Project Freedom, according to NBC, citing US officials.European equity futures are indicative of a quiet open with the Euro Stoxx 50 future +0.1% after cash closed +2.5% on Wednesday.Looking ahead, highlights include German Factory Orders (Mar), French Balance of Trade (Mar), US Challenger Job Layoffs (Apr), US Jobless Claims (May 2), Atlanta Fed GDP, Norges Bank/Riksbank/CNB/Banxico Policy Announcement (May), CBR Minutes (May), UK Local Elections. Speakers include ECBʼs de Guindos, Elderson, Schnabel, Lane, BoEʼs Lombardelli, Mann, Taylor, Fedʼs Hammack, Williams, Kashkari, Norges Bankʼs Bache, Riksbankʼs Thedeen. Supply from Spain, France. Earnings from CoreWeave, IREN, Coinbase, Cloudflare, DraftKings, ACM Research, Datadog, McDonald's & ShellRead the full report covering Equities, Forex, Fixed Income, Commodites and more on Newsquawk
Al Arabiya reported that "the coming hours will witness a breakthrough for the situation of the ships stuck in the strait", spurring pressure in the crude complex.Iran is expected to provide its reply to the US proposal for ending the war to mediators on Thursday, according to CNN, citing a regional source.US President Trump could turn to military action without an agreement with Iran ahead of the China trip, according to Axios, citing US officials.European and US equity futures are modestly firmer; ARM -6.5% post-earnings.DXY downbeat as positive geopolitical headlines pressure crude; Antipodeans lead whilst the JPY lags vs peers.Fixed benchmark made new WTD highs amidst geopolitical optimism, but now off best levels.Looking ahead, highlights include US Challenger Job Layoffs (Apr), US Jobless Claims (May 2), Atlanta Fed GDP, CNB/Banxico Policy Announcement (May), CBR Minutes (May), UK Local Elections. Speakers include ECBʼs Elderson, Schnabel, Lane, BoEʼs Mann, Taylor, Fedʼs Hammack, Williams, Kashkari. Earnings from CoreWeave, IREN, Coinbase, Cloudflare, DraftKings, ACM Research, Datadog, McDonald's.Read the full report covering Equities, Forex, Fixed Income, Commodites and more on Newsquawk
Former Bridgewater Associates Chief Investment Strategist Rebecca Patterson gives her outlook on the AI trade and broader market. Hut 8 CEO Asher Genoot joins in an exclusive to discuss the company's new $10B data center deal. And could ChatGPT manage your portfolio? Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Welcome to episode 353 of The Cloud Pod, where the weather is always cloudy! Justin, Ryan, and Matt are in the studio this week and ready to bring you all the latest news, including earnings from the big 3, a new agreement between the DOW and Google (Don’t be Evil), AI Agents, and more OpenClaw news (that your security team may not appreciate). Plus, DataCenters may not be great for the environment. Who knew? There's a lot to cover, so let's get started! Titles we almost went with this week Who Let the Bots Out? AI Governance Has No Answer Microsoft Loses Its OpenAI Monopoly But Keeps the Parking Spot AWS But Make It Forklifts and Freight Bezos Built a Money Printer That Prints Data Centers GPT-5.5 Instant Arrives Faster Than Your Last Existential Crisis When Your AI Coding Tool Ghosts You for Seven Weeks No More Goldfish Brain for Your AI Agents Amazon Quick Connects Everything Except Your Work-Life Balance AWS WAF Now Knows Which AI Is Crawling Your Stuff Stop Pushing Broken Code to Staging Like a Caveman Your AI Agent Called It Needs Automated Therapy OpenAI Moves In, and AWS Didn’t Even Change the Locks AI Interviews Candidates So Recruiters Can Nap Foundry Gives AI Agents Long-Term Memory and a Diary Cloud Earnings are Up… but some day the Capex Bell will Toll for the AI Reckoning Who Let the Bots Out? AWS WAF now shows you A big thanks to this week's sponsors: There are many cloud cost management tools out there, but only Archera provides insured commitments. It sounds fancy, but it’s really simple. Archera gives you the cost savings of a 1 or 3-year AWS Savings Plan with a commitment as short as 30 days. If you do not use all the cloud resources you have committed to, Archera will literally cover the difference. Other cost management tools may say they offer “insured commitments”, but remember to ask: Will you actually give me my rebate? Because Archera will. Check out thecloudpod.net/archera to schedule a demo today. We also wanted to tell you about something coming to the US for the first time — WeAreDevelopers World Congress! They’ve been doing this in Europe for years, 15,000-plus attendees in Berlin, it’s one of the biggest developer events over there. Coté from Software Defined Talk is actually speaking at their Berlin event this summer, so we’ve got some firsthand context here. In September, they’re launching the North America edition. San José, September 23 to 25. 500-plus speakers, 18 tracks — cloud, infrastructure, DevOps, security, AI, data engineering, all of it. Speakers from Datadog, Honeycomb, Sentry, Google, LinkedIn, and Stack Overflow. Olivier Pomel, Christine Yen, Milin Desai, Kelsey Hightower – plus workshops and masterclasses, not just talks. These are people who know how to do a developer conference at scale. wearedevelopers.us, code DEVPOD26 for 15% off. Group rates on top of that for 4 or more. Follow Up It's Earnings Time! 01:23 Microsoft (MSFT) Q3 earnings report 2026 Microsoft posted Q3 2026 revenue of $82.89 billion, up 18% year over year, with Azure cloud services growing 40%, slightly ahead of analyst expectations in the 38-39% range. Capital expenditures came in at $31.9 billion, about $3 billion below the analyst consensus of $34.9 billion, contributing to the stock dipping 2% despite the earnings beat, reflecting investor sensitivi
Welcome to episode 352 of The Cloud Pod, where the weather is always cloudy! Justin, Matt, and Ryan are safely back from Vegas (Ryan and Justin, anyway), and they have all the news and announcements from Google Next. Plus, we have Ryan's take on Phish, news from Cloudflare, and a shoe company making a pivot. There's a lot to cover, so let's get started! Titles we almost went with this week Redact Yourself Before You Wreck Yourself OpenAI **Anthropic Fork Yeah Cloudflare Artifacts Is Here Git Happens at Scale on Cloudflare Bucket List Item Checked Lambda Mounts S3 File Systems Terraform Your Agents Before They Terraform You Cloud Run Gets GPUs and Finally Hits the Gym Spanner Goes Rogue, Leaves the Cloud Behind Knowledge Catalog Knows What Your Agents Did Last Query One Control Plane to Rule a Million Chips No More Incognito Windows for Your AWS Identity Crisis Your Agent Can Now Write Files Without Burning Everything Down Spend Caps Finally Tell Runaway AI Jobs to Chill RIP Vertex, long live the agent Agents all the way down Google Next: This is the dawning of the Age of Agentic Allbirds Proves AI Hype Needs No Infrastructure A big thanks to this week's sponsors: There are a lot of cloud cost management tools out there, but only Archera provides insured commitments. It sounds fancy, but it’s really simple. Archera gives you the cost savings of a 1 or 3-year AWS Savings Plan with a commitment as short as 30 days. If you do not use all the cloud resources you have committed to, Archera will literally cover the difference. Other cost management tools may say they offer “insured commitments”, but remember to ask: Will you actually give me my rebate? Because Archera will. Check out thecloudpod.net/archera to schedule a demo today. We also wanted to tell you about something coming to the US for the first time — WeAreDevelopers World Congress! They’ve been doing this in Europe for years, 15,000-plus attendees in Berlin, it’s one of the biggest developer events over there. Coté from Software Defined Talk is actually speaking at their Berlin event this summer, so we’ve got some firsthand context here. In September, they’re launching the North America edition. San José, September 23 to 25. 500-plus speakers, 18 tracks — cloud, infrastructure, DevOps, security, AI, data engineering, all of it. Speakers from Datadog, Honeycomb, Sentry, Google, LinkedIn, and Stack Overflow. Olivier Pomel, Christine Yen, Milin Desai, Kelsey Hightower – plus workshops and masterclasses, not just talks. These are people who know how to do a developer conference at scale. wearedevelopers.us, code DEVPOD26 for 15% off. Group rates on top of that for 4 or more. General News 06:12 Amazon invest up to $25 billion in Anthropic part of AI infrastructure Amazon has committed up to $25 billion in additional investment in Anthropic, bringing its total potential investment to $33 billion. The latest $5 billion tranche is based on Anthropic’s $380 billion valuation, with up to $20 billion more tied to commercial milestones. In exchange, Anthropic has committed to spending over $100 billion on AWS over the next decade, with a specific focus on Trainium custom AI chips, and plans to bring nearly 1 gigawatt of
For all those who missed out on London, see you in Miami next week!Notion, the knowledge work decacorn, has been building AI tooling since before ChatGPT, with many hits from Q&A in 2023 and unified AI in 2024 and Meeting Notes in 2025. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0's Custom Agents - and they are finally embracing the Agent Lab playbook!Sarah Sachs and Simon Last of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work.We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today — from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work.We discuss:* Sarah and Simon's path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production* Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model* The “Agent Lab” thesis: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities* How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are* Why coding agents feel like the kernel of AGI, and how Notion is thinking about “software factories” made up of agents that spec, code, test, debug, review, and maintain codebases together* How Sarah runs AI engineering at Notion (“notes from Token Town”): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities* The “Simon Vortex,” company hackathons, and why security gets pulled in early rather than late* How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents* Why prototypes have become much easier to build internally, and how “demos over memos” changes product development inside a tool the whole company already uses every day* Notion's eval philosophy: regression tests, launch-quality evals, and “frontier/headroom” evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going* What a “Model Behavior Engineer” is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering* The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops* How the “software factory” should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter* A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database* How agents compose inside Notion: shared databases as primitives, agents invoking other agents, “manager agents” supervising dozens of specialized agents, and memory implemented simply as pages and databases* Notion's take on MCP vs CLI: why Simon is bullish on CLI's self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment* The evolution of Notion's internal agent harness: from early JavaScript coding agents, to custom XML, to Markdown and SQL-like abstractions, to tool definitions, progressive disclosure, and a much shorter system prompt* Why Notion cares about teaching “the top of the class,” building for sophisticated operators rather than abstracting away too much capability for everyone* How agent setup works today: agents that can configure themselves, inspect their own failures, and edit their own instructions — with guardrails around permissions* How Notion prices Custom Agents: credits as an abstraction over tokens, model type, serving tier, web search, and future sandbox costs; why usage-based pricing was necessary; and how “auto” tries to match the right model to the right task* Why Notion is not eager to train a foundation model, where they do fine-tune and optimize today, and why retrieval/ranking is one of the most important investment areas as more searches come from agents rather than humans* Why Meeting Notes became one of Notion's strongest growth loops: not just as transcription, but as high-signal data capture that powers search, custom agents, follow-up workflows, and the broader system of record for company collaboration* Why Notion is more interested in being the place where collaboration data lives than in building hardware themselves — and how wearables or other capture devices may eventually feed into that systemSarah SachsLinkedIn: https://www.linkedin.com/in/sarahmsachsX: https://x.com/sarahmsachsSimon LastLinkedIn: https://www.linkedin.com/in/simon-last-41404140X: https://x.com/simonlastFull Video EpisodeTimestamps* 00:00:00 Introduction and launching Notion Custom Agents* 00:01:17 Why Notion rebuilt agents four or five times* 00:03:35 Building for where models are going, not just where they are* 00:05:32 The Agent Lab thesis, wrappers, and product intuition* 00:08:07 User journeys, leadership, and low-ego AI teams* 00:13:16 The Simon Vortex, hackathons, and bringing security in early* 00:16:39 Team structure, demos over memos, and building for agents* 00:20:25 Evals, Notion's Last Exam, and the Model Behavior Engineer role* 00:27:37 Evals as an agent harness and the changing role of software engineers* 00:30:42 The software factory: specs, verification, and agent workflows* 00:32:18 Live demo: a custom agent for coworking space applications* 00:35:08 Composing agents, manager agents, and memory as pages* 00:38:15 Notion Mail, Gmail, native integrations, and tools* 00:39:43 MCP vs CLI and the cost of capability* 00:44:13 When Notion uses MCP vs building its own integrations* 00:47:43 The history of Notion's agent harness rebuilds* 00:55:35 Power users, public tools, and the setup agent* 00:58:01 Self-fixing agents, permissions, and “flippy”* 01:01:13 Pricing, credits, and choosing the right model automatically* 01:09:01 Why Notion isn't training its own frontier model* 01:14:07 Retrieval, ranking, and search built for agents* 01:17:27 Meeting Notes as data capture and workflow automation* 01:21:18 Wearables, hardware, and Notion as the system of record* 01:23:45 OutroTranscript[00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast. This is Alessio founder of Kernel Labs and I'm joined by swyx, editor of the Latent Space.[00:00:11] swyx: Hello. Hello. We're back in the beautiful studio that, uh, Alessio has set up for us with Simon and Sarah from Notion. Welcome.[00:00:18] Sarah Sachs: Thanks for having us.[00:00:19] Alessio: Thanks for having us. Yeah.[00:00:20] swyx: Congrats on the launch recently the custom agents, finally it's here. How's it feel?[00:00:26] Sarah Sachs: We ship things slowly. So it had been in Alpha for a little bit and at the point at which is it's an alpha, um, there's a group of people that are making sure it's ready for prod, and then there's a group of people working on the next thing.So sometimes some of these launches are a bit delayed satisfaction, so it's quite nice to remind yourself all the work you did because we do have a habit of like. Being two or three milestones ahead. Uh, just ‘cause you have to be, you know, you can't get complacent. Um, but it's been great that people understood how this is helpful.And I think that's just easier in general building AI tools today than it was two, three years ago. People kind of get it and so that user education, um, there's just, it was our most successful launch in terms of free trials and converting people and things like that. It was really successful, so yeah.But there's a lot to build.[00:01:12] swyx: Making it free for three months helps.[00:01:16] Sarah Sachs: Yep.[00:01:17] Simon Last: It was definitely super exciting for me because it's probably the fourth or fifth time that we rebuilt that.[00:01:22] swyx: Yes.[00:01:23] Simon Last: And I mean,[00:01:24] swyx: you've been building this since like 20, 22.[00:01:26] Simon Last: Yeah, I mean, like, it was even right when we got access to like GPT four in late 20 22, 1 of the first ideas we had is like, oh, okay, let's make an agent that I, we used the word assistant at the time, there wasn't really the word, the word agent yet, but, oh, we'll give an access to all the tools the notion can do, and then it, we run in the background like, like do work for us.And then we just tried that many times and it just. Was too early. Um,[00:01:48] swyx: I need to force you to like double click on that. What is too early? What didn't work?[00:01:52] Sarah Sachs: We were fine to, like, before function calling came out. We were trying to fine tune with the Frontier Labs and with fireworks, like a function calling model on notion functions.This is right when I joined. I joined because, um, we needed a manager as Simon was needed to be able to go on vacation. So, uh, that's, that's around when I joined, so you can speak much more to it.[00:02:11] Simon Last: Yeah, we did partnerships with both philanthropic and open AI at different times, uh, to try to, at the time the, I mean, when we first tried, there wasn't even a constant of like tools yet.We, we sort of designed our own like, like tool calling framework and then we tried to fine tune the models to, uh, to use it over multiple turns. Um, and because it, it didn't work well out the box, I think. Yeah. The models are just too dumb and the context thing was also way too short.[00:02:37] Alsesio: Yeah.[00:02:37] Simon Last: Um, and yeah, we just kind of banged our head against it for a long time.Uh, unfortunately it was always like, there was always like sort of. Glimmers that it was working, but um, it never felt quite robust enough to be like a useful, delightful thing. Um, until I would say, uh, the big unlock was probably like Sonic 3.6 or seven, uh, early last year. And that's when we started working on our agent, which we shipped last year.Um, and then, and then uh, uh, custom agents, kinda a similar capability and that, that one just took longer because we, we just wanted to get the reliability up a lot higher. ‘cause it's actually running in the background.[00:03:14] Sarah Sachs: And the product interface of like permissions and understanding, you know, this custom agent is shared in a Slack channel with X group of people and has access to documents that are surfaced to Y group of people.And the intersect experts, Y might not be whole. And so how do you build the product around making sure administrators understand that permissioning took multiple swings.[00:03:35] Alsesio: Everything is hard back at the end of the day. Yeah. I'm curious, like when the models are not working, how do you inform the product roadmap of like, okay, we should probably build, expecting the models to be better at some reasonable pace, but at the same time we need to, you know, you had a lot of customers in 2022.It's not like you were a new company or like no user base.[00:03:54] Simon Last: Yeah, I mean I think there's always the balance of, you know, like you want to be a GI pilled and thinking ahead and building for where things are going. Uh, but also you wanna be like shipping useful things. And so we always try to like, like keep a balance there.You know, we. We try to take clear, like a portfolio approach. You know, we're always working on multiple projects and, and we're always trying to work on, you know, maintaining things where that have already shipped, like, like shipping new things that are like eminently working well and make them really good.And, and then we wanna always have a few projects that are a little bit crazy. Um,[00:04:23] Alsesio: and what are the a GI peel projects that you have today? I'm curious about, uh, you don't have to share exactly what you're working on, but I'm curious what are things today that maybe in 18 months people will be like, oh, obviously this was gonna work[00:04:35] Sarah Sachs: 18 months.[00:04:37] Alsesio: Yeah, 18 months is, you know,[00:04:37] Sarah Sachs: it's a long time and Yeah. Yeah.[00:04:39] Simon Last: I mean, there's a number of things happening. I think one thing that's becoming more clear is I think like, like, uh, coding agents are the kernel of EGI, sort of, everything is a coding agent. Mm-hmm. I think that's, that's sort of one, one direction.Um, and then, yeah, the exciting thing about that is sort of your agent can sort of bootstrap its own software and capabilities and actually debug and maintain them. And so yeah, we're, we're, we're thinking a lot about that. And then, yeah, like, like another category of things that I'm, I'm really excited about is like, uh, we call the software factory also.People are using this, uh, this, this sort of word. Um, basically it just means can you create sort of like a, as automated as possible, a workflow for developing debugging. Mm-hmm. Merging, reviewing, and maintaining a code base and a service where there's a bunch of agents working together inside, and like, like how does that work?[00:05:28] Sarah Sachs: If you think back to your initial question, like, why did this take so long? I think something,[00:05:32] swyx: I didn't say that, but Yes. Okay. Go ahead.[00:05:34] Sarah Sachs: Why, what, what changed over the three and half years of trying[00:05:37] swyx: it? Exactly. Right. Because most people always say like, it didn't work yet. Then reasoning models came, then it worked.I was like, okay, let's go a little[00:05:43] Sarah Sachs: bit. That's, I mean, that's part of it, but I think the other part of it that I actually think is really what will set notion apart for every new capability is we have like. Two skills that are crucial when it comes to frontier capabilities. One is not letting yourself swim upstream.So like quickly realizing if you're just pressing against model capabilities versus not exposing the model to the right information, not having the right infrastructure set up. That and of itself is the skill of intuition. And the second is to see, okay, you're not swimming upstream. Which direction is the river flowing and what is like, how do we think ahead about the product and start building it even if it's not great yet, so that when it is there, we're ready for it.Right? And like those can sometimes feel like counterintuitive things. Like we can be trying to fine tune a tool calling model when they don't exist yet. And that the trick is to not do that for too long, but realize that there was something there. And we've had a lot of things which like, um, we're just like not swimming in the right direction with the streams.I think we had multiple versions of transcription before we got meeting notes, right? Oh, I gotta talk[00:06:39] swyx: about that. Yeah.[00:06:40] Sarah Sachs: Yeah. Um, and so. I, I, I think that like we, we really closely partner with the Frontier Labs on capabilities and we also have to have strong conviction on, as those capabilities move.Notion is about being the best place for you to collaborate and do your work. And how does that narrative change if the way that we work changes?Yeah.[00:06:58] swyx: Yeah. You told me you were a fan of the Agent Lab thesis, and this is, this is kind of it, right?[00:07:02] Sarah Sachs: Right. I show that thesis to so many candidates. Like I have it as like micro chrome autofill.Um, at this point, like it's one of my most visitations[00:07:10] swyx: because like, is this the, here's why you should work in notion and not open, open eye. I, it's like,[00:07:14] Sarah Sachs: here's, here's what's different about it.[00:07:16] swyx: Yeah.[00:07:16] Sarah Sachs: And here's why. It's not just a rapper. I actually think more and more people understand it's not just a wrapper.[00:07:21] swyx: Yeah.[00:07:22] Sarah Sachs: Um, and by the way, like in the beginning, parts of what we build are wrappers on functionality. That works well, of course, but that's not really the most, um. I would say that's not the product that, that drives revenue. And that's not necessarily always what users need.[00:07:35] swyx: I mean, you know, notion is the AWS wrapper, but like the, the wrapper is very beautiful and like very, very well polished.So[00:07:40] Sarah Sachs: like the analogy,[00:07:41] swyx: like[00:07:42] Sarah Sachs: the analogy that I've been coming back to his Datadog in AWS[00:07:45] swyx: Yeah.[00:07:46] Sarah Sachs: So, uh, Datadog could not exist with, without cloud storage. Right. That it's kind of fundamental that that works. Um, and AWS has like a CloudWatch product, but Datadog is an expert on understanding how people want observability on the products they launch.And we're experts in understanding how people wanna collaborate, and that's really where our expertise lies.[00:08:04] swyx: Totally.[00:08:04] Sarah Sachs: Um, regardless of the tools that we use,[00:08:07] Alsesio: I'm kind of curious how you think about implicit versus explicit expertise. I feel like Datadog is half and half implicit and explicit. It's like they understand across markets and industries what engineering teams usually look for.With notion, it's almost like more of the expertise is at the edge because you as a platform, you're like so horizontal that the end user is not really the same. Mm-hmm. Like with Datadog, the end user is always like, yeah, an engineering lead, a kinda like SRE related person with notion. It can be anything.So I'm curious how you put that expertise into a product versus, you know, obviously it, WS cannot build notion. It's, that doesn't quite work in this case, but[00:08:44] Simon Last: it's, it's a little bit differently shaped. I think, you know, a classic vertical SaaS, like the data is kind of like that. They understand their individual customer very deeply.It's kinda a narrow slice, um, notion has always been super horizontal. And our, our task has always been to sort of balance these two somewhat opposing forces of like, we're listening to our customers and what they want us to build. It's a broad slice. And then also we're thinking about like, okay, how do we decompose what they want into, uh, nice primitives that are, that are really nice to use and we'll, we'll get us like as much bang for the buck as possible.And then, you know. Maintain the whole system, make it all like, like super clean and nice to use.[00:09:22] Sarah Sachs: We still have user journeys. I mean, we still focus on like core. I actually think the failure of our team is when we focus too much on what are cools that are, what are tools that are[00:09:31] Simon Last: mm-hmm.[00:09:31] Sarah Sachs: Cool tools. I actually think that's when we make have the least velocity because you still need some sort of focus on a user journey.So like for instance, we'll all sit down every Friday and look at the P 99 of like the most token exhaustive custom agent transcript and just look at why it didn't do well and cut a bunch of tasks. Like we still focus on like, this has, like this should work. Email triaging should work. Mm-hmm. Right. And similarly, like when we're talking about before building, um, chatting, um, before we started filming about, okay, how can I do PDF export?Well that's functionality that then merits. Maybe we should build a tool that has access to a computer sandbox in a file system and the ability to write code. Right? Right. Um, but it's because we're thinking about the fact that our users to do their, to do their daily work, need to export PDFs, not because we're like, Hmm, I think a computer tool could be cool.Like, let's just see what happens. Mm-hmm. Like we, we have to focus on some user journeys, otherwise we just don't have like, enough strategy to, to prioritize.[00:10:29] swyx: I think there's a lot of like really strong opinions that you've had. Do you have like sort of like a towel of Sarah Sachs? Like, you know, like what, how do you run your team?Like I feel like you just have accumulated all these strong opinions. Obviously part, part of this is your, your token town thing.[00:10:43] Sarah Sachs: I think the TAs working with Service X is, um, you'd have to, it depends who you ask. Um, I think it depends if you're on my team or a partner Right. Or a vendor.[00:10:54] swyx: Yeah. There other people want to run their teams the way that you're Yeah.You're like bringing these things. And then also similarly, uh, Simon, when you did the custom agents demo, you had like, well, we've been using custom agents and here's the super long list of everything that we do. No humans ever read it. Right? That's what you said. I was like,[00:11:07] Sarah Sachs: yeah. So I think for, for me, um, something that I learned very quickly and became very comfortable with was that my job was not to be the ideas per person or the technical expert.My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on, and had an avenue to prioritize what they thought was important. And I think that's true with all, all leadership, but I think especially on the AI team. Almost all of our best ideas come from prototypes, from people that have a cool idea because they saw a user problem, and it's a huge disservice if all of those ideas have to pass, like the sniff test of what me and a product partner or Simon and Ivan decided were the direction, right?Because a lot of what we're doing is leaning into capabilities, so. I think that's the first thing is like, I don't really view like the role of engineering leadership as like, uh, hierarchical, nor has it ever been, but especially now, like very willing to change direction based on, um, like proof is in the pudding.Yeah. And like, and I think we have rebuilt our harness three or four times. And when you do that, then the second rule of engineering leadership is like you need to build a team that's comfortable deleting their own code and is very low ego and is driven by what's best for the company. And, um, doesn't write design docs because they think it's their promotion packet.Right. And that's a culture that notion had long before I joined, but like our willingness to just swarm on different problems and um, redo things that we've built before because something has changed. Like, there's a lot of friction that can happen at companies when you do that. And it doesn't happen at Notion.And because it doesn't happen when new people join. Like they don't wanna be the ones that are saying, we shouldn't do this. I wrote that code. So then it's, you know, you, you create a culture that everyone thoughts and that culture comes directly, I think from Simon and Ivan though, um, because they're very open-minded.[00:12:50] swyx: Anything that you,[00:12:50] Simon Last: you'd add? I'm not a manager, like, like, like Sarah is. Um, a lot of my role is really to try to think a little bit ahead, make sure that we're, we're building on the right capabilities and then like the prototyping stuff. And yeah, it's really, really critical to always just be starting again.It's like, okay, this is new thing. What does this mean? What if we just rethought everything or wrote everything? And so I, I'm, I'm basically just doing that in a loop every six months.[00:13:16] swyx: Yeah. Do you believe in internal hackathons for this stuff?[00:13:19] Sarah Sachs: I think there's like two different versions. So one is like, we just have a, a, a solid bench of senior engineers that come and go on what we call the Simon Vortex and Productionizing what we built, right?Because when you're in the Simon Vortex, the velocity is super high. The direction changes daily, and it's meant to be like the equivalent of a SC Works lab. We don't need to do hackathons for that. We need to have senior engineers that we trust to come in and out of those projects. For instance, like management boundaries are really loose.Like you report to him, but you work for her right now. Yeah. That's something that when we hire managers, it's important they don't care about because we tend to form more structures. Yeah. Don't be too[00:13:54] swyx: territorial.[00:13:55] Sarah Sachs: We form more. It's after we ship things, not not before, just historically. Um, the second thing is we do have companywide hackathons.Actually we just had our demos day for the hackathon we had last week this morning. That's more for people that aren't directly working on the project, feeling like they have the time to pause and learn how to make themselves more productive or how they would use notion custom agents to build something.Or part of the hackathon was actually encouraging everyone across the company to build their own agentic tool loop, calling from scratch. Follow like an every blog post on how to do what I think because we want[00:14:26] swyx: just with the compound engineering one. Yeah.[00:14:28] Sarah Sachs: We want everyone to use cloud code in the company or whatever the coding agent they please and understand that fundamental.So we set aside a day and a half. We're all leadership, encourage everyone on their teams across the company to do it. So we have hackathons like that. I would say like kind of facetiously, like everything we build is a little bit like a hackathon until it graduates and puts on big boy pants and as a product ops rollout leader and has a assigned data scientists and stuff like that,[00:14:54] swyx: security review enterprise stuff,[00:14:56] Sarah Sachs: actually security reviews one of the things that we bring in first because it just slows us down way more and, um, causes a lot of tension and they build better product if they're involved early.So, um, that is probably the first person to get involved in something that's the[00:15:09] swyx: right PR approved answer.[00:15:10] Sarah Sachs: No, but it's not just PR approved. It like, um, um, it's[00:15:13] swyx: actually real. It's actually real. It's like, um, I'm just saying scar[00:15:15] Sarah Sachs: tissue.[00:15:15] swyx: Yeah,[00:15:16] Sarah Sachs: because like, you know, my background's also, I worked at Robinhood for a number of years.Yes. So like, uh, compliance and things like that, um, are a little bit more, you learn the hard way when it doesn't come naturally.[00:15:26] Simon Last: Yeah. I think the. The hackathon is really important for uplifting the general population, but like, if that's the only way you can build new things, you're kind of toast. I mean, it, it has to be like the daily processes, like, you know, building these new things.Um, and it has to be about, I think like, I think in the AI era a lot more leverage accumulates to the most curious and excited people. And so it's like we're all about just like activating that energy. You know, like if someone's protesting something on the weekend that they're excited about and it's important, that should be the main thing that we're doing.Yeah. Um, it's not a hackathon that we schedule once a quarter, it's just like, yeah. Daily process. Part of the culture.[00:16:02] Sarah Sachs: I mean, that's how we shift image generation and notion now. It was always this thing that would be kind of nice to have, but it wasn't really clear where that was necessarily aligned in product priorities.It'd be a lot of work. And we had someone on the database collections team, Jimmy, who was like. I really wanna do image generation for cover photos and inside notion. And we're like, if you wanna build it, like it's, do it please. Like we encourage you. We gave ‘em all the resources of working directly with Gemini and being able to like track the token usage and it working through endpoints.We gave them eval, support, everything, and then became a, a full project.[00:16:34] Alsesio: Yeah.[00:16:35] Sarah Sachs: That's why you can't have like ego as a, a leader. Like that's, that's how we work.[00:16:39] Alsesio: What's the size of the team today, both engineering and overall?[00:16:43] Sarah Sachs: I manage, uh, the team. That's what we'll call it. Core AI capabilities and infrastructure.That's about 50 people. But then we have per i partner teams that do packaging. So how it shows up in the corner chat versus custom agents versus meeting notes, that's another 30, 40 people. And, and then every team that has a product service at Notion that a user can interface with owns the tool that the agent interfaces with the editor team.The team that did CRDT for offline mode is the same team that handles how two agents, um, edit competing blocks. Mm-hmm. Right? It's the same problem. The team that built the underlying SQL engine is the same team that owns how the agent asks it to run a SQL query, and it does it performantly. And so from that regard, anyone working on product engineering is tasked with making them work for customers that are humans and agents because over time the majority of our traffic will be coming from agencies using in our interface, not humans.And so. Our objective is to make it so that the whole product org is building for agents.[00:17:40] Alsesio: Yeah. How has it changed internally? The activation bar is kind of lowered a lot. Like anybody can kind of create a prototype very, somewhat easily, especially if you're like an existing code base. Have you raised the bar on like what type of prototype people need to bring forward to gonna be taken?Not like seriously, but like, you know what I[00:17:58] Simon Last: mean? Yeah. I think the bar is lowered in many ways. Be like, one thing our, uh, our team built that is really cool is our, uh, our, our design team made a whole separate GitHub repo, uh, called the, the design Playground. And it's basically just to create a bunch of like, like helper components and you, uh, for, for quickly a throwing together UIs.And it's become like actually quite sophisticated. Like it has like an agent in there and like, uh, that's pretty fun. So like, we pretty much, like, they don't do mocks, they just make like, like full, full prototypes.[00:18:27] swyx: Here it is. It works.[00:18:28] Simon Last: They give you like a u rl. They're like, okay, all right. So we have to make the, like the real production version of that.Um, and then for engineers. A prototype looks like just making it a feature flag that actually works. Like that's sort of the bar.[00:18:39] Sarah Sachs: Something to understand that's really unique about notion. One of the reasons I joined we're super lucky is no one uses Notion in their job as much as people that work at Notion.[00:18:46] Simon Last: Of course.[00:18:47] Sarah Sachs: So I think there's very few companies, maybe if you worked on Chrome I guess, but like everything that we ship, we ship internally first and get a lot of really quick feedback. And also sometimes our dev instance is totally borked and you have to change a bunch of flags to get things done. And that's kind of like, but everyone, so people that do it ticketing, people that do supply chain procurement, recruiting, everyone is using the same instance of notion with like a lot of flags on for these prototypes people build.Um, and so we have this, Brian Levin, one of the designers on our team, I think evangelize this concept of demos over memos.[00:19:18] swyx: Ooh, too[00:19:20] Sarah Sachs: good. Um, which has been, uh, very good for building demos, and I think it's put a big pressure point on us to have really strong product conviction, because if anything can be demoed, you really need a strong filter of making sure that if you know, you're doing X amount of work, you're making the, you're, you're focusing on one tower, you're not just building a really flat hill.Right. That's actually where I think there has to be more conviction from our PMs, um, and our designers and, and well, the company really to have conviction of what journey we're going on.[00:19:52] Simon Last: But overall, I feel like it works pretty well. Like people, almost all the engineers have good enough taste to realize that like, this prototype doesn't actually make sense in the product, or, or it does.So it's not that common that I would see a prototype. It's like, oh, this makes no sense. Mm-hmm. It's like, you know, people are doing reasonable things and, and, and then it's just a matter of. Which things we build first and then often just, just figuring out how to turn it on and off. There's our, in the, in our like experimental chat ui, there's this, there's probably like, like a hundred check boxes in there.[00:20:22] Sarah Sachs: Kills me[00:20:23] Simon Last: the things you could turn on and off.[00:20:25] Sarah Sachs: Uh, but I think that, okay, so that is kind of true, Simon, but like being the person that manages the evals team, like there is a level of intensity that it adds to the platform team. So, you know, if we're gonna do image generation and notion, all of a sudden the way that we do attachments and the way that we, um, our LLM completion like cortex talks and expects tokens back and now it's getting images back.Like there's a lot of platform work that we do need to, like solidify a little bit. So sometimes it'll be in dev for a couple weeks before it makes it to prod just because we still have to like, make it robust, make it HIPAA compliant, ZDR compliant, figure out the right contracting with the vendor, whatever it is.And we need to eval it because we want the team. To still maintain what they build. That's the one thing is like if we have a bunch of prototypes, it can't just be like a small group of people that then maintain whatever end prototypes. So we have invested a lot of people in an eval and model behavior understanding teams that, we call it agent dev velocity.So your dev velocity building agents can be faster if we invest in that platform. And so we have a whole org dedicated to Asian, um, platform velocity so that you can build your own eval and then maintain it once you ship it. So if a new model release comes out and we, every[00:21:38] swyx: team maintains their own eval,[00:21:40] Sarah Sachs: we maintain the eval framework.Every team owns their own evals and a lot of them we've integrated to Optin, to ci, or we run them nightly and we have a team, uh, a custom agent that triggers to a team to look at the major failures. That's really critical because if we have like all these different surfaces now, a lot of it's on the same agent harness, so it's easier to maintain.It's just packaging of different agent harnesses, but new functionality of the agent. Let's say that like we wanna update like. Uh, you know, they deprecated, sonnet, um, four or whatever it is and we need to auto update. Are[00:22:11] swyx: they already? That's so, okay. Yeah. Actually wasn't that long ago.[00:22:14] Alsesio: Theywere[00:22:14] Alsesio: just 3.5.[00:22:15] Sarah Sachs: 3.537. Just got deprecated.[00:22:18] swyx: 3 7, 5 0.2 or, yeah. No,[00:22:20] Sarah Sachs: it's not. 5.2 is five point. Five point no. Yeah, five four is 40% more expensive than five two. So if they deprecated five two, you would hear they can, you would hear from me about that one. Um, but, uh, another conversation to have.[00:22:35] swyx: I have a cheeky evals question for you.Have you noticed any secret degradation from any of the major model providers?[00:22:40] Sarah Sachs: Secret degradation,[00:22:42] swyx: like. During the War Bay, when it's high traffic, it suddenly gets dumber.[00:22:47] Sarah Sachs: Yeah. I mean, not just between the, I mean, we definitely notice flakiness, we've definitely noticed, particularly for some providers, that things are slower during working hours and[00:22:57] swyx: there's a latency argument.Yes. Not a quality argument.[00:22:59] Sarah Sachs: No. I think the quality difference that's interesting is, um, even though companies that say they're selling the same, a, it's really into like quanti quantization, but like companies that say they're selling the same model through different vendors, whether it be through first party or Bedrock, Azure, et cetera.We do see different qualities sometimes, and that's not necessarily what's advertised.[00:23:21] swyx: Yeah. Kidney went to the point of like, if we, they shipped like this, like eval across all the providers and it was like very obvious we were secret equalizing and it was very,[00:23:28] Sarah Sachs: yeah. But[00:23:29] swyx: that's very embarrassing.[00:23:30] Sarah Sachs: You know, um, we hire Subprocess to figure that out for us.So we just wanna understand where it's regressing or where it's optimized. And sometimes we're okay with regressions that optimize latency if they're the appropriate regressions. Our job is to make sure we have the evals to understand the changes that are important to us. And even like when we're partnering with labs on pre-releasees of models, they'll send us multiple snapshots.And this is less about quantization, but more just regressions. Like they have shipped models that were not the snapshots that we wanted, and they have changed the snapshots that they shipped based on the feedback that we give. Because our feedback tends to be more enterprise work focused and not coding agent focused.And definitely those can be bummers, like, you know, uh, we know that this wasn't the version you wanted, but we'll help you make it work. I mean, we always make it work, but that definitely happens.[00:24:16] Alsesio: Yeah. Do you have, um, failing evals that you're just hoping, oh, that will have success eventually when a good model comes out?[00:24:23] Sarah Sachs: Uh, I mean, yeah. So I think. I mean, I could talk about this for 60 minutes, so I will limit myself. I think it's a real issue when people say evals and it's just like, that's quality, that's like unit, I mean, it's like saying testing. It's not just unit tests, right? So. We have the equivalent of unit test.Regression test. Those live in ci, those have to pass a certain percent, you know, within some stochastic error rate. Then we have, as you're building a product, evals of these aren't passing right now, and this is launch quality. So we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user journeys to launch, and then what we have what we call frontier or headroom evals, where we actively wanna be at 30% pass rate.And that's actually been a effort that we took in partnership with philanthropic and OpenAI in the past maybe two or three months, because we actually hit a point where our evals were saturated and we weren't able to really give insightful feedback other than it wasn't worse. And not only is that not helpful for our partners, it's not helpful for us to understand where the stream is going.You know, going back to that analogy. And so we spent a lot of time thinking about. What notions last exam looks like, right? Mm-hmm. Not just humanities, last exam. Ooh, notions last exam. Mm-hmm. And, um, there's a lot of, you know, dreams about what that would look like. I know we've talked a lot about benchmarking, um, swix, but, uh, yeah.Notions last exam is a big thing inside the company and we have people, full-time staff to it exclusively. Mm. We have a data scientist, a model behavior engineer, and an full-time, um, evals engineer just dedicated to the evals that we pass 30% of the time.[00:25:56] swyx: What you're hiring for[00:25:57] Sarah Sachs: MBEs? I am hiring[00:25:58] swyx: What is an MBEA[00:25:59] Sarah Sachs: model?Behavior Engineer Model. Behavior engineers started with a title data specialist before I joined when they were working with Simon on like, uh, Google Sheets and like Simon just needed someone to look through Google Sheets and say, yes, no, this looks bad. This looks good. Right? And so we hired people with kind of diverse linguistics background.We had like a linguistics PhD dropout. Mm-hmm. And a Stanford ate new grad. And they're amazing. And they formed a new function basically. And over time we've built a whole team, um, with a manager who's now kind of reinventing what that role is with coding agents. So they used to be kind of manually inspecting code.Now they're primarily building agents that can write evals for themselves or LLM judges. There's a really funny day I can send you the picture where Simon, about a year and a half ago, was teaching them how to use GitHub. Um, and they're on the whiteboard and it was like, okay, I think it would be so much faster if our data specialists learned how to use GitHub and like learned how to commit these things in Dakota.And, and that was then and now I think, you know, coding has been a lot more accessible. Um, but moving forward it's this mix of like data scientist PM and prompt engineer because there's craft in understanding like even like what models can and can't do things. How do we define like that headroom? How do we define like what a good journey is?Um, is this model better or not? Why is this failing? There's some qualitative work, but then there's also like a lot of instinct and taste to it, and that's not necessarily software engineering. And so we have like very firm conviction and we have had for a number of years now that that is its own career path and we have always welcomed the misfits, so to speak.So we really firmly believe that you don't need an engineering background to be the best at this job. And that's what's quite unique about this particular role.[00:27:37] Simon Last: Yeah, this is something that I've been pretty excited about recently is we made an effort basically to treat the eval system as like an agent harness.So if you think about it, like, you know, you should be able to have an agent end-to-end, download a dataset, run an eval, iterate on a failure, debug, and, and then implement a fix. And ultimately you should be able to, you know, drive the full time process with a human sort of observing the, you know, the outer uh, system.So yeah, we went, went pretty hard on that. And that's, that's worked extremely well so far. It's like basically just to turn it into a coding agent, uh, uh, problem.[00:28:11] swyx: Your coding agent or just whatever[00:28:13] Simon Last: harness No coding agent. Yeah, code, cloud code. It should be totally general. Yeah. I think if it would be a mistake to like, like fix it on any, any particular coding agent.At the end of the day, it's just like CLI tools.[00:28:21] Sarah Sachs: It's like the same way that you would've a coding agent write the unit test. You should have a coding agent write the eval.[00:28:26] swyx: Yeah.[00:28:26] Sarah Sachs: But there's a lot of supervision in that still. We just don't believe that supervision has to come from software engineers because a lot of it is like, um, kind of you XREE and whatever, and these are the people that also triage failures and tell us where we should be investing next.[00:28:40] swyx: Yeah. I'm gonna go ahead and ask a spicy question. Is there a data, there are no software engineers at Notion.[00:28:46] Simon Last: Um,[00:28:46] Sarah Sachs: what does it mean to be a software engineer?[00:28:47] swyx: Exactly.[00:28:48] Simon Last: I mean, I think the way things are going is like we're on some continuum where. If, if you look back three years ago, humans were typing all the code and then we had auto complete, you're typing list of the code.Then we had sort of like filling agents, filling lines, and now we're getting into like agents doing longer range tasks where you can debug and implement a fix and then verify it works and you know, get your, get your PR even like, like Merion deployed. I think we're sort of just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system.There's a string of agents flowing through, like me prs what's going off the rails. Like what do I need to approve? Is there like a learning or memory mechanism that that works? So it's kind of a hard engineering problem. There's a, you know, there's, there's a lot to do there. I think we're just sort of moving up stack[00:29:34] Sarah Sachs: the same transition machine learning engineers have made, right?Like I haven't looked at a PR curve in a while.[00:29:39] swyx: Yeah. You used to do this stuff and now, um, auto research can do it,[00:29:42] Sarah Sachs: right? Like I think it depends on what you define as a software engineer.[00:29:46] swyx: Yes. It's, that's changing for sure.[00:29:49] Sarah Sachs: I think every software engineer in notion this summer went through like this, um, sheer, um, one of our engineering leads of the company called it, like every software engineer is going through the, the, uh, identity crisis that every manager goes through, where all of a sudden they realize their ability to write code is less important than their ability to delegate in context switch.And I think that is a transition out of being a software engineer. But[00:30:12] Simon Last: yeah. Yeah, there's a critical difference to being a manager, which is that like, it is actually very deeply technical. The problem, you know, humans are very like, like, like fuzzy and you can't like treat a team of humans like a, like a rigorous system where like, you know, prs like, like flow through and can be in like a block status and then what happens when they're blocked, right.With a set of agents, you actually can do that. And, and, and I think it's actually, there's a lot of interesting technical rigor that that goes into that it's like it's a technical design problem. Ultimately.[00:30:42] Alsesio: What is the design of the software factory that you're building?[00:30:46] Simon Last: Yeah, I mean, I think we're. Trying a lot of different things.I mean, ultimately you want to design a system that requires as little human intervention as possible, but like still maintaining the in variance that, that you care about. So yeah, we're exploring a lot different ideas there. I mean, I think I could talk about a few things I think are important there.Like, one thing I think is really important is, um, having some kind of like specification layer you can just commit marked on files. Mm-hmm. That works pretty well, but[00:31:15] swyx: it's nice to be notion man. I'm just saying like the spec, like Yeah. The natural home for specs is notion.[00:31:21] Simon Last: Yeah. Right. It can be a database of pages.Yeah. I mean, it needs to be something that is, you know, human readable and I viewable and I think that's pretty key. Another really key component is like the, the self verification loop. Yes. You need really, really good testing layers, basically. And that's a really deep, uh, uh, problem. But by getting that right, you know, and then, and then it's kinda like the workflow of like.What happens when there's a bug? How does it flow into the system? Like, is it like a subagent working on it? How does it make a PR and how does that get reviewed? And me, and then, you know, so there's like the, the flow or process.[00:31:56] swyx: Yeah. Cool. Uh, you know, one thing we did work out before you guys came in was this demo or this[00:32:01] Simon Last: agents[00:32:02] swyx: agent demo.Uh,[00:32:03] Simon Last: so every,[00:32:04] Alsesio: every time we do an episode, we try the product. Right. I don't think there's ever been an episode that I haven't tried. Yeah. Um,[00:32:11] swyx: and we, we try, try is a, a big word. Like since day one lane space has been on Notion, but this is the, this is the net new thing. Yes.[00:32:18] Alsesio: So this is for Nel Labs, which is the space we're in.So next week we're opening applications for tenants. So there's a web form, let me, we got this form done here. Uh, so, uh, before. Uh, the workflow would be I get an email, then I look at the person. It was like, should I spend time talking to this person? Then I respond, they respond back. So I build this. So the name it came up for on its own.Can you maybe h how do, how does it come up with its own name?[00:32:43] Simon Last: Yeah, that's a pretty app name. It's, it, it is just a random, it's a random, a name generator.[00:32:47] Alsesio: Oh, that's funny. It just came,[00:32:49] Simon Last: the fact that it picked that is, is kind of hilarious. I'm pretty sure it's just determined,[00:32:54] Sarah Sachs: resilient collector. I, I think I've never looked at the code for that.I've never second guessed it. I think it's kind of like a madlib situation.[00:33:00] Simon Last: Yeah, I think you're right. Yeah. It's, it's totally a, a deterministic. Oh, I thought it was great. Yes. Although, although when the, if you use the AI to set itself up, it can update its own name, so. Okay. Um,[00:33:11] Sarah Sachs: how did you create it? It, did you just do[00:33:12] Alsesio: classroom?I,[00:33:13] Sarah Sachs: okay.[00:33:13] Alsesio: I did, yeah. I'll say just check my inbox for applications for a coworking space. Keep a people, so it created the database for me. Which I have here. And I guess database is like an notion table because everything is notion. Um, and then whenever um, an email comes in, like here, it just creates a new role for the person.Mm-hmm. And then it uses web search to enrich the mm-hmm. The profile. So it kind of like searches the web and it's like, this is who this person is, this is when they say they wanna move in and kind of updates everything else. This is, I mean, it's not a GI, but to me, I don't wanna do this work. So it feels like, I mean, it took me maybe like 15 minutes to set up the whole thing.Um, and I really like that most of the information should live here. You know, it is not like some other tool asking me[00:34:01] Sarah Sachs: Yeah.[00:34:01] Alsesio: To like, bring my stuff there. It's like I would've probably already created an ocean thing.[00:34:06] Sarah Sachs: Mm-hmm.[00:34:06] Alsesio: So[00:34:07] Sarah Sachs: most of our biggest use cases and gains are from. That extra layer of human involvement in the process to make it so right.And so like one of our biggest use cases is bug triaging. So if someone posts something in Slack, can you just have a custom agent that lives there that has its own routing constitution of what team this belongs to, creates a task in your task database and then posts in that Slack channel, right? Like that's like one of the first things that we built internally, I think.And it's completely changed the way that notion functions as a company. Nothing falls through, well, most things don't fall through the crack. We don't know what we don't know. But it's not replacing people, it's replacing processes.[00:34:44] Alsesio: Yeah.[00:34:44] Sarah Sachs: Right.[00:34:45] Alsesio: And I'm curious how you think about composability of these things.So the other one I was working on is like a. These filler. So whenever somebody signs up as a tenant, kind of he'll sell the lease for them. There should probably some agent that is like office manager agent mm-hmm. That can handle the request, make the lease, and then, uh, give them a ADA access to the office and all of that.How do you think about that feature?[00:35:08] Simon Last: Yeah, so I mean, there's, there's two ways you can compose. One way is by using like the data primitives. So you can, you know, you, you could give, you have one agent, uh, be writing to the database and there's another agent that's walked in the database. So that's, that's one way that they, they can coordinate that's like a little bit more decoupled and mm-hmm.Works really well. Or you, you can couple them. So I, I think it's actually not released yet. Releasing it like next week is, uh, in the settings for an agent, you can give access to invoke any other agent.[00:35:34] swyx: Hmm.[00:35:34] Simon Last: So you can have them just. Just, uh, uh, talk directly. So[00:35:37] swyx: you, was there a limit on like, number of recursions or just,[00:35:40] Simon Last: um, probably,[00:35:42] swyx: you know what I mean?Like, you can just get an infinite loop that way there's[00:35:45] Simon Last: some kind of Yeah,[00:35:46] Sarah Sachs: I think it's, there is actually a number somewhere.[00:35:49] swyx: I believe I'm just, you know, like, you're, you're, someone's gonna screw up. You[00:35:51] Simon Last: should you try to see[00:35:53] swyx: Yeah. I mean, everything's gonna be paperclips.[00:35:55] Simon Last: Oh, yeah. Yeah. But, uh, but, but that's really useful.Yeah. So we, you know, like I just, I, I helped, uh, someone internally the other day, they had, they had built like over 30 custom agents for, uh, for our go to market team doing all kinds of different things. You know, for example, like researching, you know, like, like filling information about, about a customer or like, like triaging customer feedback or like, uh, something like that.Literally over 30 of them. And, and then he, and then he even made like a database of all the agents and then he is like, okay, and, and now I'm getting 70, over 70 notifications per day with just the agents are blocked on various things. Uh, and then I was like, oh, okay, cool. You know, the obvious thing to do there is to make a manager agent,[00:36:32] Sarah Sachs: right?[00:36:33] Simon Last: That's gonna sort of blocks be another abstraction layer in between your, your, uh, uh, 30 agents. Uh, so yeah, we, we send out with like a manager agent and then has access to invoke all the other agents and it's sort of like, like watching and observing them and then it sort of, it just creates a layer of abstraction.So instead of 70 notifications per day, it's like, like five. And then, and then the manager agent can help like, uh, debug and fix any problems with the,[00:36:54] swyx: does this is a concept of like an inbox or something like piece, you're basically saying that they can message each other?[00:37:00] Simon Last: Yeah.[00:37:01] Sarah Sachs: Well[00:37:01] swyx: they use the system of record, which, which is[00:37:02] Sarah Sachs: notion, so we[00:37:03] Simon Last: actually, yeah, we didn't make any special concepts at all.[00:37:06] swyx: They're interested to the motion notifications that I would've got,[00:37:09] Sarah Sachs: they can just like write a task to a database that the other agent's task to listening to, or they can actually call a web book to the agent, like they can just add the agent. Okay.[00:37:17] Simon Last: Yeah, I mean, this is something that, that we're still working on.I, I think we, you know, like, like generally, generally the way we do these things is, you know, you first make it possible, maybe like a sort of janky way. So I, I, I think the way I set ‘em up is like, you know, we created like a new database that was sort of like issues mm-hmm. That the custom agents were, were experiencing, and then gave them all access to file an issue and then the manager has access to, to read the issues.Um, and that works pretty well, essentially like, like give it its own like internal issue tracker just for the agents. And then, you know, if that becomes a, a concept that seems useful, generally maybe we will think of how to package it in. But I mean, generally we try to just keep it to composing the primitive if we can.You know, another example of this is we have no built-in memory concept. Memory is, is just pages and databases. And so if you wanna give a memory, just give it a page and give it. Edit access to that page and the[00:38:03] swyx: human can edit it. Agent can edit[00:38:04] Simon Last: it. Yeah. And so that works, that pattern works extremely well on it.And you know, depending this case, you can have it be just a page or it could be an entire database with, you know, or, you know, I can have sub pages is is pretty on what you can do with that.[00:38:15] Alsesio: So when I was setting this up, uh, I connected my inbox and it was like, do you wanna use Gmail or Notion Mail? And I'm like, I don't wanna use Eater, I just want you to do it.I'm curious how you think about, you know, notion, mail, notion, calendar, all of these kind of ui ux interfaces, full stack[00:38:29] Simon Last: notion.[00:38:30] Alsesio: Yeah. When like at the same time you have the agents abstracting them away from you in a way, you know, how do you spend like the product calories so to speak?[00:38:37] Simon Last: Yeah, I mean, I think it's pretty important that you don't have to use, not your mail to connect to the mail capability.So we can just connect to Gmail or, or whatever you want, uh, to use. And we're thinking of the mail service as being really great to the extent that it's really agent built, right? So maybe the mail app is just sort of a prepackaged agent that helps you automate your, your inbox.[00:39:00] Alsesio: Yeah, the auto labeling is great.Think[00:39:03] Sarah Sachs: the, when we, um, integrate with Gmail for instance, we have a series of tools available that are available via MCP or API to Gmail. When we integrate with Notion Mail, we have the Notion Mail engineering team to build us the, um, exact right tools that optimize latency, optimize performance and quality.They own that quality. Um, there's product leads there. They're directly thinking about the user problems that happen in mail. So it tends to be when we build integrations and connections, we build natively first. Um, and then think about, um, extending them generally just because it's also easier. Mm-hmm. Um, um, to build natively first.Um, so that tends to be how we phase things out.[00:39:43] swyx: Talking about integrations, you prompted me, so I gotta ask. M-C-P-C-L-I. What's going on? What's the[00:39:48] Simon Last: Yeah. Opinion. I think, I mean, I'm, I'm definitely bullish and excited about cli. I think there's a few really cool things about cli. So one really cool thing is like, um, is that it's in the terminal environment, so it gets a bunch of extra power.So it, you know, for example, it can like, like paginating and cursor through like long outputs. Um, and it has a progressive disclosure inherently. Uh, so, you know, you don't see all the tools at once. It's just, you see the CLI wrapper and you can like use the, the help commands and, and, and read files. And then I think the most important thing that's, that's super cool is that there, it's also inherently a, a bootstrapped.So if there's an issue, uh, the agent can debug and fix itself within the same environment that it uses the tool.[00:40:30] swyx: Mm.[00:40:30] Simon Last: Right. Like, you know, I think I saw a tweet this morning. Someone said, you know, my agent didn't have a browser, so I asked it to make all a browser tool and within a hundred lines of code, it gave itself a little browser, like, like wrapping the, the, the chromium API, um.That's pretty incredible. And then if there was a bug, it would just immediately try to fix it. Mm-hmm. Right. On the other hand, if you use an, you know, if you use like of, of the Chrome dev tools, MCP, I've had this issue where like, like sometimes the transport gets like messed up. If it gets messed up, the agent has no way to fix itself.It, it no longer has a browser, it's, it's not broken. Right. I think that's, that's pretty fundamental, but I would say like a lot of the, the bad things about it can be fixed. Uh, so I think like, as a progressive disclosure, that can be fixed with, with right harness. Like, it, it obviously doesn't make sense to show it all the tools all the time.That's not really inherent to the MCP protocol. It's just like how you wrap it and use it.[00:41:16] swyx: There's many poorly built MCPs because we didn't know.[00:41:19] Simon Last: Yeah, yeah. I mean it was just early, like, like the obvious thing is, uh, you know, to start with is, is to just show it all the tools and it's like, okay, now we have a hundred tools.Yeah. And like the tool calling actually works. So let's of[00:41:28] swyx: your success[00:41:29] Simon Last: give it a way to like, like filter to source the tools. So yeah, I would say like broadly speaking, I'm really bullish on cli. I'm still bullish on CPS and in a certain environment. I think in, in particular, CP is really great for when you want sort of like a narrow, lightweight agent.I think there's, there's definitely a lot of use cases where, where you don't want like a full coding agent with a compute run time. And also you want it to be like more tightly permissioned. MCP inherently has a really strong permission model, like all you can do is call the tools. A CLI is a little bit murkier.It's like, can I access the, if PI token are you, like, properly sort of like re-encrypt the token so it can't like exfiltrate it, it introduce a lot of like, like new issues, which are. Real and hard to solve. And MCP is just like the dumb simple thing that works and it that it's pretty good.[00:42:12] Sarah Sachs: I'll add two more perspectives, not from it working well for Notion, but how notion like commits to both platforms.Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP and so far as other people are using cps, right? So regardless of our perspective, we've put a lot of effort into our MCP and we have a fantastic team that we're building, um, to do more there.And the second thing I'll say, I think, um, we all think a lot, but lately I've been thinking a lot about making sure there's a value alignment and pricing, um, with capability.[00:42:43] swyx: Literally our next question[00:42:44] Sarah Sachs: and. Needing language to execute deterministic tasks feels wasteful and requiring on a language model to interface with third party providers seems wasteful for tasks that don't require it.And particularly because our custom agents are using usage-based pricing. We think of pricing as like the barrier of entry for use of our product, and we're quite committed to making sure that it's not wasteful. Um, not just because it's a bad deal for our customers, but it's also bad business. We wanna have as many buyers, like there's a, there's an elasticity of demand and so if we can have our agents properly execute code that calls on CLI deterministically, it's a one-time cost, right?Versus constantly having a language model integrate with an MCP over and over and over and paying those like repeated token fees and it's happening outside the cash window, then you're paying for it over and over and over and it's just kind of unnecessary and less deterministic when it doesn't have to be.[00:43:36] Alessio: Yeah, the open-endedness I think is like, the main thing is like, well, if I go write code to just call an API, I would never use an MCP. But then you need an NCP sometimes when you know what to call, but you don't want it to restart versus like, I think the it built a browser from scratch is like, it's great when you're doing it on your own, but like if your customers were having your AI write a browser from scratch every time and you had to pay the token cost of that, yeah.You'd be like, no, no. The Chrome dev tools CP is actually pretty great. Just use that. I'm curious, how do you make that decision? Like should it be. Just straight API call very narrow. Should it be an MCP? Should it be super open-ended?[00:44:10] Sarah Sachs: Do you mean for when we ship notion capabilities or when we add capabilities to[00:44:13] Alessio: notion[00:44:14] Sarah Sachs: AI or,[00:44:14] Alessio: I mean, you might have a capability that the only way to do is an open-ended agent, like an agent with a coding sandbox.[00:44:21] Sarah Sachs: Yeah. In Notion ai they're not explicit, not We also ship an MCP.[00:44:24] Alsesio: Yeah. Yeah. In B,[00:44:25] Sarah Sachs: yeah.[00:44:26] Alsesio: Internally. Okay. Like is there ever a discussion of like, we're not gonna ship it because we're not able to tie it down? Or are you happy to just like,[00:44:33] Sarah Sachs: um, no. I mean, there are a lot of things where we choose not to use MCP because we wanna add more high touch to quality.I think search an agent to find is like the largest instance of that, where we have. Um, slack and linear and Jira search and notion that is not using necessarily the search MCP functionality that is provided by those companies. And that's because it's quite critical we think, to how our agent trajectories work is for us to have a little bit more control on the functionality of the search journey.And so it usually comes from quality and there's a long tail of things and that's why we built an MCP client or an MCP server, excuse me, so that people can connect whatever they want. There's that long tail, right. But we, for search particularly, I would say that's like the primary entry point, but there are other connections as well that it's a little bit of secret sauce a
Welcome to episode 351 of The Cloud Pod, where the weather is always cloudy! Justin, Matt, and Ryan are in the studio today and ready to bring you the latest in cloud and AI news. And it's that time of year again – we're coming up quickly on Google Next, place your AI money bets, so we've got our yearly predictions for what's coming from Vegas, as well as more news about Mythos, Amazon finally becoming a utility, and even an aftershow where we discuss the computing power of Artemis. It's a great show, so let's get started! Titles we almost went with this week Three StorageClasses Walk Into an AI Workload Deprecated Models Don’t Die, They Just Fail Your API Calls SQL Walks Into a Graph Bar and Stays Too Many Agents Spoil the Workflow One Registry to Rule All Your Rogue AI Agents Eight CPUs Walk Into Space, Only One Comes Back Stop Retyping the Same Gemini Prompt Like a Caveman Claude Code Routines Let AI Work While You Sleep AWS Builds a Yellow Pages for Your AI Agents GPT Finally Stops Refusing to Talk About Hacking None of the hosts is ready for Next We are once again trying to look into our next next next crystal ball and failing Google is gonna announce AI, it’s just mandatory now Las Vegas is calling, our Livers are crying A big thanks to this week's sponsors: There are a lot of cloud cost management tools out there, but only Archera provides insured commitments. It sounds fancy, but it’s really simple. Archera gives you the cost savings of a 1 or 3-year AWS Savings Plan with a commitment as short as 30 days. If you do not use all the cloud resources you have committed to, Archera will literally cover the difference. Other cost management tools may say they offer “insured commitments”, but remember to ask: Will you actually give me my rebate? Because Archera will. Check out thecloudpod.net/archera to schedule a demo today. We also wanted to tell you about something coming to the US for the first time — WeAreDevelopers World Congress! They’ve been doing this in Europe for years, 15,000-plus attendees in Berlin, it’s one of the biggest developer events over there. Coté from Software Defined Talk is actually speaking at their Berlin event this summer, so we’ve got some firsthand context here. In September, they’re launching the North America edition. San José, September 23 to 25. 500-plus speakers, 18 tracks — cloud, infrastructure, DevOps, security, AI, data engineering, all of it. Speakers from Datadog, Honeycomb, Sentry, Google, LinkedIn, and Stack Overflow. Olivier Pomel, Christine Yen, Milin Desai, Kelsey Hightower – plus workshops and masterclasses, not just talks. These are people who know how to do a developer conference at scale. wearedevelopers.us, code DEVPOD26 for 15% off. Group rates on top of that for 4 or more. Follow Up 01:47 AI Cybersecurity After Mythos: The Jagged Frontier Since the original Mythos/Project Glasswing announcement, AISLE published follow-up testing showing that small, inexpensive open-weight models can replicate much of the vulnerability detection work Anthropic attributed to Mythos, with all 8 tested models detecting the flagship FreeBSD NFS buffer overflow, including a 3.6B parameter model costing $0.11 per million tokens. A notable correction to the framing of the original announcement: cyb
Dave "CAC" Kellogg and Ray "Growth" break down one of the oldest productivity metrics in business and explain why, in the age of AI-native software, it has never mattered more. This episode covers the full arc from Frederick Taylor's factory floors to Cursor's $3.3M per employee, with the rigorous definitional discipline the Metrics Brothers are known for.What We Cover:The metric's 100-year history. Revenue per employee traces its roots to scientific management in the late 1800s, gained traction as a Wall Street efficiency screen in the 80s and 90s, and became a standard signal of business model quality in M&A diligence. The core math is simple: annual revenue divided by headcount. What is not simple is how you define the denominator.FTE vs. employee: why the definition matters more than the formula. The E in FTE stands for full-time equivalent, not full-time employee, and that distinction drives real measurement decisions. How do you count a part-time contractor? What about 200 offshore developers on a third-party vendor's payroll? Ray and Dave walk through the practical choices, including why offshore headcount is almost never counted on a 1:1 basis and why that decision can dramatically change your benchmark comparison.Public SaaS companies in 2025: the benchmark is $395K. Using the Benchmarkit SaaS 100 index (134 public SaaS companies), the median revenue per employee in 2025 is $395K, up from $327K in 2022, a 21% improvement in three years. ARR per FTE runs about 5-7% higher at $413K. The shift reflects the industry's move from growth-at-all-costs to efficient revenue growth.Private SaaS companies: size matters. ARR per employee scales materially with company size. At the $5-20M ARR stage, the median is $144K. By $100M+ ARR, the median reaches $300K. The recurring-revenue tailwind from a large renewal base is a significant driver as companies scale.AI-native companies have reset the benchmark entirely. Where the historical range for enterprise software was $200-400K per employee, AI-native companies operate at a fundamentally different level. Cursor reached $1.67M per employee at 60 people, and now runs at $3.3M per employee at 300 people. Midjourney is at $4.7M. Anthropic is in the $3-5M range on a run-rate basis. This is not a modest improvement over traditional SaaS. It is a 10x shift.One important caution on the AI numbers. Many of the figures being cited by AI-native companies are monthly run-rate revenue annualized (last month times 12), not trailing 12-month GAAP revenue. When growth is compounding fast, that distinction can dramatically inflate the productivity figure. The Metrics Brothers flag this as a meaningful source of confusion in how the benchmark is being discussed today.The AI tailwind may be temporary, at least in part. Current customer acquisition friction for AI software is unusually low, given experimentation budgets and departmental purchasing. As enterprise procurement tightens (74% of enterprise AI purchases now involve IT), GTM investment will likely increase, and revenue per employee for AI-native companies may stabilize or compress. Ray and Dave estimate that steady-state productivity is more likely to be in the 3-5x range over traditional SaaS, not 10x.Revenue will replace ARR as the standard numerator. The rise of usage-based and hybrid pricing is rendering ARR less meaningful for a growing share of companies. Snowflake, Datadog, and MongoDB do not report ARR. As AI-native pricing models proliferate, Ray and Dave expect the industry to converge on revenue as the standard numerator across productivity benchmarks.What about revenue per agent? Ray raises the forward-looking question: as AI agents take on SDR, sales, and other GTM functions, how do we measure agent productivity? Dave's take is that "revenue per agent" is likely a dead end, partly because agent instances are nearly impossible to count and partly because the right way to price and measure agents is to decompose their capabilities, not to anthropomorphize them as headcount equivalents.The Bottom Line:Revenue per employee is a deceptively simple metric with genuinely complex definitional choices underneath it. For B2B SaaS executives, the 2025 benchmarks are $395K (public) and $144-300K (private, depending on scale). For AI-native companies, the numbers are in a different category entirely, though some of that gap reflects accounting choices as much as true productivity gains. The metric is worth tracking closely, both as a board-level efficiency signal and as a leading indicator of business model quality.See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
Every PM is scrambling to learn AI tools - but is that a trap? In this episode of Arguing Agile, hosts Brian Orlando and Om Patel summarize Shreyas Doshi's provocative article "Why Product Sense Is the Only Product Skill That Will Matter in the AI Age." Using the article as background for our discussion, we explore whether AI tools like Claude, Cursor, and NotebookLM are genuine superpowers for product managers or just the new baseline that everyone will have access to.https://shreyasdoshi.substack.com/p/why-product-sense-is-the-only-productWe've structured this episode around several key debates, including:
We're proud to release this ahead of Ryan's keynote at AIE Europe. Hit the bell, get notified when it is live! Attendees: come prepped for Ryan's AMA with Vibhu after.Move over, context engineering. Now it's time for Harness engineering and the age of the token billionaires.Ryan Lopopolo of OpenAI is leading that charge, recently publishing a lengthy essay on Harness Eng that has become the talk of the town:In it, Ryan peeled back the curtains on how the recently announced OpenAI Frontier team have become OpenAI's top Codex users, running a >1m LOC codebase with 0 human written code and, crucially for the Dark Factory fans, no human REVIEWED code before merge. Ryan is admirably evangelical about this, calling it borderline “negligent” if you aren't using >1B tokens a day (roughly $2-3k/day in token spend based on market rates and caching assumptions):Over the past five months, they ran an extreme experiment: building and shipping an internal beta product with zero manually written code. Through the experiment, they adopted a different model of engineering work: when the agent failed, instead of prompting it better or to “try harder,” the team would look at “what capability, context, or structure is missing?”The result was Symphony, “a ghost library” and reference Elixir implementation (by Alex Kotliarskyi) that sets up a massive system of Codex agents all extensively prompted with the specificity of a proper PRD spec, but without full implementation:The future starts taking shape as one where coding agents stop being copilots and start becoming real teammates anyone can use and Codex is doubling down on that mission with their Superbowl messaging of “you can just build things”.Across Codex, internal observability stacks, and the multi-agent orchestration system his team calls Symphony, Ryan has been pushing what happens when you optimize an entire codebase, workflow, and organization around agent legibility instead of human habit.We sat down with Ryan to dig into how OpenAI's internal teams actually use Codex, why the real bottleneck in AI-native software development is now human attention rather than tokens, how fast build loops, observability, specs, and skills let agents operate autonomously, why software increasingly needs to be written for the model as much as for the engineer, and how Frontier points toward a future where agents can safely do economically valuable work across the enterprise.We discuss:* Ryan's background from Snowflake, Brex, Stripe, and Citadel to OpenAI Frontier Product Exploration, where he works on new product development for deploying agents safely at enterprise scale* The origin of “harness engineering” and the constraint that kicked off the whole experiment: Ryan deliberately refused to write code himself so the agent had to do the job end to end* Building an internal product over five months with zero lines of human-written code, more than a million lines in the repo, and thousands of PRs across multiple Codex model generations* Why early Codex was painfully slow at first, and how the team learned to decompose tasks, build better primitives, and gradually turn the agent into a much faster engineer than any individual human* The obsession with fast build times: why one minute became the upper bound for the inner loop, and how the team repeatedly retooled the build system to keep agents productive* Why humans became the bottleneck, and how Ryan's team shifted from reviewing code directly to building systems, observability, and context that let agents review, fix, and merge work autonomously* Skills, docs, tests, markdown trackers, and quality scores as ways of encoding engineering taste and non-functional requirements directly into context the agent can use* The shift from predefined scaffolds to reasoning-model-led workflows, where the harness becomes the box and the model chooses how to proceed* Symphony, OpenAI's internal Elixir-based orchestration layer for spinning up, supervising, reworking, and coordinating large numbers of coding agents across tickets and repos* Why code is increasingly disposable, why worktrees and merge conflicts matter less when agents can resolve them, and what it really means to fully delegate the PR lifecycle* “Ghost libraries”, spec-driven software, and the idea that a coding agent can reproduce complex systems from a high-fidelity specification rather than shared source code* The broader future of Frontier: safely deploying observable, governable agents into enterprises, and building the collaboration, security, and control layers needed for real-world agentic workRyan Lopopolo* X: https://x.com/_lopopolo* Linkedin: https://www.linkedin.com/in/ryanlopopolo/* Website: https://hyperbo.la/contact/Timestamps00:00:00 Introduction: Harness Engineering and OpenAI Frontier00:02:20 Ryan's background and the “no human-written code” experiment00:08:48 Humans as the bottleneck: systems thinking, observability, and agent workflows00:12:24 Skills, scaffolds, and encoding engineering taste into context00:17:17 What humans still do, what agents already own, and why software must be agent-legible00:24:27 Delegating the PR lifecycle: worktrees, merge conflicts, and non-functional requirements00:31:57 Spec-driven software, “ghost libraries,” and the path to Symphony00:35:20 Symphony: orchestrating large numbers of coding agents00:43:42 Skill distillation, self-improving workflows, and team-wide learning00:50:04 CLI design, policy layers, and building token-efficient tools for agents00:59:43 What current models still struggle with: zero-to-one products and gnarly refactors01:02:05 Frontier's vision for enterprise AI deployment01:08:15 Culture, humor, and teaching agents how the company works01:12:29 Harness vs. training, Codex model progress, and “you can just do things”01:15:09 Bellevue, hiring, and OpenAI's expansion beyond San FranciscoTranscriptRyan Lopopolo: I do think that there is an interesting space to explore here with Codex, the harness, as part of building AI products, right? There's a ton of momentum around getting the models to be good at coding. We've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to.Build a user journey that you're trying to solve into code. It's pretty natural to use the Codex Harness to solve that problem for you. It's done all the wiring and lets you just communicate in prompts. To let the model cook, you have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking, where is the Asian making mistakes?Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I'm putting in place. So I have solved this part of the SDLC.swyx: [00:01:00] All right.[00:01:03] Meet Ryan swyx: We're in the studio with Ryan from OpenAI. Welcome.Ryan Lopopolo: Hi,swyx: Thanks for visiting San Francisco and thanks for spending some time with us.Ryan Lopopolo: Yeah, thank you. I'm super excited to be here.swyx: You wrote a blockbuster article on harness engineering. It's probably going to be the defining piece of this emerging discipline, huh?Ryan Lopopolo: Thank you. It is it's been fun to feel like we've defined the discourse in some sense.swyx: Let's contextualize a little bit, this first podcast you've ever done. Yes. And thank you for spending with us. What is, where is this coming from? What team are you in all that jazz?Ryan Lopopolo: Sure, sure.Ryan Lopopolo: I work on Frontier Product Exploration, new product development in the space of OpenAI Frontier, which is our enterprise platform for deploying agents safely at scale, with good governance in any business. And. The role of VMI team has been to figure out novel ways to deploy our models into package and products that we can sell as solutions to enterprises.swyx: And you have a background, I'll just squeeze it in there. Snowflake, brick, [00:02:00] stripe, citadel.Ryan Lopopolo: Yes. Yes. Same. Any kind of customerswyx: entire life. Yes. The exact kind of customer that you want to,Vibhu: so I'll say, I was actually, I didn't expect the background when I looked at your Twitter, I'm seeing the opposite.Stuff like this. So you've got the mindset of like full send AI, coding stuff about slop, like buckling in your laptop on your Waymo's. Yes. And then I look at your profile, I'm like, oh, you're just like, you're in the other end too. Oh, perfect. Makes perfect.Ryan Lopopolo: I it's quite fun to be AI maximalist if you're gonna live that persona.Open eye is the place to do it. And it'sswyx: token is what you say.Ryan Lopopolo: Yeah. Certainly helps that we have no rate limits internally. And I can go, like you said, full send at this stay.swyx: Yeah. Yeah. So the Frontier, and you're a special team within O Frontier.Ryan Lopopolo: We had been given some space to cook, which has been super, super exciting.[00:02:47] Zero Code ExperimentRyan Lopopolo: And this is why I started with kind of a out there constraint to not write any of the code myself. I was figuring if we're trying to make agents that can be deployed into end to enterprises, they should be [00:03:00] able to do all the things that I do. And having worked with these coding models, these coding harnesses over 6, 7, 8 months, I do feel like the models are there enough, the harnesses are there enough where they're isomorphic to me in capability and the ability to do the job.So starting with this constraint of I can't write the code meant that the only way I could do my job was to get the agent to do my job.Vibhu: And like a, just a bit of background before that. This is basically the article. So what you guys did is five months of working on an internal tool, zero lines of code over a mi, a million lines of code in the total code base.You say it was cenex, more like it was cenex faster than you would've. If you had done it by end. SoRyan Lopopolo: yeah, thatVibhu: was the mindset going into this, right?Ryan Lopopolo: That's right.[00:03:46] Model Upgrades LessonsRyan Lopopolo: Started with some of the very first versions of Codex CLI, with the Codex Mini model, which was obviously much less capable than the ones we have today.Which was also a very good constraint, right? Quite a visceral feeling to ask the [00:04:00] model to build you a product feature. And it just not being able to assemble the pieces together.Which kind of defined one of the mindsets we had for going into this, which is whenever the model just cannot, you always pop open at the task, double click into it, and build smaller building blocks that then you can reassemble into the broader objective.And it was quite painful to do this. Honestly, the first month and a half was. 10 times slower than I would be. But because we paid that cost, we ended up getting to something much more productive than any one engineer could be because we built the tools, the assembly station for the agent to do the whole thing.[00:04:43] Model Generations, Build Systems & Background ShellsRyan Lopopolo: But yeah, so onward to G BT 5, 5, 1, 5, 2, 5, 3, 5 4. To go through all these model generations and see their kind of corks and different working styles also meant we had to adapt the code base to change things up when the model was revved. [00:05:00] One interesting thing here is five two, the Codex harness at the time did not have background shells in it, which means we were able to rely on blocking scripts to perform long horizon work.But with five, three and background shells, it became less patient, less willing to block. So we had to retool the entire build system to complete in under a minute and. This is not a thing I would expect to be able to do in a code base where people have opinions. But because the only goal was to make the Asian productive over the course of a week, we went from a bespoke make file build to Basil, to turbo to nx and just left it there because builds were fast at that point.swyx: Interesting. Talk more about Turbo TenX. That's interesting ‘cause that's the other direction that other people have been doing.Ryan Lopopolo: Ultimately I have. Not a lot of experience with actual frontend repo architecture.swyx: You're talking that Jessica built the sky. So I'm like, I know the NX team. I know Turbo from Jared [00:06:00] Palmer.And I'm like, yeah, that's an interesting comparison.[00:06:02] One Minute Build LoopRyan Lopopolo: The hill we were climbing right, was make it fast.swyx: Is there a micro front end involved? Is it how how complex reactRyan Lopopolo: electron base single app sort of thingswyx: And must be under a minute. That's an interesting limitation. I'm actually not super familiar with the background shelf stuff.Probably was talked about in the fight three release.Ryan Lopopolo: BA basically means that codex is able to spawn commands in the background and then go continue to work while it waits for them to finish. So it can spawn an expensive build and then continue reviewing the code, for example.swyx: Yeah.Ryan Lopopolo: And this helps it be more time efficient for the user invoking the harness.swyx: And I guess and just to really nail this, like what does one minute matter? Like why not five, okay, good. We want no. WeRyan Lopopolo: want the inner loop to be as fast as possible. Okay. One minute was just a nice round number and we were able to hit it.swyx: And if it doesn't complete, it kills it or some something,Ryan Lopopolo: No.We just take that as a signal that we need to stop what we're doing, double click, decompose a build graph a bit to get us to high back under so that we [00:07:00] can able the agent continue to operate.swyx: It's almost like you're, it's like a ratchet. It's like you're forcing build time discipline, because if you don't, it'll just grow and grow.That's right. And you mentioned that my current, like the software I work on currently is at 12 minutes. It sucks.Ryan Lopopolo: This has been my experience with platform teams in the past, where you have an envelope of acceptable build times and you let it go up to breach and then you spend two, three weeks to bring it back down to the lower end of the average low bed stop.But because tokens are so cheap Yeah. And we're so insanely parallel with the model, we can just constantly be gardening this thing to make sure that we maintain these in variants, which means. There's way less dispersion in the code and the SDLC, which means we can simplify in a way and rely on a lot more in variance as we write the software.[00:07:45] Observability, Traces & Local Dev StackVibhu: Lovely.[00:07:46] Humans Are BottleneckVibhu: You mentioned in your article, like humans became the bottleneck, right? You kicked off as a team of three people. You're putting out a million line of code, like 1500 prs, basically. What's the mindset there? So as much as code is disposable, you're doing a lot of review. A lot [00:08:00] of the article talks about how you wanna rephrase everything is prompting everything, is what the agent can't see.It's kind of garbage, right? You shouldn't have it in there. So what's like the high level of how you went about building it, and then how you address okay, humans are just PR review. Like how is human in the loop for this?Ryan Lopopolo: We've moved beyond even the humans reviewing the code as well.[00:08:19] Human Review, PR Automation & Agent Code ReviewRyan Lopopolo: Most of the human review is post merge at this point.But post, post merge, that's not even reviewed. That's justswyx: Oh, let's just make ourselves happy by YouRyan Lopopolo: haven't used fundamentally. The model is trivially paralyzable, right? As many GPUs and tokens as I am willing to spend, I can have capacity to work with my hood base.The only fundamentally scarce thing is the synchronous human attention of my team. There's only so many hours in the day we have to eat lunch. I would like to sleep, although it's quite difficult to, stop poking the machine because it makes me want to feed it. You have to step back, right?Like you need to take a systems thinking mindset to things and [00:09:00] constantly be asking where is the agent making mistakes? Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I'm putting in place. So I have solved this part of the SDLC, and usually what that has looked like is like we started needing to pay very close attention to the code because the agent did not have the right building blocks to produce.Modular software that decomposed appropriately that was reliable and observable and actually accrued a working front end in these things, right?[00:09:35] Observability First SetupRyan Lopopolo: So in order to not spend all of our time sitting in front of a terminal at most, doing one or two things at a time, invested in giving the model that observability, which is that that graph in the post here.swyx: Yeah. Let's walk through this traces and which existed firstRyan Lopopolo: we started with just the app and the whole rest of it. From vector through to all these login metrics, APIs was, I dunno, half an [00:10:00] afternoon of my time. We have intentionally chosen very high level fast developer tools. There's a ton of great stuff out there now.We use me a bunch, which makes it trivial to pull down all these go written Victoria Stack binaries in our local development. Tiny little bit of python glue to spin all these up. And off you go. One neat thing here is we have tried to invert things as much as possible, which is instead of setting up an environment to spawn the coding agent into, instead we spawn the coding agent, like that's the entry point.It's just Codex. And then we give Codex via skills and scripts the ability to boot the stack if it chooses to, and then tell it how to set some end variables. So the app and local Devrel points at this stack that it has chosen to spin up. And this I think is like the fundamental difference between reasoning models and the four ones and four ohs of the past, where these models could not think so you had to put them in [00:11:00] boxes with a predefined set of state transitions.Whereas here we have the model, the harness be the whole box. And give it a bunch of options for how to proceed with enough context for it to make intelligent choices. SoVibhu: sales, so like a lot of that is around scaffolding, right? Yes. Previous agents, you would define a scaffold. It would operate in that.Lube, try again. That's pivoted off from when we've had reasoning models. They're seeming to perform better when you don't have a scaffold, right? That's right.[00:11:28] Docs Skills GuardrailsVibhu: And you go into like niches here too, like your SPEC MD and like having a very short agent MG Agent md.swyx: Yes. Yes.Vibhu: Yeah. So you even lay out what it is here, but I likeswyx: the table contents.Vibhu: Yeah.swyx: Like stuff like this, it really helps guide people because everyone's trying to do this.Ryan Lopopolo: This structure also makes it super cheap to put new content into the repository to steer both the humans and the agents.swyx: You, you reinvented skills, right?Vibhu: One big agents andswyx: skills from first princip holdsRyan Lopopolo: all skills did not exist when we started doing this.Vibhu: You have a short [00:12:00] one 100 line overall table of contents and then you have little skills, right? Core beliefs, MD tech tracker. Yeah. Yeah. The scale is overRyan Lopopolo: The tech jet tracker and the quality score are pretty interesting because this is basically a tiny little scaffold, like a markdown table, which is a hook for Codex to review all the business logic that we have defined in the app, assess how it matches all these documented guardrails and propose follow up work for itself.Before beads and all these ticketing systems, we were just tracking follow up work as notes in a markdown file, which, we could spa an agent on Aron to burn down. There's this really neat thing that like the models fundamentally crave text. So a lot of what we have done here is figure out ways to inject textswyx: intoRyan Lopopolo: the system right when we get a page, because we're missing a timeout, for example.I can just add Codex in Slack on that page and say, I'm gonna fix this by adding a timeout. Please update our reliability documentation. To require that all network calls have [00:13:00] timeouts. So I have not only made a point in time fix, but also like durably encoded this process knowledge around what good looks like.swyx: Yeah.Ryan Lopopolo: And we give that to the root coding agent as it goes and does the thing. But you can also use that to distill tests out of, or a code review agent, which is pointed at the same things to narrow the acceptable universe of the code that's produced.swyx: I think one of the concerns I have with that kind of stuff is you think you're making the right call by making, it's persisted for all time across everything.Yes. But then you didn't think about the exceptions that you need to make, right? And that you have to roll it back.Vibhu: Part of it isswyx: also sometimes it can follow your s instructions too.Vibhu: It's somewhat a skill, right? So it determines when it uses the tools, right? Like it's not like it'll run outta every call.It'll determine when it wants to check quality score, right?Ryan Lopopolo: Yeah. And we do in the prompts we give these agents, allow them to push back,[00:13:51] Agent Code Review RulesRyan Lopopolo: When we first started adding code review agents to the pr, it would be Codex, CLI. Locally writes the change, pushes up a PR on [00:14:00] those PR synchronizations of review agent fires.It posts a comment. We instruct Codex that it has to at least acknowledge and respond to that feedback. And initially the Codex driving the code author was willing to be bullied by the PR reviewer, which meant you could end up in a situation where things were not converging. So yeah, we had to,swyx: he's just a thrash.Ryan Lopopolo: We had to add more optionality to the prompts on both of these things, right? The reviewer agents were instructed to bias toward merging the thing to not surface anything greater than a P two in priority. We didn't really define P two, but we gave it, youswyx: did define P two.Ryan Lopopolo: We gave it a framework within which to score its outputswyx: and then greater than P zero is worse, right?Yes. P two is very good.Ryan Lopopolo: P zero is you will mute the code place ifswyx: you merch thisRyan Lopopolo: thing, right?swyx: Yeah.Ryan Lopopolo: But also on the code authoring agent side, we also gave it the flexibility to either defer or push back against review feedback, right? This happens all the time, right? Like I happen to notice something and leave a code review, [00:15:00] which.Could blow up the scope by a factor of two. I usually don't mean for that to be addressed Exactly. In the moment. It's more of an FYI file it to the backlog, pick it up in the next fix it week sort of thing. And without the context that this is permissible, the coding agents are gonna bias toward what they do, which is following instructions.swyx: Yeah.[00:15:19] Autonomous Merging Flowswyx: I do wanted to check in on a couple things, right? Sure. All the coding review agent, it can merge autonomously. I think that's something that a lot of people aren't comfortable with. And you have a list here of how much agents do they do Product code and tests, CI configuration and release tooling, internal Devrel tools, documentation eval, harness review, comments, scripts that manage the repository itself, production dashboard definition files, like everything.Yes. And so they're just all churning at the same time, is there like a record that, that any human on the team pulls to stop everythingRyan Lopopolo: Because we are building a native application here. We're not doing continuous deploy. So there's still a human in the loop for cutting the release branch.I see. We require a blessed [00:16:00] human approved smoke test of the app before we promote it to distribution, these sort of things.swyx: So you're working on the app, you're not building like infrastructure where you have like nines of reliability, that kinda stuff?Ryan Lopopolo: That's correct. That's correct. Okay. And also like full recognition here that all of this activity took in a completely greenfield repository.There's. Should be no script that this applies generally toswyx: this is a production thing, you're gonna shipRyan Lopopolo: toswyx: customers. Of course. Yeah, of course. So this is realVibhu: And like one of the things there is, you mentioned you started this as a repo from scratch. The onboarding first month or so was pretty, it was like working backwards, right?Yeah. And then you had to work with the system and now you're at that point where you know, you're very autonomous. I'm curious like, okay, so what, how human in the loop is it? So what are the bottlenecks that you wish you could still automate? And part of that is also like, where do you see the model trajectory improving and offloading more human in the loop?We just got 5.4. It's a really good,Ryan Lopopolo: fantastic model, by the way.Vibhu: Yeah. Yeah. It's the first one that's merged. Top tier coding. So it's codex level coding and reasoning. So general reasoning both in one model. SoRyan Lopopolo: andVibhu: computer [00:17:00] use vision.Ryan Lopopolo: Now we now with five four, I can just have Codex write the blog post, whereas for this one I had to balance between chat.swyx: Oh, I need to, I might be out of a job. Oh my God.Ryan Lopopolo: Oh,swyx: I know. You just gave me an idea for a completely AI newsletter that five four could do. Yeah, I get it Now.Ryan Lopopolo: This sort of thing is just one example of closing the loop, right? Like the dashboard thing you mentioned. We have Codex authoring the Js ON, for the Grafana dashboards and publishing them and also responding to the pages, which means when it gets the page, it knows exactly which dashboards are defined and what alerts.What alert was triggered by which exact log in the code base. ‘cause all of this stuff is collated together.swyx: It has to own everything.Yes. Yeah. Yeah.Ryan Lopopolo: And it means that if we have an outage that did not result in a page. It has the existing set of dashboards available to it. It has the existing set of metrics and logs and can figure out where the gaps in the dashboard are or [00:18:00] in the underlying metrics and fix them in one go.In the same way, you would have a full stack engineer be able to drive a feature from the backend all the way to the front end.Vibhu: So it, it seems like a lot of the work you guys had to do was you as a small team are fully working for a way that the model wants the software to be written. It's like less human legible for better. Code legibility, agent legibility. How do you think that affects broader teams? So one at OpenAI, do liaison, like this is how software should be written. Like I can imagine, say you join a new team with this methodology, this mindset there's ways that, teams do code review, teams write code, like teams are structured and a lot of it is for human legibility.So should we all swap? Like how does this play back one broader into OpenAI and then like broader into the software engineering, right? Is it like teams that pick this up will it's pretty drastic, right? You have to make a pretty big switch. Should they just full send Yeah.Ryan Lopopolo: The mindset is very much that I'm removed from the process, right? I can't really have deep code level opinions about [00:19:00] things. It's as if I'm. Group tech leading a 500 person organization.Vibhu: Yeah.Ryan Lopopolo: Like it's not appropriate for me to be in the weeds on every pr. This is why that post merge code review thing is like a good analog here, right?Like I have some representative sample of the code as it is written, and I have to use that to infer what the teams are struggling with, where they could use help, where they're already moving quickly and I can pivot my focus elsewhere.Vibhu: Yeah.Ryan Lopopolo: So I don't really have too many opinions around the code as it is written.I do, however, have a command based class, which is used to have repeatable chunks of business logic that comes with tracing and metrics and observability for free. And the thing to focus on is not how that business logic is structured, but that it uses this primitive ‘cause I know that's gonna give leverage by default.Vibhu: Yeah.Ryan Lopopolo: Yeah, back to that sort of systems stinking,Vibhu: and you have part of that in your blog post, enforcing architecture and ta taste how you set boundaries for what's used. There's also a section on redefining [00:20:00] engineering and stuff, but yeah, it's just, it's interesting to hear,Ryan Lopopolo: and as the models have gotten better, they have gotten better at proposing these abstractions to unblock themselves, which again, lets me move higher and higher up the stack to look deeper into the future on what ultimately blocked the team from shipping.swyx: Yeah. You mentioned so you, this is primarily a, it is like a 1 million line of code base electron app. But it manages its own services as well, so it's like a backend for front end type thing.Ryan Lopopolo: We do have a backend in there, but that's hosted in the cloud.Yeah. This sort of structure is actually within the separate main and render processesWithin theswyx: electric.That's just how electronic works.Ryan Lopopolo: Yeah, of course. So have also treated like. MVC style decomposition with the same level of rigor, which has been very fun.swyx: I have a fun pun. This is a tangent, NVC is model view controller. Any sort of full stack web Devrel knows that.But my AI native version of this is Model view Claw, the clause the harness.Ryan Lopopolo: That's right. That's right. I do think that there is an interesting space to [00:21:00] explore here with Codex, the harness as part of building AI products, right? There's a ton of momentum around getting the models to be good at coding.We've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to build, a user journey that you're trying to solve into code, it's pretty natural to use the Codex Harness to solve that problem for you. It's done all the wiring and lets you just communicate and prompts to let the model cook.Yeah. It's been very fun. And there's also a very engineering legible way of increasing capabil. It's fantastic, right? Yeah. Just give you, just give the model scripts, the same scripts you would already build for yourself.swyx: Yeah.Yeah. So for listeners, this is Ryan saying that software engineering or coding against will eat knowledge work like the non-coding parts that you would normally think.Oh, you have to build a separate agent for it. No, start a coding agent and go out from there. Which open Claw has like it's pie Underhood.Ryan Lopopolo: [00:22:00] Yes.Vibhu: Basically define your task in code. Everything is a codingswyx: agent by the way. Since I brought it up, it's probably the only place we bring it up. Is any open claw usage from you?Any?Ryan Lopopolo: No. No. Not for me. I don't have any spare Mac Minis rattling around my house.swyx: You can afford it? No. I just, I'm curious if it's changed anything in opening eye yet, but it's probably early days. And then the other, the other thing I, I wanna pull on here is like you mentioned ticketing systems and you mentioned prs and I'm wondering if both those things have to go away or be reinvented for this kind of coding.So the git itself and is like very hostile to multi-agent.Ryan Lopopolo: Yeah. We make very heavy use of work trees.swyx: But like even then, like I just did a, dropped a podcast yesterday with Cursors saying, and they said they're getting rid of work trees ‘cause it still has too many merge conflicts.It's still un too un unintuitive. But go ahead.Ryan Lopopolo: The models are really great at resolving merge conflicts. Yeah. And to get to a state where I'm not synchronously in the loop in my terminal, I almost don't care that there are mergeswyx: with disposable.[00:23:00] Yeah.Ryan Lopopolo: We invoke a dollar land skill and that coaches codex to push the PR Wait for human and agent reviewers Wait for CI to be green.Fix the flakes if there are any merged upstream. If the PR comes into conflict, wait for everything to pass. Put it in the merge queue. Deal with flakes until it's in Maine. End. This is what it means to delegate fully, right? This is in a, very large model re probably a significant tax on humans to get PRS merged, but the agent is more than capable of doing this and I really don't have to think about it other than keep my laptop open.swyx: Yeah. I used to be much more of a control freak, but now I'm like, yeah, actually you could do a better job of this than me. Yeah. With the right context. Yes.[00:23:47] Encoding Requirementsswyx: Anything else in harness in general? Just this piece, I just wanna make sure we,Ryan Lopopolo: I think one thing that I maybe didn't make super clear in the article that I heard on Twitter as an interesting, that's respond [00:24:00]swyx: to them.What's the chatter and then what's your response?Ryan Lopopolo: Ultimately, all the things that we have encoded in docs and tests and review agents and all these things are ways to put all the non-functional requirements of building high scale, high quality, reliable software into a space that prompt injects the agent.We either write it down as docs, we add links where the error messages tell how to do the right thing. So the whole meta of the thing is to basically tease out of the heads of all the engineers on my team, what they think good looks like, what they would do by default, or what they would coach a new hire on the team to do to get things to merch.And that's why we pay attention to all the mistakes, mistakes that the agent makes, right? This is code being written that is misaligned with some as yet not written down, non-functional requirement.swyx: Sorry, what? Did the online people misunderstand orRyan Lopopolo: No,swyx: whatyouRyan Lopopolo: responded to? Somebody just literally said that.I was like, oh yeah,swyx: okay,Ryan Lopopolo: This is the [00:25:00] thing. This is what I've been doing. Oh, youswyx: agree? Yeah. I see. Interesting.Ryan Lopopolo: One other neat thing, which I did totally did not expect is folks were just. Taking the link to the article and giving it to pi or Codex and say, make my repo this,Vibhu: you achi a whole recursion.Ryan Lopopolo: And it was wildly effective. Really? It was wildly effective. NoVibhu: way. It just actually is something I tried with five, four yesterday. I didn't have time. Last time I was like out speaking of something, and this is one of my things, I was like, okay, I have this article. Can we just scaffold out what it would be like to run this?And I, I did it first as that and then I was like, okay, let me take another little side repo and say okay, if I was to fully automate this like this because I haven't written a line of code, it'sRyan Lopopolo: like over full, setVibhu: it right. The side thing I'm doing of voice. TTS I'm just like, slobbing out, whatever.It's nothing production. I'm like, how would I make this like this? And it's actually like a really good way. It's like a good way to learn what could be changed, what could be like, it's just a good analyzing, right? You give it all the codes, you give it all the context, you give it the article and it walks you through it very well.That's right. That's right.[00:25:57] Inlining Dependencies[00:25:57] Dependencies Going Away & Brett Taylor's Responseswyx: I guess one more thing before we go to Symphony is I wanted to cover [00:26:00] Brett Taylor's response. We had him on the show. He is your chairman, which is wild. Yeah. That he's reading your articles as well and like getting engaged in it. He says software dependencies are going away.Basically they can just be like vendored. Yes. Response.Ryan Lopopolo: Aswyx: hundred percent. A hundred percent agree. You still pro qr, you still pay Datadog. You still pay Temporal. Thank you.Ryan Lopopolo: Yep. The level of complexity of the dependencies that we can internalize is, I would say low, medium right now. Just based on model capability.What does the,swyx: what is medium?Ryan Lopopolo: I would say like a. A couple thousand line dependency is a thing that we could in-house No problem. Call in an afternoon of time. One neat thing about it is like probably most of that code you don't even need. Like by in-house and abstraction, you can strip away all the generic parts of it and only focus on what you need to enable the specific thing.Yes. You're building,swyx: I've been calling this the end of b******t plugins.Ryan Lopopolo: Yeah.swyx: Because there's so much when I published an open source thing, I want to accept everything, be liberal. I want to accept, this is post's law, but that means there's so much bloat. Yes. There's so much overhead.Ryan Lopopolo: One other neat thing about [00:27:00] this too is when we deploy Codex Security on the repo, it is able to deeply review and change. The internalized dependencies in a much lower friction way than it would be to like, push patches upstream, wait for them to be released, pull them down, make sure that's compatible with all the transitive I have in my repo and things like that.So it's also much lower friction to internalize some of these things if code is free. ‘cause the tokens are cheap sort of thing.swyx: Yeah. Yeah. I think like the only argument I have against this is basically scale testing, which obviously the larger pieces of software like Linux, MySQL, he calls up even the Datadog and Temporals and then maybe security testing where Yes.Classically, I think, is it linis tos, it said security open source is the best disinfectant.Ryan Lopopolo: Many eyes.swyx: Many eyes. And if inline your dependencies and code them up, you're gonna have to relearn mistakes from other people that Yep.Ryan Lopopolo: Yep. And to internalize that dependency, you're back to zero and you have to start.Reassembling all those bits and pieces to Yeah. Have [00:28:00] high confidence in the code as it is written. Yeah.Vibhu: Even part of the first intro of this, you basically mentioned like everything was written by codex, including internal tooling, right? So internal tooling, like when you're visualizing what's going on it's writing it for itself.swyx: Yeah. I'm built internal tools way I now, and like I just show them off and they're like, how long did you spend? And I didn't spend any time. I just prompted it,Ryan Lopopolo: very funny story here.swyx: Yeah, go ahead.Ryan Lopopolo: We had deployed our app to the first dozen users internally had some performance issues, so we asked them to export a trace for us get a tar ball, gave it to our on-call engineer, and he did a fantastic job of working with Codex to build this beautiful local Devrel tool, next JS app, the drag and drop the tar ball in, and it visualizes the entire trace.It's fantastic. Took an afternoon, but none of this was necessary. Because you could just spin up codex and give it the tar ball and ask the same thing and get the response immediately. So in a way, optimizing for human [00:29:00] legibility of that debugging process was wrong. It kept him in the loop unnecessarily when instead he could have just like Codex cooked for five minutes and gotten this same.swyx: Yeah, you verify your instincts here of this is how we used to do it. Or this is how I would have used to solve it.Ryan Lopopolo: Yeah. In this local observability stack. Like sure, you can de deploy Yeager to visualize the traces, but I wouldn't expect to be looking at the traces in the first place because I'm not gonna write the code to fix them.swyx: Yeah. So basically there needs to be like this kind of house stack and owning the whole loop. I think that is very well established. And it sounds like you might be like sharing more about that in the future, right?Ryan Lopopolo: Yeah. I think we're excited to do[00:29:36] Ghost Libraries Specs[00:29:36] Ghost Libraries & Distributing Software as SpecsRyan Lopopolo: We're gonna talk about Symphony in a little bit, but like the way we distribute it as a spec, which I think folks are calling Ghost Libraries on Twitter.This is like a such a cool name. It does mean it becomes much cheaper to share software with the world, right? You define a spec, how you could build your own specifying as much as is required for a coding agent to reassemble it [00:30:00] locally. The flow here is very cool. Like we have taken. All the scaffolding that has existed in our proprietary repo spun up a new one.Ask Codex with our repo as a reference. Write the spec. We tell it. Spin up a team ox spawn a disconnected codex to implement the spec. Wait for it to be done. Spawn another codex and another team ox to review the spec com or review the implementation compared to upstream and update the spec so it diverges less.And then you just loop over and over Ralph style until you get a spec that is with high fidelity able to reproduce the system as it is. It's fantastic.Vibhu: And you're basically, you're not really adding any of your human bias in there, right? That's correct. A lot of times people write a spec and be like, okay, I think it should be done this way, and you'll riff on something.And it's no, the agent could have just handled it like you're still scaffolding in a sense, right? I want it done this way. It can determine its spec better.swyx: That's right. That's right. Part of me it, I'm, I've been working a lot on evals recently, and part of me is wondering if [00:31:00] an agent can produce a spec that it cannot solve.Is it always capable of things that he can imagine or can you imagine things that it is impossible to do?Ryan Lopopolo: I think with Symphony, we, there's like this there's this axis where you have things that are easier, hard, or established or new, right? And I think things that are hard and new is still something that the models need humans.Yeah. Drive.swyx: Yeah. Yeah.Ryan Lopopolo: But I think those other quadrants are largely salt. Given the right scaffold and the right thing that's gonna drive the agent to completion,swyx: it's crazy that it solved,Ryan Lopopolo: but it means that the humans, the ones with limited time and attention get to work on the hardest stuff, like the problems where it's pure white space out in front. Or like the deepest refactorings where you don't know what the proper shape of the interfaces are. And this is where I wanna spend my time. ‘cause it lets me set up for the next level of scale.swyx: Yeah. Yeah. Amazing. Let's introduce Symphony.I think we've been mentioning it every now and then. Elixir. Interesting option.Ryan Lopopolo: Yeah.swyx: Yeah. I'm not,Ryan Lopopolo: again, like the [00:32:00] elixir manifestation here is just a derivative. Is it a modelswyx: chosen? Yeah.Ryan Lopopolo: Yeah. Yeah. And it chose that because the process supervision and the gen servers are super amenable to the type of process orchestration that we're doing here.You are essentially spinning up little Damons for every task that is in execution and driving it to completion, which. Means the mall gets a ton of stuff for free by using Elixir and the Beam.swyx: I had to go do a crash course in Beam and Elixir, and I think most people are not operating at that scale of concurrency where you need that.But it is a good mental model for Resum ability and all those things. And these are things I care about. But tell me the story, the origin story of Symphony. What do you use it for? Is this, how did it form maybe any abandoned paths that you didn't take?[00:32:46] Terminal Free Orchestration[00:32:46] Symphony: Removing Humans from the LoopRyan Lopopolo: At the end of December we were at about three and a half PRS per engineer per day.This was before five two came out in the beginning of January. Everyone gets back from holiday with five two and no other work [00:33:00] on the repository. We were up in the five to 10 PRS per day per engineer. And I don't know about y'all, but like it's very taxing to constantly be switching like that. Like I was pretty tapped out at the end of the day, again, where are the humans spending their time? They're spending their time context switching between all these active tmox pains to drive the agent forward.swyx: Yeah. No way. Yeah.Ryan Lopopolo: So let's again, build something to remove ourselves from the loop. And this is what frantic sprinted adapt here to find a way to remove the need for the human to sit in front of their terminal.So a lot of experimentation with Devrel boxes and, automatically spinning up agents, like it seems like a fantastic end state here, where my life is beach. I open live twice a day and say yes no to these things. Yeah. And this is again, a super, super interesting framing for how the work is done.Because I become more latency and sensitive. I have [00:34:00] way less attachment to the code as it is written. Like I've had close to zero investment in the actual authorship experience. So if it's garbage. I can just throw it away and not care too much about it. In Symphony, there's this like rework state where once the PR is proposed and it's escalated to the human for review, it should be a cheap review.It is either mergeable or it is not. And if it's not, you move it to rework. The elixir service will completely trash the entire work tree NPR and start it again from scratch. Okay. And this is that opportunity again to say, why was it trash right? What did the agent do that wasswyx: bad. Yeah.Ryan Lopopolo: Fix that before moving the ticket toswyx: endRyan Lopopolo: of progress again.swyx: Yeah. Why is this not in codex app? I guess this, you guys are ahead of Codex app,Ryan Lopopolo: yeah, so the way the team has been working is basically to be as AI pilled as possible and spread ahead. And a lot of the things we have worked on have fallen out [00:35:00] into a lot of the products that we have.Like we were in deep consultation with the Codex team to. Have the Codex app be a thing that exists, right? To have skills be a thing that Codex is able to use. So we didn't have to roll our own to put automations into the product. So all of our automatic refactoring agents didn't have to be these hand rolled control loops.It has been really fantastic to be, in a way, un anchored to the product development of Frontier and Codex and just very quickly try to figure out what works and then later find the scalable thing that can be deployed widely. It's been a very fun way to operate. It's certainly chaotic. I have lost track very often of what the actual state of the code looks like.‘cause I'm not in the loop. There was. One point where we had wired playwright directly up to the Electron app. With MCPM CCPs, I'm pretty bearish on because the harness forcibly injects all those tokens in the [00:36:00] context, and I don't really get a say over it. They mess with auto compaction. The agent can forget how to use the tool.There's probably only what three calls in playwright that I actually ever want to use. So I pay the cost for a ton of things. Somebody vibed a local Damon that boots playwright and exposes a tiny little shim CLI to drive it. And I had zero idea that this had occurred because to me, I run Codex and it's able to, it's oh, it's better.Yeah. Like no knowledge of this at all. Uhhuh.[00:36:30] Multi Human ChaosRyan Lopopolo: So we have had like in human space to spend a lot of time doing synchronous knowledge sharing. We have a daily standup that's 45 minutes long because we almost have to. Fan out the understanding of the current state.swyx: Yeah, I was gonna say this is good for a single human multi-agent, but multi human, multi-agent is a whole like po like explosion of stuff.Ryan Lopopolo: Yeah. And that this is fundamentally why we have such a rigid, like 10,000 [00:37:00] engineer level architecture in the app because we have to find ways to carve up the space so people are not trampling on each other.swyx: Sorry, I don't get the 10,000 thing. Did I miss that?Ryan Lopopolo: The structure of the repository is like 500 NPM packages.It's like architecture to the excess for what you would consider, I think normal for a seven person team. But if every person is actually like 10 to 50. Then the like numbers on being super, super deep into decomposition and sharding and like proper interface boundaries make a lot more sense.swyx: Yeah. To me, that's why I talked about Microfund ends and I, an anex is from that world, but Cool. It is just coming back to, to, to this I dunno if you have other, thoughts on. Orchestrating so much work coin going through this. Is this enough? Is this like any aha moments?Vibhu: It'll be interesting to see like where, okay, so right now you pick linear as your issue tracker, right?swyx: Or it's like a is it actually linear? This is actually linear.[00:37:55] Linear vs Slack WorkflowVibhu: Oh, that's linear. It's linear.swyx: Oh I never looked atVibhu: video. The demo video I had to download to [00:38:00] run.swyx: So I, because I'm a Slack maxie, but Yeah, linear. Linear is also really good. Yes,Ryan Lopopolo: we do make a good use of Slack. We we fire off codex to do all these lotion, elasticity, fix ups, the things that like sync that knowledge into the repository.It's super cheap. Yeah.swyx: Yeah.Ryan Lopopolo: Just do it in Codex.swyx: My biggest plug is OpenAI needs to build Slack. You need to own Slack. Build yours. Turn this into Slack.Ryan Lopopolo: I did read about it. Youswyx: did?Ryan Lopopolo: Yeah.[00:38:25] Collaboration Tools for AgentsRyan Lopopolo: I would say that if we think that we want these agents to do economically valuable work, which is like this is the mission, right?We want AI to be deployed widely, to do economically valuable work, then we need to find ways for them to naturally collaborate with humans, which means collaboration tooling, I think, is an interesting space to explore.swyx: Yeah, totally. Yeah. GitHub, slack, linear.Vibhu: Yeah, that was my thing. Okay, where do we see right now Codex has started Codex Model, then CLI, now there's an app, app can let me shoot off multiple Codex is in parallel, but there's no great team collaboration for Codex.And it [00:39:00] seems like your team had some say into what comes out, right? So you talked to ‘em, codex kind of was a thing. From there, if you guys are on the bound, what stuff that like, you might not focus on, but what do you expect other people to be building, right? So people that are like five x 50 Xing.Should you build stuff that's like very niche for your workflow, for your team? Should it be more general so other people can adopt? Is there a niche there? ‘Cause part of it is just okay, is everything just internal tooling? Do we have everything our own way? Like the way our team operates has our own ways that we like to communicate or is there a broader way to do it?Is it something like a issue tracker? Just thoughts if you wanna riff on that.[00:39:35] Standardizing Skills and CodeRyan Lopopolo: I think TBD we have not figured this out in a general way. I do think that there is leverage to be had in making the code and the processes as much the same as possible. If you think that code is context, code is prompts, it's better from the agent behavior perspective to be able to look in a package in directory X, Y, Z, and it not to have to page so [00:40:00] deeply into directory if you C, because they have the same structure, use the same language, they have the same patterns internally.And that same like leverage comes from aligning on a single set of skills that you're pouring every engineer's taste into to make sure that the agent is effective. So like in our code base, we have, I think, six skills. That's it. And if some part of the software development loop is not being covered, our first attempt is to encode it in one of the existing setup skills, which means that we can change the agent behavior.Yeah. More cheaply than changing the human driver behavior.swyx: Yeah.[00:40:39] Self Improvement via Logsswyx: Have you ever, have you experimented with agents changing their own behavior?Ryan Lopopolo: We do.swyx: Yeah. Or parent agent changing a subagents, behavior or something like that.Ryan Lopopolo: We have some bits for skill distillation. So for example, there's one neat thing you can do with Codex, which is just point it at its own session logs to ask it to tell you how you can use [00:41:00] the tool pedal better.swyx: It's like introspectionRyan Lopopolo: or ask it to do things. I useVibhu: this session better. What skills should Iswyx: high? I like the modification of, you can do, just do things to you can just ask agent to do things.Ryan Lopopolo: Yeah. You can just codex things. This is like a, this is like a silly emoji that we have, right? You can just codex things, you can just prompt things.It's really glorious future we live in, but okay, you can do that one-on-one. But we're actually slurping these up for the entire team into blob storage and. Running agent loops over them every day to figure out where as a team can we do better and how do we reflect that back into the repositories?Yes, though everybody benefits from everybody else's behavior for free. Same for like PR comments, right? These are all feedback. That means the code as written, deviated from what was good, a PR comment, a failed build. These are all signals that mean at some point the agent was missing context. We gotta figure out how toswyx: Yeah.Ryan Lopopolo: Slurp it up and put it back in the reboot.swyx: By the way, I do this exactly right. I used to, when I use cloud code for [00:42:00] knowledge work, cloud cowork is like a nice product, right? Yes. In I think you would agree. I always have it tell me what do I do better next time? And that's the meta programming reflection thing.So I almost think like you have six reflection extraction levels in symphony and almost like the zero of layer. So the six levels are PO policy, configuration, coordination, execution, integration, observability. We've talked about a couple of these, but the zero layer is like the, okay, are we working well?Can we improve how we work? Yes. Can I modify my own workflow without MD or something? I don't know.Ryan Lopopolo: Yeah, of course. Yeah, of course you can. Like this thing is also able to cut its own tickets ‘cause we give it full access.Yeah. Make it a ticket to have it cut. Tickets you can.Put in the ticket that you expect it to file as on follow up work,swyx: like Yeah. Self-modifying. Yeah.Ryan Lopopolo: Yeah.[00:42:44] Tool Access and CLI FirstRyan Lopopolo: Put, don't put the agent in a box. Give the agent full accessibility over it. Domain.swyx: I had a mental reaction when you said don't put the agent in a box. So I think you should put it in a box. Like it's just that you're giving the box everything it needs.Ryan Lopopolo: Yeah. Context and tools.swyx: But we're like, as developers, we're used to calling [00:43:00] out to different systems, but here you use the open source things like the Prometheus, whatever, and you run it locally so that you can have the full loop. I assume.Ryan Lopopolo: Yep.Vibhu: I think likeRyan Lopopolo: another, you wanna minimize cloud, cloud dependencies.Vibhu: You also want to make sure that you think about what the agent has access to. What does it see? Does it go back into the loop, like from the most basic sense of you let it see its own like calls, traces it can determine where it went wrong. But are you feeding that back in? So you know, just the most basic level of you wanna see exactly what's input output, like does the agent have access to.What is being outputted, right? It can self-improve a lot of these things. It's allRyan Lopopolo: text, right? My job is to figure out ways to funnel text from one agent to the other.swyx: It's so strange like way back at the start of this whole AI wave Andre was like, English is the hottest day programming language.It's here, it's just Yeah. The feature as well.Vibhu: A lot of, okay. Like a lot of software, a lot of stuff. There's a gui, it's made for the human. We're seeing the evolution of CLI for everything, right? All tools have CLIs. Your agents can use [00:44:00] them well, do we get good vision? Do we get good little sandboxes?Like right now? It's a really effective way, right? Models love to use tools. They love the best. They love to read through text. So slap a CLI let it go loose. That works for everything.Ryan Lopopolo: It does. Yeah. Yeah.[00:44:14] UI Perception and RasterizingRyan Lopopolo: We've also been adapting nont, textual things to that shape in order to improve model behavior in some ways, right?We want the agent to be able to see the UI agents do not perceive visually in the same way that we do. They don't see a red box, they see red box button, right? They see these things in latent space. So if we want, Hey, yeah, I do. We haveswyx: a ding if that goes off every time. Alien spaceRyan Lopopolo: ding.Anyway if we wanna actually make it see the layout, it's almost easier to rasterize that image to ask EOR and feed it in to the agent. Ha. And there's no reason you can't do both, right? To like further refine how the model perceives the object it's [00:45:00] manipulating.swyx: Cool. Could we, you wanna talk about a couple more of these layers that might bear more introspection or that you have personal passion for?[00:45:07] Coordination Layer with ElixirRyan Lopopolo: I will say that the coordination layer here was a really tricky piece to get right.swyx: Let's do it. Yep. I'm all about that. And this is Temporal core.Ryan Lopopolo: This is where when we turn the spec into Elixir, where like the model takes a shortcut, right? Like it's oh, I have all these primitives that I can make use of in this lovely runtime that has native process supervision.Which is I think, a neat way to have taken the spec and made it more choices achievable by making choices that naturally mapswyx: Yeah.Ryan Lopopolo: To the domain, right? In the same way that like you would prefer to have a TypeScript model repo if you are doing full stack web development, right? Because the ability to share types across the front end and backend reduces a lot of complexity.And becauseswyx: that's what graph kill used to be.Ryan Lopopolo: That's right. Andswyx: I don't know if it's still alive, butRyan Lopopolo: [00:46:00] no humans in the loop here. So like my own personal ability to write or not write elixir. Doesn't really have to bias us away from using the right tool for the job. It is just wild.swyx: Love it. I love it.Yeah. I wonder if any languages struggle more than others because of this? I feel like everyone has their own abstractions. That would make sense. But maybe it might be slower, it might be more faulty where like you'd have to just kick the server every now and then. I, I don't know. I think observability layer is really well understood.Integration layer, CP is dead. I think all these just like a really interesting hierarchy to travel up and down. It's common language for people working on the system to understandRyan Lopopolo: The policy stuff is really cool, right? Yeah. You don't really have to build a bunch of code to make sure the system wait for the, to passswyx: it's institutional knowledge.Ryan Lopopolo: Yeah. You just give it the G-H-C-L-I with some text that say CI has to pass. It makes the maintenance of these systems a lot easier.[00:46:57] Agent Friendly CLI Outputswyx: Do you think that CLI maintainers need to be [00:47:00] do anything special for agents or just as is? It's good because like I don't think when people made the G GitHub, CLI, they anticipated this happening.Ryan Lopopolo: That's correct. The GH CLI is fantastic. It's great super industry.swyx: Everyone go try GH repo create GH pull and then pull request number, right? GH HPR, like 1 53, whatever. And then it like pullsRyan Lopopolo: basically my only interaction with the GitHub web UI at this point is GH PR view dash web.Exactly. Glanceswyx: at the diffRyan Lopopolo: and be like Sure thing. Send it. Yeah. But the CLI are nice ‘cause they're super token efficient and they can be made more token efficient really easily. Like I'm sure you all have seen like I go to build Kite or Jenkins and I could just get this massive wall of build output.And in order to unblock the humans, your developer productivity team is almost certainly gonna write some code that parses the actual exception out of the build logs and sticks it in a sticky note at the top of the page. And you basically [00:48:00] want CLI to be structured in a similar way, right? You're gonna want to patch dash silent to prettier because the agent doesn't care that every file was already formatted.Just wants to know it's either formatted or not. So it can then go run a right command. Similarly, like in our PNPM distributed script runner, when we had one, when you do dash recursive, like it produces a absolute mountain of text. But all of that is for passing. Test suites. So we ended up wrapping all of this in another scriptswyx: to suppress the,Ryan Lopopolo: which you can vibe the channel only output the failing parts of the tests.swyx: You make a pipe errors versus the standard, standard out. I don't know. Okay. Whatever. Too much thinking have to do that. The CII used to maintain SCLI for my company and yeah, this is like core, very core to my heart. But you're vibing my job.Ryan Lopopolo: That's right.swyx: Cool. Any other things?This is a long spec. [00:49:00] I appreciate that. It's got a lot of strong opinions in here. Any other things that we should highlight? I think obviously you can spend the whole day going through some of these, but I do think that some of these have a lot of care or some of this you might wanna tell people, Hey, take this, but, make it your own.[00:49:15] Blueprint Spec and GuardrailsRyan Lopopolo: Fundamentally, software is made more flexible when it's able to adapt to the environment in which it is deployed, which means that things like linear or GitHub even are specified within the spec, but not required pieces of it. There's like a more platonic ideal of the thing that you could swap in like Jira or Bitbucket, for example.But being able to tightly specify things like the ID formats or how the Ralph Loop works for the individual agents. Basically means you can get up and running with a fully specified system quickly that you then evolve later on. I think we never intended for this to be a static spec that you can [00:50:00] never change.It's more like a blueprint to get something worth a starting point up and running.swyx: Yeah.Ryan Lopopolo: For you then to vibe later to your heart's content,swyx: you have like code and scripts in here where it's oh, I think this is a really good prompt. It's just a very long prompt.Ryan Lopopolo: Fundamentally, the agents are good at following instructions, so give them instructions.And it will, improve the reliability of the result. We, much like the way we use Symphony, we don't want folks to have to monitor the agent as it is vibing the system into existence. So being very opinionatedVery strict around what these success criteria are means that our deployment success rate goes up. Yeah. It means we don't have to get tickets on this thing.Vibhu: Think it all goes back to that like code to disposable, right? Like early on when you had CLI or you'd kick off a Codex run, it would take two hours. You would wanna monitor okay, I'm in the workflow of just using one.I don't want it to go down the wrong path. I'll cut it off and, just shoot off four, like that was my favorite thing of the Codex app, right? Yeah. Just Forex it like, [00:51:00] it's okay. One of them will probably be right, one of them might be better. Stop overthinking it. Like my first example was probably like deep research.When you put out deep research and I'd ask it something like, I asked it something about LLM, it thought it was legal something and spent an hour, came back with a report completely off the rails. And I was like, okay, I gotta monitor this thing a bit. No don't monitor it. Just you want to build it so it's that it, it goes the right way.And you don't wanna, you don't wanna sit there and babysit, right? You don't want to babysit your agentsRyan Lopopolo: with that deep research query that you made. Looking at the bad result, you probably figured out you needed to tweak your prompt Yeah. A bit, right? That's that guardrail that you fed back into the code base for the task, your prompt to further align the agent's execution.Same sort of concept supply there too.swyx: When you talk, how are the customers feelingRyan Lopopolo: for Symphony? I think we have none, right? This is a thing we have put out into theswyx: world. Symphony's internal, right? As long as you are happy, you are the customer. That'
Most teams don't realize they're missing critical data until something goes wrong.In this episode, Austin Spiegel, co-founder and CTO of Sift and former SpaceX engineer, dives into why telemetry, simple in concept, a value and a timestamp, can become a massive problem in hardware. Miss even a fraction of a second, and you lose the story. Software engineers have plenty of tools to solve this. Hardware engineers haven't, until now.We also talk leadership, what it's like stepping into management early, why teams can actually be too flat, and how your role shifts from doing the work to connecting context. On hiring, Austin explains why pedigree doesn't equal talent, and how Sift focuses on practical, real-world ability.And throughout, one theme emerges: speed. Not just moving fast, but learning and iterating faster than anyone else.If you're building complex systems or leading technical teams, this one hits on a lot of things that don't usually get said out loud.Episode Highlights00:00 What telemetry actually is (and why it fails)05:07 Why hardware never got its “Datadog moment”12:05 The real challenge of high-frequency data17:43 Becoming a manager too early at SpaceX22:27 Interviewing for skills and values over pedigree.26:59 The shift from doing work to providing context31:32 Motivating engineers through customer impactKey TakeawaysTelemetry is simple in theory but breaks at scale and speed.Hardware teams lack the modern data tools software teams take for granted.Flat organizations can create decision bottlenecks.Great managers connect context more than they give answers.Pedigree is a weak signal, practical ability matters more.Interviews should mirror the actual job, not abstract problems.Speed is really about learning faster than everyone else.Engineers move faster when they're closer to the customer.Links & ResourcesAustin SpiegelLinkedIn: https://www.linkedin.com/in/austin-spiegel/SiftWebsite: https://www.siftstack.com/Matt GjertsenWebsite: https://www.bettereverydaystudios.com/LinkedIn: https://www.linkedin.com/in/matthewgjertsen/YouTube: https://www.youtube.com/@BetterEveryDayStudios
Today, we're revisiting a segment from our episode on Product-Led Growth and modern sales playbooks with Dan Fougere. Dan is the former Chief Revenue Officer at Datadog and former Head of Global Sales at Medallia, now advising high-growth startups. In this clip, Dan breaks down why traditional sales playbooks fail in PLG environments, and how leaders need to shift toward usage-based signals and first principles thinking. He explains how buyer engagement now starts inside the product, what those signals actually look like, and how sales teams should adapt their timing, messaging, and motion accordingly. Dan Fougere is the former Chief Revenue Officer at Datadog and former Head of Global Sales at Medallia, now advising high-growth companies on scaling modern revenue models. Connect with Dan: LinkedIn Get the Force Management framework for building sales motions that align to how modern buyers evaluate and adopt products: The Predictable Revenue Framework: Guide for Leaders Hosted by five-time CRO John McMahon and Force Management Co-Founder John Kaplan, the Revenue Builders podcast goes behind the scenes with the sales leaders who have been there, done that, and seen the results. This show is brought to you by Force Management. We help companies improve sales performance, executing their growth strategy at the point of sale. Connect with Us: LinkedInYouTubeForce Management
"Companies designing for agents, not humans, are going to get a lot of lift."ClickHouse started as an internal tool at Yandex. Today it's the database Anthropic, OpenAI, Meta and Tesla all run on.In this episode, CEO Aaron Katz joins Lukas Biewald to talk about how he turned an open source project into a $15B company, why he acquired LangFuse knowing it could cost him customers, and what he's actually building for the agent era.Snowflake, Datadog and Databricks all come up. He doesn't shy away.Connect with us here:Aaron Katz: https://www.linkedin.com/in/aaron-katz-5762094ClickHouse: https://www.linkedin.com/company/clickhouseinc/Lukas Biewald: https://www.linkedin.com/in/lbiewald/Weights and Biases: https://www.linkedin.com/company/wandb/00:00 Trailer00:57 The Origin Story: From Yandex to ClickHouse Inc.04:43 Building ClickHouse Cloud & Raising $300M10:36 Growing Up Around Xerox PARC12:51 Salesforce, Mark Benioff & the Dot-Com Bust15:32 Cloud Skeptics vs. AI Skeptics | History Repeating18:05 Building a Modern Go-To-Market Playbook21:57 The SaaS Crash, Agents & the Future of Infrastructure27:09 The Datadog Love-Hate Story35:21 Hardest Moments: Russia, SVB & Sleepless Nights43:16 Outro
Topics covered in this episode: Lock the Ghost Fence for Sandboxing MALUS: Liberate Open Source Harden your GitHub Actions Workflows with zizmor, dependency pinning, and dependency cooldowns Extras Joke Watch on YouTube About the show Sponsored by us! Support our work through: Our courses at Talk Python Training The Complete pytest Course **Patreon SupportersConnect with the hosts** Michael: @mkennedy@fosstodon.org / @mkennedy.codes (bsky) Brian: @brianokken@fosstodon.org / @brianokken.bsky.social Show: @pythonbytes@fosstodon.org / @pythonbytes.fm (bsky) Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Monday at 11am PT. Older video versions available there too. Finally, if you want an artisanal, hand-crafted digest of every week of the show notes in email form? Add your name and email to our friends of the show list, we'll never share it. Michael #1: Lock the Ghost The five core takeaways: PyPI "removal" doesn't delete distribution files. When a package is removed from PyPI, it disappears from the index and project page, but the actual distribution files remain accessible if you have a direct URL to them. uv.lock uniquely preserves access to ghost packages. Because uv.lock stores direct URLs to distribution files rather than relying on the index API at install time, uv sync can successfully install packages that have already been removed, even with cache disabled. No other Python lock file implementation tested behaved this way. This creates a supply chain attack vector. An attacker could upload a malicious package, immediately remove it to dodge automated security scanning, and still have it installable via a uv.lock file, or combine this with the xz-style strategy of hiding malicious additions in large, auto-generated lock files that nobody reviews. Removed package names can be hijacked with version collisions. When an owner removes a package, the name can be reclaimed by someone else who can upload different distribution types under the same version number, as happened with "umap." Lock files help until you regenerate them, then you're exposed. Your dependency scanning needs to cover lock files, not just manifest files. Scanning only pyproject.toml or requirements.txt misses threats embedded in lock files, which is where the actual resolved URLs and hashes live. Brian #2: Fence for Sandboxing Suggested by Martin Häcker “Some coding platforms have since integrated built-in sandboxing (e.g., Claude Code) to restrict write access to directories and/or network connectivity. However, these safeguards are typically optional and not enabled by default.” “JY Tan (on cc) has extracted the sandboxing logic from Claude Code and repackaged it into a standalone Go binary.” Source code on GitHub: https://github.com/Use-Tusk/fence Related: Simon Willison lethal trifecta for AI agents article from June 2025 Claude Code Sandboxing Michael #3: MALUS: Liberate Open Source via Paul Bauer The service will generate the specs of a library with one AI and build the newly licensed library using the specs with another AI circumventing the licensing and copyright rules. AI that has not been trained on open source reads the docs and API signature, creates a spec. Another AI processes that spec into working software. Is it a real site? Are they accepting real money, or are they just trying to cause a stir around copyright? Brian #4: Harden your GitHub Actions Workflows with zizmor, dependency pinning, and dependency cooldowns Matthias Schoettle Avoid things like this: hackerbot-claw: An AI-Powered Bot Actively Exploiting GitHub Actions - Microsoft, DataDog, and CNCF Projects Hit So Far Extras Brian: GitHub is asking to spy on us, that's nice Michael: Michael's new SaaS for podcasters: InterviewCue DigitalOcean's Spaces cold storage for infrequently accessed data Minor issue about my fire and forget post, was a latent bug? Fire and Forget at Textual follow up article Joke: Can you?
In der heutigen Folge sprechen die Finanzjournalisten Daniel Eckert und Lea Oetjen über KI-Deal von Eli Lilly, Frust bei DataDog und den Absturz von CTS Eventim. Außerdem geht es um Microsoft, SAP, Rheinmetall, Suss Microtec, Auto1, Energiekontor, Friedrich Vorwerk, Volvo, Bank of China, Nike, EUWAX Gold II (WKN: EWG2LD), EUWAX Gold Core (WKN: EWG4CR), Xetra Gold (WKN: A0S9GB), WisdomTree Physical Gold (WKN: A0N6XK) und iShares Physical Gold (WKN: A1KWPQ). Wir freuen uns an Feedback über aaa@welt.de. Anzeige: Diese Folge enthält Werbung für Smartbroker+. Depot eröffnen & 60 € ETF sichern! Riesige ETF-Auswahl, flexible Trades & persönlicher Support bei Smartbroker+. Alle Informationen gibt es unter: https://get.smartbrokerplus.de/triple-aaa-podcast/ Noch mehr "Alles auf Aktien" findet Ihr bei WELTplus und Apple Podcasts – inklusive aller Artikel der Hosts. Hier bei WELT: https://www.welt.de/podcasts/alles-auf-aktien/plus247399208/Boersen-Podcast-AAA-Bonus-Folgen-Jede-Woche-noch-mehr-Antworten-auf-Eure-Boersen-Fragen.html. Hier könnt ihr den AAA-Newsletter abonnieren: https://www.welt.de/newsletter/article232797673/Alles-auf-Aktien-Der-taegliche-Boersen-Newsletter-fuer-WELTplus-Abonnenten.html Und - ganz neu: AAA gibt es jetzt auch auf Instagram: https://www.instagram.com/alles_auf_aktien/ Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Hörtipps: Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. +++ Werbung +++ Du möchtest mehr über unsere Werbepartner erfahren? Hier findest du alle Infos & Rabatte! https://linktr.ee/alles_auf_aktien Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
In der heutigen Folge sprechen die Finanzjournalisten Philipp Vetter und Holger Zschäpitz über Stagflationssignale, crashende Softwareaktien und ein weiterer Großauftrag für Palantir. Außerdem geht es um CF Industries, Mosaic, Archer-Daniels-Midland, Hubspot, UiPath, Atlassian, Zscaler, Snowflake, Gitlab, MongoDB, Salesforce, Datadog, Servicenow, Intuit, Workday, Gartner, Amazon, SAP, Arm, Apple, Samsung, Microsoft, Ionos, Commonwealth Bank of Australia, National Australian Bank, BHP Group, Rio Tinto, Westpac Banking, ANZ Group, Wesfarmers, Xtrackers S&P ASX 200 (WKN: DBX1A2), iShares MSCI Australia (WKN: A0YJ80), Xtrackers II Australia Government Bond ETF (WKN: DBX0GG). Die Infos zum Buch “Project Maven – A Marine Colonel, His Team, and the Dawn of AI Warfare” von Katrina Manson findet ihr hier: https://wwnorton.com/books/9781324123316 Wir freuen uns an Feedback über aaa@welt.de. Noch mehr "Alles auf Aktien" findet Ihr bei WELTplus und Apple Podcasts – inklusive aller Artikel der Hosts. Hier bei WELT: https://www.welt.de/podcasts/alles-auf-aktien/plus247399208/Boersen-Podcast-AAA-Bonus-Folgen-Jede-Woche-noch-mehr-Antworten-auf-Eure-Boersen-Fragen.html. Hier könnt ihr den AAA-Newsletter abonnieren: https://www.welt.de/newsletter/article232797673/Alles-auf-Aktien-Der-taegliche-Boersen-Newsletter-fuer-WELTplus-Abonnenten.html Und - ganz neu: AAA gibt es jetzt auch auf Instagram: https://www.instagram.com/alles_auf_aktien/ Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Hörtipps: Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. +++ Werbung +++ Du möchtest mehr über unsere Werbepartner erfahren? Hier findest du alle Infos & Rabatte! https://linktr.ee/alles_auf_aktien Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
The Twenty Minute VC: Venture Capital | Startup Funding | The Pitch
Shaunt Voskanian is the CRO @ Figma, where he has scaled the sales machine to over $1BN in ARR and over 400 people. Prior to Figma, Shaunt was Senior VP of Global Sales at Datadog where he scaled the revenue org to $1BN in ARR. AGENDA: 04:33 - Are Great Sales Leaders Born or Trained? 06:55 - In a world of PLG, is sales less important than ever? 11:51 - Why does Shaunt not believe in traditional customer success teams? 14:31 - Does the role of the SDR survive in two years' time? 19:19 - When is the right time for sales to intercept in a PLG motion? 21:43 - How to Set Sales Quotas in a PLG AI Sales World? 31:19 - How has what you look for in sales hires changed over time? 42:54 - How do you judge sales performance if not on quota? 54:49 - Quick fire: Outdated sales tactic, What Role Dies, Best Sales AI Tool
The dominant structural mechanism highlighted is the industry-wide shift toward liability transfer and governance gaps in AI procurement, deployment, and incident response. According to Dave Sobel, both vendors and organizations are accelerating AI adoption without corresponding investments in oversight, training, or clear accountability structures. This is reflected across multiple sectors, from software vendors such as Grammarly, Eightfold.ai, Cohesity, and Rubrik, to business leaders and policymakers, where risk is systematically deferred downstream rather than managed at the point of adoption. The most consequential evidence is the quantitative disconnect between stated AI priorities and functional oversight. Research cited by Dave Sobel from Economist Impact and HR Dive found that while 38% of organizations budget for AI and 86% of executives rate AI as essential, only 16% offer internal training and over half of department-level AI initiatives lack formal oversight (Ernst & Young). Additionally, 88% of AI vendors limit their liability, and only 17% align with regulatory compliance, per cited surveys, leaving substantial legal and operational risk for end users and service providers. Supporting this trend, Dave Sobel points to Grammarly's opt-out identity usage in new features and a class action lawsuit against Eightfold.ai regarding AI-driven employment decisions. Vendors such as Cohesity, Rubrik, ServiceNow, and Datadog are responding by building tools focused on remediation and recovery from AI-driven incidents, underscoring a shift from preventive governance to reactive containment. Policy moves—such as expanded operational cyber roles for the private sector—further offload accountability without addressing contractual and insurance exposure. For MSPs and technology leaders, these developments create practical risks: unclear service scope around AI tool usage in contracts, increased exposure to billable incidents and legal action, and rising labor costs for incident recovery. Service providers must audit agreements for AI-specific language, distinguish AI-related incidents from standard SLAs, and treat AI governance as a managed risk service. The pressure will increasingly fall on MSPs to account for training gaps, audit trails, compliance attestations, and recovery procedures—not simply the technology itself. Three things to know today 00:00 ROI Reality Check 02:12 Governance Gap Widens 03:14 Cleanup Economy Rises 05:45 Why Do We Care? Supported by: CometBackup
Episode #534 consacré à « Shai-Hulud » Avec Christophe Tafani-Dereeper Références : Shai-Hulud: https://securitylabs.datadoghq.com/articles/shai-hulud-2.0-npm-worm/ https://github.com/DataDog/indicators-of-compromise/blob/main/shai-hulud-2.0/README.md https://www.wiz.io/blog/shai-hulud-2-0-aftermath-ongoing-supply-chain-attack https://www.cert.ssi.gouv.fr/actualite/CERTFR-2025-ACT-051/ Evoqué pendant l'épisode : Précédent épisode NLS sur la sécurité de la chaîne d'approvisionnement : https://www.nolimitsecu.fr/securisation-de-la-chaine-dapprovisionnement-logicielle/ Attaque sur le mainteneur npm « Qix » : https://socket.dev/blog/npm-author-qix-compromised-in-major-supply-chain-attack Exemples d'autres attaques par phishing npm en 2025 : https://bsky.app/profile/bad-at-computer.bsky.social/post/3lydioq5swk2y https://www.aikido.dev/blog/npm-debug-and-chalk-packages-compromised https://www.mimecast.com/threat-intelligence-hub/npm-phishing-campaign/ PoC de ver npm en […] The post Shai-Hulud appeared first on NoLimitSecu.
All speakers are announced at AIE EU, schedule coming soon. Join us there or in Miami with the renowned organizers of React Miami! Singapore CFP also open!We've called this out a few times over in AINews, but the overwhelming consensus in the Valley is that “the IDE is Dead”. In November it was just a gut feeling, but now we actually have data: even at the canonical “VSCode Fork” company, people are officially using more agents than tab autocomplete (the first wave of AI coding):Cursor has launched cloud agents for a few months now, and this specific launch is around Computer Use, which has come a long way since we first talked with Anthropic about it in 2024, and which Jonas productized as Autotab:We also take the opportunity to do a live demo, talk about slash commands and subagents, and the future of continual learning and personalized coding models, something that Sam previously worked on at New Computer. (The fact that both of these folks are top tier CEOs of their own startups that have now joined the insane talent density gathering at Cursor should also not be overlooked).Full Episode on YouTube!please like and subscribe!Timestamps00:00 Agentic Code Experiments00:53 Why Cloud Agents Matter02:08 Testing First Pillar03:36 Video Reviews Second Pillar04:29 Remote Control Third Pillar06:17 Meta Demos and Bug Repro13:36 Slash Commands and MCPs18:19 From Tab to Team Workflow31:41 Minimal Web UI Philosophy32:40 Why No File Editor34:38 Full Stack Cursor Debate36:34 Model Choice and Auto Routing38:34 Parallel Agents and Best Of N41:41 Subagents and Context Management44:48 Grind Mode and Throughput Future01:00:24 Cloud Agent Onboarding and MemoryTranscriptEP 77 - CURSOR - Audio version[00:00:00]Agentic Code ExperimentsSamantha: This is another experiment that we ran last year and didn't decide to ship at that time, but may come back to LM Judge, but one that was also agentic and could write code. So it wasn't just picking but also taking the learnings from two models or and models that it was looking at and writing a new diff.And what we found was that there were strengths to using models from different model providers as the base level of this process. Basically you could get almost like a synergistic output that was better than having a very unified like bottom model tier.Jonas: We think that over the coming months, the big unlock is not going to be one person with a model getting more done, like the water flowing faster and we'll be making the pipe much wider and so paralyzing more, whether that's swarms of agents or parallel agents, both of those are things that contribute to getting much more done in the same amount of time.Why Cloud Agents Matterswyx: This week, one of the biggest launches that Cursor's ever done is cloud agents. I think you, you had [00:01:00] cloud agents before, but this was like, you give cursor a computer, right? Yeah. So it's just basically they bought auto tab and then they repackaged it. Is that what's going on, or,Jonas: that's a big part of it.Yeah. Cloud agents already ran in their own computers, but they were sort of site reading code. Yeah. And those computers were not, they were like blank VMs typically that were not set up for the Devrel X for whatever repo the agents working on. One of the things that we talk about is if you put yourself in the model shoes and you were seeing tokens stream by and all you could do was cite read code and spit out tokens and hope that you had done the right thing,swyx: no chanceJonas: I'd be so bad.Like you obviously you need to run the code. And so that I think also is probably not that contrarian of a take, but no one has done that yet. And so giving the model the tools to onboard itself and then use full computer use end-to-end pixels in coordinates out and have the cloud computer with different apps in it is the big unlock that we've seen internally in terms of use usage of this going from, oh, we use it for little copy changes [00:02:00] to no.We're really like driving new features with this kind of new type of entech workflow. Alright, let's see it. Cool.Live Demo TourJonas: So this is what it looks like in cursor.com/agents. So this is one I kicked off a while ago. So on the left hand side is the chat. Very classic sort of agentic thing. The big new thing here is that the agent will test its changes.So you can see here it worked for half an hour. That is because it not only took time to write the tokens of code, it also took time to test them end to end. So it started Devrel servers iterate when needed. And so that's one part of it is like model works for longer and doesn't come back with a, I tried some things pr, but a I tested at pr that's ready for your review.One of the other intuition pumps we use there is if a human gave you a PR asked you to review it and you hadn't, they hadn't tested it, you'd also be annoyed because you'd be like, only ask me for a review once it's actually ready. So that's what we've done withTesting Defaults and Controlsswyx: simple question I wanted to gather out front.Some prs are way smaller, [00:03:00] like just copy change. Does it always do the video or is it sometimes,Jonas: Sometimes.swyx: Okay. So what's the judgment?Jonas: The model does it? So we we do some default prompting with sort. What types of changes to test? There's a slash command that people can do called slash no test, where if you do that, the model will not test,swyx: but the default is test.Jonas: The default is to be calibrated. So we tell it don't test, very simple copy changes, but test like more complex things. And then users can also write their agents.md and specify like this type of, if you're editing this subpart of my mono repo, never tested ‘cause that won't work or whatever.Videos and Remote ControlJonas: So pillar one is the model actually testing Pillar two is the model coming back with a video of what it did.We have found that in this new world where agents can end-to-end, write much more code, reviewing the code is one of these new bottlenecks that crop up. And so reviewing a video is not a substitute for reviewing code, but it is an entry point that is much, much easier to start with than glancing at [00:04:00] some giant diff.And so typically you kick one off you, it's done you come back and the first thing that you would do is watch this video. So this is a, video of it. In this case I wanted a tool tip over this button. And so it went and showed me what that looks like in, in this video that I think here, it actually used a gallery.So sometimes it will build storybook type galleries where you can see like that component in action. And so that's pillar two is like these demo videos of what it built. And then pillar number three is I have full remote control access to this vm. So I can go heat in here. I can hover things, I can type, I have full control.And same thing for the terminal. I have full access. And so that is also really useful because sometimes the video is like all you need to see. And oftentimes by the way, the video's not perfect, the video will show you, is this worth either merging immediately or oftentimes is this worth iterating with to get it to that final stage where I am ready to merge in.So I can go through some other examples where the first video [00:05:00] wasn't perfect, but it gave me confidence that we were on the right track and two or three follow-ups later, it was good to go. And then I also have full access here where some things you just wanna play around with. You wanna get a feel for what is this and there's no substitute to a live preview.And the VNC kind of VM remote access gives you that.swyx: Amazing What, sorry? What is VN. AndJonas: just the remote desktop. Remote desktop. Yeah.swyx: Sam, any other details that you always wanna call out?Samantha: Yeah, for me the videos have been super helpful. I would say, especially in cases where a common problem for me with agents and cloud agents beforehand was almost like under specification in my requests where our plan mode and going really back and forth and getting detailed implementation spec is a way to reduce the risk of under specification, but then similar to how human communication breaks down over time, I feel like you have this risk where it's okay, when I pull down, go to the triple of pulling down and like running this branch locally, I'm gonna see that, like I said, this should be a toggle and you have a checkbox and like, why didn't you get that detail?And having the video up front just [00:06:00] has that makes that alignment like you're talking about a shared artifact with the agent. Very clear, which has been just super helpful for me.Jonas: I can quickly run through some other Yes. Examples.Meta Agents and More DemosJonas: So this is a very front end heavy one. So one question I wasswyx: gonna say, is this only for frontJonas: end?Exactly. One question you might have is this only for front end? So this is another example where the thing I wanted it to implement was a better error message for saving secrets. So the cloud agents support adding secrets, that's part of what it needs to access certain systems. Part of onboarding that is giving access.This is cloud is working onswyx: cloud agents. Yes.Jonas: So this is a fun thing isSamantha: it can get super meta. ItJonas: can get super meta, it can start its own cloud agents, it can talk to its own cloud agents. Sometimes it's hard to wrap your mind around that. We have disabled, it's cloud agents starting more cloud agents. So we currently disallow that.Someday you might. Someday we might. Someday we might. So this actually was mostly a backend change in terms of the error handling here, where if the [00:07:00] secret is far too large, it would oh, this is actually really cool. Wow. That's the Devrel tools. That's the Devrel tools. So if the secret is far too large, we.Allow secrets above a certain size. We have a size limit on them. And the error message there was really bad. It was just some generic failed to save message. So I was like, Hey, we wanted an error message. So first cool thing it did here, zero prompting on how to test this. Instead of typing out the, like a character 5,000 times to hit the limit, it opens Devrel tools, writes js, or to paste into the input 5,000 characters of the letter A and then hit save, closes the Devrel tools, hit save and gets this new gets the new error message.So that looks like the video actually cut off, but here you can see the, here you can see the screenshot of the of the error message. What, so that is like frontend backend end-to-end feature to, to get that,swyx: yeah.Jonas: Andswyx: And you just need a full vm, full computer run everything.Okay. Yeah.Jonas: Yeah. So we've had versions of this. This is one of the auto tab lessons where we started that in 2022. [00:08:00] No, in 2023. And at the time it was like browser use, DOM, like all these different things. And I think we ended up very sort of a GI pilled in the sense that just give the model pixels, give it a box, a brain in a box is what you want and you want to remove limitations around context and capabilities such that the bottleneck should be the intelligence.And given how smart models are today, that's a very far out bottleneck. And so giving it its full VM and having it be onboarded with Devrel X set up like a human would is just been for us internally a really big step change in capability.swyx: Yeah I would say, let's call it a year ago the models weren't even good enough to do any of this stuff.SoSamantha: even six months ago. Yeah.swyx: So yeah what people have told me is like round about Sonder four fire is when this started being good enough to just automate fully by pixel.Jonas: Yeah, I think it's always a question of when is good enough. I think we found in particular with Opus 4 5, 4, 6, and Codex five three, that those were additional step [00:09:00] changes in the autonomy grade capabilities of the model to just.Go off and figure out the details and come back when it's done.swyx: I wanna appreciate a couple details. One 10 Stack Router. I see it. Yeah. I'm a big fan. Do you know any, I have to name the 10 Stack.Jonas: No.swyx: This just a random lore. Some buddy Sue Tanner. My and then the other thing if you switch back to the video.Jonas: Yeah.swyx: I wanna shout out this thing. Probably Sam did it. I don't knowJonas: the chapters.swyx: What is this called? Yeah, this is called Chapters. Yeah. It's like a Vimeo thing. I don't know. But it's so nice the design details, like the, and obviously a company called Cursor has to have a beautiful cursorSamantha: and it isswyx: the cursor.Samantha: Cursor.swyx: You see it branded? It's the cursor. Cursor, yeah. Okay, cool. And then I was like, I complained to Evan. I was like, okay, but you guys branded everything but the wallpaper. And he was like, no, that's a cursor wallpaper. I was like, what?Samantha: Yeah. Rio picked the wallpaper, I think. Yeah. The video.That's probably Alexi and yeah, a few others on the team with the chapters on the video. Matthew Frederico. There's been a lot of teamwork on this. It's a huge effort.swyx: I just, I like design details.Samantha: Yeah.swyx: And and then when you download it adds like a little cursor. Kind of TikTok clip. [00:10:00] Yes. Yes.So it's to make it really obvious is from Cursor,Jonas: we did the TikTok branding at the end. This was actually in our launch video. Alexi demoed the cloud agent that built that feature. Which was funny because that was an instance where one of the things that's been a consequence of having these videos is we use best of event where you run head to head different models on the same prompt.We use that a lot more because one of the complications with doing that before was you'd run four models and they would come back with some giant diff, like 700 lines of code times four. It's what are you gonna do? You're gonna review all that's horrible. But if you come back with four 22nd videos, yeah, I'll watch four 22nd videos.And then even if none of them is perfect, you can figure out like, which one of those do you want to iterate with, to get it over the line. Yeah. And so that's really been really fun.Bug Repro WorkflowJonas: Here's another example. That's we found really cool, which is we've actually turned since into a slash command as well slash [00:11:00] repro, where for bugs in particular, the model of having full access to the to its own vm, it can first reproduce the bug, make a video of the bug reproducing, fix the bug, make a video of the bug being fixed, like doing the same pattern workflow with obviously the bug not reproducing.And that has been the single category that has gone from like these types of bugs, really hard to reproduce and pick two tons of time locally, even if you try a cloud agent on it. Are you confident it actually fixed it to when this happens? You'll merge it in 90 seconds or something like that.So this is an example where, let me see if this is the broken one or the, okay, this is the fixed one. Okay. So we had a bug on cursor.com/agents where if you would attach images where remove them. Then still submit your prompt. They would actually still get attached to the prompt. Okay. And so here you can see Cursor is using, its full desktop by the way.This is one of the cases where if you just do, browse [00:12:00] use type stuff, you'll have a bad time. ‘cause now it needs to upload files. Like it just uses its native file viewer to do that. And so you can see here it's uploading files. It's going to submit a prompt and then it will go and open up. So this is the meta, this is cursor agent, prompting cursor agent inside its own environment.And so you can see here bug, there's five images attached, whereas when it's submitted, it only had one image.swyx: I see. Yeah. But you gotta enable that if you're gonna use cur agent inside cur.Jonas: Exactly. And so here, this is then the after video where it went, it does the same thing. It attaches images, removes, some of them hit send.And you can see here, once this agent is up, only one of the images is left in the attachments. Yeah.swyx: Beautiful.Jonas: Okay. So easy merge.swyx: So yeah. When does it choose to do this? Because this is an extra step.Jonas: Yes. I think I've not done a great job yet of calibrating the model on when to reproduce these things.Yeah. Sometimes it will do it of its own accord. Yeah. We've been conservative where we try to have it only do it when it's [00:13:00] quite sure because it does add some amount of time to how long it takes it to work on it. But we also have added things like the slash repro command where you can just do, fix this bug slash repro and then it will know that it should first make you a video of it actually finding and making sure it can reproduce the bug.swyx: Yeah. Yeah. One sort of ML topic this ties into is reward hacking, where while you write test that you update only pass. So first write test, it shows me it fails, then make you test pass, which is a classic like red green.Jonas: Yep.swyx: LikeJonas: A-T-D-D-T-D-Dswyx: thing.No, very cool. Was that the last demo? Is thereJonas: Yeah.Anything I missed on the demos or points that you think? I think thatSamantha: covers it well. Yeah.swyx: Cool. Before we stop the screen share, can you gimme like a, just a tour of the slash commands ‘cause I so God ready. Huh, what? What are the good ones?Samantha: Yeah, we wanna increase discoverability around this too.I think that'll be like a future thing we work on. Yeah. But there's definitely a lot of good stuff nowJonas: we have a lot of internal ones that I think will not be that interesting. Here's an internal one that I've made. I don't know if anyone else at Cursor uses this one. Fix bb.Samantha: I've never heard of it.Jonas: Yeah.[00:14:00]Fix Bug Bot. So this is a thing that we want to integrate more tightly on. So you made it forswyx: yourself.Jonas: I made this for myself. It's actually available to everyone in the team, but yeah, no one knows about it. But yeah, there will be Bug bot comments and so Bug Bot has a lot of cool things. We actually just launched Bug Bot Auto Fix, where you can click a button and or change a setting and it will automatically fix its own things, and that works great in a bunch of cases.There are some cases where having the context of the original agent that created the PR is really helpful for fixing the bugs, because it might be like, oh, the bug here is that this, is a regression and actually you meant to do something more like that. And so having the original prompt and all of the context of the agent that worked on it, and so here I could just do, fix or we used to be able to do fixed PB and it would do that.No test is another one that we've had. Slash repro is in here. We mentioned that one.Samantha: One of my favorites is cloud agent diagnosis. This is one that makes heavy use of the Datadog MCP. Okay. And I [00:15:00] think Nick and David on our team wrote, and basically if there is a problem with a cloud agent we'll spin up a bunch of subs.Like a singleswyx: instance.Samantha: Yeah. We'll take the ideas and argument and spin up a bunch of subagents using the Datadog MCP to explore the logs and find like all of the problems that could have happened with that. It takes the debugging time, like from potentially you can do quick stuff quickly with the Datadog ui, but it takes it down to, again, like a single agent call as opposed to trolling through logs yourself.Jonas: You should also talk about the stuff we've done with transcripts.Samantha: Yes. Also so basically we've also done some things internally. There'll be some versions of this as we ship publicly soon, where you can spit up an agent and give it access to another agent's transcript to either basically debug something that happened.So act as an external debugger. I see. Or continue the conversation. Almost like forking it.swyx: A transcript includes all the chain of thought for the 11 minutes here. 45 minutes there.Samantha: Yeah. That way. Exactly. So basically acting as a like secondary agent that debugs the first, so we've started to push more andswyx: they're all the same [00:16:00] code.It is just the different prompts, but the sa the same.Samantha: Yeah. So basically same cloud agent infrastructure and then same harness. And then like when we do things like include, there's some extra infrastructure that goes into piping in like an external transcript if we include it as an attachment.But for things like the cloud agent diagnosis, that's mostly just using the Datadog MCP. ‘Cause we also launched CPS along with along with this cloud agent launch, launch support for cloud agent cps.swyx: Oh, that was drawn out.Jonas: We won't, we'll be doing a bigger marketing moment for it next week, but, and you can now use CPS andswyx: People will listen to it as well.Yeah,Jonas: they'llSamantha: be ahead of the third. They'll be ahead. And I would I actually don't know if the Datadog CP is like publicly available yet. I realize this not sure beta testing it, but it's been one of my favorites to use. Soswyx: I think that one's interesting for Datadog. ‘cause Datadog wants to own that site.Interesting with Bits. I don't know if you've tried bits.Samantha: I haven't tried bits.swyx: Yeah.Jonas: That's their cloud agentswyx: product. Yeah. Yeah. They want to be like we own your logs and give us our, some part of the, [00:17:00] self-healing software that everyone wants. Yeah. But obviously Cursor has a strong opinion on coding agents and you, you like taking away from the which like obviously you're going to do, and not every company's like Cursor, but it's interesting if you're a Datadog, like what do you do here?Do you expose your logs to FDP and let other people do it? Or do you try to own that it because it's extra business for you? Yeah. It's like an interesting one.Samantha: It's a good question. All I know is that I love the Datadog MCP,Jonas: And yeah, it is gonna be no, no surprise that people like will demand it, right?Samantha: Yeah.swyx: It's, it's like anysystemswyx: of record company like this, it's like how much do you give away? Cool. I think that's that for the sort of cloud agents tour. Cool. And we just talk about like cloud agents have been when did Kirsten loves cloud agents? Do you know, in JuneJonas: last year.swyx: June last year. So it's been slowly develop the thing you did, like a bunch of, like Michael did a post where himself, where he like showed this chart of like ages overtaking tap. And I'm like, wow, this is like the biggest transition in code.Jonas: Yeah.swyx: Like in, in [00:18:00] like the last,Jonas: yeah. I think that kind of got turned out.Yeah. I think it's a very interest,swyx: not at all. I think it's been highlighted by our friend Andre Kati today.Jonas: Okay.swyx: Talk more about it. What does it mean? Yeah. Is I just got given like the cursor tab key.Jonas: Yes. Yes.swyx: That's that'sSamantha: cool.swyx: I know, but it's gonna be like put in a museum.Jonas: It is.Samantha: I have to say I haven't used tab a little bit myself.Jonas: Yeah. I think that what it looks like to code with AI code generally creates software, even if you want to go higher level. Is changing very rapidly. No, not a hot take, but I think from our vendor's point at Cursor, I think one of the things that is probably underappreciated from the outside is that we are extremely self-aware about that fact and Kerscher, got its start in phase one, era one of like tab and auto complete.And that was really useful in its time. But a lot of people start looking at text files and editing code, like we call it hand coding. Now when you like type out the actual letters, it'sswyx: oh that's cute.Jonas: Yeah.swyx: Oh that's cute.Jonas: You're so boomer. So boomer. [00:19:00] And so that I think has been a slowly accelerating and now in the last few months, rapidly accelerating shift.And we think that's going to happen again with the next thing where the, I think some of the pains around tab of it's great, but I actually just want to give more to the agent and I don't want to do one tab at a time. I want to just give it a task and it goes off and does a larger unit of work and I can.Lean back a little bit more and operate at that higher level of abstraction that's going to happen again, where it goes from agents handing you back diffs and you're like in the weeds and giving it, 32nd to three minute tasks, to, you're giving it, three minute to 30 minute to three hour tasks and you're getting back videos and trying out previews rather than immediately looking at diffs every single time.swyx: Yeah. Anything to add?Samantha: One other shift that I've noticed as our cloud agents have really taken off internally has been a shift from primarily individually driven development to almost this collaborative nature of development for us, slack is actually almost like a development on [00:20:00] Id basically.So Iswyx: like maybe don't even build a custom ui, like maybe that's like a debugging thing, but actually it's that.Samantha: I feel like, yeah, there's still so much to left to explore there, but basically for us, like Slack is where a lot of development happens. Like we will have these issue channels or just like this product discussion channels where people are always at cursing and that kicks off a cloud agent.And for us at least, we have team follow-ups enabled. So if Jonas kicks off at Cursor in a thread, I can follow up with it and add more context. And so it turns into almost like a discussion service where people can like collaborate on ui. Oftentimes I will kick off an investigation and then sometimes I even ask it to get blame and then tag people who should be brought in. ‘cause it can tag people in Slack and then other people will comeswyx: in, can tag other people who are not involved in conversation. Yes. Can just do at Jonas if say, was talking to,Samantha: yeah.swyx: That's cool. You should, you guys should make a big good deal outta that.Samantha: I know. It's a lot to, I feel like there's a lot more to do with our slack surface area to show people externally. But yeah, basically like it [00:21:00] can bring other people in and then other people can also contribute to that thread and you can end up with a PR again, with the artifacts visible and then people can be like, okay, cool, we can merge this.So for us it's like the ID is almost like moving into Slack in some ways as well.swyx: I have the same experience with, but it's not developers, it's me. Designer salespeople.Samantha: Yeah.swyx: So me on like technical marketing, vision, designer on design and then salespeople on here's the legal source of what we agreed on.And then they all just collaborate and correct. The agents,Jonas: I think that we found when these threads is. The work that is left, that the humans are discussing in these threads is the nugget of what is actually interesting and relevant. It's not the boring details of where does this if statement go?It's do we wanna ship this? Is this the right ux? Is this the right form factor? Yeah. How do we make this more obvious to the user? It's like those really interesting kind of higher order questions that are so easy to collaborate with and leave the implementation to the cloud agent.Samantha: Totally. And no more discussion of am I gonna do this? Are you [00:22:00] gonna do this cursor's doing it? You just have to decide. You like it.swyx: Sometimes the, I don't know if there's a, this probably, you guys probably figured this out already, but since I, you need like a mute button. So like cursor, like we're going to take this offline, but still online.But like we need to talk among the humans first. Before you like could stop responding to everything.Jonas: Yeah. This is a design decision where currently cursor won't chime in unless you explicitly add Mention it. Yeah. Yeah.Samantha: So it's not always listening.Yeah.Jonas: I can see all the intermediate messages.swyx: Have you done the recursive, can cursor add another cursor or spawn another cursor?Samantha: Oh,Jonas: we've done some versions of this.swyx: Because, ‘cause it can add humans.Jonas: Yes. One of the other things we've been working on that's like an implication of generating the code is so easy is getting it to production is still harder than it should be.And broadly, you solve one bottleneck and three new ones pop up. Yeah. And so one of the new bottlenecks is getting into production and we have a like joke internally where you'll be talking about some feature and someone says, I have a PR for that. Which is it's so easy [00:23:00] to get to, I a PR for that, but it's hard still relatively to get from I a PR for that to, I'm confident and ready to merge this.And so I think that over the coming weeks and months, that's a thing that we think a lot about is how do we scale up compute to that pipeline of getting things from a first draft An agent did.swyx: Isn't that what Merge isn't know what graphite's for, likeJonas: graphite is a big part of that. The cloud agent testingswyx: Is it fully integrated or still different companiesJonas: working on I think we'll have more to share there in the future, but the goal is to have great end-to-end experience where Cursor doesn't just help you generate code tokens, it helps you create software end-to-end.And so review is a big part of that, that I think especially as models have gotten much better at writing code, generating code, we've felt that relatively crop up more,swyx: sorry this is completely unplanned, but like there I have people arguing one to you need ai. To review ai and then there is another approach, thought school of thought where it's no, [00:24:00] reviews are dead.Like just show me the video. It's it like,Samantha: yeah. I feel again, for me, the video is often like alignment and then I often still wanna go through a code review process.swyx: Like still look at the files andSamantha: everything. Yeah. There's a spectrum of course. Like the video, if it's really well done and it does like fully like test everything, you can feel pretty competent, but it's still helpful to, to look at the code.I make hep pay a lot of attention to bug bot. I feel like Bug Bot has been a great really highly adopted internally. We often like, won't we tell people like, don't leave bug bot comments unaddressed. ‘cause we have such high confidence in it. So people always address their bug bot comments.Jonas: Once you've had two cases where you merged something and then you went back later, there was a bug in it, you merged, you went back later and you were like, ah, bug Bot had found that I should have listened to Bug Bot.Once that happens two or three times, you learn to wait for bug bot.Samantha: Yeah. So I think for us there's like that code level review where like it's looking at the actual code and then there's like the like feature level review where you're looking at the features. There's like a whole number of different like areas.There'll probably eventually be things like performance level review, security [00:25:00] review, things like that where it's like more more different aspects of how this feature might affect your code base that you want to potentially leverage an agent to help with.Jonas: And some of those like bug bot will be synchronous and you'll typically want to wait on before you merge.But I think another thing that we're starting to see is. As with cloud agents, you scale up this parallelism and how much code you generate. 10 person startups become, need the Devrel X and pipelines that a 10,000 person company used to need. And that looks like a lot of the things I think that 10,000 person companies invented in order to get that volume of software to production safely.So that's things like, release frequently or release slowly, have different stages where you release, have checkpoints, automated ways of detecting regressions. And so I think we're gonna need stacks merg stack diffs merge queues. Exactly. A lot of those things are going to be importantswyx: forward with.I think the majority of people still don't know what stack stacks are. And I like, I have many friends in Facebook and like I, I'm pretty friendly with graphite. I've just, [00:26:00] I've never needed it ‘cause I don't work on that larger team and it's just like democratization of no, only here's what we've already worked out at very large scale and here's how you can, it benefits you too.Like I think to me, one of the beautiful things about GitHub is that. It's actually useful to me as an individual solo developer, even though it's like actually collaboration software.Jonas: Yep.swyx: And I don't think a lot of Devrel tools have figured that out yet. That transition from like large down to small.Jonas: Yeah. Kers is probably an inverse story.swyx: This is small down toJonas: Yeah. Where historically Kers share, part of why we grew so quickly was anyone on the team could pick it up and in fact people would pick it up, on the weekend for their side project and then bring it into work. ‘cause they loved using it so much.swyx: Yeah.Jonas: And I think a thing that we've started working on a lot more, not us specifically, but as a company and other folks at Cursor, is making it really great for teams and making it the, the 10th person that starts using Cursor in a team. Is immediately set up with things like, we launched Marketplace recently so other people can [00:27:00] configure what CPS and skills like plugins.So skills and cps, other people can configure that. So that my cursor is ready to go and set up. Sam loves the Datadog, MCP and Slack, MCP you've also been using a lot butSamantha: also pre-launch, but I feel like it's so good.Jonas: Yeah, my cursor should be configured if Sam feels strongly that's just amazing and required.swyx: Is it automatically shared or you have to go and.Jonas: It depends on the MCP. So some are obviously off per user. Yeah. And so Sam can't off my cursor with my Slack MCP, but some are team off and those can be set up by admins.swyx: Yeah. Yeah. That's cool. Yeah, I think, we had a man on the pod when cursor was five people, and like everyone was like, okay, what's the thing?And then it's usually something teams and org and enterprise, but it's actually working. But like usually at that stage when you're five, when you're just a vs. Code fork it's like how do you get there? Yeah. Will people pay for this? People do pay for it.Jonas: Yeah. And I think for cloud agents, we expect.[00:28:00]To have similar kind of PLG things where I think off the bat we've seen a lot of adoption with kind of smaller teams where the code bases are not quite as complex to set up. Yes. If you need some insane docker layer caching thing for builds not to take two hours, that's going to take a little bit longer for us to be able to support that kind of infrastructure.Whereas if you have front end backend, like one click agents can install everything that they need themselves.swyx: This is a good chance for me to just ask some technical sort of check the box questions. Can I choose the size of the vm?Jonas: Not yet. We are planning on adding that. Weswyx: have, this is obviously you want like LXXL, whatever, right?Like it's like the Amazon like sort menu.Jonas: Yes, exactly. We'll add that.swyx: Yeah. In some ways you have to basically become like a EC2, almost like you rent a box.Jonas: You rent a box. Yes. We talk a lot about brain in a box. Yeah. So cursor, we want to be a brain in a box,swyx: but is the mental model different? Is it more serverless?Is it more persistent? Is. Something else.Samantha: We want it to be a bit persistent. The desktop should be [00:29:00] something you can return to af even after some days. Like maybe you go back, they're like still thinking about a feature for some period of time. So theswyx: full like sus like suspend the memory and bring it back and then keep going.Samantha: Exactly.swyx: That's an interesting one because what I actually do want, like from a manna and open crawl, whatever, is like I want to be able to log in with my credentials to the thing, but not actually store it in any like secret store, whatever. ‘cause it's like this is the, my most sensitive stuff.Yeah. This is like my email, whatever. And just have it like, persist to the image. I don't know how it was hood, but like to rehydrate and then just keep going from there. But I don't think a lot of infra works that way. A lot of it's stateless where like you save it to a docker image and then it's only whatever you can describe in a Docker file and that's it.That's the only thing you can cl multiple times in parallel.Jonas: Yeah. We have a bunch of different ways of setting them up. So there's a dockerfile based approach. The main default way is actually snapshottingswyx: like a Linux vmJonas: like vm, right? You run a bunch of install commands and then you snapshot more or less the file system.And so that gets you set up for everything [00:30:00] that you would want to bring a new VM up from that template basically.swyx: Yeah.Jonas: And that's a bit distinct from what Sam was talking about with the hibernating and re rehydrating where that is a full memory snapshot as well. So there, if I had like the browser open to a specific page and we bring that back, that page will still be there.swyx: Was there any discussion internally and just building this stuff about every time you shoot a video it's actually you show a little bit of the desktop and the browser and it's not necessary if you just show the browser. If, if you know you're just demoing a front end application.Why not just show the browser, right? Like it Yeah,Samantha: we do have some panning and zooming. Yeah. Like it can decide that when it's actually recording and cutting the video to highlight different things. I think we've played around with different ways of segmenting it and yeah. There's been some different revs on it for sure.Jonas: Yeah. I think one of the interesting things is the version that you see now in cursor.com actually is like half of what we had at peak where we decided to unshift or unshipped quite a few things. So two of the interesting things to talk about, one is directly an answer to your [00:31:00] question where we had native browser that you would have locally, it was basically an iframe that via port forwarding could load the URL could talk to local host in the vm.So that gets you basically, so inswyx: your machine's browser,likeJonas: in your local browser? Yeah. You would go to local host 4,000 and that would get forwarded to local host 4,000 in the VM via port forward. We unshift that like atswyx: Eng Rock.Jonas: Like an Eng Rock. Exactly. We unshift that because we felt that the remote desktop was sufficiently low latency and more general purpose.So we build Cursor web, but we also build Cursor desktop. And so it's really useful to be able to have the full spectrum of things. And even for Cursor Web, as you saw in one of the examples, the agent was uploading files and like I couldn't upload files and open the file viewer if I only had access to the browser.And we've thought a lot about, this might seem funny coming from Cursor where we started as this, vs. Code Fork and I think inherited a lot of amazing things, but also a lot [00:32:00] of legacy UI from VS Code.Minimal Web UI SurfacesJonas: And so with the web UI we wanted to be very intentional about keeping that very minimal and exposing the right sum of set of primitive sort of app surfaces we call them, that are shared features of that cloud.Environment that you and the agent both use. So agent uses desktop and controls it. I can use desktop and controlled agent runs terminal commands. I can run terminal commands. So that's how our philosophy around it. The other thing that is maybe interesting to talk about that we unshipped is and we may, both of these things we may reship and decide at some point in the future that we've changed our minds on the trade offs or gotten it to a point where, putswyx: it out there.Let users tell you they want it. Exactly. Alright, fine.Why No File EditorJonas: So one of the other things is actually a files app. And so we used to have the ability at one point during the process of testing this internally to see next to, I had GID desktop and terminal on the right hand side of the tab there earlier to also have a files app where you could see and edit files.And we actually felt that in some [00:33:00] ways, by restricting and limiting what you could do there, people would naturally leave more to the agent and fall into this new pattern of delegating, which we thought was really valuable. And there's currently no way in Cursor web to edit these files.swyx: Yeah. Except you like open up the PR and go into GitHub and do the thing.Jonas: Yeah.swyx: Which is annoying.Jonas: Just tell the agent,swyx: I have criticized open AI for this. Because Open AI is Codex app doesn't have a file editor, like it has file viewer, but isn't a file editor.Jonas: Do you use the file viewer a lot?swyx: No. I understand, but like sometimes I want it, the one way to do it is like freaking going to no, they have a open in cursor button or open an antigravity or, opening whatever and people pointed that.So I was, I was part of the early testers group people pointed that and they were like, this is like a design smell. It's like you actually want a VS. Code fork that has all these things, but also a file editor. And they were like, no, just trust us.Jonas: Yeah. I think we as Cursor will want to, as a product, offer the [00:34:00] whole spectrum and so you want to be able to.Work at really high levels of abstraction and double click and see the lowest level. That's important. But I also think that like you won't be doing that in Slack. And so there are surfaces and ways of interacting where in some cases limiting the UX capabilities makes for a cleaner experience that's more simple and drives people into these new patterns where even locally we kicked off joking about this.People like don't really edit files, hand code anymore. And so we want to build for where that's going and not where it's beenswyx: a lot of cool stuff. And Okay. I have a couple more.Full Stack Hosting Debateswyx: So observations about the design elements about these things. One of the things that I'm always thinking about is cursor and other peers of cursor start from like the Devrel tools and work their way towards cloud agents.Other people, like the lovable and bolts of the world start with here's like the vibe code. Full cloud thing. They were already cloud edges before anyone else cloud edges and we will give you the full deploy platform. So we own the whole loop. We own all the infrastructure, we own, we, we have the logs, we have the the live site, [00:35:00] whatever.And you can do that cycle cursor doesn't own that cycle even today. You don't have the versal, you don't have the, you whatever deploy infrastructure that, that you're gonna have, which gives you powers because anyone can use it. And any enterprise who, whatever you infra, I don't care. But then also gives you limitations as to how much you can actually fully debug end to end.I guess I'm just putting out there that like is there a future where there's like full stack cursor where like cursor apps.com where like I host my cursor site this, which is basically a verse clone, right? I don't know.Jonas: I think that's a interesting question to be asking, and I think like the logic that you laid out for how you would get there is logic that I largely agree with.swyx: Yeah. Yeah.Jonas: I think right now we're really focused on what we see as the next big bottleneck and because things like the Datadog MCP exist, yeah. I don't think that the best way we can help our customers ship more software. Is by building a hosting solution right now,swyx: by the way, these are things I've actually discussed with some of the companies I just named.Jonas: Yeah, for sure. Right now, just this big bottleneck is getting the code out there and also [00:36:00] unlike a lovable in the bolt, we focus much more on existing software. And the zero to one greenfield is just a very different problem. Imagine going to a Shopify and convincing them to deploy on your deployment solution.That's very different and I think will take much longer to see how that works. May never happen relative to, oh, it's like a zero to one app.swyx: I'll say. It's tempting because look like 50% of your apps are versal, superb base tailwind react it's the stack. It's what everyone does.So I it's kinda interesting.Jonas: Yeah.Model Choice and Auto Routingswyx: The other thing is the model select dying. Right now in cloud agents, it's stuck down, bottom left. Sure it's Codex High today, but do I care if it's suddenly switched to Opus? Probably not.Samantha: We definitely wanna give people a choice across models because I feel like it, the meta change is very frequently.I was a big like Opus 4.5 Maximalist, and when codex 5.3 came out, I hard, hard switch. So that's all I use now.swyx: Yeah. Agreed. I don't know if, but basically like when I use it in Slack, [00:37:00] right? Cursor does a very good job of exposing yeah. Cursors. If people go use it, here's the model we're using.Yeah. Here's how you switch if you want. But otherwise it's like extracted away, which is like beautiful because then you actually, you should decide.Jonas: Yeah, I think we want to be doing more with defaults.swyx: Yeah.Jonas: Where we can suggest things to people. A thing that we have in the editor, the desktop app is auto, which will route your request and do things there.So I think we will want to do something like that for cloud agents as well. We haven't done it yet. And so I think. We have both people like Sam, who are very savvy and want know exactly what model they want, and we also have people that want us to pick the best model for them because we have amazing people like Sam and we, we are the experts.Yeah. We have both the traffic and the internal taste and experience to know what we think is best.swyx: Yeah. I have this ongoing pieces of agent lab versus model lab. And to me, cursor and other companies are example of an agent lab that is, building a new playbook that is different from a model lab where it's like very GP heavy Olo.So obviously has a research [00:38:00] team. And my thesis is like you just, every agent lab is going to have a router because you're going to be asked like, what's what. I don't keep up to every day. I'm not a Sam, I don't keep up every day for using you as sample the arm arbitrator of taste. Put me on CRI Auto.Is it free? It's not free.Jonas: Auto's not free, but there's different pricing tiers. Yeah.swyx: Put me on Chris. You decide from me based on all the other people you know better than me. And I think every agent lab should basically end up doing this because that actually gives you extra power because you like people stop carrying or having loyalty with one lab.Jonas: Yeah.Best Of N and Model CouncilsJonas: Two other maybe interesting things that I don't know how much they're on your radar are one the best event thing we mentioned where running different models head to head is actually quite interesting becauseswyx: which exists in cursor.Jonas: That exists in cur ID and web. So the problem is where do you run them?swyx: Okay.Jonas: And so I, I can share my screen if that's interesting. Yeahinteresting.swyx: Yeah. Yeah. Obviously parallel agents, very popal.Jonas: Yes, exactly. Parallel agentsswyx: in you mind. Are they the same thing? Best event and parallel agents? I don't want to [00:39:00] put words in your mouth.Jonas: Best event is a subset of parallel agents where they're running on the same prompt.That would be my answer. So this is what that looks like. And so here in this dropdown picker, I can just select multiple models.swyx: Yeah.Jonas: And now if I do a prompt, I'm going to do something silly. I am running these five models.swyx: Okay. This is this fake clone, of course. The 2.0 yeah.Jonas: Yes, exactly. But they're running so the cursor 2.0, you can do desktop or cloud.So this is cloud specifically where the benefit over work trees is that they have their own VMs and can run commands and won't try to kill ports that the other one is running. Which are some of the pains. These are allswyx: called work trees?Jonas: No, these are all cloud agents with their own VMs.swyx: Okay. ButJonas: When you do it locally, sometimes people do work trees and that's been the main way that people have set out parallel so far.I've gotta say.swyx: That's so confusing for folks.Jonas: Yeah.swyx: No one knows what work trees are.Jonas: Exactly. I think we're phasing out work trees.swyx: Really.Jonas: Yeah.swyx: Okay.Samantha: But yeah. And one other thing I would say though on the multimodel choice, [00:40:00] so this is another experiment that we ran last year and the decide to ship at that time but may come back to, and there was an interesting learning that's relevant for, these different model providers. It was something that would run a bunch of best of ends but then synthesize and basically run like a synthesizer layer of models. And that was other agents that would take LM Judge, but one that was also agentic and could write code. So it wasn't just picking but also taking the learnings from two models or, and models that it was looking at and writing a new diff.And what we found was that at the time at least, there were strengths to using models from different model providers as the base level of this process. Like basically you could get almost like a synergistic output that was better than having a very unified, like bottom model tier. So it was really interesting ‘cause it's like potentially, even though even in the future when you have like maybe one model as ahead of the other for a little bit, there could be some benefit from having like multiple top tier models involved in like a [00:41:00] model swarm or whatever agent Swarm that you're doing, that they each have strengths and weaknesses.Yeah.Jonas: Andre called this the council, right?Samantha: Yeah, exactly. We actually, oh, that's another internal command we have that Ian wrote slash council. Oh, and they some, yeah.swyx: Yes. This idea is in various forms everywhere. And I think for me, like for me, the productization of it, you guys have done yeah, like this is very flexible, but.If I were to add another Yeah, what your thing is on here it would be too much. I what, let's say,Samantha: Ideally it's all, it's something that the user can just choose and it all happens under the hood in a way where like you just get the benefit of that process at the end and better output basically, but don't have to get too lost in the complexity of judging along the way.Jonas: Okay.Subagents for ContextJonas: Another thing on the many agents, on different parallel agents that's interesting is an idea that's been around for a while as well that has started working recently is subagents. And so this is one other way to get agents of the different prompts and different goals and different models, [00:42:00] different vintages to work together.Collaborate and delegate.swyx: Yeah. I'm very like I like one of my, I always looking for this is the year of the blah, right? Yeah. I think one of the things on the blahs is subs. I think this is of but I haven't used them in cursor. Are they fully formed or how do I honestly like an intro because do I form them from new every time?Do I have fixed subagents? How are they different for slash commands? There's all these like really basic questions that no one stops to answer for people because everyone's just like too busy launching. We have toSamantha: honestly, you could, you can see them in cursor now if you just say spin up like 50 subagents to, so cursor definesswyx: what Subagents.Yeah.Samantha: Yeah. So basically I think I shouldn't speak for the whole subagents team. This is like a different team that's been working on this, but our thesis or thing that we saw internally is that like they're great for context management for kind of long running threads, or if you're trying to just throw more compute at something.We have strongly used, almost like a generic task interface where then the main agent can define [00:43:00] like what goes into the subagent. So if I say explore my code base, it might decide to spin up an explore subagent and or might decide to spin up five explore subagent.swyx: But I don't get to set what those subagent are, right?It's all defined by a model.Samantha: I think. I actually would have to refresh myself on the sub agent interface.Jonas: There are some built-in ones like the explore subagent is free pre-built. But you can also instruct the model to use other subagents and then it will. And one other example of a built-in subagent is I actually just kicked one off in cursor and I can show you what that looks like.swyx: Yes. Because I tried to do this in pure prompt space.Jonas: So this is the desktop app? Yeah. Yeah. And that'sswyx: all you need to do, right? Yeah.Jonas: That's all you need to do. So I said use a sub agent to explore and I think, yeah, so I can even click in and see what the subagent is working on here. It ran some fine command and this is a composer under the hood.Even though my main model is Opus, it does smart routing to take, like in this instance the explorer sort of requires reading a ton of things. And so a faster model is really useful to get an [00:44:00] answer quickly, but that this is what subagent look like. And I think we wanted to do a lot more to expose hooks and ways for people to configure these.Another example of a cus sort of builtin subagent is the computer use subagent in the cloud agents, where we found that those trajectories can be long and involve a lot of images obviously, and execution of some testing verification task. We wanted to use that models that are particularly good at that.So that's one reason to use subagents. And then the other reason to use subagents is we want contexts to be summarized reduced down at a subagent level. That's a really neat boundary at which to compress that rollout and testing into a final message that agent writes that then gets passed into the parent rather than having to do some global compaction or something like that.swyx: Awesome. Cool. While we're in the subagents conversation, I can't do a cursor conversation and not talk about listen stuff. What is that? What is what? He built a browser. He built an os. Yes. And he [00:45:00] experimented with a lot of different architectures and basically ended up reinventing the software engineer org chart.This is all cool, but what's your take? What's, is there any hole behind the side? The scenes stories about that kind of, that whole adventure.Samantha: Some of those experiments have found their way into a feature that's available in cloud agents now, the long running agent mode internally, we call it grind mode.And I think there's like some hint of grind mode accessible in the picker today. ‘cause you can do choose grind until done. And so that was really the result of experiments that Wilson started in this vein where he I think the Ralph Wigga loop was like floating around at the time, but it was something he also independently found and he was experimenting with.And that was what led to this product surface.swyx: And it is just simple idea of have criteria for completion and do not. Until you complete,Samantha: there's a bit more complexity as well in, in our implementation. Like there's a specific, you have to start out by aligning and there's like a planning stage where it will work with you and it will not get like start grind execution mode until it's decided that the [00:46:00] plan is amenable to both of you.Basically,swyx: I refuse to work until you make me happy.Jonas: We found that it's really important where people would give like very underspecified prompt and then expect it to come back with magic. And if it's gonna go off and work for three minutes, that's one thing. When it's gonna go off and work for three days, probably should spend like a few hours upfront making sure that you have communicated what you actually want.swyx: Yeah. And just to like really drive from the point. We really mean three days that No, noJonas: human. Oh yeah. We've had three day months innovation whatsoever.Samantha: I don't know what the record is, but there's been a long time with the grantsJonas: and so the thing that is available in cursor. The long running agent is if you wanna think about it, very abstractly that is like one worker node.Whereas what built the browser is a society of workers and planners and different agents collaborating. Because we started building the browser with one worker node at the time, that was just the agent. And it became one worker node when we realized that the throughput of the system was not where it needed to be [00:47:00] to get something as large of a scale as the browser done.swyx: Yeah.Jonas: And so this has also become a really big mental model for us with cloud, cloud agents is there's the classic engineering latency throughput trade-offs. And so you know, the code is water flowing through a pipe. The, we think that over the coming months, the big unlock is not going to be one person with a model getting more done, like the water flowing faster and we'll be making the pipe much wider and so ing more, whether that's swarms of agents or parallel agents, both of those are things that contribute to getting.Much more done in the same amount of time, but any one of those tasks doesn't necessarily need to get done that quickly. And throughput is this really big thing where if you see the system of a hundred concurrent agents outputting thousands of tokens a second, you can't go back like that.Just you see a glimpse of the future where obviously there are many caveats. Like no one is using this browser. IRL. There's like a bunch of things not quite right yet, but we are going to get to systems that produce real production [00:48:00] code at the scale much sooner than people think. And it forces you to think what even happens to production systems. Like we've broken our GitHub actions recently because we have so many agents like producing and pushing code that like CICD is just overloaded. ‘cause suddenly it's like effectively weg grew, cursor's growing very quickly anyway, but you grow head count, 10 x when people run 10 x as many agents.And so a lot of these systems, exactly, a lot of these systems will need to adapt.swyx: It also reminds me, we, we all, the three of us live in the app layer, but if you talk to the researchers who are doing RL infrastructure, it's the same thing. It's like all these parallel rollouts and scheduling them and making sure as much throughput as possible goes through them.Yeah, it's the same thing.Jonas: We were talking briefly before we started recording. You were mentioning memory chips and some of the shortages there. The other thing that I think is just like hard to wrap your head around the scale of the system that was building the browser, the concurrency there.If Sam and I both have a system like that running for us, [00:49:00] shipping our software. The amount of inference that we're going to need per developer is just really mind-boggling. And that makes, sometimes when I think about that, I think that even with, the most optimistic projections for what we're going to need in terms of buildout, our underestimating, the extent to which these swarm systems can like churn at scale to produce code that is valuable to the economy.And,swyx: yeah, you can cut this if it's sensitive, but I was just Do you have estimates of how much your token consumption is?Jonas: Like per developer?swyx: Yeah. Or yourself. I don't need like comfy average. I just curious. ISamantha: feel like I, for a while I wasn't an admin on the usage dashboard, so I like wasn't able to actually see, but it was a,swyx: mine has gone up.Samantha: Oh yeah.swyx: But I thinkSamantha: it's in terms of how much work I'm doing, it's more like I have no worries about developers losing their jobs, at least in the near term. ‘cause I feel like that's a more broad discussion.swyx: Yeah. Yeah. You went there. I didn't go, I wasn't going there.I was just like how much more are you using?Samantha: There's so much stuff to be built. And so I feel like I'm basically just [00:50:00] trying to constantly I have more ambitions than I did before. Yes. Personally. Yes. So can't speak to the broader thing. But for me it's like I'm busier than ever before.I'm using more tokens and I am also doing more things.Jonas: Yeah. Yeah. I don't have the stats for myself, but I think broadly a thing that we've seen, that we expect to continue is J'S paradox. Whereswyx: you can't do it in our podcast without seeingJonas: it. Exactly. We've done it. Now we can wrap. We've done, we said the words.Phase one tab auto complete people paid like 20 bucks a month. And that was great. Phase two where you were iterating with these local models. Today people pay like hundreds of dollars a month. I think as we think about these highly parallel kind of agents running off for a long times in their own VM system, we are already at that point where people will be spending thousands of dollars a month per human, and I think potentially tens of thousands and beyond, where it's not like we are greedy for like capturing more money, but what happens is just individuals get that much more leverage.And if one person can do as much as 10 people, yeah. That tool that allows ‘em to do that is going to be tremendously valuable [00:51:00] and worth investing in and taking the best thing that exists.swyx: One more question on just the cursor in general and then open-ended for you guys to plug whatever you wanna put.How is Cursor hiring these days?Samantha: What do you mean by how?swyx: So obviously lead code is dead. Oh,Samantha: okay.swyx: Everyone says work trial. Different people have different levels of adoption of agents. Some people can really adopt can be much more productive. But other people, you just need to give them a little bit of time.And sometimes they've never lived in a token rich place like cursor.And once you live in a token rich place, you're you just work differently. But you need to have done that. And a lot of people anyway, it was just open-ended. Like how has agentic engineering, agentic coding changed your opinions on hiring?Is there any like broad like insights? Yeah.Jonas: Basically I'm asking this for other people, right? Yeah, totally. Totally. To hear Sam's opinion, we haven't talked about this the two of us. I think that we don't see necessarily being great at the latest thing with AI coding as a prerequisite.I do think that's a sign that people are keeping up and [00:52:00] curious and willing to upscale themselves in what's happening because. As we were talking about the last three months, the game has completely changed. It's like what I do all day is very different.swyx: Like it's my job and I can't,Jonas: Yeah, totally.I do think that still as Sam was saying, the fundamentals remain important in the current age and being able to go and double click down. And models today do still have weaknesses where if you let them run for too long without cleaning up and refactoring, the coke will get sloppy and there'll be bad abstractions.And so you still do need humans that like have built systems before, no good patterns when they see them and know where to steer things.Samantha: I would agree with that. I would say again, cursor also operates very quickly and leveraging ag agentic engineering is probably one reason why that's possible in this current moment.I think in the past it was just like people coding quickly and now there's like people who use agents to move faster as well. So it's part of our process will always look for we'll select for kind of that ability to make good decisions quickly and move well in this environment.And so I think being able to [00:53:00] figure out how to use agents to help you do that is an important part of it too.swyx: Yeah. Okay. The fork in the road, either predictions for the end of the year, if you have any, or PUDs.Jonas: Evictions are not going to go well.Samantha: I know it's hard.swyx: They're so hard. Get it wrong.It's okay. Just, yeah.Jonas: One other plug that may be interesting that I feel like we touched on but haven't talked a ton about is a thing that the kind of these new interfaces and this parallelism enables is the ability to hop back and forth between threads really quickly. And so a thing that we have,swyx: you wanna show something or,Jonas: yeah, I can show something.A thing that we have felt with local agents is this pain around contact switching. And you have one agent that went off and did some work and another agent that, that did something else. And so here by having, I just have three tabs open, let's say, but I can very quickly, hop in here.This is an example I showed earlier, but the actual workflow here I think is really different in a way that may not be obvious, where, I start t
Jason and Jeff go completely off-script to discuss what they are actually doing in their portfolios right now. Jeff breaks down why he bought more Datadog on a recent dip and his strategy for adding to Enphase, while Jason explains how he is using options (specifically selling puts) to generate income while waiting for better valuations. Plus, they analyze the risk vs. reward of QuantumScape, debate the new CEO at PayPal, and play a game of "buy or get off the pot" that ends with a live stock purchase.00:53 Unscripted Earnings Season04:34 Jeff Did a Thing06:43 Why Buy Datadog Now09:14 AI Risks for SaaS11:34 Selling Puts Strategy17:06 Market Psychology Chegg19:41 Market Efficiency Long Term23:15 Enphase DCA Dilemma25:46 How to Time Adds28:08 Quarterly Info Is Noise28:30 Enphase Through the Cycle29:49 QuantumScape Solid State Promise31:50 Timing the Entry Price33:53 Asymmetric Upside vs Risk36:24 PayPal CEO Shakeup38:59 Reframing the PayPal Thesis44:48 Small Positions to Fix46:36 Airbnb Growth Proof PointsCompanies mentioned: ABNB, ADBE, CHGG, DDOG, ENPH, PYPL, QS, UPSTFind where to listen & subscribe, portfolio contests, and contact information at https://investingunscripted.com*****************************************To get 15% off any paid plan at fiscal.ai, visit https://fiscal.ai/unscriptedListen to the Chit Chat Stocks Podcast for discussions on stocks, financial markets, super investors, and more. Follow the show on Spotify, Apple Podcasts, or YouTube*****************************************Join our PatreonSubscribe to our portfolio on Savvy Trader
In der heutigen Folge sprechen die Finanzjournalisten Anja Ettel und Holger Zschäpitz über einen Absturz bei Nvidia, einen Rebound bei Software und eine Wende im Warner Brothers Drama. Außerdem geht es um Atlassian, Zscaler, Datadog, Applovin, Crowdstrike, Workday, Salesforce, Opendoor, Intuitive Machines, Carvana, IonQ, Rigetti, Netflix, Paramount Skydance, Allianz, Deutsche Telekom, Münchener Rück (Munich Re), Scout24, Heidelberg Materials, Deutsche Börse, Kion, Hensoldt, Puma, Block (Square), WiseTech, Amazon, Nike, Verizon, Papa Johns, Pinterest, Autodesk, Ebay, UPS, Hypoport, Xtrackers MSCI World Industrials ETF (WKN: A113FN), Amundi S&P World Industrials Screened ETF (WKN: A3DSTE), iShares MSCI Europe Industrials Sector ETF (WKN: A2QBZ6), iShares S&P 500 Industrials Sector ETF (WKN: A142N0). Wir freuen uns an Feedback über aaa@welt.de. Noch mehr "Alles auf Aktien" findet Ihr bei WELTplus und Apple Podcasts – inklusive aller Artikel der Hosts und AAA-Newsletter. Hier bei WELT: https://www.welt.de/podcasts/alles-auf-aktien/plus247399208/Boersen-Podcast-AAA-Bonus-Folgen-Jede-Woche-noch-mehr-Antworten-auf-Eure-Boersen-Fragen.html. Der Börsen-Podcast Disclaimer: Die im Podcast besprochenen Aktien und Fonds stellen keine spezifischen Kauf- oder Anlage-Empfehlungen dar. Die Moderatoren und der Verlag haften nicht für etwaige Verluste, die aufgrund der Umsetzung der Gedanken oder Ideen entstehen. Hörtipps: Für alle, die noch mehr wissen wollen: Holger Zschäpitz können Sie jede Woche im Finanz- und Wirtschaftspodcast "Deffner&Zschäpitz" hören. +++ Werbung +++ Du möchtest mehr über unsere Werbepartner erfahren? Hier findest du alle Infos & Rabatte! https://linktr.ee/alles_auf_aktien Impressum: https://www.welt.de/services/article7893735/Impressum.html Datenschutz: https://www.welt.de/services/article157550705/Datenschutzerklaerung-WELT-DIGITAL.html
In today's episode of Motley Fool Money, host Emily Flippen is joined by analysts Jason Hall and Toby Bordelon to break down earnings from three of the most volatile Rule-Breaking stocks out there. They discuss: - How Spotify continues to convert free to paid users, and how monetization efforts are evolving in a more cost-conscious environment - Whether or not DataDog's usage-based business model is under threat as software companies see pullbacks across the board - Ferrari's attempt to reassure investors that it has growth left in it, even as its EV ambitions evolve Companies discussed: SPOT, DDOG, RACE Host: Emily Flippen, Jason Hall, Toby Bordelon Producer: Anand Chokkavelu Engineer: Dan Boyd Disclosure: Advertisements are sponsored content and provided for informational purposes only. The Motley Fool and its affiliates (collectively, “TMF”) do not endorse, recommend, or verify the accuracy or completeness of the statements made within advertisements. TMF is not involved in the offer, sale, or solicitation of any securities advertised herein and makes no representations regarding the suitability, or risks associated with any investment opportunity presented. Investors should conduct their own due diligence and consult with legal, tax, and financial advisors before making any investment decisions. TMF assumes no responsibility for any losses or damages arising from this advertisement. We're committed to transparency: All personal opinions in advertisements from Fools are their own. The product advertised in this episode was loaned to TMF and was returned after a test period or the product advertised in this episode was purchased by TMF. Advertiser has paid for the sponsorship of this episode. Learn more about your ad choices. Visit megaphone.fm/adchoices Learn more about your ad choices. Visit megaphone.fm/adchoices
A busy morning of when it comes to earnings:Carl Quintanilla, Sara Eisen, and David Faber kicked off the hour with two of them - Coca-Cola & Marriott... The CEOs of both companies joined the team with their read on the consumer, the numbers, and more. Plus: why Evercore still sees a higher S&P ahead - despite growing AI debt concerns - with the firm's Head of Equity Strategy. Also in focus: all the earnings names you should be watching here, from Astrazeneca to Datadog to Spotify - and David's new reporting on Paramount's enhanced offer to buy Warner Brothers Discovery. Squawk on the Street Disclaimer Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Tim Berglund talks to Richie Artoul (WarpStream/Confluent) about his career in data infrastructure. Richie's first job: working at Howie's Game Shack, a walk‑in LAN gaming cafe. His challenge: working at Datadog on a new log storage system.SEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo