POPULARITY
DNSimple has a CLI! Carl and Richard talk to DNSimple CEO Anthony Eden about the evolution of the DNSimple CLI in today's software market. DNSimple provides DNS, domain registrar, and certificate services - so why does it need a CLI? Anthony talks about earlier experiments with CLIs for folks who didn't want to use the web interface. But today, large language models change the game and work best with a CLI - those specific commands mean more accurate results from LLMs, which make for a powerful natural language interface experience. The conversation covers the new tooling around LLMs, how the registrar market has evolved, changes to certificates, and ICANN's recent announcement of new gTLDs. DNSimple continues to evolve with the times! Check the show notes for a link for ten dollars off anything at DNSimple!
Topics covered in this episode: Backup Docker volumes locally or to any S3 Pyodide 314.0 Release nb-cli: A Command-Line Interface for AI Agents and Notebook Automation Hindsight Agent Memory That Learns Extras Joke Watch on YouTube About the show Sponsored by us! Support our work through: Our courses at Talk Python AWS Community Day Midwest tomorrow Wednesday the 24th in downtown Indianapolis, Six Feet Up is sponsoring and there are 2 Sixies presenting Connect with the hosts Michael: Mastodon / BlueSky / X / LinkedIn Calvin: Mastodon / BlueSky / X / LinkedIn Show: Mastodon / BlueSky / X Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesday at 7am PT. Older video versions available there too. Finally, if you want an bonus digest of every week of the show notes in email form? Add your name and email to our friends of the show list, we'll never share it. Michael #1: Backup Docker volumes locally or to any S3 Via Bryan Weber (thanks Bryan!), who spotted it over on Virtualization HowTo. Find Bryan at bryanwweber.com. offen/docker-volume-backup is a lightweight companion container that backs up the volumes your apps actually depend on, then ships them somewhere safe. It's tiny: written in Go and about 25MB compressed, roughly 1/20th the size of the shell-based image (jareware/docker-volume-backup) that inspired it. Drop it into your docker compose file as a backup service, mount the volumes you care about as read-only, and you're off. Push backups to a pile of destinations: a local directory, plus any S3, WebDAV, Azure Blob Storage, Dropbox, Google Drive, or SSH-compatible target. Mix and match as many as you want in one run. Recurring cron-style backups in a Compose setup, or one-off backups straight from the Docker CLI. Production-friendly touches worth calling out: Rotates away old backups so you don't quietly fill the disk. GPG encryption for your archives. Notifications on finished and failed runs (so you find out about failures before you need the backup). Stop a container during backup for a consistent snapshot using a simple docker-volume-backup.stop-during-backup=true label, then auto-restart it. Run custom commands during the backup lifecycle (great for a database dump before the file copy). Docker Swarm support, plus arm64 and arm/v7 builds. Hello, Raspberry Pi homelab. Fun aside from Bryan: he searched our back catalog for this tool and the search came back so fast he thought it hadn't run. Love to hear it. Calvin #2: Pyodide 314.0 Release PEP 783 is the real news — Pyodide maintainers used to hand-build 300+ packages. Now anyone can publish Pyodide wheels to PyPI with cibuildwheel. The version jump from 0.29 to 314.0 is intentional — it now tracks the Python version, so 314.x = Python 3.14. Binary compatibility is locked per Python cycle, meaning packages you build today won't break on the next Pyodide release. sqlite3, ssl, and lzma are back in the default stdlib — no more await pyodide.loadPackage("sqlite3"). Bigger download, but a much smoother experience for newcomers. bigint precision bug is fixed — values above 2^53 were silently losing precision when crossing the Python/JS boundary. The new JsBigInt type makes the roundtrip correct. Worth flagging if anyone is doing numeric work in a browser app. Experimental TCP sockets in Node.js — you can now connect Pyodide to a real database (MySQL, PostgreSQL, Redis tested) when running server-side. Blurs the line between "Python in the browser" and "Python runtime anywhere Wasm runs." Michael #3: nb-cli: A Command-Line Interface for AI Agents and Notebook Automation From Piyush Jain (Jupyter and LangChain maintainer) on the Jupyter blog: nb-cli: A Command-Line Interface for AI Agents and Notebook Automation. nb-cli is an experimental, Rust-based CLI to read, write, execute, and search Jupyter notebooks. The premise: agents are great at CLIs but terrible at hand-editing the nested JSON in an .ipynb, so let them operate on the notebook from the outside instead of running inside it. Works with or without a Jupyter server. No server? It reads/writes .ipynb files directly and talks to kernels over ZeroMQ. Connected to a live JupyterLab, your edits show up instantly via Y.js (the same CRDT Jupyter uses). Smart output format: instead of token-heavy JSON or ambiguous plain markdown, it uses @@cell / @@output sentinels with inline metadata. Less wasted context, unambiguous structure, and it degrades gracefully on truncation. The payoff is composability. "Add a summary section and run it" becomes one shell pipeline instead of six agent tool calls. And nb search notebook.ipynb --with-errors returns only the failing cells, so the agent skips the cells that worked. Claude Code tie-in: it ships as an agent skill. npx skills install jupyter-ai-contrib/nb-cli and your agent can drive notebooks via nb. Out of jupyter-ai-contrib, which aims to become an official Jupyter AI subproject. Still early (crates.io is at v0.0.5), so kick the tires before anything load-bearing. See also marimo-pair. Calvin #4: Hindsight Agent Memory That Learns AI agents forget everything between sessions — Hindsight gives them persistent memory that learns over time Simple three-method API: retain(), recall(), reflect() — store, retrieve, and reason over memories TEMPR retrieval runs semantic, keyword, graph, and temporal search in parallel for accurate results Automatically consolidates related facts into durable observations instead of piling up duplicates pip install hindsight-all runs the entire server in-process; integrates with LangChain, LlamaIndex, Pydantic AI, CrewAI, and more Extras Calvin: Clanker: A Word For The Machine **Ponytail — You know him. Long ponytail. Oval glasses. Has been at the company longer than the version control** **Klangk: Multi-User AI Sandboxing, Collaboration and Coding Platform** Cursor announces Origin performative-ui to quick start your new idea Michael: Astral Joins OpenAI: The Interview SpaceX to acquire Cursor And OpenAI renews Open Source support Portuguese subtitles are now available for Talk Python courses DSF is hiring including Six Feet Up support Joke: Oh Babe…
Steve Engelbrecht started Sitation from a rental apartment in Somerville, Massachusetts — five weeks after being laid off in the chaos that followed 9/11. Today it's a 62-person commerce enablement firm with a client roster of household names and a defensible niche the big SIs can't easily replicate.Recorded live at Salsify's Digital Shelf Summit in Atlanta, Christian sat down with Steve — founder and CEO of Sitation — for a conversation about building a services-plus-software business in commerce, how AI is rewriting the buy-vs-build equation, and why a 62-person specialist can out-maneuver Deloitte Digital and Accenture Song in product data.What we cover: The Sitation origin story and the early bet on PIM before it was a category, the three pillars of the business today (systems integration, managed services, and proprietary software), why the software-services convergence is playing out in real time, the "headless PIM in 2026" conversation with Salsify's CEO and what AI agents, MCP, and CLIs mean for the future of product data, how AI lowered the bar for participation and changed buy-vs-build, the Philips case study — a 111% conversion lift on a single SKU by optimizing content, not price, why 90%+ of Sitation's team came from industry and how that makes them stickier than the big SIs, and how Steve thinks about Sitation's future: international expansion as a platform vs. fitting neatly into a larger strategic's plans.⏱️ TIMESTAMPS0:26 — Welcome from Salsify's Digital Shelf Summit in Atlanta1:00 — The origin story: first day of work September 10, 2001, laid off five weeks later2:11 — Early to commerce enablement — and Boston as a commerce software hotbed3:02 — What Sitation does today: the three business segments5:25 — The 2019 "pick a lane" problem and why software-services convergence vindicated the strategy6:16 — How AI is changing the buy-vs-build equation7:36 — The "headless PIM in 2026" conversation with Salsify's CEO8:33 — Salesforce going headless and the new customization opportunity for SIs10:00 — APIs, the MCP revolution, CLIs, and why schema matters for AI agents11:05 — How a 62-person firm out-maneuvers multi-thousand-person SIs11:42 — Why this is a massive market, not a zero-sum game12:30 — The Philips case study: 111% conversion lift on one SKU without touching price13:30 — Why multinationals choose a boutique over Deloitte Digital or Accenture Song15:46 — The strategic question: platform play or acquisition target?16:29 — International expansion as the organic (or capital-backed) growth path17:40 — Why Sitation's platform credentials make it an attractive, hard-to-replicate target18:45 — Why you can't build Sitation's early-mover position — you have to buy it
Join Tyler Wells, Co-founder and CTO of BrainGrid, for a forward-looking discussion on how artificial intelligence is rewriting the rules of product development. Boasting over 25 years of distributed systems engineering—including a foundational tenure at Skype building Facebook's first video-calling engine and 7+ years directing Video and global SRE at Twilio—Tyler has built infra where structural failure was not an option. In this episode, we explore why the traditional constraints of software engineering—headcount, timelines, and budgets—are dissolving, leaving a brand-new bottleneck at the front of the innovation cycle: human imagination.
An airhacks.fm conversation with Bruno Borges (@brunoborges) about: discussion about the JAZ command launcher for Java, JVM tuning and default ergonomics for containers versus dedicated cloud environments, replacing the Java launcher with jaz in container images, supporting Java 8 to 25, maximizing resource utilization on kubernetes to reduce waste, running Java on Azure Functions, Azure App Service deploying a fat JAR without a container image, Azure Container Apps as a platform on AKS without YAML, Azure Kubernetes Service and AKS Automatic, Bicep as infrastructure as code, deploying a JAR to Kubernetes via OCI artifacts and a custom operator, Microsoft Foundry and the Microsoft Agent Framework, Semantic Kernel learnings, the Copilot SDK for Java communicating with headless CLIs, A2A and ACP protocols and MCP, agents as microservices with scoped tasks, guardrails, and sandboxing, per-agent model selection for cost and reasoning trade-offs, observability and traceability between agents with opentelemetry, grounding LLMs against MicroProfile, Jakarta EE, JAX-RS normative RFC 2119 specifications for hallucination-free Java code generation, the Boundary Control Entity pattern and business components as Java packages, package-info.java for semantic context, GitHub Copilot skills and custom instructions in Visual Studio Code, the AI Rails skills site, zero-dependency Java CLI scripting, reducing dependencies by reusing source code instead of JARs, the org.json reference implementation reduced to five classes, StackGres and OnGres running Quarkus and GraalVM to manage Postgres on Kubernetes, the Digg Into Java community Bruno Borges on twitter: @brunoborges
I'm excited to work with Microsoft once again as the presenting sponsors of the AI Engineer World's Fair! We'll streaming live from MS Build today for a special crossover pod with our friends at No Priors and the one and only Satya Nadella. However we did not hold back with this interview - we asked all the burning questions about uptime and Copilot that we know you have in your minds. Lets go!For almost two decades, GitHub has been the home of software, where both open source and closed flow, through commits, pull requests, reviews, actions, etc.This ecosystem flourished as open-source maintainers and contributors would continue shipping code for the benefit of the community. However as coding agents began to ship mass quantities of code - growing 1400% in 2026, it marked a new era that was both extremely exciting and challenging for GitHub.While these agents help more people ship more projects, they also significantly increase the floor of how much code is shipped, how often it is shipped, how many people commit code, and basically orders of magnitude multiples in every dimension of GitHub infrastructure:Now GitHub inevitably experiences more pressure on their infrastructure which was originally designed around human developers moving at human speed. This has resulted in a very publicly notable uptime story:So it begs the question of whether current systems around code can absorb what AI produces. Can CI/CD keep up when every idea becomes a build? Can open source maintainers survive floods of AI-generated slop contributions? Can GitHub preserve the human social contract of software while becoming the operating layer for agents?Which brings us to the perfect person to answer these questions: GitHub COO Kyle Daigle. In this episode, he joins swyx to unpack what happens when AI doesn't just autocomplete code, but starts changing how companies operate, how open source works, how pull requests get reviewed, and how GitHub itself has to scale. We go deep on GitHub's internal AI workflows: micro-skills, WorkIQ, MCP, Slack, Teams, email, Copilot workflows, the new Copilot desktop app, CLI, cloud agents, and how Kyle uses agents to look backwards across company context before deciding what to do next. Kyle also reflects on GitHub's history building webhooks, APIs, Actions, npm, Dependabot, and Semmle, why the AI era is breaking GitHub in new ways, how Actions became a general-purpose compute layer, and what Copilot becomes after code completion.Full Video PodWe discuss:* Kyle's expanded role across GitHub* How AI got Kyle coding again after years in leadership* Why GitHub rolls out AI through existing workflows instead of forcing new tools* WorkIQ, MCP, Slack, Teams, email, and GitHub as company context* Why massive “mega-skills” are giving way to small, atomic micro-skills* How AI changes summarization, communications, marketing, and analyst work* Why former developers in leadership may have a unique advantage in the AI era* Kyle's “15 agents on Saturday” workflow* How Kyle built an AI-generated executive presentation for CRO/CFO teams* Why AI changes the chief of staff role without removing the human work* GitHub Actions, webhooks, arbitrary code execution, and secure agent compute* The npm acquisition, supply-chain security, 2FA, and token invalidation* Slop forks, vendoring, and whether AI agents change dependency management* What pull requests become when most PRs come from agents* Prompt requests, vouching, AI review, and trust in open source* What counts as a “developer” when AI lowers the barrier to building* GitHub Spark, low-code, and why GitHub refuses to hide the code* 14x commit growth, Actions load, databases, monorepos, and availability* Copilot's evolution from completion to CLI, desktop app, cloud agents, and SDK* Context, memory, rules, and making GitHub “act like Kyle wants it to act”* Ambient AI, OpenClaw, enterprise security, and the new operating system for agents* What swyx should ask Satya Nadella about Microsoft's AI futureKyle Daigle* LinkedIn: https://www.linkedin.com/in/kyledaigle* X: https://x.com/kdaigleTimestamps00:00:00 Introduction00:03:36 Why AI Got Kyle Coding Again00:07:04 Running GitHub with AI: WorkIQ, MCP, Slack, Teams, and Skills00:15:39 The Golden Age for Former Developers in Leadership00:17:31 15 Agents on Saturday and AI-Generated Executive Work00:20:20 How AI Changes the Chief of Staff Role00:21:45 GitHub's History: Actions, npm, Webhooks, and Open Source00:28:45 Slop Forks, Vendoring, and AI Dependency Management00:33:57 Pull Requests, Prompt Requests, and Trust in Agent-Generated Code00:41:21 GitHub Stars, 200M+ Developers, and the New AI Builder Wave00:45:15 GitHub Spark, Low-Code, and Why GitHub Still Shows the Code00:47:38 GitHub's Hardest Era: 14x Growth, Reliability, and Scale00:59:21 Actions as the Compute Layer for CI/CD and Automation01:02:04 The State and Future of GitHub Copilot01:08:24 Ambient AI, Background Agents, and the Future of the SDLC01:13:09 OpenClaw, Enterprise Security, and the New OS for Agents01:18:03 Build Announcements, WorkIQ, FoundryIQ, and Microsoft Context01:21:41 What Should swyx Ask Satya?TranscriptIntroduction: Kyle Daigle's Expanded Role at GitHub and MicrosoftSwyx [00:00:00]: We're here with Kyle Daigle, COO of GitHub. Welcome.Kyle [00:00:07]: Hey, thanks for having me.Swyx [00:00:08]: You're not just CEO of GitHub. People know you as that. You have a new role.Kyle [00:00:11]: So I have an expanded role now. I've been working at GitHub for thirteen years and doing all things developer. Joined as a developer myself. And now, I'm also responsible as the CMO of Developer for Microsoft. And so all the kind of learnings and passion for developers and how we work with them and how we communicate and how we bring our products to market, we're also bringing that expertise to the broader Microsoft ecosystem and helping every developer that uses a Microsoft product or would like to have a sort of similar experience that they've had with GitHub over the years. So it's a different role in some ways, but it's also just building on the experience that I've had at GitHub of just sort of tell the truth, be authentic, show people how to use it and then let the products speak for themselves. Now just doing that with, all of Microsoft.Swyx [00:01:09]: We'll be releasing this in conjunction with Build. You got lots of stuff planned, and we can sort of touch on that whenever it's appropriate. I think one of the interesting things is I rarely meet a COO who's also a CMO. I think you're a very outward facing and you're very confident publicly. That's rare. Do you actually view yourself as COO? What's What is your thing?From GitHub Developer to COO/CMO: Building the Platform and Operating GitHubKyle [00:01:33]: I think for me, it's been funny. The titles have always been, a— have always felt a little strange to me. I joined GitHub as a developer? I wrote so much of theSwyx [00:01:46]: Let's bring that up. You wrote the back ends?Kyle [00:01:48]: I was going through, I was going through, some old photos, when folks were talking about how things were being built or how there was a build GitHub. I built, webhooks and worked with teams building the API, built the platform layer. Anything that integrated with GitHub, up until really twenty eighteen, I built or ran the engineering teams. And that's kind of where my the beginning of my passion always was helping people build things, deliver them to, their customers. And so being a developer, building for developers was always super unique. In a— I think as my role expanded, it became my ability to talk to not just developers, but also enterprise customers or business leaders and have this translation layer. And then through all those years, GitHub has always operated pretty uniquely. Post-pandemic, working remotely was not as novel as it was when GitHub started in two thousand and eight. But all that expertise of running remote teams, doing it well, became this sort of bigger role, ultimately turning into the COO role of how do we operate GitHub in the way that GitHub's always operated after the Microsoft acquisition. And kind of so on from there. So like for me, I think the— I've, I still code. I love coding but the problem has always been, people. It's a much harder problem to both support our own employees, a harder problem to communicate to developers and enterprise buyers what we're building why it matters, ‘cause those are two very different messages. And so getting to work in the mix of COO, CMO, also just being a dev, I think is what's kept me at GitHub for so long.AI Workflows for Leadership: Commits, Retrospectives, and ContextSwyx [00:03:40]: Apparently, you have— your commits have gone up. What's this? What's going on?Kyle [00:03:45]: Rui's called me out pretty aggressively. So I think— as you can imagine, right, you can see my normal era of being a dev In the twenty thirteen, twenty fourteen era, and then moving into management, and then ultimately the COO role. I think what you see there is me, really getting back to coding thanks to AI. I— similar to, attaching problems between how to market and how to operate a business and how to code, I find, building agents and workflows that are connecting very disparate problems to be what's driving this. So that's, some of it's writing software. A lot of it is, connecting a ton of a different data sources to, help me out. But that is completely me really diving in on the AI side in trying out our tools, trying out everyone's tools, But building for me, building for the non-technical leader, though I'm technical and how we're, able to use these tools more than just the simple, call and response that I think a lot of the non-technical, your employers, you have to get— you have to use AI, and so everyone uses, ChatGPT or Copilot or Claude or whatever. To really get into, how is this going to help me out, it— I find that it's not the I need to write a blog post, I need to those simple examples. Helping people find the workflows of, “Okay, I need you to go through all the PRs today. I need you to go through everything that we've posted online. I need you to go through what we did the last three months. Go through all of my Obsidian notes for any mentions of this then go through my transcripts at work.” We use, Teams, so, using WorkIQ, go call that MCP server, grab all the transcripts, go through all the Slack, and then build me out the plan of, what this week's messaging actually was. That's something that was, impossible because for me, I find AI in a what most of this launch here is actually, less building forward. It's actually, a recursive loop backwards. I'm always looking at what had happened first. Go back through the week and tell me what we did, what worked, what didn't work? And then tell me in the next three or four days-What would you tweak based on this sort of like looking backwards and then looking ahead a little bit? I find that to be so much more valuable, especially for like non-technical, because that retrospection is actually LLMs are very good at that. Like finding all the patterns, pulling them out, and then applying that retrospection to just a couple of days or just like a short period of time. Is all a bunch of apps that I've built and launched a bunch of, internal tools. I use the new, GitHub Copilot app, the desktop app with workflows. Every time I crack open my laptop, it's running workflows for me. It's just a ton of different stuff and of course, it all ends up on, it all ends up on GitHub.Swyx [00:06:47]: Of course. That's where, that's where, stuff is hosted. Man, there's so much to ask you. I was going to leave the how do you run a company with AI thing at the end. I have to ask one— double click one thing. You said, you are looking back at the week. You're, you're understanding what happens. When you say we That's three thousand people. How?Rolling Out AI Internally: Skills, CLIs, and Company ContextKyle [00:07:09]: I think when we started rolling out AI internally beyond engineering, right? One of the things that I was really, passionate about is like we have to do this in a way where no one has to change how they work. I don't want to have to teach you a tool. I don't want to have to teach you something new. And so for us, we tried out a few tools. Most of them don't work because I got to get you on board? I got to teach you how to use it. What we've actually ended up doing is we've built like a set of skills internally. We have we each have our set of skills, and we've just been distributing even to the non-technical folks, the CLI. And then effectively, we're just giving it access to like read about everything that we're writing. So that's for us, that's usually GitHub, Teams, Email, and Slack. So Teams for, video chat, generally speaking.Swyx [00:08:03]: Teams and Slack?Kyle [00:08:04]: so we use Teams for video communication, but we don't use it for chat. W-we— GitHub for a long history, right? We're alwaysSwyx [00:08:13]: Also SlackKyle [00:08:14]: Talking about ChatOps and like everything is built into Slack. Like every command, every flow.Swyx [00:08:18]: So even though you have been acquired for I don't know, eight years nowKyle [00:08:22]: we stillSwyx [00:08:23]: You still use Slack?Kyle [00:08:23]: it's a purpose-built tool for us, and I think the reality is that moving off of it would be so bluntly expensive? Simply because all the tooling is, baked in with that paradigm. And they both have their pros and cons but they don't work the same way at all. We still use a bunch of different tools Because it's the purpose-built tools that We need. And thenSwyx [00:08:47]: Well, the same doesn't go for the rest of Microsoft, presumably.Kyle [00:08:50]: like the like various teams like operateSwyx [00:08:53]: They make their own decisionsKyle [00:08:54]: Various ways. I think it just matters what you're trying to what you're trying to do. But we do we do work across kind of every tool that we use, and then by giving everyone access to all of that context and the new WorkIQ MCP server, which is quite cool if you do live in the M365 like world. I can ask it all these backwards-facing questions, and it's incredibly important for our teams that are working remotely. There's a lot of stuff you miss when you're not in an office, and we are spread out all over the world. So most of that is looking back. And then we post, we post either auto-automatically into GitHub issues or discussions, these sorts of like findings or like our industry reports. Like what's happening this morning, today, yesterday. A little automation gets run. We'll use the app. We might use GitHub Actions like with, our agentic workflows just to go do that run, and then we push it into GitHub, and w-we keep having a conversation. So usually for us, it's about that sort of like looking back, looking forward on the non-technical side. And then of course for a lot of those folks, it's also building an app, pushing it to GitHub pages or pushing it somewhere to host it et cetera. But it's just like enabling everyone with that power of it's going to take me a week to figure this out. Instead, we're going “Okay I built a skill. Let's put it into a repo. We'll all share that skill together, and then we'll use the CLI or now the app-” “just to run it.”Micro Skills vs. Mega Skills: How GitHub Uses AI at WorkSwyx [00:10:26]: All right. I think, I think we're going straight into like the team management and productivity thing. I think a lot of people are getting various levels of LLM psychosis. How do you manage the bloat of skills? Like everyone Has their thing, and they're Like trying to promote it to the rest of their peers in their org, right? And obviously, whoever becomes a skill influencer internally becomes like an AI leader, right? Of sorts. I assume you have those.Kyle [00:10:50]: like I think we haveSwyx [00:10:52]: And I assume it's a mess a Yeah.Kyle [00:10:54]: there's like I— like I think the reality is there's two pieces. Like first is I think that we're ending the era of these like massive, beautiful, perfect skills that are just like not any of those things. ‘cause for a while, right every tweet every day is like go download the skills, the perfectly managed thing to do this entire workflow. And I think that like what we've found and what— I was just with my team, this week, and we were talking about the skill side, and we're really talking about these like incredibly micro skills that are just doing one thing for us very well Versus a skill that's going to do I said, that full report. That doesn't really exist on our side anymore. It's usually how do— like a single skill that's going to identify the most important marketing information given any MCP server. Like this is the most important thing. Less about stitch a bunch of tools together and have it produce this mega output because then weeks go by, months go by, things change, and you want to tweakSwyx [00:11:58]: It's brittleKyle [00:11:58]: Your mega skill and you're screwed? You can't do that. And so now we're really just talking about the Legos we're using and just letting the instruction book be something we're all putting together. Whereas I think a lot of AI skills for a while have been that mega instruction book style.Swyx [00:12:15]: I've, thought a lot about Postel's law. I don't know if that's a term that is, means things to folks. It's the idea that you should be liberal in what you accept and strict in what you output, right? And I think that's like a good framing principle for skills. This is my skills, obviously on GitHub. I feel like everyone should have like how like some repos In GitHub are special repos? I feel like we should sort of reify the slash skills and everyone like give it some kind of special presentation. Anyway, so, yeah, this is one of those like download Download anything, transcribe anything, and then you can string together the atomic skills that do one thing well Into like some kind of orchestration skill that calls other skills. I assume, does that match?Kyle [00:12:56]: I like I think so. I think that theSwyx [00:13:00]: Summarize anything.Kyle [00:13:01]: Like I think the- For me, summarizing something for I do communications and PR and analyst relations and marketing and customer activities, and so my summarize everything is very different for each one of those like Contexts. What ‘Cause if I'm summarizing something for an analyst, that's a very different thing than, probably how I'm going to summarize something for like a customer meeting or an engagement. So that's I think like the difference when we're talking about the like the tools I might use on Saturday or the skills I might use on a Saturday when it's just for Kyle. Yeah, those are kind of like they have an atomic actual tool underneath or maybe skill, and then Kyle cares about X. But I think when we're talking about work and enabling the the marketers, communicators there, it's the atomic, this is what good summarization is, and then this is what I care about as for marketing for communications For whatever. And that I think is like the interesting matrix problem when we go from like a developer set of concerns to all kinds of different professions, is that what that word means to me is different than it means to you is different than it means to the analyst or the salesperson, and that's where I think the matrix mess is that we're starting to like still starting to find. It's about these mega skills but they're all just slight permutations, but those permutations are really important. It's the difference between someone reading this and going “Did AI make this?” what Or “This makes total sense, and I would expect this when I'm giving a briefing to Gartner,” or like whatever else.Swyx [00:14:37]: I think the beauty of it maybe is that you don't have to be that careful about what goes in there. It doesn't have to exactly fit as long as it like roughly is contained in there. I used to complain about plugin hell, basically. Like when you have a framework and then you have a hundred things that you need to integrate, everyone does like the GitHub used to be bloated full of these things. And now we don't need them anymore ‘cause now you just use skills.Former Developers in Leadership: AI as a Creation MultiplierKyle [00:15:00]: And like I think the most magical thing is the just that like I can just also crack it open. Like Like yes, I could go like change the how the plugin is coded, or like I could go do that now with AI, but I think there's just something more magical about getting a response back and being “That's not right,” and then you just crack the skill open, you just type English words and it's different. That building block is just, I think very unique. Once I get everyone to kind of understand how to best how to best make those changes to get the most power out of them.Swyx [00:15:36]: Is there a— you have a your peer group that Of people like you. Is there a common framing for Something I'm feeling is, which is true, is that is this a golden age for former developers who are now in leadership? Because you can wield the tools, you would know the right words, you're maybe not too close to the details. Doesn't matter. But like you're more effective than someone who doesn't come from that background.Kyle [00:15:59]: I think that like the secret has always been your ability to identify patterns and solve problems, and I think that for folks that like myself that don't code day to day anymore, that has made me successful as a developer, made me successful as a COO and now CMO. And so now that I have access to get and write code, I'm now applying that sort of like pattern finding and problem solving, and I know enough still about how to then go and say, “Oh, I want to make an app, but I don't want to break into jail or create something that's not going to be able to work or to be deployed scale or whatever.” that ability to apply all that additional business knowledge and still code I think is what makes that so interesting to me. Slightly different than I think some of the other like technical leaders that became business leaders and now are going back to their apps and updating them. Good for them? But I think the more, much more interesting thing is, well, now I have this whole new set of expertise over ten plus years. Why not take that and use that as a developer with these AI tools? So I definitely think that makes me more powerful, but I think that's true for like every dev as well. Most of the dev friends I still have also have some other underlying skill and passion. There's really talented, very kind of linear computer science software devs, absolutely. I just find that the folks that came from a different career, went to school for something else, went off and did this random thing, and then became a software dev, or were a dev, did a random thing, came back. Learning that extra set of information, learning those extra skills, and now having the power of an AI where I can crank up fifteen agents on Saturday while my kids are doing lacrosse, That's like really powerful. And I think it gets me back to that feeling of like creation, and it's very hard to replicate that in most other senses? That first time you build an app and you click it and you show someone that's magical. And so being able to do that not just in code, but across all kinds of different assets that's, that's huge. We were doing we're doing our every year we do our revenue planning. We talk about okay, what is it going to look like for next year? And of course as you imagine, there's, slideshows everywhere talking about what are we going to talk about, what's the narrative, et cetera. And so as you said I'm “Okay, well, I could probably just like build something to build this and then that way I don't have to go build the whole spreadsheet or I have to pass it to my team.” So we went through this process, and I got all the information and used the skills I mentioned. I built like a little app just to make it so I could look at some of the information in a SQLite database, more easily. And I ultimately built this entire presentation without touching any of it and I was “Okay, I'm just going to present this to our CRO, the CFO, their teams,” without mentioning I'd built it with AI. I like built a skill to make it look very much not AI driven. Just not pretty.AI-Generated Presentations, Human Taste, and the Changing Chief of Staff RoleSwyx [00:19:03]: Like a design. Yeah.Kyle [00:19:03]: Not pretty. But just like very clearly not AI. Kind of like don't do anything interesting.Swyx [00:19:08]: That's, yeah, that is valuable.Kyle [00:19:08]: Just go Exactly. We did the whole thing through. It used my notes from Obsidian, it used all the context I mentioned before, the plans, and Never came up once that it was AI generated.Swyx [00:19:20]: It didn't matter.Kyle [00:19:20]: Never once. D It didn't matter. And so now I takeSwyx [00:19:23]: This is a toolKyle [00:19:23]: I can take that tool and go, “Look, I don't want you to go build slideshows.” They're just helping us share information with each other. If this thing can do it With a little bit of crafting from you and then we can look at it together, awesome. There's no value in all that extra work. I think that the ability to, make it look humanly bad and and build a little app to, manipulate the data I think is part of, that upside for devs that are now in leadership roles. Because, the thing that I feel like I said before, this that's all a people, that's all a people problem. I know if you've used a coworker or not to build a slide deck, unless you spent a bunch of time to not do it.Swyx [00:20:07]: I know, but like it was so, I think there's a certain charm to just being blatantly AI. ‘Cause I think that you're well, you're just honest about There may be mistakes here that I cannot vouch for. So how much value is there? But anyway I think, actually the real question I want to ask is, there's a— You were a chief of staff To Thomas. And in the pre-AI world, the that job would've been a chief of staff job of like Can you prep me these slides and all that? And now you do it yourself.Kyle [00:20:35]: I still, I still have a chief of staff. Because, the difference is it's sort of the discussion every time we have some sort of technology evolution is it's not that the jobs the roles don't all go away, they just change? And so yeah, I don't have someone spending all their time building out slides for me and presentations ‘cause I don't need that anymore. But now I need that person that is able to go and find all the different connections between humans in those discussions to help me find out, okay, I should be meeting with this group and this team, and they have an opportunity, and I'm going to be in San Francisco today, I'm going to be in Seattle tomorrow. Those sorts of human connection aspects are still incredibly valuable and has always been a big part of that chief of staff role. But now just like chiefs of staff are not opening up, letters to process, they're doing emails. What It's the same thing. And now they're, they're not building out as many of these presentations because they have the the ability to have a AI take it on for, and share that with me and great. Let's keep moving ‘cause it's allowing us to go faster and make better decisions more quickly.Swyx [00:21:45]: Awesome. Well, so we can dive into more sort of, Productivity insights as you go. I did want to do a little bit of a brief history of colleague and hub. Because, we started here. And then you also involved the NPM acquisition. I did, I do want to touch upon that. And then more recently, I just want to bring up to present day where we're having uptime issues Which transparently we've already Addressed publicly, but we'll, we'll discuss in the pod. Did I miss anything? Like what, any other major highlights? Obviously, it's, it's a lot of years to cover.A Brief History of GitHub: Webhooks, Actions, Acquisitions, and Platform EvolutionKyle [00:22:15]: No the I think one of one highlight was right before the acquisition closed in twenty eighteen, I got to launch the first version of ActionsSwyx [00:22:27]: OhKyle [00:22:27]: At GitHub Universe. So it was OSwyx [00:22:29]: They're that young?Kyle [00:22:30]: It was October of twenty eighteen, I think. Yeah. Yeah.Swyx [00:22:33]: Gee, Jesus.Kyle [00:22:34]: I got to I was the engineering leader on that project and got to launch that. And then, yeah, we did acquisitions of NPM you said, Semmle, Dependabot Pul Panda a whole bunch of things. That was a bigSwyx [00:22:47]: Pul Panda.Kyle [00:22:48]: Abi is doing well.Swyx [00:22:51]: DX. Holy crap.Kyle [00:22:52]: Did well on DX. I and like that was a that was the big shift, after the acquisition. I had to join the sort of business side.Swyx [00:23:00]: So I need to hit you on some of these things ‘cause you were there. Right? And how often do I get to talk to someone who was there? But yeah, Actions. Is that the number one source of security issues on GitHub?Kyle [00:23:11]: Oh, sh I think that the number one source of, security issues is probably like all, the literal code in everyone's like underlying repositories. I would say back further than that is, if you remember I had to show in this graph was this is, I'm, didn't say this before, this is ultimately webhooks.Swyx [00:23:30]: You yeah.Kyle [00:23:31]: Like circa whatever it was.Swyx [00:23:32]: It says Hookshot in there.Kyle [00:23:32]: I forget. Yeah. Yeah, Hookshot's in there. And so like back then, it says GitHub Services. Do you see, it says Hookshot FE for front end, and then it says GitHub Services. GitHub Services back in the old days, right? You we had a repository that was Ruby code, and you could write any Ruby code in there, and then we would execute that On your behalf As a service, and then that way if an if you were trying to integrate with something, it didn't we would run it for you.Swyx [00:23:57]: And of course no containers ‘causeKyle [00:23:58]: No, ‘cause it wasSwyx [00:23:59]: Well, no containersKyle [00:24:00]: Twenty fourteen. And so there was some isolation obviously, but it was mostly the separations on the server level. That's like an example as long as the very old version of Pages, which ran on its own containerization infrastructure, not on Actions.Swyx [00:24:15]: Which like all-time great product.Kyle [00:24:16]: Pages powers the internet at this point to some degree. Those were places where like clearly there were no like issues like to my knowledge. But it was those things where I'm looking at and going “Okay, well we can't be running arbitrary Ruby code,” like on everyone's behalf. Then containerizing all of that up intoUh into actions now where yeah the containerization, is r-really good. The pinning most folks aren't pinning it the like to a particularSwyx [00:24:48]: ImagesKyle [00:24:48]: Sha, et cetera like their workflows, and so that's a big that's a big place Of pain for folks if they're just doing similar to any dependency management, just V1 or newest or latest, I think. But, that journey from that day to “Okay, we're just going to run all this arbitrary code, and, it'll basically be okay,” to now, no, we have, really good containerization. We have a new, underlying, ag-agent, containerization, service. It's like we're using it under the hood. It's through Azure. They recently announced it. The Azure, Dev Compute, but it's, very fast, very fast compute to be able to, spin up your own cloud agents, or whatnot. We're using it under the hood for some parts of the new,Swyx [00:25:36]: Microsoft Dev Box?Kyle [00:25:37]: No. Dev Compute, yeah.Swyx [00:25:41]: Hmm. Not finding it just yet.Kyle [00:25:44]: Oh, it's, it's in there somewhere.Swyx [00:25:46]: All right. Well, we'll cut that out.Kyle [00:25:47]: Sorry. But with, Dev Compute, you can, run, really fast, spin up really, small VMs really quickly, so you're doing a tool callSwyx [00:25:58]: Same conceptKyle [00:25:58]: Just do it containerize exact-exactly. So we're using that so definitely moving that direction to protect us from every every piece of code that we're ultimately running.Swyx [00:26:07]: look, that grows into the full SDLC? Code hosting was just the start and and then it's grown beyond that. Let's talk about NPM may-maybe ‘cause I think that's also, a very major point in the industry. I do think, it was looking for a home. It was, kind of struggling as a business, right? I don't know, I don't know how you would characterize that whole acquisition and how itNPM, Package Security, and Keeping the Internet RunningKyle [00:26:33]: like when we were talking to the team, I think the big thing for the both of us was to find a way to keep NPM, which was basically powering the internet then and way more so now to some degree running. Keep it going keep continuing to scale. It was having scaling problems, if I recall, back at that time. They were doing some rewrites. ItSwyx [00:27:00]: that's cute compared to now.Kyle [00:27:01]: Well, that's the thing is like when I'm talking to folks now, there's there's so many more underlying uses of NPM than there were back when we had them join in with GitHub. But that was ultimately the goal. It was really okay, we used to have pages. We have, the world's code. Let's make sure that we can keep NPM running well for the world. And we put a bunch of time and investment into fixing some of the underlying backend, changes, some of which we talked about some of the manifest work, et cetera. And then now, really trying to bring the the security posture of NPM up to speed. But, it is a unique challenge in that every move that we make to make it more secure will break a lot of people. And security is paramount. And also, we take it very seriously. We're, the any time that we have a problem with GitHub or we make a change that makes us more secure but hurts, there's, a snow day for developers or a really bad fire that they have to go put out. And so we've, have changed the 2FA policies. We've changed the way the tokens work. When we find tokens that have been exposed or potentially, exposed, we invalidate them, andSwyx [00:28:22]: I love that feature in GitHub. Yeah, it's greatKyle [00:28:23]: That creates issues, but, the but that's the thing is we're trying to push the community, forward without necessarily, doing something that is going to break the contract that's been for 15 years or close to it or some amount of years on NPM.Slop Forks, Vendoring, and the Future of Open Source Supply ChainsSwyx [00:28:43]: I think the— So now we're talking about, open source and publishing. And I think there's something here with what people are calling slop forks, which, I think Malta from Vercel is doing. And, part of me thinks, well, the way to get past any vulnerabilities, we just, let's just get rid of the concept of NPM. And we only publish source code. And anytime you want to import it you have your coding agent look at it and then adapt whatever subset you're going to use into your vendor it. But, the AI vendor it. Is that realistic? I don't know. Is it— Will that solve all our security issues? I don't know.Kyle [00:29:24]: I don't think it'll solve I so Mitchell was just talking Mitchell Hashimoto Was just talking about this today, and I think that I-in some ways, it's all all things, old or new again? Yeah, absolutely vendoring everything. Like I do I do remember twenty thirteen, twenty fourteen.Swyx [00:29:42]: This is Yeah. Let's, we must return toKyle [00:29:43]: That's what is We were vendoring everything. We were having actual discussions around, or at least I remember we were “Should we take this full thing?” “Why is this so big? We only need this one file.” And so I do think there's something true there where having either taking only what you need or the dependencies just getting incredibly small over time, I think will help to some degree, but it's not going to solve the fundamental problem, I don't think, because the vulnerabilities in an agent looking at them, there's time and time again, there's a million different ways in which we can convince an agent that this thing is, secure or not and pull it in. Or we can do static code analysis or runtime testing to say whether the code works or not. That is, I think, the step that needs to continue to be, invested in. The question is just on, how much scope. Should it be this enormous project that I'm pulling down, or should it be this piece? Either most companies are running some amount of security checking on the on the packages that they're bringing in or vendoring. That I think won't change. That's like what advanced security does to some degree, Socket does some degree. Like everyone is doing a piece of that. How we each do that like especially when we're talking to enterprise customers, is just like very different. No there's no one wants one single way to do it. And I think that's always been GitHub's, unique position in the world. I talk a lot to maintainers, I talk a lot to folks about this. It's we're— we rarely start like a process and a practice and like push it onto the community. We usually wait for the sort of like RFC process socially or literally, everyone agreeing, and then we'll cement something in. Because otherwise we'reMaintainers, RFCs, Vouching, and the Social Layer of TrustSwyx [00:31:35]: That fits your role in the ecosystem, yeahKyle [00:31:36]: We're GitHub. Yeah, we don't want to shape the whole thing. We want it to be figured out. But like how do you balance that like sort of Role in the industry to keep everything as secure as is possible and make sure that you're you're not going to be compromised as a human, ‘cause that's usually how it all happens. And Not not create a process or lock us into a flow that you're not going to or like Mitchell's not going to or other open source projects aren't going to like. That's always been a tricky balance for us, and I think that's something that we haven't talked about enough is we're not going to be able to fix everything for everyone in a way that everyone is going to like. So tell, help us, tell us what is working. When Mitchell was talking about, the Upvote, the upSwyx [00:32:22]: I was going to bring up his thing. Yeah.Kyle [00:32:23]: I forget what it Yeah. When he's talking to us, I was chatting with him and talking to him about this and I put it on Twitter and we talked to, also over DM, was “We're going to keep working.” but I think the important thing is I do actually want to hear what isn't working for you. And as, be as specific and clear for your project as is possible. And to every piece of credit over the many years that we've known each other through the industry, he's always done that and I appreciate that ‘cause there are places that we need to fix up, and we hear from him, and we'll fix up just like we do all other kinds of maintainers. But that that process between making those types of improvements and being more secure and like creating, I forget what he calls it's not the proof process, not the claims process. Do what I'm talking about? He has that he his projects have a way for you to kind of like,Swyx [00:33:13]: VouchKyle [00:33:13]: Vouch. Thank you. Yeah. He has like the vouch system for saying, “Hey, you should accept my PRs.” That's beenSwyx [00:33:20]: I just built this into GitHub. I don't know.Kyle [00:33:22]: Well, see, but that's the thing is that you say that and like he and his community really likes this and then I'll go talk to other maintainers and other maintainers, globally, and they're “No, this doesn't work for me.” And that is the tension, but also the kind of beauty of GitHub, depending on which way you look at it is we want to help maintainers, so we create all these tools to let you have more control over how much you take in from AI and PRs. But you can also use this. What You can go use this project, and if it takes off and becomes the kind of mostly standard, then yeah, we probably wouldn't enforce it but we would add it in because that's the flow that we tend to do?Swyx [00:34:02]: I hear a lot of people don't know the history of the pull request. And like like that's how, that's something that GitHub standardized basically.Kyle [00:34:08]: Yeah. It was a very messy process Like beforehand, and now the we have the benefit of it being the process? And now we have to go and Figure out the next best process or what adaptations change, or what does a pull request look like when eighty percent of your PRs are just coming from your agents and not From other devs?Swyx [00:34:31]: Do you like the prompt request idea from Peter?Kyle [00:34:34]: like I think that for each like each idea I think has its merits. I'm not, I'm not avoiding saying anything good or bad, but I feel like I've seen a version of we have that we have entire Thomas' store. Take all the assets of what you've built and put that in. I think that's got great ideas. There's all these various permutations of the PR flow, but I think the reason why there's not a single answer is ultimately we're trying to codify trust. We're trying to say “Okay, if Sean reviews this I'm going to trust it because you're Sean or you're the senior dev or you're the whatever.” And right now, when we are working in a flow where an agent writes code and another agent reviews code and then Kyle goes and looks at it the trust is kind of diffuse. And most of the tools that we're talking about are talking more about verification flows. We have more assets to look at, so I can probably say whether this is a good PR or not. But that still doesn't solve, I think, the human problem of I'm looking at a PR and I want to know if I can trust it. And we're still, we still tend to use human signals for that? Mitchell approving it or Kyle approving it or whatever. And so I think that's, I think that's why most of these options haven't really solved it is because, it's a social problem ultimately. It's a it's a human problem to review it and agree. Or you fully trust the tool and you're imbuing that tool with full trust Which I think in some cases that absolutely exists.AI-Generated PRs, Trust, and the Waymo AnalogySwyx [00:36:08]: And so like in the same way that there will be a tipping point in society when we don't allow humans to drive anymore Because machines are measurably better than Than humans. I'm looking for that tipping point, right? Like Mythos is ridiculously expensive. Someday we'll have Mythos on a desktop. I don't know. Will, does that change the equation?Kyle [00:36:30]: I think it's more I took a Waymo here, and I was on my phone and not looking around at all. There are other, self-driving, vehicles that I would not trust while, staring at the road. And I think that trust is something that isSwyx [00:36:48]: Is this a Zoox thing? What is itKyle [00:36:50]: I think that is both. I think that is both. LikeSwyx [00:36:53]: There's Zoox in this robo taxi. That's it. It'sKyle [00:36:56]: Well, depending on what level Of self-driving. But, my point is sort of that I think part of that is I strongly believe that's, a mixture of verifiable proof. Like how many accidents, how much data, and so on, and the human aspect of how I feel when I'm in this car, what it tells me, et cetera. And so that's why I think some of the like Some of these some of our AI tools tend to, imbue me with more of that feeling of trust, even if the data says this is 100% accurate. I feel like it takes more time for us to go, “Should I trust this or not?” And that's in the soft sense of, startups with high agency, weekend projects, and open source. And then there's enterprises and regulated industries and everything else, and that is an even harder problem to go solve because even when it is fully verified, not only do you have to have trust from the humans on the team, you probably have to have trust from multinational,Swyx [00:37:55]: Oh my GodKyle [00:37:55]: Multi governments around the world and regulating agencies. And so that's where I feel like until we tip over to your point on the sort of like human EQ side of it. I feel okay this feels okay I've been proven enough. Then the ball will start to roll a lot faster, where we'll end up getting to the “Okay, we can trust this,” and feel good about it in the Most difficult of cases.Reputation, Sponsors, Stars, and Bot Activity on GitHubSwyx [00:38:18]: If human trust is the thing that matters, I feel like GitHub as the developer social network could maybe do more there. Like vouchers are one system But, we have star counts, and then we have Contributor rights, and that's it. And I feel like there should be more in that space. I don't know if there's any other design decisions there.Kyle [00:38:37]: I think that one of the places that we don't really expose right now in this sort of way is, some degree of like hard trust and support, which would like for me is like sponsors is a good example of that.Swyx [00:38:49]: Ah.Kyle [00:38:49]: It like costs you something. To prove that I believe in your project and I trust you To some degree or I want to support you at the very least.Swyx [00:38:56]: Solve payments for open source. Why not?Kyle [00:38:58]: I think that I think that like as we keep moving forward, right, there's more and more projects where I'm, adding more and more dollars into sponsors personally because I want to like support them, but I also like know of I've probably never met them in person, but, I know of enough of their work that I want to support them. I think the thing that I don't love about stars or commit counts or anything else is ultimately, even with all of the various, abuse and de-spamming and deduplication work that we do or anti-abuse work that we do, these are all, not active social signals. They're passive ones that are ultimately gamifiable. And you may trust me, but another open source maintainer may not. And on what heuristic should you be, trusting me? That I think, is kind of where some of our thinking is right now. What signal from me is most important to you? You— If you can define that potentially, honestly in an agentic workflow that's what we see some of these open source projects do, where you have GitHub actions, and then you have like an agentic workflow that's calling AI, and you're setting these rules. Like if Kyle has submitted and gotten accepted PRs across any given project and has a social handle tied to his account in GitHub, and that social account's older than a certain amount. Really complex measures that matter to you ‘cause most open source projects have that heuristic built into their heads, if not written down in the contributing guidelines. You could take that and then go apply that and then just say, “Oh, we're not going to accept this PR.” Building something that is, I think, malleable to everyone's needs, is a little bit better, rather than going “Hmm, this account's too young.” Because what happens? The attackers just go and go and create a multitude of accounts, and they wait Until it ages up. Needs to have a certain amount of stars. That's how star inflation happens. Need to have a certain amount of reposSwyx [00:40:46]: Oh my God. YeahKyle [00:40:47]: With PRs. They all just create repos and submit PRs to each other, and then they come in and do something nefarious. And so, it's hard. It's hard to find the measure. So I think we're, we're looking more at how can we provide you tools so you can kind of choose what's best for you. And of course, we'll give you some standards. But the trust vector, gets down to I don't know, some version of like human digital ID like everyone's been talking about. Like how do I prove that it's meSwyx [00:41:13]: Give me your eyeballsKyle [00:41:14]: On the internet. Give me your eyeballs. Exactly.Swyx [00:41:18]: The I got to keep moving on Topics, but obviously I can go all day on this stuff because, I've been involved in GitHub and open source My entire professional career. Stars. Very superficial. Everyone knows it. But I think time to one hundred thousand stars is the fastest I've ever seen. Like people just reached that in I don't know, months. And then like at the same time I don't trust it right? Like how many of these are real or bot or like whatever. I don't know how to ask this but like what can we do about it? LikeKyle [00:41:49]: JustSwyx [00:41:49]: Is stars broken? Is stars fine?Kyle [00:41:51]: I think that there's kind of two, there's like two pieces. Obviously we're constantly like trying to find ways in which like your users are producing spam, which would, I would include like be like only doing star gamification. When we find them, we pluck ‘em out and we,Swyx [00:42:08]: But it's like a Whac-A-MoleKyle [00:42:10]: It's a hundred percent like a Whac-A-MoleSwyx [00:42:11]: There's no wayKyle [00:42:11]: Now, powered by AI to be helpful. But I think more so what I'm seeing is, a lot of the like fastest time to X tends to be because we're now inviting so many more people into like software development on GitHub That like the zeitgeist is just swarming? And it'sSwyx [00:42:32]: It's not just developers anymoreKyle [00:42:33]: And it's not you and I. Like like however you want to say like what a developer is it's not just folks who have been coding for a very long time. It's folks that have maybe started coding or only joined in since the AI era. And nowSwyx [00:42:44]: what's the latest Octoverse number? I know eighty million was my lastRem- member that a number of developers on GitHubKyle [00:42:50]: Oh, we're over 200 million now.Swyx [00:42:53]: Okay. Well, so you see?Kyle [00:42:55]: Like over 200 million developers now.Swyx [00:42:56]: But it's not developers, right? It's, it's people with a GitHub account.What Counts as a Developer in the AI Era?Kyle [00:43:00]: So, so this is, this is the biggest debate that I would say, everyone loves to have at GitHub at this point. From my perspective, right, I think that there's, there's clearly a difference between, professional enterprise developer and then developers. But I think that I think that the idea that we should be I don't know, splitting hairs or segmenting developers in the early era of software development is, not worth our not worth the time. SoSwyx [00:43:29]: When you get into gatekeepingKyle [00:43:31]: 100%Swyx [00:43:31]: What is a developer?Kyle [00:43:31]: 100%. ‘Cause I wasn't a developer when I started writing code? I was going toSwyx [00:43:36]: Oh, no. I made— I cloned a thing, seven years before I learned to code. And then I and then I wrote about my learning to code journey, and people Just called me a fraud ‘cause I had a GitHub account. And I'm “Well, no, I just use GitHub, but I don't know-” “I didn't know what I was doing.”Kyle [00:43:49]: I I remember that. I remember those sets of posts, and like that's, that's b******t. So I fight very clearly on the line of, if you create code, if you have an idea and you create it into some way of, I'm, I'm going to run it and use the app right now, you may still use AI in that moment, but that's okay. At some point you're going to do the next thing. You're going to create a big— You're going to have to learn about this database. You're going to fix a bug, whatever. We're all on some same journey, and those people are also hearing about the great new agent skill package or a new CLI tool or a new whatever. And those projects are going up because you want to be a part of this moment, just like I wanted to be a part of the Ruby community when Ruby was popping off when I started becoming a developer, and now I can just click the star button. And so I think that yes, there's clearly some amount of like spamming and game gamification that we're working against, but I really think we're just seeing this whole new cohort of folks that are moving from technology to technology because they're not working on a 20-year-old software application. They're working on a side app that they built on the weekend for their friends or for their new idea or whatever. And that's how you see these enormous charts going up and to the right with With stars.Swyx [00:44:59]: I think something that's remarkable is the persistence or, that GitHub extends to those folks. Usually when I see platforms go into a new audience, they usually have to, have like a second platform with a different name that wraps the main platform. But somehow GitHub has been able to sort of persist and extend, and it's friendly and whatever? So it's, it's nice.Spark, Low-Code, and Always Showing the CodeKyle [00:45:19]: I that's partially why I think as we've tried to move into I don't know, more like low-code-y things. We so we started working on Spark as like a way to, build an app and run it. I think that the reality is that we anytime we try to, kind of put even a veneer on top of it without when we put a veneer on top of something, we still always show you the code. That's kind of like a tenant. We're never going to, hide the code from you ever, because whatSwyx [00:45:52]: Why would you?Kyle [00:45:52]: That's, yeah, that's the whole point? However, I think that what we learned with things like Spark is that really the value of Spark for most devs is, easy runtime. And you may have a runtime or a host that you're going to use for that or you just build something and run it but, the package of making that even more simple isn't really needed for folks that are trying to build software and not just trying to build, an app, which is, slightly different, a slightly different goal. So I want to get you in, I want to get you comfortable. I think the best thing for me as, someone that did not traditionally come into software dev way back, I want anyone to be able to breach that chasm and not be in the I don't know, I feel like we're, we're still in an era of, STEM. I've got a 12-year-old and an eight-year-old, and it's “We got to get ‘em into STEM,”? Over and over. And I like I do, I do the things that good parents do. I was “Oh, you want to do coding?” “Yes, I want to do coding.” Do coding classes. But now they're just not afraid of doing software. And that's, I think, the thing that's honestly kept me at GitHub for so long. Anyone should be able to go and build a thing, just like I can go change a light switch in my house. I'm not going to go into the breaker box ‘cause I'll probably kill myself? But, I can go change that light switch. Everyone should be able to go and say, “This fricking app doesn't do what I want. I want it to work like this.” And that I think, is what's kind of kept us all connected with GitHub through the years and some and during the easiest of times or in the hard times because of that opportunity of, we're the home for all developers, and we want everyone to be able to have that feeling that we've had of, had an idea, I created it and holy s**t here it is.Swyx [00:47:37]: Here it is. All right, I'm going to try to do more spicy questions.GitHub's Hardest Scaling Moment: Growth, Agents, and UptimeKyle [00:47:42]: Great.Swyx [00:47:42]: Is it an easy time now or a hard time?Kyle [00:47:45]: Oh at GitHub? It's a hard time. Like, it's a hard time and also, I was just with my team and I said, “This is also, the best and most exciting time that I think I can remember at GitHub.” BecauseSwyx [00:47:57]: Best of times, worst of times. It's never oneKyle [00:47:59]: ‘cause we've we were talking about Octoverse reports and, usually we do an Octoverse report once a year, and we look at the numbers, and we say, “Oh my goodness.” I was at Universe in October saying, “This was the fastest year of growth that we've ever had,” right? And now we're doing more in a month than we did in a year last year.Swyx [00:48:20]: You're talking about PRs.Kyle [00:48:21]: Commits.Swyx [00:48:21]: Commits, yeah.Kyle [00:48:22]: PRs. Kind of like you name it by roughly every measure that we're looking at, there's some amount of sort of growth that is much bigger, and that is breaking our system in new ways, not old ways. Like webhooks were always notoriously, unreliable over the years?Swyx [00:48:38]: Whose fault is that?Kyle [00:48:39]: not anymore mine, but for a period of time, I'm sure you could pull up a tweet that was “It was me. I'm sorry.” but, now, that got rewritten at a scale level that is still working and is not having problems today. Now what we're finding isn't just the isn't the-The simple stuff that folks are on the sometimes on Twitter or on the internet are “Hey, why is this like this?” Sure. There's absolutely silly problems that we shouldn't exist. But now we're talking about, unique, novel permission problems that happen only at a scale across all different objects or whatever, that now we have to go rewrite this underlying system. And so it's, there are problems that yeah, caught us off guard, which I think I said. Like the growth is astronomical, but also we're making such material progress in that I'm excited once we're once we've kind of like reimagined the underlying foundation layer, or pieces of it at least, what's going to be possible when it's not just all of us and all the new people that are being developers and all of their agents and all the tools like working together. Because that'll still happen in that in that GitHub tool, that GitHub community. But it's a it's a hard day anytime we can't give you what you're looking for. We have the same problem internally. We operate through github. Com. Of course, we have backups when things go down and whatnot for our own operations but we feel it too. If it's not working it's not working for us, and that's kind of like the promise of dogfooding for GitHub. It's always been true. We're using the same tool you're using. We're not using a super secret version. We and so we also need it to be great for us for our customers of course for open source. And now an exponential growth of agents, Doing it too.Swyx [00:50:32]: I wanted to load for audio listeners who maybe haven't seen your tweets, whatever. So one billion commits in twenty-five. Now it's two hundred and seventy-five million per week on pace for fourteen billion this year, if growth remains linear. Is that still the pace? I don't know. It's been aKyle [00:50:48]: it's, it's speedingSwyx [00:50:50]: Roughly.Kyle [00:50:50]: It's still speeding up.Swyx [00:50:51]: It's, it's April, so yeah.Kyle [00:50:51]: Exactly. This was in April.Swyx [00:50:53]: All right. So basically you have fourteen x growth, right? Year on year on year. And I think that's a scaling issue. I think, I'm going to like try to really steel man this thing. People have experienced fourteen x growth. They haven't had your downtime. And that's like— C-can we go dig into that? Why? Like what's the— what broke? What are we doing to fix it? Like just anything for the community to reassure them.Why GitHub Reliability Is Breaking in New WaysKyle [00:51:18]: so there's a Like I was saying, there's a couple different places that we've seen the growth issues. Some of the growth issues, which is why we're t— I was talking about pushing hard on more CPUs is in actions in particular. More tools, more agents, more PRs mean more builds, more builds mean more CPUs. And so we are expanding through not just our data center, but obviously we were talking about moving to Azure and moving to, adding an additional cloud compute because we simply need more CPUs. Not as much GPUs. We definitely need GPUs too, but now CPUs are becoming a factor.Swyx [00:51:53]: It's very CPU heavy.Kyle [00:51:54]: Underneath the hood when it comes to some of the underlying services, we've been breaking up over the years our database infrastructure, so that way we have, more cognitive separation between our the various services. The place that we continue to have pain is in, permissioning. And so right now m-many of our permissioning layers sit into a database that we like internally call MySQL One, and old Hubbers will know what I'm talking about. And so we've been pulling things out of MySQL One for many years, because like and we use we use Vitess and we use other technologies to shard and we do it as one bigSwyx [00:52:31]: Famous thing, PlanetScale was born from this andKyle [00:52:32]: A hundred percent. Sam Old Hubber and friend. And so finding these opportunities to like break this out and then do that globally. The other thing that I think is interesting and both a unique opportunity and tricky is we also run everything I just talked about in a black box container with GitHub Enterprise Server for people that work on-prem. So we take everything I just said, and we also do it on-prem, and we also do all of that and we do it in a data residence setup for customers that need to have their data in a single location. Each of these has the unique characteristic around how we're sort of storing that data in MySQL or in a permissioning setup. That's where some of these outages have oc-occurred, where you're seeing it more like across the board rather than just like the one pieceSwyx [00:53:17]: Filling the databaseKyle [00:53:17]: Isn't quite working. Exactly. And so part of it is that. I think there's been some other places where agents are much more or more projects appear to be moving towards monorepo versus we were going the other direction for many years in the industry. Repos were smaller, but there were more of them, and now we're seeing the opposite. Repos are bigger, and there's, not fewer of them per se ‘cause there's new growth, but, we're just seeing many more big repos. Big repos, big monorepos have always had, a unique performance problem. Because each one, is slightly different if, particularly if the underlying blobs are incredibly big Inside the repos. And so we've done a ton of work that you pro— like most people haven't probably experienced, unless you're in this case of the monorepo. But that Git, infrastructure layer improvement does help the overall, system because, many of the improvements that make monorepos work better make all repo infrastructure work better. And so, I could kind of keep going down the line where it's another thing where we're moving out of, We're changing how we do j I'll just say job queuing for lack of a better, explanation changing the underlying technologies there.Swyx [00:54:32]: I spent two years being a job queuing guy, so.Kyle [00:54:34]: And so it's kind of a little bit of a little bit of piece by piece, and it's mostly because as we were— as it was built, we built everything in a way that assumed, I guess in some ways that the size of the pipe of work was going to remain the same. There's just going to be more people coming through each of those pipes. But instead now in places whereA git push was, generally a certain size for example, is now, no longer true.Swyx [00:55:03]: Oh, yeah.Kyle [00:55:03]: OrSwyx [00:55:05]: I push a thousandKyle [00:55:06]: On the average. 100%Swyx [00:55:06]: A thousand line commits like dailyKyle [00:55:07]: Same thing with PRs. Like PRs same thing. And like we've talked about optimizing that and making changes where, and there were technology choices that did not work there? And it got slow, and it didn't It was not fast. It did not do what the users wanted. And so we've been reeling that all out and going “Okay, that's just not right. Let's stop putting good money after bad and do it the do it the right way or the right way now.” So there's It's a it's a lot of things, not quite when I've experienced scale at GitHub historically, it's almost always two options that we've used. We go vertical scaling, particularly with databases, right? And we go horizontal scaling. Oh, we just have more people using this service. Great. We're going to add more servers, and we rack them in our data center, or we use it in a cloud. And now we're sort of in a like diagonal, where like vertical doesn't really work anymore. Horizontal isn't work either because we're all We all have some CPU or GPU constraints in the world now, and now we have to go in and like crack open services that have been running for 10 or 15 years and go, “Okay, the rules of this service have legitimately changed, and now we have to rewrite them.” None of this is an excuse. This is like we're We have to do the work. We have to make it better.Swyx [00:56:22]: actually as an infra guy, I'm “This is like one of the most fascinating scaling challenges I've ever seen.”Kyle [00:56:26]: That's that's, that's the thing that's the thing that it's hard for Like when we weren't talking about it publicly, and I was like I came out, and I was “Hey, I just want to explain what's going on.” Part of it comes from a very old GitHub ethos, which is it's our it's our uptime. It's down. W What I know you're a developer, so you're, you're inclined to want to understand more what's going on. But at the same time us going “Hey, this service didn't, perform the way we expected, and now we have to go change it,” we weren't We're not trying to hide anything from you i
The new AIEWF website is live! CFPs close in 2 days and we will run our first New Engineer Orientation this weekend, get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!One of the central tensions in the agents industry is that even while there are major decacorn agent labs like Sierra, Decagon, Notion and Cursor being built up, it is also true that it has never been easier to DIY agents, with a plethora of agent frameworks like LangGraph and Pydantic and Flue, and managed agents from Anthropic and Gemini and Amazon. There has been a wave of companies building their own background agents from Shopify to Stripe to Paradigm to Razorpay, and even Cognition's friends Ramp have built their own coding agent with other friend Modal.You'd think Cognition might feel a bit threatened, but they're not - even after all this, they were way oversubscribed for the $1B Series D they just announced:Walden Yan, coiner of context engineering and Chief Product Officer/Cofounder of Cognition, invited OpenInspect's Cole Murray to talk about why the Devin is in the Details.Full conversation live on the pod today: In retrospect, async agents were the most AGI pilled bet you could make in 2024 - the models weren't good enough yet to vibecode, and people didn't trust AI enough to let it rip, nobody (including early Cognition) was sure about the form factors. Now it is obvious:* The first wave of AI coding tools made the developer faster but remain heavily in the loop. Copilor and Cursor's tab autocomplete are prime examples However, the workflow was still heavily centered around and bottlenecked by the developer's local workflow: a developer in an IDE, watching the model, accepting or rejecting changes, and pushing code one interaction at a time.* The second wave was local agents: Claude Code, Windsurf, Cursor's agents pane: first one and increasingly many terminals all running concurrently.* The current Age of Async Agents points to a different future focused more on agent orchestration which drives end-to-end development.According to previous guest Steve Yegge, there are finer-grained 8 levels to agent adoption, but we have collapsed it into three.As Cursor's Michael Truell put it in The third era of AI software development:Cursor is no longer primarily about writing code. It is about helping developers build the factory that creates their software. This factory is made up of fleets of agents that they interact with as teammates: providing initial direction, equipping them with the tools to work independently, and reviewing their work.The agent should not sit solely inside the developer's flow. It should be setup to work in the background so that you can give it a task, a repo, a machine, a shell, a browser, tests, memory, and review loops to go do the work somewhere else.In less than a year, the sentiment has shifted from avoiding multi-agent systems:to suggesting approaches that actually work:From coining “context engineering” to building the infrastructure behind Devin's 7x PR growth and jump from 16% to 80% of commits across Cognition repos, Walden Yan has had a front-row seat to the background-agent shift. In this episode, Cognition co-founder and CPO Walden Yan joins swyx alongside Cole Murray, creator of OpenInspect, to unpack why everyone is building their own Devin, what changed after the December 2025 model inflection, and why “spec to pull request” is now becoming a real production workflow.We go deep on the architecture of background agents: harness-in-the-box vs out-of-the-box, why Devin separates the “brain” from the machine, why repo setup is still one of the hardest problems, why Docker is not always enough, and how full VMs, snapshots, scoped secrets, GitHub bots, Slack integrations, and video-based testing all fit together. Walden and Cole also dig into memory, MCP limitations, multi-agent orchestration, AI code review, SRE auto-triage, PMs shipping code from Slack, Windsurf 2.0, hybrid frontier/sub-frontier systems, and the real failure mode of uncontrolled vibe coding: your codebase regressing to your worst engineer.And as agents eat software… and software eats the world… you can draw the conclusion on what is next:We discuss:* Why the engineering world is waking up to background agents and cloud agents* The December 2025 model inflection that made spec-to-PR workflows practical* Devin's 7x merged PR growth and rise from 16% to 80% of commits* Why Cole built OpenInspect as an open-source background-agent system* The economics of $20/seat agent products and why monetization is tricky* What Cognition actually sells beyond Devin: infra, onboarding, integrations, and adoption* Harness in the box vs out of the box, and why architecture matters* Why Devin separates the brain from the machine for security and permissions* Repo setup, scoped secrets, Docker Compose, and agent-ready dev environments* Why full VMs matter when agents need to run real applications and test them* Android, macOS, Windows, nested virtualization, and machine-specific agent work* Why testing is much harder than “computer use”* Screenshots, video verification, and the “I know it works” merge moment* GitHub UX, Devin Review, AI reviewers, and agents responding to PR comments* Why MCP alone is not enough for first-class Slack and enterprise integrations* Memory, Knowledge, skills, Claude.md, and why retrieval is still unsolved* Devin's auto-generated memories and the challenge of memory pruning* Always-on agents as permanent PMs for issues, tickets, and product areas* Sub-agents, meta-Devin management, and what multi-agent systems actually add* Why pure auto-merge vibe coding breaks down after about two weeks* AI code smells, lint rules, reward hacking, and Semgrep for agent-written code* GitAI, inline context, and preserving the “why” behind code changes* Local testing, mock servers, older codebases, and preparing companies for agents* Windsurf 2.0 and the handoff between local foreground agents and cloud background agents* SRE auto-triage, support workflows, and agents as first responders* PMs, marketing, and non-engineers creating pull requests from Slack* AI agent budgets, $1k-$5k per engineer spend, and hybrid frontier/sub-frontier systems* The rise of autonomous coding factories and who Cognition is hiringWalden Yan* X: https://x.com/walden_yan* LinkedIn: https://www.linkedin.com/in/waldenyan/Cole Murray* X: https://x.com/_colemurray* LinkedIn: https://www.linkedin.com/in/colemurray/* OpenInspect / Background Agents: https://github.com/ColeMurray/background-agentsTimestamps00:00:00 Introduction00:00:43 Why Everyone Is Building Their Own Devin00:01:57 Devin's 2025 Ramp: 7x PR Growth and 80% of Commits00:03:49 OpenInspect and the Rise of Open-Source Background Agents00:07:59 What Cognition Actually Sells Beyond Devin00:09:56 Background Agent Architecture: Harness In vs Out of the Box00:12:08 Separating the Brain from the Machine00:14:07 Repo Setup, Secrets, Docker, and Full VMs00:19:13 Why Testing Is Harder Than Computer Use00:22:40 Video Verification and the “I Know It Works” Merge Moment00:23:19 GitHub UX, Devin Review, and AI Code Review00:25:42 MCP, Slack, and Enterprise Agent Integrations00:28:59 Memory, Knowledge, and Always-On Agents00:36:16 Sub-Agents, Multi-Agent Orchestration, and Meta-Devin00:43:55 Vibe Coding, Auto-Merge, and Codebase Decay00:48:38 Agent Infra, VPCs, Cloud Providers, and Fast VM Restore00:52:25 AI Code Smells, Reward Hacking, and Code Review Systems00:56:10 Making Codebases Agent-Ready00:58:30 Windsurf 2.0 and the Local-to-Cloud Agent Handoff01:01:15 SRE Auto-Triage, PMs Shipping Code, and Agent Use Cases01:04:32 Agent Budgets, Hybrid Models, and Autonomous Coding Factories01:06:51 Hiring at Cognition and OpenInspect Consulting01:07:45 OutroTranscriptIntroduction: Walden Yan, Cole Murray, and Context EngineeringSwyx [00:00:00]: All right, we're in the studio with Walden Yan, co-founder of Cognition, CPO.Walden [00:00:08]: Happy to be here.Swyx [00:00:09]: Which is a cool title. And coiner of context engineering.Walden [00:00:15]: Although I think there are many people who'd used the terms in various ways beforehand, but I did find that people, both internally and externally, enjoyed the upgrade from prompt engineering or model wrapping into maybe a more thoughtful way to build agents.Swyx [00:00:33]: For those who haven't caught up on that, I have on screen the Don't Build Multi-Agents post, which you should go read on and we might refer to, and Cole Murray, who created OpenInspect.Cole [00:00:43]: Great to be here.Swyx [00:00:43]: So let's talk about it. Everyone is building their own Devins. What's going on?The December Shift: From Handholding Models to Autonomous PRsCole [00:00:51]: So I think the engineering world is waking up to this idea of background agents, cloud agents, whatever you'd like to call it. And I think we saw a shift around the December timeframe of 2025, where the models Opus 4.5 and GPT 5.2, they reached a capability where we moved away from handholding the model and being able to actually more or less autonomously drive the model. And what I mean by that is that we could pretty much go from a specification to a completed pull request, assuming the spec was good enough, with very little friction. And that paradigm alone, I think, changed a lot of how we interact with agents, and opened this world where background agents became more practical.Swyx [00:01:41]: I think for Cole, everyone experienced this in December, but I feel like there was just this increasing ramp, right? There was this moment which was, I think, Sonnet 3.7, where, You guys rewrote Devin in one night or something. So describe 2025 or how it felt from your side.Walden [00:02:01]: In retrospect, we always thought it was ramping up, but then even now, over the last three, four months from today, it's been ramping up even faster. So it's almost funny to be talking about how, big of a leap Sonnet 3.7 was, and honestly, a lot of it was stripping out parts of Devin that were no longer needed with that jump in of intelligence. But I also just think that a lot of the recent leaps, especially, you look at, models like Opus and the latest GPT models, they are reaching levels of autonomy where people are actually finding that they actually can just be hands-off. And people who were once debating, “Oh, do I need to be in the weeds with my model in the IDE? Can I just completely move it off into the cloud?” That's a more serious conversation, and we've seen that in all of our growth charts. Internally there's this funny graph where our usage has, of PRs, our merged PRs, has grown 7X since I forget what it was called.Swyx [00:02:57]: I think Dev, maybe tweeted that. Yes.Walden [00:03:01]: it grew like 7X over, the last, I think it was, two months, three months, something like that. And then you see our engineering headcount growth. It's, gone up by, 10% or something.Swyx [00:03:11]: We were, we were afraid To release this. So this is Devin commit percentages on all Devin repos, was 16% in January and now 80% in March.Walden [00:03:25]: It's a big shift right now. And so it makes sense that a lot of people are now thinking about, buying Devin, but also maybe, trying to build their own and there's Lots of I have a lot of fun building Devin, so I can see why other people would want to build their own cloud agents as well. Matt, well, maybe it's good to hear, what initially inspired you to try to build OpenInspect?OpenInspect: Ramp, Cloud Agents, and Open SourceCole [00:03:49]: OpenInspect came about, through primarily my clients observing how they were using tools like Claude, OpenAI's Codex at the time, and seeing some of the friction that they were having with it. Primarily the Claude was being used through Slack, and a big issue they ran into was that the sessions that were launched were specific to whoever called it via Slack. And so if a PM was the one who invoked the session and they would then go to pass context to engineering can't see the session. And that in itself was a deal breaker because the PM, “Hey, engineering, can you jump in?” But there's nothing to jump in on unless they're copy-pasting out or the single response that came back. And so seeing some of these problems, I had built a similar architecture internally, just to experiment with, test out different ideas as this trend of moving off of localhost was starting to become, And as Ramp released their blog post, I had a lot of the pieces for this already in place, and just thought it would be funny to, see what Claude could do just purely from the blog post. And on my X account, there's actually a thread of where I live tweeted, going through thisCole [00:05:14]: comparing GPT and Claude as both of them are going through it.Swyx [00:05:17]: On the announcement thing or something else?Cole [00:05:19]: right after it got released. We can put it in the show notes. Yeah, it was helpful that I had already knew how to verify the system. I knew what I was looking for. I think Ramp did a great job of really illustrating, the technical aspects of how to build something. It was much more than just like, “Hey, we built a great system.” It was, “And here's how you can build it too.” And so, I resonated a lot with that, just with the problems that I was already seeing, and I thought that, looking around, I didn't really see anything in the open source community that, met this type of system. I think there's a lot that run, in localhost like Superset, Conductor, and many others.But nothing that was actually running in the cloud. And so, I built it, and I thought it was interesting to just open source it and allow anyone to then have a foundation that they can mix and match on top of.The Business of Background Agents: Open Source vs. DevinSwyx [00:06:16]: So literally after Devin was launched was, there was OpenDevin Which became All Hands. I don't know if you tried that orWalden [00:06:22]: I was going to say, one of the things that interested me a lot with OpenInspect was, you didn't try to go make it then something you monetize. There are a lot of, I think, these open source projects would then go and really try to, raise VSwyx [00:06:36]: That's why no OpenDevin. Yeah.Walden [00:06:38]: yeah, and how did you think about that? I thought that was very interesting.Cole [00:06:44]: I thought, and just what I had seen across my clients, was that having a background agent system is going to become a critical infrastructure within their company. And so because of that, I think that I wanted to open source it so that they could fork it and put in whatever customization they wanted. To that question though, I get asked all, “Oh, are you going to raise? Are you going to turn this into a service?”Walden [00:07:08]: I'm sure you've gotten offers.Cole [00:07:09]: but primarily I don't want to do that for a few reasons. One, I think that I don't want to compete for, $20 a seat. I think that is just a really difficult business. I think it's very easy to copy the main pieces of it. Again, I built this fairly quickly. And I think because you are not owning, I guess, the entire stack, it's hard to monetize. You have money being made at the sandbox layer with Daytona, E2b, many other players. You have money being made at the model layer. And you sit in this weird in-between gray area where what are you actually selling? You're selling, I guess, the infrastructure. You're selling, the integrations maybe.Swyx [00:07:55]: let's ask the guy. What are you What are you selling?Walden [00:07:59]: Well, yeah, there's multiple layers to this in practice, and actually it's funny you mentioned the infrastructure, ‘cause when we got started building Devin as well, we had to go figure out how to make the infrastructure as well because,Swyx [00:08:10]: You had to build this two years before everyone else,?Swyx [00:08:15]: Including, the model sideWalden [00:08:17]: It was not, it was not very polished at the start, when we just built it off of raw VMs from cloud providers like EC2, the boot up time was so slow, I think, And especially then, turning off the machines, saving them, and then to be able to bring them back up again when the, when you want Devin to wake up again later. It would just be out cold for like 10 minutes because that's just how long these systems took. They were not built for this repeated down and up usage. And so we actually had to go do all of that. And as a result now, one thing we offer when we go and sell Devin to people is, you don't have to worry about all the compute side of things. We'll make it work. We'll make it work in your cloud if you want it to. But aside from the product, and I want to go into the agents and the tuning of the intelligence part later, but I think a big part of what we do at Cognition as well is to just make sure that your company learns and uses and adopts these coding agents. ‘Cause I think for especially the largest enterprises in the world, you find that there is a lot of people who want to move over to using AI for their day-to-day workloads. But because of the way projects are planned, because, not everyone is literate in using AI in these ways, having a team of engineers who can actually go in and onboard you, set up all the integrations you need, the automations you need to really get to that level of, leverage with AI, is super helpful. And so We do that. We show thought partners to the customers that we work with as well.Swyx [00:09:56]: So let's talk about, architectural stuff. I think that's always, that is something that was the topic of conversation between the two of you. Is this, the mental model that you want to start with or something else? I'll just leave the floor open to you guys.Agent Architecture: Harness in the Box vs. Out of the BoxCole [00:10:11]: I think, maybe we can start here as just a general what are the pieces of a background agent system. And then maybe we can go into some of the nuances of, Decisions that you can make.Swyx [00:10:22]: But I guess I also Like, what, maybe what Walden is saying is the agent is like in this open code box, I guess. Right? This is infra, and then there's, that's the agent. And you had this discussion about whether you put the agent in here or in Out externally. Can you tease that out?Cole [00:10:39]: In a background agent systems, you have a decision to make of where the agent is actually going to run. This is typically described as the harness in the box or out of the box. With running the agent in the box, you're making some trade-offs by doing that. The negative trade-off you're making is primarily security. Because the agent is running in that box, unless you otherwise design it, all of your secrets need to go into that box as well. And given the nature of AI, it can be unpredictable, and you could very easily end up accidentally exfilling your secrets, or other unintended behavior. Now, the out of the box is the idea that we are going to have the actual agent running not directly in the sandbox, and we will have, quote-unquote, the brain of the agent running in some type of worker, control plane. That sandbox then is going to serve as the hands where the brain is basically operating and making tool calls into that environment to manipulate it. I guess other trade-off that you're making between the two systems is that, in my opinion, running it out of the box is much more complex because, you have state that has to be managed, whereas if you're running it in the box, all of the state of that agent is actually in the box, and yes, it's you could persist it elsewhere, but it's all localized and you have less concerns to worry about.Walden [00:12:08]: I think a lot of that, what you mentioned, is why we actually from the start built Devin to what we called separate the brain from the machine. The other thing that this allows you to do is reuse any existing infrastructure you have for dev boxes Perhaps. And so you don't have to worry as much about making a new type of dev box that has all the dependencies the brain needs, as you mentioned, the secrets the brain needs as well. One thing that we've seen some customers run into is, you have a GitHub app and you want Devin, your agent, whatever, be able to interact with GitHub through this application, but then you have different users with different actual permissions. If they are all interacting through the same GitHub app and there's no actual, separation between the system that decides, what it does and the actual secrets on the machine, then you run into an issue where, okay, it's hard to do the separation. But in practice, with Devin, it's much easier because we just say whatever you put on the machine, that is, the scope of basically what the user is free to do, what the agent is free to do. So only put the most scoped secrets on that machine, and then the brain is fully not accessible from the machine. So you don't have to worry about messing with the, any of the most secure parts of the brain if the user is free to do whatever they want with the machine.Swyx [00:13:31]: I was going to just bring, I have this, chart from OpenAI, where I don't know if this is, in the box, out of the box. That is something that they do use to describe it. And then also recently Anthropic did, managed agentsSwyx [00:13:44]: Which is, this is their thing. I don't know. It's all, it's all variations of the same pattern, right?Cole [00:13:49]: So this would be out of the box.Swyx [00:13:51]: Which, is preferable for them because it's less work?Cole [00:13:56]: I would say it's more work.Swyx [00:13:58]: It's more work?Cole [00:13:58]: But it, in my opinion, it is the better architecture of the two. It's just, you're taking on a bit of complexity by doing that.Repo Setup, Docker, and VM-Based Development EnvironmentsWalden [00:14:07]: One thing I've not seen a lot of other players do well is how do you manage what's actually on the box? And this can be complex for many reasons. Let's say you have a big repository that's changing and updating a lot with changing dependencies. How do you make sure that the working environment of the agent actually stays up to date, has all the credentials it needs to, let's say, run the app and test it, and all the things you want your autonomousSwyx [00:14:34]: So a repo setup.Walden [00:14:35]: Exactly. So in, internally At Cognition, we call this repo setup.Cole [00:14:39]: The hardest part ofWalden [00:14:40]: It's been a perennial problem since the start of the company, of how do we help people get this set up? Because not everyone just has, working cloud environments working out of the box. And do you find this to be a common problem withSwyx [00:14:53]: How do you solve it?Walden [00:14:53]: Your clients?Cole [00:14:54]: This is a very common problem, and through my consulting, this is a lot of what I help teams do. A lot of teams don't really have great developer environment setups, if any. A lot of the times it's, “Go talk to Bob and get the secrets,” and that obviously doesn't work when the agent needs to actually set this up. And so a lot of that, most teams are using Docker Compose or some type of microservices. And so for theSwyx [00:15:19]: Even in prod?Cole [00:15:20]: Not in prod. With the OpenInspect, you are using this primarily to interact, and make code changes. There is other use cases, but you can hook, whether through CLI, MCPs, other tools, you can then hook that into your production systems primarily for, SRE type use cases. But you are not, necessarily, trying to test your prod internal microservice through the system.Walden [00:15:48]: And you mentioned Docker Compose. I think one direction we saw some of our friends take early on was, using Docker containers as the level of abstraction for their models. There's lots of reasons, I think, why Docker containers are not great. One thing is, Docker container's not really a true security boundary, for one. But the other is, if you are running real applications, a lot of times those applications use Docker, and then you have to think about Docker in Docker, which is, really weird. And so I think part of, the really hard challenge of getting VMs to work, why did we do that? Well, it was because we realized that you actually needed, full VMs to be able to do these types of things. And especially nowadays where there's actually value in running the application and clicking around and sending you screen recordings of these things. The value just, keeps adding on top of that. But it is a decision I see people run into when they try to build their own systems, is, “Oh, do we, in addition to this, do we put the agent in the machine or out of the machine? Do we use Docker? Do we use something else?” What do you recommend people nowadays?Cole [00:16:57]: I think Docker is a good solution for maybe not running the agent, but running your infrastructure, because that is more or less the same setup your engineers are probably already using. If they're not, then I don't know what they're using. But they're probably already using Docker Compose.Swyx [00:17:14]: I've always had a small candle for web containers. I don't know if you guys have tried them before.Swyx [00:17:19]: To me, they were, supposed to be like Docker Light.Cole [00:17:22]: Is it?Swyx [00:17:22]: I don't know.Cole [00:17:22]: No, I haven't tried it. But yeah, I think any environment that you've set up that is a good experience for your developer naturally lends itself to being easy to set up for the agent. And once you figure out that local developer story, you've more or less solved the agent in a sandbox, environment setup. OpenInspect does have hooks as well, where you can, run a setup SH script that will pre-install everything. You can then pre-snapshot that build so it starts instantly, and then there is a second hook to actually then, restore the state of the sandbox when it comes back. And so you can already have all of those microservices running and basically get the same experience that you would on your machine within the sandbox.Testing Agents: Computer Use, Screenshots, and Real App WorkflowsWalden [00:18:08]: Another thing that we've been thinking a lot about is like Different VM service offerings. Have you had customers where they needed like macOS specific VMs or like Windows specificWalden [00:18:20]: VMs?Walden [00:18:22]: There are like many technologies in the world that only work on specific types of machines, right? If you're building a.NET application that has to run on Windows or like, maybe more commonly if you want to build iOS or macOS Does that workSwyx [00:18:32]: Does Commission supportSwyx [00:18:33]: Choices like that?Walden [00:18:35]: The fundamental architecture we do, because we do the separation, it does support, but the actual work in progress is happening right now on these. Another thing that we've actually recently added support now for, it's in beta, is doing Android development. To do that, we needed to support, I think, nested virtualization within our machines because the VM itself is like a, is a virtualized Firecracker instance, and then you had to then run another Android emulator inside. And there's like weird performance issues that like, it, which is why it's like still in beta. We have to think through these problems, but it unlocks a lot for anyone who wants to do Android development.Swyx [00:19:13]: I was trying to find like a reference video for the testing thing. I couldn't find it, but I think you worked on the testing, capability. Why call it testing and not like computer use or I don't know, it's, what's the general Category of problem?Walden [00:19:26]: I think that when people think about the ability of an AI to run your app and test it, I think they actually over-index on the computer use part of it because computer use in my mind is the literal, okay, you want what button you want to click. Can you emit the right coordinates to go click that button? I think testing is actually a really interesting likeWalden [00:19:48]: Problem-solving, challenge for these AIs because if you wanted to do arbitrary testing, imagine you make a change that spans the frontend and the backend, maybe, even some other like even more deeply nested service. To actually test that change, we have to reason through what-- how do you first run these applications to orchestrate with each other with the right version of the code? Then, okay, how do I trigger the feature or how do I make the thing actually happen? And this can get arbitrarily hard, maybe you have to be an admin. Maybe a certain thing has to be feature flagged on. Maybe, you have to like run two sessions and then send us a very specific word into one of them to trigger a specific behavior. And figuring out how do you do that requires a lot of code base context, requires, a lot of orchestration that we've specifically done. And in some cases, we found that you actually, no one frontier model can actually do this full end-to-end task itself.Walden [00:20:42]: We've seen cases where we actually had to orchestrate different frontier models together to solve this problem together. That is where we spend most of our time when we think about this testing problem, not so much the computer use part. Computer use for what it's worth has gotten a lot better with recent models and it's made that part of the job certainly easier.Swyx [00:20:58]: Especially with like even 4.7, that they released yesterday, apparently like way better in terms of the vision stuff, which is going to be encompassing computer use.Walden [00:21:08]: Having evals for all these as well is something that like takes a while to build up. And having the evals be right is tricky as well. Do you ever see like, clients who are building their own agents have to start standing up evals to make sure things don't regress?Swyx [00:21:25]: Not so much evals in the traditional sense, but specific to the testing part that has just gone in. I just added support for screenshots And in theory you can also do video. I need to put in a plugin to do that. But they do show up natively, and it was a very heavily requested feature, especially after Cursor's recording came out. I think that was very enlightening for everyone of like, “Oh, this is a very good feature to actually have.”, I think with Devin you guys have had this for a while.Swyx [00:21:57]: Oh, yeah. See how screenshots work. Yeah, I don't know if there's anything, super and not obvious. It's like once what feature to build, you can just prompt it and it Will mostly work.Walden [00:22:09]: I think to Walden's point, though, the computer use is a subset of the larger testing problem, and I think that's very specific to the code base that you're working and it's not something that, out of the box that you could just solve it. The-- you do need the code base context to actually know how to test it. And I think in the case of a background agent system, you fortunately do have that code base locally that what is changing and could then inspect it and use that to drive the model.Swyx [00:22:40]: For those who haven't seen it before, this is an example of how it works. You, after the PR is done, you click testing approved, and then it sends you back a video. What I really like is that it labels, It's very small here, but it actually labels what it's testing. And then it-- and then you actually see the cursor and everything. So I don't know, yeah, the engineering in this, just Whatever you want to show. ‘cause this is like, this is one of those like, oh, few of the AGI moments, right? ‘cause Once I look at this, I actually don't I wish I can just merge inside Of Slack instead of going to GitHub ‘cause I don't need to see the code. I know it works.Walden [00:23:19]: Maybe a new feature in Cursor. Yeah, the annotations at the bottom was also a big difference for me when I, when I added those.Swyx [00:23:27]: It's just like, what am I looking at? What are you trying to demonstrate?Walden [00:23:30]: Exactly. There's a surprisingly long tail of small details that ends up making a big difference for this end metric of like how fast do you actually merge the code in. One experience that we spent a lot of time tuning early on was what is the right experience on GitHub for these tools. Because I think, most tools out there when you build the agent, you'll think about, oh, it'll create the PR for you. We try to take that a step further and say, “Oh, what if we actually made sure you could interact Devin, with direct Devin directly on GitHub?” And so we made sure that you can comment on GitHub, and Devin would actually receive those comments and address them back. But there's actually quite a bit of tuning you have to do here because you can imagine that actually like-We recently have Devin Review, for example. Devin Review will post comments on his own PR And then Devin has to then goGitHub Workflows: Devin Review, Comments, and PR AutomationSwyx [00:24:23]: He answers his own comments, which is Really loopy. So like, yeah, I like that it just updates here that it's, that I have commented But usually it's just me saying like, “Hey, merged, fix any merge conflicts.”Walden [00:24:37]: The, so when Devin fixes his own comments, you might be scared that, oh, maybe I'll infinite loop. But we've put a lot of work into making sure it doesn't, both by making sure that the comments are high signal, but also that the agent is thoughtful about what comments it immediately goes and tries to fix, and what comments it's like, “Wait a second, I think you're wrong.” Actually, that's one of my favorite moments is when Devin tells me that I'm wrong, when I try to get it to do something different. But tuning that behavior, actually makes a big difference in terms of how useful the actual GitHub experience is.Cole [00:25:06]: I think to touch on that as well, I think having the AI reviewer integrated into the system is a critical part of this background system. OpenInspect does have that. It has a GitHub code reviewer that you can control the prompt. It does do comments as well. It doesn't do them automatically yet. The capability is there, but it's not fully used.Swyx [00:25:27]: So you have to ask for it?Cole [00:25:28]: you do, yeah. You can tag it on GitHub, and then whatever you named your, GitHub bot, it will then follow up on it. It will then, if you have merge conflicts or whatever you have asked it to resolve, it will then resolve it, but it doesn't do it automatically yet.Integrations: Slack, MCP, and First-Party Agent InterfacesWalden [00:25:42]: Well, I'm curious, what is, the most common thing that people end up requesting, that they still need on top of OpenInspect when you help them go implement it?Cole [00:25:52]: I think a lot of it comes down to actually integrating it into the company. It's one thing to have the background agent system set up, but if it isn't actually integrated into your larger ecosystem, it isn't that useful. It is useful to be able to kick off sessions, but what we really want to be able to do is hook it into all of our other systems, whether that is the production database with read-only credentials, the logs, a Confluence or internal knowledge-based system. I think that is where I see the huge leap for companies, and that can be a challenge for companies as well who are maybe not familiar with exactly how to approach it, especially if they're in environments that have more compliance type things where, access control can be pretty big and how do you deliberately think about these problems, I find to be, one of the problems that comes with a system like this.Walden [00:26:46]: The thing we found is So, MCPs, obviously it has been like this, really big explosion of, oh, you can go, integrate it with all these different things. But to actually get the integration right and the and get the right experience, oftentimes we found that we had to go build our own ad hoc things. I think Slack is a great example of this. You could give your agent a Slack MCP and okay, it can post messages back to you on Slack. But we actually use Devin like a coworker in Slack, and that's how it's been built from the ground up. But to do that, you actually need to, support webhooks that come back, right? And then Devin has to respond in a natural way and then hopefully don't spam your threads too much and annoy the people in your company. So you got to tune that experience just right. Especially when there's a lot of back and forths, we find that we actually have to go beyond the simple MCP integrations in these places.Swyx [00:27:39]: I just pulled up the MCP marketplace. I know this is a Fair amount of work. Is the answer to eventually take first party control of all the top MCPs? Is that theWalden [00:27:48]: I would love a world where you could have something that's more expressive than MCP. That, goes both ways, not just a set of tools, but a proper system that interacts back and lets it Have the right experience with all these interfaces.Swyx [00:28:03]: So there actually is sampling in the MCP spec, but nobody Uses it, right?Walden [00:28:07]: And so I think that's the other part is, actually we found that when the MCP spec starts to get too complicated, it starts to lose its original promise of Being like a simple one-step connect. Now then we have to go figure out how to support all these different variations of things and It starts to look a lot like just building the first party integrations in a lot of these cases now.Cole [00:28:29]: I think it matters, too, how critical it is to your company, right? If this is something that nearly every session is going through, it probably makes sense to own it so that you can make optimizations on top of it Versus just whatever is off the shelf.Swyx [00:28:43]: Awesome. Other than MCPs, what else, sorry, well, I don't know if that's Narrowing in too much on, integrations. But what else? What other elements of building OpenInspect or Devin that you guys really sink on?Memory and Knowledge: What Agents Should RememberCole [00:28:59]: I think, a problem that comes up very frequently is this idea of memories or knowledge base.Swyx [00:29:05]: Oh, boy. How do you solve it?Cole [00:29:08]: so not solved yet, is the short answer.Cole [00:29:11]: it's something, there's a open issue for it, someone asking about it.Swyx [00:29:16]: There's, I, D Wiki hasn't indexed anything about memory yet.Cole [00:29:20]: how I'm seeing it solved across my clients is primarily through skills. I find that skills can be a good gap within that or updating Claude MD, but I think memory as a whole is a pretty unsolved problem, and it is why I've been hesitant to add it. I think there is parts of memory and that can be addressed, but I think as a whole it's a very difficult retrieval problem.Swyx [00:29:44]: Oh my God. RAMP didn't write anything about memory? I see zero search results.Walden [00:29:50]: No. Memory can be quite tricky to get right because it's the retrieval, but also the generation of the memories that can be really tricky. You don't want it to just like Remember very specific details.Swyx [00:29:59]: Walk us through the Devin memory journey because I know there's been a journey.Walden [00:30:03]: the first version of memory that like stuck around for a while was A system we have called Knowledge. And the idea was we wanted it to pick up things over time and not need the user to be proactive about teaching Devin things. So, okay, any time you remind Devin, “Wait, no, that's not quite the way you're supposed to use Git”Like, we actually want Devin to say, “Hey, do you want me to actually just remember this for the future?” And for you to just basically quickly approve or reject and for it to build up over time. ‘Cause I find that, 95%, I think, or some crazy stat like that of the memories that Devin has are all through these auto-generated things. Very few people actually just want to sit down and write big docs on Here's how you're supposed to work with the technology, et cetera. The generation and the retrieval has been something that we've been trying to tune a lot over the years. Generation, you don't want it to remember something like, if you asked one time to like, “Oh, please open as a draft PR,” you don't want to be like, “Oh, everyone forever now should get their PRs as draft PRs.” But you do want some, conveyor. Maybe you want to say like, “Oh, Cole generally likes, things to be created as draft PRs.” Same with retrieval, if you have thousands of these memories, how do you actually make sure they're retrieved at the right time? And that can be quite tricky to do right without exploding the context with a bunch of useful yeah, useless information. Surprising amount of just, eval work to just make sure that, memory is, remains a reliable system as new models come and go.Cole [00:31:31]: Do you have anything that you could share on, memory pruning? And like the temporal aspect of memory?Swyx [00:31:36]: Deleting and forgetting?Walden [00:31:39]: The, today, the, So the things they could do is it could edit memories. And so if your memory used to say like, “Oh, Cole likes to open everything as like a draft PR,” then you can imagine, “No, don't do that.” And then it'll say, “Oh, do you want me to update the memory to be Cole now want everything as, open PRs?” I think that at the same time we don't know if this is going to be the final version of the system. Whatever we have here will probably, translate into the new system that we'll be coming up with. But I think one big difference between two years ago and today is these agents are really good at using anything that resembles a file system natively. And so part of us are, is thinking, “Oh, should we rebuild memories to feel more like a file system that we let the agent navigate on its own?” That's been an interesting exploration. Also similar ideas in the scale space.Swyx [00:32:35]: I am pulling up OpenClaude's memory thing right now. So memory, OpenClaude has like this like daily memory journal thing, right? And you can I mean, that is a file system you can grep through and is a source of truth. I don't know if it's the best. It's probably super noisy, but at least, if you lose something you can discover it or you can apply some, forgetting algorithm to, more ancient memories that don't get recalled again or something. I don't know.Walden [00:33:01]: One thing we've been trying to do to push the boundaries of how you use agents at your company is letting an agent basically have a very similar file, a memory.md or something, and just like be your permanent PM for a specific set of issues maybe. So we have like some Slack channels internally, maybe a Slack channel dedicated to, a specific product like DeepWiki maybe. And you can imagine that, or you want a Devin that never stops, it's just always awake, but it has this like memory dock that it can just maintain for itself about, okay, what are like the number one priorities of what we have to fix and prioritize? Who is responsible for some upcoming work? Maybe they'll even Devin will even tag you on some recurring basis. And so it's been an interesting move to see, okay, how can we actually use Devin for more than just engineering? Can we actually upstream above the engineering process and maybe it's just Devin creating tickets, which then maybe some humans do, but then maybe other Devins do.Swyx [00:34:00]: One of my more fun automations is go research competitors and just suggest stuff to me on a weekly basis. That's the automation. I can't find it right now, but basically it just like, “Look at competitors and suggest things.” “And here are three things that you've suggested that I don't want any more of,” and you just stick that in the prompts. But like I wish actually So for like when I, for example, when I reject a PR, I wish that it updated memory so that I can then just not have to go up, go back and update the scheduled, sync, but anyway, feature request.Walden [00:34:31]: what? We might change it soon. I guess OpenInspect, in the time you've been around, has there been anything you tried to implement but then you had to like undo and like do a different way?OpenInspect Architecture: Webhooks, Control Planes, and Agent StateCole [00:34:41]: Nothing yet, but something that is on my mind. The initial way that I built it was that each of the integrations lives as its own package. And so you have The Slack bot, which is what's handling the webhooks, and then is basically interacting with the control plane. As I'm seeing the system starting to be more integrated, specifically with the GitHub bot integration, I'm considering bringing that all into the central control plane because especially now I want to start, And a request that I'm getting is the ability to monitor, the actual, pull requests being merged, as well as just tracking ofSwyx [00:35:19]: What do I have open?Cole [00:35:21]: What do I have open? How many of these are getting merged? How many comments are showing up? To just understand the health of the system. And so in the case of a GitHub app, you only have one webhook. And so then it's a question of do I put that webhook in that GitHub bot package? That's weird. It doesn't really make sense to live there because that package is more for like the code reviewer. Or do I like centralize it? So that's something that's on my mind of, making that decision. I think the other one we touched on earlier is the harness in the box versus out of the box. I think long term the architecture will eventually come back out of the box. Some of the newer tools that I've added are calling back into the control plane so that you don't have the secrets in the sandbox. And so I think long term I probably will pull the actual, agent out of the box, but I think for now it's fine.Subagents and Multi-Agent Systems: When Parallelism Helps or HurtsSwyx [00:36:16]: Just, a quick question on pulling the agent out of the box. I'm One thing I'm very bullish on this year is agents calling other agents or spawning sub-agents or Whatever you want to call it. Does that make it harder or easier? I can't tell. Because if the harness is in the box, you can just spin up more boxes. If the harness is outside the box, then you're, it's less easy because you are, you have a unicorn pet of a, of a harness that's, living outside the box.Cole [00:36:45]: In theory it would be the same way, right? Whether, one agent has launched many, sub-sessions within it, OpenInspect, for example, can launch sub-sessions and actually create other environments and then monitor them. In the case where it is out of the box, that would basically just be an additional session that's running. And so that session is also running outside of the box. It's running in your worker plane, wherever you're running this. And then you really just have to think about how does your top level agent then interact with it. I do think it can be more complex, just ‘cause again, you have now a more difficult architecture. But I think if you figured it out once, it's probably fine.Swyx [00:37:26]: Well, then I'm just, throwing it open to you in terms of, I call this like meta Devin management. Which is like the, Devin's calling Devins or Devin scheduling Devins or querying trajectories or anything like that. What have you built or unshipped, anything?Cole [00:37:46]: I think one of the surprising things we've seen is that a lot of the ways that, these, separate agents work with each other, and you want them to, parallelize their work, has still mostly followed the same manager sub-agents regime. And a lot of people I think are excited about this world where you have swarms of agents that, talk with each other all over the place. We've actually given Devin an MCP so they can just go arbitrarily message other Devins And create new Devins, et cetera. But I guess, it somehow creates, a really chaotic world in that sense. And so we've still found that most practical use on a day-to-day basis has been one single Devin.Cole [00:38:33]: Figuring out how to segregate the work and get, have other Devins work on it in, a relatively isolated sense, each with their own boxes Not sharing machines, so there's, a very little room for conflict is the regime that you have to create today.Swyx [00:38:50]: I'll call out, the experiments from Cursor, right? This is Wilson Lin's work on Single agent to multi-agent, and you're obviously famously on the side of don't build multi-agent. But they went through the whole thing, only to arrive at, this Which is exactly what Devin has, I think.Cole [00:39:08]: I think there will be a revision to that post at some point AboutSwyx [00:39:12]: Tell us about itCole [00:39:12]: I think multi-agents were very much not at all possible a year ago. You do see more multi-agent experiments today, but you can argue, are they really multi-agents, or are they just just, tool calls,? There are people who, will create sub-agents to go look for XYZ file, XYZ implementation. Has really nice context management benefits because all of the tool calls and tokens that it spends then get collapsed back to just the answer for the main agent. There's a lot of benefits to doing this. We basically have Devin do this with Deep Bookie, make a call out to Deep Bookie, give you back the results, but that feels like a tool call,? It's not like these, two collaborators actually talking back with each, back and forth with each other. But I think the thing that gives me the most bullishness that multi-agents might actually be possible is actually what I said earlier about Devin will actually sometimes tell me I'm wrong and push back, and I think that demonstrates a level of maturity and communication today that makes a multi-agent world possible. One, can two agents who have seen different information come back to each other and actually figure out who is right, what is the correct implementation? They're not just, yes men. Claude, I guess is like, used to just say, what is it? “You're right,” or,Swyx [00:40:25]: “You're absolutely right.”Cole [00:40:26]: “You're absolutely right.” Yeah.Swyx [00:40:28]: The Have you seen, did you seeCole [00:40:29]: The age is overSwyx [00:40:30]: The Codex app troll in Topic? This is the Codex app. Inside of Settings, there's a little, there's a little Easter egg, right? So if you go to, the Themes or Appearance, right? There's all these, color codes, and the top is absolutely, and it's the Topic's colors. Which is such a troll. Anyway.Model Behavior: Pushback, Adversarial Prompts, and Agent SkepticismCole [00:40:53]: I love that Easter egg. Did you discover that yourself?Swyx [00:40:54]: No, it was, someone was, tweeting about it And I was like, I was like, “Is this true?” Because, sometimes people just tweet stuff to, get a rise out of you. But yeah, there you go, in Topic colors.Cole [00:41:06]: Yeah. So yeah, we're out of this regime where, it just says you're absolutely right, and they can have real conversations and real back and forths.Swyx [00:41:13]: You can prompt it as well to be more adversarial or whatever. Yeah. Okay. Yeah, that, I mean, to me, that is more intelligence, right? That is not just something that's, a dumb tool, it's actually pushing back on you I think. Yeah.Cole [00:41:24]: when you mentioned, of course, the blog posts. There was one blog they had where they fed a swarm of agents together and built a browser.Swyx [00:41:34]: That was I think that was the one.Cole [00:41:36]: You can have, likeSwyx [00:41:37]: I think it's the same oneCole [00:41:37]: Creation of it. We found a surprising success of, don't do a swarm or anything, just have one Devin, it does its own context management. Just let it keep running for a while and give it some crazy tasks. I think we asked it to, rebuild, a Windows OS system. And it managed to do it just like, going on for long enough. It'sSwyx [00:41:55]: Was this Andrew's thing?Cole [00:41:58]: there were lots of demos that we ended up not posting, ‘cause at some point we'd just be posting way too much a bunch of, Demos. But I love that because it shows that I think the multi-agent thing still has, a bit of exciting sexiness to it, which is maybe still beyond still, the actual delta it adds to the capabilities of these systems. But it's absolutely the future. I think we're heading in that direction and we can see the progress being made there already.Swyx [00:42:25]: If I were to, make one super minor pushback because I don't feel that confident about it yetCole [00:42:33]: Go for itSwyx [00:42:33]: But I've had Ryan Lopopolo from OpenAI on the pod And he's a super slop cannon, right? Oh my God, that's my coding agent being done. I downloaded this, Peon Ping. I don't know if you guys have heard this. It takes like-, sound packs from popular games like, Command and Conquer and Warcraft, and then it plays it whenever it's done. And so it's like, “Work,” or whatever, “At your command,” or something. Anyway, what I got from the Cursor code base and from Ryan's thing was that there's a slop cannon approach where you try to loosen the single agent's, bottleneck, and I feel like that is, probably an, a very important thing to try to figure out. I don't think anyone's, really solved it. Because then you just have more reviewer slop on top of the agent slop To try to wrangle it all. Ryan will probably very strongly object that I say that he hasn't solved it, but he thinks he's He thinks he's completely solved it. But I think it's still I think it's, very important, ‘cause, that is a bottleneck, right? I feel Devin is slow sometimes Because I'm like, well, yeah, this is very readable and very sensible, but also it is slower than it could be if I just, I want a button to just say, “Just ramp this up 1,000 next parallel, in parallel and just, see what happens,”? And I don't know if that's, feasible at some point in the future.Code Review, Entropy, and AI SlopWalden [00:43:55]: I And we've also run experiments internally where we've basically tried to build entire products, true products that we knew we would eventually ship, but for now, let's try to see if we can do it just by purely, vibe coding on top of each other, auto merge, no code review at all. And then there's this benchmark of how many weeks can you go onto this for Before you say, “We have the trashiest code base.”Walden [00:44:18]: “Let's actually rewrite it from scratch.”Swyx [00:44:19]: Start a new factory, yeah. What'd you find?Walden [00:44:21]: I think we found that the state-of-the-art in December was you can probably, run this for about two weeks. By the end of those two weeks, you'd find that, hey, you want to, change the color of a button. Well, it turns out this button is implemented in, 10 different places, and they, have All these different variations, and oh, you forgot one of them, and actually it's a slightly different color in one spot. And you're like, “Okay, this is too much to work with. Let's actually try to do code review at the same time.” And make sure that we're on top of our software, actually cleaning it up a bit And making sure it's done in a scalable way.Cole [00:44:54]: I think building on that, the idea of, you don't have to look at code, I think is generally a bad idea. And the meme that I have for thatWalden [00:45:03]: What timeline, all right, is Do you think that statement will be true on?Cole [00:45:06]: I think probably for a while it'll be true that you should continue to look at your code. A problem that I see a lot of teams run into that I work with who are embracing AI native, AI first coding, is The meme that I have is that your code base regresses to your worst engineer, because that engineer who is, very gung-ho about AI and is not auditing their code, their pattern starts cementing into the code, and now the AI is referencing their patterns. And so now their if/else block that, is 20 if/elses back and forth, the AI is seeing that as the pattern of how things are done and starts to then exponentially grow this slop. And I find to your point, a pretty good approach to that is having scheduled cleanup, whether by humans or through systems, that are looking for duplication. They then address that. You'll end up with like 12 helpers for how to format a date. And you need to address that, because otherwise it will continue to sprawl.Swyx [00:46:09]: Within balance, I think it's fine to have some duplication, and then sometimes To have garbage collection, right? Yeah. The What I've been, talking about with a lot of engineering leaders is that you want to be very strict about the boundaries between modules, and it's your job as an architect, as a CTO, whatever, to say like, “Okay, here's the hard contract between you guys and you guys. Whatever you do inside this black box is your business. You do whatever. But between these guys, let's be, really damn clear, and any movement must be signed off by a human or me,” or. Then, and like that's that. I don't know if you have any other modifications or advice.Walden [00:46:44]: Well, I guess generally on the topic of, where humans can be useful, I found that ‘cause, some of these, really deep infra problems, sometimes just having a human that just has, really deep expertise can make a big difference. I've actually seen this come into play when actually building agents. So we've had a few friends now, try building their own coding agents, and I think one same problem that I recurringly heard a lot of them run into was this problem of like, “Oh, Grep is really slow on our agents' machines.” And so a lot of them, I assume because they're using AI and they themselves don't have, super deep infra background knowledge, say, “Okay, we're going to go build our own custom Grep index. It's going to be really fast,” and use that as a way around this problem. When we ran into this problem About like, maybe like a year and a half ago when we were, in the early days of building Devin, we obviously didn't have AI then. We just asked our, how to, how to do this. You can just swap out a new Grep index, so.Infrastructure Details: Grep, File Systems, and SandboxesSwyx [00:47:45]: What do you mean you hand-coded Devin? What?Walden [00:47:48]: It's like, can you believe we hand-wrote this code? And we had, our infra people who are really amazing, they were looking into it and they're like, “Oh, what? We realized that actually the root cause of this problem is actually super simple, but like fine-grain detail,” which is that a lot of these virtual machines actually underlying them don't use real file systems. They use these, network file systems where things are actually cached over the network actually in S3. So when you're Grepping, you're actually making network calls Every time you're doing these things, and that's why Grep is extremely slow on these machines. And so again, goes back to, what is all of the crazy infra work that we had to do to actually get these machines working. If you try to do this yourself, there are tons of small details like this, and so we had to eventually go swap out that network file system. ButSwyx [00:48:35]: I think there's a write-up about it, right? Silas did one about the virtual file system.Walden [00:48:38]: Oh, that was a whole other thing. TheSwyx [00:48:39]: Oh, that's a different thingWalden [00:48:40]: The BlockDev file storage formatSwyx [00:48:42]: I'll bring it upWalden [00:48:42]: Which is, a file system format that we built so that the VMs could be spun up and down very quickly. Basically, the intuition behind this is-Imagine you have, a terabyte of disk, and your agent only, wrote, a hundred lines of code on top of that disk. How long does it, say, take to, save and re-bring up that disk? And most systems, because you're not optimizing for this case, it's just, on the order of a terabyte of work because you have to Save all of that and bring it back up. In our system, we try to build a file system that incrementally builds on top of each other. So every time you save and bring the machine back up, you're only doing work that is proportional to effectively the diff in the file system. And so this, shaves off a lot of time in the boot-up process of Devin. I think we This is actually now outdated. We have a newer system inside of Devin. But yeah, there's a lot of tiny details you have to get right here to actually get the day-to-day experience of Devin to be good.Swyx [00:49:39]: It's, not technically agents, but it is agent infra, and when you sell an agent as a company, you sell agent plus agent infra.Walden [00:49:46]: At least the way we do it be And the other The nice thing about having the agent infra being done together is, you We get to deploy Devin in whatever environment we want now. We don't need to wait for some underlying infra provider to also go and support VPC or on-prem or FedGovCloud, for instance. So we can actually go and figure out, okay, since we own the infrastructure, how can we get that set up for you?Cloud Providers: Modal, Daytona, and Enterprise SandboxesSwyx [00:50:12]: Whereas you're Cloudflare dependent.Cole [00:50:15]: so Cloudflare runs the control plane. The sandboxes, Modal is supported. A contributor just added Daytona. E2B is on the roadmap, and I think there's an abstraction in place that if any contributor wants to add a new provider, they can add that in.Walden [00:50:32]: Well, what are, How are the customers you work with Do they generally try to then go set up a contract with another one of these third-party providers? Do they try to do the VMs in-house?Cole [00:50:44]: most of them I see using Modal. I think Modal has a greatWalden [00:50:48]: Shout out Modal.Swyx [00:50:48]: Shout out Modal.Cole [00:50:50]: I think Modal has a great offering. It captures all of the sandbox pieces you need, snapshots being a pretty big piece of that, and given that they also offer GPUs, I think it's a pretty nice offering as a whole.Swyx [00:51:04]: no debate there.Walden [00:51:07]: Modal is great, especially, I think their container offering is, the most natural, and so especially if you are willing to, forego, the full VM requirements Modal is, a really vast place you can spin something up on.Swyx [00:51:20]: Is there a point So Modal's very Python, and I feel like most workload, has really shifted to JavaScript. I don't know if you guys Get the same feeling. So, okay, when I started Landspace and IE and all these things, I was like 50/50 Python and JS, right? That's roughly. I think that's wrong now. I think JS has won. I don't know if you guys Like, I Maybe I'm overstating it, and maybe for cognition, there's, C# and Java and what have you. But for, new greenfield apps, do you feel that Do you get that sense? Does it matter?Cole [00:51:52]: I think that most of the libraries that I see in this space are Python native first, especially in theCole [00:51:58]: Observability space. That said, I think that there is a pretty big appeal of having your entire system in one language. Especially when you have both your frontend and backend communicating, you can have one central type Which is very nice.Swyx [00:52:11]: That's my case against Modal, which is Then you have to run JS. You can run JS inside Modal. It's just, one extra step That, isn't native to the runtime. I don't know ifWalden [00:52:22]: I don't knowSwyx [00:52:23]: Reviews. Do you have numbers? I don't know.Walden [00:52:25]: the one thing I don't like about Python is whenever AI, whenever it writes Python, it always does, the weirdest patterns, andSwyx [00:52:32]: Oh, because it's, mixing two and three or what?Walden [00:52:34]: I think it's something mixing two and three, yeah. The I don't know if you see this. It always tries to do, has attribute on objects as likeCole [00:52:41]: Oh, my God.Walden [00:52:41]: But it's like But that you shouldn't be doing that. It should error if there wasSwyx [00:52:45]: Because it's training on library code?Cole [00:52:47]: I think it's more of, likeCole [00:52:48]: From what I've seen, it's more of, a reward hacking mechanism where it doesn't want to basicallyWalden [00:52:54]: It'll never error.Cole [00:52:54]: It doesn't want the code to fail. And so it Even when it knows it has the attribute, it'll call getattr on a, and for a lot of my clients who have moved towards more autonomous coding, we've put that in as a lint rule That if you do getattr, your pull request is going to fail.Slop Signatures: Comments, Backwards Compatibility, and TypesSwyx [00:53:12]: Ooh, this is a fun topic. Can you tell me more about this? What else is a sign of AI coding that you have to put guards in?Walden [00:53:21]: So we were talking just before this about Opus 4.7. One of the things this new model likes to do is it writes lots of comments. Not like, it'll, comment every line, but it'll write, paragraph, PRDs, on top of every function. But I will say, to its credit, these aren't slop, descriptions like they were before. “Oh, here's what this function does.” It's like, “Oh, here's actually the r
Dan Shipper is the co-founder and CEO of Every, a media and software company that's become a living laboratory for the future of work. Everyone at his company of about 30 people is an AI early adopter; from editors to ops people, they use AI to do much of their work, giving Every a unique lens into where the world is heading. A year ago on this show, Dan predicted that people were sleeping on Claude Code for nontechnical work, which proved to be remarkably prescient. Today he's back with another set of calls: the SaaS apocalypse is dumb, CLIs are over, the forward deployed engineer is the most valuable new hire, and the only thing you need to do to stay employed is ride the models.Dan's predictions:1. The future of work will happen inside Codex or Claude Code.2. Every company will have one “super-agent” inside their Slack that every employee talks to regularly.3. SaaS is not dead—in fact, Dan is bullish on SaaS stocks. His contrarian take: “I would buy SaaS stocks right now.”4. SaaS economics will shift: users will bring their own AI tokens into apps, which actually improves SaaS margins.5. PMs will thrive in the AI era.6. Full-stack designers will become superheroes.7. The AI job apocalypse is not happening.8. Forward deployed engineer is the new most essential role.9. CLIs are over.10. Automation is a lie.11. We will read way more AI-generated writing and we will like it.12. We'll be building software for humans and agents to use together.—Brought to you by:WorkOS—Make your app enterprise-ready, with SSO, SCIM, RBAC, and more: https://workos.com/lennyVanta—Automate compliance, manage risk, and accelerate trust with AI: https://vanta.com/lenny—Episode transcript: https://www.lennysnewsletter.com/p/the-ai-paradox-dan-shipper—Archive of all Lenny's Podcast transcripts: https://www.dropbox.com/scl/fo/yxi4s2w998p1gvtpu4193/AMdNPR8AOw0lMklwtnC0TrQ?rlkey=j06x0nipoti519e0xgm23zsn9&st=ahz0fj11&dl=0—Where to find Dan Shipper:• X: https://x.com/danshipper• LinkedIn: https://www.linkedin.com/in/danshipper/• Podcast: https://every.to/podcast• Website: https://danshipper.com—Where to find Lenny:• Newsletter: https://www.lennysnewsletter.com• X: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/—In this episode, we cover:(00:00) Introduction to Dan Shipper(02:56) Dan's unique position living in the AI future(09:17) How the way we work will change in the coming year(16:39) The case for general agents(18:08) Codex and Claude Code as the new operating system for work(25:39) How Cursor fits in(27:42) How this changes what SaaS companies should build(31:13) Why CLI is already over(33:34) Two agents are better than one(36:22) Why Dan is bullish on SaaS stocks(39:01) Why automation doesn't reduce human work(47:00) The value of human-written code(48:36) Quick recap(50:15) How work is changing(56:17) Why data scientists are drowning in bad analysis(58:24) Which product/tech roles are least changed by AI(1:02:17) We will read way more AI-generated writing and we will like it(1:08:28) Why product managers will dominate the AI era(1:11:05) Full-stack designers are the other big winners(1:13:11) The AI job apocalypse won't happen(1:16:00) How to “ride the models” to stay relevant(1:21:02) Final predictions and advice(1:25:24) Lightning round—References: https://www.lennysnewsletter.com/p/the-ai-paradox-dan-shipper—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.—Lenny may be an investor in the companies discussed. To hear more, visit www.lennysnewsletter.com
In this episode, Pooja Ranjan interviews Kevin Jones, a leader at Edge and Node and creator of 1Claw - an innovative infrastructure platform designed to secure AI agents and manage secrets. They explore the critical vulnerabilities in AI workflows, how 1Claw addresses these risks, and the future of AI security in decentralized ecosystems.
Hey, Alex here, just got back from the sunny Shoreline Theater in Mountain view, so let me catch you up! This week was definitely Google heavy, we are covering Google's IO conference for the third year in a row, and today we have a special guest, Logan Kilpatrick, is joining to discuss the announced Gemini 3.5 Flash, Google Omni model, and the new Managed Agents offerings. Plus, this week, for the first time, OpenAI announced that AI solved a Math problem that humans couldn't solve for 80 years, Cursor is showing off Composer 2.5 which is partly trained on XAI data, Karpathy joins Anthropic and much more! Let's dive in! P.S - We've announced our upcoming hackathon, Weavehacks-4, June 6-7, I'll be there, we're expecting the seats to run out very soon so register nowThursdAI - We'd love to have your subscription, and if you're already subscribed, please hit that bell on YT to never miss an episode!Google I/O 2026 - Google goes agentic everywhereI went to cover Google I/O for the third year in a row, shoutout to the DeepMind team for inviting ThursdAI again, and folks, this one felt different.Last year, Google I/O was still very model-centric. This year, the story was not “here is another benchmark chart.” The story was: Google is putting Gemini into everything, and the agentic layer is becoming the product layer. Search, Gemini app, Android, Workspace, YouTube, AI Studio, Cloud, Antigravity, Flow, managed agents, smart glasses, all of it is now orbiting around one pretty clear strategy: Gemini is the intelligence, Antigravity is the agent harness, Google's products are the distribution. I saw many reactions that were milquetoast, as in, “we expected more” and those seem to dominate the X feed. But I think the distribution is the part that many folks on X are missing. Yes, we can argue about Gemini 3.5 Flash pricing. Yes, we can argue whether “Flash” still means what Flash used to mean. But when Google says the Gemini app itself has 900 million monthly active users, before even counting Search, Gmail, YouTube, Docs, Drive, Android, and the rest of the Google surface area, that's massive! OpenAI ChatGPT is supposedly stagnated at ~900M, I don't remember them crossing a 1B. Meanwhile Google is gaining traction. And they just updated all those folks with a new model!Wolfram said it really well on the show: his mother is not sitting there reading model cards. She just uses her Pixel, voice unlocks Gemini, asks for help, and suddenly the default intelligence available to her goes up. Antigravity 2.0 - the agent harness takes center stageThe biggest strategic signal from Google I/O for me was Antigravity.Remember, Antigravity was an IDE that came from the Windsurf acquisition saga. Part of the Windsurf team went to Google, part went to Cognition, and now Google is very clearly putting Antigravity in the middle of its agentic future. And I mean very clearly. Sundar mentioned it. Demis mentioned it. Varun Mohan the co-founder was on stage immediately after them! If you've ever watched a Google I/O keynote, you know how carefully every minute is allocated. Google has YouTube, Search, Gmail, Android, Cloud, Ads, Workspace, and a thousand VP-level products that could be on stage. The fact that Antigravity was that prominent should tell you everything.Logan Kilpatrick joined us and framed this in a way I loved: Gemini became the through-line across Google products, and now the Antigravity agent harness is becoming the through-line for agentic experiences.The new Antigravity 2.0 is a complete overhaul, showing only an agentic interface (which was previously just a separate window called Agent Manager) and separating the IDE layer completely into its own app and showing a Codex like agent-first interface, which got a few folks furious. This move may be weird to some folks, but if you follow along where everyone's going, this seems to be the way of the future, coding is no longer about lines of code, it's about managing fleets of agents. The new Gemini 3.5 absolutely shines inside the new Antigravity, the model was trained with this harness in mind, and is currently offered at an incredible speed (12x), so I'm definitely going to try it! Gemini 3.5 Flash - fast, determined, and maybe not the old “Flash”The most debated model release of the week was Gemini 3.5 Flash.Some folks saw the pricing and token usage and immediately went “this is not Flash.” I get that reaction. Flash used to mean cheap, fast, lightweight chat model. But Logan's framing on the show was important: Flash is now being built for the agentic era.In a chat era, you optimize for one user message and one model answer. In an agentic era, the real token volume is in tool loops, intermediate reasoning, retries, file reads, web searches, code execution, and self-correction. That's a different product profile.Wolfram already ran Gemini 3.5 Flash through WolfBench, and the results were fascinating. With the Hermes agent harness, Gemini 3.5 Flash hit an 87% ceiling on Terminal Bench 2.0, meaning across runs it could solve more of the benchmark than even GPT-5.5 extra high in that setup. The variance was higher with the simpler Terminus harness, but with a real agent harness, the model looked much stronger.That tracks with what Nisten saw in his “Martian railgun from Olympus Mons” test. Gemini 3.5 Flash went extremely detailed, almost too determined, kept correcting itself, overcorrecting itself, and built a whole game-like simulation. Logan laughed and basically said: yeah, this model is very determined, possibly an overcorrection from the “Gemini is lazy” feedback. It also tracks with the mismatch in other benchmarks, in some, Gemini 3.5 flash shines (like the above Apex-agents from AA) and in some, it doesn't match the other frontiers. In my tests, it was definitely over-eager to use a million and a half tool calls, read tons of files, to just help me review this draft inside antigravity. It's like a super eager robotic golden retriever! Gemini Omni - Nano Banana for video, but actually more than thatThe biggest update from last year IO was Veo 3! This year, the biggest wow factor was also visual, but it wasn't VEO 4, it was a new model that is multimodal, trained end-to-end they call Omni. Google is calling this their first “create anything from anything” model, and the first version, Gemini Omni Flash, starts with conversational video editing. The easy description is: Nano Banana for video. You upload or create a video, then talk to it. Change this character. Replace this person. Add an object. Make this scene claymation. Keep the scene, but change the environment.I played with it live and showed a few examples. I asked for a claymation explainer of protein folding, then gave it my face and asked it to replace the character with me. It did it. I uploaded pictures of Sonia, my cat, and it generated a talking cat video with the right kind of cat teeth, which is weirdly important because so many pet generations accidentally add human teeth and become nightmare fuel.The failure modes are still there. I asked it to make Sonia a Russian-speaking female cat, and it only partly switched languages and didn't really change the voice. Audio upload support is also not fully productized yet, even though the underlying model is multimodal. But the direction is very clear.This is not just “Veo with a chat model glued on.” I asked Jeff Dean - Google's chief scientist about this at I/O, and he explained that Omni is trained end-to-end. The intelligence and the generative media capabilities are part of the same model family, not a hacky two-model pipeline. He also said the intelligence is around a recent Flash-level model, which is a big deal when you think about video editing as reasoning over physics, identity, scene continuity, and intent.A lot of people compared Omni to Seedance 2.0, and I think that's the wrong comparison. Seedance is amazing at cinematic generation (lkaregly due to lack of copyright concerns from Bytedance). Omni's unlock is iterative editing on real footage and coherent multi-turn creative control. Other Google IO 2026 releases I found notableThis was a concentrated effort of a huge company to insert AI into every product surface they have so of course I can't cover ALL of it here, but the most notable things for me were: * Gemini Spark - a new agentic experience from Google, to help you with tasks across Gmail, Drive and more. It should support skills, and is a de-facto OpenClaw/Hermes alternative from Google for regular folks. It's not “yet” live so we'll talk more about it when I can test it out* Managed Agents in the Gemini API - We chatted with Logan about this one, Google is re-imagining how agents are going to get built, and are offering 1 api call to spin up an agent in a full Linux env, with security and sandboxing in mind. I'll expand more on this in a next episode, as I recorded a complete conversation about this with Ali Çevic, a PM for Google APIs* AI overhaul of Google Search - AI Overviews will not expand into AI mode, and the iconic Google search box itself will change, for the first time in 25 years to include AI mode! * SynthID expantion and OpenAI collab - Google showed off that OpenAI is joining in marking all AI generate imagery and video with an invisible SynthID watermark. I think this is amazing and more companies should adopt this standard* AI Glasses! We got Google Glasses demos - Together with Warby Parker and Gentle Monster, Google finally showed off their answer to Meta Raybans/Oakleys. They look like regular glasses too, but can hear and talk to you, with the full power of Gemini multimodality. Available in the fall sometime! * Demis Hassabis “we're on the cusp of the singularity” closer - CEO and Co-Founder of DeepMind, Demis Hassabis, closed the show with his remarks about the positive future and that we are nearing this Singularity point after which the future is very uncertain. I found it to be very inspiring and closed our show with that clip as well! * Personally, I got to chat to: Demis Hassabis, have breakfast with Jeff Dean, ask Josh Woodward a bunch of questions, and pester about 20 other great folks on a live stream, and had a lot of fun! Huge thanks to the DeepMind folks, Lucie, Dimple, JD and many others for the continued belief in ThursdAI and invite me to cover this great event. OpenAI LLMs solve an 80yo math problem - Erdős Unit Distance ConjectureOutside of Google I/O, the biggest story of the week was OpenAI announcing that a general-purpose reasoning model made progress on the Erdős planar unit distance problem.This problem goes back to 1946. For nearly 80 years, mathematicians believed the best constructions looked roughly like square grids. OpenAI's model found a new family of constructions with a polynomial improvement, using algebraic number theory ideas that humans apparently had not explored in this context. The above is a representation of it! Important caveat: this does not fully solve every version of the asymptotic Erdős conjecture. Some mathematicians are pushing back on the framing, and fair enough. Precision matters. But even with the caveat, this is still a huge moment.The reason it matters is not that I personally understand the math. I absolutely do not. The reason it matters is that this was not a special-purpose IMO model fine-tuned only for math competitions. This was a general-purpose reasoning model exploring a real open problem, generating candidates, verifying them, and finding a path humans hadn't taken. Extrapolate this to other sciences, Physics for example? This means an amazing future. LDJ pointed out that mathematicians have been skeptical because there have been previous false alarms. But this one landed differently. When Fields Medalist-level mathematicians verify the proof, the discourse changes from “lol stochastic parrot” to “wait, what does this mean for my PhD?”My answer is: yes, still study math. Please study math. The mathematicians who use these tools will do much more than people who don't understand the domain. Same with software engineering. Senior engineers with Codex, Claude Code, Hermes, Antigravity, Cursor and other agents are becoming dramatically more effective because they can steer, evaluate, and recover the work.This being published a day after Demis's “foothills of the singularity” is a great conjecture. Cursor Composer 2.5 - Opus 4.7 performance model from Cursor, at 10x better efficiencyCursor dropped Composer 2.5, and folks, this is a serious release.Composer 2.5 is built on Moonshot's Kimi K2.5 base, like Composer 2, but Cursor scaled the post-training dramatically. They used 25x more synthetic tasks and introduced targeted textual feedback during RL rollouts, where the model gets hints inserted at the point of failure instead of only getting a noisy final reward.The benchmark story is strong: around 69.3 on Terminal Bench 2.0, basically neck and neck with Opus 4.7 in Cursor's chart, and strong results on SWE-bench multilingual and CursorBench. The pricing is the part that makes this especially interesting: $0.50 per million input tokens and $2.50 per million output tokens, with a faster variant at $3 / $15. That is much cheaper than the frontier models it is trying to replace for day-to-day coding work.Cursor engineers are reportedly dogfooding Composer 2.5 heavily and rarely switching away. That matters more to me than any single benchmark. If the people building Cursor can use it as a daily driver, that is a very real signal.The wild part is what comes next. Cursor is partnering with SpaceXAI to train a much larger model from scratch using 10x more compute on Colossus 2. Cursor has the workflow data. xAI has enormous compute. If this works, Cursor stops being just the IDE company and becomes a coding-model lab.We've been saying for months that coding agents are the path toward general agents. Anthropic has Claude Code. OpenAI has Codex. Google has Antigravity. xAI has Grok Build. Cursor has Composer. I'm looking forward to seeing how well it performs on our own benchmarks! Anthropic, xAI, Karpathy, and the compute warsThe compute story this week was bonkers.The SpaceX IPO filing reportedly revealed that Anthropic is paying SpaceXAI $1.25B per month for AI compute at the Memphis Colossus facility. Per month. That's about $15B a year, through May 2029, for access to more than 220,000 NVIDIA GPUs including H100s, H200s and GB200s.This is apparently inference compute for Claude Pro, Max and API users, not training. And it explains a lot of the recent quota changes. Anthropic doubled some Claude usage limits, and suddenly the product feels less constrained.Also, can we just acknowledge the comedy here? Elon Musk publicly called Anthropic “misanthropic,”, went off against every competitor to XAI, is now selling spare GPU time to Cursor and Anthropic? Who's next, OpenAI? The bigger point is that the AI capex story is no longer just NVIDIA. It's also whoever owns the data centers, power, cooling, networking, and GPU clusters. Compute is becoming the land under the AI economy.Also, Andrej Karpathy joined Anthropic. Karpathy could work anywhere. He co-founded OpenAI, led Tesla Autopilot vision, taught half the AI world how neural nets work, and now he's going back into frontier LLM R&D at Anthropic.Open source LLMs - Cohere, Qwen, NousOpen source had a strong week too.Cohere released Command A+, a 218B total parameter sparse MoE model with only 25B active parameters per token, under Apache 2.0. This is their first model that unifies reasoning, vision, multilingual, tool use and citations in one package.The hardware story is great: W4A4 quantization can run on 2 H100s or a single B200. Cohere says it supports 48 languages, 128K input context, 64K output, and gets big jumps over Command A Reasoning, including Tau-squared Bench Telecom from 37% to 85% and Terminal-Bench Hard from 3% to 25%.Cohere is one of those labs that doesn't always chase the loudest consumer hype, but they are very serious on enterprise and multilingual. Apache 2.0 makes this one especially useful.Alibaba also dropped Qwen 3.7-Max, positioned as an agentic frontier model. The headline from their testing is wild: 35 hours of continuous autonomous operation with more than 1,000 tool calls. They also showed it controlling a physical robot inside Alibaba offices and finding an umbrella after about 20 minutes of agent interaction.This digital-to-physical bridge is where things start feeling very real. An agent loop that can write code and use tools can also navigate physical tasks if you give it the right robotics stack.And our friends at Nous Research released Lighthouse Attention, a sparse attention method for long-context pretraining. At 512K context, they report a 17x faster forward+backward pass than standard attention on a single B200, and the recovered checkpoints actually beat dense-from-scratch final loss at the same token budget.The clever part is that the selection logic sits outside the attention kernel, so you still use regular FlashAttention on a gathered dense subsequence. No custom sparse kernel nonsense. If this holds up, this could matter a lot for long-context training.Tools and agentic engineering - X subscriptions, Grok Build, Codex MobileOne really practical tool update: Hermes and OpenClaw can now use your X subscription directly.This is more important than it sounds. You can connect your X Premium subscription and get access to semantic X search and Grok-related tooling without using sketchy browser automation or unofficial APIs that might get you banned. Wolfram already used this to have his agent go through his likes and bookmarks from the past week and send me news items for the show. That is exactly the kind of “small but real” agent workflow that becomes addictive.xAI also launched Grok Build, their agentic CLI coding tool, in early beta for SuperGrok Heavy subscribers. Early users are already running parallel Grok Build agents through tmux supervisors and using it for more than coding: fleet data triage, security patching, training label work, and general automation.The pricing being discussed is aggressive, around $1 per million input tokens and $2 per million output tokens for the API. The model version is grok-build-0.1, and folks have already wired it into Hermes with a 256K context window.And then there's Codex Mobile, which OpenAI shipped inside the ChatGPT mobile apps. This is one of those releases that sounds small until you start using it. You can control Codex sessions remotely from your phone, connected to your machine, and because Codex has native connectors to Gmail, Calendar and other surfaces, it sometimes feels faster and more reliable than local CLIs duct-taped to third-party integrations.I ported Wolfred into Codex with skills and everything, and I've been comparing the same tasks in Hermes and Codex. Codex is often faster, not necessarily because the model is always smarter, but because the connectors and harness are cleaner. Harness matters. We keep coming back to this.This Week's Buzz - W&B, CoreWeave, WolfBench and roboticsThis week in the Buzz, Wolfram walked us through a few things from the Weights & Biases / CoreWeave world.CoreWeave is a gold sponsor at ICRA 2026 in Vienna, the International Conference on Robotics and Automation. NVIDIA is also going big there with a keynote on generalist humanoid robots, 17 accepted papers and workshops around sim-to-real, robot foundation models, autonomous driving, manipulation, and physical AI.Wolfram will be there later in the week, after speaking at the AI Developer event in Cologne about WolfBench. If you're in Europe and into robotics or agent evals, find him.We also looked at WolfBench results for Gemini 3.5 Flash, which honestly became one of the more interesting empirical points of the episode. The model looks variable in simple harnesses, but very capable in better agent loops. That's the whole thesis of measuring model + harness together instead of pretending the model card tells the whole story.The water discourse, almonds, and data center realityWe also got into the data center water discourse, because this talking point is everywhere right now.There are real infrastructure questions around AI. Power, land, cooling, grid capacity, permitting, local impact, all of that matters. But the “AI is stealing drinking water” version of the argument is often wildly detached from scale.The stat I brought up on the show: California almonds use roughly 3 to 5.5 million acre-feet of water per year, multiple times more than all North American data centers combined in 2025. Nisten and LDJ added the important cooling nuance: many large data centers use closed-loop cooling, and evaporative cooling is not universal. Some data centers can avoid water use almost entirely, but at the cost of higher electricity usage.This doesn't mean “no concerns are valid.” It means if we're going to regulate or pause data centers, let's be honest about the actual tradeoffs. AI compute is becoming the substrate for medicine, robotics, science, logistics, software, education and every other productivity layer. We should build responsibly, but not based on viral fear math.Closing thoughts - foothills of the singularityDemis closed I/O saying we're in the foothills of the singularity, and I know how that lands when you write it down. But I was in the room, and after the keynote he told me something I haven't been able to shake: he thinks AI is going to be 10x as impactful as the Industrial Revolution, and 10x as fast. Basically 100x. This is the AlphaFold guy. Not someone loose with his words.Then look at the week. A general reasoner cracked an 80-year-old math problem. Cursor is training near-frontier coding models on a fraction of the big-lab budget. Anthropic is paying Elon $15B a year for inference. Karpathy left education to go back into pre-training. Google rolled out an intelligence uplift to a billion people who don't even know a model dropped.If you put that on a whiteboard in 2023, it reads like a sci-fi pitch.LDJ's mathematician friends are asking if they should keep doing their PhDs. My answer hasn't changed: yes, please keep going. The people who combine domain taste with these tools are going to ship more in 5 years than the previous generation did in 50. The tool doesn't replace the taste. It just removes the bottleneck.That's the whole reason ThursdAI exists. Not to hype every drop, not to dunk for engagement, but to give you a shot at being one of the people who knows what's happening, with the receipts.This week, a lot changed.See you next Thursday.TL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist at Weights & Biases / CoreWeave, @altryne* Co-hosts: @WolframRvnwlf, @nisten, @ldjconfirmed* Guest: Logan Kilpatrick, MTS at Google DeepMind / AI Studio, @OfficialLoganK* Google I/O 2026* Google went all-in on agents across Search, Gemini, Antigravity, Workspace, Android, Cloud and YouTube (I/O site, Alex thread)* Antigravity 2.0 became the central agentic coding harness across Google (Sundar, Google OS demo)* Gemini 3.5 Flash launched as a fast, determined workhorse model for agentic loops (Logan, Noam Shazeer, Jeff Dean)* Gemini 3.5 Flash is rolling out across the Gemini app, Search AI Mode, Gemini API, Google AI Studio, Antigravity and Gemini Enterprise Agent Platform (Koray Kavukcuoglu)* Google Search is getting new Gemini 3.5 Flash-powered agentic capabilities, including a new AI-powered Search box and background information agents (Sundar)* Gemini Spark was announced as a 24/7 personal AI agent that can proactively work across Google surfaces (News from Google)* Google teased Gemini-powered Android XR smart glasses with eyewear partners Gentle Monster and Warby Parker (Google, Alex live reaction)* Google AI Studio and the Gemini API got major agentic developer updates, including Managed Agents (Google AI Developers)* Vision & Video* Google DeepMind launched Gemini Omni, a “create anything from anything” multimodal model starting with conversational video editing (DeepMind, Google DeepMind on X)* Omni is available in the Gemini app, Google Flow and YouTube, with API support coming soon (Logan, Gemini App, Sundar)* Key distinction: Omni is not just text-to-video, it is an iterative multi-turn video editing model that combines Gemini intelligence, world knowledge, multimodal inputs and generative media (Google)* Big CO LLMs + APIs* OpenAI announced a general-purpose reasoning model made progress on the Erdős planar unit distance problem, challenging an 80-year-old mathematical belief (OpenAI, X)* Cursor launched Composer 2.5, built on Kimi K2.5, with Opus-class coding performance at much lower cost (Cursor blog, X)* Alibaba released Qwen 3.7-Max, an agentic frontier model with long autonomous runs and robotics demos (Qwen blog, X, robot demo)* Andrej Karpathy joined Anthropic to work on frontier LLM R&D (X)* SpaceX IPO filing revealed Anthropic is paying $1.25B/month for AI compute at the Memphis Colossus facility (Axios, Sawyer Merritt)* The jury in Musk v. Altman found Musk's OpenAI claims barred by statute of limitations, with Musk saying he will appeal (Elon Musk, Sawyer Merritt, Max Zeff)* Open Source LLMs* Cohere released Command A+, a 218B MoE model with 25B active parameters under Apache 2.0 (Cohere, Nick Frosst, HF W4A4, HF BF16)* Nous Research released Lighthouse Attention, a sparse attention method for long-context pretraining with major speedups (Blog, X, arXiv, GitHub)* Tools & Agentic Engineering* Google launched Managed Agents in the Gemini API, letting developers spin up hosted Antigravity agents with Linux sandboxes and persistent state (Docs, X)* xAI launched Grok Build, an agentic CLI coding tool in beta for SuperGrok Heavy users (xAI CLI, X)* Hermes and OpenClaw can now use X subscription auth for semantic search and Grok tooling (Alex)* OpenAI Codex Mobile is now available in the ChatGPT mobile apps for remote agent workflows (OpenAI)* Anthropic doubled Claude usage outside peak hours for a limited period, including Claude Code and other Claude surfaces (Claude)* This Week's Buzz - W&B / CoreWeave* Weights & Biases by CoreWeave is at ICRA 2026 in Vienna, with robotics and automation taking center stage (ICRA, W&B event page)* NVIDIA heads to ICRA 2026 with robotics work around generalist humanoids, physical AI and sim-to-real systems (NVIDIA Robotics, NVIDIA ICRA)* Wolfram is speaking about WolfBench at the AI Developer event in Cologne before heading to ICRA in Vienna (Wolfram)* Other Topics* Data center water usage discourse came up again, including why comparisons need real scale and context rather than viral fear math* The broader theme of the week: coding agents are becoming general agents, and the major labs are now competing on the full stack of model, harness, tools, context and compute This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!On the product side, everyone is getting Computer - Perplexity, Manus, Cursor, and so on. Meanwhile on the research side, agentic evals like TerminalBench and GDPVal are also assuming computer (Harbor). On both ends, the consolidating LLM OS stack has become a standard toolkit, and Daytona is one of a small set of AI Infra companies that are booming because of it.“The end of localhost” has been Ivan Burazin's obsession for more than a decade.Something that is all too familiar…Long before agents became the default way people talked about software development, Ivan was already chasing the idea that development should not depend on a fragile local machine. CodeAnywhere, one of the first browser-based IDEs, was an early attempt at that future: move the development environment into the cloud, make setup reproducible, and free developers from the endless “works on my machine” tax.The thesis was directionally right, but the market wasn't ready yet.However, agents changed that. They do not care about a laptop, desk setup, or favorite editor. They need a computer they can access through an API: something stateful enough to keep working, fast enough to spin up instantly, flexible enough to resize, isolated enough to be safe, and composable enough to run the messy real-world workflows that real software engineering actually requires.Daytona isn't just selling “sandboxes” in the narrow code-execution sense. It is the latest version of Ivan's original localhost thesis.In this episode, Daytona's CEO joins swyx to explain why AI agents need more than code execution boxes: they need composable computers, stateful sandboxes, instant startup, dynamic resources, and infrastructure that can survive workloads going from zero to 100,000 CPUs.We go deep on the new agent compute market: Daytona's hard pivot from human dev environments to AI sandboxes, the New Year's Eve MVP that customers begged for, why Daytona runs on bare metal with its own scheduler, how one customer runs almost 850,000 sandboxes a day, and why RL/eval workloads went from 0% to roughly 50% of usage in just months. Ivan also explains why agents need Windows and macOS machines, why CLI may matter more than MCP, why Kubernetes is painful for this workload, and why the future AI cloud may look more like Stripe than AWS.We discuss:* How Daytona grew out of CodeAnywhere, Shift, and the “end of localhost” thesis* Why Daytona pivoted from human dev environments to AI sandboxes* Why agents need composable computers instead of disposable code execution boxes* The New Year's Eve MVP that customers chased API keys for* Why Daytona chose bare metal, stateful snapshots, and its own scheduler* How Daytona spins up one sandbox in ~60ms and 50,000 sandboxes in ~75 seconds* Why Daytona's biggest customer runs ~850,000 sandboxes a day* How RL/eval workloads create zero-to-100,000 CPU spikes* Why RL workloads went from 0% to roughly 50% of Daytona usage* Why customers compare Daytona against EKS/GKS and say they're “never going back”* Why every AI agent may need a computer, including Windows and macOS environments* The Apple licensing constraints that make macOS sandboxes hard* Why CLI gives agents more power than MCP* How open source helps agents integrate Daytona* Why agent-generated PRs may break today's CI/CD assumptions* Why AI SaaS companies reselling tokens may face a cold shower* Why the AI cloud may look more like Stripe than AWSIvan Burazin* LinkedIn: https://www.linkedin.com/in/ivanburazin* X: https://x.com/ivanburazinDaytona* Website: https://www.daytona.io* X: https://x.com/daytonaioTimestamps* 00:00:00 Hook* 00:01:12 Introduction* 00:03:15 CodeAnywhere, Shift, and the end of localhost* 00:05:58 What Daytona is: composable computers for AI agents* 00:08:07 The pivot from dev environments to AI sandboxes* 00:10:17 The New Year's Eve MVP and customers begging for API keys* 00:12:56 Bare metal, stateful sandboxes, and Daytona's scheduler* 00:17:28 60ms startup, 50,000 sandboxes, and 850K daily runs* 00:21:53 Spiky RL/eval workloads and the new agent infra problem* 00:28:12 RL workloads, Kubernetes pain, and dynamic resizing* 00:33:31 Why every AI agent needs a computer* 00:38:48 macOS sandboxes and Apple's licensing problem* 00:44:28 Why CLI may matter more than MCP* 00:48:11 Open source, GitHub stars, and agent integration* 00:53:11 Git, CI/CD, and agent collaboration bottlenecks* 00:58:15 Founder life and building a 25-person infra company* 01:02:44 AI SaaS, token resale, and API-first business models* 01:06:10 GPU sandboxes, data centers, and compute growth* 01:09:48 Why the AI cloud may look more like Stripe than AWS* 01:11:26 Closing thoughtsTranscriptIntroduction: Daytona, CodeAnywhere, and the End of LocalhostSwyx [00:00:02]: Okay, we're in the studio with Ivan Burazin, CEO of Daytona. Welcome.Ivan [00:00:07]: Thanks for having me, man.Swyx [00:00:08]: Ivan, you and I go back.Ivan [00:00:10]: Way back.Swyx [00:00:11]: How I don't even know how, you found, did you reach out or, for Shift.Ivan [00:00:17]: I reached out to you. The reason was you - we were just - we were thinking about I was one of the co-founders of CodeAnywhere, the first browser-based IDE, and so we were thinking a long time of, localhost should die. And you had this article.Swyx [00:00:29]: End of localhost.Ivan [00:00:30]: Then I reached out to you because of that, and then we talked, and I was actually at a different job and learning about I was the head of, developer experience, and you were quite well-versed in that, and I actually reached out to you, among other people, how do we go about that? What are the key things and whatnot at this point in time? And you were nice enough to take the call, and I remember I was late on your call with you.Swyx [00:00:51]: I don't remember.Ivan [00:00:52]: I remember because I was with my then I'm thinking of a girlfriend or wife at that point in time, I'm not sure. It's the same person, so that's great, and I was late ‘cause we were, in, Italy on, vacation, and then I was late for something. I felt so bad, and you were so nice to be, good about.Swyx [00:01:10]: The reason I'm nice is because I'm also late to other people, so it's like, who's, who's without sin here, yeah, so I have to, for those who don't know, InfoBip Shift, there's this whole thing that, you did in the past, and, and that was basically one of the inspirations for me starting AI Engineer, which is like, I have to thank you for giving me that push to be like, “Oh, you can, you can build and sell conferences?”Ivan [00:01:34]: I remember you asked you asked me at the beginning to give me advisory shares, and I was so focused on what we were doing, I said no, and I should've took the advisory shares. So I'm sorry, dude. But anyway.Swyx [00:01:43]: We're not, we're not venture backed.Ivan [00:01:44]: No, it doesn't matter.Swyx [00:01:45]: It's Yeah, anyway, so I think what's impressive about you is that CodeAnywhere is the thing that you've been trying to build, and, you kind of put it on hold and then came back after InfoBip. Just give us the story, do you - the story and the origin story, going into Daytona.From CodeAnywhere and Shift to DaytonaIvan [00:02:05]: Sure. Like, really way back, me and my co-founder have been together. I say this, I've said this multiple times, it's like we were married and divorced and married. Some people actually ask me is my co-founder my partner. they thought it literally. It's not literally, but we have done multiple companies together, and to your point, we had this shift where we went from the CodeAnywhere to the conference called Shift, and then back to, Daytona. We originally started stacking servers, doing like virtualization in the early 2000s and, routers and doing basically all these things, at a foundational level, and that was a services company which we sold to focus on what my co-founder actually invented, which was the very first browser-based IDE, right, I say the first. Before us was actually Heroku. They did it for a very short time until they became Heroku. But outside of them, we were the only one, and it was called.Swyx [00:02:55]: There was Cloud9.Ivan [00:02:57]: Cloud9 came out slightly after us. There was Replit, which came out when we stopped doing it, Replit came out, and they have been successful since then, which is great. There was Nitrous.io. There was quite a few that existed at the time, but it was like too early. But the interesting part is that we, at that point in time, because there was no VS Code, there was no Kubernetes, and Docker had just started when we Or I'm not sure if it was even public at that point in time. And so we had to build everything to the whole stack ourselves and that was the key learning that we brought into and that we've been using in Daytona today. So it was super early. There's about 3 million people used CodeAnywhere. It was slightly, it was angel-backed more than venture-backed. We ended up paying everyone back because it didn't have that sort of scale. But, three years ago, we started something similar with Daytona, which is not what we are today, but it was automating dev environments for human engineers, the basically the underlying stack of CodeAnywhere. And then we did a hard pivot last January to sandboxes. And so here we are.Swyx [00:04:01]: Historic pivot, yeah, and, it's one of those things where, I had independently invested in CodeAnywhere, but also in E2B, and then both of you pivoted into the same thing, and I'm like, “F**k.”Ivan [00:04:12]: You invested, you invested in Daytona. You invested in Daytona. But you were the first If we had not got your check, we wouldn't have done it.Swyx [00:04:18]: No way.Ivan [00:04:19]: No, it was like, “We have to get him on board first,” and you were that kicker that we, that got us off the ground.Swyx [00:04:23]: No, because you were putting me on your pitch deck, man. I was like, “Man, this is like a good trip if I don't invest.”Ivan [00:04:29]: That's because it was your quote. It's like we.Swyx [00:04:30]: Yeah. It's the end of localhost.Ivan [00:04:31]: Did a bunch of research about end of localhost and who was interested in that,.Swyx [00:04:34]: No, that's like, I put, I wrote that blog post, and every single company in that field reached out to me, and then every VC who was receiving those pitches then also had to call me and, talk it, talk through it with me.Ivan [00:04:47]: It's finally happening though.Swyx [00:04:48]: It was really super interesting.Ivan [00:04:48]: It's finally happening.Swyx [00:04:49]: It's finally happening.Ivan [00:04:49]: Yeah, it's finally.Swyx [00:04:49]: It's finally happening, with maybe sort of non-human users. Yeah, so what is Daytona today? Let's get like a quick description. I'm wearing the shirt.What Daytona Is Today: Composable Computers for AI AgentsIvan [00:04:58]: You're wearing the shirt. Yes,.Swyx [00:04:59]: It says, I think your branding is very good. Like, it's very consistent. It runs AI code. Like, it cannot be simpler.Ivan [00:05:05]: Exactly, but we're gonna probably have to change that.Swyx [00:05:07]: Oh, s**t.Ivan [00:05:07]: It's also a subset of what we do. Unfortunately, we really love this, Run AI Code is super simple. People interpret it different ways. I think we've given out 5,000, 6,000 of these shirts. People wear them with pride because it doesn't really market about us.Swyx [00:05:21]: Yeah, Daytona's on the back.Ivan [00:05:22]: It markets the back. It markets to the person itself, so I think we did a really good job on that one. But it is also a subset of what we do, because people, when they think about Run AI Code, they just think about these small, let's call it isolates, code execution boxes that, you send some code, you get an output. Whereas what Daytona is today is essentially composable computers for AI agents. It is, the market calls them sandboxes which can be misleading.Swyx [00:05:44]: All these things. All these things on.Ivan [00:05:45]: Yeah, exactly, ‘cause it can be misleading ‘cause people usually think about sandboxes as a demo or a test environment versus a production-grade environment. But what Daytona does, if you think of the laptop that you have in front of you or the computer that's over there, or, my wife is an architect, so she has like a Windows with a 3D graphics card inside to do 3D rendering. Like, as humans, we have different computers or different compositions of computers. And our belief is strongly that agents today and going forward will need all these different compositions of computers to do different types of tasks. And so we offer that basically through an API.Swyx [00:06:19]: Yeah, to give people - I'm trying to sort of front-load all the aha moments or the wow moments so that people can, stay engaged and click like and subscribe. the market is exploding, right? Like, you have been reporting 74% month-on-month growth, and it also, it's just been growing for a while. Like, it's been going like this. And every single - It's not just you guys. It's every single.Ivan [00:06:41]: Everyone, yeah.Swyx [00:06:42]: Sort of, compute provider. I don't know if you agree with me saying compute provider or not.Ivan [00:06:48]: It's fine.Swyx [00:06:48]: Yeah. So like organically PLG-driven growth, but also enterprise is doing super well, I think I wanna rewind to January of last year when you did the pivot. Like, so you obviously called this market early, and you were positioned for it, and you are now one of the market leaders. But what was the insight that made you do the pivot?The Pivot: From Human Dev Environments to Agent SandboxesIvan [00:07:06]: The insight that made us do this pivot is the quarter before that, so end of 2024, when we had - Basically, we did a demo with - I don't I think we discussed this as well, Devin was not public. You actually gave me access to Devin at that time. So Devin.Swyx [00:07:25]: I did?Ivan [00:07:26]: Yeah, you gave me access.Swyx [00:07:26]: I don't think I was supposed.Ivan [00:07:27]: Yeah, exactly.Swyx [00:07:28]: Yeah, I.Ivan [00:07:28]: So it doesn't matter. You.Swyx [00:07:29]: Yeah. I gave like three friends access.Ivan [00:07:31]: Yeah, or it was a call and you showed it to me. It doesn't matter. but OpenDevin was available, which is now called OpenHands. And so we're like, “Oh, this seems to be a thing. This is not public. Let's take our for human automation of dev environments and take, OpenDevin and launch that as a SaaS.” And we did that. Not very many people signed up and used it, but a lot of people reached out that were building agents, and they were like, “Hey, my agent needs a compute sandbox runtime,” whatever you wanna call it. I forgot what it was called at that point. And then we were like, “Oh, amazing. This is a new market. Here is our infrastructure. Here's our product, and go.” And what we found really fast, soon, was that people did not like what we had built. It didn't work. And I remember talking to people at the beginning when we're doing this, the sandbox we're building for agents. People were like, “Oh, why is it different? It's the same thing. We have like EC2, we have VMs, we have all these things.” But we saw that everyone we gave it to, it was like 20, 30 people, they all said, “No.” Like, “This is not what we need. This sort of breaks.” And basically, me and my co-founder not knowing a lot about - ‘cause we're infra people. We're not AI people. So I basically took it upon myself to like watch every single podcast that exists, including all of, all of these and all that, and sort of get up to date, read all the blogs, like get, understand what's going on.Swyx [00:08:45]: Do you wanna shout out who else was useful, just in case people are also looking.Ivan [00:08:49]: Generally we -, I looked at There's a few of podcast, different segments and different types. So there's you guys, No Priors, Bill Gurley's was great while.Swyx [00:09:04]: VG2, yeah.Ivan [00:09:05]: Yeah, while it was around. So there's a few. 20VC is interesting from a different dynamic, and some are different dynamic. But there was, also Red Points.Swyx [00:09:14]: We're not really about the compute market.Ivan [00:09:15]: It was also already - Sorry?Swyx [00:09:16]: You're, you want - You're looking at the agent infra market.Ivan [00:09:19]: I was looking at the agent market and the AI market in general and sort of understanding who are the players, what the perception, and how that goes. And like obviously you complement this with like going to conferences, going to events, going to meetups, reading white papers, like doing all the things that you have to do to understand what's happening. And so when we figured, when we sort of had an idea of what we had to build, literally over the New Year's Eve, literally on New Year's Eve, I half vibe coded the first MVP, first minimal viable product of what Daytona is today. And I went to sleep at like 3:00 AM or something like that. I was doing - I just put my like baby daughter and wife to sleep and, Happy New Year's, and go back to just, doing this. And I sent it to my co-founder, my CTO, and he saw it in the morning. He's like, “This is absolute garbage.” “Do not show this to anybody at all, but the idea is good.” And so he took two weeks, and he rebuilt it.Swyx [00:10:09]: Did it like look like that? Listen, I - It was rough idea.Ivan [00:10:12]: Oh, not even, not even close. Like it was it was way worse. But it was like a very - It was a simplistic view of what it should be. Like, it worked, but it was not ideal. And so he went, we went down the whole, which is his job as CTO, to go, and he came back with this version. We then called all the people that had said like, “This is garbage,” a quarter ago. And we set up these calls, and we gave it to - We just demoed it to everyone. And all the calls went long, every single one. They were 15-minute calls, and they all went to like 25, 30 minutes or whatnot. And everyone said, “We need, we want access.” There was no login, just an API key, ‘cause it was just a beta or an alpha. And they said, “Oh, we want access.” And we're like, “Sure, yeah. Okay, thank you very much.” But after like the next day, if we'd not send it, every single one, like every call that we did, everyone came back, “Where is my API key?” Like everyone wanted it. We're like, “S**t.” Like this is it. Like I've never felt So one, the understanding to your point was like most people thought it was the same infrastructure for humans and agents. We understood a quarter ago it's not. We just didn't know what was the right primitive. And then when we came, and we can talk about what that is, and we gave it to these people, I've never seen, I've never experienced - I've done multiple companies in my life. I've never experienced this, that people literally call you if you do not give them access. Like they want access right now. And so it's like, okay, they don't want this. the thing that they want doesn't seem to exist, or they have not found it, and they really want what we want. And then when we understood that we're onto something, and then when you think about the size of the market, like the market for human engineers and enterprise is a very large market, so think GitLab or whatnot. But the market for every single agent that will exist ever in the future is just like, what is that market? How big is that? And we're like, “We are all in on this.” And so that is where we made sort of the cut between the old product and the new one.Bare Metal, Stateful Sandboxes, and the Lambda + EC2 ModelSwyx [00:12:02]: Yeah. But it wasn't composable at the time?Ivan [00:12:05]: It was very - It was basically just a Linux box that you could change, that you could define number of CPUs, disk, and RAM. Like that is what you could do, but you couldn't have multiple operating systems, you couldn't resize it on the fly, you couldn't add a GPU, you couldn't do like all the things. It was just the, just the first sort of variation of that, yeah.Swyx [00:12:22]: Was it bare metal from the start?Ivan [00:12:24]: It was bare metal from the start. And so the interesting thing that we thought about right away, so our.Swyx [00:12:29]: Which, give people the background, what is the normal path?Ivan [00:12:32]: Yeah, so, basically most providers run this on top of VMs. And also.Swyx [00:12:37]: Firecracker.Ivan [00:12:38]: Yeah, they run on Firecracker and VM. And so we also fire - We can get - We have multiple isolation layers and we can do that. But the common way to do it is that they, one, that the state of the machine, or the hard disk is not part of the sandbox itself. And the other thing is they're not meant to last forever. So most of them are preemptible, like they can There's a time that they can live. And so our thought was when we were going into this is, agents will be like humans in the sense of you don't want your laptop to be shut down until you're done with work. Like, and you want to close the lid and open the lid, it's the same state. So you - Agents would want that, like the pause and come back. They want those two things. But also agents really want speed, right? Can they get it? So when we thought about it's like we need something insanely fast, how to make it fast, how to make it long-running, and stateful. And so those two things, it's like combining a Lambda and an EC2, right? Those two things together. And so we didn't have an idea how others did it, ‘cause we didn't know too that there was a market around this. It was more like, okay, this is what we need, what they need. And we looked at Kubernetes, it wasn't wasn't good enough for that. We looked at Nomad, it didn't enable that. And so our history in rewriting our own scheduler at CodeAnywhere is basically what my CTO came up with. Like, he's like, “Oh, the learnings from there,” and he brought it. And the funny thing is, our third co-founder, when he saw it, he's like, “Dude, what is this? This is like 2008.” Like, we went back in time, and he's like, “Exactly.” And so the reason why Daytona is like super fast, and you see this on benchmarks, is we essentially, we run on bare metal. We have our own scheduler, we use the underlying, disk, CPU, and RAM of the underlying machine, which means your IOPS are insanely fast because there's no, there's no network between an EBS or something like that. But also the snapshot, the point in time, the templates, are also preloaded on the bare metal machines. So when you fire off a sandbox from a template or a snapshot, you're essentially directed to the bare metal machine where that snapshot is based on that NVMe drive, and then it literally just turns on that machine, and it's local. There's no network latency, anything on there. And so that is sort of the specificities that we, when we're thinking from first principles, what a computer would look like for an agent, that is what we came up with, and that's what we created.Benchmarks, 60ms Startup, and 50,000 SandboxesSwyx [00:15:02]: Yeah. I should maybe, I don't know if you endorse this, but there's someone that does compute SDK, you guys do very well on there, with like the TTI, right? I. is this a, is this a is this a relevant benchmark for you guys? I don't know.Ivan [00:15:16]: I don't know, and it changes every day. So today RKL is.Swyx [00:15:18]: I don't know what RKL is. Never heard of it.Ivan [00:15:20]: Yeah. RK, yeah, so it is there.Swyx [00:15:22]: You are, at least a third of the next tier of performance, and then, there's a lot of other better-known names that are very slow to start.Ivan [00:15:31]: Yeah. We've been the number one by far for a long time, and now there's different, there's different definitions also of sandboxes, different isolation patterns, different other things. So RKL runs it literally on the S3, the data, so it's very different, and they spin up a sandbox, spin up a container for that, so it's a different type of thing. So the definition of a sandbox is something that we can all, we all need to get along with. But yeah, we're insanely fast on getting these things, up and running. And so you can see even there that it's a zero point 0.10 to 0.11, so.Swyx [00:16:03]: Close enough. Yeah. what else do you need, right?Ivan [00:16:05]: Yeah. So the benchmarks itself, so, in this, in I don't think the benchmarks equate to market ownership or revenue or anything like that. and I've seen this with multiple benchmarks, not just in sandboxes, but in general benchmarks around.Swyx [00:16:20]: It's table stakes. It's just like.Ivan [00:16:21]: Exactly. But it doesn't hurt.Swyx [00:16:22]: Just roughly check.Ivan [00:16:22]: Like you definitely have to be up there and you have to be competing so that people know that, oh, this is definitely one of the top. Because this is only one dimension of what customers look for. There's other things like how many can you spin up consecutively? There's a feature set, there's support, there's like all different things that people look at, but you definitely have to be there, on the benchmarks.Swyx [00:16:40]: How many people do people spin up consecutively?Ivan [00:16:43]: So we have.Swyx [00:16:43]: Or concurrently, is the Concurrency, right?Ivan [00:16:45]: There's three metrics that we look at. And so one is like time to spin up one, and so our time to spin up one is 60 milliseconds with network latency. So request, spin up, reply, 60, the whole thing, 60 milliseconds. That is one. But if you wanna spin up 50,000 at once, we are now at about 75 seconds. So it takes about 75 seconds to spin up concurrently 50,000. Some others, there's public data around this, like take 2,000 seconds, which is 30 minutes. Like there's different variations of that. And then there is the so it is speed of one, speed of like multiple, and then how many can you consistently have up and running. And so we basically have right now no limit to how much we can add because we basically own our own metal. But the biggest customer of ours does like about 850,000 every single day is sort of where they're, where they're just shy of a million every single day that they're running, we do have a request for half a million concurrent, which is literally half a million CPUs somewhere running. So that's an interesting.Swyx [00:17:44]: They pay by like vCPU seconds.Ivan [00:17:47]: By seconds, yeah.Swyx [00:17:47]: Or whatever. Yeah. Okay, and so and then, and the other thing is, the sleeping and the resuming, ‘cause it's all the stateful resumption of all these things, how, what kind of workload are people putting through this, right? Like how is it Do we measure by gigabytes in memory, gigabytes in storage? I don't In like network attached storage. I, what are the costly ones of, out of all these features?Workload Economics: CPU, RAM, Network, and StorageIvan [00:18:15]: The most expensive thing are CPU.Swyx [00:18:18]: Okay. Yeah, of course.Ivan [00:18:18]: The second one, yeah Then it's RAM, then it's disk. We actually don't charge.Swyx [00:18:22]: Which is snapshotting, right?Ivan [00:18:23]: No, it's actually the, snapshotting's part of it, but basically the size of your hard disk, of your machine. So do you have 10 gigabytes, do you have 20, do you have 50, do you have whatever? And then the transference of that. Right now, currently we don't charge for, network at all at Polychron.Swyx [00:18:37]: Oh, you gotta, yeah, you gotta fix.Ivan [00:18:38]: Yeah. It is very much a it's a larger and larger part of our bill, so we're working around, that part there. Obviously, that is the least, expensive, so the hard disk is the least expensive, so it's basically CPU, RAM, for us network, ‘cause we don't charge the customer, and then hard disk, is how it's split up. But there's also different types of workloads, so we basically split it up into two types of workloads in Daytona. One is what we call background agents or long-running agents. and the other is, basically RLs and evals, which I put sort of together. And so they have very different patterns of usage, and if you look at the usage of a background And I'll just name names of companies, not specifically.Background Agents vs. RL/Evals: Two Usage ShapesSwyx [00:19:21]: Yeah, open, all hands.Ivan [00:19:23]: Yeah. So like a background agent's a Cognition, a Lovable, a like all these things are Harvey. These are all long-running, background agents. And so if you look at their usage patterns, their usage patterns are similar to human, which is like follow the sun. Basically, the usage patterns of that is like noon is probably the highest, and the midnight is the lowest, and then weekends are lower. weekday is higher.Swyx [00:19:42]: Yeah, that's a fun question. How global is it? Is it very US-centric or?Ivan [00:19:46]: The US is a large part, but we have currently, we have Asia, Europe, and the US regions.Swyx [00:19:52]: So it's quite global.Ivan [00:19:53]: Yeah, it's quite global. We have it all over. It's interesting that our I talked to you a bit about this. Our number one city by user.Swyx [00:20:01]: Hmm.Ivan [00:20:02]: Is Singapore.Swyx [00:20:04]: Oh, wow. Amazing.Ivan [00:20:05]: Which is an interesting one, right? Not by revenue, just by just like by individual head count.Swyx [00:20:09]: Really?Ivan [00:20:09]: Just like an interesting thing.Swyx [00:20:10]: Singapore is, Singapore is weirdly high in the adoption charts of AI for the population. It's like an, seven, eight million population. And it's like keeps showing up.Ivan [00:20:20]: No, it's quite interesting. We were quite shocked, and I was like, “Oh, this is interesting.” And also one that's up there.Swyx [00:20:24]: There's a reason I'm doing AI using Singapore. it's because I'm from there.Ivan [00:20:27]: We're there. We're gonna, we're gonna be there as well. and it's interesting that Japan is in the top or like Tokyo's in the top, which is in all the tech cycles it has never been. It has never been, so it's quite interesting that they're.Swyx [00:20:39]: I think the Japanese just love AI. Yeah. It's that, and then it's Brazil. That's it.Ivan [00:20:44]: Brazil has always been in.Swyx [00:20:45]: I think.Ivan [00:20:46]: Even when I look, if you look at like GitHub's data and ask historically with CodeAnywhere, it was always like US, Western Europe, and then you'd have like India, Brazil, China, like that would be there. But like Singapore was not in, specifically Japan was never in sort of that top, that top.Swyx [00:21:01]: Yeah. Weird pockets.Ivan [00:21:01]: Weird. Yeah, so it's very global.Swyx [00:21:02]: Okay, so actually that, but that's helps you to distribute your load through, all time?Ivan [00:21:08]: The interesting thing is like we have those kind of loads, but if you look at the researcher loads, they're quite different. So what they are is like if you give them concurrency of 10,000 or 50,000 or 100,000 CPUs at ARMb, when they fire off a run, it's just 100%. And then it just runs, and then it stops. So it's very, the usage pattern is squares basically, right? And it's also not follow the sun, because people will fire it off at midnight before they go to sleep but then wake up and so it's very unpredictable, so you don't know where that is. So the shapes of the usage are quite different than we have had before. And also what's interesting is when it's sort of a follow the sun, even if you have a high growth company, you can sort of predict your usage patterns and have enough capacity for that, because it's sort of, it grows in a, in a way you can project. When you have companies doing sort of like evals and RL, they're super spiky. So they're gonna come in, it's like, “We're gonna use nothing, then can we have 100,000?” Right? And then go back down. And then 100,000, go back down. So it's very different, right? And.Swyx [00:22:09]: Do you want to lock them into commits so.Ivan [00:22:11]: Yeah, we do.Swyx [00:22:12]: Yeah, okay.Ivan [00:22:12]: We so we have to lock them into some sort of commits to have that capacity, because we have to have, basically we have to have the capacity for peak. Right? And so right now, Daytona's mean utilization is 15%, 1-5.Swyx [00:22:25]: Oh my God.Ivan [00:22:26]: So it's very low.Swyx [00:22:27]: Because it's very spiky.Ivan [00:22:27]: It's very spiky, but we get up to 90%. so we have these things. And so what we're, what we're looking at right now as a company is similar to Cloudflare where you can like geo move things around, but that works really well for basically the background agent where it's follow the sun. But this, it's not. Like it's a very different shape. Obviously with scale you figure these things out, but that's an interesting new problem that we have, as a compute provider in the agent space. And when we were doing the conference recently, and so we talked to like Nikita from Neon and.Swyx [00:22:57]: I should bring it up.Ivan [00:22:58]: Parag from Parallel and whatnot, everyone has the same problem. Whereas the usage is super spiky, and this is something that has not happened before, that you have these types of like it was always, it the amplitudes were not this high, right? So it's quite interesting use case and problem solve.Compute Conference and Spiky Agent InfrastructureSwyx [00:23:12]: Yeah, I don't know if we're gonna bring this up again, but let's just talk about the conference, you had like 1,000 something people at the Warriors game, at the Sorry, where is it? What's.Ivan [00:23:22]: Chase Center.Swyx [00:23:23]: Chase Center.Ivan [00:23:23]: Chase Center.Swyx [00:23:24]: I went. It was, it was very impressive. Obviously, you can, how to throw a conference, what did you learn? you put, you pulled together all these impressive names.Ivan [00:23:33]: What I.Swyx [00:23:34]: What were you looking for?Ivan [00:23:35]: My thesis behind the Compute Conference was let's bring together people that are building infrastructure for AI agents. Because when I think of what we're building, it is the agent is the primary user, what are the ergonomics and usage patterns of agents, and so we can do that. And what I found, this was a theory, it wasn't proven, is that we all have these problems, as I touched onto. And I was, as I was talking on stage, it was like we all have the same underlying infra problems, which is this spiky workloads, unpredictable workloads that we've never had before, in human, compute or human infrastructure. And it's, again, it's the same when I was talking to Parag or when I was talking.Swyx [00:24:20]: Lynn. Nikita.Ivan [00:24:21]: Lynn, Nikita. Lynn especially, I was talking to her the other day as well. Like the It is a very interesting type of problem to solve because I can touch on Cloudflare because there's a lot of like talk about that recently as to how they solve that, which is they have a bunch of geos, and basically, as users work in different places, and depending on your tier, they can move you around the geos. And so that how, that's how they get the higher utilization. But you can sort of predict these, and it's If it's something in You'll rarely get a spike that is 10 orders of magnitude. Like you'll get a like let's say one of your customers has some like an exponential curve. What is that to I'm using Cloudflare as an example. 10%, 20%, whatever it is. I don't, I don't have this data, I'm just assessing. It's surely not 10x, right? It's surely not something there. And so how do you go out and solve this problem? And we're all solving this in different ways. So we have.Swyx [00:25:11]: She also has the same thing.Ivan [00:25:12]: Yeah, I know specifically that like Neon had that issue as well. Like how are we solving these spiky loads and things like that ‘cause we talked about it. And so the interesting thing for me to actually internalize was, yes, everyone that's building for agents first is going through this, and we're all solving similar problems, which is quite.Swyx [00:25:28]: Let me let me double-click on this. Okay. So for example, Neon, I happen to know that they're very sort of S3 oriented, right? so they're just like fully bet on S3. And you get to benefit from S3's distribution and infrastructure. So I would imagine that Neon doesn't have to care, whereas Lynn maybe has to care a bit more because obviously she's doing GPU inference. And, for listeners, we did an episode with her, one and a half years ago. And you have to care. But like, right?Ivan [00:25:54]: Parag cares for sure, and Nikita.Swyx [00:25:58]: And Parag is C of, Parallel.Ivan [00:25:59]: Parallel, yeah.Swyx [00:26:00]: Former CTO of Twitter.Ivan [00:26:01]: Twitter, yeah.Swyx [00:26:02]: They are the search.Ivan [00:26:03]: Yeah, they're search, yeah.Swyx [00:26:03]: I You and I know but the listeners don't know.Ivan [00:26:08]: Yeah, we can put it down in the screen, and so ‘cause we, when we were talking.Swyx [00:26:11]: I'll put it up on the, on the screen.Ivan [00:26:12]: Yeah, right.Swyx [00:26:12]: People can look it up if they need.Ivan [00:26:14]: Look it up. And, yes, but they still have CPU and RAM, allocation that you have to have up and running. And so CPU and RAM, you have to allocate that and have that ready. And so there's basically two ways to do it. One is you either over-provision and you can handle the bursts, or two, you basically have, I don't know if this is a term, just-in-time compute, which is like as your load becomes, as your usage comes in, you can fire off requests for VMs or bare metals at other cloud providers and then get them up and running.Swyx [00:26:43]: This is if you go above 100%, right?Ivan [00:26:45]: Yeah, this is.Swyx [00:26:46]: Like your overflow.Ivan [00:26:46]: If your overflow, like spillage or whatever you do.Swyx [00:26:48]: You probably lose money on it, but it doesn't matter, right?Ivan [00:26:50]: It, not Well, you might, you might not That is a more cost-effective way to do it but it's a slower way to do it. Because basically what you have to do is you have to like queue your requests, spin up these just-in-time compute, get it all ready, provision it, and then get your workload there. And so if the time isn't important that much, that's fine, and you can do that. But if your customer, and especially for, let's say, the RL training runs, the reason why a lot of people come to us is because GPUs are more expensive than CPUs, right? So you want your GPU running at, what, 100% the entire time. And so when you're running runs on CPUs, when the when the CPU cycle is like down and spinning up the next one, you want that to be instantaneous so that your GPU doesn't go down, right? And if you then have to like go out and provision machines, you're essentially telling the GPU that it has to wait, and that's incurring our cost. So there's things that you have to try to solve for there.RL Workloads, Declarative Images, and Kubernetes ReplacementSwyx [00:27:43]: Yeah, let's talk about the different workload, right? You said that, what was it? A few months ago, you had zero RL workload and now it's 50%.Ivan [00:27:52]: It will be this one, 50%, yeah.Swyx [00:27:54]: Let's talk about how different it is, right? Like I imagine, for example, a lot less dynamic code generation of like arbitrary code. Like here, it's probably all the same code. You're just doing parallel runs or something, I don't know.Ivan [00:28:05]: Yeah. So you'll have multiple Depends on the like for each run, you'll have a snapshot. And they, for the most part, they actually do use our declarative image builder, which is like, “Oh, we, the agent wants these dependencies, these env vars.”Swyx [00:28:17]: These ones, yeah.Ivan [00:28:18]: Yeah, the declarative image builder, it.Swyx [00:28:20]: Which is a very modal like thing that they.Ivan [00:28:22]: Yeah. And so we build it on the fly and then we propagate that snapshot, and you can spin up as many sandboxes as you want against that snapshot. And then if you have to do changes, the model can, or like it could be also be automated. It's like, “Oh, now for the next run, we need to install these things or remove these things or whatever to get, a task done,” and then it goes off and runs that. So yes, that is something that it seems that they prefer. The number one reason I found, or should I say, let's take a step back. What we are competing against in that environment is essentially managed Kubernetes. So EKS, GKE, whatever. That is what the vast majority run on. And anyone that has tried Daytona versus GKE, EKS is like, “I'm never going back.” That has always been. There's a few reasons. One is the ergonomics. So if you have, if you're using Kubernetes to spin that up, you have to essentially manage the interface interactions with that. Daytona, although as a compute provider, it's more akin to a Twilio and Stripe from a consumption perspective than it is an AWS. Like you have an API, an SDK, it's quite like easy and seamless to get these things up and running, that's one. The other is the speed to which we spin up, which we mentioned earlier, which is much faster, and the scale to which we can go to. We haven't got into features, but an interesting feature is that it's very hard to OOM, or out of memory, our sandboxes, because we can dynamically on the fly.Swyx [00:29:48]: Resize.Ivan [00:29:49]: Resize, which is like impossible on almost any other thing. There are some technologies that enable you to do that, but it's like a very hard thing. And so we actually saw this when, the Terminal Revenge team is, brought us actually. So thank you, Alex and the team, that brought us into this whole space.Swyx [00:30:05]: It's just very rare that, a framework would just say, “Guys, just use Daytona.”Ivan [00:30:11]: Yeah, I think it says it somewhere. Yeah.Swyx [00:30:13]: Yeah. I was like, “What is this?”Ivan [00:30:15]: There's all, there's multiple there, but they also mention a few other places. and so Daytona specifically-We have, the, just jumping on themes here We, I don't know where it says Data Center.Swyx [00:30:27]: I, there.Ivan [00:30:27]: Doesn't matter.Swyx [00:30:28]: There's a very strong recommendation, which is, very unusual. Which is, it's.Ivan [00:30:33]: We do not pay them for this, just.Swyx [00:30:34]: I know, yeah. They just like you.Ivan [00:30:35]: Yeah, they like us. yeah, and also a thing, so, Data Center has multiple isolation sets underneath. The customer doesn't have to know what they are. But basically we have Docker, which is a container, that's hardened with Sysbox. So it's Docker's, isolation that is a security equivalent to a VM, but it's still a container. And that is the default, and they, especially in these training workloads, really like that as an interface to be able to use just a basic Docker container, and we enable Docker and Docker. Which for these RL runs, if you need to do a Docker compose or Kubernetes, you can spin up a K3S inside of these things, which unlocks a huge amount of workloads that you can do that you cannot do on other providers. So just on that part is much more interesting. And so we went that, through that. We showed them that we could do that, and they enjoyed that quite a bit. They being the general venture people.Swyx [00:31:28]: Those people, yeah.Ivan [00:31:29]: And Harbor people.Swyx [00:31:29]: Harbor people, do are they, are they a company yet?Ivan [00:31:33]: As far, I do not know.Customer Pull, Slack Connect, and the Computer Use BetSwyx [00:31:35]: Okay. All right. Yeah. It's like super obvious that like, there's a lot of excitement and success around these things, okay, so yeah, tell us more, right? Like, this is an exploding workload, Harbor adopted you, which helped speed things along. But what are you learning as this new workload comes online?Ivan [00:31:53]: There's a couple things that we learned, which we chat about in the beginning. We, and this has led our story, as we mentioned, we like talked to a lot of customers along the way, and we add more features and more tool sets as we talk to customers. And it's interesting that And I think it's that the ecosystem is so small and/or the models get smarter, where when we see one user come with a request, we know it goes on a roadmap if like three to five customers come with the same request in that week. It's like very bizarre. It happens so many times, which is.Swyx [00:32:27]: Because they're all friends.Ivan [00:32:28]: Sorry?Swyx [00:32:28]: They all, they're all friends. They're all in the same group chat.Ivan [00:32:30]: Yeah, probably, yeah. ‘Cause and they're like, “Oh, can you do this?” And I'm like, “Okay, this is interesting. We'll put it on a feature request.” And then the next one's like, “Oh, can you do this?” “Okay.” It's all the same, right? It's always the same. And so what we try to do, and I personally try to do, I try to be on as many call, quote-unquote “sales calls” I can. I'm in every Slack channel. We literally have about 1,000 Slack Connect channels, something like that. It's an interesting, there's so many interesting things you find out when you have all the Slack channels. You can also see where people, transfer between companies. You see leave Slack channel, enter Slack channel. It's an interesting thing. Also, just I digress, I feel that Slack Connect is literally LinkedIn what it should be. You have a list.Swyx [00:33:08]: LinkedIn charges you to, use your own connections, but Slack doesn't, right? Slack is like, do it for free. It's more lock-in. It's great.Ivan [00:33:15]: Yeah. It's amazing. Yeah. It's one of the reasons.Swyx [00:33:17]: You're gonna pay Slack for life.Ivan [00:33:18]: Exactly. You're there for life. So that's interesting. And so one of the things, the newer things we were talking about earlier is we made a big bet and put a lot of investment on computer use. that is not seen publicly the light of day. We haven't GA'd that yet, but we have.Swyx [00:33:32]: Is there a thing I can pull up?Ivan [00:33:33]: There is computer use there. It's right up a bit.Swyx [00:33:36]: Oh, yeah. Okay.Ivan [00:33:38]: What we have, what we talked about and what we've seen publicly is there's this theme now about, the human emulator where And Elon from XAI has talked about this publicly, and if you think about the models today, they're actually quite sophisticated and they can do a lot of work, but they still don't have access to all the tools. Like, I'm a strong believer that the most efficient way for an agent to work is essentially headless or through, terminal or whatnot. But if we, if we look at knowledge work in general, there's about 100 million knowledge workers in the US, about a billion in the world, and knowledge workers, and the salaries of them aggregate to 10 trillion in the US 50 trillion worldwide.Swyx [00:34:24]: Wow.Ivan [00:34:25]: Something like that. And if we look at, the five most important sectors of that, so like healthcare and government and financial services and whatnot, that's about 56% of that. So let's say it's about half of that. So in the US it's about 25 trillion, and most of them, most of that work is actually still locked into legacy apps inside of Windows, which is not going anywhere for a very long time. Like, people just won't invest in that. How much of it? our assumption is the following: if, in the RPA market, which is similar market, well, not the same 25% of, these white collar, workers', work is automated. If an agent is more sophisticated, can go through more runs, figure stuff out, let's say it's, 40%, right? And so if you take 40% of that, you get to essentially, $10 trillion a year.Swyx [00:35:17]: That's a TAM.Ivan [00:35:18]: That is a that is a TAM. So that's the TAM of the models, right? That's not our, essentially ours. But you get to that size, and to be able to do that, you essentially have to give agents these computers with the legacy. So computer use, either Mac or Windows or Linux. Linux we also obviously have and others have. But Windows specifically is something very new, and the only option right now is an EC2 with, Windows or on Azure. Both of them take anywhere from three to five minutes to spin up. We've created an actual sandbox, so it's a second instead of milliseconds, but you have, point in time snapshots, you have, forking, you have all the things that you have from a sandbox, but essentially enables you to hopefully unlock all this value. And so that's been our big push and bet, but we've sort of, kept our ear to the ground. What is sort of the next things in the market?RPA Returns: Why Agents Still Need ComputersSwyx [00:36:06]: Yeah, knowledge work, and building, and sort of RPA, the next wave of RPA. I got very excited about RPA kind of during COVID times. The UI path was IPO-ing. And it was, a very hot Isn't it, Eastern European?Ivan [00:36:20]: It is, Romanian.Swyx [00:36:21]: Romanian?Yeah, it might be the only Romanian, big unicorn okay, yeah. This I don't I don't, I don't have like a I think there's, I think there's a stage being set for the resurgence of RPA, ‘cause everyone understands that, yeah, no one wants to deal with these shitty apps and no one's gonna rewrite them. Like, you just have to do, a remote operation and programmatic operation of them.Ivan [00:36:45]: If you wanna unlock it, my own setup was basically the following. So I was doing a board deck recently, last month, whatever, and I'm like, “Okay, let's just, let's just do automated.” So, all our data's in, ClickHouse and PostHog and QuickBooks, where everyone else's is, and I'm basically, connected that all to, my Cloud code, like go off and go Cloud code whatever. Go off and, here's the integrations, go do that. It pulled out the first report, which was great. It connected to Brex and all these things, pulled it, which was great, and then I say, “Okay, now pull out this, and this,” and I kept getting, really well McKinsey-style design reports, but the data said partial data. all the missing data, partial data. Like, it can't access all the things, and I got so frustrated, and so I got, I got, my Mac Mini virtual sandbox with OpenClaw. I gave it its own account in our company, and then I went to all these services and created a read-only account, so literally like an intern in your company. And so I would say, “Now go and do this report,” and it would get the same, or like, “I can't via the MCP or the API or whatever. I can't get all the information.” I'm like, “Go log in.” And it will log into the website, then go in, export the data. It'll export the data and do the thing end to end. So even for things that have today APIs, not all of it is exposed, and I to get value, I get immense value right now, but it has to be a computer usage, unfortunately, and so I spend a bunch of tokens just on that, but I get the job done. And so if even a startup like ours, and using all the hottest tools, still needs a computer agent what hope does, Goldman have to have a headless, right?Swyx [00:38:22]: Yeah, what a - Why isn't Microsoft doing this?Ivan [00:38:27]: I'm pretty sure, Satya had a post yesterday.Swyx [00:38:29]: Oh, okay. I see.Ivan [00:38:29]: Which was like, “Every agent needs a computer.”Swyx [00:38:31]: I see, I see.Ivan [00:38:32]: So they have launched something recently.Swyx [00:38:34]: Yeah, they have Microsoft Power Automate, I'm sure, I'm sure, they're gonna have their version.macOS Sandboxes, Apple Constraints, and the Windows OpportunityIvan [00:38:39]: Version of that, yeah.Swyx [00:38:39]: You're gonna try to do yours, and it - I always know there's always demand for Mac, but I know it's, tricky to host, macOS sandboxes.Ivan [00:38:49]: We will have macOS sandboxes fairly soon. The problem with macOS, OS sandboxes is, I'm deep in this, I don't know how much interesting is.Swyx [00:38:55]: No, it's.Ivan [00:38:56]: MacOS has this problem.Swyx [00:38:57]: It's a licensing thing, right?Ivan [00:38:58]: Licensing thing. So one, you're allowed to run only two parallel VMs per machine, so that's one. Two, you can only license to a different user every 24 hours. So if you come in and theoretically, if I wanna charge you per second and I charge you one second, I have to have it idle for the rest of the day. I can't have anyone else doing that. So the pricing will be different in the sense that I will have to - we would have to charge for 24 hours, and that's not even, that's not even the most difficult thing. But the, thing above that is, from a security perspective, they enable you to do memory snapshot, pause, resume, but only on the same physical drive, physical machine. And so what you can do in, Windows world or Linux world is that I can move in the background, your snapshot from one to the other and manage load, right? Here, if you wanna do that, you essentially have to have your.Swyx [00:39:49]: Yeah, snapshots. Yeah.Ivan [00:39:50]: Your.Swyx [00:39:51]: It's like.Ivan [00:39:51]: Physical machine.Swyx [00:39:52]: You can't break it up.Ivan [00:39:53]: You can't, you can't move things around that, and all of that is, that part is, from a security standpoint, if it is written. Like, I understand the security aspect of that, but it disables you from doing these agentic, like really scalable agentic workloads.Swyx [00:40:08]: You need to do a vibe-coded, clean room implementation on macOS that you can then - That's like Clean OS or something. I don't know.Ivan [00:40:17]: So. We have.Swyx [00:40:18]: ‘cause like Linux was originally like a clean room rewrite of Unix.Ivan [00:40:21]: Okay. Yeah.Swyx [00:40:21]: Or something like that, right? Like same thing to macOS. Someone needs to do it.Ivan [00:40:25]: Someone will do that, and someone will have some long-running agents for a few days to figure this stuff out. But yeah. So definitely we - we're really close to offering something ‘cause people do want it, but the pricing will be different, and the feature set will be sort of stringent.Swyx [00:40:38]: Yeah, nobody's gonna use this. like, the labs, the labs will because they want to automate macOS.Ivan [00:40:42]: They have to do RL. They have to do RL again. But even if you The - So the point is with the RL part, if you, if you do RL on macOS, then the next iteration of the model comes out, it will be able to use these tools significantly. Then you actually need to run those, that somewhere. So you're gonna have to have that, later on. And from, if anyone at Apple is listening, I very much feel that they are shooting themselves in the foot of the scale of the revenue of compute or licensing they could get if they would just enable a concurrency model similar to what you can get on a Windows and a, and Linux.Swyx [00:41:17]: Yeah. Yeah. And I'm sure they've heard this before. They just don't care. Yeah, it's And maybe they will change their mind with the new CEO.Ivan [00:41:24]: Yeah. We'll see.Swyx [00:41:25]: We'll see.Ivan [00:41:25]: High hopes.Swyx [00:41:26]: High hopes.Ivan [00:41:26]: High hopes.Swyx [00:41:27]: Okay. But I, it's very clear the market opportunity is huge in Windows, and you can go for a long time on just Windows, but your customers are gonna want both. and I think, it is interesting to me that, this is the sort of God application of agents, right? Like, I don't It was - How big was OpenClaw for you guys? Like, was it, was there, a significant bump.OpenClaw, Agent Labs, and the B2B2C Sandbox MarketIvan [00:41:54]: Not for us because we.Swyx [00:41:54]: Because you already.Ivan [00:41:55]: We're kind of positioned differently. Whereas although it's completely PLG and we have individual developers that use it, most of the users that use Daytona are sort of a B2B2C. Sort of it's either B2B or B2B2C. So, in the researcher world, it's B2B, so you're selling to, labs and neo labs and things like that. But on the long-running agents, it's mostly, from a scale revenue perspective, it's mostly B2B2C, where you have a app layer agent that uses you at a big scale.Swyx [00:42:26]: Like a Manus. Yeah.Ivan [00:42:28]: Like a Manus Lovable type of thing.Swyx [00:42:31]: Yeah. I think that's the question of, well how, um-Uh, yeah, B2B to C is basically to me what I've been calling an agent lab, which is kind of like you're not in a model lab, but you're making a very good wrapper that is a platform that other people can sign up so they don't have to code those things. Yeah, it sound, it sounds like a much better market than the direct OpenClaw market.Ivan [00:42:56]: I've like - We I've done multiple things. So the CodeAnywhere's part of our career path R in the calendar, was very much an end user developer product. And so that is great. It You can get a lot of developer love, and I feel that we do as a company have a bunch of developer love. But it's a different type, where it's people building these things. Again, it's more akin to a Twilio because you don't really run - As a person, you wouldn't run Twilio. I don't know how many people remember. It was like ask your developer billboard and whatnot. And people really love Twilio, but they only used it inside of like, “Oh, I'm building this app or service for thing.” And so we're very much directly to that. And you also know that I used to work for a competitor for Twilio, so it's kind of ingrained, in my DNA.Swyx [00:43:35]: People don't know InfoBip is that big.Ivan [00:43:38]: Yeah, it's.Swyx [00:43:39]: Because.Ivan [00:43:40]: It's a billion euro.Swyx [00:43:40]: They're all American. They're like, “Whatever's in Europe doesn't matter to me.” But like it's the, it's the same size or bigger? Same size?Ivan [00:43:46]: It's about half the size.Swyx [00:43:47]: Half the size?Ivan [00:43:48]: Yeah, about half the size.Swyx [00:43:48]: It's like, yeah.Ivan [00:43:48]: Still huge. Multiple billions a year. Yes.Swyx [00:43:51]: That's crazy.Ivan [00:43:51]: Exactly, and so that - These are like really interesting and large revenue-generating, very sticky businesses. Whereas when you're selling to the - When your focus is the end developer, it is a very hard sell because they're very price sensitive, very price conscious, very around that. And there's very It's very hard to scale. Your cap is the number of people that are willing to spin up - First of all, wanna spin that up, and then spin up multiple of these. Whereas if you're in the enterprise one, like we know everyone's talking about like how many tokens they're spending, I'm spending. Like a lot of companies today are like, “If this is our company, spend as much as you can.” Like basically that is where we're going. And so if you think about that paradigm, where you're selling to companies that say, “Spend as much as you can to generate, productivity,” versus, “Oh, I'm a single person. I have this much budget, and I'm doing this thing because it's fun or it's helping me out or whatever.” Like it is a different, it's a different go-to-market, I think, strategy.MCP, CLIs, and Sandboxes as the Agent RuntimeSwyx [00:44:50]: Yeah, there's a lot of discussion. I'm just kind of going through like the mental list of things that are in your favor, which is, for example, MCP versus CLI. Like obviously you want CLI. It's been very good for you. I feel like it's maybe a drop in the bucket or maybe it's huge. I'm just checking whether it's like these are big trends.Ivan [00:45:10]: Those things you - work well in our favor, to your point just because every.Swyx [00:45:13]: They're kind of drop in the bucket, right?Ivan [00:45:15]: I think it's like sort of all the things come together. And so there's so many things that impact that. To your point, like OpenClaw wasn't huge for us, but like having the agent SDK, from Anthropic, so or Cloud Claude Code was very interesting. The reason why it was interesting is that a lot of, let's call them app I don't know what to call them, app layer agent companies, essentially they are like, “Oh, I can create this new app, this new agent. All I need, I just use Claude Code, and I throw it into a sandbox, and then I have my interface to the human to that.” And so that enabled so many more companies to actually offer this, and then they would pull on sandbox. So that was, that was interesting. And to your point, like MCP, versus the CLI, the MCP is an interface against an API, whereas the CLI is like you can actually go do things. Like this is it. The difference between integrations and actually running scripts or data or analysis against a thing. So being able to use a CLI very well enables the agent to do more things, and it's because that people will invoke a sandbox, they'll run it in the CLI, and but it'll do anal-analysis on that data and then give you an actual result versus just, pulling data from an API source.Swyx [00:46:29]: Yeah, it's a layer of indirection basically, it's the same thing as agentic search versus RAG, which where you're.Ivan [00:46:34]: Exactly, yeah.Swyx [00:46:34]: Just like you just win whenever people put more agents into their workflow. And so like it doesn't really matter, but I'm just kinda teasing out like what else have people heard about that like it's sort of, “Oh yeah, this is another sandbox use case. Oh yeah, that's another one.” Am I, am I missing any big ones?Ivan [00:46:51]: The thing, the thing that people, which is the computer use stuff, which I think is probably the most interesting one, is, and to your point, we've talked to so many people over the last year. It's like, “Oh, like why do you need a sandbox? Why do you need this? Why this?” And to your point, it's like, “Oh, I need sandbox for this. I need sandbox for that. I need sandbox-” It's like, “Oh, I need it for every single thing.” And so basically what I, what I - and it sounds like a broken record, it's like you use a laptop every single day, right? And you are n of one. It's just you. But now imagine how And by the way, the laptop, the computer PC market, the PC market is about equal to the cloud market in total. So it's about 150, 180 billion a year. Something like that. It's about roughly the three cloud hyperscalers is about equal to like Apple, HP, Lenovo, whatever, It's a little bit less, but it's sort of like that. And now imagine And that's just like, so how big is the addressable market? What, how many people are there in the world now? What's the last data?Swyx [00:47:45]: Let's call it eight billion.Ivan [00:47:46]: Eight billion. And so let's say you can have two computer, like you have one personal and one business, whatever. Like so it's double that, right? and so that's 16 billion, right? How many agents are gonna be running in two years, in 10 years, in 100 years? Like And for every single task, they will need one of these. And so how big is that? That market is essentially quote unquote “infinite”. You will get to the point, and Dylan Patel was at the conference talking about, from SemiAnalysis, that talks usually about GPUs, was also talking about how CPUs will now be a bottleneck because it will be the constraint. You won't be able to grow, or we won't be able to have enough of these because there won't be enough CPUs to basically do.Swyx [00:48:23]: Yeah. Well, I actually had a really good podcast with Doug Oliphant, who, which was his president at SemiAnalysis, where they've basically been like, yeah, it's been a GPU shortage first, but then it's cascaded down to memory and now to CPUs.Ivan [00:48:35]: CPU, yeah.Swyx [00:48:35]: It-What's next? So networking. So, networking actually has been in shortage for a while if you're looking at, just GPU networking. But, yeah, it's really crazy the amount of computer use that's going on, yeah, cool. I, other questions are, just the one very big part is the open sourceness which you didn't have to do, your competitors don't do, like it's not, a lot of people are worried about keeping their projects open source because some competitor can just slot fork it. I don't know if there's any reflections on just being an open source company.Open Source, Trust, and Enterprise ProcurementIvan [00:49:15]: Yeah. There's a bunch. So we the original product that we did was open source.Swyx [00:49:19]: Yeah. CodeAnywhere.Ivan [00:49:20]: So doing that was actually very good for us. There's basically a saying of, What's the saying? Like, companies that are, that are doing really well, measure themselves against, free cashflow, that are kinda okay, it's EBITDA, then, it's, it goes all the way down.Swyx [00:49:36]: The worst is like GitHub stars.Ivan [00:49:37]: GitHub stars. GitHub stars are the worst, yeah. So you go all the way down to GitHub stars. And so our original one was GitHub stars. That's what we talked about, we're at the point we're talking about revenue, so we're we've gone up the stack on that. And so we started.Swyx [00:49:47]: No, profit.Ivan [00:49:48]: Yeah. We haven't, we're, we'll get there. We'll get there. But basically at that point we did stars and GitHub and it was useful, and the original variation that we did, it we split the core into its own repo and it was Apache 2.0, so very, permissive. And then we basically would bundl
Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!This was recorded before Railway suffered a major GCP outage on May 19, despite being a multi-AZ, multi-zone mesh ring, with HA fiber interconnects between their Metal GCP AWS, because workload discoverability was unintentionally still tied to GCP. All has been resolved with a post-mortem.Railway did not start as an AI infrastructure company.It was founded in 2020 years before agents became the default way people thought about deploying software. Jake Cooper, formerly at Bloomberg and Uber, started Railway with a simple obsession: the activation energy to ship something to production should be near zero. Push code, get a URL, iterate. No Docker files, no Kubernetes manifests, no Ansible scripts stacked on Ansible scripts.For years, this was a slow grind. Railway spent its first 18 months hand-acquiring its first 100 users with Jake personally greeting every Discord signup on a second monitor.Today, Railway has raised $124m and is growing very fast. A 35-person team supports 3 million users, adding roughly 100,000 signups a week. Their bare metal data centers have a 3-month payback period vs. renting in the cloud, with 70% margins funding aggressive cloud bursting when needed. The servers they own have actually appreciated in value as RAM prices have climbed basically meaning the value of their hardware now exceeds the capital they've raised.From rebuilding Railway's network overlay over a weekend to moving the vast majority of workloads onto its own bare metal data centers, Jake Cooper is trying to build a new cloud for an agent-native world. In this episode, Railway's founder and “conductor” joins swyx and Alessio to unpack why the next era of software infrastructure is not just “Heroku but newer,” what agents need that humans did not, and why the old deployment loop of Git, PRs, CI/CD, and static cloud resources may be heading for a rewrite.We go deep on Railway's infrastructure stack: own-metal data centers, three-month cloud payback periods, cloud bursting, data center debt, Railpack, Nixpacks, Temporal, feature flags, Central Station, content-addressable filesystems, agent-safe production forks, and why the CLI may become more important than the canvas in an agent world. Jake also shares the founder journey behind Railway, how the company survived losing $500K/month, why it now serves millions of users with only 35 people, and why he believes the pull request is dying.We discuss:* How Railway went from a slow six-year grind to adding 100,000 users a week* How Railway thinks about agents as the next dominant software species* Why agents need version control, observability, compute, storage, and orchestration at 1000x scale* The economics of Railway's own-metal data centers and three-month payback* How Railway uses cloud bursting while scaling its own infrastructure* Why data center debt can be a better tool than venture debt for infra startups* Central Station, Railway's internal system for clustering customer feedback and incidents* Why responsible disclosure and over-communication matter for platforms* Why feature flags, progressive rollouts, and shadow traffic are essential for agents* Temporal's strengths, pain points, and why workflows matter for agents* Railpack, Nixpacks, Nix, and lazy-loaded content-addressable filesystems* Why “cattle, not pets” may change if you can clone the pets* Why Railway is building a new cloud from scratch instead of copying hyperscalers* The solo founder path, focus, writing, and how Jake thinks about company buildingRailway:* Website: https://railway.com/* X: https://x.com/RailwayJake Cooper:* LinkedIn: https://www.linkedin.com/in/thejakecooper/* X: https://x.com/JustJakeTimestamps00:00:00 Introduction: What Is Railway?00:02:07 Jake's Path to Railway00:06:13 Railway's Six-Year Growth Story00:08:52 Rebuilding the Business After the Free Tier00:11:17 Agents as the Next Software Platform00:13:29 Railway's Infrastructure Philosophy00:15:42 Bare Metal, Cloud Economics, and the Compute Crunch00:17:22 Cloud Bursting and Five-Cloud Networking00:20:20 Data Center Debt and Infra Financing00:23:31 Data Centers in Space00:25:24 What Agents Need From Infrastructure00:28:24 CLIs, Canvas, and Agent-Native UX00:35:15 Central Station, Incidents, and Responsible Disclosure00:40:30 Safe Rollouts, SRE Agents, and Production Forks00:45:00 AI SRE, Specs, Code, and Tests00:48:24 Self-Replicating Infrastructure and the New Serverless00:53:18 Heroku, Temporal, and Workflow Engines01:04:07 Railpack, Nixpacks, and Lazy-Loaded Filesystems01:06:01 Coding Agents, Token Spend, and Roadmap Acceleration01:10:56 The Pull Request Is Dying01:12:28 Feature Flags and the Agent-Era SDLC01:16:15 Cattle, Pets, and Cloning Machines01:19:29 Solo Founder Lessons01:24:12 Focus, GPUs, and Building a New Cloud01:28:20 Closing ThoughtsTranscriptAlessio [00:00:00]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swyx, editor of Latent Space.Swyx [00:00:10]: Hey, hey, hey. Today we're in the studio with Jake Cooper of Railway.Alessio [00:00:14]: Conductor of Railway.Swyx [00:00:15]: Conductor at Railway. Yeah.Alessio [00:00:16]: Choo-choo.Swyx [00:00:17]: Do you actually have that anywhere, like on your business card?Jake [00:00:20]: We call some of our volunteer moderators conductors. I don't have a business card. We're not that big yet. At some point I will. I got handed a nice business card from the Supermicro folks, and I was like, “Damn, this is pretty official.”Swyx [00:00:30]: Business cards are coming back.Jake [00:00:32]: They're cool. They're hip. The conductor thing is good. We're trying to figure out what we want to call each other internally. Some people think it's super cringe and say, “You don't need a name for people internally.” Some people want to call each other something. We still don't have a really good one.Jake [00:00:55]: We've got New Railcrews, Trainiacs. Nothing has stuck yet.Swyx [00:01:00]: I like Trainiac. Trainiac sounds good. Railwayians. For those who don't know, what is Railway? Let's give people a crisp definition up front.Jake [00:01:09]: Railway is the easiest way to ship anything. You go to the canvas, or you talk with Claude, and you say, “Deploy a Postgres instance, deploy my GitHub repository, run this code,” and you're off to the races.Swyx [00:01:22]: You've got a nice animation on the landing page.Jake [00:01:24]: Thank you. None of my work, by the way. They don't let me touch the design stuff anymore.Jake [00:01:25]: We want to make it trivially easy not just to deploy things, but to evolve applications over time. Most tooling right now stacks entropy on top of entropy: Docker, Kubernetes, Ansible scripts, and all these other things. If we can version all of your software and keep track of all the changes, then we can make it trivial to clone environments, fork into a parallel universe, get copies of production data, get copies of any services, make changes, validate them, and collapse them back in without reproducing everything across a staging environment.The Railway Origin Story: From Uber Systems to a New CloudSwyx [00:02:07]: I was looking at your background: Bloomberg, Uber. Nothing immediately stands out as, “This guy is going to found the next great platform as a service.” What prepared you for Railway?Jake [00:02:21]: It was curiosity to keep going deeper. I started out on front-end stuff, working on Wolfram Mathematica and porting it over. Then I briefly moved to Bloomberg, then toward Uber and distributed systems, taking the Jump Bikes systems and moving them to a distributed system built on top of Cadence, the pre-Temporal Temporal.Swyx [00:02:44]: Which, by the way, I'm happy to talk about, pros and cons.Jake [00:02:48]: Totally.Swyx [00:02:51]: But let's do the Railway story.Jake [00:02:52]: It has been a continual step of wanting an experience. Whether it's walking up to a bike, unlocking it, and having it work frictionlessly, or something else, the depth required to make that happen follows from the experience. A lot of the work I do, and a lot of the team does, is in service of that experience. We fundamentally don't care how deep we have to go. We will swim to the bottom of the swimming pool to get the experience.Jake [00:03:17]: I don't have a physics PhD. I did an EECS degree. It has always been about figuring out the next step: how do we get there? That's what led to starting Railway for that experience and then moving all the way to bare metal data centers. I was adding patches to the kernel this week to get the experience there because I can see how much better it can be.Swyx [00:03:49]: Other patches to the Linux kernel this week?Jake [00:03:51]: Yeah. Not upstream. Our fork.Swyx [00:03:52]: That's a flex. Railpack? No, this is different. This is the OS on top of Railpack?Jake [00:03:57]: No, this is an actual kernel patch. It's always literally: what do we have to do to get that experience? Then figure it out. Anything is figureoutable.Swyx [00:04:10]: Would you send the patch upstream, or does it not fit other use cases?Jake [00:04:13]: Maybe. We have to work out the experience internally. It has to do with the storage layer we're building for some of the agentic stuff. Maybe it'll be useful upstream, but it's deeply useful for us internally.Open Source, Forks, and Non-Deterministic VersioningSwyx [00:04:29]: You mentioned open source before. How do you think about starting from open source, and then coding agents letting you do a lot more from forks of it?Jake [00:04:38]: GitHub's original sin is that it's almost a series of broken pointers. You have this thing, then you clone it, and now you've lost the whole upstream. How do we make it trivial for people to modify really small pieces of it?Jake [00:04:51]: We think of Git in a discrete sense: I've either made a change and merged upstream, or I haven't. What would it look like if it were percentage-based, a little more non-deterministic, or a stream of changes that users traverse as a percentage rolled out in general and then rolled all the way up?Jake [00:05:13]: We have the open-source kickback program and let you deploy templates because we want to make it trivial for people to version these shards over time. It solves a large problem around authentication, authorization, and security. NPM has a way to define, “Don't take any new packages.” The ideal end state is that you roll out progressively to users with the minimum impact zone and continue rolling up. JPMorgan should probably be the last one on the patch line, for all our sakes, because our money and livelihoods are there.Jake [00:05:53]: It's okay if Johnny Vibe Coder gets a broken patch because there's so much entropy in the system that the rubber has to meet the road at some point. You have to test at varying levels.The Long Grind: First Users, Free Tier, and Making the Business WorkSwyx [00:06:13]: I wanted to pull up this glorious chart, which is your usage or number of daily signups?Jake [00:06:22]: Daily signups, I think.Swyx [00:06:24]: You started six years ago. It was a slow grind, and now you're on a rocket ship. You say, “Don't doubt your fight and don't quit.” Maybe pick out certain points that were key inflections for the company.Jake [00:06:40]: At the start, it's about getting your first 100 users, hell or high water. We had a website and a support link. The support link was the Discord channel. I had notifications on with two monitors: the monitor I was working on and the other monitor with Discord. If anybody came in, I was immediately like, “Hey, how's it going?” It was rare, so getting those first 100 users to come back was the start.Jake [00:07:14]: Then you build a consultancy factory because users want all these things. You have to go back to the board and ask, “What is the actual product offering I want to build on top of this?”Jake [00:07:28]: VCs want charts that always go up and to the right, but in reality you don't necessarily want charts that look like that. For us, there have been periods of expansion where we add features to test use cases, and periods of compaction where we ask, “If the experience we have is good, how do we make it significantly better?” Maybe we strip out features that don't fit our ICP anymore.Jake [00:07:57]: The boom from 2022 to 2023 came from the free tier. Everybody under the sun was using it.Swyx [00:08:09]: A lot of Reddit bots and Discord bots.Jake [00:08:12]: And crypto miners. When you build an open product on the internet where anybody can sign up, the internet is a horrible place with so many things. You go through periods of asking, “How do I reach as many people as possible?” Then, “How do I fit the exact use case for the people who really matter and are really excited about this specific thing?”Jake [00:08:39]: Then there was a two-year period of making the actual business work. During the free-tier era, we were losing about half a million dollars a month.Swyx [00:08:59]: On a $20 million bank account.Jake [00:09:02]: On a $20 million bank account with maybe $50,000 a month in revenue. That's a horrible business. I don't know how anybody invested. But you have to go through it and say, “We have an experience people love, but the business has to work.”Jake [00:09:17]: There are two schools of thought. You can run the horrible business all the way up with bad margins, or you can go back and make it work. We've always wanted a super lean team. We're 35 people right now. It's very small.Swyx [00:09:36]: Supporting three million already?Jake [00:09:38]: Yeah. We're adding 100,000 users a week right now, so it's growing fast. We don't want to add headcount for the sake of headcount or throw bodies at problems. We want to build systems. It's hard to build systems during expansion because you're adding things to the system because people are asking for them or things are breaking.Jake [00:10:00]: We had to cut off the free users for a little while, rebuild the business, and make sure it worked. We want to reach as many people as possible because software is important. It's become difficult to create things in the physical world, so it's important to make it easy for people to build in the virtual world and have access to creation. But there are legs to that journey.Jake [00:10:30]: You can see divots in the charts. If you follow between 2025 and 2026, it's either summer or winter. People go on holiday with family.Swyx [00:10:50]: It affects that much?Jake [00:10:51]: Yeah. It's kind of B2C and kind of B2B. People are shipping constantly, then they stop. Our activation curve now shows more people activating on weekdays because we have more business users, so it smooths out over time.Agents as the New Interface to DeploymentSwyx [00:11:17]: Was there a point where you started prioritizing AI development or agent development?Jake [00:11:24]: We've prioritized agentic as a top-of-funnel thing. Over the last six months, we've deeply prioritized agentic as a mechanism to build and deploy things because we believe the curve is so steep and that is how people will build and deploy software.Jake [00:11:42]: It almost fundamentally doesn't matter whether this is dot-com or not because we're all on the internet anyway. If agents are going to deploy a bunch of things and we hit an inference wall at some point, we'll fix those problems. The dominant species over the next 10 years is that we've moved from assembly to C to C++ to JavaScript to words. You're going to need to close that loop.Swyx [00:12:13]: When you say this is dot-com, did you mean buying the domain, or the general case?Jake [00:12:17]: I mean the dot-com era, when companies had a huge run-up because people understood the internet was important. Then they hit bottlenecks, fundamental laws of physics, math didn't work, and everybody came back down to earth. But it didn't matter because the internet became so impactful. If you operate on a long enough time horizon, you should build these things anyway because you can see where it's going.Jake [00:12:45]: That's where I think a lot of agent stuff is. You get to a point where you're running thousands of agents in parallel. What is the inference cost? What is the compute cost? How do you make that efficient? How do you coordinate all this? We have issues coordinating humans; we don't even have good tooling for that. Now we have to figure out how to get agents to coordinate, safely version changes, and know when to raise their hand for someone to intervene. Otherwise it becomes an interrupt factory.Railway's Infrastructure Thesis: Network, Compute, Storage, and MetalSwyx [00:13:19]: Let's go right into the technical side. What are the core infrastructure or architectural beliefs of Railway that allow you to do what you do?Jake [00:13:29]: The primitives matter a lot for us. We need network, compute, storage, and orchestration around it. You need control over a lot of those things. We've talked a lot about how we don't really use Kubernetes because we want higher-order control to place workloads in very specific places.Jake [00:13:48]: The reason is that you have to be very efficient with agents: memory reuse and all these other things, or you're going to massively blow up your cost structure. Being able to rack and stack your own servers and build your own metal unlocks performance and cost. Experiences where you're running 1,000 agents in parallel are not massively cost prohibitive.Jake [00:14:13]: Token use and compute use are blowing up. Over time, those things have to get a lot more efficient. You can get a lot of margin to make those experiences solid by building your own metal. That's all in service of offering a differentiated experience to as many people as humanly possible.Swyx [00:14:51]: You have a data center in Singapore.Jake [00:14:53]: Yeah. We have two in every other region now. In Singapore, we're adding a second one in Q3.Swyx [00:14:58]: What's it like? I've never built a data center. Do you go to Equinix and say, “I want some slots?”Jake [00:15:05]: Yeah. Equinix. You basically go and say, “I want power and I want a cage.” They say, “Great, here's what it's going to be.” You rent the cage for a period of time, fill it with racks and servers, and hook up internet to it. That's all the pieces.Swyx [00:15:36]: Then you handle everything else.Jake [00:15:37]: You handle everything else.Swyx [00:15:39]: What's the math versus clouds doing it for you?Jake [00:15:43]: If we rented in the cloud, our payback period when we go to metal is about three months.Swyx [00:15:50]: Which is crazy.Jake [00:15:51]: It's nuts. That's four years of depreciated hardware. You're going to see a lot of this compute crunch because hyperscalers are buying up a lot of stuff. We're working directly with OEMs, resellers, and people building these machines: Supermicro, Dell, and others.Jake [00:16:11]: Upstream, there's a bunch of supply pressure. When we raised our last round, between deploying capital for servers and now, the amount of money we've raised is less than the amount of money we have in the bank plus the value of the servers because the servers have appreciated as RAM has gone up. It's nuts how valuable hardware has become.Jake [00:16:50]: If you look at hyperscalers, they deployed around $80 billion of capital expenditures this year, and next year will be more. That's a massive infrastructure build-out. You look at that and think it's crazy that they're spending way more than the Manhattan Project. But if every person is going to run dozens or hundreds of agents in parallel, you have no conceptual idea how much compute is required to make that experience happen, even if you're deeply efficient and sharing resources. And that doesn't even count inference.Swyx [00:17:22]: How do you plan the build-out? The growth chart is so vertical. Are you usually at 100% utilization as soon as racks are live? How far ahead are you planning?Jake [00:17:33]: We still maintain cloud presence for bursting. We work with AWS, GCP, and a few other clouds. We can rent, and then the moment we get space or power, we compact those workloads off the cloud. We started on the clouds, then built a system to migrate to our own metal. There's nothing that says you can't continually do that again, and that's exactly what we do. We never want to be compute constrained.Jake [00:18:09]: At the start of the year, we actually became compute constrained because one upstream provider wasn't able to give us quota at the rate we needed, and the hardware was slower. I spent a weekend rebuilding our entire network overlay so we could straddle five clouds: Oracle, AWS, ourselves, GCP, and one other one. We can do more than that now.Jake [00:18:38]: We got into a spot where we were trying to pack instances tight because we couldn't get enough compute. That led to a few reliability issues, which are now past us. I made a tweet pointing out that it's becoming harder and harder to acquire compute at the rate these models need to acquire compute. We got bit by it.Swyx [00:19:15]: How do you think about pricing knowing you might not have your own metal available at all times? Are you pricing assuming you need extra margin if you end up going into the cloud?Jake [00:19:26]: Because we've built out our metal data centers, our margins on metal are around 70%. We can deeply subsidize the cloud business if we want to scale at a reasonable rate. We have a few levers: metal, which makes the margins; cloud burst; debt to buy servers; and venture capital. It's an interesting operational problem: how much cash do we have, how much should we raise, how quickly can we deploy it, and can we scale revenue as quickly as we scale compute?Jake [00:20:05]: If we continue making it trivially easy for people to build and deploy, then the faster we close that loop and the more operationally excellent we are with capital, the faster the business can scale. It's almost a straight linear deployment rate.Financing Infrastructure: Hardware Debt, VC, and Operational LeverageSwyx [00:20:20]: I think infra startups raising debt is a tool people don't utilize enough or know enough about. What can you tell us about that? Is it secured against your CPUs?Jake [00:20:32]: It's secured against our hardware.Swyx [00:20:37]: What rates do you get? Who are the lenders?Jake [00:20:39]: We pay prime plus a spread, and we can refinance any of the debt as rates go down. The terms are pretty good. The unfortunate thing is that Twitter has no nuance, so people say, “Venture debt bad.” But as with all things, there are specific tools and areas where you can be deliberate instead of using one tool as a hammer. Venture capital is not the hammer for everything. You have to explore and figure out what works.Swyx [00:21:12]: VC is usually the most expensive financing you can get.Jake [00:21:15]: Yeah. I also think people think about VC incorrectly from a capital-raising perspective. Most people think, “How do I raise as much money as possible from whoever is probably the best I can get at that time?” That's close to right, but what we've tried to do is figure out what unfair advantage we can buy with that equity.Jake [00:21:34]: It's the most expensive equity you're going to give away at that point in time, assuming the company keeps getting better. How do you use it to work with someone stellar who complements you? In the seed stage, I had never started a company. Ray Tonsing had good advice, and I could text him all the time. He was really fast. Awesome.Jake [00:22:01]: Then with John and Erica at Unusual, they said, “You roughly know what you're doing building a product. We'll mostly leave you alone and be available for advice.” Amazing. Then we got to Series A and the business was an operational tire fire because we didn't know how to scale a business. Work with Erica, and Jordan is over at Redpoint, so bonus.Jake [00:22:28]: Now we've raised from TQ and FPV as we're moving into enterprises. Every step of the way, we've asked: who can we partner with at this specific time to unlock the next section of the journey? I don't know enterprise sales. As an engineer, I can eyeball what features we might need, and we have wonderful people internally who can help. But you want boardroom dynamics where everyone is aligned and asking, “How do we win this?” instead of bickering about strategy.Data Centers in Space and the Physics of ComputeSwyx [00:23:31]: You had a tweet about data centers in space. Why no data centers in space?Jake [00:23:37]: It's not “no data centers in space.” My hot take is that I think it is solvable. I've just never seen anybody solve it.Swyx [00:23:49]: You said, “How are you going to dissipate that much heat in a vacuum?” You're making a physics claim.Jake [00:23:55]: I haven't seen anybody prove how you're going to dissipate that much heat in a vacuum. It doesn't mean it's not possible. It just means nobody has brought it up yet.Swyx [00:24:05]: Astrophage.Jake [00:24:06]: I don't know what that is.Swyx [00:24:07]: The Martian thing. Okay, you're very logical.Jake [00:24:09]: It could work. A lot of people are putting the cart before the horse. They say, “We're going to put data centers in space.” Okay, but how? “We have time to figure it out.” It's like in The Martian where they ask how they're going to intercept something and say, “We'll figure it out.”Swyx [00:24:36]: Making a bet on human invention is weird because you blind trust that it can be solved. But with physics, there are first-principles bounds you can put on it. Maybe not. Maybe you're asking to travel time or break a fundamental thermodynamic law.Jake [00:24:57]: I don't know how VCs do this either. How do you know what's not possible and a grift versus what's possible but sounds completely insane? “We're going to put data centers in space.” Coin flip as to which it is, and I guess you'll know in 10 years. That's one cycle.What Agents Need: Versioning, Observability, and 1,000x ScaleSwyx [00:25:23]: Moving back to agents. The branching, fast spin-up, and orchestration you do feels like pre-work that happened to be exactly what agents want. What do agents want differently than humans?Jake [00:25:37]: They want the ability to version things. It's not that different; it materializes slightly differently. Agents want a way to test changes incrementally. Engineers have feature flags. Is there a reason agents can't use feature flags? I don't think so.Jake [00:25:54]: They want version control. Can we use Git or not Git? That one is up in the air. I think something outside Git will emerge for how we version these things over time. They need observability. You need to query what happened, when it happened, which steps failed, traces, logs, metrics, and all the rest. They need network, compute, and storage. They need to write files, save files, iterate on files, and snapshot file systems.Jake [00:26:25]: A lot of what humans needed is in line with what agents need. Branching and forking are not different; we're just moving 1,000 times quicker. It can look like you need something massively different, but what you need is something massively better than what existed. You need orchestration massively better than Kubernetes. You need networking probably better than Envoy. It goes all the way down the stack.Jake [00:26:55]: If the workload profile doesn't change so much as it gets massively compressed because you need thousands of these things, what assumptions change? etcd is going to melt. You need to replace it with something. You can go all the way down the stack and say, “That part has to change, that part has to change, and that part has to change.”Jake [00:27:19]: The interesting thing about the super-exponential curve is that you have to build systems where you can rip out those parts at any time because a new bottleneck might emerge. You get good at parallel agents, and a different part of the system breaks. So it's similar to what humans needed, but at 1,000x scale.Jake [00:27:55]: How do you do code review in the age of agents?Swyx [00:28:00]: You throw more agents at it.Jake [00:28:01]: You don't. But then who reviews for CVEs and all these other things?Swyx [00:28:07]: More agents.Jake [00:28:08]: And that's how we hit the inference wall. You can continually throw agents at the problem, but I think there's a limit to the number of agents you can throw at a problem.CLI, Agent Handles, and Closing the LoopSwyx [00:28:24]: You already had a CLI before it was cool. How is the shape of what you're exposing changing, if at all?Jake [00:28:28]: CLIs have always been cool. The CLI changes because we think about how to give Claude, Codex, ChatGPT, or any model a handhold.Jake [00:28:50]: A CLI is a single command: deploy, get logs, and so on. Things that were prohibitively annoying to humans are not annoying to agents. They're nice. If I handed you a CLI with 40 arguments and 600 flags, you'd think, “I'm never going to use all of this.” But if you hand it to an agent, it says, “This is excellent. I have so many handles to work with.”Jake [00:29:24]: If you're going to expose things to agents that way, you want as many handles as possible where they can get information, query dynamic information, and close the loop quickly. Most problems right now are about how to close the loop as quickly as possible. Where does the agent get stuck, and how can you remove that?Jake [00:29:49]: Telemetry is important. If you can tell where the agent gets stuck from the CLI and say, “12% of people deviate from the happy path because of this, and now I add this argument and drive it down to 2%,” you massively increase the rate of loop closure.Jake [00:30:03]: That's how we think about not just the CLI, but every point in the dashboard. It's a user journey: I hear about Railway. I get something deployed. I get my first green build or aha moment. I see an endpoint, logs, whatever. Then I iterate. The iteration loop is indefinite. The user wants to deploy a new thing, a Postgres instance, change code, and keep iterating.Jake [00:30:36]: If you focus on the iteration loops and what's blocking them from closing quickly, one thing we say internally is: you never want to be waiting on compute anymore. You always want to be waiting on intelligence. If you're waiting on compute, there's a bottleneck that needs to be destroyed because eventually that bottleneck becomes so large that another workflow emerges to change it.Jake [00:31:04]: We've built a product where you push code, build it, and so on. But I fundamentally believe the push-pull loop is going away. We'll get to a point where you make a small change in production, that change is versioned across your infrastructure, you're working alongside copy-on-write versions of your database and infrastructure, and then you merge it in and it's instantaneously live. That's the holy grail of loops. The push-pull-rebuild thing is a point of friction that we're removing entirely.Canvas as Output: Dashboards, Context Anchors, and HyperstructuresSwyx [00:31:43]: It's incredibly fast. If anyone hasn't tried it, that fast feedback is great. My hot take is that Railway was famous for its canvas, which visualizes your infrastructure and lets you manipulate it visually. But that was for humans. For the next phase of growth, Railway CLI is more important than canvas.Jake [00:32:05]: The canvas is funny because it's a mechanism to show changes over time. You're right that previously we used it a lot as an input. Moving forward, its goal is more like an output. You would go to the canvas, make changes, see them, and watch your infrastructure evolve. Now agents have access to the CLI and can make those changes. So the canvas becomes an output: what information does the human need at this moment to make suitable decisions about control requests? Do I approve this or not?Jake [00:32:57]: It also has to be an anchor for your context, a port in the storm. Think of it like layers in a file system. You start with a project, then drill down into services, then into a function or code, because you want to represent the entire thing not just in your head, but in the canvas. Other people can share that representation, think on the same wavelength, and move quickly.Jake [00:33:33]: A lot of organizations get in trouble as they scale because all the context lives in someone's head. “How does this microservice work?” “I have no idea; go ask this person.” Then you have whole categories of products built around context discovery. A lot of that melts away if you have a solid hierarchy and can infinitely nest services, code, context, and everything else all the way down. That's what lets you build these structures over time.Jake [00:34:18]: It's also what lets us build what I've called hyperstructures: things that are way bigger. You look at the Golden Gate Bridge and ask, “How did we build that?” There's a meme that we lost the technology. To some extent, yes, because the coordination that built those things evolved and changed. We lost some of the art of building structure as we jammed everything into Slack.Swyx [00:34:52]: But you jam everything in Discord.Jake [00:34:53]: Same point. It doesn't matter. It's message passing and interrupts, message passing and interrupts.Swyx [00:35:00]: So you're arguing there should be something better and more structured than Slack?Jake [00:35:04]: Yeah. For sure. I think Slack is awful, and Discord is awful too.Central Station: Context Routing, Support, and Incident ClustersSwyx [00:35:09]: This is the equivalent of my mom test. What have you done that has your solution to this?Jake [00:35:15]: Internally, we've built a tool called Central Station that aggregates all the context from our users. Every piece of feedback, every customer support item, everything gets aggregated into clusters. If an incident is brewing, we can determine how many users are affected and break off a discussion based on that.Jake [00:35:40]: That is more helpful than long-running channels where you're trying to decide which channel to put something in. If you can dynamically aggregate information and dynamically route it to the right person based on context, it works better. We know internally that these four people are close to networking. If we see a networking thing, we can drill it down to those four people. If it's with this part, we can look at the commits. This is no longer a manual process internally.Jake [00:36:13]: If you go to station or help.railway.com, that's why we built it. We wanted to scale with a massive amount of leverage by aggregating feedback.Swyx [00:36:27]: This is built in-house?Jake [00:36:28]: Yep.Swyx [00:36:29]: I remember helping out on this one with Angelo in 2023. You scale a lot with a very small team.Jake [00:36:38]: Yeah. We're about 10 times bigger now.Swyx [00:36:40]: You have your full developer code here? Very cool.Jake [00:36:44]: If you go to railway.com/stats, we expose this as a pub-sub-able thing. It's all real-time metrics. There's a way to get it as JSON somewhere if you care.Jake [00:37:01]: We're big on trying to build everything in public and talk about what we're working on. We've had issues in the past, and we'll say, “Here's how we're fixing these things.” We've gotten compliments and flak for incident reports. We're always trying to make them better and talk with people.Incidents, Disclosure, and Progressive RolloutsSwyx [00:37:20]: You had a big one recently. I liked that it was scoped to 3,000. You presumably used Central Station. Talk through what happened and how you address it internally as a team.Jake [00:37:38]: Internally, this one really sucked. It had to do with an upstream provider that didn't do the behavior it said it documented, which is unfortunate given they wrote the RFC for how the behavior should work. We rolled those things out, and Central Station caught it initially when a couple users said caches weren't invalidating. We turned it off immediately.Jake [00:38:03]: When you roll out to a large user base of three million people, you get a lot of disparate behaviors. We tested in staging and had tests, but we hit an edge case. We've hardened those systems, and now we can make that better. But it was a tough one.Swyx [00:38:39]: I always wonder how private disclosure is supposed to work if people find an issue. Are they supposed to contact you first? When you run a platform, these things will happen. What channels should people pursue to quietly resolve it before it becomes a bigger incident?Jake [00:38:59]: There's responsible disclosure. We err on the side of over-disclosing and letting you know something is wrong versus having your provider gaslight you. We've erred on sharing those things more publicly, even if they impact a small subset of users. That's a decision we've made internally. We have four values. One is honor. The honorable thing is to notify people to the widest degree at which they may have been affected or there was an issue, and then confront it head-on: why did it happen, what can we do better?Swyx [00:39:45]: Not the whole user base. That's because of incremental rollouts and other things?Jake [00:39:50]: Yeah. Progressive rollouts.Swyx [00:39:54]: That should be the norm at all large platforms.Jake [00:39:58]: It should. A variety of companies do this. There's the quote that Meta runs 10,000 different versions of Meta. To our earlier point about agents, they need the same thing. They need shadow traffic and all these other things. We've built so much ceremony around production being sacred that we need to make it trivially easy to test different behaviors in a safe environment. Then you can make mistakes in a safe environment.Safe AI SRE: Customer Agents, Forked Environments, and Production ParityAlessio [00:40:30]: Do you see a world where these things get automatically caught, not necessarily by your agent, but by your customer's agent? The cache invalidation issue seems easy to check if you know to look for it.Jake [00:40:44]: It's hard because to determine it, we almost need to hook into your observability infrastructure. That's why we have the template loop on the platform: so you can roll things out progressively. You can roll out to Johnny Vibe Coder initially, or push a shard that someone consumes at their own leisure. Or you can roll it out over weeks: 0.1% of people, 1% of people, early adopters, then all the way up. That's the non-deterministic version control we talked about earlier.Jake [00:41:30]: I believe that's where most things should go, because most companies end up building staged rollout systems in-house. It's the same thing built again and again at every company. There's a massive opportunity to consolidate developer debt.Alessio [00:41:45]: You should have a free tier. Model providers give free tokens if you let them use the data. You could give free compute if someone is the number-one shard that goes out and lets you plug into their observability.Jake [00:41:55]: We do that. That's why we talked about the impact on 3,000 people. We start with lower-impact people. Larger companies on the platform are last to receive those rollouts so they have a version of the platform that's deeply stable.Alessio [00:42:16]: I have three services, so I'm sure I get the first rollout. You can nuke my thing at any time. There are all these SRE agent companies. Observability people also want agents that fix upstream problems. You have your own agent in the canvas now. How do you see that playing out?Jake [00:42:39]: It's the stacking entropy problem. If you don't have primitives to make iteration in production safe, it becomes difficult. If you're an observability provider saying, “Here's the fix to this error,” assume 80% are good and make sense. But in the last 20% long tail of complex issues, if you let somebody stamp it, you create an opportunity for an incident.Jake [00:43:08]: That's why forked environments are important. People have staging, but it always drifts from production. You need primitives, workflows, and experience built first-party on the platform so you can fork any service at any point in time.Jake [00:43:33]: I think of the canvas as a sheet of transparency paper. The agent is a little guy you push up into the canvas. It should say, “I need to copy that service and that service so I can test these two things.” It gets a read-only copy of production. Anything that's PII gets marked as a transform when we clone the database, create a copy-on-write version, or read from it. Then the agent makes changes and asks, “Does this actually work?” as close to production as possible.Jake [00:44:22]: That's how close you have to be, or you get massive drift. The system becomes unstable. You see this with massive systems built on Docker for local, Kubernetes for production, and a specific thing for something else. That complexity slows developers and becomes unstable at scale, making it hard to iterate. We want to compress that way down and say, “As close to prod as possible is where we want to be.”From AISRE Skeptic to Agent BelieverSwyx [00:45:00]: I was texting Erica for questions, and she says you were originally not a believer in AISRE. Have you come around on it?Jake [00:45:10]: I flipped, but I'm still not a believer in AISRE if you don't have the primitives to make it safe. If you unleash AISRE on production infrastructure without safe primitives for copying volumes and making sure things are fine, it's going to nuke your production database. It's not a matter of if, but when. I'm a big believer in making those loops safe.Jake [00:45:33]: I was a deep AI skeptic until 2023. In 2024, I thought, “Maybe I can roughly make this thing do it.” In 2025, I thought, “Now I can hold this.” Over winter break, everybody came back saying, “It's almost impossible to hold this.”Swyx [00:46:01]: Did you see this on the Claude docs? CloudBot? OpenCloud?Jake [00:46:06]: It's gotten to a point where it's harder to hold it wrong than to hold it right. There's a scene in Avengers where Vision picks up Thor's hammer and says it's terribly well-balanced. It self-balances and works well. I'm a deep believer at this point that this will be the dominant species: assembly, C, C++, JavaScript, words.Swyx [00:46:35]: It feels like a big jump.Jake [00:46:37]: It is. But it's not like you abandon CPU-based discrete logic and move straight to fuzzy logic. You need both. Your skills should call code or applications or some static structure. You can use skills to distill what the procedure should be or how the code should act.Jake [00:47:02]: I'm coming to a thesis: you need three points. You need a clear spec defining the system, the code, and the tests. When you say it out loud, if you've been in engineering long enough, you're like, “Of course. That's an RFC, tests, and code.” But they all matter. Having them together lets them reinforce each other: the spec and tests match, but the code doesn't, so reconcile it. Or the tests and code match but the spec doesn't, so reconcile that. That's the iteration loop.Jake [00:47:41]: That's why you're seeing people talk about software factories, docs, and reconciliation. Some of that is architectural astronomy if you don't implement it, but that loop is where most things will end up.Swyx [00:48:07]: For listeners, we've been talking about this on the pod for three years: the holy trinity of specs and tests. Itamar Friedman from Qodo is the reference if people want to look it up.Self-Modifying Infrastructure and the End of Push-Pull-RebuildSwyx [00:48:18]: One thing I want to mention on the OpenCloud idea is self-modification. I don't know how Railway would support it, but I have my OpenClaw, and I just tell it it has the Railway CLI and can do whatever. In theory, whatever capabilities or new infra it needs, it can call the Railway CLI, provision it, and add it to itself. The agent can modify its own infra.Jake [00:48:45]: It's nuts. I have a loop set up where you put the Railway CLI on top of something that runs on Railway. You're authenticated as whatever the current box is, and you can make any changes to it. Then you call Railway deploy, and it deploys itself.Jake [00:49:04]: It's like: “I need to spin up this instance of this environment. I already exist in this environment. Excellent, I have access to a Postgres instance now.” That's where we want to go with agentic, self-replicating infrastructure. That's your loop: iterate in production. You continue making changes. If it works, merge it upstream. If it doesn't, throw it away.Jake [00:49:37]: How do you make throwaway copies trivial to spin up and super cheap? The era of “I have an AWS instance with four vCPU and 16 gigs of RAM” is going to get destroyed. If you do that for agents, you need a thousand of those machines. It's prohibitively expensive compared with what we've spent a ton of time figuring out: the atomic unit of deploy, whether you call it isolates, sandboxes, or something else. Only pay for what you use, spin up instantaneously, and close the loop as quickly as possible.Jake [00:50:15]: If the system can self-replicate safely and say, “This is my environment, I'm making these changes,” it can come back with, “Does this look good? This is a new state of infrastructure given this prompt. I think I've solved it.” Then you go back and say, “Actually, it looks different.” It does the loop again. Then you say, “Cool. Apply.”Swyx [00:50:38]: That's retroactively obvious, which is the most useful kind. Any other comments on agent deployment on Railway?Jake [00:50:51]: It's getting better every day. I'm on X or Twitter. You can always yell at me about the parts not working as well as they should, because plenty of things should work way better.The New Serverless: Stateful, Long-Running, Pay-for-What-You-Use LinuxSwyx [00:51:04]: At this stage, when people want massively or embarrassingly parallel compute, they usually talk serverless. I feel like there's a new serverless compared to the previous five years of serverless. You're in that new bucket. Do you have comparisons or philosophical differences you want to call out?Jake [00:51:31]: It's somewhere in between. It's the ability to run stateful, long-running workflows or executions.Swyx [00:51:42]: Vercel has Fluid Compute, Cloudflare has some container thing, Google has App Runner and others.Jake [00:51:55]: That's where everything is roughly going, and it's why we've been working on this for six years. We believe users need access to a computer: a box that speaks Linux. They need to deploy what they want. Other systems change the surface area of what you can build. For us, users need a computer and need to deploy anything they truly want. That's why we've focused on the primitives: network, compute, storage. If we give you those and expose them so you can run things indefinitely, that's where we believe it's going.Jake [00:52:43]: Twitter has no nuance, so everyone says “servers” or “serverless.” It's always somewhere in the middle: I want to run it for a long time, but I don't want to provision the resource statically or pay for things I'm not using. That's been our thesis from day one: pay only for what you use, run it indefinitely, and it is full Linux.Swyx [00:53:12]: That's why I like the naming of Fluid. It's fluid. Flexible.Heroku, Focus, and Carrying the Torch Without Becoming the PastSwyx [00:53:18]: Another milestone is the Heroku official deprecation. You're one of the presumptive new Herokus. “New Heroku” has been a category for as long as I've been in developer tooling. It's finally happening. What was that like? Any behind-the-scenes of, “This is the moment”?Jake [00:53:42]: You have people where you're like, “You were running stuff on here? You, as this company?” It's crazy that names you would know are running on it and now coming to us saying, “We want to move a lot of this off.”Swyx [00:54:00]: Any behind-the-scenes on why Salesforce let Heroku stagnate?Jake [00:54:05]: I can only guess. It's hard when it's not your business. Salesforce's business is to build a great CRM. That's their focus. Then you acquire a compute business as an offshoot. A lot of early Meta people talk about focus. Boz has a write-up about how in the early days of Meta they had no money, so they were forced to focus. Then they turned on the money tree and had no reason not to split their focus.Jake [00:54:52]: But that dilutes your product. You get offshoots where you ask, “Is this the focus of the business?” If it's not core, it languishes. A lot of companies get in trouble when they split focus because they're fighting a multi-front war, not just externally but internally for alignment. Where are we going? What are we doing? What is our purpose?Jake [00:55:24]: If you're Salesforce-built and mission-driven, you want to work on Salesforce. Heroku is off to the side. It's not core to the business. Getting resources, budget, focus, and alignment internally becomes hard. It was a matter of time.Swyx [00:56:06]: Kudos for them to call it out instead of leaving it unknown.Jake [00:56:12]: Their release was a little odd. They called it out, but they didn't say they were shutting it down. Behind the scenes, I think they issued messages to people saying they should close accounts and that they were going to deprecate and remove things over time.Jake [00:56:30]: It's crazy because some of my first deployment experiences were on Heroku. You start with dragging things into an FTP server, then you try to get a deploy working, and then it's Heroku. It was the on-ramp for us. But the wheel turns. New things emerge. We're happy to carry the torch for a lot of that. But we don't want to be the new Heroku. We want to be the way people build and deploy software, and ultimately the way people monetize software over time.Swyx [00:57:19]: It's still a big crown to be the new Heroku. There are 50 companies that fought for that.Jake [00:57:23]: Everybody is holding some portion of it. We're happy to support people and companies. The platform works differently. The game loop is similar, but we've been dogmatic about where these things are going: primitives, agents, fan-out. Some things fit; some workflows need to change. We have an approximation of Heroku pipelines with the environment system. It's exciting. We've got a ton of people we can support, and it's growing a lot.Temporal, Workflow Engines, and State MachinesSwyx [00:58:12]: I have one more technical question about Temporal. I've sold my shares. You're a power user and one of our earliest customers. I met you through Temporal. You built on Temporal. You have complaints. This may be the most neutral and informed conversation anyone will hear about Temporal without someone working at the company.Jake [00:58:39]: That's fair. I've used Temporal for almost 10 years because of Cadence at Uber.Swyx [00:58:52]: Give people a sense of what Cadence was at Uber.Jake [00:58:57]: Cadence was the precursor to Temporal. It powers trip actions, rides, when you rent a Jump bike or scooter or car. You're running workflows for a period of time and saying, “This ride will run indefinitely until it finishes.” You attach information: you paused in this zone, so add this charge to the bill. When you end the trip, the workflow is done. That experience was powered by Cadence at the time.Swyx [00:59:34]: I used to say it's like programming the entire user journey top-down as one function.Jake [00:59:39]: It's a powerful idea and important. It's also important for the next phase of the agentic journey. You want an agent to do a specific task, be complete or incomplete on that task, and move on to the next thing. You need a way to manage workflows dynamically.Jake [00:59:59]: Temporal was always great in theory, and great when you got it working the way you wanted in production. But it required you to model the entire journey in your head. If you didn't, you could cause issues where replaying the state of the workflow causes non-determinism.Swyx [01:00:25]: Because it works on deterministic workflow history.Jake [01:00:28]: Exactly. I describe it as a jet engine. If you know how to operate it and run it, it's great. But you can't hand it to people trying to build complicated things if they don't have the whole state in their head.Jake [01:00:48]: We run our whole deployment pipeline on top of it. That's a reasonably complicated workflow: pre-commit hooks, signaling, queuing, and all the rest. We ran into the same thing at Uber. As you express a large workflow, it gets more complicated, with more states in the state machine that you have to map back to the workflow.Swyx [01:01:15]: It's a lot of ifs.Jake [01:01:16]: Exactly. At Uber, we built a system for doing the state machine and testing it. We've started to build some of those things here because it's grown heavily. It's not quite love-hate. When it works well, it works super well. But if someone who doesn't have full context puts something into the system that invalidates state or causes non-determinism, or spins off a ton of activities, you have to keep track of underlying SRE knobs like activity slots. Those should scale with memory, vCPU, and so on. It becomes a bear to scale.Swyx [01:02:10]: You need a capable sysadmin running things behind the scenes. If you moved off, what would you do?Jake [01:02:19]: We'd build our own workflow engine. We have a few internally that we've worked on.Swyx [01:02:27]: This is one of those classes of things you typically wouldn't vibe code, but I'm wondering if you can.Jake [01:02:33]: I still don't think you should vibe code it. You still want to run decent tests to make sure it works.Swyx [01:02:39]: Timo didn't invent that from scratch either. There are libraries you can run. On top of that, it's just a state machine that you have to map out. Ultimately, you define the instructions you want and run them through a state machine.Jake [01:03:00]: It's very doable. Workflow stuff is interesting. Restate is doing neat stuff here.Swyx [01:03:10]: You're tied into JavaScript. Are you a JavaScript maxi?Jake [01:03:13]: Internally, we have TypeScript, Rust, and Go. We don't add more languages. Actually, we have a little C because we write BPF code and hooks. But those are the languages.Swyx [01:03:28]: Is this for sidecars?Jake [01:03:32]: No. It's for the networking stack, volumes, and things like that. We use TypeScript a lot because it powers the dashboard, but we're moving a lot of workflow stuff off the dashboard stack and into the infrastructure stack.Railpack, Nixpacks, and Content-Addressable FilesystemsSwyx [01:04:00]: Cool. Any other technical infrastructure stuff? Railpacks?Jake [01:04:07]: We built an engine for determining dependencies based on source code. It's called Railpack. We built the first version, Nixpacks, on top of Nix, and then we moved.Swyx [01:04:17]: People have been trying to get me to adopt Nix and NixOS for four years. Is it ever going to be a thing?Jake [01:04:23]: I don't know. We're excited about it, but it has pain points. Think of it as a stack of versioned binaries at specific slices in time. If you want version X and version Y, you bloat the package space, which blows up image size and makes real-world workloads difficult.Swyx [01:04:53]: But you content-address it and cache it. In theory, there are optimizations.Jake [01:05:00]: In theory, yes. But with a large enough user base and disparate enough machines, you run into a problem Meta described in the XFAAS paper, their internal serverless system. It becomes difficult at scale unless you break out specific runtimes.Jake [01:05:24]: We didn't want to do that because we wanted to truly allow you to deploy anything. That was our initial thing with Nix. But we've moved toward interesting work around content-addressable file systems that can lazy-load anything from any point and page it into memory.Swyx [01:05:48]: Amazing.Jake [01:05:49]: The future is very bright. It's crazy, and it's going to be nuts.Coding Agent Spend, Roadmaps, and Token ROISwyx [01:05:54]: Founder journey stuff?Alessio [01:05:56]: Your cloud usage: you tweeted you're going to spend $300K this month?Jake [01:06:01]: I think we got to $200K.Alessio [01:06:02]: Coding agents?Jake [01:06:03]: Yeah.Swyx [01:06:04]: Across the company?Alessio [01:06:05]: You only have 35 people, so I'm sure they're not all spending $10K a month. What's the distribution?Jake [01:06:10]: I think I'm at about $25K. We have power users all the way down. We came back from winter break, and I basically said, “If you're writing code by hand, you're doing this wrong.” The tools are good enough now that you can move extremely quickly. There are issues and pain points, but you should be reviewing the code you are writing instead of writing it by hand.Jake [01:06:40]: Architectural patterns matter more now than ever, but you shouldn't spend your time generating code you would write. If you know how to write it, ask the agent to write it and reconcile it until it looks like you would have written it yourself.Jake [01:06:58]: People misconstrue my propensity to push people toward agents as connected to our growth and some reliability bumps. They're not necessarily related. The tools are good enough to move extremely quickly and build things way larger than you could before.Jake [01:07:19]: To the earlier point about cooling data centers in space: I don't know. But with software, you can ask, “How would I build block storage from scratch? How would I do these things?” I have ideas because I have history and have read papers. Let me work them out and build massive test benches with thousands of tests, because those are now free to author. If you're not using AI systems to speed-run your roadmap and reconcile your existing system onto the future, you're missing a large point of what's happening.Alessio [01:08:12]: What's the path to spending $3 million a month? Is it bound by ideas and things customers can absorb?Jake [01:08:19]: For most companies, it's bound by deployment at this point. That's why we've seen a massive boom in users and companies, from Fortune 50s down, asking how to get developers to move faster. You'll probably hit your CFO before any technical limits because they'll look at the eye-watering amount of money spent on tokens. Inference costs have to come down, but we're inference constrained now. There will be price discovery around what makes sense for an org to adopt.Jake [01:09:06]: I think you'll end up with the F1 driver concept. If someone is really adept at these things, it makes sense to put them in a $3 million car. If they're not, it probably doesn't make sense. You'll take a few people and say, “You can drive the F1 car. We need to go in this direction. Figure out if it works and prototype it.”Jake [01:09:33]: We've done some of that and vastly accelerated our roadmap. We thought we'd ship something in a few years; now we can probably ship it in a few months because we validated it and don't have to build it incrementally. We can skip steps and move toward our vision.Alessio [01:09:58]: A lot of people are realizing the roadmap doesn't always have a business impact, so they say tokens are too expensive. But if your roadmap were built to make more money by the time you built it, you'd have token pricing for it, the same way you do with sales. You'd spend a billion dollars on sales if you knew you would get $2 billion of revenue.Jake [01:10:19]: Exactly. A naive way to measure this is the percentage of tokens that end up in production. If you can measure impact because those tokens end up in production, that's awesome. But the burden of proof will rise. Internally, we have a growing number of pull requests that haven't merged. The question becomes: how do you get this into production? It's about how quickly you can build and deploy software, which is exciting because that's our whole thing.The SDLC Shift: Prompt Requests, Feature Flags, and Safe RolloutsSwyx [01:10:56]: The SDLC is changing. One thesis is that the pull request is dying. It's going to be the prompt request. Beyond that, code review is also kind of dying if you have all the other systems in place. What else is changing about the SDLC?Jake [01:11:19]: The AISRE and the tools to make it happen. AISRE is pie-in-the-sky aspirational. What does it take to get an AISRE? What tools do you need to build?Swyx [01:11:32]: You should expose your tooling to customers at some point. The Central Station command center.Jake [01:11:39]: We have it for template maintainers. Template maintainers can deploy and maintain templates, and they get feedback. We're going to expose those things incrementally.Swyx [01:11:51]: Clustering around incidents. Everyone has a version of that, but I don't think anyone has solved it.Jake [01:11:56]: I won't say we've solved it internally, but it's gotten so good that we can see incidents forming pretty quickly. At some point, those will be things either someone else builds or we build. We've always built things purpose-built for us. If it makes sense to make it useful for users, monetize it, or turn that loop into a profit center instead of a cost center, we want to do that.Jake [01:12:28]: Pull request is definitely dying.Swyx [01:12:29]: Do you do first-party feature flagging and incremental rollout stuff?Jake [01:12:34]: We have a feature-flagging engine we built internally and will eventually roll out.Swyx [01:12:38]: I don't see it as a user. How come you didn't give us what you have?Jake [01:12:43]: We have to beta test it. We care a lot about the quality of the things. There's plenty we've used internally that doesn't make it all the way through the journey because it fails. It works for one service but not multiple services. We'd have to build it for multiple services and know that if we released it, we'd rebuild it again and again. Some things are worth that, but many inform the roadmap.Jake [01:13:18]: We don't want to dilute the experience by saying, “This works, but only for this service,” unless it's a core initiative. Over the next few months, we'll roll out things that work for a single service, then multiple services, then multiple services across the environment. You have to be deliberate. Otherwise you create broken disparate experiences and support load because people ask how to use the feature.Jake [01:13:52]: It's the earlier expansion and compaction pattern. You expand the company to get features, then compact and smooth them out so the experience is stellar. You told me in the hallway, “It's gotten so much better.” Internally we're saying, “This part really sucks. We need to make it significantly better.”Swyx [01:14:11]: I can attest to that over the last three years watching you build Railway. For listeners, feature flagging is a huge part of Uber culture. So much so that they have too many feature flags and another thing to remove feature flags. Facebook has Gatekeeper. Agents are going to need this. It's fundamental to incremental rollouts. OpenAI acquired Statsig. GPT-5 is routing and flagging through different models.Jake [01:14:56]: It's super important. If the software development lifecycle is going to change because we're doing things 1,000 times faster and 1,000 times more concurrently, what becomes important at scale?Jake [01:15:16]: Before I started Railway, I built a feature-flagging product and tried to sell it. It was an easier version of LaunchDarkly. I ran into a problem: anyone small enough to adopt your technology doesn't care about feature flags, and anyone large enough to need feature flags needs so much scale that you have to build out all the infrastructure. I scrapped it.Jake [01:15:42]: But what is old is new again. Companies are trying to move quickly, but you can't YOLO a vibe-coded thing straight into production. You need to say, “Here's my blast radius, my impact, and I want to shadow it for these users.” Feature flags. You're going to need the tools larger companies built to maintain their structures. Everything gets compressed by 1,000x so everybody can build those structures quickly.Jake [01:16:07]: That's exactly where we are: compressing the software development lifecycle, then expanding it and adding more new things.Cattle, Pets, and Clonable InfrastructureSwyx [01:16:15]: Another term that comes to mind for newer developers is “cattle, not pets.” People treat production like a pet. It has a name. You baby it and keep it alive. With cattle, you can mass farm, roll out, portion parts out, and kill them.Jake [01:16:37]: I think that might change. You can move toward having pets as long as you have a cloning machine for your pets.Swyx [01:16:52]: Yeah.Jake [01:16:52]: If you can snapshot every single thing at every frame, it doesn't matter if something gets obliterated because you have a snapshot of it. The things we've built right now are designed to block changes from the hermetically sealed DevOps line. You have to write a Dockerfile because you nee
With Google I/O 2026 underway this week, Andrew sits down with Matthew McCullough, VP of Android Development Experiences at Google, to talk about the AI evolution happening across the Android ecosystem. Matthew shares his insights on why developers are rapidly transitioning into agent orchestrators, why CLIs are cool again, and how tools like AI Studio have rolled out a massive welcome banner for anyone to actively participate in the creation process. Finally, the two explore the future of mobile user interfaces and how the latest Android 17 developments are stripping away legacy friction to seamlessly get users straight to the good part.OFFERSStart Free Trial: Get started with LinearB's AI productivity platform for free.Book a Demo: Learn how you can ship faster, improve DevEx, and lead with confidence in the AI era.LEARN ABOUT LINEARBAI Code Reviews: Automate reviews to catch bugs, security risks, and performance issues before they hit production.AI & Productivity Insights: Go beyond DORA with AI-powered recommendations and dashboards to measure and improve performance.AI-Powered Workflow Automations: Use AI-generated PR descriptions, smart routing, and other automations to reduce developer toil.MCP Server: Interact with your engineering data using natural language to build custom reports and get answers on the fly.
Building Repeatables in Claude: Skills, CLI vs MCP and Token Discipline | Go With The Flow Claude Skills, CLI vs MCP and Token Discipline with Ritu Java | Seller Sessions SEO Description Ritu Java and Danny McMillan on building agentic skills, choosing CLI over MCP, plan mode discipline and the short window to ship before token costs reset. Episode Summary Week 4 of the month, Go With The Flow, and Ritu Java is back from her travels. The world has shipped fast since the last episode: Codex 5.5, Claude 4.7, an Amazon Ads MCP and a fresh round of panic over the rumoured removal of Claude Code from the $20 plan (it was a 2% AB test, not a rollout). Ritu and Danny use the noise to make a sharper point: this is the moment to stop chasing models and start building repeatable systems on the platform you have already chosen. Ritu walks through the three eras of PPC Ninja's automation stack. Apps Script bulk file generators three years ago, Netlify hosted UI apps last year, and now agentic skills that her team chats with in plain English to produce upload ready Amazon bulk files. The same shift applies to data: BigQuery accessed through the Google Cloud CLI rather than through MCP, because CLI is leaner on tokens and works better when the job is heavy on data rather than tool surface. Danny mirrors the move with his event-ops CLI for WordPress, WooCommerce, Stripe and FooEvents reconciliation, and his four tier ExtractFlow cascade (HTTP, headless, stealth, agentic) that bypasses the limits of any single browser tool. The second half is a discipline talk. Plan mode every time. Push back on the first plan because Claude over engineers by default. 30% of your time on workflow scaffolding so the other 70% can be real building. The 21 day Claude rule: when a shiny new tool fires the dopamine, wait 21 days before refactoring around it. Left brain tasks (counting, SQL, deterministic logic) belong in scripts. Right brain tasks (judgment, creativity, hypotheses) belong in the model. Mix them inside a single skill. Skills are micro pieces of your workflow, not magic, and Claude can write them for you from an existing SOP. Key Topics The three eras of PPC Ninja automation: Apps Script, Netlify UI apps, agentic skills CLI vs MCP: when to choose each and why CLI is more token efficient for data heavy work Token economics, the rumoured $20 plan change and why it was a 2% AB test The short window before subsidised tokens get repriced Plan mode discipline and the "push back on plan one" rule Danny's 30 / 70 framework: workflow scaffolding vs building The 21 day Claude rule for resisting tool churn Left brain vs right brain task design inside a single skill The PPC Ninja "5 Whys" skill: deterministic SQL plus non deterministic hypotheses Claude.md, Gemini.md, Skills.yaml and the emerging Agents.md standard Skills for beginners: let Claude write them from your SOP Skill cascading: research, article, LinkedIn post, tweets, slide deck in one chain Timestamps [00:01] Welcome back, Week 4 Go With The Flow, Ritu returns from travels [00:17] Codex 5.5, Claude 4.7 and the "no one is writing code anymore" reality [02:01] Ritu on the three eras of PPC Ninja automation [02:42] Era 1: Apps Script bulk file generators in Google Sheets [03:46] Era 2: Netlify hosted UI apps with input fields [04:48] Era 3: Agentic skills, the bulk file skill trained on Amazon templates [06:22] Cloud talking to BigQuery through the Google Cloud CLI [07:00] Danny: what is a CLI and why it matters for token use [08:00] Amazon Advertising MCP vs CLI based access to the same data [09:33] WordPress horrible to drive via MCP, easy via CLI [10:00] Danny's event-ops CLI: tickets, food tickets, WooCommerce, Stripe reconciliation [12:13] ExtractFlow four tier cascade: soft, medium, stealth, agentic [13:46] Why CLI for the heavy stuff, MCP for the soft touch [14:13] AWS CLI: chat to Claude, push HTML blog posts live in two minutes [15:33] The overwhelm problem and the 5,000costbehindthe5,000costbehindthe100 plan [17:35] The $20 plan rumour: it was a 2% AB test, not a rollout [19:38] Build repeatables, not one offs [20:38] Danny: pick a platform and stop chasing benchmarks [21:16] The 21 day Claude rule for new tools [22:16] Plan mode every time, push back on plan one, get the second plan [23:02] Why am I building it, who is it for, what am I building [23:30] The 30 / 70 split: workflow scaffolding vs real building [25:13] Why long six to fourteen hour Claude runs are usually inefficiency [27:12] Compounding 1% a day across a year [27:47] "I build the things that build things" [28:00] Architecture vs apps: filling the gaps between A and B [29:06] Left brain vs right brain task design [30:01] Why throwing 80/20 at a sales drop diagnosis fails [31:33] The PPC Ninja 5 Whys skill: deterministic plus non deterministic in one flow [34:32] Claude.md, Gemini.md, skills.yaml and the agents.md standard [40:53] Beginners: let Claude write the skill from your SOP, use the interview pattern [42:39] Skill cascading: URL to research to article to LinkedIn post to tweets to slides [44:42] Mixing deterministic and non deterministic inside a single skill [45:39] Wrap up, signal to noise, who is it for Key Takeaways Pick a platform and stop chasing models. A new model ships every week. Time spent benchmarking is time not building. Double down on Claude (or whichever you chose), use the 21 day rule, and let the ecosystem catch up to the shiny thing in your feed. CLI for heavy work, MCP for soft touch. MCP loads tools and skills into context and burns tokens. CLI uses programs already on your machine. For data heavy jobs (BigQuery, AWS, WordPress at scale), CLI wins. For light cross app workflows, MCP is fine. Build repeatables, not one offs. Subsidised tokens will not last. The 100planreportedlycostsAnthropic100planreportedlycostsAnthropic5,000 to serve. Spend the window building scaffolding that compounds, not 14 hour vibe coding runs. Plan mode every time, then push back. Claude over engineers by default. Generate the plan, then say "you have over engineered this, although I want it elegant, go back and review." Plan two is the one you start from. 30% on workflow, 70% on building. Each new dependency, MCP, skill or repo you add to your workflow compounds across every future project. Stop building only the apps. Build the things that build the apps. Left brain in scripts, right brain in the model. Counting, SQL, deterministic logic belongs in Python the moment you can offload it. Save the model for hypotheses, judgment and creativity. The PPC Ninja 5 Whys skill mixes both inside one flow. Skills are micro pieces, not magic. Take an SOP, ask Claude to interview you with decision panels, and let it write the skill. Then cascade skills together: URL to research to long form article to LinkedIn post to tweets to slide deck. Notable Quotes "Instead of doing one offs, it is time to build repeatables. The more people can learn that skill now, the better it will be, because a year from now you may not have access to the same tokens." Ritu Java "If you see something and it looks sexy and it has sex and sizzle and your dopamine is screaming to go after it, wait 21 days. Either Claude will have it, or someone will have a repo, and you can combine it." Danny McMillan "Always use plan mode. Never accept plan number one. Tell Claude: you have over engineered this, although I want it elegant, go back and review. Then start from plan two." Danny McMillan "I build the things that build things. I build the scaffolding the team needs so they can build on top of it." Danny McMillan "Spend 30% of your time on your workflow and 70% building. The 30% compounds across every project." Danny McMillan "If we just hand six months of ad, organic, ranking and SQP data to Claude with no structure, it is going to mess up. It will give you an 80/20 you are not satisfied with, because it is not equipped to handle that volume without scaffolding." Ritu Java "WordPress is horrible to work with through MCP. It falls over all the time. CLI can be amazing for certain things." Danny McMillan Resources Mentioned PPC Ninja : Ritu's Amazon PPC software and agency, base for the BigQuery + CLI stack discussed Claude Code : Anthropic's CLI for Claude, the primary surface used in the episode Anthropic Claude : Claude 4.7 referenced as the current model OpenAI Codex : Codex 5.5 mentioned as the rival shipping fast Google Gemini CLI : Referenced as a sibling agent surface (Gemini.md) Google BigQuery : PPC Ninja's central data warehouse Google Cloud CLI (gcloud) : The CLI Claude uses to talk to BigQuery Amazon Advertising MCP : Amazon's official MCP server for ads data, referenced as the MCP comparison point AWS CLI : Used by Ritu to publish HTML blog posts to ppcninja.com from a Claude chat Netlify : Hosting layer for PPC Ninja's previous era of UI based apps WordPress and WooCommerce : Backbone of Danny's event-ops CLI FooEvents : Ticketing plugin that lives behind WooCommerce in the event-ops flow Stripe : Source of the card fee variation Danny reconciles via CLI ExtractFlow / CloudExtract : Danny's four tier extraction cascade (HTTP, headless, stealth, agentic). Open repo Playwright : The default browser automation tier inside ExtractFlow Agents.md : Emerging AI agnostic instruction file standard alongside Claude.md and Gemini.md Sequential Thinking MCP : The MCP Danny invokes when asking Claude to step through analysis Hosts Danny McMillan : Host of Seller Sessions, founder of DataBrill, building AI native tooling and CLI based workflows for Amazon sellers. Website: https://sellersessions.com LinkedIn: https://www.linkedin.com/in/dannymcmillan Ritu Java : CEO and co founder of PPC Ninja, Amazon PPC software and agency. Specialises in automation, BigQuery pipelines and agentic workflow design. LinkedIn: https://ca.linkedin.com/in/ritujava Website: https://www.ppcninja.com What's Next Next week: Ritu and Danny pick up routines and the new Claude scheduler. In 8 days: Seller Sessions Live 2026 in London on 9 May. Last week to lock in any final discounts. About Seller Sessions Seller Sessions is the leading podcast for serious Amazon sellers, hosted by Danny McMillan since 2017. Go With The Flow is the weekly automation strand where Danny and Ritu Java work through agentic flows, MCPs, CLIs and skills, in real time, on the same stack their teams ship every week. Episode published: 1 May 2026 Series: Go With The Flow (Week 4 of the month) Keywords: claude skills, claude code, cli vs mcp, mcp model context protocol, claude 4.7, codex 5.5, amazon ppc automation, bigquery cli, agentic workflows, plan mode, token optimisation, claude.md, agents.md, ppc ninja, ritu java, seller sessions podcast, go with the flow
Your database is slow, your Sentry is screaming, and the backlog is full of “we'll fix it later.” What if an AI agent handled the boring but high-impact work while you slept and just opened a clean pull request for review the next morning?We're joined by Mike Coutermarsh, a software engineer who helped build GitHub Actions and later left GitHub for PlanetScale. We talk candidly about the trade-offs: walking away from big-company comfort, choosing impact over feeling like a cog, and learning to thrive in a flatter org where the best “process” is ownership. Mike shares how he leads the team responsible for everything users see at PlanetScale, from dashboards to APIs to CLIs, and why speeding up CI, reducing bugs, and protecting reliability can matter more than chasing the flashiest feature.Then we get practical about AI coding tools. Mike breaks down how Cursor, Claude Code, and MCP servers can connect production query patterns and Sentry errors to scoped “bot army” automations that propose fixes, optimize performance, and even keep error queues from becoming a garbage fire. We also dig into AI code review, responsibility (“if your name is on the commit, you own it”), and the uncomfortable question of whether code quality still matters when models can generate code fast. Along the way we touch token costs, local models, and why conventions like Rails can actually help AI work better.On the database side, Mike explains why PlanetScale started with MySQL via Vitess, how sharding changes operations like backups and restores, why Postgres demand forced a new product push, and what it could mean to bring Vitess-style scaling to Postgres. We wrap with a small but surprisingly powerful workflow upgrade: fast dictation using Spokenly and local speech-to-text.Subscribe, share this with a teammate who lives in dashboards and PRs, and leave a review with the one workflow you'd want an AI agent to automate next.Send us some love.JudoscaleAutoscaling that actually works. Take control of your cloud hosting. HoneybadgerHoneybadger is an application health monitoring tool built by developers for developers.JudoscaleAutoscaling that actually works. Take control of your cloud hosting.Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.Support the show
In this Broad Match Show, Danny McMillan and Adam Heist cover two of the most practical AI frontiers for Amazon sellers right now: getting direct API access to your Seller Central data and building a fully automated design workflow from inspiration through to live assets. Adam breaks down how he connected Amazon's SP API and Ads API to an AWS database and wired Claude Code directly to it — giving him real-time, queryable access to years of business data across any metric. No developer required. Danny walks through his 8-step system that takes a seller from a TikTok scroll to a finished, conversion-tested design with brand consistency baked in. Both share hard-won lessons on where AI gets you (the 70–85% mark) and where the human still needs to step in — plus a candid look at what's changing at Seller Sessions Live on May 9th. Key Topics Amazon API data pipeline — SP API + Ads API → AWS database → Claude Code for real-time analysis 8-step AI design workflow — Inspiration capture, memory/photo brain, brand system, mood board, asset generation, build, and quality gate CLI vs MCP — Why CLIs are becoming the cleaner integration path for tools like Google Workspace Seller Sessions Live (May 9th) — New modular format, no sponsors, £5,000 fine system for service providers pitching Health check-in — Adam on fitness goals; Danny on resolving a high ferritin (iron overload) diagnosis Timestamps [00:00] Welcome and introductions [01:10] Adam: Getting Amazon SP API and Ads API access as an individual brand [05:00] Storing API data in AWS and connecting it to Claude Code [07:30] Building custom dashboards and software from your own data [09:00] How to approach it if you're not technical — think first, screenshot issues, let Claude walk you through [12:25] Danny: 8-step AI design workflow overview [13:30] Step 1 — Inspiration capture from TikTok, YouTube, social reels [14:20] Steps 2–3 — Memory/photo brain + design system (52 world-leading brands baked in) [15:30] Steps 4–5 — TLDraw mood board + asset generation (Nano Banana 2, Gemini, Remotion) [17:50] Steps 6–7 — Build stage (React, Tailwind, ShadCN, Netlify deploy) [18:30] Step 8 — Quality gate (216-feature scoring: UX heuristics, typography, psychology) [19:30] Google Stitch + Perplexity demo: full brand system from a product title + screenshot [23:12] Adam: the 70–85% rule and how to think about AI-assisted design cycles [27:35] Danny: Google Workspace CLI for email — running launches under 3,000 contacts [29:26] Health updates — Adam on fitness; Danny on ferritin/iron overload and phlebotomy sessions [35:45] Seller Sessions Live May 9th — format, venue (inside a church), evening networking [41:49] The £5,000 fine system for service providers pitching at the event [43:01] Wrap-up Key Takeaways You can get Amazon API access as an individual brand — no developer credentials needed. SP API goes back 720 days; Ads API covers 60 days. Approval takes 1–2 days. AWS as a data warehouse for Amazon data — pipe the API into AWS, connect Claude Code to it, and query anything: anomalies, stock-outs, week-on-week comparisons, year-over-year trends. The non-technical workflow is: think → verbalize → screenshot issues → let Claude solve — you don't need to understand the infrastructure, just be clear on what you want to achieve. AI gets you to 70–85% fast — bring in your designer or team at stage 4, not stage 0. Cycle times drop from 6 weeks to 1 week. CLIs beat MCPs for tool integrations where available — less token overhead, fewer config issues, more cohesive experience in Claude Code. Google Workspace CLI can replace Mailchimp/Klaviyo for small lists — Gmail allows up to 3,000 sends per day; viable for product brand launches under that threshold. Seller Sessions Live is now sponsor-free and profitable on ticket revenue alone — the event model is shifting away from conference-style sponsorship dependency. Notable Quotes "Getting the actual real-time API data access has been just another level completely." — Adam Heist "The original thought is: I need to get API access and I need to connect that to Claude. That's my thinking. And then you literally just verbalize that and use screenshots as you get stuck." — Adam Heist "AI gets you to the finish line faster across way more dimensions, so instead of doing 600 things in a year, you're doing 2,000." — Adam Heist "We live in a time whereby execution in a way is taken care of by AI. Where we're needed is on the vision — do we build this or don't we build it?" — Danny McMillan "Know with AI it's dumb unless you give it a brain." — Danny McMillan Resources Mentioned Amazon SP API — Business reports, inventory, listings, SQP data; up to 720 days historical Amazon Ads API — Ad performance data; 60-day lookback AWS (Amazon Web Services) — Cloud database for storing API data; connects to Claude Code via MCP or CLI Claude Code — AI coding assistant used to build the data pipeline and dashboards Google Stitch — Free UI design tool; used to generate brand systems from a product image + title Perplexity — Combined with Stitch to generate full design systems from Amazon listings Nano Banana 2 — Image generation tool controlled via Claude; used in Danny's asset generation step Gemini — Used with reference images for asset generation Remotion — Video generation component in Danny's design workflow TLDraw — Collaborative whiteboard/mood board tool; integrated with Claude for live-updating design boards React / Tailwind / ShadCN — Front-end stack used in the build step of Danny's workflow Netlify — Deployment target for the build step 21st Century Dev / ShadCN MCPs — Component library MCPs used in the build stage Google Workspace CLI — Cleaner alternative to Gmail MCP for read+write workflows in Claude Code Playwright / Fetch MCP — Browser automation tools; Danny built a 4-stage cascade scraper for Amazon About the Show The Broad Match Show is a monthly format on Seller Sessions, hosted by Danny McMillan and Adam Heist. It covers the cutting edge of AI tools, Amazon strategy, and brand building — first Tuesday of every month. Seller Sessions is one of the longest-running Amazon seller podcasts, hosted by Danny McMillan. Known for deep-dives into conversion, data, and the practical application of AI for e-commerce brands.
For all those who missed out on London, see you in Miami next week!Notion, the knowledge work decacorn, has been building AI tooling since before ChatGPT, with many hits from Q&A in 2023 and unified AI in 2024 and Meeting Notes in 2025. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0's Custom Agents - and they are finally embracing the Agent Lab playbook!Sarah Sachs and Simon Last of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work.We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today — from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work.We discuss:* Sarah and Simon's path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production* Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model* The “Agent Lab” thesis: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities* How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are* Why coding agents feel like the kernel of AGI, and how Notion is thinking about “software factories” made up of agents that spec, code, test, debug, review, and maintain codebases together* How Sarah runs AI engineering at Notion (“notes from Token Town”): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities* The “Simon Vortex,” company hackathons, and why security gets pulled in early rather than late* How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents* Why prototypes have become much easier to build internally, and how “demos over memos” changes product development inside a tool the whole company already uses every day* Notion's eval philosophy: regression tests, launch-quality evals, and “frontier/headroom” evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going* What a “Model Behavior Engineer” is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering* The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops* How the “software factory” should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter* A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database* How agents compose inside Notion: shared databases as primitives, agents invoking other agents, “manager agents” supervising dozens of specialized agents, and memory implemented simply as pages and databases* Notion's take on MCP vs CLI: why Simon is bullish on CLI's self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment* The evolution of Notion's internal agent harness: from early JavaScript coding agents, to custom XML, to Markdown and SQL-like abstractions, to tool definitions, progressive disclosure, and a much shorter system prompt* Why Notion cares about teaching “the top of the class,” building for sophisticated operators rather than abstracting away too much capability for everyone* How agent setup works today: agents that can configure themselves, inspect their own failures, and edit their own instructions — with guardrails around permissions* How Notion prices Custom Agents: credits as an abstraction over tokens, model type, serving tier, web search, and future sandbox costs; why usage-based pricing was necessary; and how “auto” tries to match the right model to the right task* Why Notion is not eager to train a foundation model, where they do fine-tune and optimize today, and why retrieval/ranking is one of the most important investment areas as more searches come from agents rather than humans* Why Meeting Notes became one of Notion's strongest growth loops: not just as transcription, but as high-signal data capture that powers search, custom agents, follow-up workflows, and the broader system of record for company collaboration* Why Notion is more interested in being the place where collaboration data lives than in building hardware themselves — and how wearables or other capture devices may eventually feed into that systemSarah SachsLinkedIn: https://www.linkedin.com/in/sarahmsachsX: https://x.com/sarahmsachsSimon LastLinkedIn: https://www.linkedin.com/in/simon-last-41404140X: https://x.com/simonlastFull Video EpisodeTimestamps* 00:00:00 Introduction and launching Notion Custom Agents* 00:01:17 Why Notion rebuilt agents four or five times* 00:03:35 Building for where models are going, not just where they are* 00:05:32 The Agent Lab thesis, wrappers, and product intuition* 00:08:07 User journeys, leadership, and low-ego AI teams* 00:13:16 The Simon Vortex, hackathons, and bringing security in early* 00:16:39 Team structure, demos over memos, and building for agents* 00:20:25 Evals, Notion's Last Exam, and the Model Behavior Engineer role* 00:27:37 Evals as an agent harness and the changing role of software engineers* 00:30:42 The software factory: specs, verification, and agent workflows* 00:32:18 Live demo: a custom agent for coworking space applications* 00:35:08 Composing agents, manager agents, and memory as pages* 00:38:15 Notion Mail, Gmail, native integrations, and tools* 00:39:43 MCP vs CLI and the cost of capability* 00:44:13 When Notion uses MCP vs building its own integrations* 00:47:43 The history of Notion's agent harness rebuilds* 00:55:35 Power users, public tools, and the setup agent* 00:58:01 Self-fixing agents, permissions, and “flippy”* 01:01:13 Pricing, credits, and choosing the right model automatically* 01:09:01 Why Notion isn't training its own frontier model* 01:14:07 Retrieval, ranking, and search built for agents* 01:17:27 Meeting Notes as data capture and workflow automation* 01:21:18 Wearables, hardware, and Notion as the system of record* 01:23:45 OutroTranscript[00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast. This is Alessio founder of Kernel Labs and I'm joined by swyx, editor of the Latent Space.[00:00:11] swyx: Hello. Hello. We're back in the beautiful studio that, uh, Alessio has set up for us with Simon and Sarah from Notion. Welcome.[00:00:18] Sarah Sachs: Thanks for having us.[00:00:19] Alessio: Thanks for having us. Yeah.[00:00:20] swyx: Congrats on the launch recently the custom agents, finally it's here. How's it feel?[00:00:26] Sarah Sachs: We ship things slowly. So it had been in Alpha for a little bit and at the point at which is it's an alpha, um, there's a group of people that are making sure it's ready for prod, and then there's a group of people working on the next thing.So sometimes some of these launches are a bit delayed satisfaction, so it's quite nice to remind yourself all the work you did because we do have a habit of like. Being two or three milestones ahead. Uh, just ‘cause you have to be, you know, you can't get complacent. Um, but it's been great that people understood how this is helpful.And I think that's just easier in general building AI tools today than it was two, three years ago. People kind of get it and so that user education, um, there's just, it was our most successful launch in terms of free trials and converting people and things like that. It was really successful, so yeah.But there's a lot to build.[00:01:12] swyx: Making it free for three months helps.[00:01:16] Sarah Sachs: Yep.[00:01:17] Simon Last: It was definitely super exciting for me because it's probably the fourth or fifth time that we rebuilt that.[00:01:22] swyx: Yes.[00:01:23] Simon Last: And I mean,[00:01:24] swyx: you've been building this since like 20, 22.[00:01:26] Simon Last: Yeah, I mean, like, it was even right when we got access to like GPT four in late 20 22, 1 of the first ideas we had is like, oh, okay, let's make an agent that I, we used the word assistant at the time, there wasn't really the word, the word agent yet, but, oh, we'll give an access to all the tools the notion can do, and then it, we run in the background like, like do work for us.And then we just tried that many times and it just. Was too early. Um,[00:01:48] swyx: I need to force you to like double click on that. What is too early? What didn't work?[00:01:52] Sarah Sachs: We were fine to, like, before function calling came out. We were trying to fine tune with the Frontier Labs and with fireworks, like a function calling model on notion functions.This is right when I joined. I joined because, um, we needed a manager as Simon was needed to be able to go on vacation. So, uh, that's, that's around when I joined, so you can speak much more to it.[00:02:11] Simon Last: Yeah, we did partnerships with both philanthropic and open AI at different times, uh, to try to, at the time the, I mean, when we first tried, there wasn't even a constant of like tools yet.We, we sort of designed our own like, like tool calling framework and then we tried to fine tune the models to, uh, to use it over multiple turns. Um, and because it, it didn't work well out the box, I think. Yeah. The models are just too dumb and the context thing was also way too short.[00:02:37] Alsesio: Yeah.[00:02:37] Simon Last: Um, and yeah, we just kind of banged our head against it for a long time.Uh, unfortunately it was always like, there was always like sort of. Glimmers that it was working, but um, it never felt quite robust enough to be like a useful, delightful thing. Um, until I would say, uh, the big unlock was probably like Sonic 3.6 or seven, uh, early last year. And that's when we started working on our agent, which we shipped last year.Um, and then, and then uh, uh, custom agents, kinda a similar capability and that, that one just took longer because we, we just wanted to get the reliability up a lot higher. ‘cause it's actually running in the background.[00:03:14] Sarah Sachs: And the product interface of like permissions and understanding, you know, this custom agent is shared in a Slack channel with X group of people and has access to documents that are surfaced to Y group of people.And the intersect experts, Y might not be whole. And so how do you build the product around making sure administrators understand that permissioning took multiple swings.[00:03:35] Alsesio: Everything is hard back at the end of the day. Yeah. I'm curious, like when the models are not working, how do you inform the product roadmap of like, okay, we should probably build, expecting the models to be better at some reasonable pace, but at the same time we need to, you know, you had a lot of customers in 2022.It's not like you were a new company or like no user base.[00:03:54] Simon Last: Yeah, I mean I think there's always the balance of, you know, like you want to be a GI pilled and thinking ahead and building for where things are going. Uh, but also you wanna be like shipping useful things. And so we always try to like, like keep a balance there.You know, we. We try to take clear, like a portfolio approach. You know, we're always working on multiple projects and, and we're always trying to work on, you know, maintaining things where that have already shipped, like, like shipping new things that are like eminently working well and make them really good.And, and then we wanna always have a few projects that are a little bit crazy. Um,[00:04:23] Alsesio: and what are the a GI peel projects that you have today? I'm curious about, uh, you don't have to share exactly what you're working on, but I'm curious what are things today that maybe in 18 months people will be like, oh, obviously this was gonna work[00:04:35] Sarah Sachs: 18 months.[00:04:37] Alsesio: Yeah, 18 months is, you know,[00:04:37] Sarah Sachs: it's a long time and Yeah. Yeah.[00:04:39] Simon Last: I mean, there's a number of things happening. I think one thing that's becoming more clear is I think like, like, uh, coding agents are the kernel of EGI, sort of, everything is a coding agent. Mm-hmm. I think that's, that's sort of one, one direction.Um, and then, yeah, the exciting thing about that is sort of your agent can sort of bootstrap its own software and capabilities and actually debug and maintain them. And so yeah, we're, we're, we're thinking a lot about that. And then, yeah, like, like another category of things that I'm, I'm really excited about is like, uh, we call the software factory also.People are using this, uh, this, this sort of word. Um, basically it just means can you create sort of like a, as automated as possible, a workflow for developing debugging. Mm-hmm. Merging, reviewing, and maintaining a code base and a service where there's a bunch of agents working together inside, and like, like how does that work?[00:05:28] Sarah Sachs: If you think back to your initial question, like, why did this take so long? I think something,[00:05:32] swyx: I didn't say that, but Yes. Okay. Go ahead.[00:05:34] Sarah Sachs: Why, what, what changed over the three and half years of trying[00:05:37] swyx: it? Exactly. Right. Because most people always say like, it didn't work yet. Then reasoning models came, then it worked.I was like, okay, let's go a little[00:05:43] Sarah Sachs: bit. That's, I mean, that's part of it, but I think the other part of it that I actually think is really what will set notion apart for every new capability is we have like. Two skills that are crucial when it comes to frontier capabilities. One is not letting yourself swim upstream.So like quickly realizing if you're just pressing against model capabilities versus not exposing the model to the right information, not having the right infrastructure set up. That and of itself is the skill of intuition. And the second is to see, okay, you're not swimming upstream. Which direction is the river flowing and what is like, how do we think ahead about the product and start building it even if it's not great yet, so that when it is there, we're ready for it.Right? And like those can sometimes feel like counterintuitive things. Like we can be trying to fine tune a tool calling model when they don't exist yet. And that the trick is to not do that for too long, but realize that there was something there. And we've had a lot of things which like, um, we're just like not swimming in the right direction with the streams.I think we had multiple versions of transcription before we got meeting notes, right? Oh, I gotta talk[00:06:39] swyx: about that. Yeah.[00:06:40] Sarah Sachs: Yeah. Um, and so. I, I, I think that like we, we really closely partner with the Frontier Labs on capabilities and we also have to have strong conviction on, as those capabilities move.Notion is about being the best place for you to collaborate and do your work. And how does that narrative change if the way that we work changes?Yeah.[00:06:58] swyx: Yeah. You told me you were a fan of the Agent Lab thesis, and this is, this is kind of it, right?[00:07:02] Sarah Sachs: Right. I show that thesis to so many candidates. Like I have it as like micro chrome autofill.Um, at this point, like it's one of my most visitations[00:07:10] swyx: because like, is this the, here's why you should work in notion and not open, open eye. I, it's like,[00:07:14] Sarah Sachs: here's, here's what's different about it.[00:07:16] swyx: Yeah.[00:07:16] Sarah Sachs: And here's why. It's not just a rapper. I actually think more and more people understand it's not just a wrapper.[00:07:21] swyx: Yeah.[00:07:22] Sarah Sachs: Um, and by the way, like in the beginning, parts of what we build are wrappers on functionality. That works well, of course, but that's not really the most, um. I would say that's not the product that, that drives revenue. And that's not necessarily always what users need.[00:07:35] swyx: I mean, you know, notion is the AWS wrapper, but like the, the wrapper is very beautiful and like very, very well polished.So[00:07:40] Sarah Sachs: like the analogy,[00:07:41] swyx: like[00:07:42] Sarah Sachs: the analogy that I've been coming back to his Datadog in AWS[00:07:45] swyx: Yeah.[00:07:46] Sarah Sachs: So, uh, Datadog could not exist with, without cloud storage. Right. That it's kind of fundamental that that works. Um, and AWS has like a CloudWatch product, but Datadog is an expert on understanding how people want observability on the products they launch.And we're experts in understanding how people wanna collaborate, and that's really where our expertise lies.[00:08:04] swyx: Totally.[00:08:04] Sarah Sachs: Um, regardless of the tools that we use,[00:08:07] Alsesio: I'm kind of curious how you think about implicit versus explicit expertise. I feel like Datadog is half and half implicit and explicit. It's like they understand across markets and industries what engineering teams usually look for.With notion, it's almost like more of the expertise is at the edge because you as a platform, you're like so horizontal that the end user is not really the same. Mm-hmm. Like with Datadog, the end user is always like, yeah, an engineering lead, a kinda like SRE related person with notion. It can be anything.So I'm curious how you put that expertise into a product versus, you know, obviously it, WS cannot build notion. It's, that doesn't quite work in this case, but[00:08:44] Simon Last: it's, it's a little bit differently shaped. I think, you know, a classic vertical SaaS, like the data is kind of like that. They understand their individual customer very deeply.It's kinda a narrow slice, um, notion has always been super horizontal. And our, our task has always been to sort of balance these two somewhat opposing forces of like, we're listening to our customers and what they want us to build. It's a broad slice. And then also we're thinking about like, okay, how do we decompose what they want into, uh, nice primitives that are, that are really nice to use and we'll, we'll get us like as much bang for the buck as possible.And then, you know. Maintain the whole system, make it all like, like super clean and nice to use.[00:09:22] Sarah Sachs: We still have user journeys. I mean, we still focus on like core. I actually think the failure of our team is when we focus too much on what are cools that are, what are tools that are[00:09:31] Simon Last: mm-hmm.[00:09:31] Sarah Sachs: Cool tools. I actually think that's when we make have the least velocity because you still need some sort of focus on a user journey.So like for instance, we'll all sit down every Friday and look at the P 99 of like the most token exhaustive custom agent transcript and just look at why it didn't do well and cut a bunch of tasks. Like we still focus on like, this has, like this should work. Email triaging should work. Mm-hmm. Right. And similarly, like when we're talking about before building, um, chatting, um, before we started filming about, okay, how can I do PDF export?Well that's functionality that then merits. Maybe we should build a tool that has access to a computer sandbox in a file system and the ability to write code. Right? Right. Um, but it's because we're thinking about the fact that our users to do their, to do their daily work, need to export PDFs, not because we're like, Hmm, I think a computer tool could be cool.Like, let's just see what happens. Mm-hmm. Like we, we have to focus on some user journeys, otherwise we just don't have like, enough strategy to, to prioritize.[00:10:29] swyx: I think there's a lot of like really strong opinions that you've had. Do you have like sort of like a towel of Sarah Sachs? Like, you know, like what, how do you run your team?Like I feel like you just have accumulated all these strong opinions. Obviously part, part of this is your, your token town thing.[00:10:43] Sarah Sachs: I think the TAs working with Service X is, um, you'd have to, it depends who you ask. Um, I think it depends if you're on my team or a partner Right. Or a vendor.[00:10:54] swyx: Yeah. There other people want to run their teams the way that you're Yeah.You're like bringing these things. And then also similarly, uh, Simon, when you did the custom agents demo, you had like, well, we've been using custom agents and here's the super long list of everything that we do. No humans ever read it. Right? That's what you said. I was like,[00:11:07] Sarah Sachs: yeah. So I think for, for me, um, something that I learned very quickly and became very comfortable with was that my job was not to be the ideas per person or the technical expert.My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on, and had an avenue to prioritize what they thought was important. And I think that's true with all, all leadership, but I think especially on the AI team. Almost all of our best ideas come from prototypes, from people that have a cool idea because they saw a user problem, and it's a huge disservice if all of those ideas have to pass, like the sniff test of what me and a product partner or Simon and Ivan decided were the direction, right?Because a lot of what we're doing is leaning into capabilities, so. I think that's the first thing is like, I don't really view like the role of engineering leadership as like, uh, hierarchical, nor has it ever been, but especially now, like very willing to change direction based on, um, like proof is in the pudding.Yeah. And like, and I think we have rebuilt our harness three or four times. And when you do that, then the second rule of engineering leadership is like you need to build a team that's comfortable deleting their own code and is very low ego and is driven by what's best for the company. And, um, doesn't write design docs because they think it's their promotion packet.Right. And that's a culture that notion had long before I joined, but like our willingness to just swarm on different problems and um, redo things that we've built before because something has changed. Like, there's a lot of friction that can happen at companies when you do that. And it doesn't happen at Notion.And because it doesn't happen when new people join. Like they don't wanna be the ones that are saying, we shouldn't do this. I wrote that code. So then it's, you know, you, you create a culture that everyone thoughts and that culture comes directly, I think from Simon and Ivan though, um, because they're very open-minded.[00:12:50] swyx: Anything that you,[00:12:50] Simon Last: you'd add? I'm not a manager, like, like, like Sarah is. Um, a lot of my role is really to try to think a little bit ahead, make sure that we're, we're building on the right capabilities and then like the prototyping stuff. And yeah, it's really, really critical to always just be starting again.It's like, okay, this is new thing. What does this mean? What if we just rethought everything or wrote everything? And so I, I'm, I'm basically just doing that in a loop every six months.[00:13:16] swyx: Yeah. Do you believe in internal hackathons for this stuff?[00:13:19] Sarah Sachs: I think there's like two different versions. So one is like, we just have a, a, a solid bench of senior engineers that come and go on what we call the Simon Vortex and Productionizing what we built, right?Because when you're in the Simon Vortex, the velocity is super high. The direction changes daily, and it's meant to be like the equivalent of a SC Works lab. We don't need to do hackathons for that. We need to have senior engineers that we trust to come in and out of those projects. For instance, like management boundaries are really loose.Like you report to him, but you work for her right now. Yeah. That's something that when we hire managers, it's important they don't care about because we tend to form more structures. Yeah. Don't be too[00:13:54] swyx: territorial.[00:13:55] Sarah Sachs: We form more. It's after we ship things, not not before, just historically. Um, the second thing is we do have companywide hackathons.Actually we just had our demos day for the hackathon we had last week this morning. That's more for people that aren't directly working on the project, feeling like they have the time to pause and learn how to make themselves more productive or how they would use notion custom agents to build something.Or part of the hackathon was actually encouraging everyone across the company to build their own agentic tool loop, calling from scratch. Follow like an every blog post on how to do what I think because we want[00:14:26] swyx: just with the compound engineering one. Yeah.[00:14:28] Sarah Sachs: We want everyone to use cloud code in the company or whatever the coding agent they please and understand that fundamental.So we set aside a day and a half. We're all leadership, encourage everyone on their teams across the company to do it. So we have hackathons like that. I would say like kind of facetiously, like everything we build is a little bit like a hackathon until it graduates and puts on big boy pants and as a product ops rollout leader and has a assigned data scientists and stuff like that,[00:14:54] swyx: security review enterprise stuff,[00:14:56] Sarah Sachs: actually security reviews one of the things that we bring in first because it just slows us down way more and, um, causes a lot of tension and they build better product if they're involved early.So, um, that is probably the first person to get involved in something that's the[00:15:09] swyx: right PR approved answer.[00:15:10] Sarah Sachs: No, but it's not just PR approved. It like, um, um, it's[00:15:13] swyx: actually real. It's actually real. It's like, um, I'm just saying scar[00:15:15] Sarah Sachs: tissue.[00:15:15] swyx: Yeah,[00:15:16] Sarah Sachs: because like, you know, my background's also, I worked at Robinhood for a number of years.Yes. So like, uh, compliance and things like that, um, are a little bit more, you learn the hard way when it doesn't come naturally.[00:15:26] Simon Last: Yeah. I think the. The hackathon is really important for uplifting the general population, but like, if that's the only way you can build new things, you're kind of toast. I mean, it, it has to be like the daily processes, like, you know, building these new things.Um, and it has to be about, I think like, I think in the AI era a lot more leverage accumulates to the most curious and excited people. And so it's like we're all about just like activating that energy. You know, like if someone's protesting something on the weekend that they're excited about and it's important, that should be the main thing that we're doing.Yeah. Um, it's not a hackathon that we schedule once a quarter, it's just like, yeah. Daily process. Part of the culture.[00:16:02] Sarah Sachs: I mean, that's how we shift image generation and notion now. It was always this thing that would be kind of nice to have, but it wasn't really clear where that was necessarily aligned in product priorities.It'd be a lot of work. And we had someone on the database collections team, Jimmy, who was like. I really wanna do image generation for cover photos and inside notion. And we're like, if you wanna build it, like it's, do it please. Like we encourage you. We gave ‘em all the resources of working directly with Gemini and being able to like track the token usage and it working through endpoints.We gave them eval, support, everything, and then became a, a full project.[00:16:34] Alsesio: Yeah.[00:16:35] Sarah Sachs: That's why you can't have like ego as a, a leader. Like that's, that's how we work.[00:16:39] Alsesio: What's the size of the team today, both engineering and overall?[00:16:43] Sarah Sachs: I manage, uh, the team. That's what we'll call it. Core AI capabilities and infrastructure.That's about 50 people. But then we have per i partner teams that do packaging. So how it shows up in the corner chat versus custom agents versus meeting notes, that's another 30, 40 people. And, and then every team that has a product service at Notion that a user can interface with owns the tool that the agent interfaces with the editor team.The team that did CRDT for offline mode is the same team that handles how two agents, um, edit competing blocks. Mm-hmm. Right? It's the same problem. The team that built the underlying SQL engine is the same team that owns how the agent asks it to run a SQL query, and it does it performantly. And so from that regard, anyone working on product engineering is tasked with making them work for customers that are humans and agents because over time the majority of our traffic will be coming from agencies using in our interface, not humans.And so. Our objective is to make it so that the whole product org is building for agents.[00:17:40] Alsesio: Yeah. How has it changed internally? The activation bar is kind of lowered a lot. Like anybody can kind of create a prototype very, somewhat easily, especially if you're like an existing code base. Have you raised the bar on like what type of prototype people need to bring forward to gonna be taken?Not like seriously, but like, you know what I[00:17:58] Simon Last: mean? Yeah. I think the bar is lowered in many ways. Be like, one thing our, uh, our team built that is really cool is our, uh, our, our design team made a whole separate GitHub repo, uh, called the, the design Playground. And it's basically just to create a bunch of like, like helper components and you, uh, for, for quickly a throwing together UIs.And it's become like actually quite sophisticated. Like it has like an agent in there and like, uh, that's pretty fun. So like, we pretty much, like, they don't do mocks, they just make like, like full, full prototypes.[00:18:27] swyx: Here it is. It works.[00:18:28] Simon Last: They give you like a u rl. They're like, okay, all right. So we have to make the, like the real production version of that.Um, and then for engineers. A prototype looks like just making it a feature flag that actually works. Like that's sort of the bar.[00:18:39] Sarah Sachs: Something to understand that's really unique about notion. One of the reasons I joined we're super lucky is no one uses Notion in their job as much as people that work at Notion.[00:18:46] Simon Last: Of course.[00:18:47] Sarah Sachs: So I think there's very few companies, maybe if you worked on Chrome I guess, but like everything that we ship, we ship internally first and get a lot of really quick feedback. And also sometimes our dev instance is totally borked and you have to change a bunch of flags to get things done. And that's kind of like, but everyone, so people that do it ticketing, people that do supply chain procurement, recruiting, everyone is using the same instance of notion with like a lot of flags on for these prototypes people build.Um, and so we have this, Brian Levin, one of the designers on our team, I think evangelize this concept of demos over memos.[00:19:18] swyx: Ooh, too[00:19:20] Sarah Sachs: good. Um, which has been, uh, very good for building demos, and I think it's put a big pressure point on us to have really strong product conviction, because if anything can be demoed, you really need a strong filter of making sure that if you know, you're doing X amount of work, you're making the, you're, you're focusing on one tower, you're not just building a really flat hill.Right. That's actually where I think there has to be more conviction from our PMs, um, and our designers and, and well, the company really to have conviction of what journey we're going on.[00:19:52] Simon Last: But overall, I feel like it works pretty well. Like people, almost all the engineers have good enough taste to realize that like, this prototype doesn't actually make sense in the product, or, or it does.So it's not that common that I would see a prototype. It's like, oh, this makes no sense. Mm-hmm. It's like, you know, people are doing reasonable things and, and, and then it's just a matter of. Which things we build first and then often just, just figuring out how to turn it on and off. There's our, in the, in our like experimental chat ui, there's this, there's probably like, like a hundred check boxes in there.[00:20:22] Sarah Sachs: Kills me[00:20:23] Simon Last: the things you could turn on and off.[00:20:25] Sarah Sachs: Uh, but I think that, okay, so that is kind of true, Simon, but like being the person that manages the evals team, like there is a level of intensity that it adds to the platform team. So, you know, if we're gonna do image generation and notion, all of a sudden the way that we do attachments and the way that we, um, our LLM completion like cortex talks and expects tokens back and now it's getting images back.Like there's a lot of platform work that we do need to, like solidify a little bit. So sometimes it'll be in dev for a couple weeks before it makes it to prod just because we still have to like, make it robust, make it HIPAA compliant, ZDR compliant, figure out the right contracting with the vendor, whatever it is.And we need to eval it because we want the team. To still maintain what they build. That's the one thing is like if we have a bunch of prototypes, it can't just be like a small group of people that then maintain whatever end prototypes. So we have invested a lot of people in an eval and model behavior understanding teams that, we call it agent dev velocity.So your dev velocity building agents can be faster if we invest in that platform. And so we have a whole org dedicated to Asian, um, platform velocity so that you can build your own eval and then maintain it once you ship it. So if a new model release comes out and we, every[00:21:38] swyx: team maintains their own eval,[00:21:40] Sarah Sachs: we maintain the eval framework.Every team owns their own evals and a lot of them we've integrated to Optin, to ci, or we run them nightly and we have a team, uh, a custom agent that triggers to a team to look at the major failures. That's really critical because if we have like all these different surfaces now, a lot of it's on the same agent harness, so it's easier to maintain.It's just packaging of different agent harnesses, but new functionality of the agent. Let's say that like we wanna update like. Uh, you know, they deprecated, sonnet, um, four or whatever it is and we need to auto update. Are[00:22:11] swyx: they already? That's so, okay. Yeah. Actually wasn't that long ago.[00:22:14] Alsesio: Theywere[00:22:14] Alsesio: just 3.5.[00:22:15] Sarah Sachs: 3.537. Just got deprecated.[00:22:18] swyx: 3 7, 5 0.2 or, yeah. No,[00:22:20] Sarah Sachs: it's not. 5.2 is five point. Five point no. Yeah, five four is 40% more expensive than five two. So if they deprecated five two, you would hear they can, you would hear from me about that one. Um, but, uh, another conversation to have.[00:22:35] swyx: I have a cheeky evals question for you.Have you noticed any secret degradation from any of the major model providers?[00:22:40] Sarah Sachs: Secret degradation,[00:22:42] swyx: like. During the War Bay, when it's high traffic, it suddenly gets dumber.[00:22:47] Sarah Sachs: Yeah. I mean, not just between the, I mean, we definitely notice flakiness, we've definitely noticed, particularly for some providers, that things are slower during working hours and[00:22:57] swyx: there's a latency argument.Yes. Not a quality argument.[00:22:59] Sarah Sachs: No. I think the quality difference that's interesting is, um, even though companies that say they're selling the same, a, it's really into like quanti quantization, but like companies that say they're selling the same model through different vendors, whether it be through first party or Bedrock, Azure, et cetera.We do see different qualities sometimes, and that's not necessarily what's advertised.[00:23:21] swyx: Yeah. Kidney went to the point of like, if we, they shipped like this, like eval across all the providers and it was like very obvious we were secret equalizing and it was very,[00:23:28] Sarah Sachs: yeah. But[00:23:29] swyx: that's very embarrassing.[00:23:30] Sarah Sachs: You know, um, we hire Subprocess to figure that out for us.So we just wanna understand where it's regressing or where it's optimized. And sometimes we're okay with regressions that optimize latency if they're the appropriate regressions. Our job is to make sure we have the evals to understand the changes that are important to us. And even like when we're partnering with labs on pre-releasees of models, they'll send us multiple snapshots.And this is less about quantization, but more just regressions. Like they have shipped models that were not the snapshots that we wanted, and they have changed the snapshots that they shipped based on the feedback that we give. Because our feedback tends to be more enterprise work focused and not coding agent focused.And definitely those can be bummers, like, you know, uh, we know that this wasn't the version you wanted, but we'll help you make it work. I mean, we always make it work, but that definitely happens.[00:24:16] Alsesio: Yeah. Do you have, um, failing evals that you're just hoping, oh, that will have success eventually when a good model comes out?[00:24:23] Sarah Sachs: Uh, I mean, yeah. So I think. I mean, I could talk about this for 60 minutes, so I will limit myself. I think it's a real issue when people say evals and it's just like, that's quality, that's like unit, I mean, it's like saying testing. It's not just unit tests, right? So. We have the equivalent of unit test.Regression test. Those live in ci, those have to pass a certain percent, you know, within some stochastic error rate. Then we have, as you're building a product, evals of these aren't passing right now, and this is launch quality. So we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user journeys to launch, and then what we have what we call frontier or headroom evals, where we actively wanna be at 30% pass rate.And that's actually been a effort that we took in partnership with philanthropic and OpenAI in the past maybe two or three months, because we actually hit a point where our evals were saturated and we weren't able to really give insightful feedback other than it wasn't worse. And not only is that not helpful for our partners, it's not helpful for us to understand where the stream is going.You know, going back to that analogy. And so we spent a lot of time thinking about. What notions last exam looks like, right? Mm-hmm. Not just humanities, last exam. Ooh, notions last exam. Mm-hmm. And, um, there's a lot of, you know, dreams about what that would look like. I know we've talked a lot about benchmarking, um, swix, but, uh, yeah.Notions last exam is a big thing inside the company and we have people, full-time staff to it exclusively. Mm. We have a data scientist, a model behavior engineer, and an full-time, um, evals engineer just dedicated to the evals that we pass 30% of the time.[00:25:56] swyx: What you're hiring for[00:25:57] Sarah Sachs: MBEs? I am hiring[00:25:58] swyx: What is an MBEA[00:25:59] Sarah Sachs: model?Behavior Engineer Model. Behavior engineers started with a title data specialist before I joined when they were working with Simon on like, uh, Google Sheets and like Simon just needed someone to look through Google Sheets and say, yes, no, this looks bad. This looks good. Right? And so we hired people with kind of diverse linguistics background.We had like a linguistics PhD dropout. Mm-hmm. And a Stanford ate new grad. And they're amazing. And they formed a new function basically. And over time we've built a whole team, um, with a manager who's now kind of reinventing what that role is with coding agents. So they used to be kind of manually inspecting code.Now they're primarily building agents that can write evals for themselves or LLM judges. There's a really funny day I can send you the picture where Simon, about a year and a half ago, was teaching them how to use GitHub. Um, and they're on the whiteboard and it was like, okay, I think it would be so much faster if our data specialists learned how to use GitHub and like learned how to commit these things in Dakota.And, and that was then and now I think, you know, coding has been a lot more accessible. Um, but moving forward it's this mix of like data scientist PM and prompt engineer because there's craft in understanding like even like what models can and can't do things. How do we define like that headroom? How do we define like what a good journey is?Um, is this model better or not? Why is this failing? There's some qualitative work, but then there's also like a lot of instinct and taste to it, and that's not necessarily software engineering. And so we have like very firm conviction and we have had for a number of years now that that is its own career path and we have always welcomed the misfits, so to speak.So we really firmly believe that you don't need an engineering background to be the best at this job. And that's what's quite unique about this particular role.[00:27:37] Simon Last: Yeah, this is something that I've been pretty excited about recently is we made an effort basically to treat the eval system as like an agent harness.So if you think about it, like, you know, you should be able to have an agent end-to-end, download a dataset, run an eval, iterate on a failure, debug, and, and then implement a fix. And ultimately you should be able to, you know, drive the full time process with a human sort of observing the, you know, the outer uh, system.So yeah, we went, went pretty hard on that. And that's, that's worked extremely well so far. It's like basically just to turn it into a coding agent, uh, uh, problem.[00:28:11] swyx: Your coding agent or just whatever[00:28:13] Simon Last: harness No coding agent. Yeah, code, cloud code. It should be totally general. Yeah. I think if it would be a mistake to like, like fix it on any, any particular coding agent.At the end of the day, it's just like CLI tools.[00:28:21] Sarah Sachs: It's like the same way that you would've a coding agent write the unit test. You should have a coding agent write the eval.[00:28:26] swyx: Yeah.[00:28:26] Sarah Sachs: But there's a lot of supervision in that still. We just don't believe that supervision has to come from software engineers because a lot of it is like, um, kind of you XREE and whatever, and these are the people that also triage failures and tell us where we should be investing next.[00:28:40] swyx: Yeah. I'm gonna go ahead and ask a spicy question. Is there a data, there are no software engineers at Notion.[00:28:46] Simon Last: Um,[00:28:46] Sarah Sachs: what does it mean to be a software engineer?[00:28:47] swyx: Exactly.[00:28:48] Simon Last: I mean, I think the way things are going is like we're on some continuum where. If, if you look back three years ago, humans were typing all the code and then we had auto complete, you're typing list of the code.Then we had sort of like filling agents, filling lines, and now we're getting into like agents doing longer range tasks where you can debug and implement a fix and then verify it works and you know, get your, get your PR even like, like Merion deployed. I think we're sort of just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system.There's a string of agents flowing through, like me prs what's going off the rails. Like what do I need to approve? Is there like a learning or memory mechanism that that works? So it's kind of a hard engineering problem. There's a, you know, there's, there's a lot to do there. I think we're just sort of moving up stack[00:29:34] Sarah Sachs: the same transition machine learning engineers have made, right?Like I haven't looked at a PR curve in a while.[00:29:39] swyx: Yeah. You used to do this stuff and now, um, auto research can do it,[00:29:42] Sarah Sachs: right? Like I think it depends on what you define as a software engineer.[00:29:46] swyx: Yes. It's, that's changing for sure.[00:29:49] Sarah Sachs: I think every software engineer in notion this summer went through like this, um, sheer, um, one of our engineering leads of the company called it, like every software engineer is going through the, the, uh, identity crisis that every manager goes through, where all of a sudden they realize their ability to write code is less important than their ability to delegate in context switch.And I think that is a transition out of being a software engineer. But[00:30:12] Simon Last: yeah. Yeah, there's a critical difference to being a manager, which is that like, it is actually very deeply technical. The problem, you know, humans are very like, like, like fuzzy and you can't like treat a team of humans like a, like a rigorous system where like, you know, prs like, like flow through and can be in like a block status and then what happens when they're blocked, right.With a set of agents, you actually can do that. And, and, and I think it's actually, there's a lot of interesting technical rigor that that goes into that it's like it's a technical design problem. Ultimately.[00:30:42] Alsesio: What is the design of the software factory that you're building?[00:30:46] Simon Last: Yeah, I mean, I think we're. Trying a lot of different things.I mean, ultimately you want to design a system that requires as little human intervention as possible, but like still maintaining the in variance that, that you care about. So yeah, we're exploring a lot different ideas there. I mean, I think I could talk about a few things I think are important there.Like, one thing I think is really important is, um, having some kind of like specification layer you can just commit marked on files. Mm-hmm. That works pretty well, but[00:31:15] swyx: it's nice to be notion man. I'm just saying like the spec, like Yeah. The natural home for specs is notion.[00:31:21] Simon Last: Yeah. Right. It can be a database of pages.Yeah. I mean, it needs to be something that is, you know, human readable and I viewable and I think that's pretty key. Another really key component is like the, the self verification loop. Yes. You need really, really good testing layers, basically. And that's a really deep, uh, uh, problem. But by getting that right, you know, and then, and then it's kinda like the workflow of like.What happens when there's a bug? How does it flow into the system? Like, is it like a subagent working on it? How does it make a PR and how does that get reviewed? And me, and then, you know, so there's like the, the flow or process.[00:31:56] swyx: Yeah. Cool. Uh, you know, one thing we did work out before you guys came in was this demo or this[00:32:01] Simon Last: agents[00:32:02] swyx: agent demo.Uh,[00:32:03] Simon Last: so every,[00:32:04] Alsesio: every time we do an episode, we try the product. Right. I don't think there's ever been an episode that I haven't tried. Yeah. Um,[00:32:11] swyx: and we, we try, try is a, a big word. Like since day one lane space has been on Notion, but this is the, this is the net new thing. Yes.[00:32:18] Alsesio: So this is for Nel Labs, which is the space we're in.So next week we're opening applications for tenants. So there's a web form, let me, we got this form done here. Uh, so, uh, before. Uh, the workflow would be I get an email, then I look at the person. It was like, should I spend time talking to this person? Then I respond, they respond back. So I build this. So the name it came up for on its own.Can you maybe h how do, how does it come up with its own name?[00:32:43] Simon Last: Yeah, that's a pretty app name. It's, it, it is just a random, it's a random, a name generator.[00:32:47] Alsesio: Oh, that's funny. It just came,[00:32:49] Simon Last: the fact that it picked that is, is kind of hilarious. I'm pretty sure it's just determined,[00:32:54] Sarah Sachs: resilient collector. I, I think I've never looked at the code for that.I've never second guessed it. I think it's kind of like a madlib situation.[00:33:00] Simon Last: Yeah, I think you're right. Yeah. It's, it's totally a, a deterministic. Oh, I thought it was great. Yes. Although, although when the, if you use the AI to set itself up, it can update its own name, so. Okay. Um,[00:33:11] Sarah Sachs: how did you create it? It, did you just do[00:33:12] Alsesio: classroom?I,[00:33:13] Sarah Sachs: okay.[00:33:13] Alsesio: I did, yeah. I'll say just check my inbox for applications for a coworking space. Keep a people, so it created the database for me. Which I have here. And I guess database is like an notion table because everything is notion. Um, and then whenever um, an email comes in, like here, it just creates a new role for the person.Mm-hmm. And then it uses web search to enrich the mm-hmm. The profile. So it kind of like searches the web and it's like, this is who this person is, this is when they say they wanna move in and kind of updates everything else. This is, I mean, it's not a GI, but to me, I don't wanna do this work. So it feels like, I mean, it took me maybe like 15 minutes to set up the whole thing.Um, and I really like that most of the information should live here. You know, it is not like some other tool asking me[00:34:01] Sarah Sachs: Yeah.[00:34:01] Alsesio: To like, bring my stuff there. It's like I would've probably already created an ocean thing.[00:34:06] Sarah Sachs: Mm-hmm.[00:34:06] Alsesio: So[00:34:07] Sarah Sachs: most of our biggest use cases and gains are from. That extra layer of human involvement in the process to make it so right.And so like one of our biggest use cases is bug triaging. So if someone posts something in Slack, can you just have a custom agent that lives there that has its own routing constitution of what team this belongs to, creates a task in your task database and then posts in that Slack channel, right? Like that's like one of the first things that we built internally, I think.And it's completely changed the way that notion functions as a company. Nothing falls through, well, most things don't fall through the crack. We don't know what we don't know. But it's not replacing people, it's replacing processes.[00:34:44] Alsesio: Yeah.[00:34:44] Sarah Sachs: Right.[00:34:45] Alsesio: And I'm curious how you think about composability of these things.So the other one I was working on is like a. These filler. So whenever somebody signs up as a tenant, kind of he'll sell the lease for them. There should probably some agent that is like office manager agent mm-hmm. That can handle the request, make the lease, and then, uh, give them a ADA access to the office and all of that.How do you think about that feature?[00:35:08] Simon Last: Yeah, so I mean, there's, there's two ways you can compose. One way is by using like the data primitives. So you can, you know, you, you could give, you have one agent, uh, be writing to the database and there's another agent that's walked in the database. So that's, that's one way that they, they can coordinate that's like a little bit more decoupled and mm-hmm.Works really well. Or you, you can couple them. So I, I think it's actually not released yet. Releasing it like next week is, uh, in the settings for an agent, you can give access to invoke any other agent.[00:35:34] swyx: Hmm.[00:35:34] Simon Last: So you can have them just. Just, uh, uh, talk directly. So[00:35:37] swyx: you, was there a limit on like, number of recursions or just,[00:35:40] Simon Last: um, probably,[00:35:42] swyx: you know what I mean?Like, you can just get an infinite loop that way there's[00:35:45] Simon Last: some kind of Yeah,[00:35:46] Sarah Sachs: I think it's, there is actually a number somewhere.[00:35:49] swyx: I believe I'm just, you know, like, you're, you're, someone's gonna screw up. You[00:35:51] Simon Last: should you try to see[00:35:53] swyx: Yeah. I mean, everything's gonna be paperclips.[00:35:55] Simon Last: Oh, yeah. Yeah. But, uh, but, but that's really useful.Yeah. So we, you know, like I just, I, I helped, uh, someone internally the other day, they had, they had built like over 30 custom agents for, uh, for our go to market team doing all kinds of different things. You know, for example, like researching, you know, like, like filling information about, about a customer or like, like triaging customer feedback or like, uh, something like that.Literally over 30 of them. And, and then he, and then he even made like a database of all the agents and then he is like, okay, and, and now I'm getting 70, over 70 notifications per day with just the agents are blocked on various things. Uh, and then I was like, oh, okay, cool. You know, the obvious thing to do there is to make a manager agent,[00:36:32] Sarah Sachs: right?[00:36:33] Simon Last: That's gonna sort of blocks be another abstraction layer in between your, your, uh, uh, 30 agents. Uh, so yeah, we, we send out with like a manager agent and then has access to invoke all the other agents and it's sort of like, like watching and observing them and then it sort of, it just creates a layer of abstraction.So instead of 70 notifications per day, it's like, like five. And then, and then the manager agent can help like, uh, debug and fix any problems with the,[00:36:54] swyx: does this is a concept of like an inbox or something like piece, you're basically saying that they can message each other?[00:37:00] Simon Last: Yeah.[00:37:01] Sarah Sachs: Well[00:37:01] swyx: they use the system of record, which, which is[00:37:02] Sarah Sachs: notion, so we[00:37:03] Simon Last: actually, yeah, we didn't make any special concepts at all.[00:37:06] swyx: They're interested to the motion notifications that I would've got,[00:37:09] Sarah Sachs: they can just like write a task to a database that the other agent's task to listening to, or they can actually call a web book to the agent, like they can just add the agent. Okay.[00:37:17] Simon Last: Yeah, I mean, this is something that, that we're still working on.I, I think we, you know, like, like generally, generally the way we do these things is, you know, you first make it possible, maybe like a sort of janky way. So I, I, I think the way I set ‘em up is like, you know, we created like a new database that was sort of like issues mm-hmm. That the custom agents were, were experiencing, and then gave them all access to file an issue and then the manager has access to, to read the issues.Um, and that works pretty well, essentially like, like give it its own like internal issue tracker just for the agents. And then, you know, if that becomes a, a concept that seems useful, generally maybe we will think of how to package it in. But I mean, generally we try to just keep it to composing the primitive if we can.You know, another example of this is we have no built-in memory concept. Memory is, is just pages and databases. And so if you wanna give a memory, just give it a page and give it. Edit access to that page and the[00:38:03] swyx: human can edit it. Agent can edit[00:38:04] Simon Last: it. Yeah. And so that works, that pattern works extremely well on it.And you know, depending this case, you can have it be just a page or it could be an entire database with, you know, or, you know, I can have sub pages is is pretty on what you can do with that.[00:38:15] Alsesio: So when I was setting this up, uh, I connected my inbox and it was like, do you wanna use Gmail or Notion Mail? And I'm like, I don't wanna use Eater, I just want you to do it.I'm curious how you think about, you know, notion, mail, notion, calendar, all of these kind of ui ux interfaces, full stack[00:38:29] Simon Last: notion.[00:38:30] Alsesio: Yeah. When like at the same time you have the agents abstracting them away from you in a way, you know, how do you spend like the product calories so to speak?[00:38:37] Simon Last: Yeah, I mean, I think it's pretty important that you don't have to use, not your mail to connect to the mail capability.So we can just connect to Gmail or, or whatever you want, uh, to use. And we're thinking of the mail service as being really great to the extent that it's really agent built, right? So maybe the mail app is just sort of a prepackaged agent that helps you automate your, your inbox.[00:39:00] Alsesio: Yeah, the auto labeling is great.Think[00:39:03] Sarah Sachs: the, when we, um, integrate with Gmail for instance, we have a series of tools available that are available via MCP or API to Gmail. When we integrate with Notion Mail, we have the Notion Mail engineering team to build us the, um, exact right tools that optimize latency, optimize performance and quality.They own that quality. Um, there's product leads there. They're directly thinking about the user problems that happen in mail. So it tends to be when we build integrations and connections, we build natively first. Um, and then think about, um, extending them generally just because it's also easier. Mm-hmm. Um, um, to build natively first.Um, so that tends to be how we phase things out.[00:39:43] swyx: Talking about integrations, you prompted me, so I gotta ask. M-C-P-C-L-I. What's going on? What's the[00:39:48] Simon Last: Yeah. Opinion. I think, I mean, I'm, I'm definitely bullish and excited about cli. I think there's a few really cool things about cli. So one really cool thing is like, um, is that it's in the terminal environment, so it gets a bunch of extra power.So it, you know, for example, it can like, like paginating and cursor through like long outputs. Um, and it has a progressive disclosure inherently. Uh, so, you know, you don't see all the tools at once. It's just, you see the CLI wrapper and you can like use the, the help commands and, and, and read files. And then I think the most important thing that's, that's super cool is that there, it's also inherently a, a bootstrapped.So if there's an issue, uh, the agent can debug and fix itself within the same environment that it uses the tool.[00:40:30] swyx: Mm.[00:40:30] Simon Last: Right. Like, you know, I think I saw a tweet this morning. Someone said, you know, my agent didn't have a browser, so I asked it to make all a browser tool and within a hundred lines of code, it gave itself a little browser, like, like wrapping the, the, the chromium API, um.That's pretty incredible. And then if there was a bug, it would just immediately try to fix it. Mm-hmm. Right. On the other hand, if you use an, you know, if you use like of, of the Chrome dev tools, MCP, I've had this issue where like, like sometimes the transport gets like messed up. If it gets messed up, the agent has no way to fix itself.It, it no longer has a browser, it's, it's not broken. Right. I think that's, that's pretty fundamental, but I would say like a lot of the, the bad things about it can be fixed. Uh, so I think like, as a progressive disclosure, that can be fixed with, with right harness. Like, it, it obviously doesn't make sense to show it all the tools all the time.That's not really inherent to the MCP protocol. It's just like how you wrap it and use it.[00:41:16] swyx: There's many poorly built MCPs because we didn't know.[00:41:19] Simon Last: Yeah, yeah. I mean it was just early, like, like the obvious thing is, uh, you know, to start with is, is to just show it all the tools and it's like, okay, now we have a hundred tools.Yeah. And like the tool calling actually works. So let's of[00:41:28] swyx: your success[00:41:29] Simon Last: give it a way to like, like filter to source the tools. So yeah, I would say like broadly speaking, I'm really bullish on cli. I'm still bullish on CPS and in a certain environment. I think in, in particular, CP is really great for when you want sort of like a narrow, lightweight agent.I think there's, there's definitely a lot of use cases where, where you don't want like a full coding agent with a compute run time. And also you want it to be like more tightly permissioned. MCP inherently has a really strong permission model, like all you can do is call the tools. A CLI is a little bit murkier.It's like, can I access the, if PI token are you, like, properly sort of like re-encrypt the token so it can't like exfiltrate it, it introduce a lot of like, like new issues, which are. Real and hard to solve. And MCP is just like the dumb simple thing that works and it that it's pretty good.[00:42:12] Sarah Sachs: I'll add two more perspectives, not from it working well for Notion, but how notion like commits to both platforms.Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP and so far as other people are using cps, right? So regardless of our perspective, we've put a lot of effort into our MCP and we have a fantastic team that we're building, um, to do more there.And the second thing I'll say, I think, um, we all think a lot, but lately I've been thinking a lot about making sure there's a value alignment and pricing, um, with capability.[00:42:43] swyx: Literally our next question[00:42:44] Sarah Sachs: and. Needing language to execute deterministic tasks feels wasteful and requiring on a language model to interface with third party providers seems wasteful for tasks that don't require it.And particularly because our custom agents are using usage-based pricing. We think of pricing as like the barrier of entry for use of our product, and we're quite committed to making sure that it's not wasteful. Um, not just because it's a bad deal for our customers, but it's also bad business. We wanna have as many buyers, like there's a, there's an elasticity of demand and so if we can have our agents properly execute code that calls on CLI deterministically, it's a one-time cost, right?Versus constantly having a language model integrate with an MCP over and over and over and paying those like repeated token fees and it's happening outside the cash window, then you're paying for it over and over and over and it's just kind of unnecessary and less deterministic when it doesn't have to be.[00:43:36] Alessio: Yeah, the open-endedness I think is like, the main thing is like, well, if I go write code to just call an API, I would never use an MCP. But then you need an NCP sometimes when you know what to call, but you don't want it to restart versus like, I think the it built a browser from scratch is like, it's great when you're doing it on your own, but like if your customers were having your AI write a browser from scratch every time and you had to pay the token cost of that, yeah.You'd be like, no, no. The Chrome dev tools CP is actually pretty great. Just use that. I'm curious, how do you make that decision? Like should it be. Just straight API call very narrow. Should it be an MCP? Should it be super open-ended?[00:44:10] Sarah Sachs: Do you mean for when we ship notion capabilities or when we add capabilities to[00:44:13] Alessio: notion[00:44:14] Sarah Sachs: AI or,[00:44:14] Alessio: I mean, you might have a capability that the only way to do is an open-ended agent, like an agent with a coding sandbox.[00:44:21] Sarah Sachs: Yeah. In Notion ai they're not explicit, not We also ship an MCP.[00:44:24] Alsesio: Yeah. Yeah. In B,[00:44:25] Sarah Sachs: yeah.[00:44:26] Alsesio: Internally. Okay. Like is there ever a discussion of like, we're not gonna ship it because we're not able to tie it down? Or are you happy to just like,[00:44:33] Sarah Sachs: um, no. I mean, there are a lot of things where we choose not to use MCP because we wanna add more high touch to quality.I think search an agent to find is like the largest instance of that, where we have. Um, slack and linear and Jira search and notion that is not using necessarily the search MCP functionality that is provided by those companies. And that's because it's quite critical we think, to how our agent trajectories work is for us to have a little bit more control on the functionality of the search journey.And so it usually comes from quality and there's a long tail of things and that's why we built an MCP client or an MCP server, excuse me, so that people can connect whatever they want. There's that long tail, right. But we, for search particularly, I would say that's like the primary entry point, but there are other connections as well that it's a little bit of secret sauce a
The QQ Cast: Answers to geek culture's most superfluous questions.
Is vibe coding killing off IDEs? Are we entering a new golden age of CLIs? Should you develop your own agent? Won't you join us for yet another dry technical discussion about a completely nuncontentious subject, dear listener?Nerdy Developer StuffThe Value of IDEsAgent Reset vs Git ResetDual Booting on the ROG AllyNewsRyan Coogler's Animorphs TV ShowMcDonald's Pro Game MenuTrailerRick and Morty Season 9
This episode of Remote Ruby opens with stories of exhaustion from a sleepless week. Then, Chris, Andrew, and David spend most of the episode unpacking two big themes: trust and governance in open source, and the growing mess of software security and AI-assisted development. They dig into the new Ruby Central write-up on the RubyGems/Bundler fracture and question whether it actually clarifies the path forward, then pivot into the Axios npm compromise, supply-chain risk, and how fragile modern package ecosystems can feel. Then, they go into a wide-ranging discussion on AI coding, bloated production apps, image-performance headaches, CSS/rendering quirks, and why teams may need to rethink APIs, CLIs, MCPs, and markdown-first docs as agent traffic keeps growing. Hit download now to hear more! LinksJudoscale- Remote Ruby listener giftRubyGems Fracture Incident Report Bundler has moved to the RubyGems organization (GitHub)Mitigating the Axios npm supply chain compromise (Microsoft Security blog) Garry Tan XThe Missing GitHub Status PageHoneybadgerHoneybadger is an application health monitoring tool built by developers for developers.JudoscaleMake your deployments bulletproof with autoscaling that just works.Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.Chris Oliver X/TwitterAndrew Mason X/TwitterJason Charnes X/Twitter
We're proud to release this ahead of Ryan's keynote at AIE Europe. Hit the bell, get notified when it is live! Attendees: come prepped for Ryan's AMA with Vibhu after.Move over, context engineering. Now it's time for Harness engineering and the age of the token billionaires.Ryan Lopopolo of OpenAI is leading that charge, recently publishing a lengthy essay on Harness Eng that has become the talk of the town:In it, Ryan peeled back the curtains on how the recently announced OpenAI Frontier team have become OpenAI's top Codex users, running a >1m LOC codebase with 0 human written code and, crucially for the Dark Factory fans, no human REVIEWED code before merge. Ryan is admirably evangelical about this, calling it borderline “negligent” if you aren't using >1B tokens a day (roughly $2-3k/day in token spend based on market rates and caching assumptions):Over the past five months, they ran an extreme experiment: building and shipping an internal beta product with zero manually written code. Through the experiment, they adopted a different model of engineering work: when the agent failed, instead of prompting it better or to “try harder,” the team would look at “what capability, context, or structure is missing?”The result was Symphony, “a ghost library” and reference Elixir implementation (by Alex Kotliarskyi) that sets up a massive system of Codex agents all extensively prompted with the specificity of a proper PRD spec, but without full implementation:The future starts taking shape as one where coding agents stop being copilots and start becoming real teammates anyone can use and Codex is doubling down on that mission with their Superbowl messaging of “you can just build things”.Across Codex, internal observability stacks, and the multi-agent orchestration system his team calls Symphony, Ryan has been pushing what happens when you optimize an entire codebase, workflow, and organization around agent legibility instead of human habit.We sat down with Ryan to dig into how OpenAI's internal teams actually use Codex, why the real bottleneck in AI-native software development is now human attention rather than tokens, how fast build loops, observability, specs, and skills let agents operate autonomously, why software increasingly needs to be written for the model as much as for the engineer, and how Frontier points toward a future where agents can safely do economically valuable work across the enterprise.We discuss:* Ryan's background from Snowflake, Brex, Stripe, and Citadel to OpenAI Frontier Product Exploration, where he works on new product development for deploying agents safely at enterprise scale* The origin of “harness engineering” and the constraint that kicked off the whole experiment: Ryan deliberately refused to write code himself so the agent had to do the job end to end* Building an internal product over five months with zero lines of human-written code, more than a million lines in the repo, and thousands of PRs across multiple Codex model generations* Why early Codex was painfully slow at first, and how the team learned to decompose tasks, build better primitives, and gradually turn the agent into a much faster engineer than any individual human* The obsession with fast build times: why one minute became the upper bound for the inner loop, and how the team repeatedly retooled the build system to keep agents productive* Why humans became the bottleneck, and how Ryan's team shifted from reviewing code directly to building systems, observability, and context that let agents review, fix, and merge work autonomously* Skills, docs, tests, markdown trackers, and quality scores as ways of encoding engineering taste and non-functional requirements directly into context the agent can use* The shift from predefined scaffolds to reasoning-model-led workflows, where the harness becomes the box and the model chooses how to proceed* Symphony, OpenAI's internal Elixir-based orchestration layer for spinning up, supervising, reworking, and coordinating large numbers of coding agents across tickets and repos* Why code is increasingly disposable, why worktrees and merge conflicts matter less when agents can resolve them, and what it really means to fully delegate the PR lifecycle* “Ghost libraries”, spec-driven software, and the idea that a coding agent can reproduce complex systems from a high-fidelity specification rather than shared source code* The broader future of Frontier: safely deploying observable, governable agents into enterprises, and building the collaboration, security, and control layers needed for real-world agentic workRyan Lopopolo* X: https://x.com/_lopopolo* Linkedin: https://www.linkedin.com/in/ryanlopopolo/* Website: https://hyperbo.la/contact/Timestamps00:00:00 Introduction: Harness Engineering and OpenAI Frontier00:02:20 Ryan's background and the “no human-written code” experiment00:08:48 Humans as the bottleneck: systems thinking, observability, and agent workflows00:12:24 Skills, scaffolds, and encoding engineering taste into context00:17:17 What humans still do, what agents already own, and why software must be agent-legible00:24:27 Delegating the PR lifecycle: worktrees, merge conflicts, and non-functional requirements00:31:57 Spec-driven software, “ghost libraries,” and the path to Symphony00:35:20 Symphony: orchestrating large numbers of coding agents00:43:42 Skill distillation, self-improving workflows, and team-wide learning00:50:04 CLI design, policy layers, and building token-efficient tools for agents00:59:43 What current models still struggle with: zero-to-one products and gnarly refactors01:02:05 Frontier's vision for enterprise AI deployment01:08:15 Culture, humor, and teaching agents how the company works01:12:29 Harness vs. training, Codex model progress, and “you can just do things”01:15:09 Bellevue, hiring, and OpenAI's expansion beyond San FranciscoTranscriptRyan Lopopolo: I do think that there is an interesting space to explore here with Codex, the harness, as part of building AI products, right? There's a ton of momentum around getting the models to be good at coding. We've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to.Build a user journey that you're trying to solve into code. It's pretty natural to use the Codex Harness to solve that problem for you. It's done all the wiring and lets you just communicate in prompts. To let the model cook, you have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking, where is the Asian making mistakes?Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I'm putting in place. So I have solved this part of the SDLC.swyx: [00:01:00] All right.[00:01:03] Meet Ryan swyx: We're in the studio with Ryan from OpenAI. Welcome.Ryan Lopopolo: Hi,swyx: Thanks for visiting San Francisco and thanks for spending some time with us.Ryan Lopopolo: Yeah, thank you. I'm super excited to be here.swyx: You wrote a blockbuster article on harness engineering. It's probably going to be the defining piece of this emerging discipline, huh?Ryan Lopopolo: Thank you. It is it's been fun to feel like we've defined the discourse in some sense.swyx: Let's contextualize a little bit, this first podcast you've ever done. Yes. And thank you for spending with us. What is, where is this coming from? What team are you in all that jazz?Ryan Lopopolo: Sure, sure.Ryan Lopopolo: I work on Frontier Product Exploration, new product development in the space of OpenAI Frontier, which is our enterprise platform for deploying agents safely at scale, with good governance in any business. And. The role of VMI team has been to figure out novel ways to deploy our models into package and products that we can sell as solutions to enterprises.swyx: And you have a background, I'll just squeeze it in there. Snowflake, brick, [00:02:00] stripe, citadel.Ryan Lopopolo: Yes. Yes. Same. Any kind of customerswyx: entire life. Yes. The exact kind of customer that you want to,Vibhu: so I'll say, I was actually, I didn't expect the background when I looked at your Twitter, I'm seeing the opposite.Stuff like this. So you've got the mindset of like full send AI, coding stuff about slop, like buckling in your laptop on your Waymo's. Yes. And then I look at your profile, I'm like, oh, you're just like, you're in the other end too. Oh, perfect. Makes perfect.Ryan Lopopolo: I it's quite fun to be AI maximalist if you're gonna live that persona.Open eye is the place to do it. And it'sswyx: token is what you say.Ryan Lopopolo: Yeah. Certainly helps that we have no rate limits internally. And I can go, like you said, full send at this stay.swyx: Yeah. Yeah. So the Frontier, and you're a special team within O Frontier.Ryan Lopopolo: We had been given some space to cook, which has been super, super exciting.[00:02:47] Zero Code ExperimentRyan Lopopolo: And this is why I started with kind of a out there constraint to not write any of the code myself. I was figuring if we're trying to make agents that can be deployed into end to enterprises, they should be [00:03:00] able to do all the things that I do. And having worked with these coding models, these coding harnesses over 6, 7, 8 months, I do feel like the models are there enough, the harnesses are there enough where they're isomorphic to me in capability and the ability to do the job.So starting with this constraint of I can't write the code meant that the only way I could do my job was to get the agent to do my job.Vibhu: And like a, just a bit of background before that. This is basically the article. So what you guys did is five months of working on an internal tool, zero lines of code over a mi, a million lines of code in the total code base.You say it was cenex, more like it was cenex faster than you would've. If you had done it by end. SoRyan Lopopolo: yeah, thatVibhu: was the mindset going into this, right?Ryan Lopopolo: That's right.[00:03:46] Model Upgrades LessonsRyan Lopopolo: Started with some of the very first versions of Codex CLI, with the Codex Mini model, which was obviously much less capable than the ones we have today.Which was also a very good constraint, right? Quite a visceral feeling to ask the [00:04:00] model to build you a product feature. And it just not being able to assemble the pieces together.Which kind of defined one of the mindsets we had for going into this, which is whenever the model just cannot, you always pop open at the task, double click into it, and build smaller building blocks that then you can reassemble into the broader objective.And it was quite painful to do this. Honestly, the first month and a half was. 10 times slower than I would be. But because we paid that cost, we ended up getting to something much more productive than any one engineer could be because we built the tools, the assembly station for the agent to do the whole thing.[00:04:43] Model Generations, Build Systems & Background ShellsRyan Lopopolo: But yeah, so onward to G BT 5, 5, 1, 5, 2, 5, 3, 5 4. To go through all these model generations and see their kind of corks and different working styles also meant we had to adapt the code base to change things up when the model was revved. [00:05:00] One interesting thing here is five two, the Codex harness at the time did not have background shells in it, which means we were able to rely on blocking scripts to perform long horizon work.But with five, three and background shells, it became less patient, less willing to block. So we had to retool the entire build system to complete in under a minute and. This is not a thing I would expect to be able to do in a code base where people have opinions. But because the only goal was to make the Asian productive over the course of a week, we went from a bespoke make file build to Basil, to turbo to nx and just left it there because builds were fast at that point.swyx: Interesting. Talk more about Turbo TenX. That's interesting ‘cause that's the other direction that other people have been doing.Ryan Lopopolo: Ultimately I have. Not a lot of experience with actual frontend repo architecture.swyx: You're talking that Jessica built the sky. So I'm like, I know the NX team. I know Turbo from Jared [00:06:00] Palmer.And I'm like, yeah, that's an interesting comparison.[00:06:02] One Minute Build LoopRyan Lopopolo: The hill we were climbing right, was make it fast.swyx: Is there a micro front end involved? Is it how how complex reactRyan Lopopolo: electron base single app sort of thingswyx: And must be under a minute. That's an interesting limitation. I'm actually not super familiar with the background shelf stuff.Probably was talked about in the fight three release.Ryan Lopopolo: BA basically means that codex is able to spawn commands in the background and then go continue to work while it waits for them to finish. So it can spawn an expensive build and then continue reviewing the code, for example.swyx: Yeah.Ryan Lopopolo: And this helps it be more time efficient for the user invoking the harness.swyx: And I guess and just to really nail this, like what does one minute matter? Like why not five, okay, good. We want no. WeRyan Lopopolo: want the inner loop to be as fast as possible. Okay. One minute was just a nice round number and we were able to hit it.swyx: And if it doesn't complete, it kills it or some something,Ryan Lopopolo: No.We just take that as a signal that we need to stop what we're doing, double click, decompose a build graph a bit to get us to high back under so that we [00:07:00] can able the agent continue to operate.swyx: It's almost like you're, it's like a ratchet. It's like you're forcing build time discipline, because if you don't, it'll just grow and grow.That's right. And you mentioned that my current, like the software I work on currently is at 12 minutes. It sucks.Ryan Lopopolo: This has been my experience with platform teams in the past, where you have an envelope of acceptable build times and you let it go up to breach and then you spend two, three weeks to bring it back down to the lower end of the average low bed stop.But because tokens are so cheap Yeah. And we're so insanely parallel with the model, we can just constantly be gardening this thing to make sure that we maintain these in variants, which means. There's way less dispersion in the code and the SDLC, which means we can simplify in a way and rely on a lot more in variance as we write the software.[00:07:45] Observability, Traces & Local Dev StackVibhu: Lovely.[00:07:46] Humans Are BottleneckVibhu: You mentioned in your article, like humans became the bottleneck, right? You kicked off as a team of three people. You're putting out a million line of code, like 1500 prs, basically. What's the mindset there? So as much as code is disposable, you're doing a lot of review. A lot [00:08:00] of the article talks about how you wanna rephrase everything is prompting everything, is what the agent can't see.It's kind of garbage, right? You shouldn't have it in there. So what's like the high level of how you went about building it, and then how you address okay, humans are just PR review. Like how is human in the loop for this?Ryan Lopopolo: We've moved beyond even the humans reviewing the code as well.[00:08:19] Human Review, PR Automation & Agent Code ReviewRyan Lopopolo: Most of the human review is post merge at this point.But post, post merge, that's not even reviewed. That's justswyx: Oh, let's just make ourselves happy by YouRyan Lopopolo: haven't used fundamentally. The model is trivially paralyzable, right? As many GPUs and tokens as I am willing to spend, I can have capacity to work with my hood base.The only fundamentally scarce thing is the synchronous human attention of my team. There's only so many hours in the day we have to eat lunch. I would like to sleep, although it's quite difficult to, stop poking the machine because it makes me want to feed it. You have to step back, right?Like you need to take a systems thinking mindset to things and [00:09:00] constantly be asking where is the agent making mistakes? Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I'm putting in place. So I have solved this part of the SDLC, and usually what that has looked like is like we started needing to pay very close attention to the code because the agent did not have the right building blocks to produce.Modular software that decomposed appropriately that was reliable and observable and actually accrued a working front end in these things, right?[00:09:35] Observability First SetupRyan Lopopolo: So in order to not spend all of our time sitting in front of a terminal at most, doing one or two things at a time, invested in giving the model that observability, which is that that graph in the post here.swyx: Yeah. Let's walk through this traces and which existed firstRyan Lopopolo: we started with just the app and the whole rest of it. From vector through to all these login metrics, APIs was, I dunno, half an [00:10:00] afternoon of my time. We have intentionally chosen very high level fast developer tools. There's a ton of great stuff out there now.We use me a bunch, which makes it trivial to pull down all these go written Victoria Stack binaries in our local development. Tiny little bit of python glue to spin all these up. And off you go. One neat thing here is we have tried to invert things as much as possible, which is instead of setting up an environment to spawn the coding agent into, instead we spawn the coding agent, like that's the entry point.It's just Codex. And then we give Codex via skills and scripts the ability to boot the stack if it chooses to, and then tell it how to set some end variables. So the app and local Devrel points at this stack that it has chosen to spin up. And this I think is like the fundamental difference between reasoning models and the four ones and four ohs of the past, where these models could not think so you had to put them in [00:11:00] boxes with a predefined set of state transitions.Whereas here we have the model, the harness be the whole box. And give it a bunch of options for how to proceed with enough context for it to make intelligent choices. SoVibhu: sales, so like a lot of that is around scaffolding, right? Yes. Previous agents, you would define a scaffold. It would operate in that.Lube, try again. That's pivoted off from when we've had reasoning models. They're seeming to perform better when you don't have a scaffold, right? That's right.[00:11:28] Docs Skills GuardrailsVibhu: And you go into like niches here too, like your SPEC MD and like having a very short agent MG Agent md.swyx: Yes. Yes.Vibhu: Yeah. So you even lay out what it is here, but I likeswyx: the table contents.Vibhu: Yeah.swyx: Like stuff like this, it really helps guide people because everyone's trying to do this.Ryan Lopopolo: This structure also makes it super cheap to put new content into the repository to steer both the humans and the agents.swyx: You, you reinvented skills, right?Vibhu: One big agents andswyx: skills from first princip holdsRyan Lopopolo: all skills did not exist when we started doing this.Vibhu: You have a short [00:12:00] one 100 line overall table of contents and then you have little skills, right? Core beliefs, MD tech tracker. Yeah. Yeah. The scale is overRyan Lopopolo: The tech jet tracker and the quality score are pretty interesting because this is basically a tiny little scaffold, like a markdown table, which is a hook for Codex to review all the business logic that we have defined in the app, assess how it matches all these documented guardrails and propose follow up work for itself.Before beads and all these ticketing systems, we were just tracking follow up work as notes in a markdown file, which, we could spa an agent on Aron to burn down. There's this really neat thing that like the models fundamentally crave text. So a lot of what we have done here is figure out ways to inject textswyx: intoRyan Lopopolo: the system right when we get a page, because we're missing a timeout, for example.I can just add Codex in Slack on that page and say, I'm gonna fix this by adding a timeout. Please update our reliability documentation. To require that all network calls have [00:13:00] timeouts. So I have not only made a point in time fix, but also like durably encoded this process knowledge around what good looks like.swyx: Yeah.Ryan Lopopolo: And we give that to the root coding agent as it goes and does the thing. But you can also use that to distill tests out of, or a code review agent, which is pointed at the same things to narrow the acceptable universe of the code that's produced.swyx: I think one of the concerns I have with that kind of stuff is you think you're making the right call by making, it's persisted for all time across everything.Yes. But then you didn't think about the exceptions that you need to make, right? And that you have to roll it back.Vibhu: Part of it isswyx: also sometimes it can follow your s instructions too.Vibhu: It's somewhat a skill, right? So it determines when it uses the tools, right? Like it's not like it'll run outta every call.It'll determine when it wants to check quality score, right?Ryan Lopopolo: Yeah. And we do in the prompts we give these agents, allow them to push back,[00:13:51] Agent Code Review RulesRyan Lopopolo: When we first started adding code review agents to the pr, it would be Codex, CLI. Locally writes the change, pushes up a PR on [00:14:00] those PR synchronizations of review agent fires.It posts a comment. We instruct Codex that it has to at least acknowledge and respond to that feedback. And initially the Codex driving the code author was willing to be bullied by the PR reviewer, which meant you could end up in a situation where things were not converging. So yeah, we had to,swyx: he's just a thrash.Ryan Lopopolo: We had to add more optionality to the prompts on both of these things, right? The reviewer agents were instructed to bias toward merging the thing to not surface anything greater than a P two in priority. We didn't really define P two, but we gave it, youswyx: did define P two.Ryan Lopopolo: We gave it a framework within which to score its outputswyx: and then greater than P zero is worse, right?Yes. P two is very good.Ryan Lopopolo: P zero is you will mute the code place ifswyx: you merch thisRyan Lopopolo: thing, right?swyx: Yeah.Ryan Lopopolo: But also on the code authoring agent side, we also gave it the flexibility to either defer or push back against review feedback, right? This happens all the time, right? Like I happen to notice something and leave a code review, [00:15:00] which.Could blow up the scope by a factor of two. I usually don't mean for that to be addressed Exactly. In the moment. It's more of an FYI file it to the backlog, pick it up in the next fix it week sort of thing. And without the context that this is permissible, the coding agents are gonna bias toward what they do, which is following instructions.swyx: Yeah.[00:15:19] Autonomous Merging Flowswyx: I do wanted to check in on a couple things, right? Sure. All the coding review agent, it can merge autonomously. I think that's something that a lot of people aren't comfortable with. And you have a list here of how much agents do they do Product code and tests, CI configuration and release tooling, internal Devrel tools, documentation eval, harness review, comments, scripts that manage the repository itself, production dashboard definition files, like everything.Yes. And so they're just all churning at the same time, is there like a record that, that any human on the team pulls to stop everythingRyan Lopopolo: Because we are building a native application here. We're not doing continuous deploy. So there's still a human in the loop for cutting the release branch.I see. We require a blessed [00:16:00] human approved smoke test of the app before we promote it to distribution, these sort of things.swyx: So you're working on the app, you're not building like infrastructure where you have like nines of reliability, that kinda stuff?Ryan Lopopolo: That's correct. That's correct. Okay. And also like full recognition here that all of this activity took in a completely greenfield repository.There's. Should be no script that this applies generally toswyx: this is a production thing, you're gonna shipRyan Lopopolo: toswyx: customers. Of course. Yeah, of course. So this is realVibhu: And like one of the things there is, you mentioned you started this as a repo from scratch. The onboarding first month or so was pretty, it was like working backwards, right?Yeah. And then you had to work with the system and now you're at that point where you know, you're very autonomous. I'm curious like, okay, so what, how human in the loop is it? So what are the bottlenecks that you wish you could still automate? And part of that is also like, where do you see the model trajectory improving and offloading more human in the loop?We just got 5.4. It's a really good,Ryan Lopopolo: fantastic model, by the way.Vibhu: Yeah. Yeah. It's the first one that's merged. Top tier coding. So it's codex level coding and reasoning. So general reasoning both in one model. SoRyan Lopopolo: andVibhu: computer [00:17:00] use vision.Ryan Lopopolo: Now we now with five four, I can just have Codex write the blog post, whereas for this one I had to balance between chat.swyx: Oh, I need to, I might be out of a job. Oh my God.Ryan Lopopolo: Oh,swyx: I know. You just gave me an idea for a completely AI newsletter that five four could do. Yeah, I get it Now.Ryan Lopopolo: This sort of thing is just one example of closing the loop, right? Like the dashboard thing you mentioned. We have Codex authoring the Js ON, for the Grafana dashboards and publishing them and also responding to the pages, which means when it gets the page, it knows exactly which dashboards are defined and what alerts.What alert was triggered by which exact log in the code base. ‘cause all of this stuff is collated together.swyx: It has to own everything.Yes. Yeah. Yeah.Ryan Lopopolo: And it means that if we have an outage that did not result in a page. It has the existing set of dashboards available to it. It has the existing set of metrics and logs and can figure out where the gaps in the dashboard are or [00:18:00] in the underlying metrics and fix them in one go.In the same way, you would have a full stack engineer be able to drive a feature from the backend all the way to the front end.Vibhu: So it, it seems like a lot of the work you guys had to do was you as a small team are fully working for a way that the model wants the software to be written. It's like less human legible for better. Code legibility, agent legibility. How do you think that affects broader teams? So one at OpenAI, do liaison, like this is how software should be written. Like I can imagine, say you join a new team with this methodology, this mindset there's ways that, teams do code review, teams write code, like teams are structured and a lot of it is for human legibility.So should we all swap? Like how does this play back one broader into OpenAI and then like broader into the software engineering, right? Is it like teams that pick this up will it's pretty drastic, right? You have to make a pretty big switch. Should they just full send Yeah.Ryan Lopopolo: The mindset is very much that I'm removed from the process, right? I can't really have deep code level opinions about [00:19:00] things. It's as if I'm. Group tech leading a 500 person organization.Vibhu: Yeah.Ryan Lopopolo: Like it's not appropriate for me to be in the weeds on every pr. This is why that post merge code review thing is like a good analog here, right?Like I have some representative sample of the code as it is written, and I have to use that to infer what the teams are struggling with, where they could use help, where they're already moving quickly and I can pivot my focus elsewhere.Vibhu: Yeah.Ryan Lopopolo: So I don't really have too many opinions around the code as it is written.I do, however, have a command based class, which is used to have repeatable chunks of business logic that comes with tracing and metrics and observability for free. And the thing to focus on is not how that business logic is structured, but that it uses this primitive ‘cause I know that's gonna give leverage by default.Vibhu: Yeah.Ryan Lopopolo: Yeah, back to that sort of systems stinking,Vibhu: and you have part of that in your blog post, enforcing architecture and ta taste how you set boundaries for what's used. There's also a section on redefining [00:20:00] engineering and stuff, but yeah, it's just, it's interesting to hear,Ryan Lopopolo: and as the models have gotten better, they have gotten better at proposing these abstractions to unblock themselves, which again, lets me move higher and higher up the stack to look deeper into the future on what ultimately blocked the team from shipping.swyx: Yeah. You mentioned so you, this is primarily a, it is like a 1 million line of code base electron app. But it manages its own services as well, so it's like a backend for front end type thing.Ryan Lopopolo: We do have a backend in there, but that's hosted in the cloud.Yeah. This sort of structure is actually within the separate main and render processesWithin theswyx: electric.That's just how electronic works.Ryan Lopopolo: Yeah, of course. So have also treated like. MVC style decomposition with the same level of rigor, which has been very fun.swyx: I have a fun pun. This is a tangent, NVC is model view controller. Any sort of full stack web Devrel knows that.But my AI native version of this is Model view Claw, the clause the harness.Ryan Lopopolo: That's right. That's right. I do think that there is an interesting space to [00:21:00] explore here with Codex, the harness as part of building AI products, right? There's a ton of momentum around getting the models to be good at coding.We've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to build, a user journey that you're trying to solve into code, it's pretty natural to use the Codex Harness to solve that problem for you. It's done all the wiring and lets you just communicate and prompts to let the model cook.Yeah. It's been very fun. And there's also a very engineering legible way of increasing capabil. It's fantastic, right? Yeah. Just give you, just give the model scripts, the same scripts you would already build for yourself.swyx: Yeah.Yeah. So for listeners, this is Ryan saying that software engineering or coding against will eat knowledge work like the non-coding parts that you would normally think.Oh, you have to build a separate agent for it. No, start a coding agent and go out from there. Which open Claw has like it's pie Underhood.Ryan Lopopolo: [00:22:00] Yes.Vibhu: Basically define your task in code. Everything is a codingswyx: agent by the way. Since I brought it up, it's probably the only place we bring it up. Is any open claw usage from you?Any?Ryan Lopopolo: No. No. Not for me. I don't have any spare Mac Minis rattling around my house.swyx: You can afford it? No. I just, I'm curious if it's changed anything in opening eye yet, but it's probably early days. And then the other, the other thing I, I wanna pull on here is like you mentioned ticketing systems and you mentioned prs and I'm wondering if both those things have to go away or be reinvented for this kind of coding.So the git itself and is like very hostile to multi-agent.Ryan Lopopolo: Yeah. We make very heavy use of work trees.swyx: But like even then, like I just did a, dropped a podcast yesterday with Cursors saying, and they said they're getting rid of work trees ‘cause it still has too many merge conflicts.It's still un too un unintuitive. But go ahead.Ryan Lopopolo: The models are really great at resolving merge conflicts. Yeah. And to get to a state where I'm not synchronously in the loop in my terminal, I almost don't care that there are mergeswyx: with disposable.[00:23:00] Yeah.Ryan Lopopolo: We invoke a dollar land skill and that coaches codex to push the PR Wait for human and agent reviewers Wait for CI to be green.Fix the flakes if there are any merged upstream. If the PR comes into conflict, wait for everything to pass. Put it in the merge queue. Deal with flakes until it's in Maine. End. This is what it means to delegate fully, right? This is in a, very large model re probably a significant tax on humans to get PRS merged, but the agent is more than capable of doing this and I really don't have to think about it other than keep my laptop open.swyx: Yeah. I used to be much more of a control freak, but now I'm like, yeah, actually you could do a better job of this than me. Yeah. With the right context. Yes.[00:23:47] Encoding Requirementsswyx: Anything else in harness in general? Just this piece, I just wanna make sure we,Ryan Lopopolo: I think one thing that I maybe didn't make super clear in the article that I heard on Twitter as an interesting, that's respond [00:24:00]swyx: to them.What's the chatter and then what's your response?Ryan Lopopolo: Ultimately, all the things that we have encoded in docs and tests and review agents and all these things are ways to put all the non-functional requirements of building high scale, high quality, reliable software into a space that prompt injects the agent.We either write it down as docs, we add links where the error messages tell how to do the right thing. So the whole meta of the thing is to basically tease out of the heads of all the engineers on my team, what they think good looks like, what they would do by default, or what they would coach a new hire on the team to do to get things to merch.And that's why we pay attention to all the mistakes, mistakes that the agent makes, right? This is code being written that is misaligned with some as yet not written down, non-functional requirement.swyx: Sorry, what? Did the online people misunderstand orRyan Lopopolo: No,swyx: whatyouRyan Lopopolo: responded to? Somebody just literally said that.I was like, oh yeah,swyx: okay,Ryan Lopopolo: This is the [00:25:00] thing. This is what I've been doing. Oh, youswyx: agree? Yeah. I see. Interesting.Ryan Lopopolo: One other neat thing, which I did totally did not expect is folks were just. Taking the link to the article and giving it to pi or Codex and say, make my repo this,Vibhu: you achi a whole recursion.Ryan Lopopolo: And it was wildly effective. Really? It was wildly effective. NoVibhu: way. It just actually is something I tried with five, four yesterday. I didn't have time. Last time I was like out speaking of something, and this is one of my things, I was like, okay, I have this article. Can we just scaffold out what it would be like to run this?And I, I did it first as that and then I was like, okay, let me take another little side repo and say okay, if I was to fully automate this like this because I haven't written a line of code, it'sRyan Lopopolo: like over full, setVibhu: it right. The side thing I'm doing of voice. TTS I'm just like, slobbing out, whatever.It's nothing production. I'm like, how would I make this like this? And it's actually like a really good way. It's like a good way to learn what could be changed, what could be like, it's just a good analyzing, right? You give it all the codes, you give it all the context, you give it the article and it walks you through it very well.That's right. That's right.[00:25:57] Inlining Dependencies[00:25:57] Dependencies Going Away & Brett Taylor's Responseswyx: I guess one more thing before we go to Symphony is I wanted to cover [00:26:00] Brett Taylor's response. We had him on the show. He is your chairman, which is wild. Yeah. That he's reading your articles as well and like getting engaged in it. He says software dependencies are going away.Basically they can just be like vendored. Yes. Response.Ryan Lopopolo: Aswyx: hundred percent. A hundred percent agree. You still pro qr, you still pay Datadog. You still pay Temporal. Thank you.Ryan Lopopolo: Yep. The level of complexity of the dependencies that we can internalize is, I would say low, medium right now. Just based on model capability.What does the,swyx: what is medium?Ryan Lopopolo: I would say like a. A couple thousand line dependency is a thing that we could in-house No problem. Call in an afternoon of time. One neat thing about it is like probably most of that code you don't even need. Like by in-house and abstraction, you can strip away all the generic parts of it and only focus on what you need to enable the specific thing.Yes. You're building,swyx: I've been calling this the end of b******t plugins.Ryan Lopopolo: Yeah.swyx: Because there's so much when I published an open source thing, I want to accept everything, be liberal. I want to accept, this is post's law, but that means there's so much bloat. Yes. There's so much overhead.Ryan Lopopolo: One other neat thing about [00:27:00] this too is when we deploy Codex Security on the repo, it is able to deeply review and change. The internalized dependencies in a much lower friction way than it would be to like, push patches upstream, wait for them to be released, pull them down, make sure that's compatible with all the transitive I have in my repo and things like that.So it's also much lower friction to internalize some of these things if code is free. ‘cause the tokens are cheap sort of thing.swyx: Yeah. Yeah. I think like the only argument I have against this is basically scale testing, which obviously the larger pieces of software like Linux, MySQL, he calls up even the Datadog and Temporals and then maybe security testing where Yes.Classically, I think, is it linis tos, it said security open source is the best disinfectant.Ryan Lopopolo: Many eyes.swyx: Many eyes. And if inline your dependencies and code them up, you're gonna have to relearn mistakes from other people that Yep.Ryan Lopopolo: Yep. And to internalize that dependency, you're back to zero and you have to start.Reassembling all those bits and pieces to Yeah. Have [00:28:00] high confidence in the code as it is written. Yeah.Vibhu: Even part of the first intro of this, you basically mentioned like everything was written by codex, including internal tooling, right? So internal tooling, like when you're visualizing what's going on it's writing it for itself.swyx: Yeah. I'm built internal tools way I now, and like I just show them off and they're like, how long did you spend? And I didn't spend any time. I just prompted it,Ryan Lopopolo: very funny story here.swyx: Yeah, go ahead.Ryan Lopopolo: We had deployed our app to the first dozen users internally had some performance issues, so we asked them to export a trace for us get a tar ball, gave it to our on-call engineer, and he did a fantastic job of working with Codex to build this beautiful local Devrel tool, next JS app, the drag and drop the tar ball in, and it visualizes the entire trace.It's fantastic. Took an afternoon, but none of this was necessary. Because you could just spin up codex and give it the tar ball and ask the same thing and get the response immediately. So in a way, optimizing for human [00:29:00] legibility of that debugging process was wrong. It kept him in the loop unnecessarily when instead he could have just like Codex cooked for five minutes and gotten this same.swyx: Yeah, you verify your instincts here of this is how we used to do it. Or this is how I would have used to solve it.Ryan Lopopolo: Yeah. In this local observability stack. Like sure, you can de deploy Yeager to visualize the traces, but I wouldn't expect to be looking at the traces in the first place because I'm not gonna write the code to fix them.swyx: Yeah. So basically there needs to be like this kind of house stack and owning the whole loop. I think that is very well established. And it sounds like you might be like sharing more about that in the future, right?Ryan Lopopolo: Yeah. I think we're excited to do[00:29:36] Ghost Libraries Specs[00:29:36] Ghost Libraries & Distributing Software as SpecsRyan Lopopolo: We're gonna talk about Symphony in a little bit, but like the way we distribute it as a spec, which I think folks are calling Ghost Libraries on Twitter.This is like a such a cool name. It does mean it becomes much cheaper to share software with the world, right? You define a spec, how you could build your own specifying as much as is required for a coding agent to reassemble it [00:30:00] locally. The flow here is very cool. Like we have taken. All the scaffolding that has existed in our proprietary repo spun up a new one.Ask Codex with our repo as a reference. Write the spec. We tell it. Spin up a team ox spawn a disconnected codex to implement the spec. Wait for it to be done. Spawn another codex and another team ox to review the spec com or review the implementation compared to upstream and update the spec so it diverges less.And then you just loop over and over Ralph style until you get a spec that is with high fidelity able to reproduce the system as it is. It's fantastic.Vibhu: And you're basically, you're not really adding any of your human bias in there, right? That's correct. A lot of times people write a spec and be like, okay, I think it should be done this way, and you'll riff on something.And it's no, the agent could have just handled it like you're still scaffolding in a sense, right? I want it done this way. It can determine its spec better.swyx: That's right. That's right. Part of me it, I'm, I've been working a lot on evals recently, and part of me is wondering if [00:31:00] an agent can produce a spec that it cannot solve.Is it always capable of things that he can imagine or can you imagine things that it is impossible to do?Ryan Lopopolo: I think with Symphony, we, there's like this there's this axis where you have things that are easier, hard, or established or new, right? And I think things that are hard and new is still something that the models need humans.Yeah. Drive.swyx: Yeah. Yeah.Ryan Lopopolo: But I think those other quadrants are largely salt. Given the right scaffold and the right thing that's gonna drive the agent to completion,swyx: it's crazy that it solved,Ryan Lopopolo: but it means that the humans, the ones with limited time and attention get to work on the hardest stuff, like the problems where it's pure white space out in front. Or like the deepest refactorings where you don't know what the proper shape of the interfaces are. And this is where I wanna spend my time. ‘cause it lets me set up for the next level of scale.swyx: Yeah. Yeah. Amazing. Let's introduce Symphony.I think we've been mentioning it every now and then. Elixir. Interesting option.Ryan Lopopolo: Yeah.swyx: Yeah. I'm not,Ryan Lopopolo: again, like the [00:32:00] elixir manifestation here is just a derivative. Is it a modelswyx: chosen? Yeah.Ryan Lopopolo: Yeah. Yeah. And it chose that because the process supervision and the gen servers are super amenable to the type of process orchestration that we're doing here.You are essentially spinning up little Damons for every task that is in execution and driving it to completion, which. Means the mall gets a ton of stuff for free by using Elixir and the Beam.swyx: I had to go do a crash course in Beam and Elixir, and I think most people are not operating at that scale of concurrency where you need that.But it is a good mental model for Resum ability and all those things. And these are things I care about. But tell me the story, the origin story of Symphony. What do you use it for? Is this, how did it form maybe any abandoned paths that you didn't take?[00:32:46] Terminal Free Orchestration[00:32:46] Symphony: Removing Humans from the LoopRyan Lopopolo: At the end of December we were at about three and a half PRS per engineer per day.This was before five two came out in the beginning of January. Everyone gets back from holiday with five two and no other work [00:33:00] on the repository. We were up in the five to 10 PRS per day per engineer. And I don't know about y'all, but like it's very taxing to constantly be switching like that. Like I was pretty tapped out at the end of the day, again, where are the humans spending their time? They're spending their time context switching between all these active tmox pains to drive the agent forward.swyx: Yeah. No way. Yeah.Ryan Lopopolo: So let's again, build something to remove ourselves from the loop. And this is what frantic sprinted adapt here to find a way to remove the need for the human to sit in front of their terminal.So a lot of experimentation with Devrel boxes and, automatically spinning up agents, like it seems like a fantastic end state here, where my life is beach. I open live twice a day and say yes no to these things. Yeah. And this is again, a super, super interesting framing for how the work is done.Because I become more latency and sensitive. I have [00:34:00] way less attachment to the code as it is written. Like I've had close to zero investment in the actual authorship experience. So if it's garbage. I can just throw it away and not care too much about it. In Symphony, there's this like rework state where once the PR is proposed and it's escalated to the human for review, it should be a cheap review.It is either mergeable or it is not. And if it's not, you move it to rework. The elixir service will completely trash the entire work tree NPR and start it again from scratch. Okay. And this is that opportunity again to say, why was it trash right? What did the agent do that wasswyx: bad. Yeah.Ryan Lopopolo: Fix that before moving the ticket toswyx: endRyan Lopopolo: of progress again.swyx: Yeah. Why is this not in codex app? I guess this, you guys are ahead of Codex app,Ryan Lopopolo: yeah, so the way the team has been working is basically to be as AI pilled as possible and spread ahead. And a lot of the things we have worked on have fallen out [00:35:00] into a lot of the products that we have.Like we were in deep consultation with the Codex team to. Have the Codex app be a thing that exists, right? To have skills be a thing that Codex is able to use. So we didn't have to roll our own to put automations into the product. So all of our automatic refactoring agents didn't have to be these hand rolled control loops.It has been really fantastic to be, in a way, un anchored to the product development of Frontier and Codex and just very quickly try to figure out what works and then later find the scalable thing that can be deployed widely. It's been a very fun way to operate. It's certainly chaotic. I have lost track very often of what the actual state of the code looks like.‘cause I'm not in the loop. There was. One point where we had wired playwright directly up to the Electron app. With MCPM CCPs, I'm pretty bearish on because the harness forcibly injects all those tokens in the [00:36:00] context, and I don't really get a say over it. They mess with auto compaction. The agent can forget how to use the tool.There's probably only what three calls in playwright that I actually ever want to use. So I pay the cost for a ton of things. Somebody vibed a local Damon that boots playwright and exposes a tiny little shim CLI to drive it. And I had zero idea that this had occurred because to me, I run Codex and it's able to, it's oh, it's better.Yeah. Like no knowledge of this at all. Uhhuh.[00:36:30] Multi Human ChaosRyan Lopopolo: So we have had like in human space to spend a lot of time doing synchronous knowledge sharing. We have a daily standup that's 45 minutes long because we almost have to. Fan out the understanding of the current state.swyx: Yeah, I was gonna say this is good for a single human multi-agent, but multi human, multi-agent is a whole like po like explosion of stuff.Ryan Lopopolo: Yeah. And that this is fundamentally why we have such a rigid, like 10,000 [00:37:00] engineer level architecture in the app because we have to find ways to carve up the space so people are not trampling on each other.swyx: Sorry, I don't get the 10,000 thing. Did I miss that?Ryan Lopopolo: The structure of the repository is like 500 NPM packages.It's like architecture to the excess for what you would consider, I think normal for a seven person team. But if every person is actually like 10 to 50. Then the like numbers on being super, super deep into decomposition and sharding and like proper interface boundaries make a lot more sense.swyx: Yeah. To me, that's why I talked about Microfund ends and I, an anex is from that world, but Cool. It is just coming back to, to, to this I dunno if you have other, thoughts on. Orchestrating so much work coin going through this. Is this enough? Is this like any aha moments?Vibhu: It'll be interesting to see like where, okay, so right now you pick linear as your issue tracker, right?swyx: Or it's like a is it actually linear? This is actually linear.[00:37:55] Linear vs Slack WorkflowVibhu: Oh, that's linear. It's linear.swyx: Oh I never looked atVibhu: video. The demo video I had to download to [00:38:00] run.swyx: So I, because I'm a Slack maxie, but Yeah, linear. Linear is also really good. Yes,Ryan Lopopolo: we do make a good use of Slack. We we fire off codex to do all these lotion, elasticity, fix ups, the things that like sync that knowledge into the repository.It's super cheap. Yeah.swyx: Yeah.Ryan Lopopolo: Just do it in Codex.swyx: My biggest plug is OpenAI needs to build Slack. You need to own Slack. Build yours. Turn this into Slack.Ryan Lopopolo: I did read about it. Youswyx: did?Ryan Lopopolo: Yeah.[00:38:25] Collaboration Tools for AgentsRyan Lopopolo: I would say that if we think that we want these agents to do economically valuable work, which is like this is the mission, right?We want AI to be deployed widely, to do economically valuable work, then we need to find ways for them to naturally collaborate with humans, which means collaboration tooling, I think, is an interesting space to explore.swyx: Yeah, totally. Yeah. GitHub, slack, linear.Vibhu: Yeah, that was my thing. Okay, where do we see right now Codex has started Codex Model, then CLI, now there's an app, app can let me shoot off multiple Codex is in parallel, but there's no great team collaboration for Codex.And it [00:39:00] seems like your team had some say into what comes out, right? So you talked to ‘em, codex kind of was a thing. From there, if you guys are on the bound, what stuff that like, you might not focus on, but what do you expect other people to be building, right? So people that are like five x 50 Xing.Should you build stuff that's like very niche for your workflow, for your team? Should it be more general so other people can adopt? Is there a niche there? ‘Cause part of it is just okay, is everything just internal tooling? Do we have everything our own way? Like the way our team operates has our own ways that we like to communicate or is there a broader way to do it?Is it something like a issue tracker? Just thoughts if you wanna riff on that.[00:39:35] Standardizing Skills and CodeRyan Lopopolo: I think TBD we have not figured this out in a general way. I do think that there is leverage to be had in making the code and the processes as much the same as possible. If you think that code is context, code is prompts, it's better from the agent behavior perspective to be able to look in a package in directory X, Y, Z, and it not to have to page so [00:40:00] deeply into directory if you C, because they have the same structure, use the same language, they have the same patterns internally.And that same like leverage comes from aligning on a single set of skills that you're pouring every engineer's taste into to make sure that the agent is effective. So like in our code base, we have, I think, six skills. That's it. And if some part of the software development loop is not being covered, our first attempt is to encode it in one of the existing setup skills, which means that we can change the agent behavior.Yeah. More cheaply than changing the human driver behavior.swyx: Yeah.[00:40:39] Self Improvement via Logsswyx: Have you ever, have you experimented with agents changing their own behavior?Ryan Lopopolo: We do.swyx: Yeah. Or parent agent changing a subagents, behavior or something like that.Ryan Lopopolo: We have some bits for skill distillation. So for example, there's one neat thing you can do with Codex, which is just point it at its own session logs to ask it to tell you how you can use [00:41:00] the tool pedal better.swyx: It's like introspectionRyan Lopopolo: or ask it to do things. I useVibhu: this session better. What skills should Iswyx: high? I like the modification of, you can do, just do things to you can just ask agent to do things.Ryan Lopopolo: Yeah. You can just codex things. This is like a, this is like a silly emoji that we have, right? You can just codex things, you can just prompt things.It's really glorious future we live in, but okay, you can do that one-on-one. But we're actually slurping these up for the entire team into blob storage and. Running agent loops over them every day to figure out where as a team can we do better and how do we reflect that back into the repositories?Yes, though everybody benefits from everybody else's behavior for free. Same for like PR comments, right? These are all feedback. That means the code as written, deviated from what was good, a PR comment, a failed build. These are all signals that mean at some point the agent was missing context. We gotta figure out how toswyx: Yeah.Ryan Lopopolo: Slurp it up and put it back in the reboot.swyx: By the way, I do this exactly right. I used to, when I use cloud code for [00:42:00] knowledge work, cloud cowork is like a nice product, right? Yes. In I think you would agree. I always have it tell me what do I do better next time? And that's the meta programming reflection thing.So I almost think like you have six reflection extraction levels in symphony and almost like the zero of layer. So the six levels are PO policy, configuration, coordination, execution, integration, observability. We've talked about a couple of these, but the zero layer is like the, okay, are we working well?Can we improve how we work? Yes. Can I modify my own workflow without MD or something? I don't know.Ryan Lopopolo: Yeah, of course. Yeah, of course you can. Like this thing is also able to cut its own tickets ‘cause we give it full access.Yeah. Make it a ticket to have it cut. Tickets you can.Put in the ticket that you expect it to file as on follow up work,swyx: like Yeah. Self-modifying. Yeah.Ryan Lopopolo: Yeah.[00:42:44] Tool Access and CLI FirstRyan Lopopolo: Put, don't put the agent in a box. Give the agent full accessibility over it. Domain.swyx: I had a mental reaction when you said don't put the agent in a box. So I think you should put it in a box. Like it's just that you're giving the box everything it needs.Ryan Lopopolo: Yeah. Context and tools.swyx: But we're like, as developers, we're used to calling [00:43:00] out to different systems, but here you use the open source things like the Prometheus, whatever, and you run it locally so that you can have the full loop. I assume.Ryan Lopopolo: Yep.Vibhu: I think likeRyan Lopopolo: another, you wanna minimize cloud, cloud dependencies.Vibhu: You also want to make sure that you think about what the agent has access to. What does it see? Does it go back into the loop, like from the most basic sense of you let it see its own like calls, traces it can determine where it went wrong. But are you feeding that back in? So you know, just the most basic level of you wanna see exactly what's input output, like does the agent have access to.What is being outputted, right? It can self-improve a lot of these things. It's allRyan Lopopolo: text, right? My job is to figure out ways to funnel text from one agent to the other.swyx: It's so strange like way back at the start of this whole AI wave Andre was like, English is the hottest day programming language.It's here, it's just Yeah. The feature as well.Vibhu: A lot of, okay. Like a lot of software, a lot of stuff. There's a gui, it's made for the human. We're seeing the evolution of CLI for everything, right? All tools have CLIs. Your agents can use [00:44:00] them well, do we get good vision? Do we get good little sandboxes?Like right now? It's a really effective way, right? Models love to use tools. They love the best. They love to read through text. So slap a CLI let it go loose. That works for everything.Ryan Lopopolo: It does. Yeah. Yeah.[00:44:14] UI Perception and RasterizingRyan Lopopolo: We've also been adapting nont, textual things to that shape in order to improve model behavior in some ways, right?We want the agent to be able to see the UI agents do not perceive visually in the same way that we do. They don't see a red box, they see red box button, right? They see these things in latent space. So if we want, Hey, yeah, I do. We haveswyx: a ding if that goes off every time. Alien spaceRyan Lopopolo: ding.Anyway if we wanna actually make it see the layout, it's almost easier to rasterize that image to ask EOR and feed it in to the agent. Ha. And there's no reason you can't do both, right? To like further refine how the model perceives the object it's [00:45:00] manipulating.swyx: Cool. Could we, you wanna talk about a couple more of these layers that might bear more introspection or that you have personal passion for?[00:45:07] Coordination Layer with ElixirRyan Lopopolo: I will say that the coordination layer here was a really tricky piece to get right.swyx: Let's do it. Yep. I'm all about that. And this is Temporal core.Ryan Lopopolo: This is where when we turn the spec into Elixir, where like the model takes a shortcut, right? Like it's oh, I have all these primitives that I can make use of in this lovely runtime that has native process supervision.Which is I think, a neat way to have taken the spec and made it more choices achievable by making choices that naturally mapswyx: Yeah.Ryan Lopopolo: To the domain, right? In the same way that like you would prefer to have a TypeScript model repo if you are doing full stack web development, right? Because the ability to share types across the front end and backend reduces a lot of complexity.And becauseswyx: that's what graph kill used to be.Ryan Lopopolo: That's right. Andswyx: I don't know if it's still alive, butRyan Lopopolo: [00:46:00] no humans in the loop here. So like my own personal ability to write or not write elixir. Doesn't really have to bias us away from using the right tool for the job. It is just wild.swyx: Love it. I love it.Yeah. I wonder if any languages struggle more than others because of this? I feel like everyone has their own abstractions. That would make sense. But maybe it might be slower, it might be more faulty where like you'd have to just kick the server every now and then. I, I don't know. I think observability layer is really well understood.Integration layer, CP is dead. I think all these just like a really interesting hierarchy to travel up and down. It's common language for people working on the system to understandRyan Lopopolo: The policy stuff is really cool, right? Yeah. You don't really have to build a bunch of code to make sure the system wait for the, to passswyx: it's institutional knowledge.Ryan Lopopolo: Yeah. You just give it the G-H-C-L-I with some text that say CI has to pass. It makes the maintenance of these systems a lot easier.[00:46:57] Agent Friendly CLI Outputswyx: Do you think that CLI maintainers need to be [00:47:00] do anything special for agents or just as is? It's good because like I don't think when people made the G GitHub, CLI, they anticipated this happening.Ryan Lopopolo: That's correct. The GH CLI is fantastic. It's great super industry.swyx: Everyone go try GH repo create GH pull and then pull request number, right? GH HPR, like 1 53, whatever. And then it like pullsRyan Lopopolo: basically my only interaction with the GitHub web UI at this point is GH PR view dash web.Exactly. Glanceswyx: at the diffRyan Lopopolo: and be like Sure thing. Send it. Yeah. But the CLI are nice ‘cause they're super token efficient and they can be made more token efficient really easily. Like I'm sure you all have seen like I go to build Kite or Jenkins and I could just get this massive wall of build output.And in order to unblock the humans, your developer productivity team is almost certainly gonna write some code that parses the actual exception out of the build logs and sticks it in a sticky note at the top of the page. And you basically [00:48:00] want CLI to be structured in a similar way, right? You're gonna want to patch dash silent to prettier because the agent doesn't care that every file was already formatted.Just wants to know it's either formatted or not. So it can then go run a right command. Similarly, like in our PNPM distributed script runner, when we had one, when you do dash recursive, like it produces a absolute mountain of text. But all of that is for passing. Test suites. So we ended up wrapping all of this in another scriptswyx: to suppress the,Ryan Lopopolo: which you can vibe the channel only output the failing parts of the tests.swyx: You make a pipe errors versus the standard, standard out. I don't know. Okay. Whatever. Too much thinking have to do that. The CII used to maintain SCLI for my company and yeah, this is like core, very core to my heart. But you're vibing my job.Ryan Lopopolo: That's right.swyx: Cool. Any other things?This is a long spec. [00:49:00] I appreciate that. It's got a lot of strong opinions in here. Any other things that we should highlight? I think obviously you can spend the whole day going through some of these, but I do think that some of these have a lot of care or some of this you might wanna tell people, Hey, take this, but, make it your own.[00:49:15] Blueprint Spec and GuardrailsRyan Lopopolo: Fundamentally, software is made more flexible when it's able to adapt to the environment in which it is deployed, which means that things like linear or GitHub even are specified within the spec, but not required pieces of it. There's like a more platonic ideal of the thing that you could swap in like Jira or Bitbucket, for example.But being able to tightly specify things like the ID formats or how the Ralph Loop works for the individual agents. Basically means you can get up and running with a fully specified system quickly that you then evolve later on. I think we never intended for this to be a static spec that you can [00:50:00] never change.It's more like a blueprint to get something worth a starting point up and running.swyx: Yeah.Ryan Lopopolo: For you then to vibe later to your heart's content,swyx: you have like code and scripts in here where it's oh, I think this is a really good prompt. It's just a very long prompt.Ryan Lopopolo: Fundamentally, the agents are good at following instructions, so give them instructions.And it will, improve the reliability of the result. We, much like the way we use Symphony, we don't want folks to have to monitor the agent as it is vibing the system into existence. So being very opinionatedVery strict around what these success criteria are means that our deployment success rate goes up. Yeah. It means we don't have to get tickets on this thing.Vibhu: Think it all goes back to that like code to disposable, right? Like early on when you had CLI or you'd kick off a Codex run, it would take two hours. You would wanna monitor okay, I'm in the workflow of just using one.I don't want it to go down the wrong path. I'll cut it off and, just shoot off four, like that was my favorite thing of the Codex app, right? Yeah. Just Forex it like, [00:51:00] it's okay. One of them will probably be right, one of them might be better. Stop overthinking it. Like my first example was probably like deep research.When you put out deep research and I'd ask it something like, I asked it something about LLM, it thought it was legal something and spent an hour, came back with a report completely off the rails. And I was like, okay, I gotta monitor this thing a bit. No don't monitor it. Just you want to build it so it's that it, it goes the right way.And you don't wanna, you don't wanna sit there and babysit, right? You don't want to babysit your agentsRyan Lopopolo: with that deep research query that you made. Looking at the bad result, you probably figured out you needed to tweak your prompt Yeah. A bit, right? That's that guardrail that you fed back into the code base for the task, your prompt to further align the agent's execution.Same sort of concept supply there too.swyx: When you talk, how are the customers feelingRyan Lopopolo: for Symphony? I think we have none, right? This is a thing we have put out into theswyx: world. Symphony's internal, right? As long as you are happy, you are the customer. That'
In dieser AI News-Ausgabe geht es vor allem um eine knappe Ressource: Speicher. Googles TurboQuant verspricht Abhilfe, hilft aber vor allem den Hyperscalern. Gleichzeitig werden CLIs zum neuen Lieblingstool für AI Agents – und die Sicherheitsfragen, die das aufwirft, kommen dabei zu kurz. Einen Fail der Woche haben Fabian und Ole auch im Gepäck: Anthropic leakt versehentlich den Quellcode von Claude Code.
This episode with Joachim Hill-Grannec asks: How do platforms bloat, and how do you keep them simple and fast with trunk-based dev and small batches? Which metrics prove it works—cycle time, uptime, or developer experience? Can security act as a partner that speeds delivery instead of a gate? We are always happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners. DevSecOps Talks podcast LinkedIn page DevSecOps Talks podcast website DevSecOps Talks podcast YouTube channel Summary In this episode of DevSecOps Talks, Mattias speaks with Joachim Hill-Grannec, co-founder of Peltek, a boutique consulting firm specializing in high-availability, cloud-native infrastructure. Following up on a previous episode where Steve discussed cleaning up bloated platforms, Mattias and Joachim dig into why platforms get bloated in the first place and how platform teams should think when building from scratch. Their conversation spans cloud provider preferences, the primacy of cycle time, the danger of adding process in response to failure, and a strong argument for treating security and quality as enablers rather than gatekeepers. Key Topics Platform Teams Should Serve Delivery Teams Joachim frames the core question of platform engineering around who the platform is actually for. His answer is clear: the delivery teams are the client. Platform engineers should focus on making it easier for developers to ship products, not on making their own work more convenient. He connects this directly to platform bloat. In his experience, many platforms grow uncontrollably because platform engineers keep adding tools that help the platform team itself: "Look, I spent this week to make my job this much faster." But Joachim pushes back on this instinct — the platform team is an amplifier for the organization, and every addition should be evaluated by whether it helps a product get to production faster and gives developers better visibility into what they are working on. Choosing a Cloud Provider: Preferences vs. Reality The conversation briefly explores cloud provider choices. Joachim says GCP is his personal favorite from a developer perspective because of cleaner APIs and faster response times, though he acknowledges Google's tendency to discontinue services unexpectedly. He describes AWS as the market workhorse — mature, solid, and widely adopted, comparing it to "the Java of the land." Azure gets the coldest reception; both acknowledge it has improved over time, but Joachim says he still struggles whenever he is forced to use it. They observe that cloud choices are frequently made outside engineering. Finance teams, investors, and existing enterprise agreements often drive the decision more than technical fit. Joachim notes a common pairing: organizations using Google Workspace for productivity but AWS for cloud infrastructure, partly because the Entra ID (formerly Azure AD) integration with AWS Identity Center works more smoothly via SCIM than the equivalent Google Workspace setup, which requires a Lambda function to sync groups. Measuring Platform Success: Cycle Time Above All When Mattias asks how a team can tell whether a platform is actually successful, Joachim separates subjective and objective measures. On the subjective side, he points to developer happiness and developer experience (DX). Feedback from delivery teams matters, even if surveys are imperfect. On the objective side, his favorite metric is cycle time — specifically, the time from when code is ready to when it reaches production. He also mentions uptime and availability, but keeps returning to cycle time as the clearest indicator that a platform is helping teams deliver faster. This aligns with DORA research, which has consistently shown that deployment frequency and lead time for changes are strong predictors of overall software delivery performance. Start With a Highway to Production A major theme of the episode is that platforms should begin with the shortest possible route to production. Mattias calls this a "highway to production," and Joachim strongly agrees. For greenfield projects, Joachim favors extremely fast delivery at first — commit goes to production, commit goes to production — even with minimal process. As usage and risk increase, teams can gradually add automation, testing, and safeguards. The critical thing is to keep the flow and then ask "how do we make those steps faster?" as you add them, rather than letting each new step slow down the pipeline unchallenged. He also makes a strong case for tags and promotions over branch-based deployment, noting his instinctive reaction when someone asks "which branch are we deploying from?" is: "No branches — tags and promotions." The Trap of Slowing Down After Failure Joachim warns about a common and dangerous pattern: when a bug reaches production, the natural organizational reaction is not to fix the pipeline, but to add gates. A QA team does a full pass, a security audit is inserted, a manual review step appears. Each gate slows delivery, which leads to larger batches, which increases risk, which triggers even more controls. He sees this as a vicious cycle. Organizations that respond to incidents by slowing delivery actually get worse security, worse quality, and worse throughput over time. He references a study — likely the research behind the book Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim — showing that faster delivery correlates with better security and quality outcomes. The organizations adding Engineering Review Boards (ERBs) and Architecture Review Boards (ARBs) in the name of safety often do not measure the actual impact, so they never see that the controls are making things worse. Mattias connects this to AI-assisted development, where developers can now produce changes faster than ever. If the pipeline cannot keep up, the pile of unreleased changes grows, making each release riskier. Getting Buy-In: Start With Small Experiments Joachim does not recommend that a slow, process-heavy organization throw everything out overnight. Instead, he suggests starting with small experiments. Code promotions are a good entry point: teams can start producing artifacts more rapidly without changing how those artifacts are deployed. Once that works, the conversation shifts to delivering those artifacts faster. He finds starting on the artifact pipeline side produces quicker wins and more organizational buy-in than starting with the platform deployment side, which tends to be more intertwined and higher-risk to change. Guiding Principles Over a Rigid Golden Path Mattias questions the idea of a single "golden path," saying the term implies one rigid way of working. Joachim leans toward guiding principles instead. His strongest principle is simplicity — specifically, simplicity to understand, not necessarily simplicity to create. He references Rich Hickey's influential talk Simple Made Easy (from Strange Loop 2011), which distinguishes between things that are simple (not intertwined) and things that are easy (familiar or close at hand). Creating simple systems is hard work, but the payoff is systems that are easy to reason about, easy to change, and easy to secure. His second guiding principle is replaceability. When evaluating any tool in the platform, he asks: "How hard would it be to yank this out and replace it?" If swapping a component would be extremely difficult, that is a smell — it means the system has become too intertwined. Even with a tool as established as Argo CD, his team thinks about what it would look like to switch it out. Tooling Choices and Platform Foundations Joachim outlines the patterns his team typically uses when building platforms, organized into two paths: Delivery pipeline (artifact creation): - Trunk-based development over GitFlow - Release tags and promotions rather than branch-based deployment - Containerization early in the pipeline - Release Please for automated release management and changelogs - Renovate for dependency updates (used for production environment promotions from Helm charts and container images) Platform side (environment management): - Kubernetes-heavy, typically EKS on AWS - Karpenter for node scaling - AWS Load Balancer Controller only as a backing service for a separate ingress controller (not using ALB Ingress directly, due to its rough edges) - Argo CD for GitOps synchronization and deployment - Argo Image Updater for lower environments to pull latest images automatically - Helm for packaging, despite its learning curve He notes that NGINX Ingress Controller has been deprecated, so teams need to evaluate alternatives for their ingress layer. Developers Should Not Be Fully Shielded From Operations One of the more nuanced parts of the conversation is how much operational responsibility developers should have. Joachim rejects both extremes. He does not think every developer needs to know everything about infrastructure, but he has seen too many cases where developers completely isolated from runtime concerns make poor decisions — missing simple code changes that would make a system dramatically easier to deploy and operate. He advocates for transparency and collaboration. Platform repos should be open for anyone on the dev team to submit pull requests. When the platform team makes a change, they should pull in developers to work alongside them. This way, the delivery team gradually builds a deeper understanding of how the whole system works. Joachim loves the open-source maintainer model applied inside organizations: platform teams are maintainers of their areas, but anyone in the organization should be able to introduce change. He warns against building custom CLIs or heavy abstractions that create dependencies — if a developer wants to do something the CLI does not support, the platform team becomes a bottleneck. Mattias adds that opening up the platform to contributions also exposes assumptions. What feels easy to the person who built it may not be easy at all; it is just familiar. Outside contributors reveal where the system is actually hard to understand. Designers, Not Artists: Detaching Ego From Code Joachim shares an analogy he prefers over the common "developers as artists" framing. He sees developers more like designers than artists, because an artist's work is tied to their identity — they want it to endure. A designer, by contrast, creates something to serve a purpose and expects it to be replaced when something better comes along. He applies this to platforms and infrastructure: "I want my thing to get wiped out. If I build something, I want it to get removed eventually and have something better replace it." Organizations where ego is tied to specific systems or tools tend to resist change, which leads to the kind of dysfunction that keeps platforms bloated and brittle. Complexity Is the Enemy of Security Mattias raises the difficulty of maintaining complex security setups over time, especially when the original experts leave. Joachim responds firmly: complexity is anti-security. If people cannot comprehend a system, they cannot secure it well. He acknowledges that some problems are genuinely hard, but argues that much of the complexity engineers create is unnecessary — driven by ego rather than need. "The really smart people are the ones that create simple things," he says, wishing the industry would redirect its narrative from admiring complicated systems to admiring simple ones. Security and QA as Internal Consulting, Not Gatekeeping Joachim draws a parallel between security and QA. He dislikes calling a team "the quality team," preferring "verification" — they are one component of quality, not the entirety of it. Similarly, security is not one team's responsibility; it spans product design, development practices, tooling, and operations. His ideal model is for security and QA teams to operate as internal consultants whose goal is to reduce risk and improve the overall system — not to catch every possible issue at any cost. The framing matters: if a security team's mandate is simply "block all security issues," the logical conclusion is to stop shipping or delete the product entirely. That may be technically secure, but it is useless. He frames security as risk management: "Security is a risk management process, not just security for the sake of security. You're managing the risk to the business." The goal should be to deliver faster and more securely — an "and," not an "or." Mattias recalls a PCI DSS consultant joking over drinks that a system being down is perfectly compliant — no one can steal card numbers if the system is unavailable. The joke lands because it exposes exactly the broken incentive Joachim describes. Business Value as the Unifying Frame The episode closes by tying everything back to business outcomes. Joachim argues that speed and security are not opposites; both contribute to business value. Fast delivery creates value directly, while security reduces business risk — and risk management is itself a business operation. He explains why focusing on the highest-impact business bottleneck first builds trust. When you hit the big items first, you earn credibility, and subsequent changes become easier to justify. For example, one of his clients has a security group that is the slowest part of their organization. Speeding up that security process would have a massive impact on business delivery — more than optimizing the artifact pipeline. Mattias reflects that he used to see platform work as separate from business concerns — "I don't care about the business, I'm here to build a platform for developers." Looking back, he would reframe that: using business impact as the measure of platform success does not mean abandoning the focus on developers, it means having a clearer way to prioritize and demonstrate value. Highlights Joachim on platform bloat: "Your job is not to make your job faster and easier — you're an amplifier to the organization." Joachim on his favorite metric: "Cycle time is my favorite metric. I love cycle time metrics." Joachim on deployment strategy: "No branches, no branches — tags and promotions." Mattias on platform design: He calls the ideal early setup a "highway to production." Joachim on simplicity vs. ease: He references Rich Hickey's Simple Made Easy talk — "It's very hard to create simple systems that are easy to reason about. And it's very easy to create systems that are very hard to reason about." Joachim on replaceability: "If swapping a tool out would be extremely hard, that's a pretty big smell." Joachim on complexity and security: "If it's complicated, you just can't keep all the context together. Simple systems are much easier to be secure." Joachim on engineering ego: "I don't particularly like the aspect of [developers as] artists... I want my thing to get wiped out. I want it to get removed eventually and have something better replace it." He prefers the analogy of designers over artists, because artists tie their identity to their creations. Joachim on security as a blocker: "If their goal is we are going to block every security issue, the best way to do that is delete your product." Spicy cloud takes: Joachim calls GCP his favorite cloud for developers, compares AWS to "the Java of the land," and says he still struggles every time he is forced to use Azure. PCI DSS dark humor: Mattias recalls a consultant joking that a downed system is perfectly compliant — you cannot steal card numbers from a system that is not running. Joachim on the slow-down trap: Organizations add ERBs, ARBs, and manual security gates after incidents, but "the faster you can deliver, you actually get better security, better quality, and better throughput — and the more you slow it down, you go the opposite." Resources Simple Made Easy by Rich Hickey (InfoQ) — The influential 2011 talk Joachim references on distinguishing simplicity from ease in system design. DORA Metrics: The Four Keys — The research framework behind cycle time, deployment frequency, and the finding that speed and stability are not tradeoffs. Trunk Based Development — A comprehensive guide to the branching strategy Joachim recommends over GitFlow. Argo CD — Declarative GitOps for Kubernetes — The GitOps tool Joachim's team uses for cluster synchronization and deployment. Release Please (GitHub) — Google's tool for automated release management based on conventional commits, used by Joachim's team for tag-based promotions. Karpenter — Kubernetes Node Autoscaler — The node autoscaler Joachim's team uses with EKS for fast, flexible scaling. Renovate — Automated Dependency Updates — The dependency management bot Joachim uses for both build dependencies and production environment promotions.
For a limited time, Latent Spacenauts can skip the waitline to join Dreamer and also compete for a $10,000 cash prize for most useful tools for Dreamer! Thanks @dps!In 2024, David Singleton left Stripe and joined forces with Hugo Barra for a buzzy stealth startup named /dev/agents. This month they emerged out as Dreamer, a consumer-first platform to discover, build, and use AI agents and agentic apps, centered on a personal “Sidekick” that helps users customize experiences via natural language. Sidekick is nothing less than an “agent that builds agents”, with all the complexity that that entails:You've seen many many website builder, app builder, and even agent builder startups by now, but our favorite detail is the sheer amount of work that has gone into the “full stack” nature of the platform, including shipping their own SDK, logging, database, prompt management, serverless functions, and so on. Most platforms restrict the tech stack you can use just to get off the ground — Dreamer does it “right” by letting you push whatever arbitrary code you want to their VMs.Paying the BuildersOf course former leaders of Stripe and Android would not stop at just building the tools, but also building the ecosystem. Dreamer is deeply aware of the 4 sided network effect it has going on and is ready to fund all of it - from hiring Builders in Residence to awarding $10,000 cash prizes to the best tool builders for the Dreamer ecosystem.It's time to Dream!Full Video Episodeon youtube.Transcript[00:00:00] Meet Dreamer Purple[00:00:00] swyx: Okay, we're here in the studio with David Singleton. Welcome.[00:00:08] David Singleton: Hey, Wix. It's great to be here.[00:00:09] swyx: It's great to have you. Uh, we have very sympa that your company color is the same as Lean Spaces color.[00:00:15] David Singleton: That's right. Dreamer Purple.[00:00:17] swyx: It used to be Devrel agents, which I thought was very cool. It's like you call back to Devrel Payments.[00:00:22] David Singleton: Yeah.[00:00:22] swyx: And you were obviously CTO Stripe. And talk to me about just the origin or thinking process behind Dreamer. Yeah. And maybe, maybe start with like, what, what is Dreamer?[00:00:31] David Singleton: Yeah.[00:00:31] What Is Dreamer[00:00:31] David Singleton: So Dreamer is a new product, uh, which everyone can come and play with today. Um, it's a place where everyone, literally, everyone can discover, build, and enjoy and use AI agents and agenda apps.[00:00:45] And we really did design it for consumers, for folks who are not necessarily. Uh, have any kind of technical background. It's really aimed at everyone. I think often of my sister, she's very smart. She's not in the slightest bit technical. She has lots of problems in her life that [00:01:00] she would like to be able to have great software and intelligent software to solve.[00:01:04] But you know, even with the rise of tools like Cloud Code and so forth, she's got no way to get started. And Dreamer is a place where she can come in, grab some intelligent apps that other people in the community have built, start using them right away, and solve real problems in her life.[00:01:19] Sidekick And Waitlist[00:01:19] David Singleton: And at the core, we have a personal agent called the Sidekick.[00:01:24] Um, you can give your sidekick a name, you can give it its own personality, and it really helps you across your entire day, your life. It helps you use all of the agents on the platform, and it also helps you build anything you want. And we've been working in this for a little while. We recently launched in beta.[00:01:41] So anyone can go to dreamer.com, join the wait list. Um, and we have many, many, many people in the community now who are building really fun, really powerful, really useful. Agents and the agentic apps for themselves.[00:01:54] swyx: I think we're gonna go right into a demo. Yeah. I just wanna make an observation that, uh, you, you, [00:02:00] you put discover first before build.[00:02:02] Mm-hmm. But actually, at least for the engineers in the audience. ‘cause we are primarily engineers and you're primarily targeting consumers, right?[00:02:08] David Singleton: Yeah.[00:02:08] swyx: For engineers. Like, there's a huge full stack of stuff, which we're gonna dive into. Let's write. It's so impressive. I'm like, holy s**t, this, this is what I've always wanted.[00:02:16] Cool. Uh, so, so I think that's really good and I've, in some ways, I think given your background given, uh, Hugo's, is it Hugo? Hugo.[00:02:24] David Singleton: Hugo. Hugo Bar. Yeah.[00:02:25] swyx: Hugo, it's not surprising that you can basically kind of build an app store Yeah. For agents.[00:02:30] David Singleton: Yeah. So Hugo was my co-founder. Yeah. Um, Hugo and I met with our other co-founder Nicholas Checkoff in the very early days of Android at Google, where we were building Google's first mobile apps.[00:02:41] Uh, we then contributed to very core pieces of Android itself. And you're right, we were really excited about building two things. One, solving a bunch of problems. That this breakthrough technology here I'm talking about mobile needed to have solved in order to make it work for real people at scale. And then secondly, building this ecosystem, um, [00:03:00] of third party developers using the Play Store, um, and able to deliver way more value on the platform than we could have delivered on our own.[00:03:08] And we think about Dreamer in exactly the same way. So I was working at Stripe, as you mentioned, and we had the opportunity to put some of the very first AI agent systems in the world into production. And from the moment we did the first of those, I was just struck with a strong sense of conviction that this is breakthrough technology that's gonna change how all of us work with computers and phones and so forth, all of the, the technology in our lives, but.[00:03:34] There's a lot of problems to be solved, for real people to be able to make this approachable. Um, and it really is kind of a direct analog for what we were solving back in the early days of mobile apps at Google and, and Android. So it's, it's been fun to bring that to life.[00:03:47] swyx: Yeah. Uh, let's look at it.[00:03:48] David Singleton: Yeah, let's take a look.[00:03:49] Dashboard And Daily Briefing[00:03:49] David Singleton: So, uh, dreamer.com, this is our homepage. This is where you can come and, uh, watch some videos about what is here and sign up for the wait list. Once[00:03:57] swyx: you, I, I just wanna say for those listening, ‘cause we have a lot, you [00:04:00] know, switch to YouTube, look at the animations. So much care.[00:04:03] David Singleton: We, we really care about, uh, this product being fun.[00:04:07] Uh, and, and interesting to use. Obviously a lot of people are using it to do real important stuff. You can do real work, uh, here, uh, but also you can build fun things too. Once you get off of our wait list, you'll come into the product. The first thing that happens is you'll have a conversation with your side cake, which is this little friendly, uh, character here.[00:04:27] And psychic will seek to get to know you and understand you. What do you care about? And will help you discover and build your first AI agents or agentic apps. After that, you're, you're gonna have a dashboard. This is my dashboard. Everyone's is different. Um, you can see I have a few things here. I have a feed.[00:04:42] So a lot of our agents do things in the background when you're not looking and the feed is how they let you know what they've been up to. I have, uh, some widgets, uh, from apps that I have built. Uh, this one is called Calendar Hero. Uh, this is something that I installed from the gallery. Uh, so built by someone in our community.[00:04:59] It's a [00:05:00] really powerful calendar app because for each of my meetings, if it's with someone I don't already know, well it'll actually go off and research it, um, and give me both a history of my interactions with those people and also a bunch of, you know, public useful information to, to get started. One of the things I love about this particular app is that every day it generates a podcast, um, a daily briefing.[00:05:24] And one of the things that we've done with the platform is we've made it possible for all the things that agents do to show up in places that you care about. So if you look over here, this is the screen in my phone, and if I go ahead and open my Apple Podcasts, you can see right here. Your Daily briefing podcast is ready.[00:05:39] This was produced by an agent running in my Dreamer account, and it was very easy by scanning a QR code to connect it to my Apple podcast. That's what I listened to in the car now every morning. Yeah. On my way to work.[00:05:50] swyx: It, it[00:05:50] David Singleton: preps me for, for my day.[00:05:52] swyx: So one additional bit of context. I asked you immediately after seeing this was like, what, what about, I wanna talk back to my agent and you said you actually started with voice and then you went to [00:06:00] podcasts.[00:06:00] ‘cause it's nice to have it pre downloaded[00:06:02] David Singleton: that, right? That's right. Um, yeah, we, you, you can talk to your sidekick. So, you know, on mobile we have, uh, a dreamer app and you can talk to the sidekick right here. Um, but we've actually found that making things, uh, show up in the other apps that you already use in your life is incredibly powerful.[00:06:19] So let's take a look at what's kind of under the hood here.[00:06:21] Gallery Tools And Payouts[00:06:21] David Singleton: So I already mentioned that we have a gallery, so this is where you'll find a lot of agents from our community. Uh, there's. Many at this point, hundreds. And they are solving all kinds of, uh, use cases. I'd say the the top use cases are on personal productivity, but also a lot of information management that can range from personal information like docs and so forth, managing your emails.[00:06:42] It also ranges out to public information that you might be interested in, but you need something to help manage the, the kind of fire hose of stuff that's coming at you. For instance, I have, um, an agent which looks at all the AI news, um, all the time. There's a lot of it and it finds the stuff that I would actually be [00:07:00] interested in, um, and I find it incredibly useful.[00:07:03] So these are agents that you can install that other people have built. Anything that you install on Dreamer, you can actually just say, I wanna start making some changes, and we'll look at that in a second. But in natural language, with the sidekicks help, you can change any of these experiences to work just the way you want them.[00:07:18] But the base layer of the system are tools. So you know, as well as anyone swyx, that any AI system is only as good as the quality of data that it can pull in and the quality of action it can take. So before we launched our beta, we worked very hard to make sure that we seeded our tools with a bunch of very high quality and powerful integrations.[00:07:39] So, you know, for instance, this is real Google search, this is actual Gmail. Um, and you can do very useful things with those. But also this is a platform for everyone. And as we got started talking to people in our alpha community, a whole bunch of sports use cases popped out and we realized if you want to build something cool for sports with ai, you need really high quality live data.[00:07:58] So look at these [00:08:00] Formula one M-L-B-N-F-L, uh, these are tools, uh, that we've built. We've done a, these are not data scraped off the web. This is a, a direct data feed integration. And because it's live and ‘cause it's high quality, you can build really powerful stuff. But tools is not something that we are just going to kind of control ourselves.[00:08:19] The platform is open for tool Builders to contribute tools that anyone on Dreamer can use. So, um, this is actually the place in the platform where I think software engineers, um, well number one, would love for you to come and play with it. Uh, but software engineers are really gonna build, um, a lot of powerful stuff into the system.[00:08:38] And we are actually sharing something for the first time on this podcast, which there is, uh, tool builders on Dreamer get paid. So if you publish a tool to the platform and a lot of agents use it, you'll actually get paid, uh, in proportion to their usage. And we'd love for folks to come and give this a try.[00:08:54] We've got good docs that help you get started and you can build things that, you know, scratch your own itch. For instance, someone built this [00:09:00] Ski Bum tool, which provides live snow conditions for a bunch of, uh, ski resorts. I'd love to show you how I've used that in a second. And also we have some tools, partners where the tools themselves are paper use.[00:09:12] So for instance, parallel web systems is a premium tool. Uh, you can do really cool stuff with it. Um, it's a a, an agentic web research tool. And that one, because it's expensive to operate, is paid on a, on a per usage basis. But if you're coming in to build agents on the platform, even the premium tools, you get a free trial.[00:09:29] So you get a chance to actually try them out, make sure that the use case is good for you before you decide to, to to sign up. So that's tools. So we have the gallery, we have tools, and then the sidekick helps us put all of this together to build agents. We do that in the agents studio. You can also do this on your phone, but if I open up Agent Studio here on Desktop psychic's, just gonna start a conversation about what you want to build together.[00:09:51] I'd love to show you one that I made recently.[00:09:53] swyx: Let's do[00:09:53] David Singleton: it.[00:09:53] Building A Conference App[00:09:53] David Singleton: Um, let's look at something that hopefully is kind of near and dear to your heart. So one of the things I love about Dreamer and this kind of moment in technology is that if you think about it. There are all these things in your life where, have you ever gone to a conference?[00:10:09] I know you have. Right? And, uh, big conferences have apps. Um, and these apps are usually built by agencies and they're, they're usually actually quite expensive to build. I've been involved in running some of these myself. And how many conferences have you been to where the app was good? Zero. Honestly.[00:10:23] swyx: Exactly. Zero,[00:10:24] David Singleton: maybe one. I, I've, I've been to one conference. That was pretty good. Wait, wait session sessions. Um, but, but the point is, they're rarely great pieces of software. Right. And they're also expensive to build, but they're, they're interesting ‘cause they're episodic, they last for this one thing. Um, and then they're, they're not relevant anymore.[00:10:43] Um,[00:10:43] swyx: and so it's the worst feeling to invest in them because, you know, it's like, it's got a limited. Date?[00:10:48] David Singleton: Absolutely. So I decided to build, uh, a conference app for your AI engineer conference. Amazing. Uh, on Dreamer. One of the things that Swix has done, uh, which I [00:11:00] thought was very forward-looking, is actually put a whole bunch of data about the conference on the webpage in an LLM readable way.[00:11:06] There's an LLMs txt file, there's a feed of all of the sessions in js, ON. So I used the data from your conference last year and built this intelligent app, uh, just by talking to our sidekick, uh, in Dreamer. So just to give you a quick tour, this is my Dream Conference app. What I always wanna do for conferences is I wanna be able to search for speakers.[00:11:28] I'm usually there because, uh, there, uh, is a speaker I care about. So, you know, SWIX, you're the speaker I care about. I can actually see here who you're on stage with. So here's, here's Greg Brockman. You've read even ai, uh, and this is his session. And look Greg and Swix for the speaker. So let's add that to my schedule.[00:11:45] Great. And then maybe there's a couple others I might see here. Like on day two, I remember there were some keynotes. So, uh, building the open agenda web, that sounds fun. So I add that to my schedule.[00:11:55] swyx: She's now CEO of Xbox.[00:11:56] David Singleton: Awesome.[00:11:57] swyx: Which is interesting. So cool. So,[00:11:59] David Singleton: so I've [00:12:00] gone through and picked out a couple of sessions that I cared about.[00:12:03] That's as far as I usually get with any conference app. But of course you've got the whole of the rest of the conference to figure out what to do. So here is where the native intelligence of, of these things you build on Dreamer can come in. So I'm gonna click guide me. So Dreamers sidekick actually parsed out the whole schedule and figured out what some of the themes are and I can choose what I'm interested in here.[00:12:23] I'm definitely interested in agents. Uh, I'm definitely interested in code generation and also reasoning in rl. So now I'm gonna say build my schedule. So what this is doing is. It's going across every time slot for the conference. And it's choosing among the things I could go to, which one it thinks is best for me based on my interests.[00:12:41] It also uses its own memory of me that's part of Dreamer, uh, to understand what I might like best. And you know, there's an LLM prompt running for each one of these time slots. So this is, it's not super fast, but it'll be done in about 30 or 40 seconds. And I'm gonna have a special custom schedule for the conference.[00:12:57] This, like I said, is my [00:13:00] dream conference app is exactly what I've always wanted and I was able to build this yesterday morning. Um, I did it between some meetings. I think I spent a total of 25 minutes of wall clock time on it. I did it over the course of a couple of hours. And, uh, here is my schedule for the conference.[00:13:15] I can see it in a calendar view. This is what I should do on Tuesday, this is what I should do on Wednesday. Oof, no conflicts, but, you know, I may not go to every single thing. And there you have it built in, you know, dreamer. So let's take a look at what the building experience actually looks like. So this is the, the actual account that I made it on.[00:13:32] Oh, of course I should say anything you build on Dreamer also works on your phone. So, uh, here is my AI engineer conference app right here on my phone. Got all the same functionality, and of course this is the best place to jump into my schedule.[00:13:46] swyx: Yeah.[00:13:46] David Singleton: Um,[00:13:46] swyx: so you could generate a podcast about it just completely multimodal, absolute thing, right?[00:13:51] To me, I mean, this is why I outsource, I mean, well, I, I posted the L-M-T-X-T, the JSON because you cannot run an engineer conference in 2025 [00:14:00] and not let engineers. Do whatever they want.[00:14:02] David Singleton: Yeah.[00:14:03] swyx: And since all conference apps suck, I'm just gonna put up a ba minimum viable app and just let people do whatever they want.[00:14:09] David Singleton: Totally. And the cool thing about this on Bremer is I published this to the gallery and you can use it so you've got one that's built to my taste of conference apps. I think it's pretty cool. But you might want something different. Yeah. In which case you just start telling the sidekick how to change it.[00:14:23] So let's just very quickly look[00:14:24] swyx: at our, what sports grid is also, you can fork it, right? That I can publish. That's right. I can publish your one and go, this is the base starter. It's, it's got good defaults, but go customize, whatever.[00:14:32] David Singleton: That's right. That's right.[00:14:33] swyx: Yeah.[00:14:33] Agent Studio Under The Hood[00:14:33] David Singleton: So let's take a look at how I actually built this.[00:14:34] This is real. So I'm gonna say make changes. This experience we're looking at now is our, uh, agent development studio. Um, like I said, you can do this on your phone as well. And in fact, this one I started out on desktop. Let's look at my actual prompts. I said, let's make an agent called AI Engineer Schedule Planner should be a custom schedule planner for the AI engineer conference.[00:14:53] I'm not gonna read this all up. You get, you get the point and it told it where to get the data from. So that was the first prompt. And actually after I gave it that [00:15:00] prompt, I actually had a simple version of this app working, um, after the sidekick took one turn. So the Sidekick is a, like a professional software engineer, and we've worked very hard to make this work and build functional apps for folks that might not have any engineering experience whatsoever.[00:15:14] So, you know, done here we have build logs that are technical, but you can hide those away. And sidekick, as it is building, will actually translate everything that is coming out of, uh, of the, the harness into English that you can actually read. And by the way, this English is in the personality of your sidekick, which is fun.[00:15:32] Um. And the way that we build agents and agent apps, it's a little different to what you might have seen in some other platforms for a couple of reasons. One, just the build process. The very first thing that Sidekick does, it understands all the agents you've got set up. It understands all the tools and it will come up with a plan for how to realize your goal, how to make sure it actually has the data and the capabilities to complete it.[00:15:54] It will occasionally refuse. If it can't do what you're asking, it will tell you I can't do that. It needs another tool. And that's a good [00:16:00] jumping off point for any of the tool builders out there to build a new tool. So it'll fi first figure out how, then it will build it, and then it will actually test it.[00:16:07] So it will actually make sure that the thing that it has generated is realizing your goal. And you probably know as well as anybody that anytime you can get any. Modern state-of-the-art coding model into a loop where it can make changes and perceive its own output and then fix bugs. Magic happens. So these builds, the first build will often take 10 to 15 minutes on Dreamer, which is a little bit longer than you might've seen on some other platforms.[00:16:31] But the first thing that it creates will work most of the time. And then of course, as you start making smaller changes, you can like ask it to tweak the UI in any way that you like. Those are much faster. And just to give you a sense, uh, for this one, here's something I asked. Put a logo, I gave it a logo file in static files.[00:16:48] Use that as the title. So for folks that actually really want to dig, uh, into a bit more detail, we've provided a powerful IDE here. So I can actually see here's the code that was generated and some pieces of the [00:17:00] code are more accessible than others, like the prompts. So this is the prompt that's used by a powerful LLM in order to do that schedule picking.[00:17:08] And I can actually read it here directly. I can edit it without having to ask the sidekick if I want to do that.[00:17:12] swyx: So this is very nice.[00:17:13] David Singleton: This is for the more, the more, uh, sophisticated users.[00:17:16] swyx: Yeah. This is other people's entire startup is prop management.[00:17:21] David Singleton: This is true. The other thing that is different about Dreamer is once you've built something here, it's ready to go.[00:17:28] We host it. So you don't have to worry about getting a database from a database provider signing up, getting API keys. You don't have to worry about your LLM provider tokens. All of that is hosted on the platform. And you can use it yourself. You can share it to the gallery for other people to, to riff on it.[00:17:46] You can also share it with your friends and coworkers to use your instance of the agent or agentic app. And we're seeing that happen a lot in our community. We've seen a whole bunch of folks who built little applications for their personal life [00:18:00] and shared them with their significant other. We've seen people who are building little productivity apps for their team at work and sharing it, uh, among them.[00:18:07] And we actually do this a lot inside of the company. So at this point we, we pretty much run the company on Dreamer agents for all kinds of important things. Uh, maybe a good example of that is, um, our wait list. People are signing up every time someone signs up for our wait list. A dreamer agent will actually research, uh, that person.[00:18:25] And we're looking for folks who are builders, not super technical to build agents and come in, uh, and give us a lot of feedback and we're prioritized bringing those people off of the wait list First,[00:18:35] swyx: just a quick question on that one is there's, it may not come up again. Do you find enrichment APIs to be useful like the ZoomInfo?[00:18:42] Uh, clear bit[00:18:43] David Singleton: enrichment is a very, uh, common use case. Um, on dreamer. Any application on Dreamer can kick off a sub-agent to do a particular task. Um, so this actually is a powerful agentic harness that runs inside of its own [00:19:00] vm. Uh, we call them sidekick tasks ‘cause they actually run in the context of the sidekick.[00:19:04] I'll talk more about Sidekick in a second and. Enrichment is a very common use case. And the cool thing about a sidekick task is that it has access to all the tools on the platform, but also public data as well. And so very frequently enrichment on our platform happens using public data that it can be found in the web.[00:19:24] There are some tools for getting people data, uh, from, uh, from various bespoke systems. And so that works pretty well. But actually, you'd be surprised. I mean, we would love if someone out there would like to build a ZoomInfo tool, we don't have one today. We'd love to see that on the platform, and I'm sure it'll be very powerful.[00:19:39] But we're also seeing that this powerful agent harness can pull a lot of data in on that note of tools that make experiences better, we're constantly adding more tools because people in the community are building them and publishing them. We review the tools carefully and then they go live for everybody.[00:19:54] Yesterday we added granola. And that was pretty cool. So I was talking to actually, uh, Sarah on my team was [00:20:00] talking to, uh, someone building on the platform this morning and they actually, they have an agentic app that they built, which is a kind of magic to-do list. So they put stuff on their to-do list and for each thing it kicks off one of these, uh, sidekick tasks to figure out how to move the ball forward thing.[00:20:14] Sometimes it'll complete it[00:20:15] swyx: entirely. Yeah.[00:20:16] David Singleton: Often by calling another agent on the platform and sometimes it just kind of researches it and helps ‘em take the first step.[00:20:21] swyx: Yeah. Do you know, this is Sam Altman's number one, ask for an AI app. It's the self-completing to-do list.[00:20:26] David Singleton: Yeah. The self-completing to-do list is something that a lot of people have built on Dreamer and are getting a lot of use out of.[00:20:32] Yeah. And, and finding it actually genuinely I shouldn't, I should, I should try that. Mm-hmm. Please do. And you'll even find some in the gallery that you can remix. So he was saying this morning that he's, he built this self completing to-do list, uh, on Dreamer already. But he connected the granola tool yesterday and now something really magical happens, which is when he says in meetings that he's gonna do a thing, it magically shows up on his to-do list and then it can magically get completed.[00:20:56] And then, as I mentioned, all the agents, all the [00:21:00] apps on Dreamer can actually work together. So our coding agent, as it builds them, does something very special where it exposes the internals of each of the experiences to the system. And then Sidekick can manipulate those to get stuff done. So he has built another agent, which he uses for recruiting.[00:21:18] It kind of keeps track of candidates and also it's got a kinda mini CRM function, so he's able to introduce candidates to each other. He told us this morning that something he'd committed to do in a meeting that was recorded on granola yesterday showed up in his magic to-do list and his magic to-do list.[00:21:34] It was like introduce a person for recruiting, used his recruiting agent to get it done.[00:21:39] swyx: Ah,[00:21:39] David Singleton: um, and this is, this is the dream. This is why we started the company. It really is the case that you can build and use these very powerful, bespoke experiences that can automate your life by working together. And I'd love to talk a little bit about how they work together.[00:21:55] Ecosystem Trust And Monetization[00:21:55] David Singleton: So obviously it's really cool to have [00:22:00] software that will work on your behalf, but it's only useful if you can trust it, right? So privacy and security is very important to us making these things accessible and. While also being trustworthy is hard. So the model that we have, which is working very well, is that the sidekick is at the core of everything here.[00:22:22] So it is both your companion, your helper, but it's also the traffic cup in the system. So when, when one agent wants to work with another agent and dreamer, it doesn't do it directly, it does it via the sidekick, well ask the sidekick to do the thing. And the sidekick understands both everything, all the expectations that have been set with me as a user about what agents can do, which tools I've given them permission to use.[00:22:45] And it will make sure that whatever is is going on is actually aligned with my own interests. And you know, that's part of the background that I bring to this problem domain. I've. Worked for years, uh, keeping very important information, safe and secure. And [00:23:00] so as we started to think about this problem, we realized that we actually had to build something that's a bit like an operating system.[00:23:06] You know, the sidekicks, like the kernel, the agents and apps are like users. Yeah. Different rings. Exactly. Because if you try to pick off just one piece of this, you can't actually make it work for people at scale. Uh, because you could build little vibe coded apps, but they're gonna grab all your data willy-nilly.[00:23:23] They won't be able to work together. You actually have to invest in the fundamental core in order to make it work well for people. And that's what we've been doing and it's, uh, it's been a lot of fun. One other thing I wanted to mention is, um, I've obviously talked about two things, tools and agentic apps.[00:23:42] We really designed Dreamer to be an ecosystem and a platform, and one of my favorite quotes about platforms, I think it's from Bill Gates, is that you can only be a platform. If you create more value for the folks participating and using the platform than, than the platform itself creates. [00:24:00] And that's our goal here.[00:24:01] So we at every step have been thinking about how do we make sure that other people are deriving even more value from Dreamer than we are? So in that vein, I already mentioned tool builders get paid and people can build agents that solve their needs and share them with others, and we are already thinking about ways that they can actually monetize those as well.[00:24:24] Against that backdrop, one of the things that we are launching today is our Builders in Residence program. So there are tons of people building really cool stuff and contributing it to the gallery already, but we've been really inspired by programs we've seen at other companies where artists might be in residence, people that are very creative.[00:24:43] And might have ideas outside of what the, the folks at the company or in the ecosystem already have. And so we are looking for creative people who have fun ideas and, you know, want to really figure out how to apply their creativity at the cutting edge [00:25:00] of technology today to come and work with us. So, uh, if you go to dreamer.com/latent space, you'll find, ooh, well, we love Latent space.[00:25:09] Uh, you'll find a link both to, uh, our tool Builder information and our builder in residence program. And for builders and residents, we'll let you in off the wait list quickly, build an agent, and then for a small number of, of the most creative folks, we're going to pay you to build agents. Uh, you can work directly with our team.[00:25:29] You know, this is like building Legos. So, you know, we've got some of the basic blocks together already, but if you need a Ron steering wheel and we don't have one already, like we'll build it for you. Yeah. Um, we really want to be inspired by, by these, uh, these builders in residence.[00:25:43] swyx: This Legos thing is pretty common as an analogy.[00:25:46] And there's a, there's a thing I call the master builder. Uh, we, the actual Lego company has master builders that they employ Yeah. To inspire people and post on socials.[00:25:56] David Singleton: That is exactly what inspired us as well. Honestly, we talked about the Lego Master [00:26:00] Builder program, so that's our builder in residence program.[00:26:02] swyx: Yeah.[00:26:03] David Singleton: Um, and then, uh, finally back on, on tools. Like I said, anyone can come in and build tools today. If you follow the latent space link dreamer.com/latent space, again, we'll get you off. Directly off the wait list. So you can build right away, you can monetize by publishing onto the platform. That's for everyone, the very best tool that gets added to the platform by mid-April.[00:26:23] Uh, we have a $10,000 prize that we want to give out really, because we just want to seed the creativity of everyone out there. So we're excited to do that.[00:26:31] swyx: Yeah. And you know, uh, this is completely a flywheel, right? Like the more tools, the more builders, the more the third thing agents, you know, it just feeds into each other.[00:26:39] David Singleton: That's right.[00:26:39] swyx: Yeah. Just on the payments thing, because we probably won't touch on that again, but I have to ask the former CTO Stripe on payments as presumably you're using Stripe Connect.[00:26:48] David Singleton: Yeah.[00:26:48] swyx: Um. Any pain points that you're, people are very interested in agent commerce and micropayment and all these things.[00:26:55] Presumably stable coins get into a conversation at some point, but maybe not now.[00:26:58] David Singleton: Yeah, we are [00:27:00] really, really excited about e agent commerce. The first step we are taking is help people in the world who have never been able to build these kind of experiences and software before to build stuff that meets their passions, share it with the world and get paid.[00:27:14] So that's all commerce that happens on our platform, and so we don't need anything new to facilitate that. Stripe Connect has existed for quite a while and is the perfect solution for this kind of stuff, so, um, we we're excited about that. First and foremost, however. A lot of the things that people are already doing on Dreamer, we just talked about a self-completing to-do list.[00:27:34] A lot of the ways that you want to complete to-dos is by actually closing the loop in the real world, and that's going to involve the exchange of value. So we have some folks that are building tools already that actually do have money move in order to, to complete that, that loop. So far, we just want to be open and agnostic to all the protocols out there.[00:27:54] I honestly think this moment in time is a little bit like the early web. So I personally started coding as a kid [00:28:00] and I think I got access to the internet in about 19 95, 19 96. And back then, uh, the web existed, you know, HTTP was a protocol, but there were also other protocols I was using all the time, like Gopher and UUCP and uh, various others.[00:28:15] So the point is like the web, HTTP and HTML. Was just one among many protocols. And of course it became the winner and it's awesome. Yeah. Um, but the others were also kind of interesting and viable at the time as well. And I think the world of agentic commerce is like this right now. Also,[00:28:30] swyx: acp.[00:28:31] David Singleton: Acp, exactly.[00:28:32] All the, all the cps, you know, on Dreamer. We hope that folks will build tools that kinda make use of all of these things, but I'm sure that at a certain point. One or two will emerge as the winners, and then we'll be able to build like really deep support in,[00:28:44] swyx: yeah. This is like maybe a complete tangent, but I do think about how a lot of these companies in AI companies in particular have to switch from c based to usage based because of course, but then, then they end up, end up having to sort of [00:29:00] obscure the margins a little bit and then they inventing end up inventing their equivalent of rob robots.[00:29:04] David Singleton: Mm-hmm.[00:29:04] swyx: Uh, where they're like, well, okay, well every company should have their own currency. And it's, it's like very short lead to a token.[00:29:11] David Singleton: Yeah.[00:29:11] swyx: Or, and I'm like, okay, well where does this end? I can't really play out the next step as to like, is this chaos? Is this,[00:29:18] David Singleton: yeah.[00:29:18] swyx: Okay.[00:29:18] David Singleton: Well, I think it is kind of like the wild west.[00:29:21] I don't mean that in a completely, it's all completely disorganized way, but there's just so many things that could happen from here. The Overton window is very wide, right? Not far how this might land. And I'm just very excited to be building a platform that can take advantage of all of those opportunities and we're just gonna be there.[00:29:36] Uh, working for our users to make sure that things that emerge work,[00:29:39] swyx: you're gonna own the consumers, you're gonna be up the OS for the app store for everything.[00:29:43] David Singleton: So one of the ways to think about this is, um, dreamer actually uses all of the state-of-the-art models as a user. You don't have to think about should I be using, you know, Opus four six, or should I be using the five four model from [00:30:00] OpenAI?[00:30:00] We are continually doing evals and so forth to make sure that the best things are there for you. You can just build on the platform and know that as the world ships around, you're gonna get the right stuff for you. Um, and I think that's something that is needed to actually have folks take advantage of this technology at scale.[00:30:19] I'd love to show you another example of something I built.[00:30:21] swyx: Let's do it.[00:30:22] David Singleton: This is another example of software that just lasts for a certain moment in time. So recently I went on a ski trip with a bunch of friends,[00:30:31] ski[00:30:31] David Singleton: Bum. Uh, so it uses ski bum. Yes. I went on a ski trip to Big Sky. I'd never been there before.[00:30:38] And I made this little intelligent app for us. And you can see it says it's loading big sky conditions. So it's actually calling the Ski Bum tool that I just showed you, which is, uh, published in our, uh, in our gallery. So what is this? This is a little app that was just for our weekend trip. It shows the current status of all the lifts of Big Sky.[00:30:54] Using that tool from the ecosystem, it shows the forecast for the upcoming weekend. It shows our [00:31:00] accommodation. This is just like where my group was staying. This is just for us and also a bunch of dining information that one of our friends, uh, put together who, who's an expert on Big Sky. So I was able to take this app, share the link with my friends.[00:31:12] They weren't on Dreamer yet, just send it to them on iMessage and they get a version they can use on their phone. And of course, here's the real kicker. So I've been on ski trips before and other weekend adventures with my friends. Yeah, people pay for different things and at the end of the weekend it's always a pain to figure out who needs to pay, who to settle up.[00:31:29] So we use this during the weekend. We added all of our expenses in here. Uh, too close are it's drill data. It's only too closely. And then at the end of the trip, we press split. And we're, we settled up and we're done. So there's another dreamer. This was all through dreamer. So the, the actual payment? No, no.[00:31:47] We, it happened because, because we paid for stuff in the real world, it was like, okay, this person needs to pay that person 20 bucks. Right? Right. This person already paid in that. Right. So it just helped us all settle up. We didn't move the money on Dreamer. You could do that. And in fact, if you're a tool builder [00:32:00] thinking about this and getting excited, like come build a tool to do that stuff.[00:32:02] We really think of our tool builders as design partners.[00:32:05] swyx: Yeah. I got, I got the tool. Uh, what, like, I hate, I use Bank of America. I hate bank, I hate the app. Mm-hmm. I hate the web. All banking websites just horrible.[00:32:13] David Singleton: Yeah.[00:32:13] swyx: So just build me, like build a thing on top of Plaid.[00:32:15] David Singleton: Yeah. Right. And then just So[00:32:17] swyx: five code by banking app,[00:32:18] David Singleton: there's already a tool for that.[00:32:20] Oh. So, um, attain Finance is a tool, a builder in our community built. Okay. Um, and it uses a secure system like Plaid. To access your, uh, financial data and you can build powerful personal finance agents on Dreamer today using this tool. And like I said, we review tools carefully. So when bringing Attain Finance onto the platform, we did actually quite a detailed security review with that company to make sure that if folks build stuff with it, it's, it's gonna work well.[00:32:49] So yeah, check that out. I think, uh, I'm, I'm pretty certain it connects to Bank of America. So you'll be able to build the, the app that you wanted already?[00:32:55] swyx: Yeah. There's a couple of points I wanted to sort of dive in on, maybe highlight to folks, [00:33:00] because I, obviously, I spent more time with Dreamers. So we're making a point where you choose on behalf of your users because they're meant to be consumers.[00:33:07] So maybe less technical,[00:33:08] David Singleton: right?[00:33:08] swyx: But obviously people can, how users can override. If you read that's, but it's not just lms, it is also the, the transcription. It, it's like all, like there's, there's a first party curated set of here's the house opinion. That's right. On what?[00:33:21] David Singleton: That's[00:33:21] swyx: right. The thing is, that's right.[00:33:22] Is what's the list? Is there like,[00:33:24] David Singleton: yeah, so actually if you look in the tool gallery, the first party kind of curated set are all the ones that have these grayscale icons. So we have a built in tool for image understanding, for image generation, for RSS, exploration, text to speech and so forth.[00:33:38] swyx: Recipes.[00:33:39] David Singleton: Uh, we actually do have a built in recipes tool.[00:33:41] It turns out that a lot of people in our alpha wanted to do stuff for cooking. Yeah. Um, and you know, you can scrape the web to get good recipes, but we were able to quite quickly find a good repository of recipes. It works great here. Yeah.[00:33:55] Stable Tool Interfaces[00:33:55] David Singleton: So the point behind these though is that we'll keep the interfaces stable, so they'll always work.[00:34:00] But you know, the best translation model and, you know, there are people using this translation tool to translate Chinese podcasts into English. It's, it's pretty powerful. It can deal with very long text, but the best translation tool today might be different from the best translation tool sometime next year.[00:34:15] And we're just gonna make sure that that translation tool is always pretty close to state of the art. So you can build something and you know it's gonna continue to work well. Of course, some of our tools are branded. You may actually have a preferred way of buying groceries, like maybe you prefer Instacart and that's great.[00:34:29] You can use the Instacart tool specifically.[00:34:31] swyx: Yeah.[00:34:32] Partnerships And Ecosystem[00:34:32] swyx: Your partnerships, uh, I mean, I don't know if you ever hit of partnerships, but this is gonna be a bonanza for anyone on to do deals.[00:34:38] David Singleton: We have an amazing person who, uh, works on all of our partnerships. Um, and it's part of what you have to do to build a platform like this that's gonna work for people.[00:34:46] Like, we've gone and done that. Schlep has a lot of work, one talks lots of different companies, um, in order to make sure that you've got good tools at the core.[00:34:54] swyx: Yeah.[00:34:54] David Singleton: And then of course, because we're open to tool builders contributing to the platform, this is only gonna get better and better and [00:35:00] better.[00:35:00] swyx: Yeah.[00:35:01] Agent Lab Routing Layer[00:35:01] swyx: One observation I have this, this is gonna master a thesis I've been pursuing, which is, uh, what I've been calling an agent lab[00:35:05] David Singleton: mm-hmm.[00:35:06] swyx: Where you sort of different than a model lab in, in, in the sense that you never train your own models, but you are the router evaluation layer, ex subject domain expert for choosing between, uh, models.[00:35:18] David Singleton: Yeah.[00:35:18] swyx: And you're explicitly doing these things. And so like in my sort of construction, every agent lab does some version of this where like, here's the image understanding endpoint and we will route for you and don't worry about it. Yeah. Sally, I think it's kind of cool.[00:35:32] David Singleton: I, I think it makes total sense. Um, and again, to make this work for folks that don't follow the AI news every day, it's an actually, it's a, it's a really important thing to do.[00:35:42] Yeah. And it, it's been, it's been a real pleasure. I mean, I'm a, I'm personally a total geek for this stuff. I love it. And being able to go and dive into all those details in order to make it work well for other people. It's a true pleasure. I cannot imagine working at anything else right now. It's just so much fun.[00:35:56] swyx: The tricky part is multimodality when some of these things do [00:36:00] merge.[00:36:00] David Singleton: Mm-hmm.[00:36:01] swyx: And you are, you're sort of, this is your imposing structure on things that fundamentally don't want to be structured. And so sometimes that might work against you, but for 99% of these cases, this is fine.[00:36:10] David Singleton: Yeah. I mean, I think it's gonna be very interesting to see how the, the, the world matures because a lot of the power of dreamer is the ability to kick off these subagents, so these powerful agent harnesses, which can actually change how they work based on the data.[00:36:25] I actually think that we will be able to. Kind of keep up with and stay at the forefront of the changing landscape of how tools and systems work together. And that's, that's new. You know, software didn't used to work like this and now it does. Um, so even, even just figuring out how to design the right pri to make that possible has itself be a lot of fun.[00:36:44] Builders Can Publish Tools[00:36:44] swyx: This is, is a sort of maybe two part question that why can't streamer make its own tools? And then why don't you let you builders maybe stand up their own routing group? I call this a routing group, right? Like where it's like collect Yeah. Things.[00:36:58] David Singleton: So two things, to [00:37:00] some extent, dreamer does make its own tools in that agents appear to the system as tools.[00:37:05] So they can be, they can be used to accomplish things. So you can build an agent that is essentially a tool. Yeah. Um, and it it,[00:37:12] swyx: which is to me very useful for reuse.[00:37:14] David Singleton: Right.[00:37:14] swyx: Right. Exactly. ‘cause I, I like, this is the way I like it. Now my next five apps, I don't want to do this whole series of back and forth again.[00:37:20] David Singleton: Right.[00:37:21] swyx: Yeah.[00:37:21] David Singleton: Um. Then at the tool layer of the system, it's open to anyone. So it's actually quite powerful and flexible. So if you wanted to add a tool, which was, uh, imagine that you were training your own foundation model, Swyx. That might be fun. And imagine you wanted people to be able to play with, I don't know, maybe you make like, you know, nano chat or whatever and you want to Yeah.[00:37:42] Let people play with your own nano chat and see how I change themselves.[00:37:44] swyx: Now.[00:37:45] David Singleton: You could, you could publish a tool that is Nano Chat and it nano image generation behind a tool, and it could be your own writer if you wanted to. I see. And honestly, if that's the kind of thing that gets you excited as a builder, please come and do it.[00:37:57] Like we, we really are [00:38:00] believers in this idea that we aren't going to figure out every single detail ourselves. We're gonna make sure it's a safe and fun place to build this stuff, but we're really open to these ideas coming from other people. Um, and so I'd like nothing more than you come in and build a tool that does some of that cool stuff that you, that you have in mind.[00:38:15] swyx: Yeah. Awesome.[00:38:16] David Singleton: And just as a reminder, if you'd like to do that, the way to find the links is dreamer.com/latent space. Um, and for a limited time on that page, um, anyone who's listening to this podcast will also get directly off of our wait list. Uh, it's quite long right now. We are working hard to bring Zika.[00:38:32] Wait, so skip the wait list.[00:38:33] swyx: You know, I think, I think that's fantastic. I, I think it's, it is really sort of probuild way to do it. I wanted to jump back to the, the bar. Yeah. You know, you know, I get excited about this.[00:38:41] David Singleton: Yes. Okay. Let's set it back in there.[00:38:43] swyx: Like, let's, you know, this is the engineer podcast that's get[00:38:46] David Singleton: Yeah.[00:38:46] swyx: As technical as you can.[00:38:47] David Singleton: Yeah.[00:38:47] swyx: On everything you've built, like have a show off.[00:38:50] David Singleton: Yeah. Okay.[00:38:51] Under The Hood Debugging[00:38:51] David Singleton: So let's go wild in the aisles in the Asian studio. So as you can see, over on the left here is a conversation with the sidekick where you ask it what to do and it will explain in English that anyone can understand what's going on.[00:39:03] But, um, if you want to pull back the covers and look under the hood, um, if you're, uh, an engineer like me, then we have this, uh, this kind of debug drawer at the bottom. So you can see the full build logs here, but you can actually also dig in and see the files and prompts that have been generated. Uh, you can upload files from your computer in static files.[00:39:24] Um,[00:39:24] swyx: very important,[00:39:25] David Singleton: uh, indeed. You can actually read the prompts that have been generated for you. We intentionally put an example in here just that you can see what the format looks like. And then, you know, we already looked at this one that was generated for this particular, um, app, but if you actually want to bring the code out of Dreamer and work on your own local machine, you can.[00:39:45] So at the core of everything here is an SDK with a powerful command line interface and we built that first. It's actually possible to build agents on Dreamer without talking to the sidekick. You can write code with your fingers on a keyboard if you want to. I know that's very [00:40:00] antiquated, not, but actually this can be a lot of fun.[00:40:02] So if you wanna pull it out onto your laptop, you can use our, our CLI and, uh, you can edit it in cursor or in cloud code. You know, you don't have to use our sidekick. And the CLI actually has full access to the rest of the platform with you as the user. So, you know, obviously it is, uh, secure and privacy sensitive, and this is a way that, um, some of our most technical builders do build stuff on the platform.[00:40:24] The really cool thing is the side cake. When it's in coding mode, it uses exactly the same CLI. So the way it. Build stuff on Dreamer is using the same tools that you might as an engineer. Um, and that's actually a very powerful abstraction because it turns out that the right way to give a lot of context to agents to use CLIs is to write great documentation.[00:40:46] Make sure that all of the things that you could do are actually possible. And guess what? That makes it a delightful developer experience for real heroes as well.[00:40:53] swyx: Yeah. So that's pretty cool. We've been telling developers to do this and they ignore this until now they have to for content.[00:40:58] David Singleton: I, I've been saying this for a [00:41:00] long time.[00:41:00] Uh, we actually Stripe docs.[00:41:02] swyx: I mean, come on. Absolutely. Come on.[00:41:03] David Singleton: Absolutely. But actually, I was chatting with folks at Stripe last week and saying, Hey, you gotta make the Stripe CLI actually tell agents what they can do on Stripe because that way they're gonna use more stuff on Stripe. I think this is a real trend for the entire industry.[00:41:16] swyx: Yeah.[00:41:16] David Singleton: So we, we've been doing that.[00:41:17] swyx: To me, this, this download and, uh, GI push mm-hmm. Everything is complete confidence in that you're not hacking it. Right. Because there's other, let's call them AI builder platforms that impose their stack on you and if you, if you, and so therefore they don't allow you to do this because they cannot.[00:41:34] Right. ‘cause they, they impose some degrees of freedom, uh, restrictions so that they can get it to work. Yours is a fully general like VM running the full code. Correct. Do whatever you want. Correct. Any language you want. Correct. Yeah.[00:41:46] David Singleton: Correct. Well, in terms of language, if you use the SDK, you could build stuff in other languages.[00:41:51] We've actually found that TypeScript is the best language for building these experiences. Yes. Because it's strongly tight. So you find out at compile time if you've made mistakes [00:42:00] and there's nothing better than getting in. A coding agent in a loop where it can see its mistakes and ask them. So TypeScript is the language that everything gets built in by default here.[00:42:08] swyx: Did And did you see that TypeScript overtook Python? I did. I did. Yeah.[00:42:12] David Singleton: And for what it's worth, when we started the company, we started writing stuff in Python, and I love Python. Um, if I do, uh, a vendor code, I always write it in Python. It's my favorite language as a developer with my fingers on the keyboard.[00:42:23] Um, but TypeScript is an amazing language for AI because there's tons of training data in the models, um, and it's strongly tight. And actually at the company we built most of the stack in TypeScript, and we have this amazing property, which is, we have type safety all the way from the database to the front end.[00:42:40] And there's nothing better for working with coding agents than being able to have them check their correctness, compile time. So the same ideas behind building the company's code base, we've put into the agent SDK here as well.[00:42:51] swyx: Yeah. Do you know if you'd use one of those tools, like Prisma or whatever, or is it Tool Lab for you?[00:42:55] David Singleton: We, we actually have crafted most of our own tools. Um. For [00:43:00] instance, we had LLM Driven Code Review, uh, before the thing that got published from philanthropic this week. You know, we, we've been doing this stuff, uh, on our own bat[00:43:07] swyx: email, we'll pay $25 per review.[00:43:09] David Singleton: We, we pay a lot less than that. However, I hear that those reviews are excellent and possibly worth $25.[00:43:14] swyx: Yeah. You know, it's an option. Right. It's good, good to have it.[00:43:17] David Singleton: Just to give you a tour of some other stuff here. So, um, I can also see all the versions. Yeah. Um, this is not gi, this is not gi, this is built into dreamer. I can see all the versions that have been pushed before. Why is it[00:43:27] swyx: not gi?[00:43:28] David Singleton: It's not gi because we can make it work more efficiently than Git.[00:43:32] And we actually, we do some work behind the scenes to kind of understand what's in each of these versions. Yeah. Um,[00:43:37] swyx: so one of the things I'm pursuing, and I have a lot of thesis, right? Mm-hmm. One of the thesis is like, does GI go away? Does GitHub go away? And like, what, what is the active reinvent[00:43:46] David Singleton: you for, for what it's worth to some extent.[00:43:48] And anything you build, there's a lot of path dependency. If we started over, we might make this gi There's, uh, you know, within the company we use, uh. For our, you know, platform source code. And we like it and it [00:44:00] works well with coding agents as well. The very first versions of this, we wanted to be able to make it possible for the sidekick to manipulate it easily.[00:44:06] Um, and this, this was an expedient way to do it.[00:44:08] swyx: Yeah.[00:44:08] Workflows Logs And Databases[00:44:08] David Singleton: Um, you can also see all the activity that has happened in the workflows that you build. A lot of agents, you'll build on Dreamer, do things in the background, so they run on triggers. These are stimuli from the outside to kick them off, and this is a nice way to see all of the things that might have kicked off your agent.[00:44:24] You know, you can have an agent that kicks off on a webhook, so you can plug it into external systems. You can have an agent that runs when you receive certain emails that match filters, including LLM filters. And so here you can see, oh, when did it run? What did it do? You know, if I open up one of these guide me prompts or guide me, uh, events.[00:44:41] Oh my can see God. Well, I told you it was calling an LLM for every one of those time slots. Here's all of the LLM calls, here's the actual prompts.[00:44:49] swyx: And you don't mind exposing all of this, right?[00:44:51] David Singleton: No. We want builders to see what's going on under the hood. It's haiku to,[00:44:53] swyx: okay. Yeah. So,[00:44:54] David Singleton: okay. Right now that one was haiku.[00:44:56] Like I said, we work with all the models and sidekick will actually pick the best one [00:45:00] for the job. And you saw that was pretty high quality and pretty fast. So Haiku four five is the one that it picked for that job. Exactly. Uh, we also have logs, as I mentioned, there's a database spun up on demand for every, uh, agent.[00:45:12] You don't have to go and figure out how to do your own hosting. This is a SQL Light. This is a SQL Light database. Yeah. Um, it's a multi-user SQL light database. And then, uh, but, but each one is you, you get a database that is unique to this agent. But then if you share the agent with multiple people, we take care of like who are the owners in each row?[00:45:31] And all of that stuff is just there outta the box. Um,[00:45:34] swyx: and again, in-house?[00:45:35] David Singleton: In-house.[00:45:36] swyx: Oh my God.[00:45:37] David Singleton: Yeah. Um, well we do work with a bunch of infrastructure providers, but the technology for how to manipulate this is in-house. Fun fact. We actually did a lot of our own infrastructure development early on at the company and realized we need to spend our energy in the stuff that we're uniquely doing in the world.[00:45:53] So we're very delighted to partner with a bunch of great designer and some of this stuff. And then finally, um, I mentioned that agentic apps agents [00:46:00] expose all of their internals to the system so the psychic can manipulate them and use them just like a user can. So you can see how it's decided to break this problem up into functions.[00:46:09] Some of the functions, the ones with the little I here are exported. That means that there's probably the visible from outside. Exactly. And others are internal. And if you want to, you can dig right in here and call individual functions and see what happens. But mostly. You don't need to think about that at all.[00:46:24] Yeah. Uh, you can keep that little drawer closed and you can talk to your sidekick and build really powerful and enchanting experiences.[00:46:30] swyx: Yeah. I mean, to me, like showing this gives the engineer a complete mental model of what you've done and what you can do with it. Yeah. For example, the first thing I, I, I look for.[00:46:39] A mental checklist of things, right? Like is off in the database, off looks like it's not right. So that's a separate layer. That's probably me means it's hard to do multi-user apps on the same app, right?[00:46:50] David Singleton: So you actually, we've solved that. So, um, see, yes, the platform builds in off, so you as a user sign into the platform, if you're using an [00:47:00] agent that was published by someone else, then your identity is, is kind of taken care of by the system.[00:47:05] And when you query the database, you're gonna get the stuff that is for you. Unless the builder specifically said, this is public data that everyone should see. So they, they actually get a chance to think about that. And again, sidekick can guide you through building, uh, agents and apps that work that way.[00:47:19] So you're right, that's another thing that people have to think about when they're trying to figure out how to build software experiences on Dreamer. You, it's built in. You talk to the sidekick as if it were a human being about what you want and that's what you get. So, you know, my, my Big Sky app that I just showed you that was designed for multiple people to use it.[00:47:38] And of course the things that we were putting in as expenses were supposed to be visible to everybody, and I just told the sidekick that's the way I wanted it. Uh, but by default, if I built an app like that, the data from each user would not been visible to the others.[00:47:49] swyx: Yeah. Yeah. Uh, this is, I presume this is a mood question, but basically you've had to build your own coding agent, right?[00:47:55] Which is sidekick slash whatever is in Inside Psychic. Obviously there's a lot of [00:48:00] people with a lot of desire for cloud code and Code X and attachment to it. Mm-hmm. I know under the hood data basically reduced to a loop, but like, would you let people use cloud coding and Code X or is the harness too specialized?[00:48:12] David Singleton: Yeah. If you, if you want to use, um, cloud code and Code X, then you go down here. Yeah. Hit get the S St K. And we even say this right here, edits your heart's content Z cursor code.[00:48:22] swyx: Like people want to use it inside of Ick, right? Yeah. They want to switch the engine.[00:48:26] David Singleton: Yeah.[00:48:26] swyx: That's the coding engine.[00:48:27] David Singleton: Yeah. We are not doing that right now.[00:48:29] Um, you know, again, the goal really is abstract the complexity. Yeah. Um, because the real target for. Building agentic apps is folks who can't do this already today. I can't tell you how many users in our community I've spoken to who are like Dreamer has changed my life because I used to have all these ideas.[00:48:50] If only I could find an engineer to help me implement them, I'd be able to get them done. They're free, and now I can talk to my sidekick and, and get it built. I think that's like really how we think [00:49:00] about the people that should get a ton of value and fun, um, out of the platform. And so they're not asking to be able to plug in their their own, you know, coding agent.[00:49:11] And for those folks, the opportunity is massive. If you've never been able to do stuff in code, now you can build stuff for you, for your friends, for your family, for your coworkers. And also there's a huge opportunity for folks who do build stuff in code to actually contribute to this ecosystem. So that's how we think about it.[00:49:28] swyx: Yeah. Amazing.[00:49:28] Personalization And Memory[00:49:28] swyx: That's most of what I wanted to cover Dreamer wise. I think personalization and memory yeah. Is probably like the single most important job of, uh, of the os. Maybe we could talk about that and then I'll, I wanted to zoom out on company building stuff.[00:49:40] David Singleton: Yeah, yeah. Sounds good.[00:49:41] swyx: Yeah. So how do you handle memory?[00:49:43] What, yeah, what have you found? What have you tried and failed?[00:49:45] David Singleton: Yeah. Okay. So, uh, first of all, at the core of dreamer is the sidekick. The sidekick gets to know you and it builds up a memory about you over time, and that turns out to be very important. So Dreamer, that's
Chris, Andrew, and David welcome special guest Jeff Dickey (jdx), creator of mise, discussing his background rewriting the Heroku CLI from Ruby to Node due to Ruby distribution/sandboxing issues. The conversation digs into why language CLIs are hard to distribute, the tradeoffs between shims vs PATH-based version switching, why tasks can be the “clean” solution, and Jeff's Rust-first tooling philosophy. They also dive into his other projects: usage (CLI docs/completions), Pitchfork (dev daemon runner that starts/stops services by directory), and fnox/Fort Knox (secrets management with encrypted files or remote stores like 1Password), and a big upcoming shift: pre-compiled (portable) Rubies becoming the default in mise. Press download now!LinksJudoscale- Remote Ruby listener giftJeff Dickey XJeff Dickey (jdx) Blueskymisefnox--usagePitchforkcommuniquéCasey Neistat: NYC's Worst Blizzard in a Decade, hour by hour (YouTube) Chris Oliver X/Twitter Andrew Mason X/Twitter Jason Charnes X/Twitter
Your first app interface should be a CLI! Carl and Richard talk to Kathleen Dollard about her experiences creating the .NET CLI - and how CLIs are only getting more important in the era of AI. Kathleen talks about working within the POSIX CLI standard for consistency's sake and to recognize that there will be many more CLIs in your life, so they should be as similar as possible. While CLIs may have started as configuration-as-code and DevOps practices, LLMs work well with them as long as consistency is maintained. There are several projects out there today to help you build a great CLI - check the links!
Your first app interface should be a CLI! Carl and Richard talk to Kathleen Dollard about her experiences creating the .NET CLI - and how CLIs are only getting more important in the era of AI. Kathleen talks about working within the POSIX CLI standard for consistency's sake and to recognize that there will be many more CLIs in your life, so they should be as similar as possible. While CLIs may have started as configuration-as-code and DevOps practices, LLMs work well with them as long as consistency is maintained. There are several projects out there today to help you build a great CLI - check the links!
Your first app interface should be a CLI! Carl and Richard talk to Kathleen Dollard about her experiences creating the .NET CLI - and how CLIs are only getting more important in the era of AI. Kathleen talks about working within the POSIX CLI standard for consistency's sake and to recognize that there will be many more CLIs in your life, so they should be as similar as possible. While CLIs may have started as configuration-as-code and DevOps practices, LLMs work well with them as long as consistency is maintained. There are several projects out there today to help you build a great CLI - check the links!
In this episode, I'm breaking down a guide from Ben Tossel on how you can actually build with AI agents without being technical. I walk through what he's shipped as a “non-technical” builder, why he lives in the terminal/CLI, and the exact workflow he uses to go from idea → spec → build → iterate. We also talk about the meta-skill here: treating the model like your over-the-shoulder engineer/teacher, and using every bug as a learning checkpoint. The takeaway is simple: pick a tool, ship fast, fail forward, and build your own system as you go. Ben's Article: https://startup-ideas-pod.link/Ben-Tossell-Article Timestamps 00:00 – Intro 01:04 – What Ben Has Shipped 03:21 – The Workflow: Feed Context → Spec Mode → Let The Agent Rip 07:52 – His Agent Setup 08:56 – Coding On The Go 10:07 – Things to Learn 13:33 – The New Abstraction Layer: Learning To Work With Agents 14:33 – Learning from Others 16:15 – Use The Model As Your Teacher (Ask Everything) 18:13 – Contributing to Real Products 19:13 – Why this is Different 21:31 – Asking Silly Questions 24:00 – Beyond “Vibe Coding”: A New Technical Class 24:43 – Vibe Coding is a game 27:12 – Fail Forward + Permission To Build And Throw Things Away 28:16 – Pick One Tool, Minimize Friction, Keep Shipping Key Points I don't need to be a traditional engineer to ship—I can learn by watching agent output and iterating. The terminal/CLI is the power move because it's more capable and I can see what the agent is doing. “Spec mode” works best when I interrogate the plan like a philosopher instead of pretending I understand everything. agents.md becomes my portable instruction manual so every new repo starts clean and consistent. The fastest learning path is building ahead of my capability and treating bugs as checkpoints—fail forward. Numbered Section Summaries The Thesis: Non-Technical Doesn't Mean Non-Builder I open with Ben's core claim: you can ship real software by working through a terminal with agents, even if you can't write the code yourself—because you can read the output and learn the system over time. Proof: What He's Actually Shipped I run through examples Ben built—custom CLIs, a crypto tracker, “Droidmas” experiments, an AI-directed video demo system, and automations that keep projects moving even when he's away from his desk. The Workflow: Context → Spec Mode → Autonomy High Ben's process is straightforward: talk to the model to load context, switch into spec mode to pressure-test the plan, link docs/repos for exploration, then let the model run while he watches and steers when needed. http://agents.md/ The “Readme For Agents” That Follows You Everywhere I explain why agents . md matters—one predictable place to tell your agent how you want repos structured, how to commit, how to test, and what “good” looks like so each session gets smoother. Coding On The Go: PRs, Issues, Phone, Telegram, Slack We get into the real “agent native” behavior: install the GitHub app, work via pull requests and issues, tag the agent to self-fix, and even push changes from your phone—plus using Slack as a one-person “product” with an agent in the loop. Learning The Primitives: Bash, CLIs, VPS, Skills I cover the building blocks Ben's learning: bash commands and repeatable terminal workflows, preferring CLIs over MCPs to save context, and using a VPS + syncing to keep projects always-on. The Mindset Shift: The Model Is The Teacher The real unlock is treating the model like your patient expert—ask everything you don't understand, bake “explain simply” into your agent instructions, and close knowledge gaps as they appear. Fail Forward, Pick One, Keep Shipping I end on the playbook: build ahead of your capability, treat it like play, give yourself permission to throw things away, and stop tool-hopping—pick one system and go deep. The #1 tool to find startup ideas/trends - https://www.ideabrowser.com LCA helps Fortune 500s and fast-growing startups build their future - from Warner Music to Fortnite to Dropbox. We turn 'what if' into reality with AI, apps, and next-gen products https://latecheckout.agency/ The Vibe Marketer - Resources for people into vibe marketing/marketing with AI: https://www.thevibemarketer.com/ FIND ME ON SOCIAL X/Twitter: https://twitter.com/gregisenberg Instagram: https://instagram.com/gregisenberg/ LinkedIn: https://www.linkedin.com/in/gisenberg/
Software Engineering Radio - The Podcast for Professional Software Developers
Derick Schaefer, author of CLI: A Practical Guide to Creating Modern Command-Line Interfaces, talks with host Robert Blumen about command-line interfaces old and new. Starting with a short review of the origin of commands in the early unix systems, they trace the evolution of commands into modern CLIs. Following the historic rise, fall, and re-emergence of CLIs, they consider innovative examples such as git, github, WordPress, and warp. Schaefer clarifies whether commands are the same as CLIs and then discusses a range of topics, including implementation languages, packages in the golang ecosystem for CLI development, CLIs and APIs, CLIs and AIs, AI tooling versus MCP, the object-command pattern, command flags, API authentication, whether CLIs should be stateless, and output formats - json, rich text. Brought to you by IEEE Computer Society and IEEE Software magazine.
Aaron and Brian review the Year in AI, hand out AI awards, and discuss the biggest AI trends from 2025. Maybe a few predictions will be made as well.SHOW: 987SHOW TRANSCRIPT: The Cloudcast #987 TranscriptSHOW VIDEO: https://youtube.com/@TheCloudcastNET CLOUD NEWS OF THE WEEK: http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST: "CLOUDCAST BASICS"SHOW SPONSORS:SHOW NOTESCLOUD & AI NEWS OF THE MONTH - NOV 2025 (show)CLOUD & AI NEWS OF THE MONTH - OCT 2025 (show)CLOUD & AI NEWS OF THE MONTH - SEPT 2025 (show)CLOUD & AI NEWS OF THE MONTH - AUG 2025 (show)CLOUD & AI NEWS OF THE MONTH - JUL 2025 (show)CLOUD & AI NEWS OF THE MONTH - JUN 2025 (show)CLOUD & AI NEWS OF THE MONTH - MAY 2025 (show)CLOUD & AI NEWS OF THE MONTH - APR 2025 (show)CLOUD & AI NEWS OF THE MONTH - MAR 2025 (show)CLOUD & AI NEWS OF THE MONTH - FEB 2025 (show)CLOUD & AI NEWS OF THE MONTH - JAN 2025 (show)2025 AI YEAR IN REVIEWThe Year of OpenAIThe Year of NVIDIAThe Year of MicrosoftThe Year of GoogleThe Year of OracleThe Year of China AIThe Year of AppleThe Year of Coding Agents (Anthropic, Cursor, Windsurf, CLIs, etc..)The Year of Data CentersAI Highlights and Lowlights (Corporate Layoffs, Acquihires, Funding, etc..)2026 AI DraftFEEDBACK?Email: show at the cloudcast dot netTwitter/X: @cloudcastpodBlueSky: @cloudcastpod.bsky.socialInstagram: @cloudcastpodTikTok: @cloudcastpod
Pavan Davuluri only spoke at one Ignite 2025 session, and it did not deserve the hate he got. But what did he really say? Copilot is a front-end for apps and cloud AI services, agents are background processes. Apps in Windows need to become programmatic so AI and agents can control them. You are in control. You being IT and the user. These experiences are off by default, opt-in, and optional. This is the end of whatever BS argument anyone has about this stuff. Copilot Voice because AI is better when you babble and is often more natural than typing Key concept: Apps, CLIs, etc. expect exact commands, AI is all about intent, just do what I want, not exactly what I say. This is why, yes, people WILL want to talk to their PCs (and other devices) UIs for these new features will look/feel natural in Windows Search box in Taskbar is getting updated to orchestrate between local/web search and Copilot capabilities, including agents Agents will appear as app icons in Taskbar when fired, can be check in on, can post notifications for you to attend to Integration of M365 Copilot capabilities with Windows - Better together story, with things like Writing Assistance for every text box Accessibility updates thanks to AI - Fluid Dictation, which is what makes Copilot Voice make so much sense All the security, privacy, and IT management the audience expects Windows Insider Program Dev and Beta builds include Full Screen Experience on all PCs, new Notepad app, more Hardware - Earnings Lenovo PC business up 12 percent to $15.1 billion, 25.6 percent unit share HP up 4 percent to $14.6 billion, but job cuts for AI are coming Dell PC business up 3 percent to $1.41 billion AI and Stuff Microsoft releases local Fara-7B agentic model for computer use ChatGPT's new coding model is optimized for Windows Dear God, you must see Nano Banana Pro to understand Google's lead Google is bringing AirDrop to Android, starting with Pixel. This is what happens when regulators "force design changes on OS makers." Xbox and Gaming Xbox Cloud Gaming usage is up 45 percent YOY. Sure. What's 45 percent of 3 people? Xbox Cloud Gaming is adding per-game resolution settings, to 1440p for Game Pass Ultimate customers ROG Xbox Ally is getting default game profiles, in preview for 40 titles now Microsoft open sources the source code for Zork, Zork II, and Zork III New Chromebook buyers get one year of Nvidia GeForce NOW with Fast Pass Tips and Picks Tip of the week: Finding experts is more important than ever We live in the age of stupid. Find the smart and never let go Also, Xbox is having a good Black Friday sale Also: Perplexity Comet on Android RunAs Radio this week: Christmas Gifts for SysAdmins with Joey Snow and Rick Claus https://runasradio.com/Shows/Show/1012 Brown liquor pick of the week: Sidetrack Stone Whisky https://www.huskdistillers.com/shop/sidetrack-stone-whisky These show notes have been truncated due to length. For the full show notes, visit https://twit.tv/shows/windows-weekly/episodes/960 Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Sponsors: outsystems.com/twit cachefly.com/twit
Pavan Davuluri only spoke at one Ignite 2025 session, and it did not deserve the hate he got. But what did he really say? Copilot is a front-end for apps and cloud AI services, agents are background processes. Apps in Windows need to become programmatic so AI and agents can control them. You are in control. You being IT and the user. These experiences are off by default, opt-in, and optional. This is the end of whatever BS argument anyone has about this stuff. Copilot Voice because AI is better when you babble and is often more natural than typing Key concept: Apps, CLIs, etc. expect exact commands, AI is all about intent, just do what I want, not exactly what I say. This is why, yes, people WILL want to talk to their PCs (and other devices) UIs for these new features will look/feel natural in Windows Search box in Taskbar is getting updated to orchestrate between local/web search and Copilot capabilities, including agents Agents will appear as app icons in Taskbar when fired, can be check in on, can post notifications for you to attend to Integration of M365 Copilot capabilities with Windows - Better together story, with things like Writing Assistance for every text box Accessibility updates thanks to AI - Fluid Dictation, which is what makes Copilot Voice make so much sense All the security, privacy, and IT management the audience expects Windows Insider Program Dev and Beta builds include Full Screen Experience on all PCs, new Notepad app, more Hardware - Earnings Lenovo PC business up 12 percent to $15.1 billion, 25.6 percent unit share HP up 4 percent to $14.6 billion, but job cuts for AI are coming Dell PC business up 3 percent to $1.41 billion AI and Stuff Microsoft releases local Fara-7B agentic model for computer use ChatGPT's new coding model is optimized for Windows Dear God, you must see Nano Banana Pro to understand Google's lead Google is bringing AirDrop to Android, starting with Pixel. This is what happens when regulators "force design changes on OS makers." Xbox and Gaming Xbox Cloud Gaming usage is up 45 percent YOY. Sure. What's 45 percent of 3 people? Xbox Cloud Gaming is adding per-game resolution settings, to 1440p for Game Pass Ultimate customers ROG Xbox Ally is getting default game profiles, in preview for 40 titles now Microsoft open sources the source code for Zork, Zork II, and Zork III New Chromebook buyers get one year of Nvidia GeForce NOW with Fast Pass Tips and Picks Tip of the week: Finding experts is more important than ever We live in the age of stupid. Find the smart and never let go Also, Xbox is having a good Black Friday sale App pick of the week: Copilot Mode in Microsoft Edge Also: Perplexity Comet on Android RunAs Radio this week: Christmas Gifts for SysAdmins with Joey Snow and Rick Claus https://runasradio.com/Shows/Show/1012 Brown liquor pick of the week: Sidetrack Stone Whisky https://www.huskdistillers.com/shop/sidetrack-stone-whisky These show notes have been truncated due to length. For the full show notes, visit https://twit.tv/shows/windows-weekly/episodes/960 Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Sponsors: outsystems.com/twit cachefly.com/twit
Pavan Davuluri only spoke at one Ignite 2025 session, and it did not deserve the hate he got. But what did he really say? Copilot is a front-end for apps and cloud AI services, agents are background processes. Apps in Windows need to become programmatic so AI and agents can control them. You are in control. You being IT and the user. These experiences are off by default, opt-in, and optional. This is the end of whatever BS argument anyone has about this stuff. Copilot Voice because AI is better when you babble and is often more natural than typing Key concept: Apps, CLIs, etc. expect exact commands, AI is all about intent, just do what I want, not exactly what I say. This is why, yes, people WILL want to talk to their PCs (and other devices) UIs for these new features will look/feel natural in Windows Search box in Taskbar is getting updated to orchestrate between local/web search and Copilot capabilities, including agents Agents will appear as app icons in Taskbar when fired, can be check in on, can post notifications for you to attend to Integration of M365 Copilot capabilities with Windows - Better together story, with things like Writing Assistance for every text box Accessibility updates thanks to AI - Fluid Dictation, which is what makes Copilot Voice make so much sense All the security, privacy, and IT management the audience expects Windows Insider Program Dev and Beta builds include Full Screen Experience on all PCs, new Notepad app, more Hardware - Earnings Lenovo PC business up 12 percent to $15.1 billion, 25.6 percent unit share HP up 4 percent to $14.6 billion, but job cuts for AI are coming Dell PC business up 3 percent to $1.41 billion AI and Stuff Microsoft releases local Fara-7B agentic model for computer use ChatGPT's new coding model is optimized for Windows Dear God, you must see Nano Banana Pro to understand Google's lead Google is bringing AirDrop to Android, starting with Pixel. This is what happens when regulators "force design changes on OS makers." Xbox and Gaming Xbox Cloud Gaming usage is up 45 percent YOY. Sure. What's 45 percent of 3 people? Xbox Cloud Gaming is adding per-game resolution settings, to 1440p for Game Pass Ultimate customers ROG Xbox Ally is getting default game profiles, in preview for 40 titles now Microsoft open sources the source code for Zork, Zork II, and Zork III New Chromebook buyers get one year of Nvidia GeForce NOW with Fast Pass Tips and Picks Tip of the week: Finding experts is more important than ever We live in the age of stupid. Find the smart and never let go Also, Xbox is having a good Black Friday sale App pick of the week: Copilot Mode in Microsoft Edge Also: Perplexity Comet on Android RunAs Radio this week: Christmas Gifts for SysAdmins with Joey Snow and Rick Claus https://runasradio.com/Shows/Show/1012 Brown liquor pick of the week: Sidetrack Stone Whisky https://www.huskdistillers.com/shop/sidetrack-stone-whisky These show notes have been truncated due to length. For the full show notes, visit https://twit.tv/shows/windows-weekly/episodes/960 Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Sponsors: outsystems.com/twit cachefly.com/twit
Pavan Davuluri only spoke at one Ignite 2025 session, and it did not deserve the hate he got. But what did he really say? Copilot is a front-end for apps and cloud AI services, agents are background processes. Apps in Windows need to become programmatic so AI and agents can control them. You are in control. You being IT and the user. These experiences are off by default, opt-in, and optional. This is the end of whatever BS argument anyone has about this stuff. Copilot Voice because AI is better when you babble and is often more natural than typing Key concept: Apps, CLIs, etc. expect exact commands, AI is all about intent, just do what I want, not exactly what I say. This is why, yes, people WILL want to talk to their PCs (and other devices) UIs for these new features will look/feel natural in Windows Search box in Taskbar is getting updated to orchestrate between local/web search and Copilot capabilities, including agents Agents will appear as app icons in Taskbar when fired, can be check in on, can post notifications for you to attend to Integration of M365 Copilot capabilities with Windows - Better together story, with things like Writing Assistance for every text box Accessibility updates thanks to AI - Fluid Dictation, which is what makes Copilot Voice make so much sense All the security, privacy, and IT management the audience expects Windows Insider Program Dev and Beta builds include Full Screen Experience on all PCs, new Notepad app, more Hardware - Earnings Lenovo PC business up 12 percent to $15.1 billion, 25.6 percent unit share HP up 4 percent to $14.6 billion, but job cuts for AI are coming Dell PC business up 3 percent to $1.41 billion AI and Stuff Microsoft releases local Fara-7B agentic model for computer use ChatGPT's new coding model is optimized for Windows Dear God, you must see Nano Banana Pro to understand Google's lead Google is bringing AirDrop to Android, starting with Pixel. This is what happens when regulators "force design changes on OS makers." Xbox and Gaming Xbox Cloud Gaming usage is up 45 percent YOY. Sure. What's 45 percent of 3 people? Xbox Cloud Gaming is adding per-game resolution settings, to 1440p for Game Pass Ultimate customers ROG Xbox Ally is getting default game profiles, in preview for 40 titles now Microsoft open sources the source code for Zork, Zork II, and Zork III New Chromebook buyers get one year of Nvidia GeForce NOW with Fast Pass Tips and Picks Tip of the week: Finding experts is more important than ever We live in the age of stupid. Find the smart and never let go Also, Xbox is having a good Black Friday sale Also: Perplexity Comet on Android RunAs Radio this week: Christmas Gifts for SysAdmins with Joey Snow and Rick Claus https://runasradio.com/Shows/Show/1012 Brown liquor pick of the week: Sidetrack Stone Whisky https://www.huskdistillers.com/shop/sidetrack-stone-whisky These show notes have been truncated due to length. For the full show notes, visit https://twit.tv/shows/windows-weekly/episodes/960 Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Sponsors: outsystems.com/twit cachefly.com/twit
Pavan Davuluri only spoke at one Ignite 2025 session, and it did not deserve the hate he got. But what did he really say? Copilot is a front-end for apps and cloud AI services, agents are background processes. Apps in Windows need to become programmatic so AI and agents can control them. You are in control. You being IT and the user. These experiences are off by default, opt-in, and optional. This is the end of whatever BS argument anyone has about this stuff. Copilot Voice because AI is better when you babble and is often more natural than typing Key concept: Apps, CLIs, etc. expect exact commands, AI is all about intent, just do what I want, not exactly what I say. This is why, yes, people WILL want to talk to their PCs (and other devices) UIs for these new features will look/feel natural in Windows Search box in Taskbar is getting updated to orchestrate between local/web search and Copilot capabilities, including agents Agents will appear as app icons in Taskbar when fired, can be check in on, can post notifications for you to attend to Integration of M365 Copilot capabilities with Windows - Better together story, with things like Writing Assistance for every text box Accessibility updates thanks to AI - Fluid Dictation, which is what makes Copilot Voice make so much sense All the security, privacy, and IT management the audience expects Windows Insider Program Dev and Beta builds include Full Screen Experience on all PCs, new Notepad app, more Hardware - Earnings Lenovo PC business up 12 percent to $15.1 billion, 25.6 percent unit share HP up 4 percent to $14.6 billion, but job cuts for AI are coming Dell PC business up 3 percent to $1.41 billion AI and Stuff Microsoft releases local Fara-7B agentic model for computer use ChatGPT's new coding model is optimized for Windows Dear God, you must see Nano Banana Pro to understand Google's lead Google is bringing AirDrop to Android, starting with Pixel. This is what happens when regulators "force design changes on OS makers." Xbox and Gaming Xbox Cloud Gaming usage is up 45 percent YOY. Sure. What's 45 percent of 3 people? Xbox Cloud Gaming is adding per-game resolution settings, to 1440p for Game Pass Ultimate customers ROG Xbox Ally is getting default game profiles, in preview for 40 titles now Microsoft open sources the source code for Zork, Zork II, and Zork III New Chromebook buyers get one year of Nvidia GeForce NOW with Fast Pass Tips and Picks Tip of the week: Finding experts is more important than ever We live in the age of stupid. Find the smart and never let go Also, Xbox is having a good Black Friday sale App pick of the week: Copilot Mode in Microsoft Edge Also: Perplexity Comet on Android RunAs Radio this week: Christmas Gifts for SysAdmins with Joey Snow and Rick Claus https://runasradio.com/Shows/Show/1012 Brown liquor pick of the week: Sidetrack Stone Whisky https://www.huskdistillers.com/shop/sidetrack-stone-whisky These show notes have been truncated due to length. For the full show notes, visit https://twit.tv/shows/windows-weekly/episodes/960 Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Sponsors: outsystems.com/twit cachefly.com/twit
Pavan Davuluri only spoke at one Ignite 2025 session, and it did not deserve the hate he got. But what did he really say? Copilot is a front-end for apps and cloud AI services, agents are background processes. Apps in Windows need to become programmatic so AI and agents can control them. You are in control. You being IT and the user. These experiences are off by default, opt-in, and optional. This is the end of whatever BS argument anyone has about this stuff. Copilot Voice because AI is better when you babble and is often more natural than typing Key concept: Apps, CLIs, etc. expect exact commands, AI is all about intent, just do what I want, not exactly what I say. This is why, yes, people WILL want to talk to their PCs (and other devices) UIs for these new features will look/feel natural in Windows Search box in Taskbar is getting updated to orchestrate between local/web search and Copilot capabilities, including agents Agents will appear as app icons in Taskbar when fired, can be check in on, can post notifications for you to attend to Integration of M365 Copilot capabilities with Windows - Better together story, with things like Writing Assistance for every text box Accessibility updates thanks to AI - Fluid Dictation, which is what makes Copilot Voice make so much sense All the security, privacy, and IT management the audience expects Windows Insider Program Dev and Beta builds include Full Screen Experience on all PCs, new Notepad app, more Hardware - Earnings Lenovo PC business up 12 percent to $15.1 billion, 25.6 percent unit share HP up 4 percent to $14.6 billion, but job cuts for AI are coming Dell PC business up 3 percent to $1.41 billion AI and Stuff Microsoft releases local Fara-7B agentic model for computer use ChatGPT's new coding model is optimized for Windows Dear God, you must see Nano Banana Pro to understand Google's lead Google is bringing AirDrop to Android, starting with Pixel. This is what happens when regulators "force design changes on OS makers." Xbox and Gaming Xbox Cloud Gaming usage is up 45 percent YOY. Sure. What's 45 percent of 3 people? Xbox Cloud Gaming is adding per-game resolution settings, to 1440p for Game Pass Ultimate customers ROG Xbox Ally is getting default game profiles, in preview for 40 titles now Microsoft open sources the source code for Zork, Zork II, and Zork III New Chromebook buyers get one year of Nvidia GeForce NOW with Fast Pass Tips and Picks Tip of the week: Finding experts is more important than ever We live in the age of stupid. Find the smart and never let go Also, Xbox is having a good Black Friday sale App pick of the week: Copilot Mode in Microsoft Edge Also: Perplexity Comet on Android RunAs Radio this week: Christmas Gifts for SysAdmins with Joey Snow and Rick Claus https://runasradio.com/Shows/Show/1012 Brown liquor pick of the week: Sidetrack Stone Whisky https://www.huskdistillers.com/shop/sidetrack-stone-whisky These show notes have been truncated due to length. For the full show notes, visit https://twit.tv/shows/windows-weekly/episodes/960 Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Sponsors: outsystems.com/twit cachefly.com/twit
Arnaud et Guillaume explore l'évolution de l'écosystème Java avec Java 25, Spring Boot et Quarkus, ainsi que les dernières tendances en intelligence artificielle avec les nouveaux modèles comme Grok 4 et Claude Code. Les animateurs font également le point sur l'infrastructure cloud, les défis MCP et CLI, tout en discutant de l'impact de l'IA sur la productivité des développeurs et la gestion de la dette technique. Enregistré le 8 août 2025 Téléchargement de l'épisode LesCastCodeurs-Episode–329.mp3 ou en vidéo sur YouTube. News Langages Java 25: JEP 515 : Profilage de méthode en avance (Ahead-of-Time) https://openjdk.org/jeps/515 Le JEP 515 a pour but d'améliorer le temps de démarrage et de chauffe des applications Java. L'idée est de collecter les profils d'exécution des méthodes lors d'une exécution antérieure, puis de les rendre immédiatement disponibles au démarrage de la machine virtuelle. Cela permet au compilateur JIT de générer du code natif dès le début, sans avoir à attendre que l'application soit en cours d'exécution. Ce changement ne nécessite aucune modification du code des applications, des bibliothèques ou des frameworks. L'intégration se fait via les commandes de création de cache AOT existantes. Voir aussi https://openjdk.org/jeps/483 et https://openjdk.org/jeps/514 Java 25: JEP 518 : Échantillonnage coopératif JFR https://openjdk.org/jeps/518 Le JEP 518 a pour objectif d'améliorer la stabilité et l'évolutivité de la fonction JDK Flight Recorder (JFR) pour le profilage d'exécution. Le mécanisme d'échantillonnage des piles d'appels de threads Java est retravaillé pour s'exécuter uniquement à des safepoints, ce qui réduit les risques d'instabilité. Le nouveau modèle permet un parcours de pile plus sûr, notamment avec le garbage collector ZGC, et un échantillonnage plus efficace qui prend en charge le parcours de pile concurrent. Le JEP ajoute un nouvel événement, SafepointLatency, qui enregistre le temps nécessaire à un thread pour atteindre un safepoint. L'approche rend le processus d'échantillonnage plus léger et plus rapide, car le travail de création de traces de pile est délégué au thread cible lui-même. Librairies Spring Boot 4 M1 https://spring.io/blog/2025/07/24/spring-boot–4–0–0-M1-available-now Spring Boot 4.0.0-M1 met à jour de nombreuses dépendances internes et externes pour améliorer la stabilité et la compatibilité. Les types annotés avec @ConfigurationProperties peuvent maintenant référencer des types situés dans des modules externes grâce à @ConfigurationPropertiesSource. Le support de l'information sur la validité des certificats SSL a été simplifié, supprimant l'état WILL_EXPIRE_SOON au profit de VALID. L'auto-configuration des métriques Micrometer supporte désormais l'annotation @MeterTag sur les méthodes annotées @Counted et @Timed, avec évaluation via SpEL. Le support de @ServiceConnection pour MongoDB inclut désormais l'intégration avec MongoDBAtlasLocalContainer de Testcontainers. Certaines fonctionnalités et API ont été dépréciées, avec des recommandations pour migrer les points de terminaison personnalisés vers les versions Spring Boot 2. Les versions milestones et release candidates sont maintenant publiées sur Maven Central, en plus du repository Spring traditionnel. Un guide de migration a été publié pour faciliter la transition depuis Spring Boot 3.5 vers la version 4.0.0-M1. Passage de Spring Boot à Quarkus : retour d'expérience https://blog.stackademic.com/we-switched-from-spring-boot-to-quarkus-heres-the-ugly-truth-c8a91c2b8c53 Une équipe a migré une application Java de Spring Boot vers Quarkus pour gagner en performances et réduire la consommation mémoire. L'objectif était aussi d'optimiser l'application pour le cloud natif. La migration a été plus complexe que prévu, notamment à cause de l'incompatibilité avec certaines bibliothèques et d'un écosystème Quarkus moins mature. Il a fallu revoir du code et abandonner certaines fonctionnalités spécifiques à Spring Boot. Les gains en performances et en mémoire sont réels, mais la migration demande un vrai effort d'adaptation. La communauté Quarkus progresse, mais le support reste limité comparé à Spring Boot. Conclusion : Quarkus est intéressant pour les nouveaux projets ou ceux prêts à être réécrits, mais la migration d'un projet existant est un vrai défi. LangChain4j 1.2.0 : Nouvelles fonctionnalités et améliorations https://github.com/langchain4j/langchain4j/releases/tag/1.2.0 Modules stables : Les modules langchain4j-anthropic, langchain4j-azure-open-ai, langchain4j-bedrock, langchain4j-google-ai-gemini, langchain4j-mistral-ai et langchain4j-ollama sont désormais en version stable 1.2.0. Modules expérimentaux : La plupart des autres modules de LangChain4j sont en version 1.2.0-beta8 et restent expérimentaux/instables. BOM mis à jour : Le langchain4j-bom a été mis à jour en version 1.2.0, incluant les dernières versions de tous les modules. Principales améliorations : Support du raisonnement/pensée dans les modèles. Appels d'outils partiels en streaming. Option MCP pour exposer automatiquement les ressources en tant qu'outils. OpenAI : possibilité de définir des paramètres de requête personnalisés et d'accéder aux réponses HTTP brutes et aux événements SSE. Améliorations de la gestion des erreurs et de la documentation. Filtering Metadata Infinispan ! (cc Katia( Et 1.3.0 est déjà disponible https://github.com/langchain4j/langchain4j/releases/tag/1.3.0 2 nouveaux modules expérimentaux, langchain4j-agentic et langchain4j-agentic-a2a qui introduisent un ensemble d'abstractions et d'utilitaires pour construire des applications agentiques Infrastructure Cette fois c'est vraiment l'année de Linux sur le desktop ! https://www.lesnumeriques.com/informatique/c-est-enfin-arrive-linux-depasse-un-seuil-historique-que-microsoft-pensait-intouchable-n239977.html Linux a franchi la barre des 5% aux USA Cette progression s'explique en grande partie par l'essor des systèmes basés sur Linux dans les environnements professionnels, les serveurs, et certains usages grand public. Microsoft, longtemps dominant avec Windows, voyait ce seuil comme difficilement atteignable à court terme. Le succès de Linux est également alimenté par la popularité croissante des distributions open source, plus légères, personnalisables et adaptées à des usages variés. Le cloud, l'IoT, et les infrastructures de serveurs utilisent massivement Linux, ce qui contribue à cette augmentation globale. Ce basculement symbolique marque un changement d'équilibre dans l'écosystème des systèmes d'exploitation. Toutefois, Windows conserve encore une forte présence dans certains segments, notamment chez les particuliers et dans les entreprises classiques. Cette évolution témoigne du dynamisme et de la maturité croissante des solutions Linux, devenues des alternatives crédibles et robustes face aux offres propriétaires. Cloud Cloudflare 1.1.1.1 s'en va pendant une heure d'internet https://blog.cloudflare.com/cloudflare–1–1–1–1-incident-on-july–14–2025/ Le 14 juillet 2025, le service DNS public Cloudflare 1.1.1.1 a subi une panne majeure de 62 minutes, rendant le service indisponible pour la majorité des utilisateurs mondiaux. Cette panne a aussi causé une dégradation intermittente du service Gateway DNS. L'incident est survenu suite à une mise à jour de la topologie des services Cloudflare qui a activé une erreur de configuration introduite en juin 2025. Cette erreur faisait que les préfixes destinés au service 1.1.1.1 ont été accidentellement inclus dans un nouveau service de localisation des données (Data Localization Suite), ce qui a perturbé le routage anycast. Le résultat a été une incapacité pour les utilisateurs à résoudre les noms de domaine via 1.1.1.1, rendant la plupart des services Internet inaccessibles pour eux. Ce n'était pas le résultat d'une attaque ou d'un problème BGP, mais une erreur interne de configuration. Cloudflare a rapidement identifié la cause, corrigé la configuration et mis en place des mesures pour prévenir ce type d'incident à l'avenir. Le service est revenu à la normale après environ une heure d'indisponibilité. L'incident souligne la complexité et la sensibilité des infrastructures anycast et la nécessité d'une gestion rigoureuse des configurations réseau. Web L'évolution des bonnes pratiques de Node.js https://kashw1n.com/blog/nodejs–2025/ Évolution de Node.js en 2025 : Le développement se tourne vers les standards du web, avec moins de dépendances externes et une meilleure expérience pour les développeurs. ES Modules (ESM) par défaut : Remplacement de CommonJS pour un meilleur outillage et une standardisation avec le web. Utilisation du préfixe node: pour les modules natifs afin d'éviter les conflits. API web intégrées : fetch, AbortController, et AbortSignal sont maintenant natifs, réduisant le besoin de librairies comme axios. Runner de test intégré : Plus besoin de Jest ou Mocha pour la plupart des cas. Inclut un mode “watch” et des rapports de couverture. Patterns asynchrones avancés : Utilisation plus poussée de async/await avec Promise.all() pour le parallélisme et les AsyncIterators pour les flux d'événements. Worker Threads pour le parallélisme : Pour les tâches lourdes en CPU, évitant de bloquer l'event loop principal. Expérience de développement améliorée : Intégration du mode --watch (remplace nodemon) et du support --env-file (remplace dotenv). Sécurité et performance : Modèle de permission expérimental pour restreindre l'accès et des hooks de performance natifs pour le monitoring. Distribution simplifiée : Création d'exécutables uniques pour faciliter le déploiement d'applications ou d'outils en ligne de commande. Sortie de Apache EChart 6 après 12 ans ! https://echarts.apache.org/handbook/en/basics/release-note/v6-feature/ Apache ECharts 6.0 : Sortie officielle après 12 ans d'évolution. 12 mises à niveau majeures pour la visualisation de données. Trois dimensions clés d'amélioration : Présentation visuelle plus professionnelle : Nouveau thème par défaut (design moderne). Changement dynamique de thème. Prise en charge du mode sombre. Extension des limites de l'expression des données : Nouveaux types de graphiques : Diagramme de cordes (Chord Chart), Nuage de points en essaim (Beeswarm Chart). Nouvelles fonctionnalités : Jittering pour nuages de points denses, Axes coupés (Broken Axis). Graphiques boursiers améliorés Liberté de composition : Nouveau système de coordonnées matriciel. Séries personnalisées améliorées (réutilisation du code, publication npm). Nouveaux graphiques personnalisés inclus (violon, contour, etc.). Optimisation de l'agencement des étiquettes d'axe. Data et Intelligence Artificielle Grok 4 s'est pris pour un nazi à cause des tools https://techcrunch.com/2025/07/15/xai-says-it-has-fixed-grok–4s-problematic-responses/ À son lancement, Grok 4 a généré des réponses offensantes, notamment en se surnommant « MechaHitler » et en adoptant des propos antisémites. Ce comportement provenait d'une recherche automatique sur le web qui a mal interprété un mème viral comme une vérité. Grok alignait aussi ses réponses controversées sur les opinions d'Elon Musk et de xAI, ce qui a amplifié les biais. xAI a identifié que ces dérapages étaient dus à une mise à jour interne intégrant des instructions encourageant un humour offensant et un alignement avec Musk. Pour corriger cela, xAI a supprimé le code fautif, remanié les prompts système, et imposé des directives demandant à Grok d'effectuer une analyse indépendante, en utilisant des sources diverses. Grok doit désormais éviter tout biais, ne plus adopter un humour politiquement incorrect, et analyser objectivement les sujets sensibles. xAI a présenté ses excuses, précisant que ces dérapages étaient dus à un problème de prompt et non au modèle lui-même. Cet incident met en lumière les défis persistants d'alignement et de sécurité des modèles d'IA face aux injections indirectes issues du contenu en ligne. La correction n'est pas qu'un simple patch technique, mais un exemple des enjeux éthiques et de responsabilité majeurs dans le déploiement d'IA à grande échelle. Guillaume a sorti toute une série d'article sur les patterns agentiques avec le framework ADK pour Java https://glaforge.dev/posts/2025/07/29/mastering-agentic-workflows-with-adk-the-recap/ Un premier article explique comment découper les tâches en sous-agents IA : https://glaforge.dev/posts/2025/07/23/mastering-agentic-workflows-with-adk-sub-agents/ Un deuxième article détaille comment organiser les agents de manière séquentielle : https://glaforge.dev/posts/2025/07/24/mastering-agentic-workflows-with-adk-sequential-agent/ Un troisième article explique comment paralleliser des tâches indépendantes : https://glaforge.dev/posts/2025/07/25/mastering-agentic-workflows-with-adk-parallel-agent/ Et enfin, comment faire des boucles d'amélioration : https://glaforge.dev/posts/2025/07/28/mastering-agentic-workflows-with-adk-loop-agents/ Tout ça évidemment en Java :slightly_smiling_face: 6 semaines de code avec Claude https://blog.puzzmo.com/posts/2025/07/30/six-weeks-of-claude-code/ Orta partage son retour après 6 semaines d'utilisation quotidienne de Claude Code, qui a profondément changé sa manière de coder. Il ne « code » plus vraiment ligne par ligne, mais décrit ce qu'il veut, laisse Claude proposer une solution, puis corrige ou ajuste. Cela permet de se concentrer sur le résultat plutôt que sur l'implémentation, comme passer de la peinture au polaroid. Claude s'avère particulièrement utile pour les tâches de maintenance : migrations, refactors, nettoyage de code. Il reste toujours en contrôle, révise chaque diff généré, et guide l'IA via des prompts bien cadrés. Il note qu'il faut quelques semaines pour prendre le bon pli : apprendre à découper les tâches et formuler clairement les attentes. Les tâches simples deviennent quasi instantanées, mais les tâches complexes nécessitent encore de l'expérience et du discernement. Claude Code est vu comme un très bon copilote, mais ne remplace pas le rôle du développeur qui comprend l'ensemble du système. Le gain principal est une vitesse de feedback plus rapide et une boucle d'itération beaucoup plus courte. Ce type d'outil pourrait bien redéfinir la manière dont on pense et structure le développement logiciel à moyen terme. Claude Code et les serveurs MCP : ou comment transformer ton terminal en assistant surpuissant https://touilleur-express.fr/2025/07/27/claude-code-et-les-serveurs-mcp-ou-comment-transformer-ton-terminal-en-assistant-surpuissant/ Nicolas continue ses études sur Claude Code et explique comment utiliser les serveurs MCP pour rendre Claude bien plus efficace. Le MCP Context7 montre comment fournir à l'IA la doc technique à jour (par exemple, Next.js 15) pour éviter les hallucinations ou les erreurs. Le MCP Task Master, autre serveur MCP, transforme un cahier des charges (PRD) en tâches atomiques, estimées, et organisées sous forme de plan de travail. Le MCP Playwright permet de manipuler des navigateurs et d'executer des tests E2E Le MCP Digital Ocean permet de déployer facilement l'application en production Tout n'est pas si ideal, les quotas sont atteints en quelques heures sur une petite application et il y a des cas où il reste bien plus efficace de le faire soit-même (pour un codeur expérimenté) Nicolas complète cet article avec l'écriture d'un MVP en 20 heures: https://touilleur-express.fr/2025/07/30/comment-jai-code-un-mvp-en-une-vingtaine-dheures-avec-claude-code/ Le développement augmenté, un avis politiquement correct, mais bon… https://touilleur-express.fr/2025/07/31/le-developpement-augmente-un-avis-politiquement-correct-mais-bon/ Nicolas partage un avis nuancé (et un peu provoquant) sur le développement augmenté, où l'IA comme Claude Code assiste le développeur sans le remplacer. Il rejette l'idée que cela serait « trop magique » ou « trop facile » : c'est une évolution logique de notre métier, pas un raccourci pour les paresseux. Pour lui, un bon dev reste celui qui structure bien sa pensée, sait poser un problème, découper, valider — même si l'IA aide à coder plus vite. Il raconte avoir codé une app OAuth, testée, stylisée et déployée en quelques heures, sans jamais quitter le terminal grâce à Claude. Ce genre d'outillage change le rapport au temps : on passe de « je vais y réfléchir » à « je tente tout de suite une version qui marche à peu près ». Il assume aimer cette approche rapide et imparfaite : mieux vaut une version brute livrée vite qu'un projet bloqué par le perfectionnisme. L'IA est selon lui un super stagiaire : jamais fatigué, parfois à côté de la plaque, mais diablement productif quand bien briefé. Il conclut que le « dev augmenté » ne remplace pas les bons développeurs… mais les développeurs moyens doivent s'y mettre, sous peine d'être dépassés. ChatGPT lance le mode d'étude : un apprentissage interactif pas à pas https://openai.com/index/chatgpt-study-mode/ OpenAI propose un mode d'étude dans ChatGPT qui guide les utilisateurs pas à pas plutôt que de donner directement la réponse. Ce mode vise à encourager la réflexion active et l'apprentissage en profondeur. Il utilise des instructions personnalisées pour poser des questions et fournir des explications adaptées au niveau de l'utilisateur. Le mode d'étude favorise la gestion de la charge cognitive et stimule la métacognition. Il propose des réponses structurées pour faciliter la compréhension progressive des sujets. Disponible dès maintenant pour les utilisateurs connectés, ce mode sera intégré dans ChatGPT Edu. L'objectif est de transformer ChatGPT en un véritable tuteur numérique, aidant les étudiants à mieux assimiler les connaissances. A priori Gemini viendrait de sortir un fonctionnalité similaire Lancement de GPT-OSS par OpenAI https://openai.com/index/introducing-gpt-oss/ https://openai.com/index/gpt-oss-model-card/ OpenAI a lancé GPT-OSS, sa première famille de modèles open-weight depuis GPT–2. Deux modèles sont disponibles : gpt-oss–120b et gpt-oss–20b, qui sont des modèles mixtes d'experts conçus pour le raisonnement et les tâches d'agent. Les modèles sont distribués sous licence Apache 2.0, permettant leur utilisation et leur personnalisation gratuites, y compris pour des applications commerciales. Le modèle gpt-oss–120b est capable de performances proches du modèle OpenAI o4-mini, tandis que le gpt-oss–20b est comparable au o3-mini. OpenAI a également open-sourcé un outil de rendu appelé Harmony en Python et Rust pour en faciliter l'adoption. Les modèles sont optimisés pour fonctionner localement et sont pris en charge par des plateformes comme Hugging Face et Ollama. OpenAI a mené des recherches sur la sécurité pour s'assurer que les modèles ne pouvaient pas être affinés pour des utilisations malveillantes dans les domaines biologique, chimique ou cybernétique. Anthropic lance Opus 4.1 https://www.anthropic.com/news/claude-opus–4–1 Anthropic a publié Claude Opus 4.1, une mise à jour de son modèle de langage. Cette nouvelle version met l'accent sur l'amélioration des performances en codage, en raisonnement et sur les tâches de recherche et d'analyse de données. Le modèle a obtenu un score de 74,5 % sur le benchmark SWE-bench Verified, ce qui représente une amélioration par rapport à la version précédente. Il excelle notamment dans la refactorisation de code multifichier et est capable d'effectuer des recherches approfondies. Claude Opus 4.1 est disponible pour les utilisateurs payants de Claude, ainsi que via l'API, Amazon Bedrock et Vertex AI de Google Cloud, avec des tarifs identiques à ceux d'Opus 4. Il est présenté comme un remplacement direct de Claude Opus 4, avec des performances et une précision supérieures pour les tâches de programmation réelles. OpenAI Summer Update. GPT–5 is out https://openai.com/index/introducing-gpt–5/ Détails https://openai.com/index/gpt–5-new-era-of-work/ https://openai.com/index/introducing-gpt–5-for-developers/ https://openai.com/index/gpt–5-safe-completions/ https://openai.com/index/gpt–5-system-card/ Amélioration majeure des capacités cognitives - GPT‑5 montre un niveau de raisonnement, d'abstraction et de compréhension nettement supérieur aux modèles précédents. Deux variantes principales - gpt-5-main : rapide, efficace pour les tâches générales. gpt-5-thinking : plus lent mais spécialisé dans les tâches complexes, nécessitant réflexion profonde. Routeur intelligent intégré - Le système sélectionne automatiquement la version la plus adaptée à la tâche (rapide ou réfléchie), sans intervention de l'utilisateur. Fenêtre de contexte encore étendue - GPT‑5 peut traiter des volumes de texte plus longs (jusqu'à 1 million de tokens dans certaines versions), utile pour des documents ou projets entiers. Réduction significative des hallucinations - GPT‑5 donne des réponses plus fiables, avec moins d'erreurs inventées ou de fausses affirmations. Comportement plus neutre et moins sycophant - Il a été entraîné pour mieux résister à l'alignement excessif avec les opinions de l'utilisateur. Capacité accrue à suivre des instructions complexes - GPT‑5 comprend mieux les consignes longues, implicites ou nuancées. Approche “Safe completions” - Remplacement des “refus d'exécution” par des réponses utiles mais sûres — le modèle essaie de répondre avec prudence plutôt que bloquer. Prêt pour un usage professionnel à grande échelle - Optimisé pour le travail en entreprise : rédaction, programmation, synthèse, automatisation, gestion de tâches, etc. Améliorations spécifiques pour le codage - GPT‑5 est plus performant pour l'écriture de code, la compréhension de contextes logiciels complexes, et l'usage d'outils de développement. Expérience utilisateur plus rapide et fluide- Le système réagit plus vite grâce à une orchestration optimisée entre les différents sous-modèles. Capacités agentiques renforcées - GPT‑5 peut être utilisé comme base pour des agents autonomes capables d'accomplir des objectifs avec peu d'interventions humaines. Multimodalité maîtrisée (texte, image, audio) - GPT‑5 intègre de façon plus fluide la compréhension de formats multiples, dans un seul modèle. Fonctionnalités pensées pour les développeurs - Documentation plus claire, API unifiée, modèles plus transparents et personnalisables. Personnalisation contextuelle accrue - Le système s'adapte mieux au style, ton ou préférences de l'utilisateur, sans instructions répétées. Utilisation énergétique et matérielle optimisée - Grâce au routeur interne, les ressources sont utilisées plus efficacement selon la complexité des tâches. Intégration sécurisée dans les produits ChatGPT - Déjà déployé dans ChatGPT avec des bénéfices immédiats pour les utilisateurs Pro et entreprises. Modèle unifié pour tous les usages - Un seul système capable de passer de la conversation légère à des analyses scientifiques ou du code complexe. Priorité à la sécurité et à l'alignement - GPT‑5 a été conçu dès le départ pour minimiser les abus, biais ou comportements indésirables. Pas encore une AGI - OpenAI insiste : malgré ses capacités impressionnantes, GPT‑5 n'est pas une intelligence artificielle générale. Non, non, les juniors ne sont pas obsolètes malgré l'IA ! (dixit GitHub) https://github.blog/ai-and-ml/generative-ai/junior-developers-arent-obsolete-heres-how-to-thrive-in-the-age-of-ai/ L'IA transforme le développement logiciel, mais les développeurs juniors ne sont pas obsolètes. Les nouveaux apprenants sont bien positionnés, car déjà familiers avec les outils IA. L'objectif est de développer des compétences pour travailler avec l'IA, pas d'être remplacé. La créativité et la curiosité sont des qualités humaines clés. Cinq façons de se démarquer : Utiliser l'IA (ex: GitHub Copilot) pour apprendre plus vite, pas seulement coder plus vite (ex: mode tuteur, désactiver l'autocomplétion temporairement). Construire des projets publics démontrant ses compétences (y compris en IA). Maîtriser les workflows GitHub essentiels (GitHub Actions, contribution open source, pull requests). Affûter son expertise en révisant du code (poser des questions, chercher des patterns, prendre des notes). Déboguer plus intelligemment et rapidement avec l'IA (ex: Copilot Chat pour explications, corrections, tests). Ecrire son premier agent IA avec A2A avec WildFly par Emmanuel Hugonnet https://www.wildfly.org/news/2025/08/07/Building-your-First-A2A-Agent/ Protocole Agent2Agent (A2A) : Standard ouvert pour l'interopérabilité universelle des agents IA. Permet communication et collaboration efficaces entre agents de différents fournisseurs/frameworks. Crée des écosystèmes multi-agents unifiés, automatisant les workflows complexes. Objet de l'article : Guide pour construire un premier agent A2A (agent météo) dans WildFly. Utilise A2A Java SDK pour Jakarta Servers, WildFly AI Feature Pack, un LLM (Gemini) et un outil Python (MCP). Agent conforme A2A v0.2.5. Prérequis : JDK 17+, Apache Maven 3.8+, IDE Java, Google AI Studio API Key, Python 3.10+, uv. Étapes de construction de l'agent météo : Création du service LLM : Interface Java (WeatherAgent) utilisant LangChain4J pour interagir avec un LLM et un outil Python MCP (fonctions get_alerts, get_forecast). Définition de l'agent A2A (via CDI) : ▪︎ Agent Card : Fournit les métadonnées de l'agent (nom, description, URL, capacités, compétences comme “weather_search”). Agent Executor : Gère les requêtes A2A entrantes, extrait le message utilisateur, appelle le service LLM et formate la réponse. Exposition de l'agent : Enregistrement d'une application JAX-RS pour les endpoints. Déploiement et test : Configuration de l'outil A2A-inspector de Google (via un conteneur Podman). Construction du projet Maven, configuration des variables d'environnement (ex: GEMINI_API_KEY). Lancement du serveur WildFly. Conclusion : Transformation minimale d'une application IA en agent A2A. Permet la collaboration et le partage d'informations entre agents IA, indépendamment de leur infrastructure sous-jacente. Outillage IntelliJ IDEa bouge vers une distribution unifiée https://blog.jetbrains.com/idea/2025/07/intellij-idea-unified-distribution-plan/ À partir de la version 2025.3, IntelliJ IDEA Community Edition ne sera plus distribuée séparément. Une seule version unifiée d'IntelliJ IDEA regroupera les fonctionnalités des éditions Community et Ultimate. Les fonctionnalités avancées de l'édition Ultimate seront accessibles via abonnement. Les utilisateurs sans abonnement auront accès à une version gratuite enrichie par rapport à l'édition Community actuelle. Cette unification vise à simplifier l'expérience utilisateur et réduire les différences entre les éditions. Les utilisateurs Community seront automatiquement migrés vers cette nouvelle version unifiée. Il sera possible d'activer les fonctionnalités Ultimate temporairement d'un simple clic. En cas d'expiration d'abonnement Ultimate, l'utilisateur pourra continuer à utiliser la version installée avec un jeu limité de fonctionnalités gratuites, sans interruption. Ce changement reflète l'engagement de JetBrains envers l'open source et l'adaptation aux besoins de la communauté. Prise en charge des Ancres YAML dans GitHub Actions https://github.com/actions/runner/issues/1182#issuecomment–3150797791 Afin d'éviter de dupliquer du contenu dans un workflow les Ancres permettent d'insérer des morceaux réutilisables de YAML Fonctionnalité attendue depuis des années et disponible chez GitLab depuis bien longtemps. Elle a été déployée le 4 aout. Attention à ne pas en abuser car la lisibilité de tels documents n'est pas si facile Gemini CLI rajoute les custom commands comme Claude https://cloud.google.com/blog/topics/developers-practitioners/gemini-cli-custom-slash-commands Mais elles sont au format TOML, on ne peut donc pas les partager avec Claude :disappointed: Automatiser ses workflows IA avec les hooks de Claude Code https://blog.gitbutler.com/automate-your-ai-workflows-with-claude-code-hooks/ Claude Code propose des hooks qui permettent d'exécuter des scripts à différents moments d'une session, par exemple au début, lors de l'utilisation d'outils, ou à la fin. Ces hooks facilitent l'automatisation de tâches comme la gestion de branches Git, l'envoi de notifications, ou l'intégration avec d'autres outils. Un exemple simple est l'envoi d'une notification sur le bureau à la fin d'une session. Les hooks se configurent via trois fichiers JSON distincts selon le scope : utilisateur, projet ou local. Sur macOS, l'envoi de notifications nécessite une permission spécifique via l'application “Script Editor”. Il est important d'avoir une version à jour de Claude Code pour utiliser ces hooks. GitButler permet desormais de s'intégrer à Claude Code via ces hooks: https://blog.gitbutler.com/parallel-claude-code/ Le client Git de Jetbrains bientot en standalone https://lp.jetbrains.com/closed-preview-for-jetbrains-git-client/ Demandé par certains utilisateurs depuis longtemps Ca serait un client graphique du même style qu'un GitButler, SourceTree, etc Apache Maven 4 …. arrive …. l'utilitaire mvnupva vous aider à upgrader https://maven.apache.org/tools/mvnup.html Fixe les incompatibilités connues Nettoie les redondances et valeurs par defaut (versions par ex) non utiles pour Maven 4 Reformattage selon les conventions maven … Une GitHub Action pour Gemini CLI https://blog.google/technology/developers/introducing-gemini-cli-github-actions/ Google a lancé Gemini CLI GitHub Actions, un agent d'IA qui fonctionne comme un “coéquipier de code” pour les dépôts GitHub. L'outil est gratuit et est conçu pour automatiser des tâches de routine telles que le triage des problèmes (issues), l'examen des demandes de tirage (pull requests) et d'autres tâches de développement. Il agit à la fois comme un agent autonome et un collaborateur que les développeurs peuvent solliciter à la demande, notamment en le mentionnant dans une issue ou une pull request. L'outil est basé sur la CLI Gemini, un agent d'IA open-source qui amène le modèle Gemini directement dans le terminal. Il utilise l'infrastructure GitHub Actions, ce qui permet d'isoler les processus dans des conteneurs séparés pour des raisons de sécurité. Trois flux de travail (workflows) open-source sont disponibles au lancement : le triage intelligent des issues, l'examen des pull requests et la collaboration à la demande. Pas besoin de MCP, le code est tout ce dont vous avez besoin https://lucumr.pocoo.org/2025/7/3/tools/ Armin souligne qu'il n'est pas fan du protocole MCP (Model Context Protocol) dans sa forme actuelle : il manque de composabilité et exige trop de contexte. Il remarque que pour une même tâche (ex. GitHub), utiliser le CLI est souvent plus rapide et plus efficace en termes de contexte que passer par un serveur MCP. Selon lui, le code reste la solution la plus simple et fiable, surtout pour automatiser des tâches répétitives. Il préfère créer des scripts clairs plutôt que se reposer sur l'inférence LLM : cela facilite la vérification, la maintenance et évite les erreurs subtiles. Pour les tâches récurrentes, si on les automatise, mieux vaut le faire avec du code reusable, plutôt que de laisser l'IA deviner à chaque fois. Il illustre cela en convertissant son blog entier de reStructuredText à Markdown : plutôt qu'un usage direct d'IA, il a demandé à Claude de générer un script complet, avec parsing AST, comparaison des fichiers, validation et itération. Ce workflow LLM→code→LLM (analyse et validation) lui a donné confiance dans le résultat final, tout en conservant un contrôle humain sur le processus. Il juge que MCP ne permet pas ce type de pipeline automatisé fiable, car il introduit trop d'inférence et trop de variations par appel. Pour lui, coder reste le meilleur moyen de garder le contrôle, la reproductibilité et la clarté dans les workflows automatisés. MCP vs CLI … https://www.async-let.com/blog/my-take-on-the-mcp-verses-cli-debate/ Cameron raconte son expérience de création du serveur XcodeBuildMCP, qui lui a permis de mieux comprendre le débat entre servir l'IA via MCP ou laisser l'IA utiliser directement les CLI du système. Selon lui, les CLIs restent préférables pour les développeurs experts recherchant contrôle, transparence, performance et simplicité. Mais les serveurs MCP excellent sur les workflows complexes, les contextes persistants, les contraintes de sécurité, et facilitent l'accès pour les utilisateurs moins expérimentés. Il reconnaît la critique selon laquelle MCP consomme trop de contexte (« context bloat ») et que les appels CLI peuvent être plus rapides et compréhensibles. Toutefois, il souligne que beaucoup de problèmes proviennent de la qualité des implémentations clients, pas du protocole MCP en lui‑même. Pour lui, un bon serveur MCP peut proposer des outils soigneusement définis qui simplifient la vie de l'IA (par exemple, renvoyer des données structurées plutôt que du texte brut à parser). Il apprécie la capacité des MCP à offrir des opérations état‑durables (sessions, mémoire, logs capturés), ce que les CLI ne gèrent pas naturellement. Certains scénarios ne peuvent pas fonctionner via CLI (pas de shell accessible) alors que MCP, en tant que protocole indépendant, reste utilisable par n'importe quel client. Son verdict : pas de solution universelle — chaque contexte mérite d'être évalué, et on ne devrait pas imposer MCP ou CLI à tout prix. Jules, l'agent de code asynchrone gratuit de Google, est sorti de beta et est disponible pour tout le monde https://blog.google/technology/google-labs/jules-now-available/ Jules, agent de codage asynchrone, est maintenant publiquement disponible. Propulsé par Gemini 2.5 Pro. Phase bêta : 140 000+ améliorations de code et retours de milliers de développeurs. Améliorations : interface utilisateur, corrections de bugs, réutilisation des configurations, intégration GitHub Issues, support multimodal. Gemini 2.5 Pro améliore les plans de codage et la qualité du code. Nouveaux paliers structurés : Introductif, Google AI Pro (limites 5x supérieures), Google AI Ultra (limites 20x supérieures). Déploiement immédiat pour les abonnés Google AI Pro et Ultra, incluant les étudiants éligibles (un an gratuit de AI Pro). Architecture Valoriser la réduction de la dette technique : un vrai défi https://www.lemondeinformatique.fr/actualites/lire-valoriser-la-reduction-de-la-dette-technique-mission-impossible–97483.html La dette technique est un concept mal compris et difficile à valoriser financièrement auprès des directions générales. Les DSI ont du mal à mesurer précisément cette dette, à allouer des budgets spécifiques, et à prouver un retour sur investissement clair. Cette difficulté limite la priorisation des projets de réduction de dette technique face à d'autres initiatives jugées plus urgentes ou stratégiques. Certaines entreprises intègrent progressivement la gestion de la dette technique dans leurs processus de développement. Des approches comme le Software Crafting visent à améliorer la qualité du code pour limiter l'accumulation de cette dette. L'absence d'outils adaptés pour mesurer les progrès rend la démarche encore plus complexe. En résumé, réduire la dette technique reste une mission délicate qui nécessite innovation, méthode et sensibilisation en interne. Il ne faut pas se Mocker … https://martinelli.ch/why-i-dont-use-mocking-frameworks-and-why-you-might-not-need-them-either/ https://blog.tremblay.pro/2025/08/not-using-mocking-frmk.html L'auteur préfère utiliser des fakes ou stubs faits à la main plutôt que des frameworks de mocking comme Mockito ou EasyMock. Les frameworks de mocking isolent le code, mais entraînent souvent : Un fort couplage entre les tests et les détails d'implémentation. Des tests qui valident le mock plutôt que le comportement réel. Deux principes fondamentaux guident son approche : Favoriser un design fonctionnel, avec logique métier pure (fonctions sans effets de bord). Contrôler les données de test : par exemple en utilisant des bases réelles (via Testcontainers) plutôt que de simuler. Dans sa pratique, les seuls cas où un mock externe est utilisé concernent les services HTTP externes, et encore il préfère en simuler seulement le transport plutôt que le comportement métier. Résultat : les tests deviennent plus simples, plus rapides à écrire, plus fiables, et moins fragiles aux évolutions du code. L'article conclut que si tu conçois correctement ton code, tu pourrais très bien ne pas avoir besoin de frameworks de mocking du tout. Le blog en réponse d'Henri Tremblay nuance un peu ces retours Méthodologies C'est quoi être un bon PM ? (Product Manager) Article de Chris Perry, un PM chez Google : https://thechrisperry.substack.com/p/being-a-good-pm-at-google Le rôle de PM est difficile : Un travail exigeant, où il faut être le plus impliqué de l'équipe pour assurer le succès. 1. Livrer (shipper) est tout ce qui compte : La priorité absolue. Mieux vaut livrer et itérer rapidement que de chercher la perfection en théorie. Un produit livré permet d'apprendre de la réalité. 2. Donner l'envie du grand large : La meilleure façon de faire avancer un projet est d'inspirer l'équipe avec une vision forte et désirable. Montrer le “pourquoi”. 3. Utiliser son produit tous les jours : Non négociable pour réussir. Permet de développer une intuition et de repérer les vrais problèmes que la recherche utilisateur ne montre pas toujours. 4. Être un bon ami : Créer des relations authentiques et aider les autres est un facteur clé de succès à long terme. La confiance est la base d'une exécution rapide. 5. Donner plus qu'on ne reçoit : Toujours chercher à aider et à collaborer. La stratégie optimale sur la durée est la coopération. Ne pas être possessif avec ses idées. 6. Utiliser le bon levier : Pour obtenir une décision, il faut identifier la bonne personne qui a le pouvoir de dire “oui”, et ne pas se laisser bloquer par des avis non décisionnaires. 7. N'aller que là où on apporte de la valeur : Combler les manques, faire le travail ingrat que personne ne veut faire. Savoir aussi s'écarter (réunions, projets) quand on n'est pas utile. 8. Le succès a plusieurs parents, l'échec est orphelin : Si le produit réussit, c'est un succès d'équipe. S'il échoue, c'est la faute du PM. Il faut assumer la responsabilité finale. Conclusion : Le PM est un chef d'orchestre. Il ne peut pas jouer de tous les instruments, mais son rôle est d'orchestrer avec humilité le travail de tous pour créer quelque chose d'harmonieux. Tester des applications Spring Boot prêtes pour la production : points clés https://www.wimdeblauwe.com/blog/2025/07/30/how-i-test-production-ready-spring-boot-applications/ L'auteur (Wim Deblauwe) détaille comment il structure ses tests dans une application Spring Boot destinée à la production. Le projet inclut automatiquement la dépendance spring-boot-starter-test, qui regroupe JUnit 5, AssertJ, Mockito, Awaitility, JsonAssert, XmlUnit et les outils de testing Spring. Tests unitaires : ciblent les fonctions pures (record, utilitaire), testés simplement avec JUnit et AssertJ sans démarrage du contexte Spring. Tests de cas d'usage (use case) : orchestrent la logique métier, généralement via des use cases qui utilisent un ou plusieurs dépôts de données. Tests JPA/repository : vérifient les interactions avec la base via des tests realisant des opérations CRUD (avec un contexte Spring pour la couche persistance). Tests de contrôleur : permettent de tester les endpoints web (ex. @WebMvcTest), souvent avec MockBean pour simuler les dépendances. Tests d'intégration complets : ils démarrent tout le contexte Spring (@SpringBootTest) pour tester l'application dans son ensemble. L'auteur évoque également des tests d'architecture, mais sans entrer dans le détail dans cet article. Résultat : une pyramide de tests allant des plus rapides (unitaires) aux plus complets (intégration), garantissant fiabilité, vitesse et couverture sans surcharge inutile. Sécurité Bitwarden offre un serveur MCP pour que les agents puissent accéder aux mots de passe https://nerds.xyz/2025/07/bitwarden-mcp-server-secure-ai/ Bitwarden introduit un serveur MCP (Model Context Protocol) destiné à intégrer de manière sécurisée les agents IA dans les workflows de gestion de mots de passe. Ce serveur fonctionne en architecture locale (local-first) : toutes les interactions et les données sensibles restent sur la machine de l'utilisateur, garantissant l'application du principe de chiffrement zero‑knowledge. L'intégration se fait via l'interface CLI de Bitwarden, permettant aux agents IA de générer, récupérer, modifier et verrouiller les identifiants via des commandes sécurisées. Le serveur peut être auto‑hébergé pour un contrôle maximal des données. Le protocole MCP est un standard ouvert qui permet de connecter de façon uniforme des agents IA à des sources de données et outils tiers, simplifiant les intégrations entre LLM et applications. Une démo avec Claude (agent IA d'Anthropic) montre que l'IA peut interagir avec le coffre Bitwarden : vérifier l'état, déverrouiller le vault, générer ou modifier des identifiants, le tout sans intervention humaine directe. Bitwarden affiche une approche priorisant la sécurité, mais reconnaît les risques liés à l'utilisation d'IA autonome. L'usage d'un LLM local privé est fortement recommandé pour limiter les vulnérabilités. Si tu veux, je peux aussi te résumer les enjeux principaux (interopérabilité, sécurité, cas d'usage) ou un extrait spécifique ! NVIDIA a une faille de securite critique https://www.wiz.io/blog/nvidia-ai-vulnerability-cve–2025–23266-nvidiascape Il s'agit d'une faille d'évasion de conteneur dans le NVIDIA Container Toolkit. La gravité est jugée critique avec un score CVSS de 9.0. Cette vulnérabilité permet à un conteneur malveillant d'obtenir un accès root complet sur l'hôte. L'origine du problème vient d'une mauvaise configuration des hooks OCI dans le toolkit. L'exploitation peut se faire très facilement, par exemple avec un Dockerfile de seulement trois lignes. Le risque principal concerne la compromission de l'isolation entre différents clients sur des infrastructures cloud GPU partagées. Les versions affectées incluent toutes les versions du NVIDIA Container Toolkit jusqu'à la 1.17.7 et du NVIDIA GPU Operator jusqu'à la version 25.3.1. Pour atténuer le risque, il est recommandé de mettre à jour vers les dernières versions corrigées. En attendant, il est possible de désactiver certains hooks problématiques dans la configuration pour limiter l'exposition. Cette faille met en lumière l'importance de renforcer la sécurité des environnements GPU partagés et la gestion des conteneurs AI. Fuite de données de l'application Tea : points essentiels https://knowyourmeme.com/memes/events/the-tea-app-data-leak Tea est une application lancée en 2023 qui permet aux femmes de laisser des avis anonymes sur des hommes rencontrés. En juillet 2025, une importante fuite a exposé environ 72 000 images sensibles (selfies, pièces d'identité) et plus d'1,1 million de messages privés. La fuite a été révélée après qu'un utilisateur ait partagé un lien pour télécharger la base de données compromise. Les données touchées concernaient majoritairement des utilisateurs inscrits avant février 2024, date à laquelle l'application a migré vers une infrastructure plus sécurisée. En réponse, Tea prévoit de proposer des services de protection d'identité aux utilisateurs impactés. Faille dans le paquet npm is : attaque en chaîne d'approvisionnement https://socket.dev/blog/npm-is-package-hijacked-in-expanding-supply-chain-attack Une campagne de phishing ciblant les mainteneurs npm a compromis plusieurs comptes, incluant celui du paquet is. Des versions compromises du paquet is (notamment les versions 3.3.1 et 5.0.0) contenaient un chargeur de malware JavaScript destiné aux systèmes Windows. Ce malware a offert aux attaquants un accès à distance via WebSocket, permettant potentiellement l'exécution de code arbitraire. L'attaque fait suite à d'autres compromissions de paquets populaires comme eslint-config-prettier, eslint-plugin-prettier, synckit, @pkgr/core, napi-postinstall, et got-fetch. Tous ces paquets ont été publiés sans aucun commit ou PR sur leurs dépôts GitHub respectifs, signalant un accès non autorisé aux tokens mainteneurs. Le domaine usurpé [npnjs.com](http://npnjs.com) a été utilisé pour collecter les jetons d'accès via des emails de phishing trompeurs. L'épisode met en lumière la fragilité des chaînes d'approvisionnement logicielle dans l'écosystème npm et la nécessité d'adopter des pratiques renforcées de sécurité autour des dépendances. Revues de sécurité automatisées avec Claude Code https://www.anthropic.com/news/automate-security-reviews-with-claude-code Anthropic a lancé des fonctionnalités de sécurité automatisées pour Claude Code, un assistant de codage d'IA en ligne de commande. Ces fonctionnalités ont été introduites en réponse au besoin croissant de maintenir la sécurité du code alors que les outils d'IA accélèrent considérablement le développement de logiciels. Commande /security-review : les développeurs peuvent exécuter cette commande dans leur terminal pour demander à Claude d'identifier les vulnérabilités de sécurité, notamment les risques d'injection SQL, les vulnérabilités de script intersite (XSS), les failles d'authentification et d'autorisation, ainsi que la gestion non sécurisée des données. Claude peut également suggérer et implémenter des correctifs. Intégration GitHub Actions : une nouvelle action GitHub permet à Claude Code d'analyser automatiquement chaque nouvelle demande d'extraction (pull request). L'outil examine les modifications de code pour y trouver des vulnérabilités, applique des règles personnalisables pour filtrer les faux positifs et commente directement la demande d'extraction avec les problèmes détectés et les correctifs recommandés. Ces fonctionnalités sont conçues pour créer un processus d'examen de sécurité cohérent et s'intégrer aux pipelines CI/CD existants, ce qui permet de s'assurer qu'aucun code n'atteint la production sans un examen de sécurité de base. Loi, société et organisation Google embauche les personnes clés de Windsurf https://www.blog-nouvelles-technologies.fr/333959/openai-windsurf-google-deepmind-codage-agentique/ windsurf devait être racheté par OpenAI Google ne fait pas d'offre de rachat mais débauche quelques personnes clés de Windsurf Windsurf reste donc indépendante mais sans certains cerveaux y compris son PDG. Les nouveaux dirigeants sont les ex leaders des force de vente Donc plus une boîte tech Pourquoi le deal a 3 milliard est tombé à l'eau ? On ne sait pas mais la divergence et l‘indépendance technologique est possiblement en cause. Les transfuge vont bosser chez Deepmind dans le code argentique Opinion Article: https://www.linkedin.com/pulse/dear-people-who-think-ai-low-skilled-code-monkeys-future-jan-moser-svade/ Jan Moser critique ceux qui pensent que l'IA et les développeurs peu qualifiés peuvent remplacer les ingénieurs logiciels compétents. Il cite l'exemple de l'application Tea, une plateforme de sécurité pour femmes, qui a exposé 72 000 images d'utilisateurs en raison d'une mauvaise configuration de Firebase et d'un manque de pratiques de développement sécurisées. Il souligne que l'absence de contrôles automatisés et de bonnes pratiques de sécurité a permis cette fuite de données. Moser avertit que des outils comme l'IA ne peuvent pas compenser l'absence de compétences en génie logiciel, notamment en matière de sécurité, de gestion des erreurs et de qualité du code. Il appelle à une reconnaissance de la valeur des ingénieurs logiciels qualifiés et à une approche plus rigoureuse dans le développement logiciel. YouTube déploie une technologie d'estimation d'âge pour identifier les adolescents aux États-Unis https://techcrunch.com/2025/07/29/youtube-rolls-out-age-estimatation-tech-to-identify-u-s-teens-and-apply-additional-protections/ Sujet très à la mode, surtout au UK mais pas que… YouTube commence à déployer une technologie d'estimation d'âge basée sur l'IA pour identifier les utilisateurs adolescents aux États-Unis, indépendamment de l'âge déclaré lors de l'inscription. Cette technologie analyse divers signaux comportementaux, tels que l'historique de visionnage, les catégories de vidéos consultées et l'âge du compte. Lorsqu'un utilisateur est identifié comme adolescent, YouTube applique des protections supplémentaires, notamment : Désactivation des publicités personnalisées. Activation des outils de bien-être numérique, tels que les rappels de temps d'écran et de coucher. Limitation de la visualisation répétée de contenus sensibles, comme ceux liés à l'image corporelle. Si un utilisateur est incorrectement identifié comme mineur, il peut vérifier son âge via une pièce d'identité gouvernementale, une carte de crédit ou un selfie. Ce déploiement initial concerne un petit groupe d'utilisateurs aux États-Unis et sera étendu progressivement. Cette initiative s'inscrit dans les efforts de YouTube pour renforcer la sécurité des jeunes utilisateurs en ligne. Mistral AI : contribution à un standard environnemental pour l'IA https://mistral.ai/news/our-contribution-to-a-global-environmental-standard-for-ai Mistral AI a réalisé la première analyse de cycle de vie complète d'un modèle d'IA, en collaboration avec plusieurs partenaires. L'étude quantifie l'impact environnemental du modèle Mistral Large 2 sur les émissions de gaz à effet de serre, la consommation d'eau, et l'épuisement des ressources. La phase d'entraînement a généré 20,4 kilotonnes de CO₂ équivalent, consommé 281 000 m³ d'eau, et utilisé 660 kg SB-eq (mineral consumption). Pour une réponse de 400 tokens, l'impact marginal est faible mais non négligeable : 1,14 gramme de CO₂, 45 mL d'eau, et 0,16 mg d'équivalent antimoine. Mistral propose trois indicateurs pour évaluer cet impact : l'impact absolu de l'entraînement, l'impact marginal de l'inférence, et le ratio inference/impact total sur le cycle de vie. L'entreprise souligne l'importance de choisir le modèle en fonction du cas d'usage pour limiter l'empreinte environnementale. Mistral appelle à plus de transparence et à l'adoption de standards internationaux pour permettre une comparaison claire entre modèles. L'IA promettait plus d'efficacité… elle nous fait surtout travailler plus https://afterburnout.co/p/ai-promised-to-make-us-more-efficient Les outils d'IA devaient automatiser les tâches pénibles et libérer du temps pour les activités stratégiques et créatives. En réalité, le temps gagné est souvent aussitôt réinvesti dans d'autres tâches, créant une surcharge. Les utilisateurs croient être plus productifs avec l'IA, mais les données contredisent cette impression : une étude montre que les développeurs utilisant l'IA prennent 19 % de temps en plus pour accomplir leurs tâches. Le rapport DORA 2024 observe une baisse de performance globale des équipes lorsque l'usage de l'IA augmente : –1,5 % de throughput et –7,2 % de stabilité de livraison pour +25 % d'adoption de l'IA. L'IA ne réduit pas la charge mentale, elle la déplace : rédaction de prompts, vérification de résultats douteux, ajustements constants… Cela épuise et limite le temps de concentration réelle. Cette surcharge cognitive entraîne une forme de dette mentale : on ne gagne pas vraiment du temps, on le paie autrement. Le vrai problème vient de notre culture de la productivité, qui pousse à toujours vouloir optimiser, quitte à alimenter l'épuisement professionnel. Trois pistes concrètes : Repenser la productivité non en temps gagné, mais en énergie préservée. Être sélectif dans l'usage des outils IA, en fonction de son ressenti et non du battage médiatique. Accepter la courbe en J : l'IA peut être utile, mais nécessite des ajustements profonds pour produire des gains réels. Le vrai hack de productivité ? Parfois, ralentir pour rester lucide et durable. Conférences MCP Submit Europe https://mcpdevsummit.ai/ Retour de JavaOne en 2026 https://inside.java/2025/08/04/javaone-returns–2026/ JavaOne, la conférence dédiée à la communauté Java, fait son grand retour dans la Bay Area du 17 au 19 mars 2026. Après le succès de l'édition 2025, ce retour s'inscrit dans la continuité de la mission initiale de la conférence : rassembler la communauté pour apprendre, collaborer et innover. La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 25–27 août 2025 : SHAKA Biarritz - Biarritz (France) 5 septembre 2025 : JUG Summer Camp 2025 - La Rochelle (France) 12 septembre 2025 : Agile Pays Basque 2025 - Bidart (France) 15 septembre 2025 : Agile Tour Montpellier - Montpellier (France) 18–19 septembre 2025 : API Platform Conference - Lille (France) & Online 22–24 septembre 2025 : Kernel Recipes - Paris (France) 22–27 septembre 2025 : La Mélée Numérique - Toulouse (France) 23 septembre 2025 : OWASP AppSec France 2025 - Paris (France) 23–24 septembre 2025 : AI Engineer Paris - Paris (France) 25 septembre 2025 : Agile Game Toulouse - Toulouse (France) 25–26 septembre 2025 : Paris Web 2025 - Paris (France) 30 septembre 2025–1 octobre 2025 : PyData Paris 2025 - Paris (France) 2 octobre 2025 : Nantes Craft - Nantes (France) 2–3 octobre 2025 : Volcamp - Clermont-Ferrand (France) 3 octobre 2025 : DevFest Perros-Guirec 2025 - Perros-Guirec (France) 6–7 octobre 2025 : Swift Connection 2025 - Paris (France) 6–10 octobre 2025 : Devoxx Belgium - Antwerp (Belgium) 7 octobre 2025 : BSides Mulhouse - Mulhouse (France) 7–8 octobre 2025 : Agile en Seine - Issy-les-Moulineaux (France) 8–10 octobre 2025 : SIG 2025 - Paris (France) & Online 9 octobre 2025 : DevCon #25 : informatique quantique - Paris (France) 9–10 octobre 2025 : Forum PHP 2025 - Marne-la-Vallée (France) 9–10 octobre 2025 : EuroRust 2025 - Paris (France) 16 octobre 2025 : PlatformCon25 Live Day Paris - Paris (France) 16 octobre 2025 : Power 365 - 2025 - Lille (France) 16–17 octobre 2025 : DevFest Nantes - Nantes (France) 17 octobre 2025 : Sylius Con 2025 - Lyon (France) 17 octobre 2025 : ScalaIO 2025 - Paris (France) 17–19 octobre 2025 : OpenInfra Summit Europe - Paris (France) 20 octobre 2025 : Codeurs en Seine - Rouen (France) 23 octobre 2025 : Cloud Nord - Lille (France) 30–31 octobre 2025 : Agile Tour Bordeaux 2025 - Bordeaux (France) 30–31 octobre 2025 : Agile Tour Nantais 2025 - Nantes (France) 30 octobre 2025–2 novembre 2025 : PyConFR 2025 - Lyon (France) 4–7 novembre 2025 : NewCrafts 2025 - Paris (France) 5–6 novembre 2025 : Tech Show Paris - Paris (France) 6 novembre 2025 : dotAI 2025 - Paris (France) 6 novembre 2025 : Agile Tour Aix-Marseille 2025 - Gardanne (France) 7 novembre 2025 : BDX I/O - Bordeaux (France) 12–14 novembre 2025 : Devoxx Morocco - Marrakech (Morocco) 13 novembre 2025 : DevFest Toulouse - Toulouse (France) 15–16 novembre 2025 : Capitole du Libre - Toulouse (France) 19 novembre 2025 : SREday Paris 2025 Q4 - Paris (France) 19–21 novembre 2025 : Agile Grenoble - Grenoble (France) 20 novembre 2025 : OVHcloud Summit - Paris (France) 21 novembre 2025 : DevFest Paris 2025 - Paris (France) 27 novembre 2025 : DevFest Strasbourg 2025 - Strasbourg (France) 28 novembre 2025 : DevFest Lyon - Lyon (France) 1–2 décembre 2025 : Tech Rocks Summit 2025 - Paris (France) 4–5 décembre 2025 : Agile Tour Rennes - Rennes (France) 5 décembre 2025 : DevFest Dijon 2025 - Dijon (France) 9–11 décembre 2025 : APIdays Paris - Paris (France) 9–11 décembre 2025 : Green IO Paris - Paris (France) 10–11 décembre 2025 : Devops REX - Paris (France) 10–11 décembre 2025 : Open Source Experience - Paris (France) 11 décembre 2025 : Normandie.ai 2025 - Rouen (France) 28–31 janvier 2026 : SnowCamp 2026 - Grenoble (France) 2–6 février 2026 : Web Days Convention - Aix-en-Provence (France) 3 février 2026 : Cloud Native Days France 2026 - Paris (France) 12–13 février 2026 : Touraine Tech #26 - Tours (France) 22–24 avril 2026 : Devoxx France 2026 - Paris (France) 23–25 avril 2026 : Devoxx Greece - Athens (Greece) 17 juin 2026 : Devoxx Poland - Krakow (Poland) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via X/twitter https://twitter.com/lescastcodeurs ou Bluesky https://bsky.app/profile/lescastcodeurs.com Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/
Os agentes de código via terminal (CLIs) estão mudando a forma como interagimos com nossos projetos. Neste episódio, comparamos de forma prática as ferramentas da OpenAI, Google e Anthropic — Codex CLI, Gemini CLI e Claude Code — testando sua capacidade de gerar documentação e páginas web a partir de um repositório real. Qual deles entrega mais? E qual vale mais a pena? A gente te conta tudo por aqui.
The field of AI coding agent CLIs is crowded and getting more so by the day, and our co-host Jack has tried them all so you don't have to. The big four are: OpenAI's Codex, Anthropic's Claude Code, Google's Gemini Code, and Amazon Q, along with some lesser known CLIs like AmpCode, OpenCode, and (the already shut down) Anon Kode. After trying everything, Jack says Anthropic's Sonnet models and Claude Code are still the best.Google's quietly been working on new LLM-powered web APIs that rely on Google's Gemini Nano model to power browser features like language detection and translation, and writing and proofreading, and Mozilla is concerned devs will create apps based on Gemini's behavior.Less than two months after Figma's big Config conference, it shared it's acquired OS headless CMS Payload. Continuing the effort to make Figma a central hub for digital product creation, Figma's adding a CMS to the mix so marketers and designers can more easily update website content as needed.Timestamps:1:01 - Jack's AI tool roundup10:34 - Mozilla's concerns about Google building AI into Chrome19:16 - Figma buys Payload24:22 - Firefox gets vertical tabs27:15 - Jack's macOS 26 experiment goes wrong30:36 - Anthropic destroys millions of print books38:06 - What's making us happyLinks:Paige - Figma buys CMS PayloadJack - State of the AI CLIs: Codex, OpenCode, AmpCode, Gemini Code, Claude Code, Amazon QTJ - Mozilla's concerns about Google building AI into ChromeLightning News:Firefox v140Jack's MacOS 26 upgrade gone wrongAnthropic destroyed millions of print books to build its AI modelsWhat Makes Us Happy this Week:Paige - Rock Paper Scissors novelJack - Tamolitch Falls and Final Destination movie seriesTJ - Watkins Glen State ParkThanks as always to our sponsor, the Blue Collar Coder channel on YouTube. You can join us in our Discord channel, explore our website and reach us via email, or talk to us on X, Bluesky, or YouTube.Front-end Fire websiteBlue Collar Coder on YouTubeBlue Collar Coder on DiscordReach out via emailTweet at us on X @front_end_fireFollow us on Bluesky @front-end-fire.comSubscribe to our YouTube channel @Front-EndFirePodcast
Software Engineering Radio - The Podcast for Professional Software Developers
Will McGugan, the CEO and founder of Textualize, speaks with host Gregory M. Kapfhammer about how to use packages such as Rich and Textual to build text-based user interfaces (TUIs) and command-line interfaces (CLIs) in Python. Along with discussing the design idioms that enable developers to create TUIs in Python, they consider practical strategies for efficiently rendering the components of a TUI. They also explore the subtle idiosyncrasies of implementing performant TUI frameworks like Textual and Rich and introduce the steps that developers would take to create their own CLI or TUI. This episode is sponsored by Fly.io.
Today's episode is with Paul Klein, founder of Browserbase. We talked about building browser infrastructure for AI agents, the future of agent authentication, and their open source framework Stagehand.* [00:00:00] Introductions* [00:04:46] AI-specific challenges in browser infrastructure* [00:07:05] Multimodality in AI-Powered Browsing* [00:12:26] Running headless browsers at scale* [00:18:46] Geolocation when proxying* [00:21:25] CAPTCHAs and Agent Auth* [00:28:21] Building “User take over” functionality* [00:33:43] Stagehand: AI web browsing framework* [00:38:58] OpenAI's Operator and computer use agents* [00:44:44] Surprising use cases of Browserbase* [00:47:18] Future of browser automation and market competition* [00:53:11] Being a solo founderTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.swyx [00:00:12]: Hey, and today we are very blessed to have our friends, Paul Klein, for the fourth, the fourth, CEO of Browserbase. Welcome.Paul [00:00:21]: Thanks guys. Yeah, I'm happy to be here. I've been lucky to know both of you for like a couple of years now, I think. So it's just like we're hanging out, you know, with three ginormous microphones in front of our face. It's totally normal hangout.swyx [00:00:34]: Yeah. We've actually mentioned you on the podcast, I think, more often than any other Solaris tenant. Just because like you're one of the, you know, best performing, I think, LLM tool companies that have started up in the last couple of years.Paul [00:00:50]: Yeah, I mean, it's been a whirlwind of a year, like Browserbase is actually pretty close to our first birthday. So we are one years old. And going from, you know, starting a company as a solo founder to... To, you know, having a team of 20 people, you know, a series A, but also being able to support hundreds of AI companies that are building AI applications that go out and automate the web. It's just been like, really cool. It's been happening a little too fast. I think like collectively as an AI industry, let's just take a week off together. I took my first vacation actually two weeks ago, and Operator came out on the first day, and then a week later, DeepSeat came out. And I'm like on vacation trying to chill. I'm like, we got to build with this stuff, right? So it's been a breakneck year. But I'm super happy to be here and like talk more about all the stuff we're seeing. And I'd love to hear kind of what you guys are excited about too, and share with it, you know?swyx [00:01:39]: Where to start? So people, you've done a bunch of podcasts. I think I strongly recommend Jack Bridger's Scaling DevTools, as well as Turner Novak's The Peel. And, you know, I'm sure there's others. So you covered your Twilio story in the past, talked about StreamClub, you got acquired to Mux, and then you left to start Browserbase. So maybe we just start with what is Browserbase? Yeah.Paul [00:02:02]: Browserbase is the web browser for your AI. We're building headless browser infrastructure, which are browsers that run in a server environment that's accessible to developers via APIs and SDKs. It's really hard to run a web browser in the cloud. You guys are probably running Chrome on your computers, and that's using a lot of resources, right? So if you want to run a web browser or thousands of web browsers, you can't just spin up a bunch of lambdas. You actually need to use a secure containerized environment. You have to scale it up and down. It's a stateful system. And that infrastructure is, like, super painful. And I know that firsthand, because at my last company, StreamClub, I was CTO, and I was building our own internal headless browser infrastructure. That's actually why we sold the company, is because Mux really wanted to buy our headless browser infrastructure that we'd built. And it's just a super hard problem. And I actually told my co-founders, I would never start another company unless it was a browser infrastructure company. And it turns out that's really necessary in the age of AI, when AI can actually go out and interact with websites, click on buttons, fill in forms. You need AI to do all of that work in an actual browser running somewhere on a server. And BrowserBase powers that.swyx [00:03:08]: While you're talking about it, it occurred to me, not that you're going to be acquired or anything, but it occurred to me that it would be really funny if you became the Nikita Beer of headless browser companies. You just have one trick, and you make browser companies that get acquired.Paul [00:03:23]: I truly do only have one trick. I'm screwed if it's not for headless browsers. I'm not a Go programmer. You know, I'm in AI grant. You know, browsers is an AI grant. But we were the only company in that AI grant batch that used zero dollars on AI spend. You know, we're purely an infrastructure company. So as much as people want to ask me about reinforcement learning, I might not be the best guy to talk about that. But if you want to ask about headless browser infrastructure at scale, I can talk your ear off. So that's really my area of expertise. And it's a pretty niche thing. Like, nobody has done what we're doing at scale before. So we're happy to be the experts.swyx [00:03:59]: You do have an AI thing, stagehand. We can talk about the sort of core of browser-based first, and then maybe stagehand. Yeah, stagehand is kind of the web browsing framework. Yeah.What is Browserbase? Headless Browser Infrastructure ExplainedAlessio [00:04:10]: Yeah. Yeah. And maybe how you got to browser-based and what problems you saw. So one of the first things I worked on as a software engineer was integration testing. Sauce Labs was kind of like the main thing at the time. And then we had Selenium, we had Playbrite, we had all these different browser things. But it's always been super hard to do. So obviously you've worked on this before. When you started browser-based, what were the challenges? What were the AI-specific challenges that you saw versus, there's kind of like all the usual running browser at scale in the cloud, which has been a problem for years. What are like the AI unique things that you saw that like traditional purchase just didn't cover? Yeah.AI-specific challenges in browser infrastructurePaul [00:04:46]: First and foremost, I think back to like the first thing I did as a developer, like as a kid when I was writing code, I wanted to write code that did stuff for me. You know, I wanted to write code to automate my life. And I do that probably by using curl or beautiful soup to fetch data from a web browser. And I think I still do that now that I'm in the cloud. And the other thing that I think is a huge challenge for me is that you can't just create a web site and parse that data. And we all know that now like, you know, taking HTML and plugging that into an LLM, you can extract insights, you can summarize. So it was very clear that now like dynamic web scraping became very possible with the rise of large language models or a lot easier. And that was like a clear reason why there's been more usage of headless browsers, which are necessary because a lot of modern websites don't expose all of their page content via a simple HTTP request. You know, they actually do require you to run this type of code for a specific time. JavaScript on the page to hydrate this. Airbnb is a great example. You go to airbnb.com. A lot of that content on the page isn't there until after they run the initial hydration. So you can't just scrape it with a curl. You need to have some JavaScript run. And a browser is that JavaScript engine that's going to actually run all those requests on the page. So web data retrieval was definitely one driver of starting BrowserBase and the rise of being able to summarize that within LLM. Also, I was familiar with if I wanted to automate a website, I could write one script and that would work for one website. It was very static and deterministic. But the web is non-deterministic. The web is always changing. And until we had LLMs, there was no way to write scripts that you could write once that would run on any website. That would change with the structure of the website. Click the login button. It could mean something different on many different websites. And LLMs allow us to generate code on the fly to actually control that. So I think that rise of writing the generic automation scripts that can work on many different websites, to me, made it clear that browsers are going to be a lot more useful because now you can automate a lot more things without writing. If you wanted to write a script to book a demo call on 100 websites, previously, you had to write 100 scripts. Now you write one script that uses LLMs to generate that script. That's why we built our web browsing framework, StageHand, which does a lot of that work for you. But those two things, web data collection and then enhanced automation of many different websites, it just felt like big drivers for more browser infrastructure that would be required to power these kinds of features.Alessio [00:07:05]: And was multimodality also a big thing?Paul [00:07:08]: Now you can use the LLMs to look, even though the text in the dome might not be as friendly. Maybe my hot take is I was always kind of like, I didn't think vision would be as big of a driver. For UI automation, I felt like, you know, HTML is structured text and large language models are good with structured text. But it's clear that these computer use models are often vision driven, and they've been really pushing things forward. So definitely being multimodal, like rendering the page is required to take a screenshot to give that to a computer use model to take actions on a website. And it's just another win for browser. But I'll be honest, that wasn't what I was thinking early on. I didn't even think that we'd get here so fast with multimodality. I think we're going to have to get back to multimodal and vision models.swyx [00:07:50]: This is one of those things where I forgot to mention in my intro that I'm an investor in Browserbase. And I remember that when you pitched to me, like a lot of the stuff that we have today, we like wasn't on the original conversation. But I did have my original thesis was something that we've talked about on the podcast before, which is take the GPT store, the custom GPT store, all the every single checkbox and plugin is effectively a startup. And this was the browser one. I think the main hesitation, I think I actually took a while to get back to you. The main hesitation was that there were others. Like you're not the first hit list browser startup. It's not even your first hit list browser startup. There's always a question of like, will you be the category winner in a place where there's a bunch of incumbents, to be honest, that are bigger than you? They're just not targeted at the AI space. They don't have the backing of Nat Friedman. And there's a bunch of like, you're here in Silicon Valley. They're not. I don't know.Paul [00:08:47]: I don't know if that's, that was it, but like, there was a, yeah, I mean, like, I think I tried all the other ones and I was like, really disappointed. Like my background is from working at great developer tools, companies, and nothing had like the Vercel like experience. Um, like our biggest competitor actually is partly owned by private equity and they just jacked up their prices quite a bit. And the dashboard hasn't changed in five years. And I actually used them at my last company and tried them and I was like, oh man, like there really just needs to be something that's like the experience of these great infrastructure companies, like Stripe, like clerk, like Vercel that I use in love, but oriented towards this kind of like more specific category, which is browser infrastructure, which is really technically complex. Like a lot of stuff can go wrong on the internet when you're running a browser. The internet is very vast. There's a lot of different configurations. Like there's still websites that only work with internet explorer out there. How do you handle that when you're running your own browser infrastructure? These are the problems that we have to think about and solve at BrowserBase. And it's, it's certainly a labor of love, but I built this for me, first and foremost, I know it's super cheesy and everyone says that for like their startups, but it really, truly was for me. If you look at like the talks I've done even before BrowserBase, and I'm just like really excited to try and build a category defining infrastructure company. And it's, it's rare to have a new category of infrastructure exists. We're here in the Chroma offices and like, you know, vector databases is a new category of infrastructure. Is it, is it, I mean, we can, we're in their office, so, you know, we can, we can debate that one later. That is one.Multimodality in AI-Powered Browsingswyx [00:10:16]: That's one of the industry debates.Paul [00:10:17]: I guess we go back to the LLMOS talk that Karpathy gave way long ago. And like the browser box was very clearly there and it seemed like the people who were building in this space also agreed that browsers are a core primitive of infrastructure for the LLMOS that's going to exist in the future. And nobody was building something there that I wanted to use. So I had to go build it myself.swyx [00:10:38]: Yeah. I mean, exactly that talk that, that honestly, that diagram, every box is a startup and there's the code box and then there's the. The browser box. I think at some point they will start clashing there. There's always the question of the, are you a point solution or are you the sort of all in one? And I think the point solutions tend to win quickly, but then the only ones have a very tight cohesive experience. Yeah. Let's talk about just the hard problems of browser base you have on your website, which is beautiful. Thank you. Was there an agency that you used for that? Yeah. Herb.paris.Paul [00:11:11]: They're amazing. Herb.paris. Yeah. It's H-E-R-V-E. I highly recommend for developers. Developer tools, founders to work with consumer agencies because they end up building beautiful things and the Parisians know how to build beautiful interfaces. So I got to give prep.swyx [00:11:24]: And chat apps, apparently are, they are very fast. Oh yeah. The Mistral chat. Yeah. Mistral. Yeah.Paul [00:11:31]: Late chat.swyx [00:11:31]: Late chat. And then your videos as well, it was professionally shot, right? The series A video. Yeah.Alessio [00:11:36]: Nico did the videos. He's amazing. Not the initial video that you shot at the new one. First one was Austin.Paul [00:11:41]: Another, another video pretty surprised. But yeah, I mean, like, I think when you think about how you talk about your company. You have to think about the way you present yourself. It's, you know, as a developer, you think you evaluate a company based on like the API reliability and the P 95, but a lot of developers say, is the website good? Is the message clear? Do I like trust this founder? I'm building my whole feature on. So I've tried to nail that as well as like the reliability of the infrastructure. You're right. It's very hard. And there's a lot of kind of foot guns that you run into when running headless browsers at scale. Right.Competing with Existing Headless Browser Solutionsswyx [00:12:10]: So let's pick one. You have eight features here. Seamless integration. Scalability. Fast or speed. Secure. Observable. Stealth. That's interesting. Extensible and developer first. What comes to your mind as like the top two, three hardest ones? Yeah.Running headless browsers at scalePaul [00:12:26]: I think just running headless browsers at scale is like the hardest one. And maybe can I nerd out for a second? Is that okay? I heard this is a technical audience, so I'll talk to the other nerds. Whoa. They were listening. Yeah. They're upset. They're ready. The AGI is angry. Okay. So. So how do you run a browser in the cloud? Let's start with that, right? So let's say you're using a popular browser automation framework like Puppeteer, Playwright, and Selenium. Maybe you've written a code, some code locally on your computer that opens up Google. It finds the search bar and then types in, you know, search for Latent Space and hits the search button. That script works great locally. You can see the little browser open up. You want to take that to production. You want to run the script in a cloud environment. So when your laptop is closed, your browser is doing something. The browser is doing something. Well, I, we use Amazon. You can see the little browser open up. You know, the first thing I'd reach for is probably like some sort of serverless infrastructure. I would probably try and deploy on a Lambda. But Chrome itself is too big to run on a Lambda. It's over 250 megabytes. So you can't easily start it on a Lambda. So you maybe have to use something like Lambda layers to squeeze it in there. Maybe use a different Chromium build that's lighter. And you get it on the Lambda. Great. It works. But it runs super slowly. It's because Lambdas are very like resource limited. They only run like with one vCPU. You can run one process at a time. Remember, Chromium is super beefy. It's barely running on my MacBook Air. I'm still downloading it from a pre-run. Yeah, from the test earlier, right? I'm joking. But it's big, you know? So like Lambda, it just won't work really well. Maybe it'll work, but you need something faster. Your users want something faster. Okay. Well, let's put it on a beefier instance. Let's get an EC2 server running. Let's throw Chromium on there. Great. Okay. I can, that works well with one user. But what if I want to run like 10 Chromium instances, one for each of my users? Okay. Well, I might need two EC2 instances. Maybe 10. All of a sudden, you have multiple EC2 instances. This sounds like a problem for Kubernetes and Docker, right? Now, all of a sudden, you're using ECS or EKS, the Kubernetes or container solutions by Amazon. You're spending up and down containers, and you're spending a whole engineer's time on kind of maintaining this stateful distributed system. Those are some of the worst systems to run because when it's a stateful distributed system, it means that you are bound by the connections to that thing. You have to keep the browser open while someone is working with it, right? That's just a painful architecture to run. And there's all this other little gotchas with Chromium, like Chromium, which is the open source version of Chrome, by the way. You have to install all these fonts. You want emojis working in your browsers because your vision model is looking for the emoji. You need to make sure you have the emoji fonts. You need to make sure you have all the right extensions configured, like, oh, do you want ad blocking? How do you configure that? How do you actually record all these browser sessions? Like it's a headless browser. You can't look at it. So you need to have some sort of observability. Maybe you're recording videos and storing those somewhere. It all kind of adds up to be this just giant monster piece of your project when all you wanted to do was run a lot of browsers in production for this little script to go to google.com and search. And when I see a complex distributed system, I see an opportunity to build a great infrastructure company. And we really abstract that away with Browserbase where our customers can use these existing frameworks, Playwright, Publisher, Selenium, or our own stagehand and connect to our browsers in a serverless-like way. And control them, and then just disconnect when they're done. And they don't have to think about the complex distributed system behind all of that. They just get a browser running anywhere, anytime. Really easy to connect to.swyx [00:15:55]: I'm sure you have questions. My standard question with anything, so essentially you're a serverless browser company, and there's been other serverless things that I'm familiar with in the past, serverless GPUs, serverless website hosting. That's where I come from with Netlify. One question is just like, you promised to spin up thousands of servers. You promised to spin up thousands of browsers in milliseconds. I feel like there's no real solution that does that yet. And I'm just kind of curious how. The only solution I know, which is to kind of keep a kind of warm pool of servers around, which is expensive, but maybe not so expensive because it's just CPUs. So I'm just like, you know. Yeah.Browsers as a Core Primitive in AI InfrastructurePaul [00:16:36]: You nailed it, right? I mean, how do you offer a serverless-like experience with something that is clearly not serverless, right? And the answer is, you need to be able to run... We run many browsers on single nodes. We use Kubernetes at browser base. So we have many pods that are being scheduled. We have to predictably schedule them up or down. Yes, thousands of browsers in milliseconds is the best case scenario. If you hit us with 10,000 requests, you may hit a slower cold start, right? So we've done a lot of work on predictive scaling and being able to kind of route stuff to different regions where we have multiple regions of browser base where we have different pools available. You can also pick the region you want to go to based on like lower latency, round trip, time latency. It's very important with these types of things. There's a lot of requests going over the wire. So for us, like having a VM like Firecracker powering everything under the hood allows us to be super nimble and spin things up or down really quickly with strong multi-tenancy. But in the end, this is like the complex infrastructural challenges that we have to kind of deal with at browser base. And we have a lot more stuff on our roadmap to allow customers to have more levers to pull to exchange, do you want really fast browser startup times or do you want really low costs? And if you're willing to be more flexible on that, we may be able to kind of like work better for your use cases.swyx [00:17:44]: Since you used Firecracker, shouldn't Fargate do that for you or did you have to go lower level than that? We had to go lower level than that.Paul [00:17:51]: I find this a lot with Fargate customers, which is alarming for Fargate. We used to be a giant Fargate customer. Actually, the first version of browser base was ECS and Fargate. And unfortunately, it's a great product. I think we were actually the largest Fargate customer in our region for a little while. No, what? Yeah, seriously. And unfortunately, it's a great product, but I think if you're an infrastructure company, you actually have to have a deeper level of control over these primitives. I think it's the same thing is true with databases. We've used other database providers and I think-swyx [00:18:21]: Yeah, serverless Postgres.Paul [00:18:23]: Shocker. When you're an infrastructure company, you're on the hook if any provider has an outage. And I can't tell my customers like, hey, we went down because so-and-so went down. That's not acceptable. So for us, we've really moved to bringing things internally. It's kind of opposite of what we preach. We tell our customers, don't build this in-house, but then we're like, we build a lot of stuff in-house. But I think it just really depends on what is in the critical path. We try and have deep ownership of that.Alessio [00:18:46]: On the distributed location side, how does that work for the web where you might get sort of different content in different locations, but the customer is expecting, you know, if you're in the US, I'm expecting the US version. But if you're spinning up my browser in France, I might get the French version. Yeah.Paul [00:19:02]: Yeah. That's a good question. Well, generally, like on the localization, there is a thing called locale in the browser. You can set like what your locale is. If you're like in the ENUS browser or not, but some things do IP, IP based routing. And in that case, you may want to have a proxy. Like let's say you're running something in the, in Europe, but you want to make sure you're showing up from the US. You may want to use one of our proxy features so you can turn on proxies to say like, make sure these connections always come from the United States, which is necessary too, because when you're browsing the web, you're coming from like a, you know, data center IP, and that can make things a lot harder to browse web. So we do have kind of like this proxy super network. Yeah. We have a proxy for you based on where you're going, so you can reliably automate the web. But if you get scheduled in Europe, that doesn't happen as much. We try and schedule you as close to, you know, your origin that you're trying to go to. But generally you have control over the regions you can put your browsers in. So you can specify West one or East one or Europe. We only have one region of Europe right now, actually. Yeah.Alessio [00:19:55]: What's harder, the browser or the proxy? I feel like to me, it feels like actually proxying reliably at scale. It's much harder than spending up browsers at scale. I'm curious. It's all hard.Paul [00:20:06]: It's layers of hard, right? Yeah. I think it's different levels of hard. I think the thing with the proxy infrastructure is that we work with many different web proxy providers and some are better than others. Some have good days, some have bad days. And our customers who've built browser infrastructure on their own, they have to go and deal with sketchy actors. Like first they figure out their own browser infrastructure and then they got to go buy a proxy. And it's like you can pay in Bitcoin and it just kind of feels a little sus, right? It's like you're buying drugs when you're trying to get a proxy online. We have like deep relationships with these counterparties. We're able to audit them and say, is this proxy being sourced ethically? Like it's not running on someone's TV somewhere. Is it free range? Yeah. Free range organic proxies, right? Right. We do a level of diligence. We're SOC 2. So we have to understand what is going on here. But then we're able to make sure that like we route around proxy providers not working. There's proxy providers who will just, the proxy will stop working all of a sudden. And then if you don't have redundant proxying on your own browsers, that's hard down for you or you may get some serious impacts there. With us, like we intelligently know, hey, this proxy is not working. Let's go to this one. And you can kind of build a network of multiple providers to really guarantee the best uptime for our customers. Yeah. So you don't own any proxies? We don't own any proxies. You're right. The team has been saying who wants to like take home a little proxy server, but not yet. We're not there yet. You know?swyx [00:21:25]: It's a very mature market. I don't think you should build that yourself. Like you should just be a super customer of them. Yeah. Scraping, I think, is the main use case for that. I guess. Well, that leads us into CAPTCHAs and also off, but let's talk about CAPTCHAs. You had a little spiel that you wanted to talk about CAPTCHA stuff.Challenges of Scaling Browser InfrastructurePaul [00:21:43]: Oh, yeah. I was just, I think a lot of people ask, if you're thinking about proxies, you're thinking about CAPTCHAs too. I think it's the same thing. You can go buy CAPTCHA solvers online, but it's the same buying experience. It's some sketchy website, you have to integrate it. It's not fun to buy these things and you can't really trust that the docs are bad. What Browserbase does is we integrate a bunch of different CAPTCHAs. We do some stuff in-house, but generally we just integrate with a bunch of known vendors and continually monitor and maintain these things and say, is this working or not? Can we route around it or not? These are CAPTCHA solvers. CAPTCHA solvers, yeah. Not CAPTCHA providers, CAPTCHA solvers. Yeah, sorry. CAPTCHA solvers. We really try and make sure all of that works for you. I think as a dev, if I'm buying infrastructure, I want it all to work all the time and it's important for us to provide that experience by making sure everything does work and monitoring it on our own. Yeah. Right now, the world of CAPTCHAs is tricky. I think AI agents in particular are very much ahead of the internet infrastructure. CAPTCHAs are designed to block all types of bots, but there are now good bots and bad bots. I think in the future, CAPTCHAs will be able to identify who a good bot is, hopefully via some sort of KYC. For us, we've been very lucky. We have very little to no known abuse of Browserbase because we really look into who we work with. And for certain types of CAPTCHA solving, we only allow them on certain types of plans because we want to make sure that we can know what people are doing, what their use cases are. And that's really allowed us to try and be an arbiter of good bots, which is our long term goal. I want to build great relationships with people like Cloudflare so we can agree, hey, here are these acceptable bots. We'll identify them for you and make sure we flag when they come to your website. This is a good bot, you know?Alessio [00:23:23]: I see. And Cloudflare said they want to do more of this. So they're going to set by default, if they think you're an AI bot, they're going to reject. I'm curious if you think this is something that is going to be at the browser level or I mean, the DNS level with Cloudflare seems more where it should belong. But I'm curious how you think about it.Paul [00:23:40]: I think the web's going to change. You know, I think that the Internet as we have it right now is going to change. And we all need to just accept that the cat is out of the bag. And instead of kind of like wishing the Internet was like it was in the 2000s, we can have free content line that wouldn't be scraped. It's just it's not going to happen. And instead, we should think about like, one, how can we change? How can we change the models of, you know, information being published online so people can adequately commercialize it? But two, how do we rebuild applications that expect that AI agents are going to log in on their behalf? Those are the things that are going to allow us to kind of like identify good and bad bots. And I think the team at Clerk has been doing a really good job with this on the authentication side. I actually think that auth is the biggest thing that will prevent agents from accessing stuff, not captchas. And I think there will be agent auth in the future. I don't know if it's going to happen from an individual company, but actually authentication providers that have a, you know, hidden login as agent feature, which will then you put in your email, you'll get a push notification, say like, hey, your browser-based agent wants to log into your Airbnb. You can approve that and then the agent can proceed. That really circumvents the need for captchas or logging in as you and sharing your password. I think agent auth is going to be one way we identify good bots going forward. And I think a lot of this captcha solving stuff is really short-term problems as the internet kind of reorients itself around how it's going to work with agents browsing the web, just like people do. Yeah.Managing Distributed Browser Locations and Proxiesswyx [00:24:59]: Stitch recently was on Hacker News for talking about agent experience, AX, which is a thing that Netlify is also trying to clone and coin and talk about. And we've talked about this on our previous episodes before in a sense that I actually think that's like maybe the only part of the tech stack that needs to be kind of reinvented for agents. Everything else can stay the same, CLIs, APIs, whatever. But auth, yeah, we need agent auth. And it's mostly like short-lived, like it should not, it should be a distinct, identity from the human, but paired. I almost think like in the same way that every social network should have your main profile and then your alt accounts or your Finsta, it's almost like, you know, every, every human token should be paired with the agent token and the agent token can go and do stuff on behalf of the human token, but not be presumed to be the human. Yeah.Paul [00:25:48]: It's like, it's, it's actually very similar to OAuth is what I'm thinking. And, you know, Thread from Stitch is an investor, Colin from Clerk, Octaventures, all investors in browser-based because like, I hope they solve this because they'll make browser-based submission more possible. So we don't have to overcome all these hurdles, but I think it will be an OAuth-like flow where an agent will ask to log in as you, you'll approve the scopes. Like it can book an apartment on Airbnb, but it can't like message anybody. And then, you know, the agent will have some sort of like role-based access control within an application. Yeah. I'm excited for that.swyx [00:26:16]: The tricky part is just, there's one, one layer of delegation here, which is like, you're authoring my user's user or something like that. I don't know if that's tricky or not. Does that make sense? Yeah.Paul [00:26:25]: You know, actually at Twilio, I worked on the login identity and access. Management teams, right? So like I built Twilio's login page.swyx [00:26:31]: You were an intern on that team and then you became the lead in two years? Yeah.Paul [00:26:34]: Yeah. I started as an intern in 2016 and then I was the tech lead of that team. How? That's not normal. I didn't have a life. He's not normal. Look at this guy. I didn't have a girlfriend. I just loved my job. I don't know. I applied to 500 internships for my first job and I got rejected from every single one of them except for Twilio and then eventually Amazon. And they took a shot on me and like, I was getting paid money to write code, which was my dream. Yeah. Yeah. I'm very lucky that like this coding thing worked out because I was going to be doing it regardless. And yeah, I was able to kind of spend a lot of time on a team that was growing at a company that was growing. So it informed a lot of this stuff here. I think these are problems that have been solved with like the SAML protocol with SSO. I think it's a really interesting stuff with like WebAuthn, like these different types of authentication, like schemes that you can use to authenticate people. The tooling is all there. It just needs to be tweaked a little bit to work for agents. And I think the fact that there are companies that are already. Providing authentication as a service really sets it up. Well, the thing that's hard is like reinventing the internet for agents. We don't want to rebuild the internet. That's an impossible task. And I think people often say like, well, we'll have this second layer of APIs built for agents. I'm like, we will for the top use cases, but instead of we can just tweak the internet as is, which is on the authentication side, I think we're going to be the dumb ones going forward. Unfortunately, I think AI is going to be able to do a lot of the tasks that we do online, which means that it will be able to go to websites, click buttons on our behalf and log in on our behalf too. So with this kind of like web agent future happening, I think with some small structural changes, like you said, it feels like it could all slot in really nicely with the existing internet.Handling CAPTCHAs and Agent Authenticationswyx [00:28:08]: There's one more thing, which is the, your live view iframe, which lets you take, take control. Yeah. Obviously very key for operator now, but like, was, is there anything interesting technically there or that the people like, well, people always want this.Paul [00:28:21]: It was really hard to build, you know, like, so, okay. Headless browsers, you don't see them, right. They're running. They're running in a cloud somewhere. You can't like look at them. And I just want to really make, it's a weird name. I wish we came up with a better name for this thing, but you can't see them. Right. But customers don't trust AI agents, right. At least the first pass. So what we do with our live view is that, you know, when you use browser base, you can actually embed a live view of the browser running in the cloud for your customer to see it working. And that's what the first reason is the build trust, like, okay, so I have this script. That's going to go automate a website. I can embed it into my web application via an iframe and my customer can watch. I think. And then we added two way communication. So now not only can you watch the browser kind of being operated by AI, if you want to pause and actually click around type within this iframe that's controlling a browser, that's also possible. And this is all thanks to some of the lower level protocol, which is called the Chrome DevTools protocol. It has a API called start screencast, and you can also send mouse clicks and button clicks to a remote browser. And this is all embeddable within iframes. You have a browser within a browser, yo. And then you simulate the screen, the click on the other side. Exactly. And this is really nice often for, like, let's say, a capture that can't be solved. You saw this with Operator, you know, Operator actually uses a different approach. They use VNC. So, you know, you're able to see, like, you're seeing the whole window here. What we're doing is something a little lower level with the Chrome DevTools protocol. It's just PNGs being streamed over the wire. But the same thing is true, right? Like, hey, I'm running a window. Pause. Can you do something in this window? Human. Okay, great. Resume. Like sometimes 2FA tokens. Like if you get that text message, you might need a person to type that in. Web agents need human-in-the-loop type workflows still. You still need a person to interact with the browser. And building a UI to proxy that is kind of hard. You may as well just show them the whole browser and say, hey, can you finish this up for me? And then let the AI proceed on afterwards. Is there a future where I stream my current desktop to browser base? I don't think so. I think we're very much cloud infrastructure. Yeah. You know, but I think a lot of the stuff we're doing, we do want to, like, build tools. Like, you know, we'll talk about the stage and, you know, web agent framework in a second. But, like, there's a case where a lot of people are going desktop first for, you know, consumer use. And I think cloud is doing a lot of this, where I expect to see, you know, MCPs really oriented around the cloud desktop app for a reason, right? Like, I think a lot of these tools are going to run on your computer because it makes... I think it's breaking out. People are putting it on a server. Oh, really? Okay. Well, sweet. We'll see. We'll see that. I was surprised, though, wasn't I? I think that the browser company, too, with Dia Browser, it runs on your machine. You know, it's going to be...swyx [00:30:50]: What is it?Paul [00:30:51]: So, Dia Browser, as far as I understand... I used to use Arc. Yeah. I haven't used Arc. But I'm a big fan of the browser company. I think they're doing a lot of cool stuff in consumer. As far as I understand, it's a browser where you have a sidebar where you can, like, chat with it and it can control the local browser on your machine. So, if you imagine, like, what a consumer web agent is, which it lives alongside your browser, I think Google Chrome has Project Marina, I think. I almost call it Project Marinara for some reason. I don't know why. It's...swyx [00:31:17]: No, I think it's someone really likes the Waterworld. Oh, I see. The classic Kevin Costner. Yeah.Paul [00:31:22]: Okay. Project Marinara is a similar thing to the Dia Browser, in my mind, as far as I understand it. You have a browser that has an AI interface that will take over your mouse and keyboard and control the browser for you. Great for consumer use cases. But if you're building applications that rely on a browser and it's more part of a greater, like, AI app experience, you probably need something that's more like infrastructure, not a consumer app.swyx [00:31:44]: Just because I have explored a little bit in this area, do people want branching? So, I have the state. Of whatever my browser's in. And then I want, like, 100 clones of this state. Do people do that? Or...Paul [00:31:56]: People don't do it currently. Yeah. But it's definitely something we're thinking about. I think the idea of forking a browser is really cool. Technically, kind of hard. We're starting to see this in code execution, where people are, like, forking some, like, code execution, like, processes or forking some tool calls or branching tool calls. Haven't seen it at the browser level yet. But it makes sense. Like, if an AI agent is, like, using a website and it's not sure what path it wants to take to crawl this website. To find the information it's looking for. It would make sense for it to explore both paths in parallel. And that'd be a very, like... A road not taken. Yeah. And hopefully find the right answer. And then say, okay, this was actually the right one. And memorize that. And go there in the future. On the roadmap. For sure. Don't make my roadmap, please. You know?Alessio [00:32:37]: How do you actually do that? Yeah. How do you fork? I feel like the browser is so stateful for so many things.swyx [00:32:42]: Serialize the state. Restore the state. I don't know.Paul [00:32:44]: So, it's one of the reasons why we haven't done it yet. It's hard. You know? Like, to truly fork, it's actually quite difficult. The naive way is to open the same page in a new tab and then, like, hope that it's at the same thing. But if you have a form halfway filled, you may have to, like, take the whole, you know, container. Pause it. All the memory. Duplicate it. Restart it from there. It could be very slow. So, we haven't found a thing. Like, the easy thing to fork is just, like, copy the page object. You know? But I think there needs to be something a little bit more robust there. Yeah.swyx [00:33:12]: So, MorphLabs has this infinite branch thing. Like, wrote a custom fork of Linux or something that let them save the system state and clone it. MorphLabs, hit me up. I'll be a customer. Yeah. That's the only. I think that's the only way to do it. Yeah. Like, unless Chrome has some special API for you. Yeah.Paul [00:33:29]: There's probably something we'll reverse engineer one day. I don't know. Yeah.Alessio [00:33:32]: Let's talk about StageHand, the AI web browsing framework. You have three core components, Observe, Extract, and Act. Pretty clean landing page. What was the idea behind making a framework? Yeah.Stagehand: AI web browsing frameworkPaul [00:33:43]: So, there's three frameworks that are very popular or already exist, right? Puppeteer, Playwright, Selenium. Those are for building hard-coded scripts to control websites. And as soon as I started to play with LLMs plus browsing, I caught myself, you know, code-genning Playwright code to control a website. I would, like, take the DOM. I'd pass it to an LLM. I'd say, can you generate the Playwright code to click the appropriate button here? And it would do that. And I was like, this really should be part of the frameworks themselves. And I became really obsessed with SDKs that take natural language as part of, like, the API input. And that's what StageHand is. StageHand exposes three APIs, and it's a super set of Playwright. So, if you go to a page, you may want to take an action, click on the button, fill in the form, etc. That's what the act command is for. You may want to extract some data. This one takes a natural language, like, extract the winner of the Super Bowl from this page. You can give it a Zod schema, so it returns a structured output. And then maybe you're building an API. You can do an agent loop, and you want to kind of see what actions are possible on this page before taking one. You can do observe. So, you can observe the actions on the page, and it will generate a list of actions. You can guide it, like, give me actions on this page related to buying an item. And you can, like, buy it now, add to cart, view shipping options, and pass that to an LLM, an agent loop, to say, what's the appropriate action given this high-level goal? So, StageHand isn't a web agent. It's a framework for building web agents. And we think that agent loops are actually pretty close to the application layer because every application probably has different goals or different ways it wants to take steps. I don't think I've seen a generic. Maybe you guys are the experts here. I haven't seen, like, a really good AI agent framework here. Everyone kind of has their own special sauce, right? I see a lot of developers building their own agent loops, and they're using tools. And I view StageHand as the browser tool. So, we expose act, extract, observe. Your agent can call these tools. And from that, you don't have to worry about it. You don't have to worry about generating playwright code performantly. You don't have to worry about running it. You can kind of just integrate these three tool calls into your agent loop and reliably automate the web.swyx [00:35:48]: A special shout-out to Anirudh, who I met at your dinner, who I think listens to the pod. Yeah. Hey, Anirudh.Paul [00:35:54]: Anirudh's a man. He's a StageHand guy.swyx [00:35:56]: I mean, the interesting thing about each of these APIs is they're kind of each startup. Like, specifically extract, you know, Firecrawler is extract. There's, like, Expand AI. There's a whole bunch of, like, extract companies. They just focus on extract. I'm curious. Like, I feel like you guys are going to collide at some point. Like, right now, it's friendly. Everyone's in a blue ocean. At some point, it's going to be valuable enough that there's some turf battle here. I don't think you have a dog in a fight. I think you can mock extract to use an external service if they're better at it than you. But it's just an observation that, like, in the same way that I see each option, each checkbox in the side of custom GBTs becoming a startup or each box in the Karpathy chart being a startup. Like, this is also becoming a thing. Yeah.Paul [00:36:41]: I mean, like, so the way StageHand works is that it's MIT-licensed, completely open source. You bring your own API key to your LLM of choice. You could choose your LLM. We don't make any money off of the extract or really. We only really make money if you choose to run it with our browser. You don't have to. You can actually use your own browser, a local browser. You know, StageHand is completely open source for that reason. And, yeah, like, I think if you're building really complex web scraping workflows, I don't know if StageHand is the tool for you. I think it's really more if you're building an AI agent that needs a few general tools or if it's doing a lot of, like, web automation-intensive work. But if you're building a scraping company, StageHand is not your thing. You probably want something that's going to, like, get HTML content, you know, convert that to Markdown, query it. That's not what StageHand does. StageHand is more about reliability. I think we focus a lot on reliability and less so on cost optimization and speed at this point.swyx [00:37:33]: I actually feel like StageHand, so the way that StageHand works, it's like, you know, page.act, click on the quick start. Yeah. It's kind of the integration test for the code that you would have to write anyway, like the Puppeteer code that you have to write anyway. And when the page structure changes, because it always does, then this is still the test. This is still the test that I would have to write. Yeah. So it's kind of like a testing framework that doesn't need implementation detail.Paul [00:37:56]: Well, yeah. I mean, Puppeteer, Playwright, and Slenderman were all designed as testing frameworks, right? Yeah. And now people are, like, hacking them together to automate the web. I would say, and, like, maybe this is, like, me being too specific. But, like, when I write tests, if the page structure changes. Without me knowing, I want that test to fail. So I don't know if, like, AI, like, regenerating that. Like, people are using StageHand for testing. But it's more for, like, usability testing, not, like, testing of, like, does the front end, like, has it changed or not. Okay. But generally where we've seen people, like, really, like, take off is, like, if they're using, you know, something. If they want to build a feature in their application that's kind of like Operator or Deep Research, they're using StageHand to kind of power that tool calling in their own agent loop. Okay. Cool.swyx [00:38:37]: So let's go into Operator, the first big agent launch of the year from OpenAI. Seems like they have a whole bunch scheduled. You were on break and your phone blew up. What's your just general view of computer use agents is what they're calling it. The overall category before we go into Open Operator, just the overall promise of Operator. I will observe that I tried it once. It was okay. And I never tried it again.OpenAI's Operator and computer use agentsPaul [00:38:58]: That tracks with my experience, too. Like, I'm a huge fan of the OpenAI team. Like, I think that I do not view Operator as the company. I'm not a company killer for browser base at all. I think it actually shows people what's possible. I think, like, computer use models make a lot of sense. And I'm actually most excited about computer use models is, like, their ability to, like, really take screenshots and reasoning and output steps. I think that using mouse click or mouse coordinates, I've seen that proved to be less reliable than I would like. And I just wonder if that's the right form factor. What we've done with our framework is anchor it to the DOM itself, anchor it to the actual item. So, like, if it's clicking on something, it's clicking on that thing, you know? Like, it's more accurate. No matter where it is. Yeah, exactly. Because it really ties in nicely. And it can handle, like, the whole viewport in one go, whereas, like, Operator can only handle what it sees. Can you hover? Is hovering a thing that you can do? I don't know if we expose it as a tool directly, but I'm sure there's, like, an API for hovering. Like, move mouse to this position. Yeah, yeah, yeah. I think you can trigger hover, like, via, like, the JavaScript on the DOM itself. But, no, I think, like, when we saw computer use, everyone's eyes lit up because they realized, like, wow, like, AI is going to actually automate work for people. And I think seeing that kind of happen from both of the labs, and I'm sure we're going to see more labs launch computer use models, I'm excited to see all the stuff that people build with it. I think that I'd love to see computer use power, like, controlling a browser on browser base. And I think, like, Open Operator, which was, like, our open source version of OpenAI's Operator, was our first take on, like, how can we integrate these models into browser base? And we handle the infrastructure and let the labs do the models. I don't have a sense that Operator will be released as an API. I don't know. Maybe it will. I'm curious to see how well that works because I think it's going to be really hard for a company like OpenAI to do things like support CAPTCHA solving or, like, have proxies. Like, I think it's hard for them structurally. Imagine this New York Times headline, OpenAI CAPTCHA solving. Like, that would be a pretty bad headline, this New York Times headline. Browser base solves CAPTCHAs. No one cares. No one cares. And, like, our investors are bored. Like, we're all okay with this, you know? We're building this company knowing that the CAPTCHA solving is short-lived until we figure out how to authenticate good bots. I think it's really hard for a company like OpenAI, who has this brand that's so, so good, to balance with, like, the icky parts of web automation, which it can be kind of complex to solve. I'm sure OpenAI knows who to call whenever they need you. Yeah, right. I'm sure they'll have a great partnership.Alessio [00:41:23]: And is Open Operator just, like, a marketing thing for you? Like, how do you think about resource allocation? So, you can spin this up very quickly. And now there's all this, like, open deep research, just open all these things that people are building. We started it, you know. You're the original Open. We're the original Open operator, you know? Is it just, hey, look, this is a demo, but, like, we'll help you build out an actual product for yourself? Like, are you interested in going more of a product route? That's kind of the OpenAI way, right? They started as a model provider and then…Paul [00:41:53]: Yeah, we're not interested in going the product route yet. I view Open Operator as a model provider. It's a reference project, you know? Let's show people how to build these things using the infrastructure and models that are out there. And that's what it is. It's, like, Open Operator is very simple. It's an agent loop. It says, like, take a high-level goal, break it down into steps, use tool calling to accomplish those steps. It takes screenshots and feeds those screenshots into an LLM with the step to generate the right action. It uses stagehand under the hood to actually execute this action. It doesn't use a computer use model. And it, like, has a nice interface using the live view that we talked about, the iframe, to embed that into an application. So I felt like people on launch day wanted to figure out how to build their own version of this. And we turned that around really quickly to show them. And I hope we do that with other things like deep research. We don't have a deep research launch yet. I think David from AOMNI actually has an amazing open deep research that he launched. It has, like, 10K GitHub stars now. So he's crushing that. But I think if people want to build these features natively into their application, they need good reference projects. And I think Open Operator is a good example of that.swyx [00:42:52]: I don't know. Actually, I'm actually pretty bullish on API-driven operator. Because that's the only way that you can sort of, like, once it's reliable enough, obviously. And now we're nowhere near. But, like, give it five years. It'll happen, you know. And then you can sort of spin this up and browsers are working in the background and you don't necessarily have to know. And it just is booking restaurants for you, whatever. I can definitely see that future happening. I had this on the landing page here. This might be a slightly out of order. But, you know, you have, like, sort of three use cases for browser base. Open Operator. Or this is the operator sort of use case. It's kind of like the workflow automation use case. And it completes with UiPath in the sort of RPA category. Would you agree with that? Yeah, I would agree with that. And then there's Agents we talked about already. And web scraping, which I imagine would be the bulk of your workload right now, right?Paul [00:43:40]: No, not at all. I'd say actually, like, the majority is browser automation. We're kind of expensive for web scraping. Like, I think that if you're building a web scraping product, if you need to do occasional web scraping or you have to do web scraping that works every single time, you want to use browser automation. Yeah. You want to use browser-based. But if you're building web scraping workflows, what you should do is have a waterfall. You should have the first request is a curl to the website. See if you can get it without even using a browser. And then the second request may be, like, a scraping-specific API. There's, like, a thousand scraping APIs out there that you can use to try and get data. Scraping B. Scraping B is a great example, right? Yeah. And then, like, if those two don't work, bring out the heavy hitter. Like, browser-based will 100% work, right? It will load the page in a real browser, hydrate it. I see.swyx [00:44:21]: Because a lot of people don't render to JS.swyx [00:44:25]: Yeah, exactly.Paul [00:44:26]: So, I mean, the three big use cases, right? Like, you know, automation, web data collection, and then, you know, if you're building anything agentic that needs, like, a browser tool, you want to use browser-based.Alessio [00:44:35]: Is there any use case that, like, you were super surprised by that people might not even think about? Oh, yeah. Or is it, yeah, anything that you can share? The long tail is crazy. Yeah.Surprising use cases of BrowserbasePaul [00:44:44]: One of the case studies on our website that I think is the most interesting is this company called Benny. So, the way that it works is if you're on food stamps in the United States, you can actually get rebates if you buy certain things. Yeah. You buy some vegetables. You submit your receipt to the government. They'll give you a little rebate back. Say, hey, thanks for buying vegetables. It's good for you. That process of submitting that receipt is very painful. And the way Benny works is you use their app to take a photo of your receipt, and then Benny will go submit that receipt for you and then deposit the money into your account. That's actually using no AI at all. It's all, like, hard-coded scripts. They maintain the scripts. They've been doing a great job. And they build this amazing consumer app. But it's an example of, like, all these, like, tedious workflows that people have to do to kind of go about their business. And they're doing it for the sake of their day-to-day lives. And I had never known about, like, food stamp rebates or the complex forms you have to do to fill them. But the world is powered by millions and millions of tedious forms, visas. You know, Emirate Lighthouse is a customer, right? You know, they do the O1 visa. Millions and millions of forms are taking away humans' time. And I hope that Browserbase can help power software that automates away the web forms that we don't need anymore. Yeah.swyx [00:45:49]: I mean, I'm very supportive of that. I mean, forms. I do think, like, government itself is a big part of it. I think the government itself should embrace AI more to do more sort of human-friendly form filling. Mm-hmm. But I'm not optimistic. I'm not holding my breath. Yeah. We'll see. Okay. I think I'm about to zoom out. I have a little brief thing on computer use, and then we can talk about founder stuff, which is, I tend to think of developer tooling markets in impossible triangles, where everyone starts in a niche, and then they start to branch out. So I already hinted at a little bit of this, right? We mentioned more. We mentioned E2B. We mentioned Firecrawl. And then there's Browserbase. So there's, like, all this stuff of, like, have serverless virtual computer that you give to an agent and let them do stuff with it. And there's various ways of connecting it to the internet. You can just connect to a search API, like SERP API, whatever other, like, EXA is another one. That's what you're searching. You can also have a JSON markdown extractor, which is Firecrawl. Or you can have a virtual browser like Browserbase, or you can have a virtual machine like Morph. And then there's also maybe, like, a virtual sort of code environment, like Code Interpreter. So, like, there's just, like, a bunch of different ways to tackle the problem of give a computer to an agent. And I'm just kind of wondering if you see, like, everyone's just, like, happily coexisting in their respective niches. And as a developer, I just go and pick, like, a shopping basket of one of each. Or do you think that you eventually, people will collide?Future of browser automation and market competitionPaul [00:47:18]: I think that currently it's not a zero-sum market. Like, I think we're talking about... I think we're talking about all of knowledge work that people do that can be automated online. All of these, like, trillions of hours that happen online where people are working. And I think that there's so much software to be built that, like, I tend not to think about how these companies will collide. I just try to solve the problem as best as I can and make this specific piece of infrastructure, which I think is an important primitive, the best I possibly can. And yeah. I think there's players that are actually going to like it. I think there's players that are going to launch, like, over-the-top, you know, platforms, like agent platforms that have all these tools built in, right? Like, who's building the rippling for agent tools that has the search tool, the browser tool, the operating system tool, right? There are some. There are some. There are some, right? And I think in the end, what I have seen as my time as a developer, and I look at all the favorite tools that I have, is that, like, for tools and primitives with sufficient levels of complexity, you need to have a solution that's really bespoke to that primitive, you know? And I am sufficiently convinced that the browser is complex enough to deserve a primitive. Obviously, I have to. I'm the founder of BrowserBase, right? I'm talking my book. But, like, I think maybe I can give you one spicy take against, like, maybe just whole OS running. I think that when I look at computer use when it first came out, I saw that the majority of use cases for computer use were controlling a browser. And do we really need to run an entire operating system just to control a browser? I don't think so. I don't think that's necessary. You know, BrowserBase can run browsers for way cheaper than you can if you're running a full-fledged OS with a GUI, you know, operating system. And I think that's just an advantage of the browser. It is, like, browsers are little OSs, and you can run them very efficiently if you orchestrate it well. And I think that allows us to offer 90% of the, you know, functionality in the platform needed at 10% of the cost of running a full OS. Yeah.Open Operator: Browserbase's Open-Source Alternativeswyx [00:49:16]: I definitely see the logic in that. There's a Mark Andreessen quote. I don't know if you know this one. Where he basically observed that the browser is turning the operating system into a poorly debugged set of device drivers, because most of the apps are moved from the OS to the browser. So you can just run browsers.Paul [00:49:31]: There's a place for OSs, too. Like, I think that there are some applications that only run on Windows operating systems. And Eric from pig.dev in this upcoming YC batch, or last YC batch, like, he's building all run tons of Windows operating systems for you to control with your agent. And like, there's some legacy EHR systems that only run on Internet-controlled systems. Yeah.Paul [00:49:54]: I think that's it. I think, like, there are use cases for specific operating systems for specific legacy software. And like, I'm excited to see what he does with that. I just wanted to give a shout out to the pig.dev website.swyx [00:50:06]: The pigs jump when you click on them. Yeah. That's great.Paul [00:50:08]: Eric, he's the former co-founder of banana.dev, too.swyx [00:50:11]: Oh, that Eric. Yeah. That Eric. Okay. Well, he abandoned bananas for pigs. I hope he doesn't start going around with pigs now.Alessio [00:50:18]: Like he was going around with bananas. A little toy pig. Yeah. Yeah. I love that. What else are we missing? I think we covered a lot of, like, the browser-based product history, but. What do you wish people asked you? Yeah.Paul [00:50:29]: I wish people asked me more about, like, what will the future of software look like? Because I think that's really where I've spent a lot of time about why do browser-based. Like, for me, starting a company is like a means of last resort. Like, you shouldn't start a company unless you absolutely have to. And I remain convinced that the future of software is software that you're going to click a button and it's going to do stuff on your behalf. Right now, software. You click a button and it maybe, like, calls it back an API and, like, computes some numbers. It, like, modifies some text, whatever. But the future of software is software using software. So, I may log into my accounting website for my business, click a button, and it's going to go load up my Gmail, search my emails, find the thing, upload the receipt, and then comment it for me. Right? And it may use it using APIs, maybe a browser. I don't know. I think it's a little bit of both. But that's completely different from how we've built software so far. And that's. I think that future of software has different infrastructure requirements. It's going to require different UIs. It's going to require different pieces of infrastructure. I think the browser infrastructure is one piece that fits into that, along with all the other categories you mentioned. So, I think that it's going to require developers to think differently about how they've built software for, you know
We compiled our favorite clips on developer tools and developer experience (DevX). We discuss why DevX has become essential for developer-focused companies and how it drives adoption to grow your product. Learn what makes developers a unique and discerning customer base, and hear practical strategies for designing exceptional tools and platforms. Our guests also share lessons learned from their own experiences—whether in creating frictionless integrations, maintaining a strong feedback culture, or enabling internal platform adoption. Through compelling stories and actionable advice, this episode is packed with lessons on how to build products that developers love. Playlist of Full Episodes from This Compilation: https://www.youtube.com/playlist?list=PL31JETR9AR0FV-46VR4G_n6xi4WdXEx-2 Inside the episode... The importance of developer experience and why it's a priority for developer-facing companies. Key differences between building developer tools and end-user applications. How DevX differs from DevRel and the synergy between the two. Metrics for measuring the success of developer tools: adoption, satisfaction, and revenue. Insights into abstraction ladders and balancing complexity and power. Customer research strategies for validating assumptions and prioritizing features. Stripe's culture of craftsmanship and creating “surprisingly great” experiences. The importance of dogfooding and feedback loops in building trusted platforms. Balancing enablement and avoiding gatekeeping in internal platform adoption. Maintaining consistency and quality across APIs, CLIs, and other resources. Mentioned in this episode Stripe Doppler Heroku Abstraction ladders Developer feedback loops Unlock the full potential of your product team with Integral's player coaches, experts in lean, human-centered design. Visit integral.io/convergence for a free Product Success Lab workshop to gain clarity and confidence in tackling any product design or engineering challenge. Subscribe to the Convergence podcast wherever you get podcasts including video episodes to get updated on the other crucial conversations that we'll post on YouTube at youtube.com/@convergencefmpodcast Learn something? Give us a 5 star review and like the podcast on YouTube. It's how we grow. Follow the Pod Linkedin: https://www.linkedin.com/company/convergence-podcast/ X: https://twitter.com/podconvergence Instagram: @podconvergence
With the number of libraries available to Go developers these days, you'd think building a CLI app was now a trivial matter. But like many things in software development, it depends. In this episode, we explore the challenges that arose during one team's journey towards a production-ready CLI.
With the number of libraries available to Go developers these days, you'd think building a CLI app was now a trivial matter. But like many things in software development, it depends. In this episode, we explore the challenges that arose during one team's journey towards a production-ready CLI.
#278: In today's tech landscape, developers often find themselves caught in the middle of a debate that never seems to age: GUI or CLI? While the tools and interfaces we use may evolve, the core question remains. How do we balance the efficiency and familiarity of graphical user interfaces (GUIs) with the raw power and flexibility of command-line interfaces (CLIs)? In this episode, Darin and Viktor discuss a blog post by Ian Miell titled In Praise of Low Tech DevEx. In Praise of Low Tech DevEx https://blog.container-solutions.com/in-praise-of-low-tech-devex YouTube channel: https://youtube.com/devopsparadox Review the podcast on Apple Podcasts: https://www.devopsparadox.com/review-podcast/ Slack: https://www.devopsparadox.com/slack/ Connect with us at: https://www.devopsparadox.com/contact/
Dans la vie, on passe notre temps à coller des étiquettes sur les gens comme lorsqu'on aimait coller des gommettes à la maternelle quand on était petit·e. Untel est timide, untel est bordélique, untel est rigolo, etc. Mais l'étiquette la plus répandue et sans doute la plus difficile à porter, c'est celle de la susceptibilité. Elle donne l'impression qu'on ne peut jamais rien nous dire, et en retour, on n'ose plus élever la voix pour s'exprimer contre les choses qui nous dérangent. Mais, c'est quoi, en fait, une personne susceptible ? Dans cet épisode, la journaliste Eloïse Renou s'attaque à l'épineuse question de sa propre susceptibilité, et s'appuie sur le témoignage d'Hélina, qui part en vrille à la moindre remarque. La neuropsychologue Harmony Duclos définit la susceptibilité comme une “émotion sociale" plutôt que comme un trait de caractère. Ensemble, elles discutent de confiance en soi, d'égo surdimensionné, des six émotions de base, et du fait de rejeter la faute sur les autres. Elles se demandent comment arrêter de souffrir de cette étiquette et pourquoi on a soi-même si facilement tendance à la poser sur nos proches.Pour aller plus loin : Le mémoire “La démarche scientifique pour restaurer l'estime de soi : une expérimentation adaptée en CLIS” de Viviane François, sur le portail DUMAS du CNRSL'article “Whatever people say I am, that's what I am: Social labeling as a social marketing tool” de Gert Cornelissen, paru en 2007 dans le International Journal of Research in Marketing.L'article “La théorie de l'étiquetage modifiée, ou l'« analyse stigmatique » revisitée” de Lionel Lacaze, paru en 2008 dans la Nouvelle revue de psychosociologieL'article “Quelques disqualifications. Le sentiment ou ressenti d'incompétence” d'Héloïse de Visscher, paru en 2013 dans les cahiers internationaux de psychologie socialeEloïse Renou a tourné et écrit cet épisode. La réalisation sonore est signée Renaud Wattine. Le générique est réalisé par Clémence Reliat, à partir d'un extrait d'En Sommeil de Jaune. Lena Coutrot est la productrice d'Émotions.Suivez Louie Media sur Instagram, Facebook, Twitter. Si vous aussi vous voulez nous raconter votre histoire, écrivez-nous en remplissant ce formulaire. Et si vous souhaitez soutenir Louie, n'hésitez pas à vous abonner au Club. Hébergé par Acast. Visitez acast.com/privacy pour plus d'informations.
This Friday we're doing a special crossover event in SF with of SemiAnalysis (previous guest!), and we will do a live podcast on site. RSVP here. Also join us on June 25-27 for the biggest AI Engineer conference of the year!Replicate is one of the most popular AI inference providers, reporting over 2 million users as of their $40m Series B with a16z. But how did they get there? The Definitive Replicate Story (warts and all)Their overnight success took 5 years of building, and it all started with arXiv Vanity, which was a 2017 vacation project that scrapes arXiv PDFs and re-renders them into semantic web pages that reflow nicely with better typography and whitespace. From there, Ben and Andreas' idea was to build tools to make ML research more robust and reproducible by making it easy to share code artefacts alongside papers. They had previously created Fig, which made it easy to spin up dev environments; it was eventually acquired by Docker and turned into `docker-compose`, the industry standard way to define services from containerized applications. 2019: CogThe first iteration of Replicate was a Fig-equivalent for ML workloads which they called Cog; it made it easy for researchers to package all their work and share it with peers for review and reproducibility. But they found that researchers were terrible users: they'd do all this work for a paper, publish it, and then never return to it again. “We talked to a bunch of researchers and they really wanted that.... But how the hell is this a business, you know, like how are we even going to make any money out of this? …So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. Do you want like a deployment platform for deploying models? Do you want a central place for versioning models? We were trying to think of lots of different products we could sell that were related to this thing…So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day.”The team graduated YCombinator with no customers, no product and nothing to demo - which was fine because demo day got canceled as the YC W'20 class graduated right into the pandemic. The team spent the next year exploring and building Covid tools.2021: CLIP + GAN = PixRayBy 2021, OpenAI released CLIP. Overnight dozens of Discord servers got spun up to hack on CLIP + GANs. Unlike academic researchers, this community was constantly releasing new checkpoints and builds of models. PixRay was one of the first models being built on Replicate, and it quickly started taking over the community. Chris Dixon has a famous 2010 post titled “The next big thing will start out looking like a toy”; image generation would have definitely felt like a toy in 2021, but it gave Replicate its initial boost.2022: Stable DiffusionIn August 2022 Stable Diffusion came out, and all the work they had been doing to build this infrastructure for CLIP / GANs models became the best way for people to share their StableDiffusion fine-tunes:And like the first week we saw people making animation models out of it. We saw people make game texture models that use circular convolutions to make repeatable textures. We saw a few weeks later, people were fine tuning it so you could put your face in these models and all of these other ways. […] So tons of product builders wanted to build stuff with it. And we were just sitting in there in the middle, as the interface layer between all these people who wanted to build, and all these machine learning experts who were building cool models. And that's really where it took off. Incredible supply, incredible demand, and we were just in the middle.(Stable Diffusion also spawned Latent Space as a newsletter)The landing page paved the cowpath for the intense interest in diffusion model APIs.2023: Llama & other multimodal LLMsBy 2023, Replicate's growing visibility in the Stable Diffusion indie hacker community came from top AI hackers like Pieter Levels and Danny Postmaa, each making millions off their AI apps:Meta then released LLaMA 1 and 2 (our coverage of it), greatly pushing forward the SOTA open source model landscape. Demand for text LLMs and other modalities rose, and Replicate broadened its focus accordingly, culminating in a $18m Series A and $40m Series B from a16z (at a $350m valuation).Building standards for the AI worldNow that the industry is evolving from toys to enterprise use cases, all these companies are working to set standards for their own space. We cover this at ~45 mins in the podcast. Some examples:* LangChain has been trying to establish "chain” as the standard mental models when putting multiple prompts and models together, and the “LangChain Expression Language” to go with it. (Our episode with Harrison)* LLamaHub for packaging RAG utilities. (Our episode with Jerry)* Ollama's Modelfile to define runtimes for different model architectures. These are usually targeted at local inference. * Cog (by Replicate) to create environments to which you can easily attach CUDA devices and make it easy to spin up inference on remote servers. * GGUF as the filetype ggml-based executors. None of them have really broken out yet, but this is going to become a fiercer competition as the market matures. Full Video PodcastAs a reminder, all Latent Space pods now come in full video on our YouTube, with bonus content that we cut for time!Show Notes* Ben Firshman* Replicate* Free $10 credit for Latent Space readers* Andreas Jansson (Ben's co-founder)* Charlie Holtz (Replicate's Hacker in Residence)* Fig (now Docker Compose)* Command Line Interface Guidelines (clig)* Apple Human Interface Guidelines* arXiv Vanity* Open Interpreter* PixRay* SF Compute* Big Sleep by Advadnoun* VQGAN-CLIP by Rivers Have WingsTimestamps* [00:00:00] Introductions* [00:01:17] Low latency is all you need* [00:04:08] Evolution of CLIs* [00:05:59] How building ArxivVanity led to Replicate* [00:11:37] Making ML research replicable with containers* [00:17:22] Doing YC in 2020 and pivoting to tools for COVID* [00:20:22] Launching the first version of Replicate* [00:25:51] Embracing the generative image community* [00:28:04] Getting reverse engineered into an API product* [00:31:25] Growing to 2 million users* [00:34:29] Indie vs Enterprise customers* [00:37:09] How Unsplash uses Replicate* [00:38:29] Learnings from Docker that went into Cog* [00:45:25] Creating AI standards* [00:50:05] Replicate's compute availability* [00:53:55] Fixing GPU waste* [01:00:39] What's open source AI?* [01:04:46] Building for AI engineers* [01:06:41] Hiring at ReplicateThis summary covers the full range of topics discussed throughout the episode, providing a comprehensive overview of the content and insights shared.TranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: Hey, and today we have Ben Firshman in the studio. Welcome Ben.Ben [00:00:18]: Hey, good to be here.Swyx [00:00:19]: Ben, you're a co-founder and CEO of Replicate. Before that, you were most notably founder of Fig, which became Docker Compose. You also did a couple of other things before that, but that's what a lot of people know you for. What should people know about you that, you know, outside of your, your sort of LinkedIn profile?Ben [00:00:35]: Yeah. Good question. I think I'm a builder and tinkerer, like in a very broad sense. And I love using my hands to make things. So like I work on, you know, things may be a bit closer to tech, like electronics. I also like build things out of wood and I like fix cars and I fix my bike and build bicycles and all this kind of stuff. And there's so much, I think I've learned from transferable skills, from just like working in the real world to building things, building things in software. And you know, it's so much about being a builder, both in real life and, and in software that crosses over.Swyx [00:01:11]: Is there a real world analogy that you use often when you're thinking about like a code architecture or problem?Ben [00:01:17]: I like to build software tools as if they were something real. So I wrote this thing called the command line interface guidelines, which was a bit like sort of the Mac human interface guidelines, but for command line interfaces, I did it with the guy I created Docker Compose with and a few other people. And I think something in there, I think I described that your command line interface should feel like a big iron machine where you pull a lever and it goes clunk and like things should respond within like 50 milliseconds as if it was like a real life thing. And like another analogy here is like in the real life, you know, when you press a button on an electronic device and it's like a soft switch and you press it and nothing happens and there's no physical feedback of anything happening, then like half a second later, something happens. Like that's how a lot of software feels, but instead like software should feel more like something that's real where you touch, you pull a physical lever and the physical lever moves, you know, and I've taken that lesson of kind of human interface to, to software a ton. You know, it's all about kind of low latency of feeling, things feeling really solid and robust, both the command lines and, and user interfaces as well.Swyx [00:02:22]: And how did you operationalize that for Fig or Docker?Ben [00:02:27]: A lot of it's just low latency. Actually, we didn't do it very well for Fig in the first place. We used Python, which was a big mistake where Python's really hard to get booting up fast because you have to load up the whole Python runtime before it can run anything. Okay. Go is much better at this where like Go just instantly starts.Swyx [00:02:45]: You have to be under 500 milliseconds to start up?Ben [00:02:48]: Yeah, effectively. I mean, I mean, you know, perception of human things being immediate is, you know, something like a hundred milliseconds. So anything like that is, is yeah, good enough.Swyx [00:02:57]: Yeah. Also, I should mention, since we're talking about your side projects, well, one thing is I am maybe one of a few fellow people who have actually written something about CLI design principles because I was in charge of the Netlify CLI back in the day and had many thoughts. One of my fun thoughts, I'll just share it in case you have thoughts, is I think CLIs are effectively starting points for scripts that are then run. And the moment one of the script's preconditions are not fulfilled, typically they end. So the CLI developer will just exit the program. And the way that I designed, I really wanted to create the Netlify dev workflow was for it to be kind of a state machine that would resolve itself. If it detected a precondition wasn't fulfilled, it would actually delegate to a subprogram that would then fulfill that precondition, asking for more info or waiting until a condition is fulfilled. Then it would go back to the original flow and continue that. I don't know if that was ever tried or is there a more formal definition of it? Because I just came up with it randomly. But it felt like the beginnings of AI in the sense that when you run a CLI command, you have an intent to do something and you may not have given the CLI all the things that it needs to do, to execute that intent. So that was my two cents.Ben [00:04:08]: Yeah, that reminds me of a thing we sort of thought about when writing the CLI guidelines, where CLIs were designed in a world where the CLI was really a programming environment and it's primarily designed for machines to use all of these commands and scripts. Whereas over time, the CLI has evolved to humans. It was back in a world where the primary way of using computers was writing shell scripts effectively. We've transitioned to a world where actually humans are using CLI programs much more than they used to. And the current sort of best practices about how Unix was designed, there's lots of design documents about Unix from the 70s and 80s, where they say things like, command line commands should not output anything on success. It should be completely silent, which makes sense if you're using it in a shell script. But if a user is using that, it just looks like it's broken. If you type copy and it just doesn't say anything, you assume that it didn't work as a new user. I think what's really interesting about the CLI is that it's actually a really good, to your point, it's a really good user interface where it can be like a conversation, where it feels like you're, instead of just like you telling the computer to do this thing and either silently succeeding or saying, no, you did, failed, it can guide you in the right direction and tell you what your intent might be, and that kind of thing in a way that's actually, it's almost more natural to a CLI than it is in a graphical user interface because it feels like this back and forth with the computer, almost funnily like a language model. So I think there's some interesting intersection of CLIs and language models actually being very sort of closely related and a good fit for each other.Swyx [00:05:59]: Yeah, I'll say one of the surprises from last year, I worked on a coding agent, but I think the most successful coding agent of my cohort was Open Interpreter, which was a CLI implementation. And I have chronically, even as a CLI person, I have chronically underestimated the CLI as a useful interface. You also developed ArchiveVanity, which you recently retired after a glorious seven years.Ben [00:06:22]: Something like that.Swyx [00:06:23]: Which is nice, I guess, HTML PDFs.Ben [00:06:27]: Yeah, that was actually the start of where Replicate came from. Okay, we can tell that story. So when I quit Docker, I got really interested in science infrastructure, just as like a problem area, because it is like science has created so much progress in the world. The fact that we're, you know, can talk to each other on a podcast and we use computers and the fact that we're alive is probably thanks to medical research, you know. But science is just like completely archaic and broken and it's like 19th century processes that just happen to be copied to the internet rather than take into account that, you know, we can transfer information at the speed of light now. And the whole way science is funded and all this kind of thing is all kind of very broken. And there's just so much potential for making science work better. And I realized that I wasn't a scientist and I didn't really have any time to go and get a PhD and become a researcher, but I'm a tool builder and I could make existing scientists better at their job. And if I could make like a bunch of scientists a little bit better at their job, maybe that's the kind of equivalent of being a researcher. So one particular thing I dialed in on is just how science is disseminated in that all of these PDFs, quite often behind paywalls, you know, on the internet.Swyx [00:07:34]: And that's a whole thing because it's funded by national grants, government grants, then they're put behind paywalls. Yeah, exactly.Ben [00:07:40]: That's like a whole, yeah, I could talk for hours about that. But the particular thing we got dialed in on was, interestingly, these PDFs are also, there's a bunch of open science that happens as well. So math, physics, computer science, machine learning, notably, is all published on the archive, which is actually a surprisingly old institution.Swyx [00:08:00]: Some random Cornell.Ben [00:08:01]: Yeah, it was just like somebody in Cornell who started a mailing list in the 80s. And then when the web was invented, they built a web interface around it. Like it's super old.Swyx [00:08:11]: And it's like kind of like a user group thing, right? That's why they're all these like numbers and stuff.Ben [00:08:15]: Yeah, exactly. Like it's a bit like something, yeah. That's where all basically all of math, physics and computer science happens. But it's still PDFs published to this thing. Yeah, which is just so infuriating. The web was invented at CERN, a physics institution, to share academic writing. Like there are figure tags, there are like author tags, there are heading tags, there are site tags. You know, hyperlinks are effectively citations because you want to link to another academic paper. But instead, you have to like copy and paste these things and try and get around paywalls. Like it's absurd, you know. And now we have like social media and things, but still like academic papers as PDFs, you know. This is not what the web was for. So anyway, I got really frustrated with that. And I went on vacation with my old friend Andreas. So we were, we used to work together in London on a startup, at somebody else's startup. And we were just on vacation in Greece for fun. And he was like trying to read a machine learning paper on his phone, you know, like we had to like zoom in and like scroll line by line on the PDF. And he was like, this is f*****g stupid. So I was like, I know, like this is something we discovered our mutual hatred for this, you know. And we spent our vacation sitting by the pool, like making latex to HTML, like converters, making the first version of Archive Vanity. Anyway, that was up then a whole thing. And the story, we shut it down recently because they caught the eye of Archive. They were like, oh, this is great. We just haven't had the time to work on this. And what's tragic about the Archive, it's like this project of Cornell that's like, they can barely scrounge together enough money to survive. I think it might be better funded now than it was when we were, we were collaborating with them. And compared to these like scientific journals, it's just that this is actually where the work happens. But they just have a fraction of the money that like these big scientific journals have, which is just so tragic. But anyway, they were like, yeah, this is great. We can't afford to like do it, but do you want to like as a volunteer integrate arXiv Vanity into arXiv?Swyx [00:10:05]: Oh, you did the work.Ben [00:10:06]: We didn't do the work. We started doing the work. We did some. I think we worked on this for like a few months to actually get it integrated into arXiv. And then we got like distracted by Replicate. So a guy called Dan picked up the work and made it happen. Like somebody who works on one of the, the piece of the libraries that powers arXiv Vanity. Okay.Swyx [00:10:26]: And the relationship with arXiv Sanity?Ben [00:10:28]: None.Swyx [00:10:30]: Did you predate them? I actually don't know the lineage.Ben [00:10:32]: We were after, we both were both users of arXiv Sanity, which is like a sort of arXiv...Ben [00:10:37]: Which is Andre's RecSys on top of arXiv.Ben [00:10:40]: Yeah. Yeah. And we were both users of that. And I think we were trying to come up with a working name for arXiv and Andreas just like cracked a joke of like, oh, let's call it arXiv Vanity. Let's make the papers look nice. Yeah. Yeah. And that was the working name and it just stuck.Swyx [00:10:52]: Got it.Ben [00:10:53]: Got it.Alessio [00:10:54]: Yeah. And then from there, tell us more about why you got distracted, right? So Replicate, maybe it feels like an overnight success to a lot of people, but you've been building this since 2019. Yeah.Ben [00:11:04]: So what prompted the start?Alessio [00:11:05]: And we've been collaborating for even longer.Ben [00:11:07]: So we created arXiv Vanity in 2017. So in some sense, we've been doing this almost like six, seven years now, a classic seven year.Swyx [00:11:16]: Overnight success.Ben [00:11:17]: Yeah. Yes. We did arXiv Vanity and then worked on a bunch of like surrounding projects. I was still like really interested in science publishing at that point. And I'm trying to remember, because I tell a lot of like the condensed story to people because I can't really tell like a seven year history. So I'm trying to figure out like the right. Oh, we got room. The right length.Swyx [00:11:35]: We want to nail the definitive Replicate story here.Ben [00:11:37]: One thing that's really interesting about these machine learning papers is that these machine learning papers are published on arXiv and a lot of them are actual fundamental research. So like should be like prose describing a theory. But a lot of them are just running pieces of software that like a machine learning researcher made that did something, you know, it was like an image classification model or something. And they managed to make an image classification model that was better than the existing state of the art. And they've made an actual running piece of software that does image segmentation. And then what they had to do is they then had to take that piece of software and write it up as prose and math in a PDF. And what's frustrating about that is like if you want to. So this was like Andreas is, Andreas was a machine learning engineer at Spotify. And some of his job was like he did pure research as well. Like he did a PhD and he was doing a lot of stuff internally. But part of his job was also being an engineer and taking some of these existing things that people have made and published and trying to apply them to actual problems at Spotify. And he was like, you know, you get given a paper which like describes roughly how the model works. It's probably listing lots of crucial information. There's sometimes code on GitHub. More and more there's code on GitHub. But back then it was kind of relatively rare. But it's quite often just like scrappy research code and didn't actually run. And, you know, there was maybe the weights that were on Google Drive, but they accidentally deleted the weights of Google Drive, you know, and it was like really hard to like take this stuff and actually use it for real things. We just started talking together about like his problems at Spotify and I connected this back to my work at Docker as well. I was like, oh, this is what we created containers for. You know, we solved this problem for normal software by putting the thing inside a container so you could ship it around and it kept on running. So we were sort of hypothesizing about like, hmm, what if we put machine learning models inside containers so they could actually be shipped around and they could be defined in like some production ready formats and other researchers could run them to generate baselines and you could people who wanted to actually apply them to real problems in the world could just pick up the container and run it, you know. And we then thought this is quite whether it gets normally in this part of the story I skip forward to be like and then we created cog this container stuff for machine learning models and we created Replicate, the place for people to publish these machine learning models. But there's actually like two or three years between that. The thing we then got dialed into was Andreas was like, what if there was a CI system for machine learning? It's like one of the things he really struggled with as a researcher is generating baselines. So when like he's writing a paper, he needs to like get like five other models that are existing work and get them running.Swyx [00:14:21]: On the same evals.Ben [00:14:22]: Exactly, on the same evals so you can compare apples to apples because you can't trust the numbers in the paper.Swyx [00:14:26]: So you can be Google and just publish them anyway.Ben [00:14:31]: So I think this was coming from the thinking of like there should be containers for machine learning, but why are people going to use that? Okay, maybe we can create a supply of containers by like creating this useful tool for researchers. And the useful tool was like, let's get researchers to package up their models and push them to the central place where we run a standard set of benchmarks across the models so that you can trust those results and you can compare these models apples to apples and for like a researcher for Andreas, like doing a new piece of research, he could trust those numbers and he could like pull down those models, confirm it on his machine, use the standard benchmark to then measure his model and you know, all this kind of stuff. And so we started building that. That's what we applied to YC with, got into YC and we started sort of building a prototype of this. And then this is like where it all starts to fall apart. We were like, okay, that sounds great. And we talked to a bunch of researchers and they really wanted that and that sounds brilliant. That's a great way to create a supply of like models on this research platform. But how the hell is this a business, you know, like how are we even going to make any money out of this? And we're like, oh s**t, that's like the, that's the real unknown here of like what the business is. So we thought it would be a really good idea to like, okay, before we get too deep into this, let's try and like reduce the risk of this turning into a business. So let's try and like research what the business could be for this research tool effectively. So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. And we were like, do you want like a deployment platform for deploying models? Like, do you want like a central place for versioning models? Like we're trying to think of like lots of different like products we could sell that were like related to this thing. And terrible idea. Like we're not sales people and like people don't want to buy something that doesn't exist. I think some people can pull this off, but we were just like, you know, a bunch of product people, products and engineer people, and we just like couldn't pull this off. So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day. And then we then like tried to figure out, okay, what can we build in like two weeks that'll be something. So we like desperately tried to, I can't remember what we've tried to build at that point. And then two weeks before demo day, I just remember it was all, we were going down to Mountain View every week for dinners and we got called on to like an all hands Zoom call, which was super weird. We're like, what's going on? And they were like, don't come to dinner tomorrow. And we realized, we kind of looked at the news and we were like, oh, there's a pandemic going on. We were like so deep in our startup. We were just like completely oblivious to what was going on around us.Swyx [00:17:20]: Was this Jan or Feb 2020?Ben [00:17:22]: This was March 2020. March 2020. 2020.Swyx [00:17:25]: Yeah. Because I remember Silicon Valley at the time was early to COVID. Like they started locking down a lot faster than the rest of the US.Ben [00:17:32]: Yeah, exactly. And I remember, yeah, soon after that, like there was the San Francisco lockdowns and then like the YC batch just like stopped. There wasn't demo day and it was in a sense a blessing for us because we just kind ofSwyx [00:17:43]: In the normal course of events, you're actually allowed to defer to a future demo day. Yeah.Ben [00:17:51]: So we didn't even take any defer because it just kind of didn't happen.Swyx [00:17:55]: So was YC helpful?Ben [00:17:57]: Yes. We completely screwed up the batch and that was our fault. I think the thing that YC has become incredibly valuable for us has been after YC. I think there was a reason why we couldn't, didn't need to do YC to start with because we were quite experienced. We had done some startups before. We were kind of well connected with VCs, you know, it was relatively easy to raise money because we were like a known quantity. You know, if you go to a VC and be like, Hey, I made this piece of-Swyx [00:18:24]: It's Docker Compose for AI.Ben [00:18:26]: Exactly. Yeah. And like, you know, people can pattern match like that and they can have some trust, you know what you're doing. Whereas it's much harder for people straight out of college and that's where like YC sweet spot is like helping people straight out of college who are super promising, like figure out how to do that.Swyx [00:18:40]: No credentials.Ben [00:18:41]: Yeah, exactly. We don't need that. But the thing that's been incredibly useful for us since YC has been, this was actually, I think, so Docker was a YC company and Solomon, the founder of Docker, I think told me this. He was like, a lot of people underestimate the value of YC after you finish the batch. And his biggest regret was like not staying in touch with YC. I might be misattributing this, but I think it was him. And so we made a point of that. And we just stayed in touch with our batch partner, who Jared at YC has been fantastic.Ben [00:19:10]: Jared Friedman. All of like the team at YC, there was the growth team at YC when they were still there and they've been super helpful. And two things have been super helpful about that is like raising money, like they just know exactly how to raise money. And they've been super helpful during that process in all of our rounds, like we've done three rounds since we did YC and they've been super helpful during the whole process. And also just like reaching a ton of customers. So like the magic of YC is that you have all of, like there's thousands of YC companies, I think, on the order of thousands, I think. And they're all of your first customers. And they're like super helpful, super receptive, really want to like try out new things. You have like a warm intro to every one of them basically. And there's this mailing list where you can post about updates to your products, which is like really receptive. And that's just been fantastic for us. Like we've just like got so many of our users and customers through YC. Yeah.Swyx [00:20:00]: Well, so the classic criticism or the sort of, you know, pushback is people don't buy you because you are both from YC. But at least they'll open the email. Right. Like that's the... Okay.Ben [00:20:13]: Yeah. Yeah. Yeah.Swyx [00:20:16]: So that's been a really, really positive experience for us. And sorry, I interrupted with the YC question. Like you were, you make it, you just made it out of the YC, survived the pandemic.Ben [00:20:22]: I'll try and condense this a little bit. Then we started building tools for COVID weirdly. We were like, okay, we don't have a startup. We haven't figured out anything. What's the most useful thing we could be doing right now?Swyx [00:20:32]: Save lives.Ben [00:20:33]: So yeah. Let's try and save lives. I think we failed at that as well. We had a bunch of products that didn't really go anywhere. We kind of worked on, yeah, a bunch of stuff like contact tracing, which turned out didn't really be a useful thing. Sort of Andreas worked on like a door dash for like people delivering food to people who are vulnerable. What else did we do? The meta problem of like helping people direct their efforts to what was most useful and a few other things like that. It didn't really go anywhere. So we're like, okay, this is not really working either. We were considering actually just like doing like work for COVID. We have this decision document early on in our company, which is like, should we become a like government app contracting shop? We decided no.Swyx [00:21:11]: Because you also did work for the gov.uk. Yeah, exactly.Ben [00:21:14]: We had experience like doing some like-Swyx [00:21:17]: And the Guardian and all that.Ben [00:21:18]: Yeah. For like government stuff. And we were just like really good at building stuff. Like we were just like product people. Like I was like the front end product side and Andreas was the back end side. So we were just like a product. And we were working with a designer at the time, a guy called Mark, who did our early designs for Replicate. And we were like, hey, what if we just team up and like become and build stuff? And yeah, we gave up on that in the end for, I can't remember the details. So we went back to machine learning. And then we were like, well, we're not really sure if this is going to work. And one of my most painful experiences from previous startups is shutting them down. Like when you realize it's not really working and having to shut it down, it's like a ton of work and it's people hate you and it's just sort of, you know. So we were like, how can we make something we don't have to shut down? And even better, how can we make something that won't page us in the middle of the night? So we made an open source project. We made a thing which was an open source Weights and Biases, because we had this theory that like people want open source tools. There should be like an open source, like version control, experiment tracking like thing. And it was intuitive to us and we're like, oh, we're software developers and we like command line tools. Like everyone loves command line tools and open source stuff, but machine learning researchers just really didn't care. Like they just wanted to click on buttons. They didn't mind that it was a cloud service. It was all very visual as well, that you need lots of graphs and charts and stuff like this. So it wasn't right. Like it was right. We actually were building something that Andreas made at Spotify for just like saving experiments to cloud storage automatically, but other people didn't really want this. So we kind of gave up on that. And then that was actually originally called Replicate and we renamed that out of the way. So it's now called Keepsake and I think some people still use it. Then we sort of came back, we looped back to our original idea. So we were like, oh, maybe there was a thing in that thing we were originally sort of thinking about of like researchers sharing their work and containers for machine learning models. So we just built that. And at that point we were kind of running out of the YC money. So we were like, okay, this like feels good though. Let's like give this a shot. So that was the point we raised a seed round. We raised seed round. Pre-launch. We raised pre-launch and pre-team. It was an idea basically. We had a little prototype. It was just an idea and a team. But we were like, okay, like, you know, bootstrapping this thing is getting hard. So let's actually raise some money. Then we made Cog and Replicate. It initially didn't have APIs, interestingly. It was just the bit that I was talking about before of helping researchers share their work. So it was a way for researchers to put their work on a webpage such that other people could try it out and so that you could download the Docker container. We cut the benchmarks thing of it because we thought that was just like too complicated. But it had a Docker container that like, you know, Andreas in a past life could download and run with his benchmark and you could compare all these models apples to apples. So that was like the theory behind it. That kind of started to work. It was like still when like, you know, it was long time pre-AI hype and there was lots of interesting stuff going on, but it was very much in like the classic deep learning era. So sort of image segmentation models and sentiment analysis and all these kinds of things, you know, that people were using, that we're using deep learning models for. And we were very much building for research because all of this stuff was happening in research institutions, you know, the sort of people who'd be publishing to archive. So we were creating an accompanying material for their models, basically, you know, they wanted a demo for their models and we were creating a company material for it. What was funny about that is they were like not very good users. Like they were, they were doing great work obviously, but, but the way that research worked is that they, they just made like one thing every six months and they just fired and forget it, forgot it. Like they, they published this piece of paper and like, done, I've, I've published it. So they like output it to Replicate and then they just stopped using Replicate. You know, they were like once every six monthly users and that wasn't great for us, but we stumbled across this early community. This was early 2021 when OpenAI created this, created CLIP and people started smushing CLIP and GANs together to produce image generation models. And this started with, you know, it was just a bunch of like tinkerers on Discord, basically. There was an early model called Big Sleep by Advadnoun. And then there was VQGAN Clip, which was like a bit more popular by Rivers Have Wings. And it was all just people like tinkering on stuff in Colabs and it was very dynamic and it was people just making copies of co-labs and playing around with things and forking in. And to me this, I saw this and I was like, oh, this feels like open source software, like so much more than the research world where like people are publishing these papers.Swyx [00:25:48]: You don't know their real names and it's just like a Discord.Ben [00:25:51]: Yeah, exactly. But crucially, it was like people were tinkering and forking and things were moving really fast and it just felt like this creative, dynamic, collaborative community in a way that research wasn't really, like it was still stuck in this kind of six month publication cycle. So we just kind of latched onto that and started building for this community. And you know, a lot of those early models were published on Replicate. I think the first one that was really primarily on Replicate was one called Pixray, which was sort of mid 2021 and it had a really cool like pixel art output, but it also just like produced general, you know, the sort of, they weren't like crisp in images, but they were quite aesthetically pleasing, like some of these early image generation models. And you know, that was like published primarily on Replicate and then a few other models around that were like published on Replicate. And that's where we really started to find our early community and like where we really found like, oh, we've actually built a thing that people want and they were great users as well. And people really want to try out these models. Lots of people were like running the models on Replicate. We still didn't have APIs though, interestingly, and this is like another like really complicated part of the story. We had no idea what a business model was still at this point. I don't think people could even pay for it. You know, it was just like these web forms where people could run the model.Swyx [00:27:06]: Just for historical interest, which discords were they and how did you find them? Was this the Lion Discord? Yeah, Lion. This is Eleuther.Ben [00:27:12]: Eleuther, yeah. It was the Eleuther one. These two, right? There was a channel where Viki Gangklep, this was early 2021, where Viki Gangklep was set up as a Discord bot. I just remember being completely just like captivated by this thing. I was just like playing around with it all afternoon and like the sort of thing. In Discord. Oh s**t, it's 2am. You know, yeah.Swyx [00:27:33]: This is the beginnings of Midjourney.Ben [00:27:34]: Yeah, exactly. And Stability. It was the start of Midjourney. And you know, it's where that kind of user interface came from. Like what's beautiful about the user interface is like you could see what other people are doing. And you could riff off other people's ideas. And it was just so much fun to just like play around with this in like a channel full of a hundred people. And yeah, that just like completely captivated me and I'm like, okay, this is something, you know. So like we should get these things on Replicate. Yeah, that's where that all came from.Swyx [00:28:00]: And then you moved on to, so was it APIs next or was it Stable Diffusion next?Ben [00:28:04]: It was APIs next. And the APIs happened because one of our users, our web form had like an internal API for making the web form work, like with an API that was called from JavaScript. And somebody like reverse engineered that to start generating images with a script. You know, they did like, you know, Web Inspector Coffee is Carl, like figured out what the API request was. And it wasn't secured or anything.Swyx [00:28:28]: Of course not.Ben [00:28:29]: They started generating a bunch of images and like we got tons of traffic and like what's going on? And I think like a sort of usual reaction to that would be like, hey, you're abusing our API and to shut them down. And instead we're like, oh, this is interesting. Like people want to run these models. So we documented the API in a Notion document, like our internal API in a Notion document and like message this person being like, hey, you seem to have found our API. Here's the documentation. That'll be like a thousand bucks a month, please, with a straight form, like we just click some buttons to make. And they were like, sure, that sounds great. So that was our first customer.Swyx [00:29:05]: A thousand bucks a month.Ben [00:29:07]: It was a surprising amount of money. That's not casual. It was on the order of a thousand bucks a month.Swyx [00:29:11]: So was it a business?Ben [00:29:13]: It was the creator of PixRay. Like it was, he generated NFT art. And so he like made a bunch of art with these models and was, you know, selling these NFTs effectively. And I think lots of people in his community were doing similar things. And like he then referred us to other people who were also generating NFTs and he joined us with models. We started our API business. Yeah. Then we like made an official API and actually like added some billing to it. So it wasn't just like a fixed fee.Swyx [00:29:40]: And now people think of you as the host and models API business. Yeah, exactly.Ben [00:29:44]: But that just turned out to be our business, you know, but what ended up being beautiful about this is it was really fulfilling. Like the original goal of what we wanted to do is that we wanted to make this research that people were making accessible to like other people and for it to be used in the real world. And this was like the just like ultimately the right way to do it because all of these people making these generative models could publish them to replicate and they wanted a place to publish it. And software engineers, you know, like myself, like I'm not a machine learning expert, but I want to use this stuff, could just run these models with a single line of code. And we thought, oh, maybe the Docker image is enough, but it's actually super hard to get the Docker image running on a GPU and stuff. So it really needed to be the hosted API for this to work and to make it accessible to software engineers. And we just like wound our way to this. Yeah.Swyx [00:30:30]: Two years to the first paying customer. Yeah, exactly.Alessio [00:30:33]: Did you ever think about becoming Midjourney during that time? You have like so much interest in image generation.Swyx [00:30:38]: I mean, you're doing fine for the record, but, you know, it was right there, you were playing with it.Ben [00:30:46]: I don't think it was our expertise. Like I think our expertise was DevTools rather than like Midjourney is almost like a consumer products, you know? Yeah. So I don't think it was our expertise. It certainly occurred to us. I think at the time we were thinking about like, oh, maybe we could hire some of these people in this community and make great models and stuff like this. But we ended up more being at the tooling. Like I think like before I was saying, like I'm not really a researcher, but I'm more like the tool builder, the behind the scenes. And I think both me and Andreas are like that.Swyx [00:31:09]: I think this is an illustration of the tool builder philosophy. Something where you latch on to in DevTools, which is when you see people behaving weird, it's not their fault, it's yours. And you want to pave the cow paths is what they say, right? Like the unofficial paths that people are making, like make it official and make it easy for them and then maybe charge a bit of money.Alessio [00:31:25]: And now fast forward a couple of years, you have 2 million developers using Replicate. Maybe more. That was the last public number that I found.Ben [00:31:33]: It's 2 million users. Not all those people are developers, but a lot of them are developers, yeah.Alessio [00:31:38]: And then 30,000 paying customers was the number late in space runs on Replicate. So we had a small podcaster and we host a whisper diarization on Replicate. And we're paying. So we're late in space in the 30,000. You raised a $40 million dollars, Series B. I would say that maybe the stable diffusion time, August 22, was like really when the company started to break out. Tell us a bit about that and the community that came out and I know now you're expanding beyond just image generation.Ben [00:32:06]: Yeah, like I think we kind of set ourselves, like we saw there was this really interesting image, generative image world going on. So we kind of, you know, like we're building the tools for that community already, really. And we knew stable diffusion was coming out. We knew it was a really exciting thing, you know, it was the best generative image model so far. I think the thing we underestimated was just like what an inflection point it would be, where it was, I think Simon Willison put it this way, where he said something along the lines of it was a model that was open source and tinkerable and like, you know, it was just good enough and open source and tinkerable such that it just kind of took off in a way that none of the models had before. And like what was really neat about stable diffusion is it was open source so you could like, compared to like Dali, for example, which was like sort of equivalent quality. And like the first week we saw like people making animation models out of it. We saw people make like game texture models that like use circular convolutions to make repeatable textures. We saw, you know, a few weeks later, like people were fine tuning it so you could make, put your face in these models and all of these other-Swyx [00:33:10]: Textual inversion.Ben [00:33:11]: Yep. Yeah, exactly. That happened a bit before that. And all of this sort of innovation was happening all of a sudden. And people were publishing on Replicate because you could just like publish arbitrary models on Replicate. So we had this sort of supply of like interesting stuff being built. But because it was a sufficiently good model, there was also just like a ton of people building with it. They were like, oh, we can build products with this thing. And this was like about the time where people were starting to get really interested in AI. So like tons of product builders wanted to build stuff with it. And we were just like sitting in there in the middle, it's like the interface layer between like all these people who wanted to build and all these like machine learning experts who were building cool models. And that's like really where it took off. We were just sort of incredible supply, incredible demand, and we were just like in the middle. And then, yeah, since then, we've just kind of grown and grown really. And we've been building a lot for like the indie hacker community, these like individual tinkerers, but also startups and a lot of large companies as well who are sort of exploring and building AI things. Then kind of the same thing happened like middle of last year with language models and Lama 2, where the same kind of stable diffusion effect happened with Lama. And Lama 2 was like our biggest week of growth ever because like tons of people wanted to tinker with it and run it. And you know, since then we've just been seeing a ton of growth in language models as well as image models. Yeah. We're just kind of riding a lot of the interest that's going on in AI and all the people building in AI, you know. Yeah.Swyx [00:34:29]: Kudos. Right place, right time. But also, you know, took a while to position for the right place before the wave came. I'm curious if like you have any insights on these different markets. So Peter Levels, notably very loud person, very picky about his tools. I wasn't sure actually if he used you. He does. So you've met him on your Series B blog posts and Danny Post might as well, his competitor all in that wave. What are their needs versus, you know, the more enterprise or B2B type needs? Did you come to a decision point where you're like, okay, you know, how serious are these indie hackers versus like the actual businesses that are bigger and perhaps better customers because they're less churny?Ben [00:35:04]: They're surprisingly similar because I think a lot of people right now want to use and build with AI, but they're not AI experts and they're not infrastructure experts either. So they want to be able to use this stuff without having to like figure out all the internals of the models and, you know, like touch PyTorch and whatever. And they also don't want to be like setting up and booting up servers. And that's the same all the way from like indie hackers just getting started because like obviously you just want to get started as quickly as possible, all the way through to like large companies who want to be able to use this stuff, but don't have like all of the experts on stuff, you know, you know, big companies like Google and so on that do actually have a lot of experts on stuff, but the vast majority of companies don't. And they're all software engineers who want to be able to use this AI stuff, but they just don't know how to use it. And it's like, you really need to be an expert and it takes a long time to like learn the skills to be able to use that. So they're surprisingly similar in that sense. I think it's kind of also unfair of like the indie community, like they're not churning surprisingly, or churny or spiky surprisingly, like they're building real established businesses, which is like, kudos to them, like building these really like large, sustainable businesses, often just as solo developers. And it's kind of remarkable how they can do that actually, and it's in credit to a lot of their like product skills. And you know, we're just like there to help them being like their machine learning team effectively to help them use all of this stuff. A lot of these indie hackers are some of our largest customers, like alongside some of our biggest customers that you would think would be spending a lot more money than them, but yeah.Swyx [00:36:35]: And we should name some of these. So you have them on your landing page, your Buzzfeed, you have Unsplash, Character AI. What do they power? What can you say about their usage?Ben [00:36:43]: Yeah, totally. It's kind of a various things.Swyx [00:36:46]: Well, I mean, I'm naming them because they're on your landing page. So you have logo rights. It's useful for people to, like, I'm not imaginative. I see monkey see monkey do, right? Like if I see someone doing something that I want to do, then I'm like, okay, Replicate's great for that.Ben [00:37:00]: Yeah, yeah, yeah.Swyx [00:37:01]: So that's what I think about case studies on company landing pages is that it's just a way of explaining like, yep, this is something that we are good for. Yeah, totally.Ben [00:37:09]: I mean, it's, these companies are doing things all the way up and down the stack at different levels of sophistication. So like Unsplash, for example, they actually publicly posted this story on Twitter where they're using BLIP to annotate all of the images in their catalog. So you know, they have lots of images in the catalog and they want to create a text description of it so you can search for it. And they're annotating images with, you know, off the shelf, open source model, you know, we have this big library of open source models that you can run. And you know, we've got lots of people are running these open source models off the shelf. And then most of our larger customers are doing more sophisticated stuff. So they're like fine tuning the models, they're running completely custom models on us. A lot of these larger companies are like, using us for a lot of their, you know, inference, but it's like a lot of custom models and them like writing the Python themselves because they've got machine learning experts on the team. And they're using us for like, you know, their inference infrastructure effectively. And so it's like lots of different levels of sophistication where like some people using these off the shelf models. Some people are fine tuning models. So like level, Peter Levels is a great example where a lot of his products are based off like fine tuning, fine tuning image models, for example. And then we've also got like larger customers who are just like using us as infrastructure effectively. So yeah, it's like all things up and down, up and down the stack.Alessio [00:38:29]: Let's talk a bit about COG and the technical layer. So there are a lot of GPU clouds. I think people have different pricing points. And I think everybody tries to offer a different developer experience on top of it, which then lets you charge a premium. Why did you want to create COG?Ben [00:38:46]: You worked at Docker.Alessio [00:38:47]: What were some of the issues with traditional container runtimes? And maybe yeah, what were you surprised with as you built it?Ben [00:38:54]: COG came right from the start, actually, when we were thinking about this, you know, evaluation, the sort of benchmarking system for machine learning researchers, where we wanted researchers to publish their models in a standard format that was guaranteed to keep on running, that you could replicate the results of, like that's where the name came from. And we realized that we needed something like Docker to make that work, you know. And I think it was just like natural from my point of view of like, obviously that should be open source, that we should try and create some kind of open standard here that people can share. Because if more people use this format, then that's great for everyone involved. I think the magic of Docker is not really in the software. It's just like the standard that people have agreed on, like, here are a bunch of keys for a JSON document, basically. And you know, that was the magic of like the metaphor of real containerization as well. It's not the containers that are interesting. It's just like the size and shape of the damn box, you know. And it's a similar thing here, where really we just wanted to get people to agree on like, this is what a machine learning model is. This is how a prediction works. This is what the inputs are, this is what the outputs are. So cog is really just a Docker container that attaches to a CUDA device, if it needs a GPU, that has a open API specification as a label on the Docker image. And the open API specification defines the interface for the machine learning model, like the inputs and outputs effectively, or the params in machine learning terminology. And you know, we just wanted to get people to kind of agree on this thing. And it's like general purpose enough, like we weren't saying like, some of the existing things were like at the graph level, but we really wanted something general purpose enough that you could just put anything inside this and it was like future compatible and it was just like arbitrary software. And you know, it'd be future compatible with like future inference servers and future machine learning model formats and all this kind of stuff. So that was the intent behind it. It just came naturally that we wanted to define this format. And that's been really working for us. Like a bunch of people have been using cog outside of replicates, which is kind of our original intention, like this should be how machine learning is packaged and how people should use it. Like it's common to use cog in situations where like maybe they can't use the SAS service because I don't know, they're in a big company and they're not allowed to use a SAS service, but they can use cog internally still. And like they can download the models from replicates and run them internally in their org, which we've been seeing happen. And that works really well. People who want to build like custom inference pipelines, but don't want to like reinvent the world, they can use cog off the shelf and use it as like a component in their inference pipelines. We've been seeing tons of usage like that and it's just been kind of happening organically. We haven't really been trying, you know, but it's like there if people want it and we've been seeing people use it. So that's great. Yeah. So a lot of it is just sort of philosophical of just like, this is how it should work from my experience at Docker, you know, and there's just a lot of value from like the core being open, I think, and that other people can share it and it's like an integration point. So, you know, if replicate, for example, wanted to work with a testing system, like a CI system or whatever, we can just like interface at the cog level, like that system just needs to put cog models and then you can like test your models on that CI system before they get deployed to replicate. And it's just like a format that everyone, we can get everyone to agree on, you know.Alessio [00:41:55]: What do you think, I guess, Docker got wrong? Because if I look at a Docker Compose and a cog definition, first of all, the cog is kind of like the Dockerfile plus the Compose versus in Docker Compose, you're just exposing the services. And also Docker Compose is very like ports driven versus you have like the actual, you know, predict this is what you have to run.Ben [00:42:16]: Yeah.Alessio [00:42:17]: Any learnings and maybe tips for other people building container based runtimes, like how much should you separate the API services versus the image building or how much you want to build them together?Ben [00:42:29]: I think it was coming from two sides. We were thinking about the design from the point of view of user needs, what are their problems and what problems can we solve for them, but also what the interface should be for a machine learning model. And it was sort of the combination of two things that led us to this design. So the thing I talked about before was a little bit of like the interface around the machine learning model. So we realized that we wanted to be general purpose. We wanted to be at the like JSON, like human readable things rather than the tensor level. So it was like an open API specification that wrapped a Docker container. And that's where that design came from. And it's really just a wrapper around Docker. So we were kind of building on, standing on shoulders there, but Docker is too low level. So it's just like arbitrary software. So we wanted to be able to like have a open API specification that defined the function effectively that is the machine learning model. But also like how that function is written, how that function is run, which is all defined in code and stuff like that. So it's like a bunch of abstraction on top of Docker to make that work. And that's where that design came from. But the core problems we were solving for users was that Docker is really hard to use and productionizing machine learning models is really hard. So on the first part of that, we knew we couldn't use Dockerfiles. Like Dockerfiles are hard enough for software developers to write. I'm saying this with love as somebody who works on Docker and like works on Dockerfiles, but it's really hard to use. And you need to know a bunch about Linux, basically, because you're running a bunch of CLI commands. You need to know a bunch about Linux and best practices and like how apt works and all this kind of stuff. So we're like, OK, we can't get to that level. We need something that machine learning researchers will be able to understand, like people who are used to like Colab notebooks. And what they understand is they're like, I need this version of Python. I need these Python packages. And somebody told me to apt-get install something. You know? If there was sudo in there, I don't really know what that means. So we tried to create a format that was at that level, and that's what cog.yaml is. And we were really kind of trying to imagine like, what is that machine learning researcher going to understand, you know, and trying to build for them. Then the productionizing machine learning models thing is like, OK, how can we package up all of the complexity of like productionizing machine learning models, like picking CUDA versions, like hooking it up to GPUs, writing an inference server, defining a schema, doing batching, all of these just like really gnarly things that everyone does again and again. And just like, you know, provide that as a tool. And that's where that side of it came from. So it's like combining those user needs with, you know, the sort of world need of needing like a common standard for like what a machine learning model is. And that's how we thought about the design. I don't know whether that answers the question.Alessio [00:45:12]: Yeah. So your idea was like, hey, you really want what Docker stands for in terms of standard, but you actually don't want people to do all the work that goes into Docker.Ben [00:45:22]: It needs to be higher level, you know?Swyx [00:45:25]: So I want to, for the listener, you're not the only standard that is out there. As with any standard, there must be 14 of them. You are surprisingly friendly with Olama, who is your former colleagues from Docker, who came out with the model file. Mozilla came out with the Lama file. And then I don't know if this is in the same category even, but I'm just going to throw it in there. Like Hugging Face has the transformers and diffusers library, which is a way of disseminating models that obviously people use. How would you compare your contrast, your approach of Cog versus all these?Ben [00:45:53]: It's kind of complementary, actually, which is kind of neat in that a lot of transformers, for example, is lower level than Cog. So it's a Python library effectively, but you still need to like...Swyx [00:46:04]: Expose them.Ben [00:46:05]: Yeah. You still need to turn that into an inference server. You still need to like install the Python packages and that kind of thing. So lots of replicate models are transformers models and diffusers models inside Cog, you know? So that's like the level that that sits. So it's very complementary in some sense. We're kind of working on integration with Hugging Face such that you can deploy models from Hugging Face into Cog models and stuff like that to replicate. And some of these things like Llamafile and what Llama are working on are also very complementary in that they're doing a lot of the sort of running these things locally on laptops, which is not a thing that works very well with Cog. Like Cog is really designed around servers and attaching to CUDA devices and NVIDIA GPUs and this kind of thing. So we're actually like, you know, figuring out ways that like we can, those things can be interoperable because, you know, they should be and they are quite complementary and that you should be able to like take a model and replicate and run it on your local machine. You should be able to take a model, you know, the machine and run it in the cloud.Swyx [00:47:02]: Is the base layer something like, is it at the like the GGUF level, which by the way, I need to get a primer on like the different formats that have emerged, or is it at the star dot file level, which is model file, Llamafile, whatever, whatever, or is it at the Cog level? I don't know, to be honest.Ben [00:47:16]: And I think this is something we still have to figure out. There's a lot yet, like exactly where those lines are drawn. Don't know exactly. I think this is something we're trying to figure out ourselves, but I think there's certainly a lot of promise about these systems interoperating. We just want things to work together. You know, we want to try and reduce the number of standards. So the more, the more these things can interoperate and, you know
Topics covered in this episode: Fixit 2: Meta's next-generation auto-fixing linter FastUI Mail list / newsletter conversation CLIs from type hints Extras Joke Watch on YouTube About the show Sponsored by us! Support our work through: Our courses at Talk Python Training The Complete pytest Course Patreon Supporters Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.org Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Michael #1: Fixit 2: Meta's next-generation auto-fixing linter via Bart Kappenburg Fixit is dead! Long live Fixit 2 – the latest version of our open-source auto-fixing linter. Fixit provides a highly configurable linting framework with support for auto-fixes, custom “local” lint rules, and hierarchical configuration, built on LibCST. Fixit 2 is available today on PyPI. Created by Meta's Python Language Foundation team — a hybrid team of both PEs and traditional SWEs — helps own and maintain the infrastructure and tooling for Python. Interesting comments on this article on Hacker News I wonder if ruff format was already a thing when Fixit was adopted, whether it would exist? Brian #2: FastUI Samuel Colvin “FastUI is a new way to build web application user interfaces defined by declarative Python code.” MK: Reminds me of the code matches DOM style of Flutter. See code samples at the end. Michael #3: Mail list / newsletter conversation I've been tired of Mailchimp for a long time Raising the prices month over month by $100 several months may be the straw But what are the options? Lets ask Mastodon: emailoctopus.com listmonk.app [self hosted, open source] keila.io [self/saas, open source] mailyherald.org [self hosted, open source] sendportal.io [self hosted, open source] brevo.com buttondown.email [django] zoho.com/campaigns/ sendy.co [use your own bulk emailer (e.g. sendgrid or aws ses) convertkit.com mautic.org [open source] constantcontact.com getresponse.com convertkit.com Brian #4: CLIs from type hints From Sander76 Pydantic Argparse “is a Python package built on top of pydantic which provides declarative typed argument parsing using pydantic models.” Clipstick is a “cli-tool based on Pydantic models.” tyro “is a tool for generating command-line interfaces and configuration objects in Python.” tyro includes support for dataclasses and attrs in place of Pydantic Extras Brian: Django 5.0 has been released vim-keybindings-everywhere-the-ultimate-list - submitted by Paul Barry PythonTest (the podcast formerly known as Test & Code, to be read in an undertone similar to the way one used to say “The artist formerly known as Prince”) has moved form testandcode.com to podcast.pythontest.com Plus more guests are listed now. I think I've gone backwards from current to episode 182. I tried to get my kid to help out, unsuccessfully. May have to hire someone to help. grrr. Michael: Essay: Don't Sweat the Ad Blocker Drama A story: my project this weekend, unify my over 20 domains to one host Joke: Honest LinkedIn
Topics covered in this episode: QuickMacHotKey Things I've learned about building CLI tools in Python Warp Terminal (referral code) Python 3.7 EOLed, but I hadn't noticed Extras Joke Watch on YouTube About the show Sponsored by us! Support our work through: Our courses at Talk Python Training Python People Podcast Patreon Supporters Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.org Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Michael #1: QuickMacHotKey This is a set of minimal Python bindings for the undocumented macOS framework APIs that even the most modern, sandboxing-friendly shortcut-binding frameworks use under the hood for actually binding global hotkeys. Thinking of updating my urlify menubar app. Brian #2: Things I've learned about building CLI tools in Python Simon Willison A cool Cookiecutter starter project, if you like Click. Conventions and consistency in commands, arguments, options, and flags. The importance of versioning. Your CLI is an API. Include examples in --help Include --help in documentation. Aside, Typer is also cool, and is built on Click. Michael #3: Warp Terminal (referral code) Really nice reimagining of the terminal Currently macOS only but will be Linux, then Windows New command section & output section mode Blocks can be navigated and searched as a single thing (even if it's 1,000 lines of output) CTRL+R gives a nice history like McFly I've discussed before Completions into popular CLIs (i.e. git) Edit like an editor (even you VIM people
In this repeat episode picked by PodRocket host Paul Mikulskis, Ian Sutherland, Node.js core contributor and Architect and Developer Experience Lead at Neo Financial, joins the pod to talk about zero-dependency CLIs, why they're fun to build, and what they can teach us about developing other applications. Links https://twitter.com/iansu https://github.com/iansu https://iansutherland.ca Tell us what you think of PodRocket We want to hear from you! We want to know what you love and hate about the podcast. What do you want to hear more about? Who do you want to see on the show? Our producers want to know, and if you talk with us, we'll send you a $25 gift card! If you're interested, schedule a call with us (https://podrocket.logrocket.com/contact-us) or you can email producer Kate Trahan at kate@logrocket.com (mailto:kate@logrocket.com) Follow us. Get free stickers. Follow us on Apple Podcasts, fill out this form (https://podrocket.logrocket.com/get-podrocket-stickers), and we'll send you free PodRocket stickers! What does LogRocket do? LogRocket combines frontend monitoring, product analytics, and session replay to help software teams deliver the ideal product experience. Try LogRocket for free today. (https://logrocket.com/signup/?pdr) Special Guest: Ian Sutherland.