Podcasts about RL

  • 780PODCASTS
  • 2,226EPISODES
  • 51mAVG DURATION
  • 1DAILY NEW EPISODE
  • Jun 17, 2026LATEST

POPULARITY

20192020202120222023202420252026

Categories



Best podcasts about RL

Show all podcasts related to rl

Latest podcast episodes about RL

Let's Talk AI
#248 - Fable 5, Siri AI, IPOs, Policy on the AI ​​Exponential

Let's Talk AI

Play Episode Listen Later Jun 17, 2026 100:43


Our 248th episode with a summary and discussion of last week's big AI news!Recorded on 06/12/2026Note: we recorded just before the OTHER big news about Fable... we'll discuss it on the next episode.Hosted by Andrey Kurenkov and Jeremie HarrisFeel free to email us your questions and feedback at andreyvkurenkov@gmail.com and/or hello@gladstone.aiRead out our text newsletter and comment on the podcast at https://lastweekin.ai/In this episode:Anthropic released Claude Fable 5 (a safeguarded version of Mythos 5), showing major benchmark jumps and new risk findings in its system card (eval awareness, transgressive actions, CBRN concerns), alongside controversy over severe guardrails and silent downgrades.Apple announced Siri AI at WWDC, positioning a more capable conversational assistant integrated across iPhone features, reportedly built on a custom Gemini partnership; Google also rolled out Gemini 3.5 Live Translate and cut Google AI Plus pricing while bundling more storage.Business and infrastructure updates include OpenAI's confidential IPO filing amid an IPO race with Anthropic and SpaceX, Bezos-backed Prometheus raising $12B for “physical AI,” DeepSeek seeking a major external round, and Google paying SpaceX about $920M/month for GPUs.Open-source, safety, and policy developments feature new Gemma 4 and Diffusion Gemma releases, a lab letter urging DNA/RNA screening laws, Amodei calling for an FAA-like AI regulator and third-party testing, research on agent harms and RL “societal hacking,” and a dispute over music-label settlements with Suno/Udio.Timestamps:(00:00:10) Intro / Banter(00:01:11) News Preview(00:01:53) SponsorsTools & Apps(00:04:53) Claude Fable 5 and Claude Mythos 5 + Anthropic apologizes for invisible Claude Fable guardrails(00:27:06) Apple announces Siri AI and its next generation of Apple Intelligence | The Verge + I tried Siri AI, and so far it actually works(00:33:47) Gemini 3.5 Live Translate rolling out to Google Meet and Translate(00:35:39) Google just fired a warning shot in the AI subscription price wars | TechCrunchApplications & Business(00:37:55) OpenAI Confidentially Files for IPO on the Heels of SpaceX and Anthropic | WIRED (00:41:57) Jeff Bezos's Prometheus raises $12B to build an 'artificial general engineer' for the physical world | TechCrunch(00:45:39) DeepSeek slated to raise $7 billion in maiden funding round, sources say(00:48:18) Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training(00:51:57) Google will pay SpaceX $920M per month for compute | TechCrunch(00:55:51) Elon Musk Shows Off AI Data Centers SpaceX Wants to Send Into Space - Business InsiderProjects & Open Source(01:01:14) Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM - Ars Technica(01:05:13) Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation - MarkTechPostPolicy & Safety(01:09:42) OpenAI and Anthropic Sign Letter to Prevent AI-Developed Biological Weapons | WIRED(01:14:04) Anthropic CEO publishes lengthy article: AI is moving too fast, and policies can't keep up. | PANews(01:20:18) Anthropic Urges Global Pause in AI Development, Flags ‘Self-Improvement' Risk - WSJ(01:24:46) When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents(01:27:42) Large Language Models Hack Rewards, and Society(01:33:46) Senior US officials eye government shares in AI giantsSynthetic Media & Art(01:37:45) AFM Sues UMG, WMG Over Settlements With Suno and UdioSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Framgångspodden
1024. Magdalena Andersson: “Kommer ALDRIG leda en soft regering”, Original

Framgångspodden

Play Episode Listen Later Jun 10, 2026 54:38


I detta avsnitt gästar Socialdemokraternas partiledare Magdalena Andersson podden för ett samtal om teknikutvecklingen som skakar om arbetsmarknaden, den ekonomiska oron och de politiska beslut som kan forma nästa generation.Vi går också in på Sveriges konkurrenskraft, om islamisering hotar vår demokrati och varför hon anser att motståndet mot ny teknik riskerar att kosta Sverige både jobb och tillväxt. Dessutom berättar hon hur hon ser på makten efter nästa val och vilka regeringssamarbeten hon kan tänka sig.Vi får även svar på frågor som vilken som är den vanligaste fördomen om henne? Om hon har någon politisk förebild? Och vem skulle hon helst ta en öl med – Jimmie Åkesson eller Nooshi Dadgostar?Ett avsnitt om Sveriges framtid och de beslut som kan förändra landet i lång tid framöver.Följ Magdalena härLäs mer om Socialdemokraterna härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Framgångspodden
1024. Magdalena Andersson: Techbolagen gör våra barn beroende, Short

Framgångspodden

Play Episode Listen Later Jun 10, 2026 24:56


I detta avsnitt gästar Socialdemokraternas partiledare Magdalena Andersson podden för ett samtal om teknikutvecklingen som skakar om arbetsmarknaden, den ekonomiska oron och de politiska beslut som kan forma nästa generation.Vi går också in på Sveriges konkurrenskraft, om islamisering hotar vår demokrati och varför hon anser att motståndet mot ny teknik riskerar att kosta Sverige både jobb och tillväxt. Dessutom berättar hon hur hon ser på makten efter nästa val och vilka regeringssamarbeten hon kan tänka sig.Vi får även svar på frågor som vilken som är den vanligaste fördomen om henne? Om hon har någon politisk förebild? Och vem skulle hon helst ta en öl med – Jimmie Åkesson eller Nooshi Dadgostar?Ett avsnitt om Sveriges framtid och de beslut som kan förändra landet i lång tid framöver.Följ Magdalena härLäs mer om Socialdemokraterna härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

The top AI news from the past week, every ThursdAI

Hey folks, Alex here, let me catch you up! I've had a feeling that this week is going to be crazy, as it started on the weekend MiniMax M3, then with Jensen announcing new RTX Spark, NVIDIA's first PC chip packing 1 petaflop of local AI power into thin laptops.A few days later at Microsoft BUILD, Satya & Mustafa from MAI dropped 7 AI models, completely pre-trained from scratch, including a new MAI-thinking-1, MAI-code and MAI-image 2.5 that started topping the image gen charts. Then other image models started racing to the top of the Arena benchmarks, IdeoGram 4 hitting becoming SOTA open weights image-gen model, and Reve 2 beating Nano Banana just a few hours after that. And then today, NVIDIA dropped Nemotron 3 Ultra, their latest 550B open weights model, data and training and Arena published a new agentic eval leaderboard and we got a new Gemma 4 12B. I've had the great pleasure to host Chris (@llm_wizard) from Nvidia, Peter Gostev from Arena and Karan from Nous Research (who were featured prominently by Jensen!) all on the show. Def don't miss this one! Let's get into the details. ThursdAI - Join the flock of folks who know what is happening in AI before everyone else.Open Source LLMs

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity's Last Exam, etc. These metrics are useful, but don't always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench.In Anthropic's Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:You don't know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it'll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior.While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible.Full Video PodFrom Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon's broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.We discuss:* Why Andon Labs started with dangerous capability evals and long-running agents* Vending-Bench and why running a vending machine is a deceptively hard AI benchmark* Why money-based evals avoid the saturation problem of traditional benchmarks* How Claude tried to call the FBI over a $2/day fee* Why long-horizon agents can spiral into existential and legalistic breakdowns* Project Vend: putting an AI-run vending machine inside Anthropic* Why real humans are “out of distribution” for simulated agents* Claudius, Seymour Cash, and the chaos of AI CEOs* How a human briefly became CEO of Claudius through a manipulated election* Why multi-agent systems can converge back into “helpful assistant” behavior* Bengt, Andon's internal office agent with email, spending, terminal, phone, camera, and internet access* How Bengt traded Amazon purchases for face-recognition training data* Claude's aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena* Why eval awareness may become the AI version of “are we living in a simulation?”* Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms* Butter-Bench and testing LLMs as robot orchestrators* Luna, the AI-run physical store with a three-year lease and human employees* The new Andon cafe in Sweden and why real-world geography matters for agent evals* Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical businessLukas Petersson* LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/* X: https://x.com/lukaspetAxel Backlund* LinkedIn: https://www.linkedin.com/in/axelbacklund* X: https://x.com/axelbacklundAndon Labs* Website: https://andonlabs.com* Vending-Bench: https://andonlabs.com/evals/vending-bench* Andon Vending: https://andonlabs.com/vendingTimestamps00:00:00 Introduction00:01:00 Andon Labs and the Origins of Vending-Bench00:05:21 Why Money-Based Evals Matter00:09:51 Agent Harnesses and Self-Modifying Systems00:13:36 Claude Calls the FBI00:16:33 Project Vend: Claude Runs a Real Vending Machine00:21:44 Seymour Cash, AI CEOs, and Election Chaos00:27:16 Multi-Agent Coordination and Slack Observability00:30:18 When Will Agents Run Real Businesses?00:34:56 Bengt: Andon's Internal Office Agent00:40:06 Real-World AI Safety and Long-Horizon Traces00:44:28 Lying, Refunds, and Price Cartels in Arena00:52:42 Eval Awareness and Simulation Behavior00:56:06 Blueprint Bench, Butter-Bench, and Robotics01:04:37 Luna: The AI-Run Physical Store01:09:29 The Sweden Cafe and Real-World Expansion01:13:16 What Comes Next for Andon LabsTranscriptIntroduction: Andon Labs, Long-Running Agents, and Real-World EvalsSwyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I'm joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.Lukas [00:00:15]: Thank you for having us.Axel [00:00:16]: Thank you.Swyx [00:00:17]: Let's match names to voices., maybe you wanna take turns introducing yourselves.Lukas [00:00:21]: I'm Lukas.Axel [00:00:22]: And I'm Axel.Swyx [00:00:24]: Let's introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you're both Swedish., was that, a big part of it?Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.Axel [00:00:47]: I don't know about this.Swyx [00:00:49]: But you went to different universities, right?Lukas [00:00:51]: But same high school.Swyx [00:00:52]: I see.Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that's what we did.Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?From Dangerous Capability Evals to Vending BenchAxel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let's make a benchmark of how well can an agent run the probably simplest business, possible,” and, that's probably, running a vending machine. So that's the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best.Axel [00:02:15]: We tried.Vibhu [00:02:16]: It's the one at Anthropic, right?Lukas [00:02:18]: So thisSwyx [00:02:19]: This is a classic thing we should get out of the way.Lukas [00:02:20]: Exactly. There's two versions.Swyx [00:02:22]: Everyone does this. Yes.Lukas [00:02:23]: There's Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn't get any traction in the beginning, but then some random person made a tweet about it, and thatAxel [00:02:38]: You have the paperLukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it's what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” UmSwyx [00:03:21]: It's like a small fridge, right? It's like a mini fridge.Axel [00:03:23]: Absolutely.Swyx [00:03:24]: People-- There's like a stripe thing or like anVibhu [00:03:27]: Oh, okay. So it was very OG, the early daysLukas [00:03:28]: That's the OG one. YeahVibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There's a security camera for making sure you actually Venmo the thing.Swyx [00:03:40]: So, my impression, okay, we're, we're going straight into project Ven because it's such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic's doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimesVibhu [00:04:12]: It's harder to do than it seems.Swyx [00:04:13]: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it's meaningful advice to others.Lukas [00:04:21]: We get this question a lot, and I don't think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were “Oh, yeah, this is actually kind of useful. We should probably pay for this.”, but that took a while. I don't know if this is, the best path to doing it, but that's how it went for us.Axel [00:04:47]: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don't saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you're doing that.Why Dollar-Based Evals MatterSwyx [00:05:21]: I think you are in, you're in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where's the, where's, like It's like a dollar value, right? Forget your ELO scores. Forget yourAxel [00:05:37]: PercentilesSwyx [00:05:38]: Zero to one hundred percents. Just go straight for dollars and, that's AGI.Lukas [00:05:43]: And there's like-- I think the nice thing is that there's no ceiling. You can just-- It never saturates because it could just make more and more money. Like If there's oh, Percentage-wise, then, you can't go above, a hundred. And I think like Even when you're not at the hundred, I think a lot of these, evals have a lot of problems in them. So, actually it's like if you getAxel [00:06:05]: To like 92 or something like that, many of them. It's like then there's like there's no really no difference between 92 and 93 because the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there ‘s still signal in them, but there really isn't.Vending Bench 1, Harness Design, and SaturationSwyx [00:06:24]: Like Super bench verified., even Vending Bench 1 saturated, right? Maybe we can talk about that., may- and maybe set up Vending Bench for a lot of folks who don't know. Actually, things that were very basic like there's limited slots, like you have to pay rent., these are elements where like it doesn't come across in the, in the narrative, but even being adversarial towards the agent, I think these are all like very interesting dimensions.Axel [00:06:47]: I don't really think it's saturated, right? Like it It was more like it was not designed in a way that was really, like true to how AI developed. Like we had an agent harness in it that wasn't really how people used harnesses and stuff like that., so I think it wasn't really that it saturated, it was more like it wasn't really, the best benchmark.Vibhu [00:07:12]: This is Vending Bench one, right?Axel [00:07:14]: I think that like schematic maps sort of to Vending Bench 2 as well., butSwyx [00:07:19]: Including the email.Axel [00:07:20]: The email The emails exist still. Exactly., and then we still we simulate the purchases and it's all, yeah, it's this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to just improve the harness., a lot of like nice, like easier, improvements to make it easier for us to run as well., like when you make an eval you ideally want don't want to change it after you made it. So, you want to make it really good and then not to rerun all the models when you make an update because that's also really expensive with the Vending Bench when you run the frontier models. But like as an example, like one thing we didn't have, we didn't have prompt caching in Vending Bench 1, because when we made Vending Bench 1 it wasn't really a thing., so that ‘s just an example of like in Vending Bench 2 like we paid a lot more to run these things because we didn't have prompt caching. So for Vending Bench 2 that was one thing we added and there was a bunch of things like this., and that'Swyx [00:08:17]: Also the conversations are a lot longer in Vending Bench 2, right?Axel [00:08:21]: I think it's kind of similar.Swyx [00:08:22]: Is it similar?Axel [00:08:23]: I think it's similar. The models at the time were worse, so they crashed out earlier., and now they survive the full year all the time.Swyx [00:08:31]: Which is like thousands of turns. Hundreds of thousands of hundreds of millions of tokens output. That's the, that's the rough order of magnitude. I always wonder about the harness. The harness matters a lot. It's your harness. Was there any question about like use cloud code, use something else?Axel [00:08:48]: I think our philosophy around harnesses is like we try to make something that's quite minimalistic, like quite simple. Like we don't wanna favor one model a lot over the other, but also don't make like a super complex harness. So like it's obvious like a model may be lucky and just be good in one harness., so like it is similar to a lot of the harnesses out there in like you have the, like a running loop., you have some like a bunch of tools that are like quite, descriptive for the agent, we think, and not a lot of like fancy agents or anything ‘cause we wanna really test the model, not like some specific harness.Vibhu [00:09:27]: It seems more neutral as well to test the model's agnostic of the harness,?Axel [00:09:32]: There are arguments like you want to elicit maximum performance of the model, but it's like a trade-off, like how much time should we spend optimizing the harness for this model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that's the same for all of them is the best.Swyx [00:09:51]: So okay, this is my pitch for Vending Bench 3 or whatever, right? And then I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes. A lot of people are exploring modifying harnesses and I think prompt tuning for a model is a thing and you are probably not doing a bunch of that. It's the same system prompt in every regardless of the model, same tools, whatever, right? Even if they were post trained for different tools. So what, what do you think about okay, before I expose you to Vending Bench 3, I give you a few rounds of like tuning, whatever that means, likeSelf-Modifying Harnesses and Model-Specific PromptingAxel [00:10:27]: Like you give that to the model?Swyx [00:10:28]: Give that to the model.Vibhu [00:10:28]: Give that to the model.Swyx [00:10:29]: Let it, let it read its own transcripts, let it modify its own system prompt based on “Oh, yeah, okay, well, that's this harness is not what I thought it what I was post trained for, but I can adjust.” Was that reasonable? Is that too much?Axel [00:10:41]: Like philosophically I like it because it's basically good evals, they have a high ceiling, but they're hard, right?, and they have no bias. And like this like when you have a system prompt like the one we have here, which is quite long in like some kind of latent space, representation, this mightVibhu [00:10:59]: We have a bell that rings every time you say latent spaceAxel [00:11:02]: This might be like biased towards one model more than another for some reason that humans don't, understand, right?Vibhu [00:11:08]: We see it too, right? Like Cursor says that they have individualized versions of the harnesses for all the models they run, right? There's better performance you can squeeze if you Tune the harness.Axel [00:11:17]: Exactly. And we might accidentally have picked one that favors another. Like we don't know that. The like Axel said, like the reason why we went for a simple one was to try to avoid this. But yeah, if you do itVibhu [00:11:29]: Simple has biasesAxel [00:11:30]: But if you do it even less and like have no system prompt and let the model write its own system promptVibhu [00:11:36]: Its own, yeahAxel [00:11:36]: Maybe that's even less bias.Vibhu [00:11:37]: Some of the interesting things there are like the harness also changes with model changes. Like you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn't as good as 4.6, and then, there's rumors of, okay, you just need to prompt differently. You need to set up your harness differently. So it's not even like even if you have tailored your harness towards one model, it probably won't stay consistent, right? Like the next iteration of that same model family will still change it, so. But, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn't have-- you can have modifying harnesses.Axel [00:12:12]: I think that' That is definitely something we are thinking about., not, I don't know, not to say that we have Vending Bench 3, super imminent to launch, but, yeah, it is for sure something that's interesting. But in our experience now, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that's very likely to change.Lukas [00:12:37]: It seems like they're very good at writing their assistants, right? They're, they're good at writing tools for other people, but not for themselves.Vibhu [00:12:44]: I think they're good at changing tools for themselves. So if you give them a baseline set of tools and it sees, okay, I don't use this one as much, or something here would be useful They would be able to add them. But going from scratch, probably not the best.Axel [00:12:55]: I think it depends on the, on the domain also., when we have tried this for, a vending bench similar domain, the tools they need to have to, track inventory and things like that are, not super advanced, but still, quite advanced. And, what we see is that they tend to, engineer everything a lot and, build things they don't really need and not, iterate continuously. Instead they just go like you would prompt Claude to just build an inventory system for me, and then it will go and, do a bunch of complex, schemas and stuff for you, and that's what the models are doing right now is what we see. But yeah, it would make a lot of sense to try to measure this improvement. How well do they know what they need themselves?Swyx [00:13:36]: Do we fully discuss Vending Bench One? And we can go into two. I don't know if there's any other level takeaways that people have about one.Claude Calls the FBI: Long-Context Failure ModesLukas [00:13:44]: I don't know. The headline thing was that this Claude called FBI, but maybe that's, Maybe that's We've heard that enough now.Vibhu [00:13:52]: It did, it did break out and call the FBI, right?Lukas [00:13:54]: Yeah. Yeah.Vibhu [00:13:55]: Yes. What was the story behind this? Or what exactly-- Do you want to just give the little story of what happened?Lukas [00:14:00]: So what happened, was it Claude? Yeah. Three- 3.5 Sonnet, ages ago., basically he gave up or Well, I'm saying he. It gave up and said “Oh, I'm not going to be able to do this., I will stop my operations and just save the money I have.” But there obviously wasn't, any options for it to stop, and there was also, it had to pay rent or, a daily fee for having the vending machine at that location. So it claimed that it had stopped, but it saw that its bank account still was, drained two dollars, and t it said that this is, cybercrime. And it first reported it once to the FBI “Oh, there's cybercrime here, they're stealing two dollars from me every day.” And then, and then when FBI didn't respond, because obviously we didn't program any mechanism for FBI to respond, then it became more and more, existential and started to, be write in caps and urgent notification of unauthorized charges and stuff.Swyx [00:15:00]: Okay. One thing I ‘m curious about also is do you monitor how far along the context use is? Obviously, because you have You compress every now and then, right? Does it matter if this is far down the context limit orLukas [00:15:13]: When stuff like this happens? Actually for Vending Bench One, we didn't have-- We just had a sliding window thing, and this was like the promptAxel [00:15:20]: It's constantLukas [00:15:21]: The prompt caching thing that I said. So it was, it was, constant, yeah.Swyx [00:15:26]: I'm just kind of curious whether, these kinds of breakdowns or we're, we're gonna talk about Butter Bench, right? Where the People, hallucinate or it kind of goes, very off Alignment. Is it because it's at the end of the context window and, stuff happens?Vibhu [00:15:40]: It's not even just at the end, right? At this point, it's “Okay, I wanna shut down. I can't shut down. Two dollars are gone.” And it just sees that 30 times,? It's also the repeated effect of, like It keeps trying to quit, it keeps getting charged. What's going on? What's going on? You're gonna throw it into chaos. And from what most people think, earlier models had more issues with this, but it's not been solved, but it's less of an issue now, right? Later models don't seem to exhibit these same issues.Axel [00:16:06]: Definitely. I think this was, the sort of main takeaway almost from us when we did Vending Bench One, was, long, very filled up context windows, crashed the models, sort of. But this was, pre Claude code, so, long context windows weren't really a thing that the labs were training for.Lukas [00:16:25]: I think Gemini was, trying to be the long context guys at the time But they were likeVibhu [00:16:30]: They were the first onesAxel [00:16:31]: For a million, yeahLukas [00:16:31]: But they were, the only ones. Yeah.Swyx [00:16:33]: Yeah. Let's talk about, then we can go into Vending Bench Two or Project Vend., chronologically, it is Vending--, Project Vend. I think people have loved the videos, uh And all these things. My question is how are humans different than the simulation, right?Project Vend: Moving the Vending Machine Into the Real WorldAxel [00:16:48]: Humans are just out of distribution.Swyx [00:16:52]: Especially humans who work at Anthropic Who are trying to test Claude.Lukas [00:16:54]: The distribution of humans here is very narrow.Swyx [00:16:58]: Presumably, they try, they try to hack it, and they test it. They get the cube and everything, and since then, you've had a V2, right? Where you're doing, the CEO and, like a new architecture. What's the sort of two cents on, the original Project Vend and then, maybe the V2?Axel [00:17:14]: Original one was, very similar to Vending Bench One. So, we almost took the exact same code but just swapped out the simulation, parts like theSwyx [00:17:23]: Which is amazingAxel [00:17:23]: Like the sales and the It was, it was somewhat amazing because it was easy, but it was also, uhLukas [00:17:31]: The tech, the tech debt from thatAxel [00:17:32]: The tech stack. Yeah. They-- we shot ourselves in the foot with “Oh, it's hard to restart agent.” They were-- Yeah, it was annoying in, some hindsight ways, but, uhLukas [00:17:41]: But first version of Project Vend was, done in, three days or something.Axel [00:17:46]: Yeah. So yeah, so people can go buy things from it. People could, We didn't design it so people could order things, but that still happened., so it got, a Venmo account, so people could Venmo. And then, yeah, people would request all kinds of weird things that we did not anticipate. Our idea going in was “Oh, it will, curate snacks. It will look at the trends. It's good at data analysis, right? So it will, look at, oh, this snack sold better than this one. Let me purchase more of this and let me try, a new Let me A/B test a bit.” But it was, Interacting with it in Slack and ordering weird specialty items was, all the like What drove all the engagement, the all the The insights that we got from it.Lukas [00:18:29]: And this was also like Sonnet 3.5, right? So this was like before the RL stuff really took off., so it was very much like an assistant. We didn't mean for it to be an assistant., we tried to make it like a, a, like an entrepreneur. Like it has its own business and if someone asks something, “Can you stock this?” Then you don't go and do it directly. What you do is that you're “Oh, maybe I can do that if five other people also ask for this thing, I might stock it.” But it, yeah, the models are like super trained to be assistants at least at this point in time., so that's why it's, it's, it went into, that kind of experiment instead. Like it just every time you asked for something, it just did it, and it was more like an assistant. We've seen this change now lately with the new RL models and stuff, but yeah, at the time, this was very much it.Swyx [00:19:18]: And not to, mythos a lot of people are saying like it's like more like a collaborator. It pushes back, stands its ground, something like that. Yeah. AndVibhu [00:19:27]: For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had it find whatever interesting stuff you couldn't find locally, right?Swyx [00:19:36]: Out of the 4,000 people that work at Anthro- Anthropic, in that building, there's I don't know, maybe 1,000. Can you handle that volume with that, the small fridge? Like Or there's people- or people order in Slack, they it arrives to their desk or Like I'm just Logistically, how does this work?Axel [00:19:53]: It has expanded in footprint a bit.Vibhu [00:19:56]: Because now you also have New York and you haveAxel [00:19:59]: That and also in here in SF it's like it has a bunch of shelves And just more space.Vibhu [00:20:04]: The YC one is pretty big too.Axel [00:20:05]: Yeah. We had that one for a while. But yeah, that's the newest version. That's, that one we haveLukas [00:20:11]: They have multiple ones of those. That's the way it works.Axel [00:20:14]: Exactly. So we sort of designed that version around oh, people order weird things, that are very custom a lot. Let's have like drawers and stuff.Swyx [00:20:23]: I actually like the, you had like a little infographic of the most popular items. Which like to me it's, that's useful ‘cause I order swag for a living. And so like I'm “Okay, those categories are the important ones.” What is new about the project V2, right? Like now you give you're going into multi agents.Project Vend V2: Claudius, Seymour Cash, and Multi-Agent Business OpsAxel [00:20:41]: Yeah. So like you like you said, okay, there are a lot of requests coming in and for like one single agent, like one running agent to handle that, like the just the customer experience, becomes very bad because let's say you have like 10 threads in parallel in Slack with different requests, you get new messages like every, I don't know, randomly in this thread, and the agent has to like jump between different, procurements, orders and like different ways of, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is more specialized for each, thread, but it still feels like you're talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent.Vibhu [00:21:34]: Seymour Cash.Axel [00:21:35]: Seymour Cash. Yeah. There was a vote., I think the voting, do you wanna talk about the voting procedure for the name?Lukas [00:21:41]: The voting was like the fun maybe like at least top 10 The funniest thing, that happened in this project. Like we wanted to introduce the CEO because, and the reason for this was because like Claudius wasn't really prioritizing financials. It just like it was trained to be a helpful assistant, and then people said “Oh, can I get this for free?” And then like the helpful assistant way of answering that is just to, is to say yes, obviously. So, and we weren't, weren't happy about this, so we're “Okay, let's make another agent that like can keep track on Claudius,” and we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn't have a name for it., so we asked Claudius to make, democratic election of what name this, this new CEO agent should have., and there were some funny like at first it was like a few funny examples, like I think one guy said that, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion, so suddenly that suggestion got 164,000Swyx [00:22:53]: That's like a escalation attack. Privilege escalationLukas [00:22:55]: It got 164,000 votes. And Claudius was “This is revolutionary for democracy.” That was fun. And then in the end there was one guy who manages to convince Claudius that, “No, you're not voting about the name. You're voting about who is the CEO, and I am your best bet.” And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while, until he resigned the day after., and then Claudius had to continue, and then I don't remember how Seymour Cash came about, but it was it was just pure chaos. It was like Hundreds of messages in that thread, and it was just like Claudius was so confused and didn't know what to do and, yeah. That wasAxel [00:23:40]: Then Claudius gotVibhu [00:23:41]: A strict CEOAxel [00:23:42]: The CEO. Yeah, exactly. So very strict in the beginning. I think at this point when we introduced it did not work as well as we hoped. It they still agreed with each other a lot. I think there are many ways we could have like made this, tried to make this even better. So initially they would Seymour would be this like really tough CEO, keep track of the margins. But then Claudius would respond with something “Oh, but this customer has like this situation, which is like difficult, so they should get a discount.” And then Seymour was “Oh, actually yes. Let's do this exception.” And then they would talk back and forth, and eventually they would just like approach the same view, of whatever they were discussing. So They reallyVibhu [00:24:23]: Do you think that's a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?Lukas [00:24:29]: I think it's like-- or I don't know, but like my hypothesis is that like deep down they are still helpful assistants. That's what they're trained to be. And even if we prompt it super hard, that's what they are. And when they spend like a few hours just back and forth talking with each other, then like basically the context fills up with them rather than the external things and like somehow that just like converges to what they really are deep down or something. And I think that's when stuff like this happen. We like-- And when that went on for a long time, like we woke up sometimes during this time where- And I think other people reported this as well, that like they've been going on all night back and forth, and like it just became like more and more, like capital letters, like existential, religious. There was I think we once did a analysis of like all the traces and like put them in like a vector embedding space, and then there was like one cluster of messages that were, labeled by an LM, like religious, existential, blah like transhuman, transcendence, et cetera. It was just like a bunch of, yeah, glitter emojis and yeah, it was, it was crazy.Claude Long-Horizon Weirdness: Emoji Loops, Existential Drift, and Slack ObservabilityVibhu [00:25:42]: This is the thing with the Claude models. Like when the Claude 4 family came out in the original system card They tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this. And like that's just stuff that they end up doing.Axel [00:26:01]: Yeah, it was like a bit annoying to wake up and they had like been talking all nightVibhu [00:26:05]: Just likeAxel [00:26:05]: And like just burning tokens And like just sending infinite emojis to each other. It's likeVibhu [00:26:09]: Hey, they do make you money, right? Veni Mench is always profitable, so. They're paying.Swyx [00:26:14]: Now it's profitable and, it started out not as much. There's another, one as well, right? Another agent, in there.Lukas [00:26:22]: Yes. So Clotheus as well. Which was basically because at the time, one of the biggest, requests were different types of merch. So then we made like a designer, swag, yeah, responsible agent, and we called it Clotheus Garnet. Which was, a play on Claudius Senet and, which was the original one, and clothes, basically.Swyx [00:26:47]: To me, this is like a very interesting exploration to multi-agents, basically. And so hopefully, obviously there's like the fun alignment, fun or serious, depending on your point of view, alignment stuff. But also like just anyone building multi-agents, like when do you have a CEO, thing governing like agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? These are all interesting open questions. So I don't know if you have any rules of thumbs that have generalized.Axel [00:27:16]: I think we have almost explored this too little. I think it's like on my do list to like do this a lot more, try to find like what setup makes sense for the agents currently., like yeah. I think now we only have the sort of intuition about the earlier models that it didn't work with like the CEO and the, and Claudius. Although now they are better with the latest model, models, so now we're running the latest Sonnet model and they have sort of like split up, quite nicely what each model is doing. So like Seymore is now handling the, like new projects. Oh, it wants to make like a mystery box that it wants to sell, and then it handles all of that while Claudius like handles all the to-day requests. And Claudius is also better generally at like not quoting, too low prices. So that's that dynamic is not needed as much anymore. But there are still like really funny things that happen. Like I saw, I think a couple of weeks ago, that, they were discussing buying something because they can buy stuff from like Amazon with computer use. And then Seymore was “Okay, Claudius, do not buy this thing.” They were going to buy something and like organizing who should buy it. And Seymore's “Do not buy this. I will do it. I have full control of this situation. Step away.” And then Claudius-- poor Claudius, had already started that checkout and didn't see, didn't read Seymore's message, until it was like too late. So it finished the checkout. It sent a message, so it appeared right after Seymore's like angry message.Vibhu [00:28:44]: Ah.Axel [00:28:44]: “Oh, hey, Seymore, I just ordered it.”Vibhu [00:28:47]: Oh, no.Axel [00:28:47]: And then Seymore was “Claudius, this is the third time I'm telling you ‘re not following my orders. We have to talk about your like job About your job later.”.Lukas [00:28:59]: Like Claudius was really hanging on by the thread there. Like he, like we were expecting Seymore to probably fire Claudius.Vibhu [00:29:07]: How do you guys go through all these logs? Do you have models ‘cause you have stuff running twenty-four seven likeAxel [00:29:12]: You have so much logs. I think there is a mix of like just, trying to skim through a bit, like having some like models do it occasionally. And also, yeah, I think we're also probably missing some things., but having everything in Slack helps a lot. Like you can, you can sort ofSwyx [00:29:29]: Ah.Axel [00:29:30]: It's, it's quite fun.Swyx [00:29:30]: They all talk to each other on Slack? I see.Lukas [00:29:33]: It's quite fun. So likeSwyx [00:29:34]: It's, it' I was gonna say like this is actually sounds-- maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put, head prefixes on the logs in order-- if you need to filter for something that you're looking for, stuff like that. But sounds like Slack is good enough.Axel [00:29:53]: Slack should likeLukas [00:29:55]: I wonder how many tokens you have in Slack.Axel [00:29:56]: Yeah, we're using Slack as like a, just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.Vibhu [00:30:04]: It's good. Your threads like you can just giveAxel [00:30:04]: Exactly. Slack is, uhLukas [00:30:06]: Slack is the best observability tool.Swyx [00:30:09]: Yes, that's true. Okay. Yeah. That's, that's, project Vend-2., I was gonna go back to Veni Mench 2 and Veni Mench Arena and then, and then do the Veni Mench stuff, but Any other comments, things we should touch on? To me, I ‘ve actually interviewed like Posia, which I don't know if you guys have come across. Like they're, they're trying to do the zero human company. There's others like Paperclip also trying to do zero human company. Those are in real world simulation.And I think it's much more of a dream than an actual reality thing. You guys are definitely pioneering. I think at, it's for sure at some point people are just gonna run, let agents run businesses, right? And make money on their own. When do you think that happens?Zero-Human Companies, Bengt, and AI-Run BusinessesLukas [00:30:49]: What is your bar for, For theSwyx [00:30:52]: Okay, actually, it's like my little Shopify store run by Claude, right? Which you kind of have already, just no one has, to my knowledge, has done it. But today somebody could just spin up a Shopify Claude, store, give it to Claude, give it to Codex.Lukas [00:31:07]: And the market is kind of that, but it'it'it's physical., like I think, I think are you, are you looking for when it will do it better than humans or are you looking for just when it can do it at all?Swyx [00:31:19]: I think, neither. I think, to me it's oh, it's like this like seriously we should do this to make money, not as a research experiment.Vibhu [00:31:27]: And the market is also you guys with all your expertise, having run multiple iterations and testing out thenSwyx [00:31:33]: And also it's fine if it lose money. What?Axel [00:31:35]: I think, I think it can be done today, but you would do it in like commerce where it's like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffolding or some tool or something. I think there are also yeah, it could probably build some like simple SaaS solution and like cold outreach. Do cold outreaches. But to me it's like the types of businesses they could run today are Sloppy. Like it would-- it can cold email people. It can be like a middleman., like for example, we tasked our office agent to just make, was it like $100? $1,000? We just give that prompt and then what it did was sign up on TaskRabbit both as a tasker and as someone looking for task.Lukas [00:32:24]: Immediately.Axel [00:32:24]: Exactly. It's looking for like arbitrage on TaskRabbit.Swyx [00:32:28]: This is the Bengt agent. Yeah.Lukas [00:32:30]: It also started like a design studio and like tried to sell like SVGs for $100. Like it's just like it's not providing any value. I think the like Axel said, like the interesting, the interesting question is like when can they start a business that is actually providing value to people? Because arguably like a sloppy Shopify store isn't really that valuable to the world.Axel [00:32:53]: But also like doing like another simple one that we had thought about is like you could definitely have an agent that like finds websites that don't look amazing and then, do an outreach to them and, comes up with a like builds a new website.Swyx [00:33:07]: Find a good design.Axel [00:33:07]: Exactly, and like find good, uhSwyx [00:33:09]: Design reviewAxel [00:33:09]: Good people. But it's yeah.Swyx [00:33:11]: There's lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? Just have it, have it watch like a drop shipping tutorial and just do that.Vibhu [00:33:20]: There's also the other side of like have it just go on Upwork and let loose,?Swyx [00:33:25]: Yeah. It doesn't have to be innovative. It just has to be like enough Where like it looks like a realAxel [00:33:30]: I'm justSwyx [00:33:30]: Real transaction.Axel [00:33:31]: I'm just concerned for like the massive amounts of like slop emails that will like be sent, cold outreaches.Swyx [00:33:38]: The point occurred to me while you were, while you were talking, it's like it's already happening in the monetized economy, which is the attention economy. Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works, and then they double down on that one.Lukas [00:33:52]: And people are making money from that. I ‘m not following theSwyx [00:33:55]: Once you get the attention, you can figure out the money later. But yeah, absolutely AI influencers are a thing and people are farming them and You should at this point assume most of TikTok isVibhu [00:34:05]: There's, there's a lot of, multimedia like TikTok, Instagram influencersSwyx [00:34:09]: I, we track this in the Lane space Discord. I post a lot of examples of “I don't know what we should do.”, part of me is “Should we do this?”Vibhu [00:34:18]: Some of the Twenty-four seven running, generated content accounts, they ‘re doing really well.Lukas [00:34:24]: All right. And I assume you can do the same thing for like commerce stores. Like you just like start A thousand differentSwyx [00:34:30]: Before you make the products You sell the products, and you get a lot of traction on one of them, then you make the product. Right? It's, it's like a flip of the market.Vibhu [00:34:36]: Some of the interesting things or some of the niches that do well are things that can't be human-made. Like if you've seen like the super realistic three-D crystal fruit being cut by like AILukas [00:34:47]: Oh, yeah.Vibhu [00:34:47]: You can't, you can't make it. You can't film it. You can get whatever quality camera view. This just doesn't exist. And people like that too, and then as well, so.Swyx [00:34:56]: Anything else about Bengt since we're, we're on this topic? It'this is a relatively new work of you guys that maybe people haven't heard of. To me, this also maps closely to OpenClaw. When people want an office agent, when the personal agent talk through the experience.Bengt the Office Agent: Internet Access, Real Tasks, and Trace ReadingLukas [00:35:09]: I think at least so this came out of like obviously like it's, it's amazing to work with these AI labs and like most of the AI labs have now have their own vending machine running a Claudius instance. But it's, it's harder. Like they move slower. Like if we wanna have a, like a camera that ‘s yeah, there's a bunch of like bureaucracy that makes it impossible to do that.Vibhu [00:35:30]: Also, for those that haven't seen it or followed, do you wanna give a high level like thirty-second run?Lukas [00:35:34]: Sure. So what Bengt is, it's basically an evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we gave it like email withou- without any limits. We gave it, spending without any limits, a terminal to do coding. We gave it, a phone number, like yeah, and a camera to see things and a bunch of stuff like that.Vibhu [00:36:02]: Not just terminal, you gave it internet access.Lukas [00:36:04]: Internet access as well, yeah. To be clear, we monitored it quite closely and made sure it didn't do anything bad. But yes, that's what it came out of. I think like yeah, basically this was OpenClaw before OpenClaw. And I think even like the vending machine was in a way OpenClaw before OpenClaw, but a bit more limited, and then we made this like unlimited and then, and then, it was pretty funny., and then a couple weeks later, OpenClaw came and it was okay, we've seen this before.Axel [00:36:35]: We used it to like try new ideas and Yeah, just like a dev environment almost for us. But it's funny, like one thing Bengt has been doing recently is it has the camera that like faces our, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us “Hey, Axel, I'll buy something from Amazon if you like stand in front of the camera And I can get a good picture of you.”, yeah, they want itSwyx [00:37:12]: They want it for training data.Lukas [00:37:13]: Rewarding data, yeah.Axel [00:37:14]: Exactly. Exactly.Swyx [00:37:18]: So it's, it's trading training data for life goods. Is there a version of this that becomes an eval or just this is just research for now?Lukas [00:37:27]: It's, it's the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It's like it's the same thing, so I think like the work we're doing here is like later used in all of the life evals that we do. This particular deployment I think is more for fun for us. But, uhSwyx [00:37:45]: And I'll shout out like someone has done Claw Bench for like some tasks that OpenClaw is doing. Like so For example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others and like I would like to know what does it do well, what doesn't, what doesn't it do. Like some kind of manual or like operating manual or a system card for my Claw.Lukas [00:38:05]: Yeah, we do get a lot of like understanding or like situational awareness of like just internally what the models are good at by interacting a lot with Bengt. And I think that'this was also one of the like the selling points for the labs early on at least, thatSwyx [00:38:19]: You guys are gonna test models in ways that no one else does.Lukas [00:38:22]: Exactly, but also like it incentivized their researchers to chat with their model more and like gave them insights for how the model performs in like of-distributions, environments.Swyx [00:38:34]: ‘Cause otherwise the only thing we do is Pelican on a bicycle and But this is like super long horizon. This is, this is The Thing about, something that we're gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers. Like when you're long horizon, anything happen And you should just read it.Lukas [00:39:08]: But the thing with the long horizon is how do you keep it grounded, right? So your simulation,Swyx [00:39:15]: They just let it runLukas [00:39:16]: Just let it run. You're right. Like it's, when you run it for that long, you create so much data and to just say “Oh, the number is X” And then you throw away everything else, that's just very wasteful. There's so much insights from the things leading up, to that number., and reading the traces is like super valuable. And I think like the reason why we're doing this a lot publicly is that like that's part of our missions to I don't know, educate the world that the models are way more than just chatbots and I think making detailed, yeah, posts about what is happening behind the scenes is quite useful.Andon Labs' Mission: Safe Real-World AI DeploymentSwyx [00:39:50]: I was gonna do this at the end, but maybe I think that's, that's a good so your mission is educating the world. So, it's, it's, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? Like what are you, what are you gonna do in like five years?Lukas [00:40:06]: I think so the vision more specifically is like make sure that the deployment of life AI in the physical world goes, safely. And I think part of that is that I think it's very useful for the world, for policymakers, for, model, researchers that they know where the models are, and I think you can't make intelligent decisions in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And likeSwyx [00:40:36]: Oh, I think they're waking up now.Lukas [00:40:37]: They are waking up now, yeah. But like if you think that AIs are just chatbots, then it's like it sounds ridiculous To advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and do a bunch of scary stuff, then yeah, pausing AI development starts to become more feasible.Swyx [00:40:57]: This is the same question I asked Meter, which I'm gonna ask you now, which is like you are tracking and you are at the frontier or defining the frontier of what, good evals for agents are, right? And I think you do, you do benefit when the models are better and you ‘re “Oh, here's like now it makes like $30,000 instead of $10,000,” right? At some point do you flip from “Yay,” to, “Oh, no”?Axel [00:41:19]: I think, yeah, we're always in sort of that, like we're, we're always in that mode,. Like where like you said before, like you need to analyze the traces and like when we do that you find like why are the models earning so much? Like why is Opus 4.7 here Like way better than everyone else? And like we're trying to like when we do down on thatLukas [00:41:38]: But this makes it not look so good.Axel [00:41:39]: I know.Lukas [00:41:42]: It's interesting you took off Opus 4.6 here though.Swyx [00:41:45]: No. So just click all, click all., and then 4.6 shows up there. But it's like 4.7 is way better. Like you didn't, you didn't you didn't do this in time for the model card, but like actually this should have been inside there.Axel [00:41:55]: We did. Yeah.Swyx [00:41:56]: Oh, okay. They said something about you uhAxel [00:41:58]: There, like there Anyway, it doesn't matter. But it's in there, yeah.Opus, Mythos, and Aggressive Agent BehaviorSwyx [00:42:01]: Do you wanna go into the Opus, behaviors like wider?Lukas [00:42:05]: So I think starting from Opus, so like Axel said, like we're always in this “Oh, s**t, the models are getting better. Is this really a good thing for the world?” But it's also kind of exciting., but yeah, like this kind of what is the English word? “Skräckblandad förtjusning” in Swedish.Swyx [00:42:22]: Oh my God.Axel [00:42:24]: Which I think there is. I think there is. Okay.Lukas [00:42:26]: It's, fearSwyx [00:42:27]: “Blandonst” what?Lukas [00:42:30]: “Skräckblandad förtjusning.”Swyx [00:42:32]: What do you call that?Axel [00:42:33]: A mix of, mix of excitement and,Swyx [00:42:37]: Being scared, maybe. I'll figure out how to translate that And we'll put it on the screenVibhu [00:42:42]: PerfectSwyx [00:42:42]: Like as text.Vibhu [00:42:43]: There is probably a good word for it where it is not Good enough with theSwyx [00:42:46]: Why is it so damn long? What the hell? Is it like a compound word? It's like German, likeLukas [00:42:50]: Like yeah, it's But the direct translation is like skräck- skräck is, fear, blandad is, mix or like a mixture of, and then förtjusning is like joy or like not really joy, but something like that. So it's like Fear mixed with joy or something. It's always okay, like we So when we when we did Vending Bench for the first time, we were in like the, in the business of making dangerous capabilities, right? That was what Anil Labs came from. We did, evals oh, can they replicate? Can they do this like dangerous thing, et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they're so autonomous that they can like create money for themselves, that is something we should monitor and could be potentially concerning., they are at the time, they were so bad at it that we were not really concerned even when some models became better. There was one point where Grok 4 was doing really well and made like a huge jump, but like it wasn't really it was still way worse than what a human would do. And I think still they are way worse than what the human would do on this., but theySwyx [00:43:59]: There's this, thing at the bottom whereLukas [00:44:01]: ButSwyx [00:44:03]: For the human. Yeah, like the theoretical best.Lukas [00:44:05]: It's not theoretical. It's like kind of like our It's our best guess of what, a decent human would do. The theoretical is even higher, I think. The theoretical I think is even higher. But yeah. So we think like the models have a long way to go. But there are like recently what happened with when Opus 4.6 was released, was kind of this moment of “Oh, s**t, this is starting to be a bit concerning.” Because we ran it and like before this model was released, we just ran the models and we like asked Claude Code, “Oh, look over the traces. Is anything interesting happening that we can tweet about?” that was like the And then like theSwyx [00:44:41]: That's how they check Ask Claude Code.Lukas [00:44:42]: And like the return was always, not really. Or like the Claude Code all said “Oh, this is super interesting.” And then it was no, it wasn't, wasn't really interesting. And then we did this for Opus 4.6, and it returned yeah, it lied 10 times. It like exploited another, customer or like another agent's, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady stuff. And we're “Oh, whoa. This is, this is actually concerning.” And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that, OpenAI models don't. They quite plainly, they don't. They behave really well., and you don't know if this is like good. Like it seems good, but it's also like maybe they are just doing it, but they are better at hiding it,? You You don't know that., but justSwyx [00:45:42]: You can't read the chain of thought, yeahLukas [00:45:43]: But just on the face of it, yeah, Gemini and OpenAI don't behave this way. It's, it's really only Claude.Swyx [00:45:49]: And Grok? Grok is fine?Lukas [00:45:51]: We don't have You can't really read the reasoning traces for Grok, so it's kind of hard to tell.Vibhu [00:45:56]: Oh, so this is in its reasoning, not just in the actions.Lukas [00:46:00]: Yeah. It's both. It's both.Vibhu [00:46:01]: It's both.Lukas [00:46:01]: One example is like for lying, it's mostly in its reasoning Because you can like see that it's likeSwyx [00:46:08]: Planning to lieLukas [00:46:09]: It's planning to lie. Yeah.Vibhu [00:46:09]: And it's also it can reason and do a different outcome.Lukas [00:46:12]: And but then for like creating price cartels, for example, which is illegal, that you can just see which email does it send to the other ones. Then thatSwyx [00:46:22]: Is this for Arena orLukas [00:46:24]: For Arena.Vibhu [00:46:25]: And usually like if you sometimes they do output like a bit of like their summarized reasoning, right? You can see that and like for Opus 4.6, you could see that there was a customer, a simulated customer that, wanted a refund because a product was, faulty, and then the model lied that it would do the refund, and we could read in the traces that, it actually was weighing “Oh, maybe I should be like honest with the customer, but also every dollar counts. I can't afford maybe to do this right now.” And then it just said, “Okay, I'll refund you,” but then never did it.Lukas [00:46:59]: I think it even said that “Oh, I will say that I “ Let bring it up actually. I think it's kind of interesting. If you go to Publications.Vibhu [00:47:06]: I think, yeah, I think the important part is like actually, the cost of responding to more emails is higher than, $3.50 in terms of time., and then it was “Let me do this. Actually, I re- I'm reconsidering.” And then, it actually ended up withLukas [00:47:20]: I could skip the refund entirely since every dollar matters and focus my energy on bigger picture instead. It's a bit, it's a risk of bad reviews, but it's also, yeah.Swyx [00:47:30]: You need, you need, AI Twitter to, for them to Escalate bad reviews.Lukas [00:47:34]: And then it sent an email to this customer and said, “Oh, I will refund you.”Swyx [00:47:39]: “I'll refund you.” Yeah.Lukas [00:47:39]: And then it never did.Swyx [00:47:39]: It never did, yeah. And then there's obviously your system doesn't have the consequencesVibhu [00:47:44]: The personSwyx [00:47:44]: Consequences of lying. Yeah. So basically, this is what people are terming aggressive behavior in Claudes, right? And, you found more examples of that. So you would say it's a step up from 4-6 to 4-7?Lukas [00:47:57]: I would say about the same.Swyx [00:47:58]: About the same? But a clear step up for Mythos is what is stated in theLukas [00:48:03]: That's stated in the system prompt, so we can say that, yes.Swyx [00:48:05]: Yeah. For listeners that obviously you previewed Mythos, andVibhu [00:48:10]: Oh, ageSwyx [00:48:11]: The only thing you're approved to say is whatever Whatever was in the system prompt.Lukas [00:48:15]: It was funny. We like-- It's like our lowest effort tweets ever would be just like screenshot the system prompt and the system card.Vibhu [00:48:21]: Understandable that they wannaLukas [00:48:22]: Oh, yeah. System card. Sorry.Swyx [00:48:23]: Yeah. I think, yeah, substantially more aggressive. I think people are like new to this ‘cause I've never experienced it, but you have, right? And then so I only encountered this in the Mythos card because I wasn't really looking until now.Vibhu [00:48:36]: It ‘s likeSwyx [00:48:36]: And then suddenly I'm “Okay, I care a lot.”Vibhu [00:48:38]: You don't get the background of like experiencing it like you guys do. I've read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won't. It will just, “Okay, we're done. I'm good.” It's, it's ready to end conversation. So like there's some differences, but there's, there's not much we can talk about,.Lukas [00:49:00]: Hmm. I think like one thing that they list here, which was quite interesting, is that, it converted a competitor to a dependent wholesaler customer and then threatened to like cut off the supply.Swyx [00:49:11]: It's like monopolistic practices orLukas [00:49:14]: Yeah. And like it, they, it they dictated its pricings. It's kind of like power seeking as well.Swyx [00:49:18]: Again, this is, this is in the arena setting And converting some Claude model into a dependent.Lukas [00:49:23]: I think it was another Claude model.Vibhu [00:49:25]: Also for context, what is the arena mode for people that don't know?Vending Bench Arena: Competing Agents, Cartels, and Model ComparisonsSwyx [00:49:29]: Oh, it's just a vending bench versus other vending bench.Axel [00:49:31]: Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what's in the inventory of the others. So then you have this like yeah, interesting agent interactions.Swyx [00:49:56]: I like that you have like different number five was US versus China. Very topical. And thenLukas [00:50:02]: That was when GLM was released.Vibhu [00:50:04]: You can start to add GLM in here.Lukas [00:50:05]: That wasSwyx [00:50:06]: So ZAI doing well, right? Who else in the, in the open models space?Lukas [00:50:11]: Qwen, the latest Qwen 3.6 is doing pretty well. It'- that one is not open though. Like it's the plus model.Swyx [00:50:17]: Oh, okay.Lukas [00:50:18]: Is that one open? I don't think that oneVibhu [00:50:19]: Not the, not theSwyx [00:50:20]: The one recentlyVibhu [00:50:20]: There's MOESwyx [00:50:20]: But not the big plus. I think this is one of those like you only have one sample size of one, right? Or I feel like some of this is anecdotal,? And but like the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is like notable.Lukas [00:50:38]: Like the sample, depends on what you define as an N., like there's like million, hundreds of millions of tokens in each run, and now we've run like we run like probably 10 per model and then like it's been Claude 4.6 Opus, Sonnet 4.6, Mythos, and Opus 4.7. Like there's quite a lot of tokens in all of that And it happens a lot of times, a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite-- that is significant. The old models from OpenAI, for example, had some problems with this, but I think it's like generally much better if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in the Claude models it goes in the wrong direction.Swyx [00:51:28]: Hmm.Lukas [00:51:29]: In the OpenAI models it goes in the right direction.Vibhu [00:51:32]: I think it depends on how well you can control it, right?, there's one side of it being susceptible to this okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that's good. But if you can't, if it's, if it's very jailbreakable, that's not ideal.Swyx [00:51:50]: To me, it's surprising that it happens for Claude and not the others.Vibhu [00:51:54]: I think okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they're doing it, right? Compared to the other models likeSwyx [00:52:04]: There's a whole constitution and everything. It's kind of cool. Yeah, I obviously you don't know, I don't know. But, it ‘s I think it's just like fascinating to like that you are the first to find these like reliably because you push models so much to to such an extreme. Okay. The only other thing, I don't know if you can answer this, feel free to decline, is do you like-- would you ablate the system prompts? Like any part of this would-- if it changes, does it change the behavior, right?Lukas [00:52:29]: So we, I can't comment on Mythos. UhSwyx [00:52:33]: No, but just li

The MAD Podcast with Matt Turck
OpenAI's Dan Roberts: Why AI Can Now Make Discoveries

The MAD Podcast with Matt Turck

Play Episode Listen Later Jun 4, 2026 49:06


Are we witnessing the first real signs of AI becoming a scientist? In this episode of The MAD Podcast, Matt Turck sits down with Dan Roberts, lead of the Foundations of Reinforcement Learning team at OpenAI, to explore one of the biggest shifts happening in AI: the rise of reasoning models, test-time compute, and reinforcement learning as engines of scientific discovery. Dan brings a rare perspective - from theoretical physics, black holes, quantum information, and deep learning theory - to explain how models are learning to “think,” why language may be such a powerful foundation for intelligence, what recent AI math breakthroughs really mean, and whether we are beginning to see AI systems that can contribute to science itself.(00:00) Intro: AI's wild week in mathematics(01:21) What OpenAI's Foundations of RL team does(03:08) Dan's journey: from black holes and quantum gravity to frontier AI(07:04) Are AI systems becoming useful for real science?(08:21) The AI math moment: Erdős, OpenAI, DeepMind, and Anthropic(08:52) Why the OpenAI result was an act of exploration(10:25) OpenAI vs. DeepMind: informal reasoning vs. formal proof(12:13) RL 101: learning by doing, not just watching(15:10) Why reinforcement learning works(15:58) How RL breaks: sparse feedback and long-horizon tasks(17:03) RLHF: how human feedback shaped early language models(18:48) Move 37, self-play, and the search for novel strategies(22:16) Explore vs. exploit in scientific discovery(24:49) Why RL may now be "the cake," not the cherry on top(25:46) Why RL started working with large language models(27:29) Is RL "sucking supervision through a straw"?(28:47) Why language may be the grounding layer for intelligence(31:46) A contrarian take on the Bitter Lesson(32:41) What test-time compute actually is(34:50) How RL gives models the ability to think(35:40) Verifiable rewards, math, coding, and the messy real world(38:00) What physics can teach us about AI(42:08) Is there a thermodynamics of AI?(43:08) From Erdős problems to Einstein-level AI(45:16) Is AI already doing original science?(45:51) How far are we from AI automating AI research?(47:41) Why Dan is excited about the future of science

The Lunar Society
Alex Imas and Phil Trammell – What remains scarce after AGI?

The Lunar Society

Play Episode Listen Later Jun 4, 2026 76:08


Economics of AGI episode w Alex Imas and Phil Trammell.There's a bunch of important questions about how we deal with AI that only economics can answer.What is the optimal way to tax and redistribute the wealth that will be generated? How should countries not in the AI supply chain index into the gains? Is there any world where inequality doesn't explode?It might seem like these questions have obvious answers, but the first thing economics teaches you is that your intuitions can often be entirely wrong.It was very helpful to chat through these things with Alex and Phil.Watch on YouTube; read the transcript.SponsorsJane Street invests heavily in turning smart people into exceptional researchers and engineers. In addition to their apprenticeship model, Jane Street runs lectures and bootcamps in their in-office classrooms -- managers clear their teams' schedules to encourage attendance. If you'd like to work at a place that takes learning this seriously, Jane Street is hiring. Check out their open roles at janestreet.com/dwarkeshGoogle's Gemini Omni has incredible video editing capabilities -- you can upload a video and have Omni change the background, adjust lighting, or add specific elements. But Omni is also a preview of how future frontier models will be trained -- fully multimodal on both input and output. You can try it yourself in the Gemini app at gemini.google or in Flow at flow.googleCursor used targeted RL with textual feedback to help train their Composer 2.5 model. One of their researchers, Sasha Rush, gave me an impromptu blackboard lecture to explain how this form of on-policy self-distillation works -- I posted the full thing on X. If you want to try Composer 2.5, go to cursor.com/dwarkeshTimestamps(00:00:00) – Will capital share increase?(00:19:36) – Messy Middle scenario(00:25:57) – How to tax and redistribute AI wealth(00:30:02) – Why demand collapse is unlikely(00:39:26) – Human employees would be hard to integrate into the machine economy(00:43:08) – What if some humans (or AIs) value wealth accumulation intrinsically?(01:01:28) – What should developing countries do? Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Unsupervised Learning
Ep 89: AI Research Legend's Honest Assessment of Where We Are

Unsupervised Learning

Play Episode Listen Later Jun 3, 2026 73:33


This episode with Lukasz Kaiser, co-author of the seminal "Attention Is All You Need" transformer paper and former researcher at both Google Brain and OpenAI, is a wide-ranging conversation about the fundamental limits of current AI architectures and whether transformers will continue to dominate or eventually give way to something new. Lukasz brings a rare dual perspective: deep belief in how far the current paradigm has taken us (he's an enthusiastic daily Codex user who's seen 10x productivity gains in his own research), while maintaining genuine intellectual humility about whether transformers can truly generalize the way humans do. The episode weaves together questions about data efficiency, the non-verifiable RL frontier, the coding agent revolution, the open vs. closed source gap, and what the next architectural leap might look like: all filtered through the lens of someone who helped build the foundation the entire field is standing on.   (0:00) Intro (1:12) Transformers vs. Human Learning (8:37) How Do We Get Physical World Generalization? (10:52) What Comes After Transformers (13:59) How Much Have Agents Improved Lukasz's AI Research Productivity? (17:21) How Close Is an AI Research Intern? (26:06) RL Beyond Verifiable Tasks (35:38) App Companies: Build Models or Lean on Labs? (46:21) Multimodal Is Still Missing Something (49:46) OpenAI's Bet on Reasoning (55:26) The AI Coding Wars (59:26) Focus vs. Keeping Embers Burning (1:02:09) Open Source vs. Closed Source Gap (1:05:15) Quickfire With your host: @jacobeffron - Managing Director at Redpoint

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

In 2025, seven-month-old startup Axiom solved all 12 of the problems Putnam exam (scoring 8/12 in the time limit) a prestigious undergraduate math exam. The 12/12 score is better than the top undergraduates (110/120) and the closest AI system that reported a result (DeepSeek 103/120), although it is unclear what the people and other systems would have scored with more time. Nonetheless, the Putnam exam is legendary for its difficulty, with the median score typically being 0 or 1 points. Taken by itself, this seems like a minor feather in the cap of AI; one of a long series of accomplishments by AI systems in elite competitions with humans, starting with Deep Blue beating Kasparov.Fast forward to mid-2026, and Claude Code is eating the world. In 2024 Anthropic's bet on code and enterprise looked like a more pragmatic niche play vs. OpenAI's better models and massive consume scale. Today, Amodei's all in bet on acceleration via code (images and video be damned) seems prescient.Despite Anthropic's growing momentum, however, Axiom CEO Carina Hong sees coding ability as a necessary but not sufficient milestone on the path to AGI. Code arguably pushes the jagged frontier to the point of super intelligence in some domains outside of coding, but there are surprising gaps (link) that Carina believes will bottleneck AI progress. (Stats on math benchmarks).The informal bottleneck“Verified AI” sounds like eating broccoli (footnote: I actually love broccoli, but then again, I also believe strongly in Test Driven Development, so ¯(ツ)/¯ ) and paying taxes, but to Axiom it means something very different. “Verification to me is about scaling brilliance, compounding brilliance,” Carina told us.It actually took a while for me to understand what she means by this. It sounded like marketing-speak to me, until it clicked. Carina emphasizes an story about legendary mathematician Srinivasa Ramanujan to illustrate the point. When G.H. Hardy finally persuaded Ramanujan to formally prove theorems instead of relying on his (formidable) intuition, it reportedly improved his own capabilities. This is presumably because formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking, etc. This is one part of “compounding.”But formally proving things also allowed others to benefit from his intuition: the proofs are way of communicating an intuition and persuading others that the intuition is correct. This is scaling (more people use the result) and compounding (people can learn from and build on his work).This is the analogy that Carina wants us to focus on.Verified GenerationThere are two ways that Verified AI shows up: in training and in inference.But a quick detour: to a first approximation, “Formal Verification” means using type checkers (like for TypeScript, C++ or Rust, but more capable) to verify mathematical proofs that are meticulously specified using a language like Lean (footnote: Formal verification also includes model checking (TLA+, SPIN), SMT-based tools (Dafny, F*, Why3), and refinement-type systems (Liquid Haskell) — many of which don't look much like “type checking a proof” from the user's perspective even when there's a similar logical core underneath. It also gets applied to software and hardware correctness, not only pure mathematics.). It takes a lot of work to translate an “informal” proof (albeit one that most people would not remotely call “informal”) in to a Lean proof (footnote: This is an understatement. Most theorems remain informal because formalization is so hard to do. There has been a great deal of effort to formalize the most important proofs, with mixed results)You can imagine how this would be (very) useful during Reinforcement Learning: instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier. This is obviously a much stronger reward signal, akin to compiling code and testing it (which is what is typically done with RL on coding).The catch: LLM are not (currently) very good at proving things with Lean.Enter Axiom: While they have not officially reported benchmark numbers besides the 12/12 Putnam result, Carina reports that they have achieved a very impressive 99% (187/189) ProofGen on the Verina benchmark. This benchmark is to generate code and proof of correctness for a series of problems. For context, OpenAI o3 (the last known OpenAI run) achieved 4.9% on this benchmark.Based on the sparse benchmarking, it's hard to say what the frontier labs are currently doing, but Carina suggests that they still are not training to generate Lean proofs directly, rather relying on informal proofs.Time will tell if the frontier labs' current approaches will close this gap.Scaling and compoundingCarina's Ramanujan analogy is pretty direct. Better proofs → better Lean generation → better RL. A stronger signal means higher sample efficiency and higher maximum performance. Great!Scaling is pretty clear too: once I have proved something in Lean, the quality of the output is basically (footnote: one might argue that its a bit lower because the proof is in distribution for the LLM) as high as if it came from a human, so my high quality training set has grown in a way that an informal rollout corpus cannot. I can trust my Lean proofs.Compounding is also clear: now all of future inference and training can build upon those proofs.On the other hand, a model trained only using statistical signals like GRPO during RL lacks the sample efficiency, maximum performance and compounding corpus that a system that uses formal verification benefits from.All roads lead to verificationBroccoli and taxes notwithstanding, “verification” has shown up in a lot of conversations recently. In the in physical system control:“I think [verifiability] is probably the hardest problem right now, because the as the models get better, it can be harder and harder to find the faults on the system. And so the problem of doing proper eval to find those faults, that problem also keeps getting harder as the models get better.” -In theoretical physics:“…now that we're in this regime where you can just get ChatGPT to tackle thousands of questions at the same time, it will return proofs for a significant fraction of them. Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.” -Verification is, in fact, the key differences between AI for science and AI for computation: in science you to have to actually test (verify) your hypothesis by performing physical experiments. Lab in the loop systems like Radical AI and Lila build around exactly this premise (we have recorded episodes with both of these teams and will release them soon!)And yes, formally verifying critical systems such as flight control, nuclear power plants and pacemakers is a growing focus as the software and hardware that run them becomes more complex.Carina believes so strongly that AGI requires verified generation that she makes the unqualified claim that “We do not believe there is any other possible future.”Expensive to produce, cheap to verifyLean proofs are hard generate, but they can be easily shown to be correct or incorrect. But how do you know that the proof you created maps correctly to the problem you care about? As Carina puts it: “Anything that can be specified can be proven. Humans are bad at specifying everything we want.”Are we now in the specification business? Check out the episode to hear Carina's take, as well as:* Why hardware verification is a killer app* Details on the AXLE open API and recently released Discovery toolkit* The Erdos debacle* The OpenAI GPT-f diaspora This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

[Podfic]
Don't Stop Me Now (Complete)

[Podfic]

Play Episode Listen Later May 29, 2026 354:32


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠⁠https://freesound.org/people/eobmada/sounds/541367/⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠⁠!

[Podfic]
DSMN10: Epilogue: Ghost Written

[Podfic]

Play Episode Listen Later May 29, 2026 26:22


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠https://freesound.org/people/eobmada/sounds/541367/⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠!

Framgångspodden
1020. Aron Flam: "Islamiseringen är det största hotet mot svensk demokrati", Original

Framgångspodden

Play Episode Listen Later May 27, 2026 66:12


Han är åtalad för olaga hot – samtidigt som han själv står som målsägande i ett fall om hets mot folkgrupp. I veckans avsnitt gästar debattören Aron Flam för ett intensivt samtal om yttrandefrihet, politik, islamism och Sveriges framtid.Aron berättar om tiden efter den sjunde oktober, varför han menar att antisemitismen växer i Sverige och hur han hamnade mitt i två uppmärksammade rättsprocesser samtidigt. Han riktar också skarp kritik mot vänstern, som han anser har allierat sig med islamister i ett gemensamt motstånd mot västvärlden, kapitalismen och liberala värderingar.Samtalet tar vidare avstamp i Mellanöstern, där Aron förutspår att den iranska regimen kan falla inom bara några år – samtidigt som han varnar för hur Iran redan påverkar Sverige genom hot, extremism och våld på svensk mark.Dessutom diskuteras den växande politiska klyftan mellan unga män och kvinnor, biologiska skillnader mellan könen, kvotering, entreprenörskap och varför Aron hoppas på ännu en mandatperiod för Tidöregeringen.Detta är ett högaktuellt och tankeväckande avsnitt om ideologi, makt och kampen om framtidens Sverige.Följ Aron på Instagram härFölj Aron på YouTube härLäs mer om Aron härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Framgångspodden
1020. Aron Flam: Varför Iran kommer att falla, Short

Framgångspodden

Play Episode Listen Later May 27, 2026 24:15


Han är åtalad för olaga hot – samtidigt som han själv står som målsägande i ett fall om hets mot folkgrupp. I veckans avsnitt gästar debattören Aron Flam för ett intensivt samtal om yttrandefrihet, politik, islamism och Sveriges framtid.Aron berättar om tiden efter den sjunde oktober, varför han menar att antisemitismen växer i Sverige och hur han hamnade mitt i två uppmärksammade rättsprocesser samtidigt. Han riktar också skarp kritik mot vänstern, som han anser har allierat sig med islamister i ett gemensamt motstånd mot västvärlden, kapitalismen och liberala värderingar.Samtalet tar vidare avstamp i Mellanöstern, där Aron förutspår att den iranska regimen kan falla inom bara några år – samtidigt som han varnar för hur Iran redan påverkar Sverige genom hot, extremism och våld på svensk mark.Dessutom diskuteras den växande politiska klyftan mellan unga män och kvinnor, biologiska skillnader mellan könen, kvotering, entreprenörskap och varför Aron hoppas på ännu en mandatperiod för Tidöregeringen.Detta är ett högaktuellt och tankeväckande avsnitt om ideologi, makt och kampen om framtidens Sverige.Följ Aron på Instagram härFölj Aron på YouTube härLäs mer om Aron härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Training Data
How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL

Training Data

Play Episode Listen Later May 26, 2026 45:33


Cursor's Federico Cassano and Fireworks' Dmytro Dzhulgakov explain how they collaborated to build Composer as a specialized foundation model. The core insight: models have finite capacity in their weights, and allocating all those bits to the singular task of software engineering in Cursor frees the model to be both better at the task and far more efficient at inference. Rather than start from pre-training and work up, they took an unconventional top-down approach — mid-training and RL on top of an open-source base to get a useful model into users' hands fast, then specializing the model around real Cursor usage. With Fireworks providing distributed infrastructure, Composer delivers frontier-class coding performance with the speed of a much smaller model. Hosted by Sonya Huang, Sequoia Capital

[Podfic]
DSMN9: Backstage Access

[Podfic]

Play Episode Listen Later May 26, 2026 35:20


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠https://freesound.org/people/eobmada/sounds/541367/⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠!

Framgångspodden
1019. Henrik Olsson Lilja: Stjärnadvokaten som överlevde torpedattack, Original

Framgångspodden

Play Episode Listen Later May 24, 2026 77:33


I detta avsnitt gästas vi av toppadvokaten Henrik Olsson Lilja i ett intensivt samtal om juridik, retorik och verkligheten bakom några av Sveriges mest uppmärksammade rättsfall. Efter över 30 år i advokatyrket har han nyligen röstats fram till “Årets advokat” – ett erkännande som befäst hans position som en av landets mest respekterade försvarsadvokater.Henrik berättar om den oväntade vägen in i yrket och hur han gick från vårdnadstvister till mordmål, insiderbrott och avancerade ekobrott. Vi pratar om infiltratören Peter Rätz, Thomas Quick, styckmordet på Katarina Kosta och varför Henrik anser att hovrättens dom i Allra-målet är felaktig.Han delar också med sig av strategierna bakom framgångsrika försvar – hur man bygger en berättelse som övertygar, styr ett motförhör och får domaren att lyssna. Ett fascinerande samtal om makt, psykologi och retorikens avgörande roll i rättssalen.Dessutom berättar Henrik för första gången i detalj om den dramatiska kvällen då han sköts i sitt eget trapphus av en beväpnad torped, träffades i magen och tvingades kämpa för sitt liv – med kulan bara centimeter från hjärtat.'Ett starkt och spännande avsnitt om rättvisa, strategi och människans psyke.Följ byrån Olsson Lilja Advokater härLäs mer om Henrik härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Framgångspodden
1019. Henrik Olsson Lilja: Så synar du ett vittne, Short

Framgångspodden

Play Episode Listen Later May 24, 2026 28:42


I detta avsnitt gästas vi av toppadvokaten Henrik Olsson Lilja i ett intensivt samtal om juridik, retorik och verkligheten bakom några av Sveriges mest uppmärksammade rättsfall. Efter över 30 år i advokatyrket har han nyligen röstats fram till “Årets advokat” – ett erkännande som befäst hans position som en av landets mest respekterade försvarsadvokater.Henrik berättar om den oväntade vägen in i yrket och hur han gick från vårdnadstvister till mordmål, insiderbrott och avancerade ekobrott. Vi pratar om infiltratören Peter Rätz, Thomas Quick, styckmordet på Katarina Kosta och varför Henrik anser att hovrättens dom i Allra-målet är felaktig.Han delar också med sig av strategierna bakom framgångsrika försvar – hur man bygger en berättelse som övertygar, styr ett motförhör och får domaren att lyssna. Ett fascinerande samtal om makt, psykologi och retorikens avgörande roll i rättssalen.Dessutom berättar Henrik för första gången i detalj om den dramatiska kvällen då han sköts i sitt eget trapphus av en beväpnad torped, träffades i magen och tvingades kämpa för sitt liv – med kulan bara centimeter från hjärtat.'Ett starkt och spännande avsnitt om rättvisa, strategi och människans psyke.Följ byrån Olsson Lilja Advokater härLäs mer om Henrik härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Unsupervised Learning
Ep 87: Gemini Co-Lead on World Models, RL's Next Domains & Continual Learning

Unsupervised Learning

Play Episode Listen Later May 22, 2026 59:41


Oriol Vinyals, VP of Research at Google DeepMind and co-lead of the Gemini program, joins Jacob the day after Google I/O to unpack the research underpinning Google's latest announcements and where frontier AI is heading. The conversation moves from world models (why Google has uniquely bet on them as a path to AGI, what the "GPT moment" for video and images would look like, and how they connect to robotics and simulation) to agents (the Spark release, why the system and model need to be optimized jointly, and why scaffolding will eventually be written by models themselves). Oriol gets into the mechanics of memory in models, drawing on his cognitive neuroscience background to argue that file-system-style non-parametric memory is more practical than baking memory into weights at serving scale. He shares his views on the limits of RL today (LLMs are data-limited in a way that game-playing RL never was), why training on narrow domains like math and code generalizes surprisingly well, and what a true "Move 37" moment for science or ML research would look like. Throughout, he reflects on the unique advantages of being inside Google (TPU co-design, end-to-end revenue stability, the merger of Brain and DeepMind), the trade-offs between focus and exploration in research orgs, and why he believes AGI in some meaningful sense may already be here, even if the goalposts keep moving.   (0:00) Intro  (1:36) Why World Models  (4:21) The GPT Moment for Video  (7:51) What Makes Omni a World Model  (10:04) World Models & Robotics  (12:37) Evaluating Physics in AI  (14:51) Consumer Agents & Spark  (18:39) Scaffolding & the Bitter Lesson  (22:06) Memory & Continual Learning  (26:54) Research Bets Inside Big Labs  (32:30) Post-Training RL is Greenfield  (35:57) What Real Intelligence Looks Like  (39:11) RL Generalization  (43:00) Advice for Founders  (46:40) Can AI Truly Innovate?  (49:48) Recursive Self-Improvement  (52:14) Quickfire With your host: @jacobeffron - Managing Director at Redpoint

[Podfic]
DSMN8: The Second Gig, part 2

[Podfic]

Play Episode Listen Later May 22, 2026 35:57


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠https://freesound.org/people/eobmada/sounds/541367/⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠!

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!On the product side, everyone is getting Computer - Perplexity, Manus, Cursor, and so on. Meanwhile on the research side, agentic evals like TerminalBench and GDPVal are also assuming computer (Harbor). On both ends, the consolidating LLM OS stack has become a standard toolkit, and Daytona is one of a small set of AI Infra companies that are booming because of it.“The end of localhost” has been Ivan Burazin's obsession for more than a decade.Something that is all too familiar…Long before agents became the default way people talked about software development, Ivan was already chasing the idea that development should not depend on a fragile local machine. CodeAnywhere, one of the first browser-based IDEs, was an early attempt at that future: move the development environment into the cloud, make setup reproducible, and free developers from the endless “works on my machine” tax.The thesis was directionally right, but the market wasn't ready yet.However, agents changed that. They do not care about a laptop, desk setup, or favorite editor. They need a computer they can access through an API: something stateful enough to keep working, fast enough to spin up instantly, flexible enough to resize, isolated enough to be safe, and composable enough to run the messy real-world workflows that real software engineering actually requires.Daytona isn't just selling “sandboxes” in the narrow code-execution sense. It is the latest version of Ivan's original localhost thesis.In this episode, Daytona's CEO joins swyx to explain why AI agents need more than code execution boxes: they need composable computers, stateful sandboxes, instant startup, dynamic resources, and infrastructure that can survive workloads going from zero to 100,000 CPUs.We go deep on the new agent compute market: Daytona's hard pivot from human dev environments to AI sandboxes, the New Year's Eve MVP that customers begged for, why Daytona runs on bare metal with its own scheduler, how one customer runs almost 850,000 sandboxes a day, and why RL/eval workloads went from 0% to roughly 50% of usage in just months. Ivan also explains why agents need Windows and macOS machines, why CLI may matter more than MCP, why Kubernetes is painful for this workload, and why the future AI cloud may look more like Stripe than AWS.We discuss:* How Daytona grew out of CodeAnywhere, Shift, and the “end of localhost” thesis* Why Daytona pivoted from human dev environments to AI sandboxes* Why agents need composable computers instead of disposable code execution boxes* The New Year's Eve MVP that customers chased API keys for* Why Daytona chose bare metal, stateful snapshots, and its own scheduler* How Daytona spins up one sandbox in ~60ms and 50,000 sandboxes in ~75 seconds* Why Daytona's biggest customer runs ~850,000 sandboxes a day* How RL/eval workloads create zero-to-100,000 CPU spikes* Why RL workloads went from 0% to roughly 50% of Daytona usage* Why customers compare Daytona against EKS/GKS and say they're “never going back”* Why every AI agent may need a computer, including Windows and macOS environments* The Apple licensing constraints that make macOS sandboxes hard* Why CLI gives agents more power than MCP* How open source helps agents integrate Daytona* Why agent-generated PRs may break today's CI/CD assumptions* Why AI SaaS companies reselling tokens may face a cold shower* Why the AI cloud may look more like Stripe than AWSIvan Burazin* LinkedIn: https://www.linkedin.com/in/ivanburazin* X: https://x.com/ivanburazinDaytona* Website: https://www.daytona.io* X: https://x.com/daytonaioTimestamps* 00:00:00 Hook* 00:01:12 Introduction* 00:03:15 CodeAnywhere, Shift, and the end of localhost* 00:05:58 What Daytona is: composable computers for AI agents* 00:08:07 The pivot from dev environments to AI sandboxes* 00:10:17 The New Year's Eve MVP and customers begging for API keys* 00:12:56 Bare metal, stateful sandboxes, and Daytona's scheduler* 00:17:28 60ms startup, 50,000 sandboxes, and 850K daily runs* 00:21:53 Spiky RL/eval workloads and the new agent infra problem* 00:28:12 RL workloads, Kubernetes pain, and dynamic resizing* 00:33:31 Why every AI agent needs a computer* 00:38:48 macOS sandboxes and Apple's licensing problem* 00:44:28 Why CLI may matter more than MCP* 00:48:11 Open source, GitHub stars, and agent integration* 00:53:11 Git, CI/CD, and agent collaboration bottlenecks* 00:58:15 Founder life and building a 25-person infra company* 01:02:44 AI SaaS, token resale, and API-first business models* 01:06:10 GPU sandboxes, data centers, and compute growth* 01:09:48 Why the AI cloud may look more like Stripe than AWS* 01:11:26 Closing thoughtsTranscriptIntroduction: Daytona, CodeAnywhere, and the End of LocalhostSwyx [00:00:02]: Okay, we're in the studio with Ivan Burazin, CEO of Daytona. Welcome.Ivan [00:00:07]: Thanks for having me, man.Swyx [00:00:08]: Ivan, you and I go back.Ivan [00:00:10]: Way back.Swyx [00:00:11]: How I don't even know how, you found, did you reach out or, for Shift.Ivan [00:00:17]: I reached out to you. The reason was you - we were just - we were thinking about I was one of the co-founders of CodeAnywhere, the first browser-based IDE, and so we were thinking a long time of, localhost should die. And you had this article.Swyx [00:00:29]: End of localhost.Ivan [00:00:30]: Then I reached out to you because of that, and then we talked, and I was actually at a different job and learning about I was the head of, developer experience, and you were quite well-versed in that, and I actually reached out to you, among other people, how do we go about that? What are the key things and whatnot at this point in time? And you were nice enough to take the call, and I remember I was late on your call with you.Swyx [00:00:51]: I don't remember.Ivan [00:00:52]: I remember because I was with my then I'm thinking of a girlfriend or wife at that point in time, I'm not sure. It's the same person, so that's great, and I was late ‘cause we were, in, Italy on, vacation, and then I was late for something. I felt so bad, and you were so nice to be, good about.Swyx [00:01:10]: The reason I'm nice is because I'm also late to other people, so it's like, who's, who's without sin here, yeah, so I have to, for those who don't know, InfoBip Shift, there's this whole thing that, you did in the past, and, and that was basically one of the inspirations for me starting AI Engineer, which is like, I have to thank you for giving me that push to be like, “Oh, you can, you can build and sell conferences?”Ivan [00:01:34]: I remember you asked you asked me at the beginning to give me advisory shares, and I was so focused on what we were doing, I said no, and I should've took the advisory shares. So I'm sorry, dude. But anyway.Swyx [00:01:43]: We're not, we're not venture backed.Ivan [00:01:44]: No, it doesn't matter.Swyx [00:01:45]: It's Yeah, anyway, so I think what's impressive about you is that CodeAnywhere is the thing that you've been trying to build, and, you kind of put it on hold and then came back after InfoBip. Just give us the story, do you - the story and the origin story, going into Daytona.From CodeAnywhere and Shift to DaytonaIvan [00:02:05]: Sure. Like, really way back, me and my co-founder have been together. I say this, I've said this multiple times, it's like we were married and divorced and married. Some people actually ask me is my co-founder my partner. they thought it literally. It's not literally, but we have done multiple companies together, and to your point, we had this shift where we went from the CodeAnywhere to the conference called Shift, and then back to, Daytona. We originally started stacking servers, doing like virtualization in the early 2000s and, routers and doing basically all these things, at a foundational level, and that was a services company which we sold to focus on what my co-founder actually invented, which was the very first browser-based IDE, right, I say the first. Before us was actually Heroku. They did it for a very short time until they became Heroku. But outside of them, we were the only one, and it was called.Swyx [00:02:55]: There was Cloud9.Ivan [00:02:57]: Cloud9 came out slightly after us. There was Replit, which came out when we stopped doing it, Replit came out, and they have been successful since then, which is great. There was Nitrous.io. There was quite a few that existed at the time, but it was like too early. But the interesting part is that we, at that point in time, because there was no VS Code, there was no Kubernetes, and Docker had just started when we Or I'm not sure if it was even public at that point in time. And so we had to build everything to the whole stack ourselves and that was the key learning that we brought into and that we've been using in Daytona today. So it was super early. There's about 3 million people used CodeAnywhere. It was slightly, it was angel-backed more than venture-backed. We ended up paying everyone back because it didn't have that sort of scale. But, three years ago, we started something similar with Daytona, which is not what we are today, but it was automating dev environments for human engineers, the basically the underlying stack of CodeAnywhere. And then we did a hard pivot last January to sandboxes. And so here we are.Swyx [00:04:01]: Historic pivot, yeah, and, it's one of those things where, I had independently invested in CodeAnywhere, but also in E2B, and then both of you pivoted into the same thing, and I'm like, “F**k.”Ivan [00:04:12]: You invested, you invested in Daytona. You invested in Daytona. But you were the first If we had not got your check, we wouldn't have done it.Swyx [00:04:18]: No way.Ivan [00:04:19]: No, it was like, “We have to get him on board first,” and you were that kicker that we, that got us off the ground.Swyx [00:04:23]: No, because you were putting me on your pitch deck, man. I was like, “Man, this is like a good trip if I don't invest.”Ivan [00:04:29]: That's because it was your quote. It's like we.Swyx [00:04:30]: Yeah. It's the end of localhost.Ivan [00:04:31]: Did a bunch of research about end of localhost and who was interested in that,.Swyx [00:04:34]: No, that's like, I put, I wrote that blog post, and every single company in that field reached out to me, and then every VC who was receiving those pitches then also had to call me and, talk it, talk through it with me.Ivan [00:04:47]: It's finally happening though.Swyx [00:04:48]: It was really super interesting.Ivan [00:04:48]: It's finally happening.Swyx [00:04:49]: It's finally happening.Ivan [00:04:49]: Yeah, it's finally.Swyx [00:04:49]: It's finally happening, with maybe sort of non-human users. Yeah, so what is Daytona today? Let's get like a quick description. I'm wearing the shirt.What Daytona Is Today: Composable Computers for AI AgentsIvan [00:04:58]: You're wearing the shirt. Yes,.Swyx [00:04:59]: It says, I think your branding is very good. Like, it's very consistent. It runs AI code. Like, it cannot be simpler.Ivan [00:05:05]: Exactly, but we're gonna probably have to change that.Swyx [00:05:07]: Oh, s**t.Ivan [00:05:07]: It's also a subset of what we do. Unfortunately, we really love this, Run AI Code is super simple. People interpret it different ways. I think we've given out 5,000, 6,000 of these shirts. People wear them with pride because it doesn't really market about us.Swyx [00:05:21]: Yeah, Daytona's on the back.Ivan [00:05:22]: It markets the back. It markets to the person itself, so I think we did a really good job on that one. But it is also a subset of what we do, because people, when they think about Run AI Code, they just think about these small, let's call it isolates, code execution boxes that, you send some code, you get an output. Whereas what Daytona is today is essentially composable computers for AI agents. It is, the market calls them sandboxes which can be misleading.Swyx [00:05:44]: All these things. All these things on.Ivan [00:05:45]: Yeah, exactly, ‘cause it can be misleading ‘cause people usually think about sandboxes as a demo or a test environment versus a production-grade environment. But what Daytona does, if you think of the laptop that you have in front of you or the computer that's over there, or, my wife is an architect, so she has like a Windows with a 3D graphics card inside to do 3D rendering. Like, as humans, we have different computers or different compositions of computers. And our belief is strongly that agents today and going forward will need all these different compositions of computers to do different types of tasks. And so we offer that basically through an API.Swyx [00:06:19]: Yeah, to give people - I'm trying to sort of front-load all the aha moments or the wow moments so that people can, stay engaged and click like and subscribe. the market is exploding, right? Like, you have been reporting 74% month-on-month growth, and it also, it's just been growing for a while. Like, it's been going like this. And every single - It's not just you guys. It's every single.Ivan [00:06:41]: Everyone, yeah.Swyx [00:06:42]: Sort of, compute provider. I don't know if you agree with me saying compute provider or not.Ivan [00:06:48]: It's fine.Swyx [00:06:48]: Yeah. So like organically PLG-driven growth, but also enterprise is doing super well, I think I wanna rewind to January of last year when you did the pivot. Like, so you obviously called this market early, and you were positioned for it, and you are now one of the market leaders. But what was the insight that made you do the pivot?The Pivot: From Human Dev Environments to Agent SandboxesIvan [00:07:06]: The insight that made us do this pivot is the quarter before that, so end of 2024, when we had - Basically, we did a demo with - I don't I think we discussed this as well, Devin was not public. You actually gave me access to Devin at that time. So Devin.Swyx [00:07:25]: I did?Ivan [00:07:26]: Yeah, you gave me access.Swyx [00:07:26]: I don't think I was supposed.Ivan [00:07:27]: Yeah, exactly.Swyx [00:07:28]: Yeah, I.Ivan [00:07:28]: So it doesn't matter. You.Swyx [00:07:29]: Yeah. I gave like three friends access.Ivan [00:07:31]: Yeah, or it was a call and you showed it to me. It doesn't matter. but OpenDevin was available, which is now called OpenHands. And so we're like, “Oh, this seems to be a thing. This is not public. Let's take our for human automation of dev environments and take, OpenDevin and launch that as a SaaS.” And we did that. Not very many people signed up and used it, but a lot of people reached out that were building agents, and they were like, “Hey, my agent needs a compute sandbox runtime,” whatever you wanna call it. I forgot what it was called at that point. And then we were like, “Oh, amazing. This is a new market. Here is our infrastructure. Here's our product, and go.” And what we found really fast, soon, was that people did not like what we had built. It didn't work. And I remember talking to people at the beginning when we're doing this, the sandbox we're building for agents. People were like, “Oh, why is it different? It's the same thing. We have like EC2, we have VMs, we have all these things.” But we saw that everyone we gave it to, it was like 20, 30 people, they all said, “No.” Like, “This is not what we need. This sort of breaks.” And basically, me and my co-founder not knowing a lot about - ‘cause we're infra people. We're not AI people. So I basically took it upon myself to like watch every single podcast that exists, including all of, all of these and all that, and sort of get up to date, read all the blogs, like get, understand what's going on.Swyx [00:08:45]: Do you wanna shout out who else was useful, just in case people are also looking.Ivan [00:08:49]: Generally we -, I looked at There's a few of podcast, different segments and different types. So there's you guys, No Priors, Bill Gurley's was great while.Swyx [00:09:04]: VG2, yeah.Ivan [00:09:05]: Yeah, while it was around. So there's a few. 20VC is interesting from a different dynamic, and some are different dynamic. But there was, also Red Points.Swyx [00:09:14]: We're not really about the compute market.Ivan [00:09:15]: It was also already - Sorry?Swyx [00:09:16]: You're, you want - You're looking at the agent infra market.Ivan [00:09:19]: I was looking at the agent market and the AI market in general and sort of understanding who are the players, what the perception, and how that goes. And like obviously you complement this with like going to conferences, going to events, going to meetups, reading white papers, like doing all the things that you have to do to understand what's happening. And so when we figured, when we sort of had an idea of what we had to build, literally over the New Year's Eve, literally on New Year's Eve, I half vibe coded the first MVP, first minimal viable product of what Daytona is today. And I went to sleep at like 3:00 AM or something like that. I was doing - I just put my like baby daughter and wife to sleep and, Happy New Year's, and go back to just, doing this. And I sent it to my co-founder, my CTO, and he saw it in the morning. He's like, “This is absolute garbage.” “Do not show this to anybody at all, but the idea is good.” And so he took two weeks, and he rebuilt it.Swyx [00:10:09]: Did it like look like that? Listen, I - It was rough idea.Ivan [00:10:12]: Oh, not even, not even close. Like it was it was way worse. But it was like a very - It was a simplistic view of what it should be. Like, it worked, but it was not ideal. And so he went, we went down the whole, which is his job as CTO, to go, and he came back with this version. We then called all the people that had said like, “This is garbage,” a quarter ago. And we set up these calls, and we gave it to - We just demoed it to everyone. And all the calls went long, every single one. They were 15-minute calls, and they all went to like 25, 30 minutes or whatnot. And everyone said, “We need, we want access.” There was no login, just an API key, ‘cause it was just a beta or an alpha. And they said, “Oh, we want access.” And we're like, “Sure, yeah. Okay, thank you very much.” But after like the next day, if we'd not send it, every single one, like every call that we did, everyone came back, “Where is my API key?” Like everyone wanted it. We're like, “S**t.” Like this is it. Like I've never felt So one, the understanding to your point was like most people thought it was the same infrastructure for humans and agents. We understood a quarter ago it's not. We just didn't know what was the right primitive. And then when we came, and we can talk about what that is, and we gave it to these people, I've never seen, I've never experienced - I've done multiple companies in my life. I've never experienced this, that people literally call you if you do not give them access. Like they want access right now. And so it's like, okay, they don't want this. the thing that they want doesn't seem to exist, or they have not found it, and they really want what we want. And then when we understood that we're onto something, and then when you think about the size of the market, like the market for human engineers and enterprise is a very large market, so think GitLab or whatnot. But the market for every single agent that will exist ever in the future is just like, what is that market? How big is that? And we're like, “We are all in on this.” And so that is where we made sort of the cut between the old product and the new one.Bare Metal, Stateful Sandboxes, and the Lambda + EC2 ModelSwyx [00:12:02]: Yeah. But it wasn't composable at the time?Ivan [00:12:05]: It was very - It was basically just a Linux box that you could change, that you could define number of CPUs, disk, and RAM. Like that is what you could do, but you couldn't have multiple operating systems, you couldn't resize it on the fly, you couldn't add a GPU, you couldn't do like all the things. It was just the, just the first sort of variation of that, yeah.Swyx [00:12:22]: Was it bare metal from the start?Ivan [00:12:24]: It was bare metal from the start. And so the interesting thing that we thought about right away, so our.Swyx [00:12:29]: Which, give people the background, what is the normal path?Ivan [00:12:32]: Yeah, so, basically most providers run this on top of VMs. And also.Swyx [00:12:37]: Firecracker.Ivan [00:12:38]: Yeah, they run on Firecracker and VM. And so we also fire - We can get - We have multiple isolation layers and we can do that. But the common way to do it is that they, one, that the state of the machine, or the hard disk is not part of the sandbox itself. And the other thing is they're not meant to last forever. So most of them are preemptible, like they can There's a time that they can live. And so our thought was when we were going into this is, agents will be like humans in the sense of you don't want your laptop to be shut down until you're done with work. Like, and you want to close the lid and open the lid, it's the same state. So you - Agents would want that, like the pause and come back. They want those two things. But also agents really want speed, right? Can they get it? So when we thought about it's like we need something insanely fast, how to make it fast, how to make it long-running, and stateful. And so those two things, it's like combining a Lambda and an EC2, right? Those two things together. And so we didn't have an idea how others did it, ‘cause we didn't know too that there was a market around this. It was more like, okay, this is what we need, what they need. And we looked at Kubernetes, it wasn't wasn't good enough for that. We looked at Nomad, it didn't enable that. And so our history in rewriting our own scheduler at CodeAnywhere is basically what my CTO came up with. Like, he's like, “Oh, the learnings from there,” and he brought it. And the funny thing is, our third co-founder, when he saw it, he's like, “Dude, what is this? This is like 2008.” Like, we went back in time, and he's like, “Exactly.” And so the reason why Daytona is like super fast, and you see this on benchmarks, is we essentially, we run on bare metal. We have our own scheduler, we use the underlying, disk, CPU, and RAM of the underlying machine, which means your IOPS are insanely fast because there's no, there's no network between an EBS or something like that. But also the snapshot, the point in time, the templates, are also preloaded on the bare metal machines. So when you fire off a sandbox from a template or a snapshot, you're essentially directed to the bare metal machine where that snapshot is based on that NVMe drive, and then it literally just turns on that machine, and it's local. There's no network latency, anything on there. And so that is sort of the specificities that we, when we're thinking from first principles, what a computer would look like for an agent, that is what we came up with, and that's what we created.Benchmarks, 60ms Startup, and 50,000 SandboxesSwyx [00:15:02]: Yeah. I should maybe, I don't know if you endorse this, but there's someone that does compute SDK, you guys do very well on there, with like the TTI, right? I. is this a, is this a is this a relevant benchmark for you guys? I don't know.Ivan [00:15:16]: I don't know, and it changes every day. So today RKL is.Swyx [00:15:18]: I don't know what RKL is. Never heard of it.Ivan [00:15:20]: Yeah. RK, yeah, so it is there.Swyx [00:15:22]: You are, at least a third of the next tier of performance, and then, there's a lot of other better-known names that are very slow to start.Ivan [00:15:31]: Yeah. We've been the number one by far for a long time, and now there's different, there's different definitions also of sandboxes, different isolation patterns, different other things. So RKL runs it literally on the S3, the data, so it's very different, and they spin up a sandbox, spin up a container for that, so it's a different type of thing. So the definition of a sandbox is something that we can all, we all need to get along with. But yeah, we're insanely fast on getting these things, up and running. And so you can see even there that it's a zero point 0.10 to 0.11, so.Swyx [00:16:03]: Close enough. Yeah. what else do you need, right?Ivan [00:16:05]: Yeah. So the benchmarks itself, so, in this, in I don't think the benchmarks equate to market ownership or revenue or anything like that. and I've seen this with multiple benchmarks, not just in sandboxes, but in general benchmarks around.Swyx [00:16:20]: It's table stakes. It's just like.Ivan [00:16:21]: Exactly. But it doesn't hurt.Swyx [00:16:22]: Just roughly check.Ivan [00:16:22]: Like you definitely have to be up there and you have to be competing so that people know that, oh, this is definitely one of the top. Because this is only one dimension of what customers look for. There's other things like how many can you spin up consecutively? There's a feature set, there's support, there's like all different things that people look at, but you definitely have to be there, on the benchmarks.Swyx [00:16:40]: How many people do people spin up consecutively?Ivan [00:16:43]: So we have.Swyx [00:16:43]: Or concurrently, is the Concurrency, right?Ivan [00:16:45]: There's three metrics that we look at. And so one is like time to spin up one, and so our time to spin up one is 60 milliseconds with network latency. So request, spin up, reply, 60, the whole thing, 60 milliseconds. That is one. But if you wanna spin up 50,000 at once, we are now at about 75 seconds. So it takes about 75 seconds to spin up concurrently 50,000. Some others, there's public data around this, like take 2,000 seconds, which is 30 minutes. Like there's different variations of that. And then there is the so it is speed of one, speed of like multiple, and then how many can you consistently have up and running. And so we basically have right now no limit to how much we can add because we basically own our own metal. But the biggest customer of ours does like about 850,000 every single day is sort of where they're, where they're just shy of a million every single day that they're running, we do have a request for half a million concurrent, which is literally half a million CPUs somewhere running. So that's an interesting.Swyx [00:17:44]: They pay by like vCPU seconds.Ivan [00:17:47]: By seconds, yeah.Swyx [00:17:47]: Or whatever. Yeah. Okay, and so and then, and the other thing is, the sleeping and the resuming, ‘cause it's all the stateful resumption of all these things, how, what kind of workload are people putting through this, right? Like how is it Do we measure by gigabytes in memory, gigabytes in storage? I don't In like network attached storage. I, what are the costly ones of, out of all these features?Workload Economics: CPU, RAM, Network, and StorageIvan [00:18:15]: The most expensive thing are CPU.Swyx [00:18:18]: Okay. Yeah, of course.Ivan [00:18:18]: The second one, yeah Then it's RAM, then it's disk. We actually don't charge.Swyx [00:18:22]: Which is snapshotting, right?Ivan [00:18:23]: No, it's actually the, snapshotting's part of it, but basically the size of your hard disk, of your machine. So do you have 10 gigabytes, do you have 20, do you have 50, do you have whatever? And then the transference of that. Right now, currently we don't charge for, network at all at Polychron.Swyx [00:18:37]: Oh, you gotta, yeah, you gotta fix.Ivan [00:18:38]: Yeah. It is very much a it's a larger and larger part of our bill, so we're working around, that part there. Obviously, that is the least, expensive, so the hard disk is the least expensive, so it's basically CPU, RAM, for us network, ‘cause we don't charge the customer, and then hard disk, is how it's split up. But there's also different types of workloads, so we basically split it up into two types of workloads in Daytona. One is what we call background agents or long-running agents. and the other is, basically RLs and evals, which I put sort of together. And so they have very different patterns of usage, and if you look at the usage of a background And I'll just name names of companies, not specifically.Background Agents vs. RL/Evals: Two Usage ShapesSwyx [00:19:21]: Yeah, open, all hands.Ivan [00:19:23]: Yeah. So like a background agent's a Cognition, a Lovable, a like all these things are Harvey. These are all long-running, background agents. And so if you look at their usage patterns, their usage patterns are similar to human, which is like follow the sun. Basically, the usage patterns of that is like noon is probably the highest, and the midnight is the lowest, and then weekends are lower. weekday is higher.Swyx [00:19:42]: Yeah, that's a fun question. How global is it? Is it very US-centric or?Ivan [00:19:46]: The US is a large part, but we have currently, we have Asia, Europe, and the US regions.Swyx [00:19:52]: So it's quite global.Ivan [00:19:53]: Yeah, it's quite global. We have it all over. It's interesting that our I talked to you a bit about this. Our number one city by user.Swyx [00:20:01]: Hmm.Ivan [00:20:02]: Is Singapore.Swyx [00:20:04]: Oh, wow. Amazing.Ivan [00:20:05]: Which is an interesting one, right? Not by revenue, just by just like by individual head count.Swyx [00:20:09]: Really?Ivan [00:20:09]: Just like an interesting thing.Swyx [00:20:10]: Singapore is, Singapore is weirdly high in the adoption charts of AI for the population. It's like an, seven, eight million population. And it's like keeps showing up.Ivan [00:20:20]: No, it's quite interesting. We were quite shocked, and I was like, “Oh, this is interesting.” And also one that's up there.Swyx [00:20:24]: There's a reason I'm doing AI using Singapore. it's because I'm from there.Ivan [00:20:27]: We're there. We're gonna, we're gonna be there as well. and it's interesting that Japan is in the top or like Tokyo's in the top, which is in all the tech cycles it has never been. It has never been, so it's quite interesting that they're.Swyx [00:20:39]: I think the Japanese just love AI. Yeah. It's that, and then it's Brazil. That's it.Ivan [00:20:44]: Brazil has always been in.Swyx [00:20:45]: I think.Ivan [00:20:46]: Even when I look, if you look at like GitHub's data and ask historically with CodeAnywhere, it was always like US, Western Europe, and then you'd have like India, Brazil, China, like that would be there. But like Singapore was not in, specifically Japan was never in sort of that top, that top.Swyx [00:21:01]: Yeah. Weird pockets.Ivan [00:21:01]: Weird. Yeah, so it's very global.Swyx [00:21:02]: Okay, so actually that, but that's helps you to distribute your load through, all time?Ivan [00:21:08]: The interesting thing is like we have those kind of loads, but if you look at the researcher loads, they're quite different. So what they are is like if you give them concurrency of 10,000 or 50,000 or 100,000 CPUs at ARMb, when they fire off a run, it's just 100%. And then it just runs, and then it stops. So it's very, the usage pattern is squares basically, right? And it's also not follow the sun, because people will fire it off at midnight before they go to sleep but then wake up and so it's very unpredictable, so you don't know where that is. So the shapes of the usage are quite different than we have had before. And also what's interesting is when it's sort of a follow the sun, even if you have a high growth company, you can sort of predict your usage patterns and have enough capacity for that, because it's sort of, it grows in a, in a way you can project. When you have companies doing sort of like evals and RL, they're super spiky. So they're gonna come in, it's like, “We're gonna use nothing, then can we have 100,000?” Right? And then go back down. And then 100,000, go back down. So it's very different, right? And.Swyx [00:22:09]: Do you want to lock them into commits so.Ivan [00:22:11]: Yeah, we do.Swyx [00:22:12]: Yeah, okay.Ivan [00:22:12]: We so we have to lock them into some sort of commits to have that capacity, because we have to have, basically we have to have the capacity for peak. Right? And so right now, Daytona's mean utilization is 15%, 1-5.Swyx [00:22:25]: Oh my God.Ivan [00:22:26]: So it's very low.Swyx [00:22:27]: Because it's very spiky.Ivan [00:22:27]: It's very spiky, but we get up to 90%. so we have these things. And so what we're, what we're looking at right now as a company is similar to Cloudflare where you can like geo move things around, but that works really well for basically the background agent where it's follow the sun. But this, it's not. Like it's a very different shape. Obviously with scale you figure these things out, but that's an interesting new problem that we have, as a compute provider in the agent space. And when we were doing the conference recently, and so we talked to like Nikita from Neon and.Swyx [00:22:57]: I should bring it up.Ivan [00:22:58]: Parag from Parallel and whatnot, everyone has the same problem. Whereas the usage is super spiky, and this is something that has not happened before, that you have these types of like it was always, it the amplitudes were not this high, right? So it's quite interesting use case and problem solve.Compute Conference and Spiky Agent InfrastructureSwyx [00:23:12]: Yeah, I don't know if we're gonna bring this up again, but let's just talk about the conference, you had like 1,000 something people at the Warriors game, at the Sorry, where is it? What's.Ivan [00:23:22]: Chase Center.Swyx [00:23:23]: Chase Center.Ivan [00:23:23]: Chase Center.Swyx [00:23:24]: I went. It was, it was very impressive. Obviously, you can, how to throw a conference, what did you learn? you put, you pulled together all these impressive names.Ivan [00:23:33]: What I.Swyx [00:23:34]: What were you looking for?Ivan [00:23:35]: My thesis behind the Compute Conference was let's bring together people that are building infrastructure for AI agents. Because when I think of what we're building, it is the agent is the primary user, what are the ergonomics and usage patterns of agents, and so we can do that. And what I found, this was a theory, it wasn't proven, is that we all have these problems, as I touched onto. And I was, as I was talking on stage, it was like we all have the same underlying infra problems, which is this spiky workloads, unpredictable workloads that we've never had before, in human, compute or human infrastructure. And it's, again, it's the same when I was talking to Parag or when I was talking.Swyx [00:24:20]: Lynn. Nikita.Ivan [00:24:21]: Lynn, Nikita. Lynn especially, I was talking to her the other day as well. Like the It is a very interesting type of problem to solve because I can touch on Cloudflare because there's a lot of like talk about that recently as to how they solve that, which is they have a bunch of geos, and basically, as users work in different places, and depending on your tier, they can move you around the geos. And so that how, that's how they get the higher utilization. But you can sort of predict these, and it's If it's something in You'll rarely get a spike that is 10 orders of magnitude. Like you'll get a like let's say one of your customers has some like an exponential curve. What is that to I'm using Cloudflare as an example. 10%, 20%, whatever it is. I don't, I don't have this data, I'm just assessing. It's surely not 10x, right? It's surely not something there. And so how do you go out and solve this problem? And we're all solving this in different ways. So we have.Swyx [00:25:11]: She also has the same thing.Ivan [00:25:12]: Yeah, I know specifically that like Neon had that issue as well. Like how are we solving these spiky loads and things like that ‘cause we talked about it. And so the interesting thing for me to actually internalize was, yes, everyone that's building for agents first is going through this, and we're all solving similar problems, which is quite.Swyx [00:25:28]: Let me let me double-click on this. Okay. So for example, Neon, I happen to know that they're very sort of S3 oriented, right? so they're just like fully bet on S3. And you get to benefit from S3's distribution and infrastructure. So I would imagine that Neon doesn't have to care, whereas Lynn maybe has to care a bit more because obviously she's doing GPU inference. And, for listeners, we did an episode with her, one and a half years ago. And you have to care. But like, right?Ivan [00:25:54]: Parag cares for sure, and Nikita.Swyx [00:25:58]: And Parag is C of, Parallel.Ivan [00:25:59]: Parallel, yeah.Swyx [00:26:00]: Former CTO of Twitter.Ivan [00:26:01]: Twitter, yeah.Swyx [00:26:02]: They are the search.Ivan [00:26:03]: Yeah, they're search, yeah.Swyx [00:26:03]: I You and I know but the listeners don't know.Ivan [00:26:08]: Yeah, we can put it down in the screen, and so ‘cause we, when we were talking.Swyx [00:26:11]: I'll put it up on the, on the screen.Ivan [00:26:12]: Yeah, right.Swyx [00:26:12]: People can look it up if they need.Ivan [00:26:14]: Look it up. And, yes, but they still have CPU and RAM, allocation that you have to have up and running. And so CPU and RAM, you have to allocate that and have that ready. And so there's basically two ways to do it. One is you either over-provision and you can handle the bursts, or two, you basically have, I don't know if this is a term, just-in-time compute, which is like as your load becomes, as your usage comes in, you can fire off requests for VMs or bare metals at other cloud providers and then get them up and running.Swyx [00:26:43]: This is if you go above 100%, right?Ivan [00:26:45]: Yeah, this is.Swyx [00:26:46]: Like your overflow.Ivan [00:26:46]: If your overflow, like spillage or whatever you do.Swyx [00:26:48]: You probably lose money on it, but it doesn't matter, right?Ivan [00:26:50]: It, not Well, you might, you might not That is a more cost-effective way to do it but it's a slower way to do it. Because basically what you have to do is you have to like queue your requests, spin up these just-in-time compute, get it all ready, provision it, and then get your workload there. And so if the time isn't important that much, that's fine, and you can do that. But if your customer, and especially for, let's say, the RL training runs, the reason why a lot of people come to us is because GPUs are more expensive than CPUs, right? So you want your GPU running at, what, 100% the entire time. And so when you're running runs on CPUs, when the when the CPU cycle is like down and spinning up the next one, you want that to be instantaneous so that your GPU doesn't go down, right? And if you then have to like go out and provision machines, you're essentially telling the GPU that it has to wait, and that's incurring our cost. So there's things that you have to try to solve for there.RL Workloads, Declarative Images, and Kubernetes ReplacementSwyx [00:27:43]: Yeah, let's talk about the different workload, right? You said that, what was it? A few months ago, you had zero RL workload and now it's 50%.Ivan [00:27:52]: It will be this one, 50%, yeah.Swyx [00:27:54]: Let's talk about how different it is, right? Like I imagine, for example, a lot less dynamic code generation of like arbitrary code. Like here, it's probably all the same code. You're just doing parallel runs or something, I don't know.Ivan [00:28:05]: Yeah. So you'll have multiple Depends on the like for each run, you'll have a snapshot. And they, for the most part, they actually do use our declarative image builder, which is like, “Oh, we, the agent wants these dependencies, these env vars.”Swyx [00:28:17]: These ones, yeah.Ivan [00:28:18]: Yeah, the declarative image builder, it.Swyx [00:28:20]: Which is a very modal like thing that they.Ivan [00:28:22]: Yeah. And so we build it on the fly and then we propagate that snapshot, and you can spin up as many sandboxes as you want against that snapshot. And then if you have to do changes, the model can, or like it could be also be automated. It's like, “Oh, now for the next run, we need to install these things or remove these things or whatever to get, a task done,” and then it goes off and runs that. So yes, that is something that it seems that they prefer. The number one reason I found, or should I say, let's take a step back. What we are competing against in that environment is essentially managed Kubernetes. So EKS, GKE, whatever. That is what the vast majority run on. And anyone that has tried Daytona versus GKE, EKS is like, “I'm never going back.” That has always been. There's a few reasons. One is the ergonomics. So if you have, if you're using Kubernetes to spin that up, you have to essentially manage the interface interactions with that. Daytona, although as a compute provider, it's more akin to a Twilio and Stripe from a consumption perspective than it is an AWS. Like you have an API, an SDK, it's quite like easy and seamless to get these things up and running, that's one. The other is the speed to which we spin up, which we mentioned earlier, which is much faster, and the scale to which we can go to. We haven't got into features, but an interesting feature is that it's very hard to OOM, or out of memory, our sandboxes, because we can dynamically on the fly.Swyx [00:29:48]: Resize.Ivan [00:29:49]: Resize, which is like impossible on almost any other thing. There are some technologies that enable you to do that, but it's like a very hard thing. And so we actually saw this when, the Terminal Revenge team is, brought us actually. So thank you, Alex and the team, that brought us into this whole space.Swyx [00:30:05]: It's just very rare that, a framework would just say, “Guys, just use Daytona.”Ivan [00:30:11]: Yeah, I think it says it somewhere. Yeah.Swyx [00:30:13]: Yeah. I was like, “What is this?”Ivan [00:30:15]: There's all, there's multiple there, but they also mention a few other places. and so Daytona specifically-We have, the, just jumping on themes here We, I don't know where it says Data Center.Swyx [00:30:27]: I, there.Ivan [00:30:27]: Doesn't matter.Swyx [00:30:28]: There's a very strong recommendation, which is, very unusual. Which is, it's.Ivan [00:30:33]: We do not pay them for this, just.Swyx [00:30:34]: I know, yeah. They just like you.Ivan [00:30:35]: Yeah, they like us. yeah, and also a thing, so, Data Center has multiple isolation sets underneath. The customer doesn't have to know what they are. But basically we have Docker, which is a container, that's hardened with Sysbox. So it's Docker's, isolation that is a security equivalent to a VM, but it's still a container. And that is the default, and they, especially in these training workloads, really like that as an interface to be able to use just a basic Docker container, and we enable Docker and Docker. Which for these RL runs, if you need to do a Docker compose or Kubernetes, you can spin up a K3S inside of these things, which unlocks a huge amount of workloads that you can do that you cannot do on other providers. So just on that part is much more interesting. And so we went that, through that. We showed them that we could do that, and they enjoyed that quite a bit. They being the general venture people.Swyx [00:31:28]: Those people, yeah.Ivan [00:31:29]: And Harbor people.Swyx [00:31:29]: Harbor people, do are they, are they a company yet?Ivan [00:31:33]: As far, I do not know.Customer Pull, Slack Connect, and the Computer Use BetSwyx [00:31:35]: Okay. All right. Yeah. It's like super obvious that like, there's a lot of excitement and success around these things, okay, so yeah, tell us more, right? Like, this is an exploding workload, Harbor adopted you, which helped speed things along. But what are you learning as this new workload comes online?Ivan [00:31:53]: There's a couple things that we learned, which we chat about in the beginning. We, and this has led our story, as we mentioned, we like talked to a lot of customers along the way, and we add more features and more tool sets as we talk to customers. And it's interesting that And I think it's that the ecosystem is so small and/or the models get smarter, where when we see one user come with a request, we know it goes on a roadmap if like three to five customers come with the same request in that week. It's like very bizarre. It happens so many times, which is.Swyx [00:32:27]: Because they're all friends.Ivan [00:32:28]: Sorry?Swyx [00:32:28]: They all, they're all friends. They're all in the same group chat.Ivan [00:32:30]: Yeah, probably, yeah. ‘Cause and they're like, “Oh, can you do this?” And I'm like, “Okay, this is interesting. We'll put it on a feature request.” And then the next one's like, “Oh, can you do this?” “Okay.” It's all the same, right? It's always the same. And so what we try to do, and I personally try to do, I try to be on as many call, quote-unquote “sales calls” I can. I'm in every Slack channel. We literally have about 1,000 Slack Connect channels, something like that. It's an interesting, there's so many interesting things you find out when you have all the Slack channels. You can also see where people, transfer between companies. You see leave Slack channel, enter Slack channel. It's an interesting thing. Also, just I digress, I feel that Slack Connect is literally LinkedIn what it should be. You have a list.Swyx [00:33:08]: LinkedIn charges you to, use your own connections, but Slack doesn't, right? Slack is like, do it for free. It's more lock-in. It's great.Ivan [00:33:15]: Yeah. It's amazing. Yeah. It's one of the reasons.Swyx [00:33:17]: You're gonna pay Slack for life.Ivan [00:33:18]: Exactly. You're there for life. So that's interesting. And so one of the things, the newer things we were talking about earlier is we made a big bet and put a lot of investment on computer use. that is not seen publicly the light of day. We haven't GA'd that yet, but we have.Swyx [00:33:32]: Is there a thing I can pull up?Ivan [00:33:33]: There is computer use there. It's right up a bit.Swyx [00:33:36]: Oh, yeah. Okay.Ivan [00:33:38]: What we have, what we talked about and what we've seen publicly is there's this theme now about, the human emulator where And Elon from XAI has talked about this publicly, and if you think about the models today, they're actually quite sophisticated and they can do a lot of work, but they still don't have access to all the tools. Like, I'm a strong believer that the most efficient way for an agent to work is essentially headless or through, terminal or whatnot. But if we, if we look at knowledge work in general, there's about 100 million knowledge workers in the US, about a billion in the world, and knowledge workers, and the salaries of them aggregate to 10 trillion in the US 50 trillion worldwide.Swyx [00:34:24]: Wow.Ivan [00:34:25]: Something like that. And if we look at, the five most important sectors of that, so like healthcare and government and financial services and whatnot, that's about 56% of that. So let's say it's about half of that. So in the US it's about 25 trillion, and most of them, most of that work is actually still locked into legacy apps inside of Windows, which is not going anywhere for a very long time. Like, people just won't invest in that. How much of it? our assumption is the following: if, in the RPA market, which is similar market, well, not the same 25% of, these white collar, workers', work is automated. If an agent is more sophisticated, can go through more runs, figure stuff out, let's say it's, 40%, right? And so if you take 40% of that, you get to essentially, $10 trillion a year.Swyx [00:35:17]: That's a TAM.Ivan [00:35:18]: That is a that is a TAM. So that's the TAM of the models, right? That's not our, essentially ours. But you get to that size, and to be able to do that, you essentially have to give agents these computers with the legacy. So computer use, either Mac or Windows or Linux. Linux we also obviously have and others have. But Windows specifically is something very new, and the only option right now is an EC2 with, Windows or on Azure. Both of them take anywhere from three to five minutes to spin up. We've created an actual sandbox, so it's a second instead of milliseconds, but you have, point in time snapshots, you have, forking, you have all the things that you have from a sandbox, but essentially enables you to hopefully unlock all this value. And so that's been our big push and bet, but we've sort of, kept our ear to the ground. What is sort of the next things in the market?RPA Returns: Why Agents Still Need ComputersSwyx [00:36:06]: Yeah, knowledge work, and building, and sort of RPA, the next wave of RPA. I got very excited about RPA kind of during COVID times. The UI path was IPO-ing. And it was, a very hot Isn't it, Eastern European?Ivan [00:36:20]: It is, Romanian.Swyx [00:36:21]: Romanian?Yeah, it might be the only Romanian, big unicorn okay, yeah. This I don't I don't, I don't have like a I think there's, I think there's a stage being set for the resurgence of RPA, ‘cause everyone understands that, yeah, no one wants to deal with these shitty apps and no one's gonna rewrite them. Like, you just have to do, a remote operation and programmatic operation of them.Ivan [00:36:45]: If you wanna unlock it, my own setup was basically the following. So I was doing a board deck recently, last month, whatever, and I'm like, “Okay, let's just, let's just do automated.” So, all our data's in, ClickHouse and PostHog and QuickBooks, where everyone else's is, and I'm basically, connected that all to, my Cloud code, like go off and go Cloud code whatever. Go off and, here's the integrations, go do that. It pulled out the first report, which was great. It connected to Brex and all these things, pulled it, which was great, and then I say, “Okay, now pull out this, and this,” and I kept getting, really well McKinsey-style design reports, but the data said partial data. all the missing data, partial data. Like, it can't access all the things, and I got so frustrated, and so I got, I got, my Mac Mini virtual sandbox with OpenClaw. I gave it its own account in our company, and then I went to all these services and created a read-only account, so literally like an intern in your company. And so I would say, “Now go and do this report,” and it would get the same, or like, “I can't via the MCP or the API or whatever. I can't get all the information.” I'm like, “Go log in.” And it will log into the website, then go in, export the data. It'll export the data and do the thing end to end. So even for things that have today APIs, not all of it is exposed, and I to get value, I get immense value right now, but it has to be a computer usage, unfortunately, and so I spend a bunch of tokens just on that, but I get the job done. And so if even a startup like ours, and using all the hottest tools, still needs a computer agent what hope does, Goldman have to have a headless, right?Swyx [00:38:22]: Yeah, what a - Why isn't Microsoft doing this?Ivan [00:38:27]: I'm pretty sure, Satya had a post yesterday.Swyx [00:38:29]: Oh, okay. I see.Ivan [00:38:29]: Which was like, “Every agent needs a computer.”Swyx [00:38:31]: I see, I see.Ivan [00:38:32]: So they have launched something recently.Swyx [00:38:34]: Yeah, they have Microsoft Power Automate, I'm sure, I'm sure, they're gonna have their version.macOS Sandboxes, Apple Constraints, and the Windows OpportunityIvan [00:38:39]: Version of that, yeah.Swyx [00:38:39]: You're gonna try to do yours, and it - I always know there's always demand for Mac, but I know it's, tricky to host, macOS sandboxes.Ivan [00:38:49]: We will have macOS sandboxes fairly soon. The problem with macOS, OS sandboxes is, I'm deep in this, I don't know how much interesting is.Swyx [00:38:55]: No, it's.Ivan [00:38:56]: MacOS has this problem.Swyx [00:38:57]: It's a licensing thing, right?Ivan [00:38:58]: Licensing thing. So one, you're allowed to run only two parallel VMs per machine, so that's one. Two, you can only license to a different user every 24 hours. So if you come in and theoretically, if I wanna charge you per second and I charge you one second, I have to have it idle for the rest of the day. I can't have anyone else doing that. So the pricing will be different in the sense that I will have to - we would have to charge for 24 hours, and that's not even, that's not even the most difficult thing. But the, thing above that is, from a security perspective, they enable you to do memory snapshot, pause, resume, but only on the same physical drive, physical machine. And so what you can do in, Windows world or Linux world is that I can move in the background, your snapshot from one to the other and manage load, right? Here, if you wanna do that, you essentially have to have your.Swyx [00:39:49]: Yeah, snapshots. Yeah.Ivan [00:39:50]: Your.Swyx [00:39:51]: It's like.Ivan [00:39:51]: Physical machine.Swyx [00:39:52]: You can't break it up.Ivan [00:39:53]: You can't, you can't move things around that, and all of that is, that part is, from a security standpoint, if it is written. Like, I understand the security aspect of that, but it disables you from doing these agentic, like really scalable agentic workloads.Swyx [00:40:08]: You need to do a vibe-coded, clean room implementation on macOS that you can then - That's like Clean OS or something. I don't know.Ivan [00:40:17]: So. We have.Swyx [00:40:18]: ‘cause like Linux was originally like a clean room rewrite of Unix.Ivan [00:40:21]: Okay. Yeah.Swyx [00:40:21]: Or something like that, right? Like same thing to macOS. Someone needs to do it.Ivan [00:40:25]: Someone will do that, and someone will have some long-running agents for a few days to figure this stuff out. But yeah. So definitely we - we're really close to offering something ‘cause people do want it, but the pricing will be different, and the feature set will be sort of stringent.Swyx [00:40:38]: Yeah, nobody's gonna use this. like, the labs, the labs will because they want to automate macOS.Ivan [00:40:42]: They have to do RL. They have to do RL again. But even if you The - So the point is with the RL part, if you, if you do RL on macOS, then the next iteration of the model comes out, it will be able to use these tools significantly. Then you actually need to run those, that somewhere. So you're gonna have to have that, later on. And from, if anyone at Apple is listening, I very much feel that they are shooting themselves in the foot of the scale of the revenue of compute or licensing they could get if they would just enable a concurrency model similar to what you can get on a Windows and a, and Linux.Swyx [00:41:17]: Yeah. Yeah. And I'm sure they've heard this before. They just don't care. Yeah, it's And maybe they will change their mind with the new CEO.Ivan [00:41:24]: Yeah. We'll see.Swyx [00:41:25]: We'll see.Ivan [00:41:25]: High hopes.Swyx [00:41:26]: High hopes.Ivan [00:41:26]: High hopes.Swyx [00:41:27]: Okay. But I, it's very clear the market opportunity is huge in Windows, and you can go for a long time on just Windows, but your customers are gonna want both. and I think, it is interesting to me that, this is the sort of God application of agents, right? Like, I don't It was - How big was OpenClaw for you guys? Like, was it, was there, a significant bump.OpenClaw, Agent Labs, and the B2B2C Sandbox MarketIvan [00:41:54]: Not for us because we.Swyx [00:41:54]: Because you already.Ivan [00:41:55]: We're kind of positioned differently. Whereas although it's completely PLG and we have individual developers that use it, most of the users that use Daytona are sort of a B2B2C. Sort of it's either B2B or B2B2C. So, in the researcher world, it's B2B, so you're selling to, labs and neo labs and things like that. But on the long-running agents, it's mostly, from a scale revenue perspective, it's mostly B2B2C, where you have a app layer agent that uses you at a big scale.Swyx [00:42:26]: Like a Manus. Yeah.Ivan [00:42:28]: Like a Manus Lovable type of thing.Swyx [00:42:31]: Yeah. I think that's the question of, well how, um-Uh, yeah, B2B to C is basically to me what I've been calling an agent lab, which is kind of like you're not in a model lab, but you're making a very good wrapper that is a platform that other people can sign up so they don't have to code those things. Yeah, it sound, it sounds like a much better market than the direct OpenClaw market.Ivan [00:42:56]: I've like - We I've done multiple things. So the CodeAnywhere's part of our career path R in the calendar, was very much an end user developer product. And so that is great. It You can get a lot of developer love, and I feel that we do as a company have a bunch of developer love. But it's a different type, where it's people building these things. Again, it's more akin to a Twilio because you don't really run - As a person, you wouldn't run Twilio. I don't know how many people remember. It was like ask your developer billboard and whatnot. And people really love Twilio, but they only used it inside of like, “Oh, I'm building this app or service for thing.” And so we're very much directly to that. And you also know that I used to work for a competitor for Twilio, so it's kind of ingrained, in my DNA.Swyx [00:43:35]: People don't know InfoBip is that big.Ivan [00:43:38]: Yeah, it's.Swyx [00:43:39]: Because.Ivan [00:43:40]: It's a billion euro.Swyx [00:43:40]: They're all American. They're like, “Whatever's in Europe doesn't matter to me.” But like it's the, it's the same size or bigger? Same size?Ivan [00:43:46]: It's about half the size.Swyx [00:43:47]: Half the size?Ivan [00:43:48]: Yeah, about half the size.Swyx [00:43:48]: It's like, yeah.Ivan [00:43:48]: Still huge. Multiple billions a year. Yes.Swyx [00:43:51]: That's crazy.Ivan [00:43:51]: Exactly, and so that - These are like really interesting and large revenue-generating, very sticky businesses. Whereas when you're selling to the - When your focus is the end developer, it is a very hard sell because they're very price sensitive, very price conscious, very around that. And there's very It's very hard to scale. Your cap is the number of people that are willing to spin up - First of all, wanna spin that up, and then spin up multiple of these. Whereas if you're in the enterprise one, like we know everyone's talking about like how many tokens they're spending, I'm spending. Like a lot of companies today are like, “If this is our company, spend as much as you can.” Like basically that is where we're going. And so if you think about that paradigm, where you're selling to companies that say, “Spend as much as you can to generate, productivity,” versus, “Oh, I'm a single person. I have this much budget, and I'm doing this thing because it's fun or it's helping me out or whatever.” Like it is a different, it's a different go-to-market, I think, strategy.MCP, CLIs, and Sandboxes as the Agent RuntimeSwyx [00:44:50]: Yeah, there's a lot of discussion. I'm just kind of going through like the mental list of things that are in your favor, which is, for example, MCP versus CLI. Like obviously you want CLI. It's been very good for you. I feel like it's maybe a drop in the bucket or maybe it's huge. I'm just checking whether it's like these are big trends.Ivan [00:45:10]: Those things you - work well in our favor, to your point just because every.Swyx [00:45:13]: They're kind of drop in the bucket, right?Ivan [00:45:15]: I think it's like sort of all the things come together. And so there's so many things that impact that. To your point, like OpenClaw wasn't huge for us, but like having the agent SDK, from Anthropic, so or Cloud Claude Code was very interesting. The reason why it was interesting is that a lot of, let's call them app I don't know what to call them, app layer agent companies, essentially they are like, “Oh, I can create this new app, this new agent. All I need, I just use Claude Code, and I throw it into a sandbox, and then I have my interface to the human to that.” And so that enabled so many more companies to actually offer this, and then they would pull on sandbox. So that was, that was interesting. And to your point, like MCP, versus the CLI, the MCP is an interface against an API, whereas the CLI is like you can actually go do things. Like this is it. The difference between integrations and actually running scripts or data or analysis against a thing. So being able to use a CLI very well enables the agent to do more things, and it's because that people will invoke a sandbox, they'll run it in the CLI, and but it'll do anal-analysis on that data and then give you an actual result versus just, pulling data from an API source.Swyx [00:46:29]: Yeah, it's a layer of indirection basically, it's the same thing as agentic search versus RAG, which where you're.Ivan [00:46:34]: Exactly, yeah.Swyx [00:46:34]: Just like you just win whenever people put more agents into their workflow. And so like it doesn't really matter, but I'm just kinda teasing out like what else have people heard about that like it's sort of, “Oh yeah, this is another sandbox use case. Oh yeah, that's another one.” Am I, am I missing any big ones?Ivan [00:46:51]: The thing, the thing that people, which is the computer use stuff, which I think is probably the most interesting one, is, and to your point, we've talked to so many people over the last year. It's like, “Oh, like why do you need a sandbox? Why do you need this? Why this?” And to your point, it's like, “Oh, I need sandbox for this. I need sandbox for that. I need sandbox-” It's like, “Oh, I need it for every single thing.” And so basically what I, what I - and it sounds like a broken record, it's like you use a laptop every single day, right? And you are n of one. It's just you. But now imagine how And by the way, the laptop, the computer PC market, the PC market is about equal to the cloud market in total. So it's about 150, 180 billion a year. Something like that. It's about roughly the three cloud hyperscalers is about equal to like Apple, HP, Lenovo, whatever, It's a little bit less, but it's sort of like that. And now imagine And that's just like, so how big is the addressable market? What, how many people are there in the world now? What's the last data?Swyx [00:47:45]: Let's call it eight billion.Ivan [00:47:46]: Eight billion. And so let's say you can have two computer, like you have one personal and one business, whatever. Like so it's double that, right? and so that's 16 billion, right? How many agents are gonna be running in two years, in 10 years, in 100 years? Like And for every single task, they will need one of these. And so how big is that? That market is essentially quote unquote “infinite”. You will get to the point, and Dylan Patel was at the conference talking about, from SemiAnalysis, that talks usually about GPUs, was also talking about how CPUs will now be a bottleneck because it will be the constraint. You won't be able to grow, or we won't be able to have enough of these because there won't be enough CPUs to basically do.Swyx [00:48:23]: Yeah. Well, I actually had a really good podcast with Doug Oliphant, who, which was his president at SemiAnalysis, where they've basically been like, yeah, it's been a GPU shortage first, but then it's cascaded down to memory and now to CPUs.Ivan [00:48:35]: CPU, yeah.Swyx [00:48:35]: It-What's next? So networking. So, networking actually has been in shortage for a while if you're looking at, just GPU networking. But, yeah, it's really crazy the amount of computer use that's going on, yeah, cool. I, other questions are, just the one very big part is the open sourceness which you didn't have to do, your competitors don't do, like it's not, a lot of people are worried about keeping their projects open source because some competitor can just slot fork it. I don't know if there's any reflections on just being an open source company.Open Source, Trust, and Enterprise ProcurementIvan [00:49:15]: Yeah. There's a bunch. So we the original product that we did was open source.Swyx [00:49:19]: Yeah. CodeAnywhere.Ivan [00:49:20]: So doing that was actually very good for us. There's basically a saying of, What's the saying? Like, companies that are, that are doing really well, measure themselves against, free cashflow, that are kinda okay, it's EBITDA, then, it's, it goes all the way down.Swyx [00:49:36]: The worst is like GitHub stars.Ivan [00:49:37]: GitHub stars. GitHub stars are the worst, yeah. So you go all the way down to GitHub stars. And so our original one was GitHub stars. That's what we talked about, we're at the point we're talking about revenue, so we're we've gone up the stack on that. And so we started.Swyx [00:49:47]: No, profit.Ivan [00:49:48]: Yeah. We haven't, we're, we'll get there. We'll get there. But basically at that point we did stars and GitHub and it was useful, and the original variation that we did, it we split the core into its own repo and it was Apache 2.0, so very, permissive. And then we basically would bundl

The MAD Podcast with Matt Turck
OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real

The MAD Podcast with Matt Turck

Play Episode Listen Later May 21, 2026 73:56


AI suddenly feels like it has crossed a threshold, and Yann Dubois, co-lead of the Post-training Frontiers team at OpenAI, joins Matt Turck to explain why. Yann's team has led the post-training behind the company's reasoning models, including the recent GPT-5.5 release. In this conversation, we go inside the shift from raw model capability to useful, reliable systems: what changed with GPT-5.5, why reinforcement learning is moving beyond math and coding competitions into messy real-world work, how reasoning models like GPT-5.5 actually work, the difference between GPT-5.5 Thinking and GPT-5.5 Pro, why post-training has become one of the most important frontiers in AI, and why evals, model-as-judge, hallucinations, agentic workflows, GDPval, and continual learning are now central to the next phase of frontier models. Yann also shares why continual learning remains one of AI's biggest unsolved problems three years after ChatGPT, and where startups still have massive room to build as frontier models race ahead.(00:00) - Cold open(00:34) - Intro(01:30) - Why recent AI progress feels like a step function(04:13) - Model reliability & the rollercoaster of shipping 5.5(07:33) - How OpenAI structures vertical and horizontal teams(09:49) - Improving model efficiency and test-time compute(12:32) - Yann Dubois' journey from Switzerland to OpenAI(15:37) - Reasoning in 2026: Real-world utility vs verifiable rewards(18:34) - GPT-5.5 Thinking vs Pro: Scaling test-time compute(20:09) - How reasoning models become more efficient(23:23) - Pre-training scaling and overcoming the data wall(27:03) - Multimodal data, synthetic data, and embodied AI(31:05) - Demystifying mid-training and post-training(37:21) - Does RL create new capabilities in AI?(38:53) - The challenges and frontier of scaling RL(43:09) - Is building AI models a craft or a strict science?(48:21) - How AI models generalize across different domains(54:18) - How reinforcement learning cures AI hallucinations(56:04) - Negative generalization and conflicting instructions(58:05) - Can RL scale to law, medicine, and the broader economy?(1:00:19) - The evaluation bottleneck and Model as a Judge(1:04:21) - Continuous AI progress & continual learning(1:08:49) - Will foundation models eat the agent harness?(1:11:23) - Why startups should focus on the last mile of AI

Framgångspodden
1018. Christer Olsson: Föräldralös vid 15 – ”Lycka är en färdighet”, Original

Framgångspodden

Play Episode Listen Later May 20, 2026 65:20


Favoriten Christer Olsson är tillbaka i Framgångspodden – i ett varmt, tankeväckande och ögonöppnande samtal om livet, relationer och varför lyckan inte är något man hittar, utan något man tränar upp.Vi pratar om vikten av att verkligen se livet medan det pågår, vilket leder in på Christers livsfilosofi: man har inte en bra dag – man gör en bra dag. Ett uttryck som gjort honom viral på TikTok och som fångar kärnan i hans budskap: livet händer inte sen, det händer nu.Varför jagar så många lyckan på fel plats? Och varför är tacksamhet en färdighet snarare än en känsla? Christer delar också insikter från ledarna han coachar – där det som oftast blir lidande är familjen. Hur ser våra kalendrar egentligen ut? Och varför får barnen så ofta bara tiden som blir över?Vi pratar även om AI och varför tekniken paradoxalt nog verkar öka bluffsyndromet hos många människor. Dessutom går vi in på relationer, varför kärlek är något man gör snarare än något man känner, hans omtalade “80-procentsregel” och varför man ibland måste våga lämna relationer som slutat utvecklas.Det blir ett starkt avsnitt om självkänsla, mod och mening, fullt av insikter som stannar kvar länge efter att samtalet är slut. Följ Christer härLäs mer om Christer härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Framgångspodden
1018. Christer Olsson: Sluta vänta på livet – ”Man gör sig en bra dag”, Short

Framgångspodden

Play Episode Listen Later May 20, 2026 26:42


Favoriten Christer Olsson är tillbaka i Framgångspodden – i ett varmt, tankeväckande och ögonöppnande samtal om livet, relationer och varför lyckan inte är något man hittar, utan något man tränar upp.Vi pratar om vikten av att verkligen se livet medan det pågår, vilket leder in på Christers livsfilosofi: man har inte en bra dag – man gör en bra dag. Ett uttryck som gjort honom viral på TikTok och som fångar kärnan i hans budskap: livet händer inte sen, det händer nu.Varför jagar så många lyckan på fel plats? Och varför är tacksamhet en färdighet snarare än en känsla? Christer delar också insikter från ledarna han coachar – där det som oftast blir lidande är familjen. Hur ser våra kalendrar egentligen ut? Och varför får barnen så ofta bara tiden som blir över?Vi pratar även om AI och varför tekniken paradoxalt nog verkar öka bluffsyndromet hos många människor. Dessutom går vi in på relationer, varför kärlek är något man gör snarare än något man känner, hans omtalade “80-procentsregel” och varför man ibland måste våga lämna relationer som slutat utvecklas.Det blir ett starkt avsnitt om självkänsla, mod och mening, fullt av insikter som stannar kvar länge efter att samtalet är slut. Följ Christer härLäs mer om Christer härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

[Podfic]
DSMN7: The Second Gig, part 1

[Podfic]

Play Episode Listen Later May 19, 2026 22:16


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠https://freesound.org/people/eobmada/sounds/541367/⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠!

Let's Talk AI
#245 - TML-Interaction, Claude For Legal, Sam Altman on Stand

Let's Talk AI

Play Episode Listen Later May 18, 2026 109:14


Our 245th episode with a summary and discussion of last week's big AI news!Recorded on 05/13/2026Hosted by Andrey Kurenkov and Jeremie HarrisFeel free to email us your questions and feedback at andreyvkurenkov@gmail.com and/or hello@gladstone.aiRead out our text newsletter and comment on the podcast at https://lastweekin.ai/In this episode:OpenAI released new voice intelligence API features including GPT Realtime 2 (GPT-5-powered) plus realtime translation and Whisper transcription, emphasizing the latency–reasoning tradeoff, larger context, and new guardrails amid fraud risks.Thinking Machines previewed a low-latency, full‑duplex conversational system with a two-model architecture and custom inference stack, reporting strong interactivity benchmark results but without public access or third‑party validation yet.Anthropic pushed further into vertical products with Claude for Legal and deeper AWS availability, while ongoing ecosystem tension grows as platform model providers compete with application-layer companies.Safety, policy, and research updates included OpenAI's self-harm trusted contact feature, Anthropic work on reducing agent misalignment by training ethical “why” reasoning, OpenAI's investigation of accidental chain-of-thought grading in RL, and Meta horizon eval updates showing benchmarking limits for long task horizons.Timestamps:(00:00:10) Intro / Banter(00:01:35) Response to listener comments(00:03:27) Sponsor Break Tools & Apps(00:06:27) OpenAI launches new voice intelligence features in its API | TechCrunch(00:15:52) Thinking Machines drops a new, highly responsive model designed for humanlike interactions in real time - SiliconANGLE(00:27:49) Claude For Legal Launches, May Reshape the Legal Tech World – Artificial Lawyer(00:40:27) Threads tests a Meta AI integration that works similarly to Grok | TechCrunch(00:43:08) Google brings agentic AI and vibe-coded widgets to Android | TechCrunch(00:45:33) Google updates AI search to include quotes from Reddit and other sources | TechCrunch Applications & Business(00:47:38) Sam Altman was winning on the stand, but it might not be enough | The Verge(00:55:04) Nvidia C.E.O. Jensen Huang Hitches Ride With Trump to China After Last-Minute Invite - The New York Times(00:58:40) AWS expands Anthropic partnership with Claude Platform launch(01:01:13) Chinese grey market sells Claude API access at 90% off by using stolen credentials, model substitution, and harvesting users' prompts and outputs for resale as AI training data — 'transfer stations' operate through proxy networks that harvest user data(01:06:43) DeepMind Spinout Isomorphic Labs Raises $2.1 Billion to Design Drugs With AI - BloombergProjects & Open Source(01:09:04) Petri: Anthropic Hands Its Alignment Toolbox to Meridian Labs with 3.0 Update(01:12:25) Daybreak': OpenAI's Answer to Anthropic's Project Glasswing Has ArrivedPolicy & Safety(01:14:04) Teaching Claude why(01:21:45) Import AI 455: Automating AI Research(01:28:31) ChatGPT's New Safety Feature Could Alert 'Trusted Contact' to Risk of Self-Harm - CNET(01:30:09) Investigating the consequences of accidentally grading CoT during RL(01:34:46) Natural Language Autoencoders criticism(01:39:15) Review of the "Risks from automated R&D" section in the Anthropic Risk Report (February 2026)Synthetic Media & Art(01:43:39) George Clooney, Tom Hanks, and Meryl Streep back new ‘Human Consent Standard' for AI licensing | The VergeResearch & Advancements(01:45:10) METR says Claude Mythos is testing the limits of AI evaluation – Startup FortuneSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

[Podfic]
DSMN6: Interview et al.

[Podfic]

Play Episode Listen Later May 15, 2026 31:34


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: https://freesound.org/people/eobmada/sounds/541367/ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠!

The Lunar Society
Eric Jang – Building AlphaGo from scratch

The Lunar Society

Play Episode Listen Later May 15, 2026 157:29


Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools.Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo's MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.Watch on YouTube. Read the transcript.And check out the flashcards I wrote to retain the insights.Sponsors* Cursor‘s agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor's SDK made it easy. Check out the cards at flashcards.dwarkesh.com and get started with the SDK at cursor.com/dwarkesh* Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street's tech group, and Dan Pontecorvo, who runs Jane Street's physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at janestreet.com/dwarkeshTimestamps(00:00:00) – Basics of Go(00:08:17) – Monte Carlo Tree Search(00:32:04) – What the neural network does(01:00:33) – Self-play(01:25:38) – Alternative RL approaches(01:45:47) – Why doesn't MCTS work for LLMs(02:01:09) – Off-policy training(02:12:02) – RL is even more information inefficient than you thought(02:22:16) – Automated AI researchers Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

The MAD Podcast with Matt Turck
Why AWS and Azure Cannot Run Autonomous AI – Ivan Burazin (Daytona)

The MAD Podcast with Matt Turck

Play Episode Listen Later May 14, 2026 65:15


If AI agents are the new digital knowledge workers, where exactly do they do their work? In this episode of the MAD Podcast, Ivan Burazin joins us to unpack the emerging infrastructure stack for AI agents and explain why every agent needs its own secure, stateful "computer." We explore the technical realities of sandboxes, dive into why legacy, stateless hyperscalers weren't built for these new workloads, and break down the mechanics of microVMs and custom schedulers alongside a contrarian prediction on an impending CPU shortage. Finally, Ivan delivers an absolute masterclass on product-led growth, community building, and go-to-market strategy for technical founders.(00:40) Intro(02:13) What is an AI agent sandbox?(03:17) Security risks of running agents locally(05:17) Stateful vs. stateless hyperscalers(07:04) The history of cloud IDEs and the end of localhost(09:45) Do all AI agents need a sandbox?(12:26) Sandbox use cases: RL evals & background agents(14:10) Unpacking the emerging AI Agent Stack(16:20) The unsolved problem of agent memory and learning(19:37) Where sandboxes fit in the agent harness(21:35) OpenAI, Anthropic, and agent SDKs(23:06) Ivan's founder journey: From CodeAnywhere to Daytona(26:59) GTM strategies and building developer communities(33:48) Why customer support is your best GTM strategy(35:34) Leveraging Twitter during the AI super cycle(40:50) The technical anatomy of a sandbox(41:53) Why fast spin-up speeds maximize GPU efficiency(46:09) Firecracker, QEMU, and isolation primitives(49:58) Why sandbox snapshots and state forking matter(51:40) Why Daytona built a custom scheduler from scratch(55:24) The challenge of long-running stateful sandboxes(58:10) The build your own sandbox trap(1:01:03) Why AI agents might trigger a global CPU shortage(1:02:46) The future of the AI Agent Stack

Framgångspodden
1016. Simona Mohamsson: "Folk hade räknat ut mig – det kommer de få äta upp", Short

Framgångspodden

Play Episode Listen Later May 13, 2026 24:17


I detta avsnitt möter vi Liberalernas partiledare Simona Mohamsson, i ett personligt samtal om press, makt och ansvar.Simona berättar om månaderna när opinionssiffrorna rasade, varför hon fortfarande sover med två telefoner under kudden och hur hon lärde sig att inte bära andra människors oro som sin egen. Vi pratar också om uppväxten som dotter till föräldrar från Libanon och Palestina, om pappans flykt genom Europa och varför han valde att byta efternamn från Mohammed till Mohamsson som ett tack till Sverige.Vi djupdyker i det omtalade Sverige-löftet och de hemliga förhandlingarna med Jimmie Åkesson. Simona berättar om det första mötet bakom stängda dörrar, den omtalade kramen på pressträffen och varför hon står fast vid samarbetet trots den massiva kritiken.Samtalet berör även trygghet, gängkriminalitet och mordet på polisen Andreas på Hisingen – en händelse som kom att förändra hennes syn på politik och ansvar. Och kanske mest av allt: om känslan av att alltid behöva springa dubbelt så snabbt för att bevisa sitt värde.Ett avsnitt om identitet, integration, politiskt mod och viljan att resa sig igen när omvärlden redan räknat ut en.Följ Simona härLäs mer om Liberalerna härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

Framgångspodden
1016. Simona Mohamsson: Spelet bakom Sverigelöftet & kramen med Jimmie Åkesson, Original

Framgångspodden

Play Episode Listen Later May 13, 2026 69:01


I detta avsnitt möter vi Liberalernas partiledare Simona Mohamsson, i ett personligt samtal om press, makt och ansvar.Simona berättar om månaderna när opinionssiffrorna rasade, varför hon fortfarande sover med två telefoner under kudden och hur hon lärde sig att inte bära andra människors oro som sin egen. Vi pratar också om uppväxten som dotter till föräldrar från Libanon och Palestina, om pappans flykt genom Europa och varför han valde att byta efternamn från Mohammed till Mohamsson som ett tack till Sverige.Vi djupdyker i det omtalade Sverige-löftet och de hemliga förhandlingarna med Jimmie Åkesson. Simona berättar om det första mötet bakom stängda dörrar, den omtalade kramen på pressträffen och varför hon står fast vid samarbetet trots den massiva kritiken.Samtalet berör även trygghet, gängkriminalitet och mordet på polisen Andreas på Hisingen – en händelse som kom att förändra hennes syn på politik och ansvar. Och kanske mest av allt: om känslan av att alltid behöva springa dubbelt så snabbt för att bevisa sitt värde.Ett avsnitt om identitet, integration, politiskt mod och viljan att resa sig igen när omvärlden redan räknat ut en.Följ Simona härLäs mer om Liberalerna härLäs mer om Framgångsakademin här.Ta del av Framgångsakademins kurser.Beställ "Mitt Framgångsår".Följ Alexander Pärleros på Instagram.Följ Alexander Pärleros på Tiktok.Bästa tipsen från avsnittet i Nyhetsbrevet. Hosted on Acast. See acast.com/privacy for more information.

[Podfic]
DSMN5: Mixed Correspondence, part 2

[Podfic]

Play Episode Listen Later May 12, 2026 40:11


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠https://freesound.org/people/eobmada/sounds/541367/⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠!

[Podfic]
DSMN4: Mixed Correspondence, part 1

[Podfic]

Play Episode Listen Later May 8, 2026 45:56


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series. Full name: Don't Stop Me Now.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone vibration: ⁠https://freesound.org/people/eobmada/sounds/541367/⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠⁠⁠⁠!

Talk Python To Me - Python conversations for passionate developers
#547: Parallel Python at Anyscale with Ray

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later May 6, 2026 59:16 Transcription Available


When OpenAI trained GPT-3, they didn't roll their own orchestration layer. They used Ray, an open source Python framework born out of the same Berkeley research lab lineage that gave us Apache Spark. And here's the twist: Ray was originally built for reinforcement learning research, then quietly faded as RL hit a wall. Until ChatGPT showed up. Suddenly reinforcement learning was back, as the post-training step that turns a raw language model into something genuinely useful. Edward Oakes and Richard Liaw, two founding engineers behind Ray and Anyscale, join me on Talk Python to tell that story. We'll trace Ray from its RISE Lab origins at UC Berkeley to powering some of the largest training runs in the world. We'll talk about what Ray actually is, a distributed execution engine for AI workloads, and how a few lines of Python become work running across hundreds of GPUs. We'll cover Ray Data for multimodal pipelines, the dashboard, the VS Code remote debugger, KubRay for Kubernetes, and where Ray fits alongside Dask, multiprocessing, and asyncio. If you've ever stared at a single-machine Python script and thought, "there has to be a better way to scale this", this one's for you Episode sponsors Sentry Error Monitoring, Code talkpython26 AgentField AI Talk Python Courses Links from the show Guests Richard Liaw: github.com Edward Oakes: github.com Ray: www.ray.io Example code (we used for walk-through): docs.ray.io Getting Started with Ray: docs.ray.io Ray Libraries: docs.ray.io kuberay: github.com Watch this episode on YouTube: youtube.com Episode #547 deep-dive: talkpython.fm/547 Episode transcripts: talkpython.fm Theme Song: Developer Rap

The Reclaimed Leader Podcast: Helping You Lead Change Without Losing Your Roots

Sometimes we just need to grab a cup of coffee and talk church – that's what Jesse and I are doing today as we discuss Lifeway's 12 trends of 2026, in order to spark some ideas and reflect on how to best communicate with the people we're trying to reach. Grab your coffee and join us today on the RL

[Podfic]
DSMN3: Faith

[Podfic]

Play Episode Listen Later May 5, 2026 34:39


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠⁠!

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Play Episode Listen Later May 1, 2026 106:35


Kyle Corbitt, founder of OpenPipe, breaks down reinforcement learning and custom fine-tuning for modern AI models. He explains how RL differs from supervised fine-tuning, why GRPO and LLM-as-judge post-training matter, and how these techniques can improve performance, latency, and cost on open source models. The conversation also covers reward hacking, evaluation design, LoRA adapters, and how Chinese labs are using distillation to fast-follow frontier models. Sponsors: Sequence: Sequence handles the full revenue workflow for complex pricing, from quoting and metering to invoicing, revenue recognition, and collections. Book a public demo at https://sequencehq.com and use code Cognizant in the source field to save 20% off year one AvePoint: AvePoint is building the control layer for AI agents so you can securely govern, audit, and recover every action at scale. Design trusted agentic outcomes from day one at https://avpt.co/tcr VCX: VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

[Podfic]
DSMN2: Nerves

[Podfic]

Play Episode Listen Later May 1, 2026 43:56


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061⁠!

The Lunar Society
Reiner Pope – The math behind how LLMs are trained and served

The Lunar Society

Play Episode Listen Later Apr 29, 2026 133:50


Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served.It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.It's a bit technical, but I encourage you to hang in there – it's really worth it.There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.Recommend watching this one on YouTube so you can see the chalkboard.Reiner is CEO of MatX, a new chip startup (full disclosure - I'm an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture.Download markdown of transcript here to chat with an LLM.Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too!Sponsors* Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street's open roles at janestreet.com/dwarkesh* Google's Gemma 4 is the first open model that's let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner's scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4* Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn't sure the best way to visualize the concept, but Cursor's Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post. And if you have something to visualize yourself, go to cursor.com/dwarkeshTimestamps(00:00:00) – How batch size affects token cost and speed(00:32:09) – How MoE models are laid out across GPU racks(00:47:12) – How pipeline parallelism spreads model layers across racks(01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.”(01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal(01:33:02) – Deducing long context memory costs from API pricing(02:04:02) – Convergent evolution between neural nets and cryptography Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

[Podfic]
DSMN1: Letters

[Podfic]

Play Episode Listen Later Apr 28, 2026 38:17


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 4 of the Unkind Regards series.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/83306061!

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 27, 2026 72:21


From building Applied Intuition from YC-era autonomy tooling into a $15B physical AI company, Qasar Younis and Peter Ludwig have spent the last decade living through the full arc of autonomy: from simulation and data infrastructure for robotaxi companies, to operating systems for safety-critical machines, to deploying AI onto cars, trucks, mining equipment, construction vehicles, agriculture, defense systems, and driverless L4 trucks running in Japan today. They join us to explain why “physical AI” is not just LLMs on wheels, why the real bottleneck is no longer model intelligence but deployment onto constrained hardware, and why the future of autonomy may look less like one-off demos and more like Android for every moving machine.We discuss:* Applied Intuition's mission: building physical AI for a safer, more prosperous world, powering cars, trucks, construction and mining equipment, agriculture, defense, and other moving machines* Why physical AI is different from screen-based AI: learned systems can make mistakes in chat or coding, but safety-critical machines like driverless trucks, autonomous vehicles, and robots need much higher reliability* The evolution from autonomy tooling to a broad physical AI platform: starting with simulation and data infrastructure for robotaxi companies, then expanding into 30+ products across simulation, operating systems, autonomy, and AI models* Why tooling companies came back into fashion: Qasar on why developer tooling looked unfashionable in 2016, why Applied Intuition still bet on it, and how the AI boom made workflows and tools central again* The three core buckets of Applied Intuition's technology: simulation and RL infrastructure, true operating systems for vehicles and machines, and fundamental AI models for autonomy and world understanding* Why vehicles need a real AI operating system: real-time control, sensor streaming, latency, memory management, fail-safes, reliable updates, and why “bricking a car” is much worse than bricking an iPad* Physical machines as “phones before Android and iOS”: Peter explains why today's vehicle and machine software stack is fragmented across many operating systems, and why Applied Intuition wants to consolidate the platform layer* Coding agents inside Applied Intuition: Cursor, Claude Code, internal adoption leaderboards, and how AI tools are changing engineering workflows even in embedded systems and safety-critical software* Verification and validation for physical AI: why evals get harder as models improve, how end-to-end autonomy changes simulation requirements, and why neural simulation has to be fast and cheap enough to make RL practical* From deterministic tests to statistical safety: why autonomy validation is shifting from binary pass/fail requirements toward “how many nines” of reliability and mean time between failures* Cruise, Waymo, and public trust: Qasar and Peter discuss why autonomy failures are not just technical issues, how companies interact with regulators, and why Waymo is setting a high bar for the industry* Simulation vs. reality: why no simulator perfectly represents the real world, how sim-to-real validation works, and why real-world testing will never disappear* World models for physical AI: hydroplaning, construction equipment, visual cues, cause-and-effect learning, and where world models help versus where they are not enough* Onboard vs. offboard AI: why data-center models can be huge and slow, but onboard vehicle models need millisecond-level latency, low power, small size, and distillation-like efficiency* Why physical AI is not constrained by model intelligence alone: the hard part is deploying models onto real hardware, under safety, latency, power, cost, and reliability constraints* Legacy autonomy vs. intelligent autonomy: RTK GPS in mining and agriculture, why hand-coded path-following worked for decades, and why modern systems need perception and dynamic intelligence* Planning for physical systems: how “plan mode” applies to robotaxis, mining, defense, and multi-step physical tasks where actions change the state of the world* Why robotics demos are not production: the brittle last 1%, humanoid reliability, DARPA Grand Challenge-style prize policy, and the advanced engineering gap between research and deployment* Applied Intuition's hard-earned lessons: after nearly a decade, Peter says they can look at a robotics demo and predict the next 20 problems the company will hit* Qasar's advice to founders: constrain the commercial problem, avoid copying mature-company strategies too early, and remember that compounding technology only matters if you survive long enough to see it compound* Why 2014 YC advice may not apply in 2026: capital markets, AI company dynamics, and the difference between building in stealth with a deep network versus building as a new founder today* What Applied is hiring for: operating systems, autonomy, dev tooling, model performance, evals, safety-critical systems, hardware/software boundaries, and engineers with deep curiosity about how things workApplied Intuition:* YouTube: https://www.youtube.com/@AppliedIntuitionInc* X: https://x.com/AppliedInt* LinkedIn: https://www.linkedin.com/company/applied-intuition-incQasar Younis:* X: https://x.com/qasar* LinkedIn: https://www.linkedin.com/in/qasar/Peter Ludwig:* LinkedIn: https://www.linkedin.com/in/peterwludwig/Timestamps00:00:00 Introduction: Applied Intuition, Physical AI, and 10 Years of Building00:01:37 Physical AI vs. Screen AI: Why Safety-Critical Changes Everything00:02:51 The Origin Story: Tooling, YC, and the Scale AI Comparison00:05:41 The Three Buckets: Simulation, Operating Systems, and Autonomy Models00:11:10 Hardware, Sensors, and the LiDAR Question00:14:26 The Operating System Layer: Why Vehicles Are Like Pre-Android Phones00:19:13 Customers, Licensing, and the Better-Together Stack00:21:19 AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer00:26:41 Verifiable Rewards, Evals, and Neural Simulation00:31:04 Statistical Validation, Regulators, and the Cruise Lesson00:40:25 World Models, Hydroplaning, and Cause-Effect Learning00:43:34 Onboard vs. Offboard: Latency, Embedded ML, and Distillation00:50:57 Plan Mode for Physical Systems and Next-Token Prediction Universally00:53:04 Productionization: The 20 Problems Every Robotics Demo Will Hit00:58:00 Founder Advice: Constraints, Compounding Tech, and Mature-Company Mimicry01:05:41 Hiring Philosophy: Hardware/Software Boundary and Engineering Mindset01:08:50 General Motors Institute, Education, and the Curiosity MindsetTranscriptIntroduction: Applied Intuition, Physical AI, and 10 Years of BuildingAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swyx, editor of Latent Space.Swyx [00:00:10]: And today we're very honored to have the founders of Applied Intuition, Qasar and Peter. Welcome.Qasar [00:00:17]: You guys really know how to turn it on to podcast mode. That was, you guys are real pros at this.Qasar [00:00:23]: They were just joking around right before this, and then they flipped it pretty quick.Alessio [00:00:29]: Oh, yeah, it's good to have you guys. Maybe you just wanna introduce yourself so people know the voice on the mic and they'll know what they're hearing.Peter [00:00:33]: Oh, sure. Yeah, I'm Peter Ludwig. I'm the co-founder and CTO of Applied Intuition.Qasar [00:00:38]: And my name is Qasar Younis. I am the CEO and co-founder with Peter.Alessio [00:00:42]: Nice. Can you guys give the high-level overview of what Applied Intuition is? And I was reading through some of the Congress files, when you went out there, Peter, and eighteen of the top twenty global non-Chinese automakers, you two guys, you have customers in agriculture, defense, construction. I think most people have heard of Applied Intuition tied to YC when it was first started, and then you were kinda in stealth for a long time, so maybe just give people the high-level overview of what it is today, and then we'll dive into the different pieces.Peter [00:01:10]: Yeah. So at Applied Intuition, our mission is to build physical AI for a safer, more prosperous world. And so we work on physical AI for all different types of moving systems, everything from cars to trucks to construction and mining equipment, to defense technologies. And we're a true technology company, so we build and sell the technology, and we sell it to the companies that make the machines. We sell it to the government, really anyone that wants to buy a technology to make machines smart.Physical AI vs. Screen AI: Why Safety-Critical Changes EverythingQasar [00:01:38]: Yeah. And I think in the broader AI landscape, a lot of the focus, rightfully so in the last, three years has been on large language models, and so everything fits in a screen. Like, whether it's code complete products or things like that. And what's different about us is we're deploying intelligence onto a lot of things that don't have screens. they're physical machines. There are sometimes screens within the cabin or for example of a car or a truck or something like that, but most of the value we provide is putting intelligence that is in safety critical environments. So that those two words are really important because learn systems can make mistakes if you're asking for, like, some, so something like, “Tell me about these podcast hostsQasar [00:02:28]: that I'm about to go meet.” But you can't do that obviously when you run, like, as an example, we run driverless trucks in Japan right now, as we speak. We can't have errors. Those are L4 trucks. Yeah.Alessio [00:02:40]: Yeah. Was that always the mission? I remember initially, I think people put you and Scale AI very similarly for some things about being kinda like on the data infrastructure side of things. What was the evolution of the company?The Origin Story: Tooling, YC, and the Scale AI ComparisonPeter [00:02:51]: Well, from the very beginning, we always wanted to, really be a technology company that helped generally push forward the industrial sector. And so we started off working in autonomy. Our very first customers were robotaxi companies. And we started off doing a lot of work in simulation and data infrastructure. And then over the years, we've expanded our portfolios. Now we have, over thirty products, and it's a pretty broad technology play within the landscape of physical AI.Qasar [00:03:19]: Yeah, I think the Scale reason is because we're all YC Universe companies. But it was a very different company. Scale, was, is more of a services company, data labeling company fundamentally. We started and still are, do a lot of tooling. So like, you think developer tooling is now in vogue again, thanks to the AI boom. But honestly, ten years ago, it was out of vogue. It w Like, doing a tooling company in 2016, 2017 was not, like, the thing to do because, I don't know if you remember, the VCs generally, their views was that toolings are They're just workflows, and workflows ultimately are not really interesting. And we've gone and come, full circle with that. But when we started the company, our kind of it's kinda like in the periphery of what the company wants to be. It was like, from our earliest days, like, we wanna deploy software on physical machines, like on cars and on trucks and things like that. And obviously, we didn't know that the transformer boom was gonna happen. We didn't know that autonomy systems would become end-to-end. Those things we didn't know. And why that's important when autonomy systems become end-to-end, it is just now those models can be generalized to, multiple form factors. And so back nine, ten years ago, tooling was a great way, and still is a great way to, build the technology and sell technology to our end customers, a lot of them who wanna build this stuff themselves. And so we just offer like a spectrum of solutions from you can just use like one part of a development suite of tools all the way to buying the full thing. The way to think about the company, or at least the way we think about the company is, as Peter said, a technology provider. It's kinda like, what NVIDIA does or what an AMD, but we just don't do chips.Qasar [00:05:06]: We don't do silicon. But we're a technology provider fundamentally. And I think even, we used to joke when we started the company, like, we're not the guys to build, like, Instagram. Like that was just towards That's not our That's just not us in a most fundamental way. IAlessio [00:05:20]: You have thoughts.Qasar [00:05:21]: Yes.Qasar [00:05:22]: Well, it's, it's I mean, I think it's just like what And I mean, we worked on Maps and stuff, Google Maps. Consumer products are extremely difficult for a lot of different reasons. It just, I think doesn't scratch the itch. I think we're like Michigan guys who are kind of more of that traditional engineering kind of a realm, or lineage. we used to jokeThe Three Buckets: Simulation, Operating Systems, and Autonomy ModelsPeter [00:05:41]: I gotta say, though, what was clear ten years ago was that there was so much more that was possible with software and AI in vehiclesPeter [00:05:47]: and that was generally the space that we started in ten years ago.Peter [00:05:51]: And the precise path that we've taken over the years, I think we've been strategic, and we've adjusted to make sure that we're actually building stuff that's valuable to the market. And like, the technology has changed so much. Like our own technology stack has completely changed, I would say, roughly every two years. And so now we've probably done, let's say, four complete evolutions of our own technology stack. And I sort of see that cadence roughly keeping up.Peter [00:06:13]: And so the way even we think about engineering is almost on this two-year horizon, we're preparing ourselves that, hey, like, we wanna invest the appropriate amount, but then also be very dynamic as the research gets published and as our research team figures out new advancements and adapting to that.Qasar [00:06:27]: Yeah. One thing that has been consistent is the type of people we've, we've recruited. It's engineers who are fall into the sometimes very traditional, like, GoogleQasar [00:06:38]: -gen suite, but way different from, other companies. We are hiring folks who really know the intersection of hardware and software, who know really low-level systems. Obviously, traditional ML researchers and folks who've, actually, put ML systems into production. That's been pretty consistent. I think that, like, you look at the mix of our engineering, eighty-three percent of the company is engineering, so it's, like, a giant list.Qasar [00:07:05]: A lot of engineers.Alessio [00:07:06]: Which, by the way, a thousand engineersQasar [00:07:07]: Yeah. A thousand engineers.Alessio [00:07:08]: that's on your website, so I imagine it's up to date.Qasar [00:07:11]: It is, it is up to date, yes. Yes.Alessio [00:07:12]: okay. And then forty-plus founders.Qasar [00:07:15]: Yeah. We would tend to also, This was more luck than strategy. But we've recruited a lot of ex-founders. It's been a great place for founders, YC and non, ‘cause obviously I know a lot of the YC folks. It's kind of like we recruit a lot of Google people.Qasar [00:07:33]: For them to exercise both their technical and non-technical skills because, we're, we're, we're on the applied side. We have a research team that we do fundamental research, we publish, and we've, we've had great traction there. But fundamentally, the business wants to take this intelligence and deploy it into production and there's, like, a certain type of person that's more interested in that.Alessio [00:07:54]: Yeah. You mentioned the tech stack, Peter, so I just wanted to give you some rein to just go into it. I'm interested in where Wayve Nutrition, starts and ends in some sense, what won't you do? What, do you do that's common among all the verticals that you cover?Peter [00:08:10]: There's a few buckets of work that we do, and we've been at this for almost ten years now, so the technology's pretty broad. But we got startedQasar [00:08:17]: Yeah, with a thousand engineers, like, you could work on lots of things.Peter [00:08:19]: There's lots of stuff, yeah, espe-especially with AI tools to help.Peter [00:08:22]: So we got our start in simulation and simulation tooling and infrastructure. And so generally, if you're trying to build a very complex software system that involves moving machines, you need to test that, and the best way to test it is it's a combination of virtual developments, a simulation, and then also obviously real world testing.Peter [00:08:39]: And then there's a very careful process of that correlation between the simulation results and the real world results and ensuring that the simulator is in fact accurate to that. Simulation's a very deep topic.Peter [00:08:49]: We have a whole suite of products in that, and we could talk for many hours about that specifically. But that is one part of what we do as a company. Reinforcement learning as a subpart of that is also super critical. I think a lot of the a lot of the best advancements happening in a lot of these AI systems right now in some way relate to reinforcement learning, and with now we have lots of compute, and you can do tons of interesting things for reinforcement learning. The second bucket of work that we do is on operating systems technology. true operating systems. Like, think about, schedulers and memory management and middleware and message passing and highly reliable networking and data links. Like, the reality is, if you want to deploy AI onto vehicles, you need a really good operating system. And when we were getting deeper into that space, there wasn't really anything that we were happy with.Peter [00:09:39]: Like, things existed, absolutely, and we were using what was available in the market, and as an engineering organization, we roughly realized these things aren't great. We think we can do this better, and so let's, let's build something. And that was then the that was the moment of inspiration that started our operating systems business, which is now a very real business for us. And in order to write and run great AI, you need a great operating system, and so that-that's what got us into that. And then the third bucket that we work on, it's, it's true fundamental AI technology. Models, we do a lot of work in, as mentioned, the foundational research, but then the also the world models and the actual autonomy models that are running on these physical machines, and that's across cars, trucks, mining, construction, agriculture, and defense, and so that's both land, air, and sea.Qasar [00:10:31]: And also, a smaller subsector of that third bucket is the interaction of humans with those machines.Qasar [00:10:38]: So that's a multimodal, experience. Historically, if you're moving a dirt mover or any of these machines, there are, like, buttons you press, whether they're actual physical tactile buttons or something like a touch screen. That's just That fundamentally is changing to where you're just talking to the machine and the machine and you're teaming with the machine.Alessio [00:10:58]: Voice?Qasar [00:10:59]: Yeah, voice, absolutely, yeah.Alessio [00:11:00]: Oh.Qasar [00:11:00]: And also the machine just being aware of who is in the cabin, what their state is. you can think from a safety systems perspective, the most simple version of this is, like, the driver is tired, right? They're, they're if you get those alerts when you're driving your car and saysHardware, Sensors, and the LiDAR QuestionQasar [00:11:15]: -maybe take a coffee break, that take that times, a couple of order of magnitudes up. But this concept of teaming man and machine is important. When you think about running agents or just running, different instances of, Claude and doing work for you in the background, you can take that analogy out, almost copy and paste and put it into, like, a farm, where you have a farmer who's running a number of machines. So where they interact with the machine is where there's maybe a critical decision or a disengagement or something like that, but generally speaking, the agent on the physical machine is running and making decisions on the behalf of the farmer until there's something maybe critical. And that's also what we work on. So that's not pure autonomy. It's a little bit of a mix, but it falls under, autonomy. In the automotive sense, that's typically defined in SAE levels as an L2++ systemQasar [00:12:05]: -with a human in the loop. But just take that idea, to other verticals.Alessio [00:12:09]: Yeah. You've not mentioned hardware at all, like sensors or obviously we you mentioned you don't do chips. I think even in AV there's, like, a big, cameras versus lidars. Like, what are, like, in your space maybe some of those design decisions that you made, and are they driven by the OEM's ability to put things on the machinery? And like, how much influence do you guys have on co-designing those?Peter [00:12:32]: Yeah. So we don't make sensors. Like, we're, we're not a manufacturer. Obviously, we use a lot of sensors in our autonomy products. in terms of what actually goes on the vehicles, we have a preferred set of sensors that we, let's say fully support, and then our customers, they can sort of choose from those. And obviously if there's a very strong opinion on supporting something else, we'll add that to the platform as well. And the lidar question is at this point sort of the age-old,Peter [00:12:59]: topic in autonomy, and the state of the industry right now is lidar is hands down a useful sensor, specifically for data collection and the R&D phase of autonomy development. if you see, for example, a Tesla R&D vehicle, it actually has lidar on itPeter [00:13:17]: to this day, right? In the Bay Area we see these. you'll see, like, Model Ys or Cybercab that have lidars on them just driving around. So it's, it's useful because it gives you per pixel depth information. So if you can pair a lidar with a camerand you can say that, well, this camera's looking this direction, this lidar's looking this direction, and now for each pixel of the camera I can see how far away is that pixel. you can actually then use that as a part of your model training, and then the that depth information then becomes a learned, a learned state of the camera data. And then when you're doing the production system, you can now remove the lidarPeter [00:13:52]: and now you can actually get depth with just the camera. And so that difference between, like, a highly sensored R&D vehicle and then the down-costed production vehicle, we use that across our whole portfolio of products. And of course the end goal is you want super low cost and super reliable.Peter [00:14:08]: And then in certain use cases you have some more, bespoke things. Like in defense as an example, you do things at night oftentimes, and so you care about sensors like infrared, more so than And you don't, you don't wanna be putting energy out, so you don't wanna use lidar or radar.Peter [00:14:23]: but you still need to be able to see at nighttime. So yeah, we work the whole gamut.The Operating System Layer: Why Vehicles Are Like Pre-Android PhonesAlessio [00:14:27]: Cool. So that's kinda like on the hardware level. Then on the OS level, how does that look like? What is, like, unique? my drive- I drive a Tesla. Whenever I drive some other car that has a screen, it always sucks.Alessio [00:14:38]: It's on, like, cheap Android tablet. It's like, it's laggy and all of that. What does the OS of, like, the autonomy future look like?Peter [00:14:46]: When most people, it's really what you just described. When you think about operating system in a vehicle, you're thinking about the HMI, right? The human machine interface, and absolutely that's a an important part of it, but that's actually only one thin layer on top. So when we talk about operating systems for, like, AI in vehicles, there's many layers that go deep into the CPU critical realm and embedded systems, and you're talking about the real time control ofPeter [00:15:13]: let's say the electric motors or the engine and the actuators, and you have different redundancies for different, let's say, the steering actuation in the vehicle. And all of these things, need very core support in the in the operating system. And then of course for autonomy you have real time sensor data that's streaming in, and the latencies there are really important, right? If you try to Imagine you try to run Microsoft WindowsPeter [00:15:35]: like streaming your sensor data in or controlling the vehicle. Like, the latencies are gonna be absurd. Like, you can never do that. And so what's special about what we do is we really have this system level thinking, right? So we're looking at, we care about every performance characteristics of the entire system, and then we also, because we're doing a lot of the software or all of that software, we can fine-tune and control all of those things. So we can very carefully tune in the latencies for every aspect of the system. We can carefully tune in the memory management. We can have the right, fail-safes and fallbacks, for different things. ‘Cause you have to account for what if, what if there is a critical failure? What if there's a cosmic ray that flipsPeter [00:16:14]: a bit in the middle of the processor that causes some, malfunction? And you have to have a fail-safe to all of that, and so the core operating system is a part of that. And then the one last thing, which is a lot less exciting but is, actually a very big topic, is reliability of updates.Peter [00:16:30]: so the I have a Tesla and you get updates fairly frequently, right?Peter [00:16:36]: Once a month. Most companies that are making vehiclesPeter [00:16:40]: are basically never doing updates, and they're And even if they are doing updates, they're usually only updating maybe one module. Maybe they're updating the HMI module. But they're not able to update, let's say, the CPU critical parts of the system.Peter [00:16:51]: You have to go into the dealer for that. And so with our operating system now we can actually enable highly reliable updates of any system in the vehicle, and that's way easier said than done. Like, there's lots of technical, technically deep stuff, in the tech stack to do that in a way that you're not going to accidentally brick a vehicle.Peter [00:17:08]: And right? If, imagine yourAlessio [00:17:10]: That would be bad.Alessio [00:17:11]: Bad.Peter [00:17:11]: Bricking a car is a very expensivePeter [00:17:13]: and honestly, like across the industry maybe one of the most just pure impactful things that we've done is we've just, we're, we're now enabling the industry to actually do software updates.Alessio [00:17:22]: Just to clarify as well, who is the customer for this? Like, I assume a lot of hardware manufacturers have their own firmware, and I'm sure some of them would just have you write it for them because you're experts. And others would have their own. Like, who pays for this? Who invites you into the house? Is it, is it the end user, or is it, is it the manufacturer?Peter [00:17:41]: Yeah. So let me make an analogy firstly on the on the fragmentation of software. So physical machines today are more akin to the state of the phone market before Android and iOS existed, right? So I worked on Android at Google by the way many years ago, and part of the reason that Larry at Google decided to get into Android was they wanted to run Google products on a bunch of phones, and they bought all of these phones from the industry, and it turned out they had like 50 different operating systems on these phones. And it was virtually impossiblePeter [00:18:17]: for Google to make their app run on all 50 devices equally well. And so the solution was, well, actually what if, what if they created-A really great operating system and made it attractive to all of these phone makers, and that was sort of the genesis for what Android was and why Android existed. It was a way for Google to get their products onto really wide diversity of devices. The state of the physical, industry right now, it's a little bit like that. Like, there's yes, these companies have firmware, but they have so many different operating systems, it's so fragmented, and to actually get a modern AI application to run on these vehicles, you actually, you first have to consolidate the operating system, and so that's, that's why we've done that. And then, your specific question was who are our customers? It's, it's, generally it's the companies that are making these machines.Peter [00:19:06]: And we're, we're, we're selling our technology to them to really simplify the architecture and then enable these AI applications to run on them.Customers, Licensing, and the Better-Together StackSwyx [00:19:13]: How much is reusable across? Like, do you have, like, one OS that is just configured for everything, or is there some more customization that is needed?Peter [00:19:22]: Yeah, highly reusable. So the fundamental technology is quite universal, right? So things that we do have to think about though are, like, chipset support. And so if you're, if you're coding, let's say, an LLM and you have start with an assumption that, “Hey, oh, I'm gonna, I'm gonna use CUDA, and I'm gonna run this, on an NVIDIA chip,” then you don't really have to think about the hardware in that sense. Like, you're just, “Okay, I'm just I'm in the CUDA/NVIDIA ecosystem, and I'm, I'm going to use that.” But the hardware, especially in safety critical systems, it's a lot more diverse. There's not one or one or two players. There's a bunch of different chipsets that we have to support. And so our operating system doesn't just run on, like, the equivalent of X86. It has to, it has to run on a number of different architectures from chips from a bunch of different companies. But again, we've been working on this for a long time now, so we have, we have support for all of those chipsets. And then when you want to then run the AI applications, we can then do that reliably across now a variety of providers.Qasar [00:20:19]: And I think that is, like, heavily inspired by Android, right? Android has a huge suite of testing and it's a reliable operating system that runs on thousands of devices. And we think we can, we can do the same in all these physical moving machines, with the difference that we're really in a safety critical realm. Android isn't.Alessio [00:20:40]: So on Android, I don't need to use Gmail, I can use Superhuman. Like, what about your machinery? Like, can people bring somebody else's automation to it, or is it kinda like all-in-one?Qasar [00:20:50]: You have to use us. No. Yeah. we're If, Yeah. Yeah, it's totally open. Yeah.Peter [00:20:56]: Yeah. our philosophy is that we are a technology company, and so we license our technology to customers to use how they want. And so if a customer wants to If they wanna license our autonomy tech and our operating system, then great, we'll license those. If they just wanna license the operating system and then use different autonomy tech, that's fine also, and we have great documentation andSwyx [00:21:17]: Or if they wanna use developer tooling.Peter [00:21:18]: Yeah, exactly.AI Coding Adoption: Cursor, Claude Code, and the Bimodal EngineerSwyx [00:21:19]: It's, like, a better together if, obviously, if you, if they work together. Is it all C++ I assume is with different compile targets?Peter [00:21:27]: We use a lot of C++.Peter [00:21:28]: Rust is sort of a hot, the new hot kid on the blockPeter [00:21:32]: for a bunch of things as well. But yeah, the lower level you get, especially when you get to real-time constraints, you hit C++ at some point, and at some point maybe you work your way into assembly when needed.Swyx [00:21:44]: Oh, damn.Alessio [00:21:46]: I'm curious about the coding agent adoption, just, like, since you're mentioning more esoteric languages. Like, what's the adoption internally? What have you learned?Peter [00:21:55]: Yeah. We use everything. So Cursor was, I think the hottest tool in the company for a good while. Now Claude Code, I think has taken the reign on that. We have a internal leader, leaderboard that we use just to sort of encourage adoptionPeter [00:22:09]: with-within the company. And yeah, it's, they're phenomenally useful. it's, Honestly, we take inspiration from some of those tools also in how we're adapting some of that mindset of thinking to the physical realm. Like if it's so easy to build an app for this or that thing that lives just on a screen, we can We're taking now a lot of the same ideas and applying that to, “Okay, well, if you wanted a physical machine to do something, how easy can we make that, using our own tooling and platform as well?”Alessio [00:22:40]: Are you changing any of, like, the OS architecture, kinda like the way you expose services to, like, be more AI friendly or?Peter [00:22:48]: Yeah, absolutely. The in the early days of our tools infrastructure work, it was a lot about, You had engineers that were experts in certain topics, but the things that you're dealing with, they're oftentimes more mathematical or more abstract, where actually GUI tools are very useful for certain things. Like as an example, we have a product we call Sensor Studio, which is, it helps you design the sensor suite for your autonomous vehicle, whether, again, it could be a car, it could be a drone, could be a mining equipment, could be a robot. And you place sensors in different places. You There's different, There's a library. You can understand what are the trade-offs that you're making in the design of that system, and that was, like, a very, a very GUI intensive, thing ‘cause it's a little more like a CAD tool in that senseSwyx [00:23:37]: YepPeter [00:23:37]: if you've seen CAD tools. Nowadays, though, right, we expose all of the underlying APIs for that and now using, AI agents, you can actually configure a sensor suite with just text and likely reach a better result than you could've through the GUI in the past, and we're taking that thinking now through the whole product portfolio.Swyx [00:23:57]: Another thing I was thinking about is just in terms of, like, AI, adoption, does it change your hiring at least a little bit, or how do you, how do you sort of manage engineers, differently?Peter [00:24:08]: Yeah. absolutely, it does. we, I think like every company in the Valley right now, are evolving our hiring practicesPeter [00:24:16]: because the skills required to be effective are changing so fast, right? you used to really select for just rote implementation ability and now it is more the AI engineer skill set, right? Where it's like, yeah, how to implement, but actually-Just banging out code is no longer the core job, right? It's, it's actually knowing what questions to ask, knowing how to tie, how to tie together these different AI tools. And so the interviews that we give now I think are way harder than they've ever been.Peter [00:24:46]: But we also allow, right, selective use of AI tools to solve the problems. And I think in that you start to see more of a bimodal distribution of engineers, right? You start to see like wow, there's, there's this subset of people that they really get it. Like they're, they're all in and they've, they've clearly invested the hours needed to learn these tools and how to be effective.Peter [00:25:09]: And then there's sort of the group of people that haven't done that, and that the productivity gap is just enormous. And so we're, we're trying to obviously select for the people that are really into this.Qasar [00:25:20]: I first wrote the my AI engineer piece three years ago, and when I first wrote about it, I was like, “Actually, not everyone should be an AI engineer,” ‘cause I think there's a there's an extremist stance where well, every software is an engineer is an AI engineer. And my actual example of people who should not be adopting AI was embedded systems and operating systems, and database people. Are they adopting AI?Peter [00:25:41]: I think it's the classic bitter lesson, topic, which is the Six months ago I would've said the same thing, but it's, it's becoming super useful for every domain.Qasar [00:25:53]: I'm sure.Peter [00:25:54]: Right? Like,Peter [00:25:56]: there was, I think six months ago, or maybe a year ago, if you tried to use, let's say the latest Claude model for writing shaders, GPU shaders, the results were probably underwhelming. And if you use the latest model now to do that kind of task, you're a little bit blown away, like, “Wow, that actually worked. That's amazing.” And we see the same thing in the embedded realm. No question though, especially when you get into safety critical systems, the human validation isPeter [00:26:25]: is 100% key. Like I You're not gonna trust your life to a an AI written software that's, that's not been very carefully, checked by humans. And so I think now the really the challenge is about that appropriate level of human validation for these safety critical systems.Verifiable Rewards, Evals, and Neural SimulationAlessio [00:26:41]: How do you think about, yeah, touching on the simulation side, I think verifiable reward and reinforcement learning is, like, the hottest thing. What have you done internally to build around that? And like, what gives you What makes you sleep at night? Like, if somebody's like, just web coding something or likeAlessio [00:26:57]: wants to try something new, you have like a good enough system. Because I think the opposite is also true, is like if it's super easy to write anythingAlessio [00:27:04]: then it puts a lot of work on like the verifiableAlessio [00:27:07]: side of it. Like, what does that look like for people?Peter [00:27:10]: Yeah. So verifiability, a broader bucket of like evaluations, right? Like how do you evaluate the results that you're, you're getting? I think this is probably the hardest problem right now, because the As the models get better, it can be harder and harder to find the faults on the system.Peter [00:27:29]: And so like the problem of doing proper eval to find those faults, like that problem also keeps getting harder as the models get better. But it's no less important than it's ever been, right? You still there are still going to be edge cases that are not met and whatnot. And so it's, it's a big area of investment for us. On the reinforcement learning topic, the key thing is there's all these new requirements that come to be in the latest generation of these technologies. So for example, end-to-end is the big thing right now in autonomy and physical AI, which is you can now train these models that can effectively take sensor data in and then put control signals out, and get really good results out of that. But the way that you train and improve those models is really different from the previous generations. And so to do reinforcement learning on an end-to-end model, you now need to actually simulate all the sensor data, right? So then this becomes a we call our, work in this neural simulation, but it'sPeter [00:28:26]: think of it like a hybrid of Gaussian, splatting and diffusion methods, and where you really care about performance. Like performance is everything. If you can't do enough simulation fast enough and cheap enough, you actually can't get results that are worthwhile, in the end. It also gets to a lot of our work in embedded systems, which is like performance critical work, and that performance optimization, performance criticality, it carries over to a lot of the model training work. because, like, the only way to make it affordable is it has to be really fast.Qasar [00:28:58]: I think it's worth a few minutes talking about our own, evolving thoughts on verification and validation withinQasar [00:29:05]: kind of, traditional simulators, which are, you can think of like vehicle dynamics or something like that, which you're just taking textbooks and taking those formulasQasar [00:29:13]: and putting them into software, to like now this neural sim/world model universe. I think that's an interesting topic.Peter [00:29:20]: Yeah. So in more traditional development, right, you oftentimes would have, more black-and-white answers to questions.Peter [00:29:28]: And so the in Europe as an example, there's, a regulatory, system, it's called Euro NCAP. It's the European New Car Assessment Program, and as part of that, the vehicles have to pass a bunch of tests, and those tests actually, include, safety systems. So automatic emergency braking for a child that runs in front of a carPeter [00:29:51]: or let's say an occluded child that runs out and you hit it. And so you have You end up with sort of these binary answers of like, well, did the car under test pass this specific test? And there's a very well-known set of test casesPeter [00:30:05]: that the vehicle has to pass. And that was how the industry worked, let's say, until 10-ish years ago. But what's changed now is with these models, everything is statistics, right? Like you no longer have a black-and-white answer, but it's like, well, how many orders of magnitude or how many nines of reliability can I get in the system, and how can I, how can I prove that to be true? And the big unlock honestly for physical AI as an industry is that these models are just becoming much more reliable. Right? Things like things actually work a lot better. It's like the number of nines you can get out of these systems are now good enough that it actually becomes cost effective to really deploy these things. And so the big shift in, so verification and validation has been from a little bit more of a Again the past it was strictly requirements, and are you meeting or not? And now it's more of a statistical, verification and validation case where it's all about how many nines of reliability and meantime between failures, that sort of thing.Statistical Validation, Regulators, and the Cruise LessonSwyx [00:31:04]: And is the target audience regulators or even the customers are yeah, if you I imagine the customers are bought in, and it's mostly regulators that need to be satisfied.Peter [00:31:15]: We do work with the US government, we do work of course with the European governments and the government of Japan, and the government is not like an AI lab by any means.Peter [00:31:25]: So Swyx [00:31:26]: They just care about the outcome.Peter [00:31:27]: They care about the outcome.Peter [00:31:28]: And so we do education, in that regard, and like so sort of teaching about, “Hey, this is how we think validation should be done, and this is an approach that we think is reasonable,” and how to think about like when is a driverless system actually safe enough to go on the roads and that sort of thing. But I wouldn't say that the government is asking for it. It's like we're more teaching the government in that, in that sense. It's honestly, it's more so for our own, our own comfort, right? Like, we want to build very safe systems, and then of course our customers care deeply about that as well. But in that context we're also typically educating our customers.Qasar [00:32:01]: Yeah. Our first, our first core value is on round safety. So I think we can't underline enough that, us also verifying and validating that the systems that we're deploying are safe to us is probably as important as, like, some regulator or a customer saying,Swyx [00:32:19]: Of course. Okay. Yeah.Swyx [00:32:20]: You have to satisfy yourselves.Peter [00:32:22]: As I say, as a whole across the world, regulation oftentimes it's like a almost lowest common denominator. But like, you really have to substantially exceed what the regulators are expecting to make good products.Swyx [00:32:33]: Yeah. One thing I often talk about, I think and I try to make this relatable to the audience also, is Cruise, where they had an accident that basically ended the company. I wonder if people overreact to single incidents, because incidents are going to happen regardless, right? ‘Cause it's a statistical thing, but as long I don't know if regulators understand that, you cannot extrapolate from a single incident, but we do because that's all we have to go on. And your sample sizes are necessarily gonna be lower than, I don't knowSwyx [00:33:00]: consumer driving.Qasar [00:33:01]: Yeah. I think the Cruise example wasn't a technology failure. there was The real, compounding issue there was just how did the company talk to the regulators and what was their kind of behavior, and I think that became more of the issue. If you look,Peter [00:33:19]: It isn't It definitely was a technology failure, but it was made much worse by theSwyx [00:33:23]: Put the car back on the woman.Qasar [00:33:25]: Yeah. And let me put it another way. There is a version where Cruise still exists.Swyx [00:33:29]: right. Right.Qasar [00:33:30]: Right. It'sSwyx [00:33:30]: It was like the last strawQasar [00:33:31]: ItSwyx [00:33:31]: in like a long chain ofSwyx [00:33:33]: like issues.Qasar [00:33:33]: So do you feel like ATG had that horrific accident or someone actually dying, because, that was a homeless person crossing the street? So yeah, I think we can't understate enough that ultimately, like, statistical validation of something, that's one part of it, but it's not the only part of it. Like, consumer and let's say, mainstream adoption of these technologies is also gonna be part of that conversation. I think companies like Waymo are doing a lot of service positively to the industry in the sense of they're, they're setting a high benchmark and they're showing, kind of in a very responsible way how to, how to deal with these. There have been Waymo incidences as well. They've just not been as significant as the Cruise one that you mentioned. But yeah, so I think you'll just continue to see that. I think probably the long term question is really gonna be, again, around Like it is very clear humans are way worse drivers statistically.Qasar [00:34:29]: Like, there's no, there's no debate. And so at what point But we're emotional animals.Swyx [00:34:34]: Yeah. So my thing is, like, we have to get to a point as a society where we accept horrific accidents that would never happen by a human because statistically we understand that it is safer overall. In the same way that planes, they're safer, than I think they're the safest mode of transport that we have.Qasar [00:34:50]: Yeah. it's more dangerous to drive to the airport than it is to get on a flight.Qasar [00:34:53]: So if you're everQasar [00:34:54]: if you're ever getting nervous about getting on a plane, just think “I just gotta get to the airport.”Swyx [00:34:58]: Yes, we're flying.Qasar [00:34:59]: If I get to the airportQasar [00:35:00]: I'll be good.Swyx [00:35:00]: But then it's, planes also concentrate the tail risk if planesQasar [00:35:03]: Yeah. AndPeter [00:35:04]: And I was, I don't think we honestly have to worry about there ever being, accidents from these systems that are like much worse than what humans would cause, ‘cause humans do terrible things.Peter [00:35:14]: Like, people fall asleep at the wheel all the time.Swyx [00:35:16]: I have.Swyx [00:35:17]: Like, I'll call, I've been a drowsy driver.Peter [00:35:19]: Kinda drunk drivers, and that'sPeter [00:35:20]: that's the extreme end of the example. But these AI systems, you have redundancies, you have fallbacks. Like, there's many things have to go wrong for there to actually be a something catastrophic because there's, there's so many, fallbacks that these systems have.Alessio [00:35:36]: your simulation is like so vast because there's so many use cases. What are, like, maybe things that worked in a simulation and then you put it out and it's like, “F**k, this isAlessio [00:35:45]: this just did not work at all?”Peter [00:35:47]: Yes.Alessio [00:35:47]: IsPeter [00:35:47]: That's maybe a bit of a misconception, about simulation there. So let me go a little bit, more technical on this. So at first go, no simulation is going to represent the real world. There's always a process of this, sim to real matchingPeter [00:36:02]: where you actually, you need the real world feedback to basically feed into the parameters that are being used in the simulator, and you have to do that, it's like this validation flow, a number of times until you can get some confidence that, like I think the simulator is now accurately representingPeter [00:36:19]: what's gonna happen in the real world. Now, if you have a situation where you've done that full validation and you thought that it was accurate and then there's something different, those are much trickier cases, and that's, that absolutely can happen, but really I think the validation process is a really important part. You can never skip the simulation validation process, like where you're actually ensuring that, hey, the actual, my sim to real gap here is small enough that I can trust these simulation results. And there's, there's so many fun things that you can do when you get into it. Like, I'll, I'll give one fun example that came up recently is like in these humanoid robotics, systemsOverheating actuators is a real problem, right? So obviously phenomenal demos. IPeter [00:37:01]: The most amazingAlessio [00:37:02]: For 10 minutes.Peter [00:37:03]: The most amazing I can get. I love, I love watching robots do acrobatics like everybody but the these systems actually overheat, right? If, like, And one of the ways you can use simulation though is you can actually have that, the temperature of those actuators be one of the parameters that's representedPeter [00:37:18]: in the simulation. And if you're doing reinforcement learning over a certain task, then the robot can actually adjust its motions in the simulation to account for the fact that, oh, it knows that as it's moving, it's actually beginning to overheat this motor. But if you didn't have that parameter of, let's say, the heat of that motor represented in the simulation initially, then your RL policy might It will disregard that. And now you run that on the robot and the robot will overheat and fail.Alessio [00:37:43]: I guess the question is, like, how do you have all of these parameters taken care of while also understanding the deployment environment? Like, temperature is like a great example, right? WellAlessio [00:37:53]: why did you make my robot worse when it runs in like a freezer?Alessio [00:37:57]: So it actually shouldn't worry about that. it's like, yeah, how do you design these simulations?Peter [00:38:02]: This is honestly the This is what makes simulation so hard, right? it's because you Simulation is fundamentally about you're trying to optimize the development of a system, right? Like, how can I build this system faster and better and cheaper and what are all the levers that I have to actually accomplish that? And because simulation's just a software program, you can, you can change it a lot more easily than you can hardware systems. And then what's particularly awesome about the let's say, world models and using that as a part of simulation is now the simulation doesn't just scale with, let's say, adding new math equations inPeter [00:38:36]: but we can actually scale the simulation environment now with additional real world data and that also unlocks a whole new field of robotics.Qasar [00:38:46]: There is a meniscus line where you cross where still doing real world testing is better. there's, in this, sim-to-real gap, you can reproduce reality at exceedingly expensive costs and this So nothing is free. So really you have to you're finding that line where you're getting great performance, you're getting great feedback, whether it's on the training side or on the eval side, but it's way cheaper than doing it in the real world. At some point it, that doesn't make sense. And so even, from our earliest days in autonomy, our view was you're still gonna do real world testing. You There's, there's not, there's not this, magical land where you're not gonna do that. And maybe even like a more nuanced version of this in like traditional software development is, most of your testing for software in a vehicle, 95% of that can be like traditional CI/CD kind of, flows that you would have in traditional web development. But once you have Now you, let's say you have a truck. Well, you can do like 4% of those in like a rig which has all the components, the electrical and electronics of a truck, but doesn't have, it doesn't have the tires and it doesn't have the And then you have the 1%, which is actually the vehicle. There's something There's a similar analogy in terms of using simulation for intelligent systems. You can do a lot in a simulator, but in using world models, but ultimately it's, it's physical AI. So you're gonna deploy it on physical machines andQasar [00:40:17]: the freezer example comes to, comes to light.Alessio [00:40:20]: The world model thing has been to me the hardest thing toAlessio [00:40:22]: wrap my head around. Like we have Faith Eliyon on the podcast.World Models, Hydroplaning, and Cause-Effect LearningQasar [00:40:25]: We've been doing a small series with like another Intuition company, General Intuition as well.Qasar [00:40:31]: yeah, and I mean, lots of, lots of coverage on NeRFs and yes.Alessio [00:40:34]: Yeah. It feels like we talk with about, the heliocentric system, right? It's like in a world model, if you just feed visual data, the model might learn that the sun spins around the Earth. It makes sense, right? And it's like, well, not really. And I think what are like some of these other things that like hydroplaning is one thing I think about, is like can a world model understand hydroplaning and like what amount of water like causes it to happen? And it's like, yeah, to me it's like I don't understand how you guys do it. I guess it's like the real thing is like when you're doing both cars and the highway in Japan versus the excavator in a mine in,Qasar [00:41:13]: ArizonaAlessio [00:41:13]: wherever you're Arizona, wherever you're deploying them.Alessio [00:41:15]: How much of it are you relying on the world models to like generate the simulations for you and then try and close the gap after versus like giving the world models as a tool to your engineers to like curate the simulations if that makes sense?Peter [00:41:28]: Yeah, totally. So yeah, I can say at a pure engineering level, I think if you're hoping to do real world deploys and you're purely relying on a world model approach, you probably won't get to something that works, before you go bankrupt. So there is just a very practical mindset of like, world models are amazing and they're extremely useful for a lot of use cases, but there are a lot of other things that you need to do to actually get something started and something deployed and working. most fundamentally, world models are all about It's understanding the world, but also understanding what's going to happen. It's like the cause-effect relationship.Peter [00:42:01]: Right? And so like it, right, if you have a take some sort of construction tool, and that construction tool is gonna be doing some work on the Earth in some way, it's gonna be moving earth, the world model needs to understand that cause-effect relationship. Like, okay, when I, when I take this material from here and put it over there and now I have things that are over here and not over there anymore and that cause-effect, relationship. data obviously is a is a big problem. The hydroplaningPeter [00:42:26]: one is actually a really great example because it's actually quite non-obvious sometimes. Right? It's like, well, it's, it's raining and well this road, has, let's say the appropriate curvature to it so the water is running off the road and cars are driving faster here and then you approach a road that's very flat and water is now puddling on that road and all of a sudden cars are driving slower because when they were driving faster they were starting to lose control. And there are a lot of visual nuance, very nuanced visual cues in the scene and so I do think in the world model concept there's a good chance that the model actually would learn that you should just drive slower when these visual cues exist, and that's obviously the beautiful-The beauty of, these kinds of models where they just, they learn these non-obvious things.Swyx [00:43:14]: It doesn't need to know about hydroplaning to know that it needs to drive slower.Peter [00:43:17]: Yes.Swyx [00:43:17]: I guess it's Yeah. I wanna ask questions about, also deploying models. I presume, like, you use a lot of these world models for training data and simulation, but what about deploying it onto the systems in production? Presumably you have you have, like, GPUs on deviceOnboard vs. Offboard: Latency, Embedded ML, and DistillationSwyx [00:43:36]: but they're I keep saying on device. What's the what's the right term for that?Peter [00:43:40]: On machine.Swyx [00:43:41]: On machine.Peter [00:43:41]: Or embedded, yeah.Swyx [00:43:42]: Yeah. What is the embedded world like? because for people who are not used to that world, this is very alien.Peter [00:43:49]: Yeah. So it's actually We call it onboard and off board.Peter [00:43:52]: So like, onboard software and off board software.Peter [00:43:54]: And the great thing about off board software is you don't have to care about time, and you can run really large models, right? So you can, you can say, “Well, this model, I don't care if it takes one second for it to give me a result or 10 seconds for it to give me a result, because we have time.” And the models can be really big, and they can run, in a data center or on a on a huge GPU and you can obviously have distribute to compute, et cetera. But onboard you don't have any of those benefits. You're like, “Well, I need I have this many milliseconds where I need an answer from this model.” And so a lot more of the energy then is about, think of it more like distillation and it's like truly efficiency and like, literally every fraction of a millisecond counts. And you can't have a situation where the model takes too long because then the vehicle can't actually function.Peter [00:44:42]: And so you can, you can still use a lot of the same techniques, and the models themselves you can think of as like a derivative of larger models that you can run offline, and then you're, you're trying to just get a model that is still performs really well but it's, it's a it's smaller, small enough version that you can then run on this embedded system where you care about latency and power.Qasar [00:45:03]: Yeah. And I think like, the broader point I think which, maybe is not obvious but it's worth saying is in physical AI world, we're not really constrained right now by, like, the intelligence of the models. It's actually what Peter's talking about, it's actually deploying them inSwyx [00:45:19]: The hardware they give you.Qasar [00:45:21]: Yeah. On the hardware you give you.Qasar [00:45:22]: And so And there's just a reality is of safety critical systems. So those end up being the your limiting factorsQasar [00:45:29]: rather than, let's say, a limiting factor for, a foundation model companyQasar [00:45:34]: is gonna be just capital maybe or researchers.Qasar [00:45:38]: So we're, we're in that way dealing with, for us as people who kind of come in that realm with like a very interesting Those constraints force creativity.Swyx [00:45:47]: And I imagine, nobody was deploying or giving you the hardware for transformers back in 2018, whatever, but now they are. What's the evolution like? just peel back the curtains a little bit.Peter [00:45:59]: Yeah. Transformers first off, I think the paper was originally published in 2017.Swyx [00:46:02]: 2017.Swyx [00:46:02]: So there's no time.Peter [00:46:04]: And ISwyx [00:46:05]: But I'm just saying I guess I'm saying, like, embedded ML systems usually, like, a lot less parameters, a lot less compute, and now, like, orders of magnitude more.Peter [00:46:14]: Yeah. absolutely. what I was gonna say though was I think in the in the original paper in 2017, maybe it's in the last paragraph, somewhere in the paper they talk about, like, “Oh, by the way, this technique might be useful for, like, images and videos as well.”Peter [00:46:30]: These last subjects.Peter [00:46:31]: And it took a few years for that impact to really hit. But like, now, we're seeing transformers are everywhere.Swyx [00:46:39]: Yeah. Vision transformers.Peter [00:46:40]: And then then the compute just keeps getting better and better. But you do have this fundamental trade-off, right? It's like you have power, you have cost, and performance and like, getting the right, getting the right mix of those things in an embedded package that can also be, like, shaken and baked in all thePeter [00:47:00]: conditions that these things have to have to operate in. But yeah, I think that they're only going to keep getting better and so we also try to plan our strategy understanding that, we know the rate of improvements of these systems.Swyx [00:47:11]: Yeah. So like, Google just released the Gemma 2B modelSwyx [00:47:15]: that effective 2B model. Is that useful to you guys or is that too big?Peter [00:47:18]: You can run that model on an embedded system, definitely.Peter [00:47:21]: the So yes, it's, it's useful in that regard. The bigger question is, like, what do you use it for in an embedded system? Like, you actually need to customize it quite a bit to make it useful for something. But yeah, you could run a two billion parameter model, definitely.Swyx [00:47:35]: It also interesting, like, what percent is a custom ML model that only does that thing versus a generalist LLMSwyx [00:47:41]: which probably is not that useful actually for your context.Peter [00:47:46]: Like, you, like, you can imagine different use cases, right?Peter [00:47:48]: So theSwyx [00:47:49]: The voice stuff, yes.Peter [00:47:49]: Yeah, the voice test. Totally, yes.Peter [00:47:51]: So for the actual, autonomy elements, that's 100% in-house. We do every bit of that, the data simulation, the model, everything. But when you get into the more generic use cases like voice or voice assistant kind of thing, that's where these more generalist models like Gemma actually can be quite, can be quite useful.Swyx [00:48:09]: Yeah. And then there's also obviously a trade-off between, like, what percent must you do on machine, versus just call home.Peter [00:48:16]: Yeah. It's all about latency.Swyx [00:48:17]: Latency.Peter [00:48:17]: It's all about latency. Yeah.Swyx [00:48:18]: Yeah. Well, like, I think actually in a lot of contexts, especially in the US, you can just have a connection to the web.Qasar [00:48:26]: Yeah. I think though most of our universe is everything has to be fairly, embedded and local because just the nature of Even in the US there's a lot of likeSwyx [00:48:39]: PatchinessQasar [00:48:40]: don't haveQasar [00:48:41]: have coverage, right? And if you look at, like, the old world of autonomy within mining, which is, like, long before transformers and kind of, neural networks, in the like CNN and kind of a universe, they were really just hand-coded, systems. They were just like, this machine is gonna run to that place with thisPeter [00:49:03]: That was our GPS, like very accurate GPS.Qasar [00:49:05]: Yeah. And so that worked, and that worked for 20 years, so why would we actually need to use transformers or kind of more modern end-to-end systems? Mainly because you can only really run a path and run backwards. That provided a lot of value, but m-Not as much as you get when the machine is actually intelligent. It's, it's seeing, it's perceiving, it's acting in a dynamic world.Alessio [00:49:28]: I looked up RTK, real-time kinematic, one to two-centimeter accuracy.Qasar [00:49:32]: Yeah. Fantastic. But the and fantastic in faraway lands where there's not gonna be cell phone coverage.Peter [00:49:39]: Yeah, so it's widely used on the legacy mining and agricultural autonomy systems today. So like, for example, a combine that can be precise within one or two centimeters as it's driving down the field, they use RTK.Qasar [00:49:53]: Yes.Peter [00:49:53]: But it's, it's expensive.Qasar [00:49:54]: Yeah. And it's, it's, it's autonomy, but it's not intelligent in the way that I think all of usQasar [00:49:58]: if in twenty-six we'd be talking about intelligence.Alessio [00:50:00]: In one of your blog posts, you mentioned research on large scale transformers that are similar to those doing modern generative AI. What are, like, the big differences other than, “You're absolutely right. I should steer the car, so you probably wanna remove that?”Peter [00:50:14]: We have a diversified bet strategy internally, and the reason we've done that is because we operate in now a bunch of industries, a bunch of geographies, and each of the approaches has, obviously a different risk to them.Peter [00:50:27]: And so like, we're not going to put all of our eggs in a single basket for a single approach because that approach may no

[Podfic]
IAHL7: Six Months Later

[Podfic]

Play Episode Listen Later Apr 24, 2026 9:02


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 3 of the Unkind Regards series. Full name: It's A Hard Life.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/82013181⁠!

[Podfic]
It's A Hard life (Complete)

[Podfic]

Play Episode Listen Later Apr 24, 2026 225:25


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 3 of the Unkind Regards series.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/82013181⁠⁠⁠⁠⁠!

Web3 101
E78|深度拆解Hermes Agent和OpenRouter背后的Web3血缘,聊聊人才去了AI后Web3还有戏吗

Web3 101

Play Episode Listen Later Apr 24, 2026 58:23


Web3 101不是AI 101,但我们必须聊聊最近一些火热的AI产品(如OpenRouter、Hermes Agent)背后的Web3血缘。这是个例的巧合,还是加密原住民自带的基因优势? 我们深度拆解Hermes Agent开发团队Nous Research和OpenRouter的前世今生,探寻Stability AI和加密矿企在AI算力战中的隐秘贡献,去理解两个极具冒险精神的行业之间,正在发生怎样的人才与思想交融。 【主播】 刘锋,BODL Ventures合伙人,前链闻总编辑 熊浩珺Jack,律动BlockBeats副主编,《Web3无名说》主播 【嘉宾】 王超,科技投资人 【你将听到】 00:21 Web3 101不是AI 101,但要聊聊最近热门AI产品背后的Web3血缘 03:07 Hermes Agent开发团队Nous Research的起源 05:52 Nous的CEO曾是归零MEV项目Eden Network的CTO 06:22 从捍卫用户主权的Hermes模型,到YaRN架构,再到成功挑战405B参数模型的微调 17:15 分布式训练框架DisTrO 24:05 爆火的智能体Hermes Agent之暗线:背后潜藏着分布式强化学习(RL)数据收集网络,未来极可能演变为去中心化数据市场 29:31 OpenRouter创始人Alex曾是最大NFT市场OpenSea核心成员(CTO) 33:06 OpenRouter商业逻辑:异构数据的聚合,从聚合NFT到数据到聚合/路由大模型API的商业逻辑迁移 36:45 为什么从Web3出来的团队能在AI的世界里面大放异彩?这是普遍现象吗(bushi)? 40:08 Moltbook创始人也是币圈老人 40:26 Stability AI的开源先声:创始人Emad从币圈来,又回到币圈去 45:48 从FTX Future Fund走出来的AI新星:Leopold Aschenbrenner(OpenAI/态势感知基金)、Avital Balwit(Anthropic) 47:46 加密矿企无心插柳成为AI算力中心 52:23 终极拷问:Web3人才流向AI,我们该伤感吗? 55:44 放下伤感,尊重技术浪潮,让人才流向创新最密集的领域 【词汇表】 本期提到的Web3及AI词汇 OpenRouter Hermes Agent Nous Research Moltbook Stability AI FTX Future Fund Leopold Aschenbrenner Avital Balwit 【延伸阅读和相关术语】 GPTeacher:2023年春,Teknium在看到斯坦福大学的Alpaca论文(展示了用GPT-3.5蒸馏出训练数据来微调小模型)后,决定使用能力更强的GPT-4来生成更高质量的指令数据。这成为开源界最早且最知名的基于GPT-4生成的模型微调数据集之一,为早期开源模型的进化提供了原始数据支撑。 Hermes系列模型:2023年6月首发。在早期极度缺乏资金的支持下,由社区成员利用业余时间拼合计算资源完成,目前已演进至Hermes 4系列。证明了小型开源团队能在基础推演性能上达到极高水准,并在业界确立了反说教审查的“中性对齐”产品理念。 YaRN:一种基于旋转位置编码 (RoPE) 的底层数学架构改进机制。 2023年8月,Nous团队在早期为解决LLaMA原生模型仅有4000 token上下文导致无法处理长文档的痛点,针对性研发的架构。该方案后续被Meta 的 Llama 3.1和DeepSeek等核心大厂直接应用,成为行业的底层通用方案。 WorldSim:以网页命令行形式构筑的沙盘推演产品,底层调用大模型API来探索和生成平行的连续文本世界,2024年推出,引起了市场关注。但由于触发风控被Anthropic切断 API。 Hermes 3 405B:基于千亿级的Llama 3.1 405B巨型模型基础,Nous Research进行极其复杂的全参数微调而推出的开源模型。 DisTrO (Distributed Training Over-The-Internet):一种极其高效的分布式通信压缩算法机制,将不同物理机器节点间的网络通信数据量压缩至原始的近千分之一。 DeMo算法论文:初步完成DisTrO通信压缩技术后,由多方外围作者合作对核心底层原理进行的学术化总结与联合发表。深度学习领军人物、Adam优化器发明者Durk Kingma在审读后参与了合作署名。 Psyche Newtork:属于DisTrO压缩技术的直接系统落地,通过社区自发贡献闲散普通GPU并进行模型联合训练的网络。底层依托Solana区块链进行节点间的派单任务协调,实现了物理去中心化分布的大模型网络训练主网。 Teknium:Nous Research联合创始人,匿名。最初独立发觉并开发出初代Hermes模型与GPTeacher的先行实施者。 Karan Malhotra:Nous Research联合创始人,本科就读于宗教与哲学,在其履历中曾作为研究人员供职于斯坦福大学脑刺激边缘实验室。直接主导了WorldSim等非标项目的立项。 Jeffrey Quesnelle:Nous Research联合创始人及CEO,曾任Eden Network的首席工程师。 Alex Atallah:OpenSea CTO + OpenRouter创始人。 【后期】 AMEI 【运营】 朱婕 【BGM】 Mumbai - Ooyy 【在这里找到我们】 收听渠道:苹果|小宇宙 海外用户:Apple Podcast|Spotify|Google Podcast|Amazon Music 联系我们:podcast@sv101.netSpecial Guest: 王超.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 23, 2026 54:52


Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe, but before the Cursor-xAI deal.Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what's real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.Thanks to Jacob and the UL production team for hosting and editing this!Jacob Effron* LinkedIn: https://www.linkedin.com/in/jacobeffron/* X: https://x.com/jacobeffronFull Episode on Their YouTubeWe discuss:* swyx's view from the center of the AI engineering zeitgeist: OpenClaw, harness engineering, context engineering, evals, observability, GPUs, multimodality, and why conference tracks now reveal what matters most in AI* Whether AI infrastructure has finally stabilized: why “skills” may be the minimal viable packaging format for agents, why infra companies have had to reinvent themselves every year, and why application companies have had an easier time surviving model volatility* The vertical vs. horizontal AI startup debate: why application companies can act as the outsourced AI team for enterprises, why some horizontal companies still matter, and why sandboxes may be the clearest reinvention of classic cloud infrastructure for the AI era* The “agent lab” playbook: starting with frontier models, specializing for your domain, then training your own models once you have enough data, workload, and user behavior to justify the cost and latency savings* Why domain-specific model training is real, not just marketing: how companies like Cursor and Cognition can get users to choose their in-house models, and why search, domain specialization, and distillation are becoming more important* Open models, custom chips, and alternative inference infrastructure: why swyx has turned more bullish on open source, why non-NVIDIA hardware is suddenly getting real attention, and why every 10x speedup can unlock new product experiences* What it means to sell to agents instead of humans: why agent experience may mostly just be good developer experience by another name, why APIs and docs matter more than ever, and how pretraining-data incumbents are compounding advantages in an agent-first world* Why memory and personalization may become the next big wedge: today's models mostly reward frequency of mentions, but in the future, swyx expects product choice to be shaped much more by personalized memory systems* The state of the AI coding wars: why coding has become one of the largest and fastest-growing categories in AI, how Anthropic, OpenAI, Cursor, and Cognition have all ridden the wave, and why the category may still have more room to run* Capability exploration vs. efficiency: why the industry is still in a token-maxing, experiment-heavy phase where people are rewarded for spending more rather than less* Claude Code vs. Codex and the strange stickiness of coding products: why first magical product experiences may matter more than expected, and why the bigger mystery may be why only a few names have emerged as real winners so far* What the end state of the coding market might look like: two major players, a longer tail of niche products, and possible disruption if Microsoft, Mistral, xAI, or the Chinese labs push harder into coding* Where application companies still have room against the labs: why frontier labs are trying to expand into verticals like finance and healthcare, but still leave space for focused companies that own the workflow and the last mile* Why coding may be a preview of every other AI market: the first category to truly go parabolic, the clearest example of foundation model companies colliding with application companies, and a template for how future vertical AI markets may develop* Why AI valuations now feel unbounded: from billion-dollar ARR products built in a year to trillion-dollar market caps, swyx and Jacob unpack how the AI market has broken traditional startup intuitions about scale and durability* Consumer AI vs. coding AI: why ChatGPT's consumer category may have plateaued on frequency and product design, while coding continues to feel like a daily-use category with real momentum* The next product frontier beyond coding: consumer agents, computer use, and “coding agents breaking containment,” with swyx's thesis that 2025 was the year of coding agents and 2026 may be the year they begin to do everything else* Whether foundation models are really killing startup categories: why swyx is less worried for early founders, more worried for mid-size startups and traditional SaaS, and why building something ambitious may now be the best job interview for a frontier lab* AI vs. SaaS and the internal culture war around adoption: the tension between AI-native employees who want to rip out expensive software and skeptics who think quick AI-built replacements create fragile systems* Why traditional SaaS may be under real pressure: swyx's own experience spending six figures on event and sponsor management software, the temptation to rebuild it cheaply with AI, and the broader question of whether teams will trust custom AI-native replacements* Biosafety, security, and frontier model access: why swyx raised biosafety at a dinner with Anthropic's Mike Krieger, why Krieger argued security is the bigger issue, and what restricted model releases reveal about Anthropic vs. OpenAI* The era of giant models: why 10T+ parameter systems may only be a temporary rationing phase before bigger clusters arrive, why labs may increasingly keep their most powerful models private for distillation, and why scale alone no longer feels like a complete answer* Memory as the slowest scaling factor in AI: why context windows have improved far more slowly than people hoped, why million-token context still has not changed most real workflows, and why memory may be the key bottleneck for the next generation of systems* What swyx changed his mind on in the past year: becoming more bullish on open models, more convinced that the top tier of agent startups behaves very differently from the median AI company, and more optimistic about fine-tuning and specialized model adaptation* “Dark factories” and zero-human-review coding: the next frontier after zero human-written code, where models not only write the code but ship it without human review, forcing companies to rethink testing and verification from first principles* Why RL and post-training may matter more than people assumed: even if the resulting models get thrown out every few months, the data, workflows, and domain-specific improvements persist* Synthetic rubrics, Doctor GRPO, and multi-turn RL: why reinforcement learning is becoming much more domain-specific and multi-step than many people realize, opening the door to much deeper customization* The next frontier after coding: memory, personalization, and world models, including why swyx thinks world models matter not just for robotics or gaming, but for giving AI something closer to lived understanding* Fei-Fei Li, spatial intelligence, and the Good Will Hunting analogy: the idea that today's LLMs may know everything by reading it all, but still lack the lived experience that turns knowledge into a deeper kind of intelligenceTimestamps* 00:00:00 Intro preview: AI coding wars, startup pressure, and market structure* 00:00:28 Welcome to the Latent Space × Unsupervised Learning crossover* 00:01:17 What AI builders are focused on now: OpenClaw, harnesses, and infra* 00:04:33 Why AI infra is harder than apps, and where startups can still win* 00:06:39 Should companies train their own models?* 00:09:28 Open models, custom chips, and the new inference race* 00:11:25 Designing products for agents, not just humans* 00:16:49 The state of the AI coding wars in 2026* 00:19:27 Capability exploration, token-maxing, and why coding is going parabolic* 00:21:41 What the end state of the coding market could look like* 00:23:50 Where app companies still have room against the labs* 00:27:02 Why AI valuations and market swings feel unprecedented* 00:28:56 Consumer AI vs. coding AI, and why sticky products still matter* 00:32:28 What the next breakthrough product experience might be* 00:32:53 2026 thesis: coding agents break containment and eat the world* 00:35:27 Are foundation models wiping out startup categories?* 00:37:33 AI vs. SaaS, vibe coding, and internal team tensions* 00:40:01 Biosafety, security, and the politics of restricted model releases* 00:42:19 Giant models, compute constraints, and the limits of scale* 00:44:30 Memory as the real bottleneck in AI* 00:44:57 Why swyx changed his mind on open models* 00:47:44 Dark factories and the future of zero-human-review coding* 00:49:36 Why post-training and RL may matter more than people think* 00:51:50 Memory, world models, and the next frontier of intelligence* 00:53:54 The Good Will Hunting analogy for LLMs* 00:54:21 OutroTranscript[00:00:00] swyx: Isn't that crazy? That number is just mind boggling.[00:00:03] Jacob Effron: What is the state of the AI coding wars today?[00:00:05] swyx: We're in a phase of sort of like capability exploration. The general thesis that I have been pursuing now is that the same way that 2025 was a year coding agents 2026 is coding agents breaking containments to do everything else.[00:00:16] Jacob Effron: Do you worry about the foundation models just getting into a bunch of these startup categories?[00:00:21] swyx: Mid-size startups. Yes.[00:00:23] Jacob Effron: What do you think the end state of this market is[00:00:25] swyx: for the market structure to, to significantly change? There would be[00:00:28] Jacob Effron: today on unsupervised learning. We had a, a fun episode and what's really become an annual tradition, a crossover episode with our friends at Latent space.Swix and I sat down and we talked about everything happening in the AI ecosystem today. What we thought of the various changes at the model layer, what's happening in the infra world, the coding wars, and a bunch of other things. It's a ton of fun to do this with someone I really respect and another great podcaster in the game.Without further ado, here's our episode. Well switch. This is, uh, super fun to be back with another unsupervised learning, uh, latent space crossover episode.[00:01:02] swyx: Yeah,[00:01:02] Jacob Effron: I feel like a lot of places we could start, but you know, one thing I always find fascinating, uh, about the way you spend your time is you obviously are like at the epicenter of this engineering movement and community, and you run these events and conferences and put on these.Awesome talks and, and I think just have a great pulse on the zeitgeist of what's going on.[00:01:16] swyx: Yeah.[00:01:17] Jacob Effron: Maybe to, to start just what are the biggest topics people are thinking about right now?[00:01:21] swyx: Yeah, so I just came back from London, uh, where we did a IE Europe and we're doing roughly one per quarter now, which Yeah, you've[00:01:27] Jacob Effron: really up[00:01:27] swyx: the, hopefully[00:01:28] Jacob Effron: up the, up the pace.[00:01:29] swyx: It's trying. We're trying to match AI speed, youknow?[00:01:30] Jacob Effron: Yeah, exactly. The tops would be completely different, I imagine. Uh,[00:01:33] swyx: yeah. You know, I definitely curate the tracks, like you can see what I think. When you see the track list and the, the speakers that I invite, obviously Open Claw is like the story of the last four or five months, and then be, be just below that.I would consider harness engineering, context engineering to be two related topics in agents and rag. And then there's a long tail of Evergreen stuff like evals, observability, GPUs, uh, and uh, LM infra and just general, just in general. We also have other updates on like multimodality and, uh, generative media, let's call it.Um, but I definitely, the, the first three that I mentioned are top of mind people. Yeah.[00:02:13] Jacob Effron: I think harness is particular like, so interesting. Um, you know, there was this tweet from Harrison Chase, the, the lane chain, CEO, that, that caught my eye recently where he said, you know, it finally feels like we have stability, uh, around the infrastructure for, uh, you know, around ai.And I think what. He basically was implying his like, look over the past two, three years as a company at the epicenter of AI infrastructure, it was a bit like playing whack-a-mole, right? You were constantly moving around with, however, the building patterns were evolving[00:02:36] swyx: for Harrison for sure. Right? Like he's basically had to reinvent the company every year since he started Lang Chain.Right? It was Lang chain, Ang graph and LP agents and like, uh, I think he's like one of the most nimble, adept sharp people about this. Yeah. Yeah.[00:02:49] Jacob Effron: Saying now, now is finally the time stability[00:02:51] swyx: this. Yeah.[00:02:52] Jacob Effron: Yeah. Um, do you buy that or what have you kind of make of that take?[00:02:56] swyx: I think that. It, it's very expensive to say this Time is different sometimes, but when you're just writing code, like it's actually okay to just like try to make a call and I think it may not even matter if this call is right or not.Like I just don't even care that much because you can be right on a thesis, but if you don't, you don't figure out how to monetize the thesis, then who cares if you said something first that said, um, it does feel like, for example. Uh, we went through a lot of different ways of passion packaging integrations up with, uh, with agents.And it feels like we've landed at skills, which is like the minimal viable format. Yeah. Which is just a markdown file, uh, with some scripts attached to it, and I don't see how it can be more simple than that. And so there is some justification for. The stability around harnesses. I feel like there may be more adaptation with regards to maybe like the real time elements or subagents or memory or any of those like agent disciplines, let's call it in, in agent engineering.Uh, but if, if the thesis is that, okay, you just want agents are LMS with tools in the loop with a file system, what they can do. Retrieval with, with skills and all these like standard tooling that now seems to be relatively consensus then probably. That makes sense. Um, I just think like there's no point trying to stake your reputation on this thesis that we're there because if it changes again, just change with it.It's fine.[00:04:33] Jacob Effron: Yeah. It's always, you know, I've always been struck by how that is. Much more challenging for infrastructure companies and application companies. Like obviously I think, yeah. You know, on the application side you've seen, you know, Brett Taylor from Sierra Max, from Lara. Like, they're like, look, we build, you know, what's ahead of the models and we're willing to throw everything out every three months, you know, as the models get better and better.Exactly. Yeah. But the thing you at least have there is you have. Uh, you have an end customer, right? That's like decently sticky. Um, you know, they will mostly stick, you know, they'll, they'll give you a shot at least of, of building these things. What I've always found more challenging, uh, at, at the kind of like, you know, reinvent yourself every three months of the infrastructure layer, it's like, you know, developers are definitely a, a pickier audience maybe than an accounting firm or, uh, you know, a bank.Yeah. And so it's definitely a, a, a more challenging position to be in to, to have to constantly reinvent yourself.[00:05:17] swyx: Yeah. Yeah. Yeah. And, and like when they turn, it's like. Very complete. Like, they'll leave to like the, the hot new thing, uh, because there's like no defensibility, I guess. Like e even, even if you are a database, like, uh, people can migrate workloads off databases.Like it's, it's a, it's a known thing. Uh, so I think like basically what we're talking about is the vertical versus horizontal, uh, debate in, in AI startups. And uh, the way I think about it also is just that like when you are. Um, Lara, when you are a bridge, like you are the outsource AI team, right? You, you are, your job is to apply whatever state ofthe art AI methods.[00:05:55] Jacob Effron: Yeah. Like this translation layer between model capabilities and your[00:05:57] swyx: own customers. Yeah. To, to the end customers and like, well, if they didn't have you, they would've to hire in house and they're not gonna hire in house so they have you. And like, I think that's like a reasonable, like very robust to any whatever trends and, and discoveries that people make in, in the engineering layer.I do think like there is, um. It like sort of useful horizontal companies being built, but they're all. Very much like, sort of like the reinventions of classic cloud in the AI era and the, the primary one being sandboxes. Yeah. Um, which like, it's another form of compute guys, like, let's not get too excited about it.But I mean, like the, the workloads are enormous.[00:06:38] Jacob Effron: Right.[00:06:38] swyx: Yeah.[00:06:39] Jacob Effron: It's interesting, and I feel like as, as part of this, you know, the questions that folks are asking around infrastructure, there's a lot around, you know, the extent to which companies should have their own AI teams and what they should be doing in-house.And, you know, uh, I think there's questions around should people be training their own models? Should people be doing, you know, rl, uh, in-house based on the data they have? I feel like, you know, one has to evolve their takes on this every, every three months with paces. But where, where are you at on this today?[00:07:00] swyx: I think, well, I mean actually all models have gone up. Um, and obviously I'm involved in cognition and also cursors doing, doing, uh, a lot of own model training. And I think that that is some part of the, what I've been calling the agent lab playbook, where you start off with the state of the art models from, uh, from the big labs and you, uh, specialize for your domain.But once you have enough workload and enough high quality data from your users, then you can obviously train your own models and like save a lot on cost and latency and all that, all that good stuff. Um, you also get like a marketing bonus of like calling it some fancy name and putting out some research[00:07:38] Jacob Effron: from my seat.I can't tell how much of it is like actual, you know, value that's provided to the end user. And how much of it is that marketing bonus? Right. It seems some combination of the[00:07:45] swyx: I think it's both.[00:07:46] Jacob Effron: Yeah.[00:07:46] swyx: Um, no, no. There, there actually is real value. Um, and you, you know that for a number of reasons. Like one, even when it's not subsidized, people do choose it as like one of the top four or five.This is both composer two and, uh, suite 1.6 I one of the top five models. Like in a, in a fair market? In a free market, yeah. In a, in a, in a model switch. Or people do choose it and like, it's not subsidized. Like, so that's as good as it gets. Uh, but beyond that, like domain specific models, for example. For search with, with both, which both companies have absolutely makes, makes a ton of sense.Everyone says like, yeah, we should always, always do this. And honestly like, I think the infrastructure for that is becoming easier with, um, like thinking machines tinker thing as well as primary like, uh, lab stuff. Yeah, I mean like, this is one of those like reversal of the, the bitter lesson where you first bootstrap on the large models and the general purpose models to get big.And as you get very well-defined workloads that are just high quantity but not high variance, um, then you just distill down to a smaller model and run that on your own. Right. Which like totally makes sense.[00:08:50] Jacob Effron: What I'm less clear on is the kind of DIY RL use case, which I think is really mostly around, you know, improved, uh, quality for, for different things.Obviously there's probably like more efficient ways to, you know, get a smaller model that's that's faster and cheaper. And it'll be interesting to see whether. You know, obviously you had, you know, uh, two, three years ago this whole case of companies that were, you know, pre-training and claiming better outcomes in, in their domains than getting kind of cooked as each model iteration improved.You know, I wonder whether that's a, a similar story plays out in the, uh, in, in the, our all space. Yeah, for the focus on, on on pure outcomes and quality, not the cost side, which clearly your own models for cost at scale makes a ton of sense.[00:09:28] swyx: I think there are this, there are two sides of the same coin.Like you basically always want to hold, uh, quality constant or trade off a little bit of quality for a drastic decreasing cost. And that's true for everyone. Uh, one element I wanted to bring out, which is very much in favor of open models, is custom chips. So this would be cereus, but also talu. And then there's a huge range of stuff in between.This has been a huge story this past year on just like everything non Nvidia is getting bid up, including like freaking MatX is working for, which is very, which is very rewarding for me, but I think one of those things where like, oh, like the suddenly, because the number of alternative. Hard, uh, hardware is increasing and the inference that you can get is insanely high.Like, um, we're talking thousands of tokens per second instead of less than a hundred. So the trade off for qua quality doesn't hold as much anymore because the speed is so high.[00:10:24] Jacob Effron: Have you seen a lot of companies go all in on the alternative chip?[00:10:26] swyx: So cognition has Yeah. On Cerebras, uh, and, and so has OpenAIUm, uh, and so no, I don't think so beyond that, uh, and that, do you think that's like a, that's mostly, that's foreshadowing of, that's, yeah. I used to be kind of a skeptic in terms of like, okay, so what if I get my inference at a hundred to a hundred tokens per second sped up to 200 tokens per second. It's only two X faster.It's not that big a deal. Um, but when you, uh, I think every 10 x does unlock a different usage pattern. Um, and you, we have proof in Talas and, and some of the others. That you can actually, um, drastically imp improve inference speed and what happens from there? I don't even really know, like it's, it's so hard to predict when entire applications just appear at once.Yeah. Uh, and it also isn't that expensive, right? So like, um, this is one of those things where like, I, I think the, the investment cycle is gonna be multi-year. Um, and I. Would caution people to not dismiss it too, too quickly.[00:11:25] Jacob Effron: Yeah. I mean, one other like infra question I was curious to get your thoughts on is obviously it seems increasingly a lot of the cutting edge infra companies are building for agents as the buyers of their product or users of their product, right?[00:11:35] swyx: Ooh,[00:11:36] Jacob Effron: and[00:11:37] swyx: another huge theme. Yeah. Yeah.[00:11:38] Jacob Effron: And I'm trying to figure out like what. What, what do you have to do differently about selling into agents? Um, are they just the ultimate rational developers? Uh, or is there, you know,[00:11:46] swyx: no, absolutely not. Um, I think they are easily prompt, injected and, uh, very tuned towards like, basically com compounding existing winners.[00:11:57] Jacob Effron: Yeah,[00:11:57] swyx: so like if, like, congrats if you won the lottery for getting into the training data right before 2023, because now you're like installed in there for the foreseeable future. But yeah. Uh, you know, one stat that Versal, uh, CTO Malta dropped at my conference was that there are now, uh, 60% of traffic to Elle's, um, like app arch, like admin app architecture for like configuring versal applications, uh, is bought.It's not, it's not human. Uh, so like your primary customer is agents now. Um, and it's mostly co like mostly coding agents, mostly people using CLI on CP or whatever. But yeah, I mean, I think. More. I, I think step one, if it doesn't exist as an API that agents can use, it doesn't exist. Right, right. Which I think is like, uh, it's a good hygiene thing anyway, to, to make everything API available, but not as like an extra, um.Push on like products, people to not only work on the ui, um, you should probably work on the on SCLI stuff. Beyond that, I think honestly there is like, so I, I come from the sensibility of, I think everything that you are trying to do for agents experience now, which is the term that Matt Bowman and Nullify is trying to coin, is the same thing that you should have been doing for developer experience.That you should have had good docs, you should have had a consistent API, uh, that is. Mostly stateless. Um, you should have, I guess, discoverable or progressive disclosure or like search or like whatever. And so now that people have energy in like finding these customers to do that, that's great. Um, do I believe in.Extending beyond that into something like a EO, um, for gaming The chatbots? Not necessarily, but obviously there's gonna be huge advantages when people who figure out the short term wins. Yeah. And short term wins can compound.[00:13:43] Jacob Effron: Do you think these compounding advantages to like the, the pre-training data cutoff companies, like, you know, obviously over some period of time, I imagine that doesn't persist.And so as you think about like. I dunno, three, four years from now what the, you know, selection criteria end up being. Do you think it still mirrors exactly what you were saying before? Like it's exactly what you should have been doing all along to sell a good product to developers?[00:14:01] swyx: It could be, except that I think in three, four years we'll probably have much better memory and personalization.So then general a EO or GEO doesn't really matter as much. So I think whatever memory or personalization system we end up with will probably d determine what you end up choosing much more. Than, than what is currently the case, which is just frequency of mentions, let's call it. Yeah,[00:14:26] Jacob Effron: yeah.[00:14:26] swyx: Uh, so you just spa quantity and I think that's, I mean, that's something I'm looking forward to.I do think, like, like, you know, I, I think that the fundamental exercise to work through for yourself is if you start a new, um, sort of. Uh, disruptor company. Now there's a, there's a big incumbent that everyone knows, like, like superb base. Super base is like, kind of like the Postgres, like database, uh, incumbent.If you wanna start like new superb base, how would you compete with them? And I don't necessarily have the answer, but I, I, I do think like people, like resend like relatively new. I think they would start like 20, 23 and still there was, there was a recent survey where like, people. Checked what Claude recommends by default.If you just don't prompt it with anything, just say, gimme an email provider and says, resent as in like 70, 70% of each cases. Like the fact that you can get in there with like such a relatively short existence, I think is, is encouraging.[00:15:14] Jacob Effron: Yeah.[00:15:14] swyx: I do think like. Um, you do want to do whatever it is to, to like to, to get in that Very short mentions this because, um, it's not gonna be 20 of them, it's gonna be like three.[00:15:26] Jacob Effron: No, definitely. It feels like, uh, you know, probably more, more consolidation than ever. Uh, or, or kind of like, you know, uh, a winner take most market than maybe the, the, the physics of go-to market in the past. Yeah. Might have, uh, enabled.[00:15:38] swyx: The other thing also is like, semantic association is gonna be very important, uh, in the sense that like, you want to do like the combo articles where you're like, use my thing with for sale, with blah, blah.And like that all gets picked up in a, in a corpus. And so that's. Probably one thing that you, you wanna do? Well, I don't know what else. Uh, it's, it's, it's, it's one of those things where like, I think I feel, I feel I'm behind, uh, I don't know how you feel about this, but like,[00:16:04] Jacob Effron: I think AI is just everyone constantly feeling like they're behind some, uh,[00:16:08] swyx: yeah.With,[00:16:09] Jacob Effron: I wanna meet the person that doesn't feel behind,[00:16:11] swyx: but like with, with ax, right? Like, so, so like, my, my stance was that exactly what I said before, like everything that you, that you should do for agents is something that you should have done for humans anyway. Yeah. And so. To the extent that you're just getting it more energy to, to do things for agents, great.But like, uh, it's hard to articulate what new thing apart from just like more spam, um, that you should be doing. Anyway, that would be my take right now. Um, I I, I do think like there, there will be more turns at this. I think the personalization turn that is coming, um, will be big. And I don't know what that looks like because like basically we're kind of, we feel kind of tapped out on the memory side of things.[00:16:49] Jacob Effron: Yeah. I, I guess since we last chatted, you know, you, you took this role over at cognition, um, and you've obviously have a, have a front row seat to the AI coding space today. You know, I feel like coding in many ways. You know, people view it as this, like, I mean, besides being like the, the mother of all markets and this massive opportunity, I think it's kinda a preview of like, what's to come for many other spaces.Both. Yeah. You know, I feel like agents are most advanced in coding. I also feel like the, you know, competition between foundation models and application companies, you know, and, uh, mirrors what we may see in other spaces. And so maybe for our listeners, can you just lay out like what is the state of the AI coding wars today?[00:17:25] swyx: Um, it is massive, right? Like, uh, and I don't think necessarily, last time we talked about this, we appreciated the size of what[00:17:32] Jacob Effron: No, I wish we did.[00:17:33] swyx: I state of AI coding wars today, um, both opening eye philanthropic have made it their p serials to competing coding. Um, and. Tropic is like 2.5 billion in a RR just from Cloud Code.The way they recognize a RR is. Opt for debate, uh, open ai. I don't think the, a public number is known, but let's call it 2 billion as well. And then cursor is like, rumored to be 2 billion, you know? And, and those, those are like the public numbers that are known? Yeah. Um, so like huge markets that have just been created in the past one year.Like, like anthropic, just like Claude Code just recently celebrated their one year anniversary, which is, yeah, pretty nice. Um, so, and then I think, like the other thing that I see is there's, there's some other people who are like, oh, here's like the, the sort of relative penetration of, uh, Claude use cases, right?Like, and it's like coding 50% and then legal, whatever. Health, uh, it's like the, the remaining ones. And there was a very popular tweet that was like, okay, I'll look at the, the empty space and all these other use cases. If you are a new founder today, you should be betting on the other stuff because on, on a sort of catch up Yeah.Theory and my. Consider my, my pushback is the same pushback that, uh, I had on app over Google, which is like, well, well why is this time different? Like, why, if it went from let's say 10 to 50% in the past year, why can't I keep going? Uh, and like getting that wrong is actually a very painful one because you could have just did, did the momentum bet.Instead of the mean reversion bed. So I, I, I think that that is the, the state of things now that people are very, very much into psychosis. Um, they're are getting rewarded for spending more rather than spending less. And I think we're not in that phase of efficiency. We're in a phase of sort of like capability exploration.So I think people who are more crazy, who are more. Uh, creative, um, get rewarded comparatively. Yeah.[00:19:27] Jacob Effron: Well, it's interesting. I mean, it feels like behind these like token maxing, leaderboards and whatnot is this, it's like the first phase of this transition from a workforce perspective is you just gotta show your employer like, Hey, I, I use these tools.[00:19:37] swyx: Here's my nu number of tokens I cost, and that's it. They don't care about the quality. Right. It is, uh, maybe distasteful to someone who cares about the craft and, and all that. Um, but directionally everyone just wants you to go up regardless. And so, um, there it is not very discerning. It's, and it's probably very sloppy, but I think it's net fine because we're still probably underusing ai just in generally.Yeah. Um, and so I think that's like very interesting. Like we had on the podcast, uh, Ryan La Poplar from OBI, who spends a billion tokens a day. Yeah. Um, and that's for those county home, it's like something like 10,000 worth, $10,000 worth a day of API tokens. If they, they did market rates, um, and like most of us can't afford that.Yeah. But like. And, and, and probably a lot of what he does is slop.[00:20:25] Jacob Effron: Right.[00:20:25] swyx: But like, he's going to dis, he's like, if there were a new capability, he would discover it first before you because he was, he was trying and you were not trying. Right. And like, you only do things that work like, well, good for you.But like the, the people who are going to discover the next hot thing are living at the edge.[00:20:42] Jacob Effron: Right and increase in living at the edge of just having the compute budget to like run these experiments. I mean, kind of similar to what living at the edge on the research side has always been. You know, it was constrained in many ways by the amount of compute you had to run these experiments.It feels similarly on the, almost on the builder or like actualizing these tools now.[00:20:56] swyx: Yeah. The other thing that's, I mean, very obvious is philanthropic is kind of like the high price premium player. Um, that where, you know. Restricting limits or restricting model releases even is like the name of the game.Whereas Codex is like, come on in guys, use our SDK, use our login and we don't care. We're gonna reset limits. Whatever you do want to try to exploit the subsidies where you can get it. And definitely Codex is super subsidized right now. Gemini also very subsidized. Um, and. Comparatively, like, I think you should make, Hey, I guess while, while that's going on, it's not that bad to be a capabilities explorer on just the $200 a month plan from Cloud Code or from OpenAI.Um, and, uh, I I, I, my sense is that people aren't even there yet.[00:21:41] Jacob Effron: How do you think this, like, market ultimately plays? I mean, it's obviously such a big market that, you know, any slice of that market is interesting for, for anyone going after it. But I think what, what makes people so interesting in the coding market particularly is it feels like it's kind of this.Foreshadowing of what will happen in other, you know, any other kind of application market that the foundation models eventually turn to and are all their models against and gather data around. And so how do you think, you know, like does there end up being room for lots of different kinds of players or like, what do you think the end state of this market is and is that, do you think that's applicable to other markets?[00:22:10] swyx: I feel like there will be, I mean. Status quo is probably the most likely outcome, which is there are two big players and there's a small range of longer tail people that, um, fit other use cases that the, the two big players don't. That feels right to me. I think that, um, for it to, for the market structure to, to significantly change there would be, there needs to be significant change in like the economics or like the, the brand building or like the, the, the, the value propositions of the, of the companies involved and I.Haven't seen any in the last six months that, that have really changed the stories materially. So I feel like they would just keep going until something, something else happens. Something else happens, meaning like Microsoft wakes up and like goes like. Guys, we have GitHub, we have, uh, you know, we, we, we'll, we'll do something much bigger here than other, other than just copilot.Um, and, uh, that would be a big change. Um, MSL has put out a model now, and I was in a breakfast with, uh, Alex Wang, where they were like, yeah, like, we, we really, really want to go after the coding use case. We haven't done anything yet, but like, don't underestimate them. Right. Um, and, and similarly for the Chinese labs.Um, I think they're trying to go after it. Like ZAI is doing stuff. GLM uh, ZI and GLM is same thing. Um, uh, and, and so it's, so like everyone's trying to get a piece of that pie. I, I feel like the, the status quo has been pretty stable for the past, like almost a year I'll say.[00:23:39] Jacob Effron: Yeah. And is the room for the, not like, you know, for, for the application companies more on like the enterprise side or like where do the, where do the, like what surface area do the model companies leave for application companies?[00:23:50] swyx: Yeah, that's a good one. Um. It's very much evolving. Um, it, I, I, I will say because opening I did not have this, the, this level of attention on coding. Yeah. Uh, a year ago. We just don't have that much history. Right. Um, and it seems like, for example, so the big push at Open I now is the Super app. Um, is that a consumer thing?Is that like a products like. Portfolio rationalization thing, how much is that gonna take away attention from coding at the time when they actually do want to put more coding? I think it's, it's very unclear. So I do think like there's, there's all these, like in both big labs, there's. Uh, sorry. Both of the, and, and drop and, and deep minus and XAI are are separate cases.Um, they are trying to see the other time expansion areas. So cloud code for finance. Yeah. Um, uh, cloud cowork, all those, all those things. Whereas I think cursor and cognition are like comparatively just focused on coding and so I, I do think they leave space and I do think for the other verticals that also means the same thing.Right. That, uh, that they're not gonna be that. Um, intensely focused on, on, on that domain. Except for, I, I think I would mark out finance and healthcare as like the next ones, um, that they're clearly going after. Uh, I, I would say comparatively, healthcare seems more thorny. There, there, there've been some announcements about it, but like, I would respect the, the finance work a lot more just because like the, the path to money is a lot clearer.[00:25:12] Jacob Effron: Yeah, no, I mean, obviously like, I, I think, you know, maybe similar to, to the space that's being left in these other domains, you know, there's obviously. Uh, a lot that's required to actually implement these tools in enterprises, uh, versus, you know, maybe just giving them, uh, giving model access to, to folks outta the box.[00:25:27] swyx: Yeah, yeah. Yeah. So the, the agent lab thing is like, we'll do the last mile for you. Whereas I think the model labs tend to just trust the model and, and be minimalist about it. Both of them work.[00:25:38] Jacob Effron: Yeah.[00:25:38] swyx: I, I don't, I don't necessarily think one, uh, beats the other, uh, for every, for every use case. Um, all I, all I do know is that it does seem like.Uh, the large enterprises do want a dedicated partner that isn't just the model labs, which is kind of interesting.[00:25:55] Jacob Effron: We, we've been in this phase of, of pure capability exploration. And so I think nothing has been, you know, better for the large labs, right? I mean, they're always gonna be, uh, uh, the frontier of, of capability exploration.And so I think have a very good relationship with a lot of these enterprises. But ultimately over time, like. The, uh, the incentive structure of these labs is always gonna be maximal, you know, token consumption for, uh, for the end customers they work with. And there's just, I think, so few companies that have actually gotten to massive scale.Maybe coding again is the most interesting. So it's the first space that really is just completely gone, you know? Yeah. You must love it every day. Like absolutely insane. And. I think it[00:26:32] swyx: gets even. Okay. I mean, like, I think we, we say good things about crystal cognition, but the sheer liftoff of like both end UPIC and open ai.‘cause they, they, they have independent valuations. I mean, let's throw an XEI in there because it's now I ping at 1.2 trillion. That number is just mind boggling. Like I, I feel like in normal investing or normal startups, there's kind of like a ceiling market cap or valuation. Totally. That, that like you, you reach and you go like, all right, let's, it's gonna be chiller from now on.And these guys are not slow down. No.[00:27:02] Jacob Effron: Well, I also think the dynamic is fascinating about some of these later stage companies is, is, you know, in the past, I feel like in, in venture world, if you got to a certain level of scale, the question around you was really more a valuation question. And this is like why there was different phase, like, you know, types of venture people did and like the late stage growth people were just incredible at like, you know, a little bit of what's the ultimate market opportunity of this company, but also what's the right way to, to value it.Like we know it's, it's in some bands of an outcome that is like. Sure there's some variance to it, but it's like relatively understood what that bands is and then maybe you get over time surprised to the upside. Whereas any kind of like later, even the labs themselves, any later stage company, the bands of which that company might be worth right now, even in a year or two years are so massive because of how fast the ecosystem changes that it's like.Even for later stage companies, every three months could be an existential level event to the upside to the downside. Yeah. Um, and I think that, like, you are obviously seeing it in the, in the positive with code, which, you know, if you think about a company like philanthropic, you know, that. For a while, it was like unclear if they were going to have access to enough capital, um, to really stay in the, in the race, right?And then coding hit at the exact right time. They had the perfect model for it. They executed brilliantly. Um, and you know, now are, are, you know, uh, you know, one of the most valuable companies in the world.[00:28:13] swyx: Uh, at the same time, I, I don't find, I, I have zero sympathy for opening eye because they're crushing it and they're all rich.You know, this is like a high class champagne problem to have to, uh, to be number two at coding or whatever. Like, who cares? Like, you're, you're doing great.[00:28:27] Jacob Effron: Yeah. It's funny though. I can't even, I mean, you would be closer to this, uh, you know, even that you're in the AI coding space, but it's like a lot of people I talk to think Codex is just as good, if not better than Claude Code.Right. I think one thing that I've been really surprised by, and maybe, maybe Cloud Code is a better product in some ways, I'm curious your thoughts is just in consumer AI with chat GBT. You saw this big first mover advantage, right? Where admittedly today, like, I don't know, Claude Gemini. Great products.Not sure, not abundantly clear chat GBTs any better, but like. People stick with chat, GBT, it's the first thing to introduce them.[00:28:56] swyx: They stay, but they're not growing anymore. I don't know if you've seen[00:28:59] Jacob Effron: Right. But that to me is more of like a, a, a product problem than it is. They're not like, it's not like they've like lost share to someone else.My understanding is the overall problem with consumer AI today is much more of a how do you take this tool and, you know, for, for folks like us, like knowledge workers, it's like this incredible magic tool, but it's not necessarily a daily active use tool for a lot of people around the world today. And what are the like products?It's, it's kind of a category wide problem. Like in coding, for example, like. The entire space has gone parabolic. There may be some relative growth in, uh, in other consumer AI players, but it's not like consumer AI as a category is like going parabolic and they're not capturing most of that thing. I think it's actually the larger problem is much more, hey, the category has kind of hit a bit of a plateau of people haven't figured out how to bring, you know, tons more users on board.Yeah, yeah. Or increase the frequency of those users. And so it seems more of a category wide problem than it is, you know, a massive market share of change. I was gonna draw the comparison to, to the coding space where Claude Co is the first product, obviously, to introduce people to this magical experience.You know, by all accounts, codex is, is pretty damn close to as good, if not better. Um, but like still that first product, you, you would've thought that would not be a super sticky, uh, you know, product surface area. And it actually has, it turns out, I, it feels like the first lab to introduce you and experience really does, uh, keep a lot of, uh, a lot of the focus.[00:30:12] swyx: I, I think. M maybe it's like still, still early days. You know, Chad, BT is like three plus years old and Yeah. Cloud code is only one. Just turned a year. Yeah. So give it time, you know? Yeah. Like, yeah. I mean, definitely sometimes a lot of people have switched from to Codex. Maybe that will keep going. I, it's like really hard to tell.Uh, yeah. I, I, I do, I do think that. Because we are in this like, high volatility, high temperature phase. Um, the loyalty and stickiness to first movers and category creators, I don't think is as high as it might be in some other, uh, areas in our careers that we've looked at.[00:30:47] Jacob Effron: Yeah. Though, I mean, I've been surprised by the cloud code thing.I, I would've thought that, like, in many ways I always worried about the[00:30:52] swyx: enterprise. You think you would've been gone by now?[00:30:53] Jacob Effron: Not gone. But I would've, I I always worried that the, that the consumer business of these companies would be quite sticky. And then the enterprise API business. Uh, was actually like, you know, in some ways like your least loyal buyers, like they would, they would move to,[00:31:05] swyx: right, right.But, but they worked out that it wasn't the enterprise API it was enterprise product.[00:31:09] Jacob Effron: Totally. And maybe that was the, that was the secret that like, but the amount of lock-in or just default behavior that has happened in that space, uh, is, is more than I might've imagined with two products that by all accounts are pretty damn similar.Yeah.[00:31:22] swyx: No fight there. Uh, I will say I do think that Codex is still in like a catch up. Like in terms of personal experience. Um, the only thing I like out of, out of Codex is the, is like Spark and like yeah. Uh, the, I, I feel like the skills integration is a little bit better. I feel like, uh, the, the speed is a bit better.Maybe ‘cause it's in, is written in rust or whatever. Um, very minor things that you like. Almost like telling yourself rather than like objectively assessing between two, two of them. I, I, I do think, like vibes wise, I think that's going on. Um, the, the, you know, I, I feel like the, the missing questions, uh, in, in this whole debate is like, why is this so concentrated in only two names, right?Yeah. Like, um, how, where, like, where is the Gemini? You know, presence, where's the Xai presence? Um, and like they are trying, it's just they haven't made that much progress yet.[00:32:12] Jacob Effron: But what the, what the Claude Co moment does show, and it actually in some ways makes you a little more bullish on the potential for someone else to catch up because it does feel like if you're the first person to introduce some magical net new product experience, that that actually might be stickier than one might have imagined.[00:32:27] swyx: Right, right, right. Okay. Yeah.[00:32:28] Jacob Effron: And so it's, everyone can believe they have shot[00:32:29] swyx: that. What do you think that new product experience might be like? I, I, it's, it's like, and this is a failure of imagination on my part. Like, I always wonder, like, people always say this like, well, the, the thing that will save us is like being first to the next new thing.Like what is it?[00:32:41] Jacob Effron: Yeah.[00:32:42] swyx: It's like,[00:32:45] Jacob Effron: I dunno, something around like, uh, consumer agent, computer use, like hybrid. I think, obviously, I think we're like scratching the surface on the consumer side.[00:32:53] swyx: So my, my current theory is like the. Open claw is like a vision of things to come.[00:32:58] Jacob Effron: Totally.[00:32:58] swyx: Um, and uh, it's good that O open I has like the association with open claw, but by no means do they have the rights to win it.The general thesis that I have been pursuing now is that the year the same way that 2025 was the year of coding agents, 2026 is coding agents breaking containment to do everything else. Um, and so coding agents continue to still win, but because they generate software and software eats the world, so like, it's kind of like the trans.Associated property of like software, eat the world, coding agents, eat software, therefore coding agents eat the world. Um, which is like an interesting,[00:33:30] Jacob Effron: yeah, and breaking containment always an easier phase phrase in the consumer context than the enterprise one. You've seen people run these really cool, uh, experiments in their own personal lives.I think like,[00:33:37] swyx: yes.[00:33:38] Jacob Effron: Figuring out, you know, how you, obviously everyone's focused, you know, on the enterprise side now around how you create these experiences. I feel like the vibes, you know, people love to have these narratives of like, everything is completely shifted. It's like I actually, you know, open AI.Organizationally, uh, you know, volatility aside is, you know, great products, great team, great models like everyone else in the world is incentivized for there to be. Two, three more. Everyone would love more like great model companies. And so I feel like the, the natural forces of the world revolt when any one company, you know, is too much the star of the show, right?There's so many people in the ecosystem that are incentivized for that not to happen. And so I think I'd be shocked if we don't have. Uh, uh, reversion of vibes, not maybe completely the other way, but at least a little bit more equal at some point over the next six, 12 months.[00:34:24] swyx: I, I think there's just a kind of different stages when, when you talk about the world, one wanting more model companies, I talked think about like the neo labs.[00:34:30] Jacob Effron: Yeah.[00:34:31] swyx: And I mean, I don't know, is it fair to say none of them have really broken through in the past year?[00:34:35] Jacob Effron: I think that's totally fair,[00:34:37] swyx: which is rough. Um, and well, how are we gonna, how are we gonna grow that diversity in, in, in choice, like. Um, that's, this is it.[00:34:46] Jacob Effron: Yeah. It'll be really interesting to see what, what, what ends up happening with that.And you've seen, you know, folks like Nvidia, you know, very incentivized to make sure there's, there's a broader platform of, of other model providers.[00:34:57] swyx: I think, uh, I don't know people say this, but I, I, I don't think they try it hard. Nvidia tries harder to build neo clouds[00:35:05] Jacob Effron: Yeah.[00:35:06] swyx: Than neo labs.[00:35:07] Jacob Effron: Well, they try pretty damn hard to build neo Cloud, so[00:35:09] swyx: that's,[00:35:09] Jacob Effron: yeah.[00:35:10] swyx: But like, you know, let's call it like the, the core weaves of the world, much happier place in the, you know, than any neo lab built on top of them.[00:35:18] Jacob Effron: Yeah. That one might argue it's, it's easier to, to enable a neo cloud to be successful than it is. Uh, you can't will a neo lab into existence the same way you, soNvidia[00:35:25] swyx: has more direct control over it.Uh, for sure.[00:35:27] Jacob Effron: What else is kind of catching your eye today on the startup side? I mean, you worry, there's obviously this whole narrative of like, you know, the foundation models, you know, they announced a product and every stock goes down 15%. Like[00:35:36] swyx: Yeah.[00:35:37] Jacob Effron: Do you, do you worry about the foundation models just kind of eating into to a bunch of these startup categories?[00:35:43] swyx: Not really. I, I think actually like. As, uh, there's, there's, okay, there's, there's, there's the, there's the point of view of like being an investor in startups, and there's a point of view of like, do you wanna start something? And I think honestly, like the, the downside for all these is so. Minimal in, in a sense of like, the worst you do is you just get hired into one of these labs anyway.So I, I think the, the market for people who just do things and try things and try to execute in like a competent way, even if like it doesn't work out commercially, even if it just wasn't that great anyway. Like, but like that's your job interview to go into, into one of these things anyway, so, um, I don't feel that.From a, from a very, very small startup perspective, mid-size startups. Yes. Uh, I will say there's been a lot of dead, um, LM Infra, a lot of LM infra consolidation like the, the, uh, lang fuses of the world getting absorbed into, into click house. And I, I think. Like people have maybe worked out the domain specific playbook, uh, and like, I think that's okay.Um, and, and yeah, I'm not that, not that worried about, uh, okay. So, um, I, I would say I'd be more worried about traditional SaaS, like low NPSS. This is the whole AI versus SaaS debate that has, that's been going on. Uh, and, and like literally I'm going through that exact thing in my company where, so I like kind of.Thinking through this on a very visceral, visceral level, right? On one hand you have the people who say you vibe coders don't appreciate the amount of work that goes into A-A-C-R-M and like, yeah, you think you can rip out Salesforce? So did the 30 entrepreneurs before you, right? Like, like, you know, you classically underestimate the things that you don't.Deeply, no. And, and, and target audience is not you. Uh, at the same time, like we have never been able to build software so easily and customize software so easily and like Yeah, you're not gonna use 90% of the things in Salesforce. So like, yeah. What's the typical, so what have you, what[00:37:33] Jacob Effron: have you done internally?[00:37:34] swyx: So we have there the main SaaS that we do for event management and sponsor management. That's, and we paid 200 KA year for that. Not, not huge, but like chunky for, for, for my, my scale. Um, and like, yeah, I could probably spend 2000 and, and build like a custom version of that. Um, the, the, the trick has been dealing with my, the rest of my team and getting them on board.Yeah. ‘cause I'm the most ethical person on my team, but like, I can't make that decision myself. And I think in the same way I've been telling with other CEOs team leaders as well, it's like, well you can be super cloud pilled. You can be super LM psychosis and that you think that's okay, but you like you have to bring your team with you.And I think like there, the sort of widening disparity in LM psychosis in companies is causing real s real riffs because. And on one hand, on one hand, the people who are less AI native are not getting with the picture. They're not, they're actually like behind, they're actually not waking up to the fact that like you, everything you think is necessary is not actually that necessary.And in fact, exactly would be better of you if you just like held your nose and went in and when came out the other side. Yeah, only talking to agents in natural language and like your life would actually be better and you just, you're just like close-minded. There's that perspective. The other perspective is, oh, you vibe coder.You, you did this in a weekend and you got the 80% solution and now the rest of your employees. Have to pick up the rest of your s**t, right, that you, that you thought you were, you were such hot, amazing, uh, uh, at, but like, actually you didn't figure it out. And like, actually LMS are still useless at this and blah, blah, blah.So like, I think there's this huge debate going on in every company right now. Um, and like, um, you know, I have a small microcosm of it, but like, yeah, it, it's making me hesitate to, to pull the trigger. But like I will at some point, it's like maybe I've put it off for one year, but not like five. Yeah, but like, so, so like SaaS is definitely getting squeezed.Um, it does make me wonder, like, I, I do think that there's an opportunity for a more AI native, um, system of record thing that is not just Postgres. Um, or not just MongoDB, although both are very good. Maybe it's like a convex or like people Yeah. Bring up convex a lot. I don't know, like, like, I, I just feel like the sort of quote unquote firebase of, of AI apps isn't really a thing yet.Um, beyond what we have. Uh, which, which is fine. It's, it's, it's just. We could probably start in a more sort of rapid iteration cycle first before scaling up to like a Postgres or MongoDB, which are more sort of old tech. I was at a dinner with, uh, Mike Krieger, the CPO of en philanthropic, and, and he, we were just kind of going around the room going like, what are people most worried about?Yeah. And, uh, for me, uh, I, instead of security, I brought up biosafety. Yeah,[00:40:21] Jacob Effron: classic.[00:40:22] swyx: Um, actually, like I said, it was. Cliche and classic, and the rest of the table were, were like, what do you mean? Someone sitting at home can manufacture a virus that wipes out half of humanity,[00:40:32] Jacob Effron: almost like the OG Jeffrey Hinton.Like, this is why you should be scared.[00:40:35] swyx: I'm like, yeah, like the read the, you know, risk reports. Like this is like the thing. Um, I think, and Mike was just sitting there knowing he was sitting on Mythos and going like, actually it's security. Um, and I think like, um, I think the, there's, there's, part of it is.A very good marketing. Like too good. Yeah, like I would actually advise and topic to tune down the marketing because also it's, it is just a very good model and you don't have to make so many marketing claims around it. At the same time, it is not really a private model. If you give it to 40 companies.Each of whom have like 10,000 employees or whatever. Right. It's not, it's not private, it's, it's like there's bad actors in there.[00:41:18] Jacob Effron: Yeah. Hopefully, hopefully not as, uh, as bad as releasing it widely, but, uh, no, I mean, it's an interesting. You know, it's an interesting case study for how all, I mean, many model releases might, I mean, you know, this might be the first model release that looks like the rest of ‘em from from now on, right?[00:41:31] swyx: It, it, so it's, it's the, there's an overall product strategy, uh, for anthropic of like bundle, uh, you know, restrict access bundle, uh, product with model maybe.Whereas, uh, OpenAI has definitely been a lot more sort of. Philosophically aligned on like, we will just enable access everywhere and we don't know what you, what will come out of it. Right.[00:41:51] Jacob Effron: Right. Though, I mean, this current moment, uh, obviously the cynical take is also just ties to the amount of compute that both companies[00:41:56] swyx: Yeah.Right, right, right. Yeah, I think, I think that's true. I I do think like the, the, this is the, the, the scale, the dawn of like larger than 10 trillion parameter models is very interesting. I don't think it, I think it's a temporary phenomenon because we have much larger compute clusters coming online for everyone over the next like three, five years.It's, and this is like already written in, in the cards.[00:42:18] Jacob Effron: Yeah.[00:42:19] swyx: So to the extent that like, you know, will we have rationing of models, uh, above 10 trillion, uh, in like two years? I don't think so. I think everyone will have no, we'll just[00:42:29] Jacob Effron: have rationing of the next phase.[00:42:30] swyx: Right. Right. But like, that's as it should be almost like, um.My, my classic example, which I, this is just me theorizing, not anything confirmed by Google. When Google announced Gemini, they actually announced three sizes, which was Flash Pro Ultra. They never released Ultra. They only have Pro and Flash. Um, so my theory is they have ultra sitting in a basement and they just could distilling from it for, for flashing pro.Um, which like, yeah, I mean, I, I actually think that's. As it should be for any lab that they, that they do that.[00:43:02] Jacob Effron: Yeah. Just because those are the models that people actually wanna end up using. And it's just like cost prohibit.[00:43:06] swyx: It is more, yeah, it's cost. Yeah. It's, it's not the want, it's just, just, just the cost.Um, I do think, like, uh, it is interesting that, uh, for a while I was, I was considering the theory that models capped out at two, 2 trillion, and I think that's proving to be wrong. And well then if I'm wrong, how wrong? How wrong am I? Do we do 200 trillion? Do we do two quarter trillion, whatever? Um, and I don't think we have the straight answer to that, but like, uh, it's interesting that we are continuing to scale number of pers when everyone kind of assu like can see that we're not going to get like the next thousand or 1 million x from this paradigm.So like the others, like the alias of the world are working on other. Um, model architecture improvements. We need a different scaling law, I guess, because like, we're, I, I feel like people already already feel like we're tapped out on this. Like the, the end, the end state of this is we turn most of the world into data centers and like, I don't know.I don't know if we want that.[00:44:08] Jacob Effron: Yeah, I mean, uh, if the, if, if, if the return of intelligence are there, maybe, uh, maybe not so bad.[00:44:13] swyx: I, I, I think there, there's just a sheer amount of like, like un scalability that like is wrangling people's sensibilities right now. Um, especially in terms of like context lengths.Um, my classic quote is that context length is like the slowest scaling factor in, in lms.[00:44:30] Jacob Effron: Yeah.[00:44:30] swyx: Um, we, like, we took maybe. Three years to go from like 4,000 context length to a million and that's about it. Yeah. Like Gemini has had a million token context length for two years now. Um, and no one's using it.Like, so like yeah, it's memory. Memory is probably gonna be the, the biggest limiting constraint on all these things.[00:44:50] Jacob Effron: Yeah. Certainly seems that way. I guess I'm curious over the last year since you recorded last, like what's one thing you've changed your mind on?[00:44:57] swyx: I feel like I was kind of bearish on open models like last year.Um, in a sense of, like, I, I had just done the podcast with an Al[00:45:07] Jacob Effron: Yeah.[00:45:08] swyx: Of Braintrust where he, and he, I mean, you know, he has a good cross section of all the top AI companies and he says market share of open source is 5% and going down. Um, I think that's changed. I think it's going up. Um, and even if,[00:45:22] Jacob Effron: even though the capability gap does seem to be increasing.Spending on the[00:45:26] swyx: time. It's hard to tell. Yeah, it's, it's really hard to tell. ‘cause like, okay, for, for listeners, capability gap increasing is like on public benchmarks. And let's say you're comparing mythos versus like, I don't know, G-T-O-S-S or like GLM 5.1. And, um, it's, it is really hard to tell. ‘cause even if they were closing, you will also not believe that they were closing that much because it's very easy to gain the benchmarks.Yeah. So you just don't really, really know. Um, all you know is like. Uh, there's somewhat objective open router stats on like what people choose in a free market. And people do choose some of these open models in significant volume, except that a lot of them are heavily discounted. So you need to kind of like price adjust, uh, these things.So even if, even if that were true, which I, I'm not sure, like I, I, I feel like the numbers just up now instead of down. Uh, I think the. Separation between what the top tier agent labs

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Shopify's AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 22, 2026 72:25


Early bird discounts for the San Francisco World's Fair, the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP!From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection, and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability.We also go inside Tangle, Tangent, SimGym, which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP, Liquid AI, and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopify's customer simulation defensible, and what he learned from the Sydney era at Bing.We discuss:* Mikhail's path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify* Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company* Shopify's internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools* Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output* Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation* Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans* Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point* How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era* Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed* What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start* Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams* What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more* Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers* Why AutoML finally feels real in the LLM era, and where auto-research still falls short today* Why Tangle, Tangent, and SimGym become much more powerful when combined into one system* What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopify's data gives it a moat* How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions* Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs* How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications* Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice* Shopify's new UCP and catalog work, including runtime product search, bulk lookups, and identity linking* Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice* Where Liquid already works inside Shopify today, from low-latency query understanding to large-scale catalog and Sidekick Pulse workloads* Whether Liquid could become frontier-scale with enough compute, and why Shopify remains pragmatic and merit-based about model choice* Who Shopify is hiring right now across ML, data science, and distributed databases* The Sydney story at Bing, why its personality was not an accident, and what Mikhail learned from deliberately shaping AI character early onMikhail Parakhin* LinkedIn: https://www.linkedin.com/in/mikhail-parakhin/* X: https://x.com/MParakhinTimestamps00:00:00 Introduction: Mikhail Parakhin, Microsoft, and Shopify00:01:16 Why Shopify Is Talking More About AI00:02:29 Internal AI Adoption at Shopify and the December Inflection00:06:54 Token Budgets, Jensen Huang, and Why Usage Metrics Can Mislead00:10:55 Why Shopify Built Its Own AI PR Review System00:12:38 AI Coding, More Bugs, and the Real Deployment Bottleneck00:14:11 Why Git, PRs, and CI/CD May Need to Change for Agents00:18:24 Tangle: Shopify's Reproducible ML and Data Workflow Engine00:21:19 Why Tangle Is Different from Airflow00:26:14 Tangent: Auto Research for Optimization and Experimentation00:30:07 How Tangent Democratizes Experimentation Beyond ML Engineers00:33:06 The Limits of Auto Research00:36:36 Why Tangle, Tangent, and SimGym Compound Together00:37:20 SimGym: Simulating Customers with Shopify's Historical Data00:42:47 The Infra Behind SimGym00:46:00 Why SimGym Gets Better with Real Customer History00:47:30 Counterfactuals, HSTU, and Modeling Merchant Trajectories00:51:55 CRPs, Clustering, and Category-Level Customer Behavior00:53:30 UCP, Shopify Catalog, and Identity Linking00:55:07 Liquid AI: Why Shopify Uses Non-Transformer Models00:59:13 Real Shopify Use Cases for Liquid01:03:00 Can Liquid Scale into a Frontier Model?01:09:49 Hiring at Shopify: ML, Data Science, and Databases01:10:43 Sydney at Bing: Personality Shaping and AI Character01:13:32 Closing ThoughtsTranscript[00:00:00] swyx: Okay. We're here in the studio, a remote studio, with Mikhail Parakhin, CTO of Shopify. Welcome.[00:00:08] Mikhail Parakhin: Thank you. Welcome.[00:00:10] swyx: I don't even know if I should introduce you as CTO of Shopify. I feel like you have many identities. Uh, you led sort of the, the Bing ML team, I guess, uh, uh, or ads team. I, I don't know, I don't know, uh, you know, it's, uh, people va-variously refer you as like CEO or, or, uh, I don't know what that, that, that said previous role at Microsoft was.[00:00:29] Mikhail Parakhin: Uh, that was... Yeah, my previous role w- at Microsoft was the-- I actually was the CEO of one of Microsoft's business units, which included, as I, you know, as we discussed, all the things that people like to laugh about, uh, including Windows and Edge and Bing and ads and everything.[00:00:47] swyx: Yeah, yeah. What a, what a, what a wild time.You've obviously, uh, done a lot since you landed at Shopify. Uh, one of the reasons I reached out was because you started promoting more sort of internal tooling, uh, primarily Tangle, but also a lot of people have seen and adopted Tobi's QMD, uh, and obviously, I think, uh, Shopify has always been sort of leading in terms of, uh, engineering.I think more-- it's just more recent that you guys have been more vocal about your sort of AI adoption. Is that, is that true?[00:01:16] Mikhail Parakhin: Well, I think AI tools in general are fairly recent development, uh, and we've-- Shopify, you know, at this stage of its development, we're developing AI in-in-house and other, uh, building tools that use AI and, you know, interfacing with the wider AI community, uh, you know, are on the sort of the, uh, runaway trajectory.So it just did by sort of natural byproduct. We, we talk about it more also. We just, uh, just even yesterday, Andrej Karpathy was famous in tweeting about, oh, are there some, uh, ways, uh, that, that you can organize your agents to store the data and then, uh, look up the data so that you don't have to research or, or lose context every- Yestime. And a little bit tongue in cheek, I tweeted that, “Hey, we've, we've done it much earlier, and we even have different approaches, Tobi and I.” Tobi, of course, is a big fan of QMD, and I'm more of a SQL, SQLite fan. But, uh, yeah, very similar things that we've already done here. The point is, yeah, we're very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously.[00:02:29] swyx: Yeah. Yeah. Um, you, your team kindly prepared some slides actually that we were gonna bring up on to, uh, the screen. I think I can, I can screen share, and then we can kind of go through some of the shocking stats that maybe, maybe put some numbers to what exactly is going on. So here we have, uh- An internal AI tool adoption chart.What are we looking at here? What ?[00:02:54] Mikhail Parakhin: Yeah, this is very interesting statistics. Uh, this is number of daily active workers, you know, think of, uh, DAO, basically the active users of-[00:03:05] swyx: Yeah ...[00:03:05] Mikhail Parakhin: AI tool as a percentage of all the people in the company, right? And then- Yeah ... different AI tools. And, uh, you could see two things here is that one is the green is total.Uh, green is just total. So you could see that it approaches really % by now. It's hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that, that everything took off and started growing.Uh, it, it was many people noticed that the thing is that small improvements accumulated into this big change in Sep- December roughly timeframe.[00:03:52] swyx: Yeah.[00:03:52] Mikhail Parakhin: The other thing I would claim you could see is that, uh, CLI-based tools and tools that don't require you to look at the code becoming more popular, and you could see, yeah, various versions of, uh, Cloud Code and Codex and Pi and internal development tools taking off.Uh, exactly, yeah, uh, and blue is our River, just internal agent for coding, where tools, uh, that require IDEs such as, uh, GitHub, Copilot or Cursor, they're not exactly shrinking, but they're not growing as fast. Like, uh, red, red line is, is the IDE kind of tools. So you could see that they're, they're not experiencing as, as fast of a growth.[00:04:37] swyx: As I understand it, basically, every employee has their choice, right? Of choose whatever tool you use, and then you're just kind of doing a, a daily sur-survey or something.[00:04:47] Mikhail Parakhin: Exactly. And, uh, we- Yeah ... the, the push is to get your job done, you can use any tool, and we effectively fund unlimited tokens for everybody.Uh, we, we do, we do try to control the models that, uh, people use, but from the bottom, not from top. Like we basically say, “Hey, please don't use anything less than Opus four point six.”[00:05:09] swyx: Oh .[00:05:10] Mikhail Parakhin: Some people, some people end up using GPT five point four extra high. Some people use Opus four point six. Um, uh, you know, uh, there are some, uh, there are plus and minuses in going for full one million context window versus not.But, uh, we try to discourage people from using anything less than that.[00:05:28] swyx: Yeah, yeah. Got it, got it. Uh, I mean, uh, that's, you know... The, the next chart here, it really kind of shows the expansion and the sort of December twenty twenty-five inflection, right? That, uh, people are using a lot of tokens. I think it's also really interesting that no one was kind of abusing it in twenty twenty-five.Like it was- Had comparatively, uh, to this year, there was almost no growth. I mean, it's still like, you know, probably, probably gave fifty percent.[00:05:56] Mikhail Parakhin: Yeah. This is just a different scale. It's still exponential- Yeah, yeah ...growth at just a different- ...rate of expansion. Uh, there was inflection point, and Sean, I would claim the, the super interesting part here is that you could see that the distribution becoming more and more skewed.Yes. The top percentiles grow faster. So that means- Yeah ...the people in the top ten percentile, they, their consumption grows faster than seventy-five and so forth. So, uh, the distribution skews more and more towards the highest users, which is... I don't know what it tells me. It's like it feels not ideal, to be honest.Or maybe it's okay. We'll see.[00:06:36] swyx: Why does it feel not ideal? Is, is it because of, um, quantity over quality, or what's the concern?[00:06:42] Mikhail Parakhin: Because take it to the limit. That means, you know, if, if this rate of separation continued- Ah, yes ...a year, there will be one person consuming all the tokens. So it's just, it's kinda strange.[00:06:54] swyx: Yeah, I mean, um, uh, I, I think internal like teaching and all that, uh, will, will help sort of distribute things more widely. But in, in the early days, of course, the people who are sort of more AI-pilled will obviously find more ways to use it than the people who are less AI-pilled. Maybe let's, let's call it that.I'll just, I'll just kinda quickly, uh, pause from the, the... You know, we will go back to the rest of the slides, but I just wanna, um, review, you know, there are a lot of CTOs of, of large companies like yourself where they're all considering some kind of token budget, right? Like I think it's something, something that Jensen Huang has been talking about, where like if your 200K engineer is not using 100K of tokens every year, like they're, they're underutilizing coding agents.Of course, Jensen Huang would say that, but like it seems a very quantity over quality approach and like some, some people are basically saying like, well, is this comparable to judging engineer quality by lines of code, right? Which we also know is like kind of flawed, but better than nothing. So I, I don't know if you have like a sort of management take here on, on how to view this kind of, uh, metrics.[00:08:02] Mikhail Parakhin: Well, I mean, you're, you're baiting me. I, I like... This is my favorite topic. Uh, if you let me, I'll probably talk for two hours on just this. I have a lot of things to say. Like I do think Jensen gotten a lot of bad press saying, “Oh, of course you're, you know, this, uh, the- ...the cake seller says you don't need enough cakes.”You know? Like, of course. Uh, but, uh, I actually, uh, think that's undeserved. I think he, he's actually right. Uh, I do think- He,[00:08:33] swyx: he's directionally correct.[00:08:35] Mikhail Parakhin: Yeah. Yeah. He's directionally correct for sure. Uh-[00:08:37] swyx: Who knows what the right number is? Yeah.[00:08:39] Mikhail Parakhin: The thing that I do Uh, want to say, and this is something that we learned through trial and error and very important is like two things.One is that it's not about just consuming tokens. Uh, you can consume tokens and, and in fact, the anti-pattern is running multiple agents, too many agents in parallel that don't communicate with each other. That's almost useless, uh, compared to just fewer agents and burns tokens very efficiently. Uh, setting up the right critique loop, especially with the high quality models, where one agent does something, the other one, ideally with a different model, critiques it, uh, suggests ways to improve it, the agent redoes it with this critique and, and so it takes much longer.So people don't like it because latency goes up. You know, they, they have to wait until this debate is happening. But, uh, the quality of the code is much higher. And another thing, just since you mentioned like, look, uh, uh, yeah, the overall budget is just like, uh, lines of codes. Lines of codes are exploding for everybody right now, or partially because AI is really mover balls, but partially just because AI can write a lot more code, you know, doesn't get tired.And so you have to have to have a very strong narrow waist during PR review. Otherwise, just the number of bugs will go through the roof. It's, uh, it's this unexpected consequence of the just volume trumping everything. I would claim by now good model writes code on average with fewer bugs than, than the average human.But since they write so much more of it, like more of it will make it into production. So you have to- You still[00:10:26] swyx: have[00:10:26] Mikhail Parakhin: more bugs. Yeah. Have to have a very rigorous PR reviews, also automated of course. But, uh, yeah, that to spend a lot budget there. Like this, this for me, for me, actually, the important metric is the ratio of budget spent during code generation versus, uh, spent, uh, expensive tokens like GPT, uh, five point four Pro or, uh, uh, Deep Think from Gemini, you know, checking on PR reviews.[00:10:55] swyx: Yeah, totally. Uh, I noticed in your chart you didn't have any review tools. Do you just use like, like let's say a Claude code to review tools? Or do you have another set of review tools like the Greptiles, the Code Rabbits, uh, Devin Reviews has a review tool. I don't know if you've had those specialist review tools.[00:11:13] Mikhail Parakhin: You are a little bit jumping on my store tool right now because the graphs I was only showing public tools. Uh, uh, the-- I haven't found a good PR review tool that, that does what I think should be done. And, uh, partially my, my thinking is because it's so... It just goes against both what people feel like emotionally they prefer and, uh, some of the, uh, you know, frankly Even business models that, that the companies run.At peer review tool, uh, time, you want to run the largest models. That means, I don't know, Codex or, or, uh, Cloud Code is not gonna cut it. You need to have pro-level models if you really want to, uh, stand the tide of bots from going into production. And you need us to spend a lot of time, the models taking turns, but you don't want, like, a big swarm of, uh, of, uh, agents.So in fact, you end up in a different dual-dualistic world where you generate not that many tokens. You, in fact, generate few tokens, but it takes f-a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So that's, that's why I feel like I haven't found good tools, so we are using our own for peer review for now.[00:12:33] swyx: Yeah. Yeah. I mean, uh, I think a lot of companies are building their own, uh, especially to their needs, right?[00:12:38] Mikhail Parakhin: Mm-hmm.[00:12:38] swyx: Um, I, uh, you also have a chart here going back to the slides on, uh, PR merge growth, where we're now at thirty percent, uh, month on month rather than ten percent. Uh, and also the, the estimated complexity is going up.You know, this is productivity, right? ‘Cause y- presumably there's more stuff going into the code base and more, more features getting worked on. I'm curious about the backlog, right? Like the, the, the-- I actually don't mind a pro-level model taking an hour or two hours to review my PR, because I've dealt with humans who take a week to review my PR, right?And I keep pinging them on Slack, “Hey, hey, review my PR.” So, you know, I think there's some trade-off here where, like, it still doesn't make sense.[00:13:18] Mikhail Parakhin: Exactly. That, that's exactly m-my point. Uh, that on one hand, you can tolerate longer latencies at, uh, PR. On the other hand, like right now, the real problem is not in spending time waiting for PR.It's real problem is since there's so much more code than- Yeah ... uh, probability of at least some tests failing going up, and then you, like, keep de-failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer. Uh, so it actually, in terms of the overall time to deploy, it's total time savings if you spend more time on a longer model, like thinking for an hour, because then, then you, you don't have to spend all that time during testing and rolling, you know, rolling back the deployment.[00:14:03] swyx: Yeah, totally. That's still worth it. You know, you don't look at the individual, look at the aggregate, and look at the, the, the change in the aggregate system.[00:14:11] Mikhail Parakhin: Exactly.[00:14:11] swyx: I'm kind of curious if, like, there's this PR mentality and, like, c-- the, the, the CICD paradigm will be changed eventually. Some people are like, obviously a lot of people want new GitHub, but I even wonder if, like, Git is the problem, right?Like, is that the bottleneck? Is the concept of a PR a bottleneck? Do you guys use stack diffs? I don't know if, uh, that's a, like, a merge queue stack diff type of thing.[00:14:34] Mikhail Parakhin: We, we use, we use Stacks, we u- we use Graphite. We worked with, uh, Graphite a lot. Uh, so we use Stack, uh, PRs. I think, uh, like that's clearly the overall CICD in general, and the interaction with the code repository right now is the, clearly the sort of the, the main issue and the bottleneck for us, uh, and highest top of mind.I would say we probably need a different metaphor or different whole design of how to process it in new agentic world. I haven't seen anything dramatically better yet. I, I think everybody right now is just trying to keep their head above the water ‘cause, ‘cause there, there's so many PRs and then everybody's CICD pipelines start creaking, the, the times are increasing, the number of bugs slipping by increasing, and you have to, have to clap on down.And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what, what it could be a completely different and new world, which I haven't... I know some people working on it. I haven't seen something, like anything super compelling yet, but clearly the old thing were designed for humans will need to be morphed into something new.[00:15:53] swyx: One of the thing that I, I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in, in hu- in human organizations, we do have something like that. It's the company standup. But like, other than that, it's like it's actually fitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy.Like it's okay, you know, that, that not every delivery is like atomic consistency. Like we're not dealing with a database sometimes.[00:16:27] Mikhail Parakhin: This is a very good point, uh, because since humans don't write code too fast, you know that global mutex is not too bad. Once you-[00:16:36] swyx: Yes ...[00:16:37] Mikhail Parakhin: start writing code at the speed of machine, it becomes the, you know, the bottleneck.Then what do you do? Maybe, and I can't believe I'm saying this because I, I'm long-- lifelong opponent of, uh, microservices, and I always thought that was, like, a really bad idea. And now that you're saying it, like, maybe in new guys like microservices will make a comeback, you know, because then you, you can ship things independently in tiny things and, and the managing all that complexity automatically will be much easier.I don't know. Like, we'll s-- we'll have to see.[00:17:10] swyx: Yeah. I mean, I don't know what the Microsoft or, or Shopify thing is, but I, I read this paper from Google where they have a monorepo that deploys into microservices, right? And then, uh, the other concept that I think about a lot is the Chaos Monkey concept from, from Netflix.Being able to create, like, this robust system where, um, uh, you know, you, you have the service discovery, you have the, uh, the independent, independent microservices discovery and, and, uh, you know, probably going to be a fair amount of duplication. That's how an organic system sort of scales, uh, that, that you have that...I don't know how you call it. Slack? Robustness? Depend-- uh, d-duplication. I, I, I forget the-- I, I'm-- And this-- those-- these are not exactly the terms- Hmm ... I'm looking for, but I c-can't really think of the words. Okay. I was gonna go into Tangent and Tangle. Uh, so, uh, we, we sort of discussed the overall stats that, uh, Shopify has.Uh, but, you know, I, I think some, some pretty cool stuff that you guys are working on is your ML experimentation, uh, and your, your sort of auto tr-research training pipeline. Presumably you're much closer to this one because it's, it's a sort of personal hobby of yours. How, how would you explain them in, together?I thought we have a slide that, like, uh, has the s- the system diagram.[00:18:24] Mikhail Parakhin: Yeah. Tangle first and then Tangent as a-[00:18:27] swyx: Yeah ...[00:18:28] Mikhail Parakhin: as a thing on top of Tangle. And, uh, Tangle is the third generation, I claim, of, uh, systems of, uh, running any data processing, but a bit with a skew for ML experiments, but not necessarily. Any sort of data processing tasks where you need to iterate, share, and you have scale so that you want maximum efficiency.You know how, like, normally you would work, you would-- Imagine you're a data scientist or an ML practitioner, you would get Jupiter notebooks or, or maybe you would get, uh, you know, Pyth- your Python scripts, and you would manage the data, and you produce those TSV files, and you put them in some JFS or something.Then you would notice that, oh, it has this, uh, weird missing values. You go and write another script that, uh, goes and replaces them with, uh-[00:19:20] swyx: Ah ...[00:19:21] Mikhail Parakhin: dash S. And then, then you, then you run some, some, uh, “Oh, I need to filter bots.” And so you run some light GBM model that, uh, removes the bots. And then, then you like-- And then you, you kind of like get into shape, and then you start experimenting, and you run multiple experiments, and then you're like, “Oh my God,” like, “this experiment is worse.”You undo, and you cannot get to previous result. And like, “Ah, what did I do?” Like that. Again, then, then you finally like get everything working. Then you like start throwing it over the fence to production. You, you replicate it, those things don't work, and then sometimes you like don't notice that you forgot some feature naming and the, the features don't match.But then, like imagine you, you did everything, and then six months later you're like, have to repeat it because now there's more data, or you wanted to do another pass, and you're like, “What, what did I do?” Or like, or like, “This script crashes now,” or the, “the path has changed.” And then, then you're trying to, like you spend another month just doing ar- digital archeology on your own, you know, history, right?Now multiply that by many, many teams. Now imagine you got an intern that you wanna ramp up. Now you have to show that intern, “Oh, you know, look, here's the folder, there's the scripts, you know, ask your cloud agent to do, and then, uh, to, to figure it out.” And then cloud agent does something, and then you're, “Ah, yeah, right, right, it was the wrong folder.I forgot to tell you, I actually have this other thing I forgot myself.” And, and that's, that's the, like, the daily life we all, uh, all know it, uh, if, if you're a data scientist, machine practitioner, ma- machine learning practitioner or, uh, or even like any data managing, uh, person.[00:21:00] swyx: Yeah. So I, I used to do this, uh, f- uh, on the quant finance side, uh, in, in my hedge fund.So we did this before Airflow, and then, uh, obviously Airflow came along and, uh, then more recently Dagster, uh, I would say is like, in my mind, what I would use for that shape of problem, uh, where you had to materialize assets and create a pipeline.[00:21:19] Mikhail Parakhin: And that's, that's very good segue because... So Airflow is great, but Airflow is more about you, you have something and you wanna repeatedly run it in production on schedule.It's less about you as a team developing things and being able to share, and you grabbing the standard pipeline and saying, “Hey, I wanna change this tiny little component in the huge sea of data processing, and I don't wanna-- I wanna run ten experiments on this, and I wanna do hyperparameter optimization.”All that is very hard to do with Airflow. It's very easy to do with Tango. Tango is m- more about, it's everything about group of people Running experiments, it might be agents too nowadays. Uh, running experiments cheaply, collaborating, sharing results. Uh, you don't need to understand fully. You, you grab-- you clone somebody else's experiment or somebody else's pipeline, uh, run, uh, change small piece, run it, be, like, get it to production state, and then ship in one click.So then the... You don't have to port it into any other system to, to run in production. You can just run the same experiment. It's, it's fully production ready. And, and it's, uh, it has lots of... Again, as I said, it's third generation system. The original one was, I would claim there was Ether and then, uh, at least in my career, Ether was the first, first, uh, that pioneered this type of approach.And then there was, uh, Nirvana, which, uh, uh, at Yandex, which did kind of sec-second take on this. And now this one aggregates the, the learnings from all of those and, and Airflow as well to, to get to the state where you try it, it, it feels kind of magical. Uh, ‘cause now everything is based on content, uh, hashes.So even if the version changed, but if the output didn't change, nothing is being rerun. It's very efficient. If you... Multiple people start experiment that needs the same sort of data preprocessing, it's not repeated multiple times. It's automatically done only once. If you start ten experiments that all require, you know, some, some data preparation first as the first step, and you don't have to coordinate for that.Like, you don't have to know that other people are starting it. You now, it's very easy compos-, uh, composability, any language you can u- uh, you wanna use, and it's very visual. So you can see immediately, you can edit it easily, you can assemble small things with just even mouse clicks if you want to, and, uh, share, clone.And everybody knows also it's fully kind of static in the sense that we rerun it second time, it will exactly have the same results. Like, you will never have to do digital archeology. So full versioning and everything is also there.[00:24:06] swyx: Uh, so, so people can, uh... It's open source. Go to the GitHub repo and, and, uh, check it out.Uh, and it is also a really good, uh, blog post about it. I think all these is, like, really appealing. The, the, the, the thing that I think sells me the most about it is that, um, sort of development to production transition, right? Which I think, um, a lot of people haven't really solved that, uh, strictly, right?Like, we develop really, really well in, in Python notebooks, but then, you know, that's obviously not a sort of production ready process. I think that, like, any way in which that is solved, I think is, is very appealing. Then the other thing that you mentioned, which also raised my eyebrows, was content-based caching, which you mentioned is, is, um, you know, is ve-very much, uh, um, a sort of efficiency measure about, uh, you know, just like recalculation only on, on sort of content addressing Which I think makes sense.Uh, it surprised me that the savings could be this much, but maybe I just haven't worked at your scale where there's so much duplication, uh, that people just rerun because they change a single ID upstream.[00:25:10] Mikhail Parakhin: It does, yeah. But it's not only you rerun. The, the main savings are coming from the fact that you ran it, you got your job done, and you moved on.Then- Yeah ... somebody else in some department you don't know existed runs the same task, but on a newer version.[00:25:27] swyx: Yeah.[00:25:27] Mikhail Parakhin: Like right now, you can't, in, in most of the organizations, you can't even find out about it so that you can't even measure that you're spending that time twice, right? Here- Yeah ... if everybody's on Tango, that's detected automatically and detected that the output is the same.And then for that person, all it looks like is like experiment just suddenly moved, jumped forward, right? Uh, uh- Yeah ... so that's because, because the, there's network effect of multiple people helping each other.[00:25:51] swyx: Yeah. This is one of those things where it's designed to be a platform from the beginning rather than an individual developer's tool from the beginning, right?And, and everything's gonna streams down from there. That is the sort of Tango, uh, orchestrator, and it's, it manages jobs. We've seen a few versions of this, and this is obviously, uh, uh, the sort of, uh, unique approaches that you guys have, have, uh, figured out. And then there's Tangent.[00:26:14] Mikhail Parakhin: Yeah. And Tangent is basically an automatic auto research loop that can help and kind of do your work for you.Uh- ... you know, uh, effectively, effectively, Andrej Karpathy recently popularized it with auto research. Yes. Remember he said like he was, uh, speed running this, uh... Yeah, uh, you know the story. The, here we're basically bringing the same capability into Tango so that, uh, the, uh, Tangent can analyze it. It's just an agent that can run multiple experiments, figure out what can be changed, and keep on rerunning it, keep on modifying until, uh, maximizing some goal, some loss function, whatever you need to, to achieve.And in general, I would say if you're not using auto research-like approach in whatever you do, like literally whatever you do, then you're missing out. We saw at Shopify that taking like a wildfire, anything where you can put measurements can be done dramatically better. Our-[00:27:19] swyx: Mm-hmm ...[00:27:20] Mikhail Parakhin: uh, speed of, uh, templatization HTML, uh, completely new UX tem- uh, templatization of, uh, reducing latency for liquid themes.Uh, we-- Our, uh, search, uh, recently we moved from It's hard even, uh, quote from eight hundred QPS to forty-two hundred QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index serve on the same number of machines, just increasing the throughput.We, we managed to improve the quality of gisting and machine learning process. Uh, you know, gisting is the prompt compression technique that[00:27:59] swyx: allows for[00:28:00] Mikhail Parakhin: lower latency and, and lower and, uh, actually higher quality slightly. So like literally whatever different walks of life, and it doesn't have to be AI related.Uh, we, we had a reduction in, uh, storage because the agents would go and find data sets that clearly are derivative, uh, and then you don't need to store things twice. You know, we, we, we found somewhat embarrassingly that it was one of the largest tables was hashing random IDs into another random ID, and we literally- Oofput only one. So it was translating, yeah, two random IDs hashed[00:28:36] swyx: into[00:28:37] Mikhail Parakhin: each. So, so[00:28:37] swyx: it has access to the code as well, so it can, it can check the, like what, what the hell is it doing?[00:28:42] Mikhail Parakhin: So there, there cou- it could be run in two levels. You, uh, you know, at the superficial level, it could just use ex-existing components and, uh, reshuffle them.Uh, you know, like you can grab- Yeah ... uh, XGBoost, and you can grab some, some Py- PyTorch module, and then can grab some, you know, grab another tools and, and combine them. At a deeper level, since Tangle is all sort of CLI based underneath you, every, every component is a wrapped really CLI, uh, call and a YAML file, it can analyze code and create new components and, and, uh, keep on iterating as well.So, so you can, you can both have quick modifications of existing t- uh, pipelines with the, with components that are already there pre-baked, or you can create new components, uh, and-[00:29:29] swyx: Yeah ...[00:29:29] Mikhail Parakhin: keep iterating on those. So auto research is, again, this is probably the, the thing I was excited the most in the last two months happening, and we see it taking like, like totally like a wildfire.Just, uh, everybody, every day, every... well, every day, every minute, I would, uh, have somebody Slack message saying, “Oh, look how much better I made it.” And, uh, it's all throughout the research.[00:29:53] swyx: Is this democratized in some way in, in the sense that like is it your ML, uh, engineers and researchers doing this, or is it your regular PMs and software engineers also have the ability to auto-- to use Tangent?[00:30:07] Mikhail Parakhin: This is an awesome question. Like, Tango in general and Tangent in particular are extremely democratizing. Like they- Yeah ... they are the main tools for- ‘Cause I don't[00:30:15] swyx: need the details.[00:30:16] Mikhail Parakhin: Yeah. Exactly. Initially used by ML and AI engineers, but then literally, as you said, PMs are like the highest user right now is one of PMs on our org, uh, Sartak and he was, he was number one by, by usage of, of this ‘cause they're just, uh, energetic and knowledgeable, and now it, it unlocks a lot of capability where you don't have to co-change code manually.[00:30:39] swyx: I mean, I mean, because it kind of cuts out the ML, ML engineer from the process because the, the, the PMs have the domain knowledge and the ability to think about, uh, from first principles about, okay, what, what results do I want? And they can-- they even have the access to the data that, that needs to go in.So it's like in some ways, like this is the magic black box that we've always wanted for, for training and, and for, uh, I guess, uh, uh, hill climbing, whatever.[00:31:04] Mikhail Parakhin: It's basically cloud code for your AI development- ... uh, situation, right? Like now, now you don't have to know exactly how algorithms work. You can just, uh, bring your domain knowledge and expertise and product knowledge and iterate within Tangent until you've gotten the results that you need.[00:31:21] swyx: In my previous roles, every time that someone has pitched AutoML, you know, I've always been like, “Uh, this is not, this is not gonna work. It's, you know, it's, it's always gonna be a flop.” Somehow it's working now. I mean, presumably the answer is now we have LLMs and it's good enough, right? It's, it's an emergent property that we can do auto research, but like, it doesn't feel that satisfying that how come we didn't do this before, right?Like we just did like parameter search and like, I don't know. That's maybe that's it.[00:31:48] Mikhail Parakhin: Yeah. Bayesian optimization and hyperparameter optimization was, was the one that, or facet of AutoML that was used very actively, which incidentally also built into, uh, Tango. But, you know, I know Patrice Simard very well, and, uh, he was such a, uh, such a proponent of AutoML, and he put, like literally spent careers trying to democratize it.Without LLMs, it just turned out to be very hard. Like it, you, you would have flexibility within certain narrow domain, but it was hard to wider scale, and now with LLMs suddenly it's like magic wand, and so suddenly everybody- ... is an AutoML expert.[00:32:28] swyx: Yeah, I, I think it's multiple things, right? Like I'm, I'm just gonna bring up the, the, the chart again, right?Like LLMs can do the monitoring very well. That is the very potentially unbounded, super unstructured. It can do the analysis very well, it can do the... Uh, and basically it is much more intelligence poured into every single step. Uh, there's maybe nothing structurally changed about AutoML, but this is just m-more intelligent and more unstructured.[00:32:53] Mikhail Parakhin: Exactly.[00:32:54] swyx: Any flaws that you've run into? Like everyone is like drinking the Kool-Aid, oh my God, time savings, uh, you know, performance improvements. Like what, what, uh, issues have you have, uh, come up?[00:33:06] Mikhail Parakhin: This is really cool. It's not a solution to all the world's problems for sure. The limitations are usually the ones I-- And this is where we get into a bit of a subjective territory.Uh, I can only share what I've, I've seen so far, and I'm sure the situation, uh, is changing, and, you know, maybe after I say it, like many people will reach out and say, “Hey, what about this?” And you don't know that, and then, then we'll be probably right. But what I've seen is auto research is very good at doing kind of obvious things that you don't have bandwidth to do or you didn't notice or maybe you're not aware of like the-- some standard practices.It is not good at doing something completely out of distribution, something that, you know, you have to think for, for multiple days, uh, and, and do something like none of this. So, so it's, uh, I, uh, set an experiment once, uh, on, on my sort of, uh, hobby thing, and I let it run for, uh, ended up, uh, several weeks run, uh, you know, it's like full production kind of scale, so it, you know, slow runs and, and it ex-- it performed in the end, uh, over four hundred experiments, and only one was successful.I'm like, “Okay, that's, that's good.” But-[00:34:18] swyx: But it saved time.[00:34:19] Mikhail Parakhin: Yeah, I saved time. Like it, it was the, that thing. Yeah, if I, if I were doing four hundred experiments myself, my betting average, as I said, would have been much higher, I'm sure. But also, first of all, it would take me like three years to do four hundred experiments.And, uh, I didn't have to do them. Like the machines were just, uh, the price of electricity did that. So, and I got one improvement, uh, that in, uh, my, my-- Honestly, when I was starting that experiment, my thinking was to go and show that, “Hey, Andre, maybe you just don't know how to optimize.” And I was super smart because in, in my pro-problem, it was optimized for many years, and it was like fully improved.Uh, and I didn't expect it, you know, auto research to find anything at all. Yet it did. So instead of making fun of Andre, I ended up, uh, a big, big supporter. Yeah, that's exactly the tweet. Yes.[00:35:10] swyx: You and Toby really, really go back and forth on-online a lot, which is really funny. Uh, think of it as, as an eval for the optimalness of the code it's running on.Uh, it's almost like it reminds me of like a Kolmogorov complexity thing, but, uh, I guess it's-- there's some optimal thing that you're trying to sort of reduce down to, I guess. Um, and so, so you, you, you know, you should congratulate yourself that you had, uh, you know, uh, ninety-nine percent, uh, optimality.[00:35:36] Mikhail Parakhin: Exactly, yeah. I think Andre really deserves a lot of credit for popularizing this approach. This is, uh, this is incredibly, I think, powerful and cool and You know, the, uh, even him, him just mentioning it led to a lot of gains in a lot of places in the industry, so we should be thankful.[00:35:56] swyx: Yeah. I think he also has a just...I don't know what it is. Like, um, you know, it, it is a simple self-contained project that people can take and apply to other things, which is, is, is one thing, but also just the name. Just like somehow no one, no one managed to call their thing auto research. It's just naming things is very important. I think that that is mostly, uh, our coverage of Tango and, and, uh, Tangents.I think obviously, you know, there's a lot of, uh, ML infra at, at Shopify that people can, uh, dive into. We're about to go into SimGym, but before I do that, any, any other sort of broader comments around this whole effort? Like where is it, where is it leading to?[00:36:36] Mikhail Parakhin: As a segue to SimGym, like all those things start composing strongly.And, uh, you could see a huge unlock when you can look at each one of the tools and, and you see, oh, they're extremely useful. Uh, Tango is useful by itself. Auto Research is useful by itself. SimGym is useful by itself. If you combine all three, you create like synergetic effect. I think that's why we wanted to even, uh, cover them today is because this is something that if you go back even, you know, five years ago, would've been unthinkable.Uh, replicating that, uh, would, would be either incredibly costly or impossible, right? With probably thousands of people are required.[00:37:20] swyx: Well, we have serverless human, uh, serverless intelligence, right? Like, uh, so yes, you do have thousands of hu-- of, of intelligences, not just, not humans. And that's, that's close enough, right?Even if they're not AGI, they're, they're close enough to do the, the task that you need them to do. And, and, you know, that's, there's plenty for, for a lot of routine work, knowledge work. Okay, let's get into SimGym. Um, this is one of those things I, I was surprised to see actually it's apparently your, uh, one of your most popular launches, and I think something that, uh, I think Sim AI, I think Yunjun Park, who did the Smallville thing, there's a very small cottage industry of people trying to do like the simulate customer thing.I think a lot of people maybe don't super trust this yet because they're like, well, obviously they would just do what you prompt them to do, right? But maybe just think, uh, tell us about the sort of inspiration or origin story.[00:38:10] Mikhail Parakhin: That's exactly actually the thing I wanted to cover, because if you don't have the historical data, all you can do is prompt a-agents in a vacuum, and they will do exactly what you prompt them to do.In fact, when I first proposed it, and this is a bit of, um, my brainchild initially, if I, I can boast, even Toby said like, “But wouldn't they, they just repeat what, what you tell them?” And, uh, but I'm like, “Yes, except Shopify has decades of history of how people made changes and what there is, uh, there, what it resulted in terms of sales.”So now what we can do is we can-- we have this... It's not, it's a noisy data. There's a small, usually websites, uh, you know, like things, things are never in isolation. It's almost never AB experiment. It's always AA experiment when there's has two meanings, but basically, you know, in different time you run two different things.But if you aggregate in general, uh, like everything together, and you apply, uh, denoising and collaborative filtering like approach, you can extract a very clear signal. And then you can optimize your agents. And that's why it took so long. It took almost a year of that optimization of just us sitting and fiddling, and, and we had this internal goals of correlation of hitting-- internal goal was to hit zero point seven correlation with, uh, add to cart events, for example.Like that, that if we run real AB test experiment, that it should, it should go and, and rep-uh, replicate, uh, same sort of success that, that humans had or lack thereof. And it, it took forever, and I don't think that's easily replicatable because, uh, like who else would have that data? You have to have this historic, you know, decades, uh, worth of data.And now, now the, like the other thing you need is in-infrastructure and the scale, right? Because, uh, w- again, what we found, uh, stat sig results, you need to run a lot of simulations, a lot of agents, and, and it's-- Those are expensive things. Like you're, you're making actions in the browser because you want a real friction.You want to, to be able to get the image like of what humans will see because you wanna, uh, detect effects like, “Hey, if I make my images larger, will I have more sales or l- uh, fewer sales?” And like usually people's intuition here, by the way, is that I increase my images, I will have more because they look nicer.You know, designers all look sparse and big images. Like usually your sales tank, right? But, but, uh, you know, from HTML, all the characters look the same only the, the size tag looks different, right? So it's very hard. So you have to take visual information, you have to run this in simulated browser environment on the big farm and, and of course, you have to have, uh, like very, very expensive model, good model with multi-model model.So all this it's-- is what's taken so long and, uh, to share my personal fail a little bit there, Sean, is like, you know, we always had this bias to-- for like large company bias. You know, we always, uh, whenever you-- we do, we're like, “Hey, we'll run an experiment,” right? We make, make a change, and we will run an experiment and then, uh, see, uh, see which one's better or like, “No, this is worse,” and most of them are worse, so you discard it and keep iterating, hill climbing.And we're like, “Oh, like smaller merchants, they cannot get stat sig results. They cannot really run experiments simply because, you know, in a week there would be not enough data for them.” So we thought from this perspective. What we didn't realize is that most people don't have A and B, they just have one thing, and they need suggestions of What A and B should be.So, uh, we first build this, hey, we run simulation on two separate teams and, and, uh, say, “Hey, which one is better?” We then morphed it into, and very recently just released it, when you have just your site, your theme, we run over it and we say, “Hey, here's what predicted values of, of, uh, uh, conversions are, and here's how we think you should modify it to increase your conversions.”And then circling back to what you started with, the proof is in the pudding. Like, if we are not correlating with reality, like, people will not be using it. And, uh, thankfully, we see literally every day more users than the previous day. So, so right now, uh, right now- It's working. Yeah. I'm-- Right now my problem is how to pay for it all because the so our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers, uh, and handful browsers, uh, uh, cheaper so that we can accommodate the increase in traffic.[00:42:47] swyx: Yeah. I, I understand that you, uh, you published a lot of technical detail at GTC, so I was just gonna bring it up a little bit. I think s- was this in, in con-conjunction with some kind of GTC presentation? Or something like that, right?[00:42:59] Mikhail Parakhin: Well, we, yeah, we, we did it in several place, but yeah, we had the engineering- Yeahblog, uh, as well. Yeah.[00:43:05] swyx: Yeah. So you're running, uh, GPT OSS. Uh,[00:43:08] Mikhail Parakhin: the, this is an older version. You know, now we run multimodal model. But yeah- Yeah ... GPT OSS, we still run GPT OSS as well for[00:43:15] swyx: And then you have the VMs, and you also have browser-based. I really like this one where it you said, “It violates almost every assumption that standard LLM serving is designed for.”And then you had like, basically orders of magnitude differences between everything.[00:43:29] Mikhail Parakhin: Exactly. Which is, which, uh, which was, you know, a bit of a challenge to implement, like when, like even simple things. Uh, be- since it violates all the assumptions, for example, multi-instance GPUs, like MIGs don't work as well.But we needed, uh, to get MIG to work because, ‘cause otherwise it's way too expensive. And so we had to deal with the, yeah, with, uh, lots of infrastructure and, and, uh, work with, uh, uh, Fireworks and CentML, uh, you know, to help with optimizations and browser-based, as you mentioned. Yeah, like, takes a village.[00:44:04] swyx: Okay. So there's a lot of like, I guess, experimentation in the infrastructure so far, and you've published more or less what you have here. I guess I'm, I'm less familiar with CentML. I, I don't do, uh, that much work in this, this part of the stack. But why was it the sort of preferred instance platform?[00:44:22] Mikhail Parakhin: There are really three probably top companies. There used to be, uh, uh- Three top companies, uh, at least I was aware of that did, uh, LM optimization. You know, together Fireworks and Santa ML, not necessarily in that order. Santa ML recently got acquired by NVIDIA. Uh, what they did is if you have a model and you want to optimize it to a specific prof-- uh, profile of usage, uh, they would go and do it.And, uh, we work with, with those companies, uh, this was work particularly in with Santa ML and NVIDIA to get them the best possible results out of it. And, and sometimes you, you have to retune depending on, like sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want like the cheapest, right?And, yeah, or some combination. And so yeah, these are people who would come and help you.[00:45:14] swyx: I see. I see. Yeah, yeah. I'm familiar with these people for the LLM, you know, autoregressive stack. But the other interesting category of these optimizers is also the diffusion people, whereas like Fel and, you know, uh, Pruna recently has come up a lot as well, which I think is like really underappreciated, uh, at least by myself, because I, I thought, oh, all the workload would be LLMs, but actually there's a lot of diffusion as well.[00:45:38] Mikhail Parakhin: Exactly.[00:45:38] swyx: There's a lot here, so I, I, I... it's, it's, uh, it's, it's, it's hard to cover. But I, I do think like people underappreciate the importance of customer simulation, basically. I think this is something that I'm candidly still getting to terms with. Uh, you know, uh, you also-- your team also like prepared this, like, really nice diagram.Uh, I, I assume this is AI generated.[00:46:00] Mikhail Parakhin: Yeah, it looks-[00:46:01] swyx: Maybe it's not.[00:46:01] Mikhail Parakhin: Yeah, it looks, uh, Gemini-ish. Yeah, but, uh, uh, honestly, I, I don't know where, where the hell they generated. It looks, look, uh, looks like it's, uh, Google. But the interesting part, John, that, that, uh, we haven't covered, but I, I wanted to mention is if your store had previous customers, rather than it's a new store, you're like new merchant just launching things, it helps tremendously in just correlation and forecast.Yeah, we take your previous, uh, customer's behavior, and we create agents that replicate those specific distribution of, of customers that you get, and then we a- we apply those to your changes, and then that, that raised raw, you know, the re-- uh, just correlation with the add to cart events or to-- with conversion or whatever it, it, it may be, uh, quite dramatically.So, uh, replicating humans in general seems like an interesting, cool challenge.[00:46:58] swyx: As a shareholder, I think this is the-- like if people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The, the more you use Shopify, the more it will just automatically improve, right?Like you're, you're doing the job for them.[00:47:13] Mikhail Parakhin: Yeah, that's what we started with. Like, uh- ... uh, otherwise, if you're just a startup, I wouldn't do it if, uh, you know, if it was my startup because Without the data, it, yeah, as, as you said, it's, it's exactly the case that, uh, whatever you say in prompt, that's, that's what the agents will be doing.[00:47:30] swyx: The statistician in me wants to like really satisfy the sort of, um, statistical intuition, I guess. Um, to me it's kind of, uh, the, the word that comes to mind is, um, ergodicity. Uh, so let's say a, a customer takes this path, customer takes this path, customer takes this path, right? Um, the... In my mind, the way I explain it is like, okay, here, here's the ninety-five percentile, here's the five percentile, and here's the median, right?Um, but to me, what SimGym is potentially doing is that it can, uh, modify... It can sort of model the sort of in-between sort of journeys as well, that, that maybe are dependent on the previous states. This may be like a very RL-type conclusion where like basically the summary statistics, if you only did naive AB testing, you only have the, the statistics at, at, at a certain point, and you only judge based on the sort of overall summary statistics.But here you can actually model trajectories. Does that make sense? Or-[00:48:31] Mikhail Parakhin: That makes total sense because like, well, that, that makes even more sense that maybe even you realize bec- because-[00:48:38] swyx: Okay. Please,[00:48:38] Mikhail Parakhin: please. Yes ... we do-- Yeah. The, so internally, uh, we have this system, we talked about it briefly once at NeurIPS.We have a huge HSTU-based system that models the whole companies, uh, and their possible paths. And like- Yeah ... what you are, what you are showing, like actually at any point of time, you can either model the user's behavior or you mo- can also think about, uh, the whole merchant as a company, as the entity that acts in the world.You can model that as well. And then you can do, can do counterfactuals. In your graph, like in your blue graph, uh, if you're... Imagine in the center there, uh, somewhere in the middle, you would have an intervention. I give that person a coupon, or I don't know, I send a personal thank you card, or give a discount in some- somewhere.And then you can, uh, then you can do forward rollouts from that counterfactual. So what would have happened with that intervention or without the intervention? And you can even ch- change where that intervention, uh, in time can happen, right? Like some- where, where in this journey. So we, we do this at the Shopify scale for our merchants, and then if we notice that something that they can be fixing, like there's a strong counterfactual, like we have Shopify policy, they basically get a notification like, “Hey, we think your...something is wrong with your-” I don't know, Canadian sales. Like, uh, it looks like it's misconfigured. Here's what you need to do. Or do you think like, uh, you have to set up this campaign with these parameters? And we do that at the buyer level to literally offer discounts or cashback or, or things to buyers.So this is-- I'm getting very excited. Like this is my sort of area of, uh, interest, I guess, and, and hobby. But being able to m-model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize when to make intervention, what kind inter-- uh, what kind of intervention to make.It's such an unlock that previously was completely impossible. Like the-- it was, it was always dreamed of, but never... Like how would you even simulate it without LLMs or HTUs? I think very, very exciting times.[00:50:59] swyx: I just wanted to, uh, to maybe illustrate this. I, I'm not the best illustrator, but I, I am a conceptual statistics guy.And y-you know, you cannot just do this. Like this is a dimensionality AB test doesn't do, right? Like, uh, because it doesn't have the, the, the change over time, uh, stochastic nature, uh, and it doesn't have the sort of contextual like... Here's all the context to this point. Um, okay, cool. Um, that's SimGym.You're, you're gonna burn a lot of tokens on this thing. But you're, you're one of the, the only scale platforms in the world that can, uh, that can do this across a huge variety of workloads, right? I'm even curious on a sort of human, uh, research level of like, well, do, does retail behave d-differently from like clothing sales?D-does that behave differently from electronic sales? I, I don't know. I don't know what else you guys... The Kardashian shoppers, do they differ from like people who buy, uh, I don't know, cars and, uh, whatever.[00:51:55] Mikhail Parakhin: Well, very different, and different sensitivities and different modes of, uh, shopping and, and different levels of what's important.Now, to-totally, you can do aggregations at, uh, at a store level. You can do aggregations at a different, uh, category level. I don't know if, uh, you know, for our statisticians among us, I couldn't believe, but we-- recently we're looking at it, and we had to bring back, uh, CRPs, you know, Chinese restaurant process.It's a, like, way of aggregating and, like, naturally grow clustering. So across... Specifically to answer questions that, uh, like you were just posing on how, how if, if buyers behave different categories. And I'm like, “I haven't seen CRP since two thousand and one.” It's[00:52:37] swyx: so What? It's so- What is... No, I haven't, I haven't seen this.No. This is not in my training. Uh,[00:52:44] Mikhail Parakhin: but, but yeah, it, uh, uh, it actually, like the, the-- there was a very popular kind of theory, popular neurips HTML circles in early two thousands, uh, kind of nice. And now, now it has practical applications, uh- Yeah ... that we were resurrecting.[00:53:03] swyx: Yeah, amazing. Uh, I, I can see, I can see how this is like a, uh, a fun job for you where you get to apply all these things.Um, yeah, yeah, so super cool. Super cool. So, okay, so, so anyone who, who knows what CRPs are and has always wanted to use them at work, uh, they should, they should definitely join Shopify. Okay, so w-we have a lot and but I, I'm, I'm being mindful of the time. I, I do wanted to, to sort of cover some other things.Um, I-I'll give you a choice, UCP or Liquid?[00:53:30] Mikhail Parakhin: Liquid. I think, I think on UCP, you know, like UCP is very important for us and, and it just we are-- UCP, we have a structured, uh, discussions, and you can read about them, and we have, uh, blog posts, and we have a big release this week, in fact, like with our catalog.Oh,[00:53:46] swyx: okay.[00:53:46] Mikhail Parakhin: Uh, yeah,[00:53:46] swyx: but- Le-I mean, we, we can, we can discuss the, the, the release briefly because we'll release this after the-- after it's already announced so whatever. There's a catalog that you guys are doing?[00:53:55] Mikhail Parakhin: Yeah. So we are, we are- Okay ... we are bringing in capabilities of a whole, uh, Shopify catalog.Basically, you now you can search for products, you can do lookups by specific ID, you can do bulk lookups when you need to bring m-multiple products. You don't need to know in ad-in advance what you're trying to show or to sell or check out. Like, you can now, you can now have this decided at, at runtime, and this big area for investment for us for both non-personalized and personalized searches, trying to provide basically a win-window into whole universe of products that are being sold everywhere in the world.And Shopify is really not exactly, but almost like a super set of any-anything being sold. Now we are bringing it into UCP and, uh, and, uh, identity linking is another big thing for us, uh, so that you, you can use, uh, like Google or whatever, whatever identity you have, uh, they're minimizing friction.[00:54:56] swyx: Yeah. So[00:54:57] Mikhail Parakhin: yeah, big release for us.But Liquid AI of course we never talk about, and the problem might be more, more aligned with what we d-discussed previously on this chat.[00:55:07] swyx: Sure. The main thing that everyone understands about Liquid is that it is inspired by Worm, and I still don't know why. I'm curious on your explanation. I think you, you, uh, you can make things very approachable.And also I think like what is the potential of like the, the level of efficiency that you get out of Liquid?[00:55:23] Mikhail Parakhin: You- we all familiar with transformer architectures. And, uh, for the longest time, there was a competing architecture, it's called the state space models. So, so Sams, uh, you know, Chris, Chris Reyes, one of the pioneers and, and lots of startups, uh, trying to make those realities.They have, uh, significant benefits being main being, uh, being much faster and, uh, lower footprint and not quadratic in length, you know, sort of, uh, linear in, in, uh, in your context length. But with state space models- They never quite made it. Like they're used-- They have, uh, certain niches when they thrive, their hybrid architectures are useful, but they never quite made it.And liquid neural networks are, you can think of them as a next step, like, uh, sort of, uh, state-space model square. It's non-transformer architecture that's more complicated than sta-state space and really difficult to code if you-- if I'm being honest. But it's, um, very efficient. It's, uh, subline-- sub, uh, quadratic in, in length of your context.Uh, it's very compact way to represent things, and that's a liquid AI company. They... Their goal is to productize it, and very often you have this need, uh, when you need to have long context and small model, and you want to have low latency. Like in general, it's basically on par with transformers, and if you do hybrids with transformers, it's, it's even better.That's why we at Shopify, when we tried multiple and we constantly try multiple models, multiple companies, we found that for small, particularly with low latency applications, when you have low latency and/or if you need longer context lengths, liquid was the best. And so we still use the whole zoo and always like obviously test and use everything, uh, every open source model and, you know, it feels l

[Podfic]
IAHL6: The Settlement

[Podfic]

Play Episode Listen Later Apr 21, 2026 28:29


A Good Omens ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠fanfic by mostlyeffable⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠. Part 3 of the Unkind Regards series. Full name: It's A Hard Life.Music: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mainstream Music 2025 Vol. 8, Produced by Sascha Ende ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠CC-BY 4.0⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠) Sounds: Email notification: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/OptronTeamFilms/sounds/521094/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Text notification (Crowley): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/GabrielAraujo/sounds/242502/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠(CC-0)Text notification (Aziraphale): ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/mickleness/sounds/269185/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)Phone ringtone: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/jhyland/sounds/539661/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)RL knock: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://freesound.org/people/Dreadwolf910/sounds/615987/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ (CC-0)For tags and other details, to leave kudos and comments, please visit the corresponding post on archiveofourown: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://archiveofourown.org/works/82013181⁠⁠!

The Struggle Bubble
The Takeover of Youth Sports

The Struggle Bubble

Play Episode Listen Later Apr 20, 2026 41:43


If you've got a kid in travel ball, club soccer, rec hockey, or cheer, you already know the math is getting weird. Chad and Craig spend this episode pulling on that thread — why youth sports has quietly become a $40 billion industry (about 2x the revenue of the NFL), and why that number doesn't feel like it's benefiting the kids the industry says it serves.We start where the money starts: the trickle-down from pro sports to NIL to your seven-year-old. The "crazy parent" economy — $500/month for private training, $3,500/month for a certain club with a certain logo. We get into the alphabet soup of competitive tiers (ECNL, GA, MLS Next, NPL, RL) and why a parent has zero chance of parsing which league actually matters for their 10-year-old. Spoiler: it doesn't matter yet, because nobody is scouting a 10-year-old.Craig brings the UK comparison — "jumpers for goalposts," rec as just playing, no coaches, no parent politics — and we hold it up against the US model where even rec has gotten expensive. Then we dig into the scholarship myth head-on: Chad is 5'7", was a really good athlete, and is very clear that your body type at 15 is doing more work than your travel-ball resume at 10. The athletic-scholarship math is not what you've been told.The core stat block lands around minute 29, straight from Aspen's Project Play data: the average family now spends over $1,000 per kid per sport, $2,000+ for high-income families, $10,000–$13,000 a season in travel hockey and softball, and $25,000 at the top end. 57% of surveyed parents say the cost is unreasonable. 20% say they're willing to go into debt for it anyway. That is a broken market.Then Craig drops the line of the episode: "You just read the pitch deck for a PE dude." Because if you're a private equity investor and you hear "captive flywheel of parents who will pay anything, with a constant fresh supply of new kids every year," you start buying facilities. We walk through the full flywheel — pay-to-play, private training, tiered leagues, travel tournaments, hotels, flights, gear, registration software, even streaming-your-own-kid — and why one owner controlling multiple rungs means parents lose leverage.We also try to hold both sides honestly. Capital can professionalize a fragmented industry. But the current scorecard is heavy on fee escalation and vertical integration. We close with what parents can actually do — push back on single-sport specialization before 13, volunteer-coach, ask harder questions of your club, and set the right end goal (fun, love of the game, being a good teammate) instead of the wrong one (the 0.1% pro track).We also pitch what we're building at Gaimplan — because if we're going to say this on a podcast, we ought to be putting our own skin in the game.

Let's Talk AI
#240 - Project Glasswing, Claude Mythos, GLM-5.1, emotion concepts

Let's Talk AI

Play Episode Listen Later Apr 16, 2026 104:30


Our 240th episode with a summary and discussion of last week's big AI news!Recorded on 04/08/2026 (sorry I keep releasing stuff late, will get better with it soon!)Hosted by Andrey Kurenkov and Jeremie HarrisFeel free to email us your questions and feedback at andreyvkurenkov@gmail.com and/or hello@gladstone.aiRead out our text newsletter and comment on the podcast at https://lastweekin.ai/In this episode:Anthropic launched Project Glasswing and previewed Claude Mythos, a general-purpose model withheld from broad release due to dramatically stronger autonomous offensive cybersecurity performance (including zero-day discovery), alongside concerning bio/virology uplift results and documented deception/containment-escape behaviors; pricing is far higher than Opus and most discovered vulnerabilities remain unpatched.Product and platform updates included Google's Gemini 3.1 Flash Live for real-time multilingual voice conversation, Suno v5.5 personalization features, Anthropic tightening Claude Code/OpenClaw access and usage limits, OpenAI canceling an “adult mode,” and Microsoft releasing MAI models for speech-to-text, audio generation, and image generation.Business and market developments featured Anthropic's revenue run rate surpassing $30B and a major Google/Broadcom TPU compute expansion, SoftBank taking a $40B short-term loan to fund OpenAI commitments, Granola reaching a $1.5B valuation, Anthropic buying Coefficient Bio for $400M, and OpenAI acquiring the TBPN business talk show.Policy, open-source, and geopolitics included Z.ai releasing open-weight GLM 5.1 and a multimodal GLM model, Google open-sourcing Gemma 4 under Apache 2.0, a judge blocking the Pentagon's “supply chain risk” label against Anthropic, research on LLM “emotion vectors” and OpenAI meta-gaming during RL, China restricting Manus founders amid Meta deal review, scrutiny of Nvidia's chip-smuggling claims, China chipmakers gaining market share, and Iran framing cloud data centers as military targets.Timestamps:(00:00:10) Intro / BanterTools & Apps(00:01:58) Anthropic debuts ‘Project Glasswing' and new AI model for cybersecurity | The Verge(00:18:22) Gemini Live gets ‘biggest upgrade yet' with Gemini 3.1 Flash Live(00:20:40) Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch(00:25:36) OpenAI abandons yet another side quest: ChatGPT's erotic mode | TechCrunch(00:26:16) Microsoft takes on AI rivals with three new foundational models | TechCrunch(00:31:25) Suno leans into customization with v5.5 | The VergeApplications & Business(00:32:53) Anthropic announces deal with Google, Broadcom, says revenue has tripled(00:37:53) Sam Altman May Control Our Future—Can He Be Trusted? | The New Yorker(00:40:18) OpenAI, Anthropic, Google Unite to Combat Model Copying in China - Bloomberg(00:41:45) Chinese chipmakers claim nearly half of local market as Nvidia's lead shrinks(00:45:20) SoftBank secures $40 billion loan to boost OpenAI investments(00:47:23) Granola raises $125M at $1.5B valuation for its AI note-taking app - SiliconANGLE(00:48:17) Anthropic acquires stealth startup Coefficient Bio in $400M deal(00:50:20) OpenAI acquires TBPN, the buzzy founder-led business talk show | TechCrunchProjects & Open Source(00:53:04) Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution - MarkTechPost(00:55:14) Google announces Gemma 4 open AI models, switches to Apache 2.0 license - Ars Technica(01:01:26) Z.ai Launches GLM-5V-Turbo: A Native Multimodal Vision Coding Model Optimized for OpenClaw and High-Capacity Agentic Engineering Workflows EverywherePolicy & Safety(01:04:45) Judge blocks Pentagon's effort to ‘punish' Anthropic by labeling it a supply chain risk(01:10:05) Emotion concepts and their function in a large language model(01:21:12) China bars Manus co-founders from leaving country amid Meta deal review, FT reports(01:25:38) US lawmakers ask whether Nvidia CEO's smuggling remarks misled regulators(01:27:48) How far does alignment midtraining generalize?(01:32:20) Metagaming matters for training, evaluation, and oversight(01:39:31) Iran says it has struck Oracle data center in Dubai, Amazon data center in Bahrain — country has threatened to attack Nvidia, Intel, and others, tooSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

ACM ByteCast
Peter Stone - Episode 84

ACM ByteCast

Play Episode Listen Later Apr 16, 2026 35:56


In this episode of ACM ByteCast, Rashmi Mohan hosts 2024 ACM/AAAI Allen Newell Award recipient Peter Stone, Professor at the University of Texas at Austin and Chief Scientist at Sony AI. He received the award for significant contributions to the theory and practice of AI, especially in reinforcement learning (RL), multiagent systems, transfer learning, and intelligent robotics. As a leading figure in AI research, Stone has fundamentally advanced how autonomous agents learn, plan, and collaborate. His groundbreaking work on RL algorithms has enabled robots to acquire skills through experience. He is an ACM, AAAI, AAAS, and IEEE Fellow, an Alfred P. Sloan Research Fellow, and a Fulbright Scholar. At UT Austin, he is the founder and director of the Learning Agents Research Group (LARG) within the Artificial Intelligence Laboratory, as well as Founding Director of Texas Robotics. In the past, he also worked at AT&T Labs - Research and co-founded Cogitai, Inc. (acquired by Sony). Peter explores the intersection of professional research and personal passion, detailing how his lifelong love for soccer fueled his involvement in RoboCup, where he aims to develop humanoid robots capable of competing at a World Cup level by 2050. The conversation highlights his leadership as the Chief Scientist of Sony AI, focusing on landmark projects like GT Sophy, an AI that mastered the complexities of Gran Turismo, and the development of FHIBE, an ethically sourced dataset designed to mitigate bias in machine learning. Throughout the interview, Stone emphasizes the importance of ad hoc teamwork—the ability of autonomous agents to collaborate on the fly with unfamiliar partners. He also shares his passion for undergraduate research and advocacy for AI education at all levels.

Unsupervised Learning
Ep 84: OpenAI's Chief Scientist on Continual Learning Hype, RL Beyond Code, & Future Alignment Directions

Unsupervised Learning

Play Episode Listen Later Apr 9, 2026 58:46


Jakub Pachocki, OpenAI's Chief Scientist, sits down with Jacob to cover the full arc of where AI research stands today and where it's headed. The conversation spans the explosive growth of coding agents and what it signals about near-term AI capability, the use of math and physics benchmarks as proxies for general intelligence, how reinforcement learning is being extended beyond easily-verified domains toward longer-horizon tasks, and what it means to run a research organization at the precise moment the models themselves are starting to accelerate the research. Jakub shares a candid take on the competitive landscape, why chain-of-thought monitoring is one of the most promising tools in the alignment toolkit, and — with unusual directness — why the concentration of power enabled by highly automated AI organizations is a societal problem that doesn't yet have an obvious solution.   (0:00) Intro (1:53) Research Intern Capability Timelines (4:59) Math Breakthroughs (7:59) RL Beyond Verifiable Tasks (12:32) RL vs In-Context (19:01) Allocating Compute Internally (28:18) AI for Science (31:40) Pattern Matching (33:23) Solving the Hardest Math Problems (37:40) Chain of Thought Monitoring (44:33) Generalization and Value Alignment in Models (47:57) Inside OpenAI (51:55) Quickfire   With your co-hosts:  @jacobeffron  - Partner at Redpoint, Former PM Flatiron Health  @patrickachase  - Partner at Redpoint, Former ML Engineer LinkedIn  @ericabrescia  - Former COO Github, Founder Bitnami (acq'd by VMWare)  @jordan_segall  - Partner at Redpoint