Podcasts about latent space

  • 39PODCASTS
  • 183EPISODES
  • 1h 14mAVG DURATION
  • 1EPISODE EVERY OTHER WEEK
  • Feb 24, 2026LATEST

POPULARITY

20192020202120222023202420252026


Best podcasts about latent space

Latest podcast episodes about latent space

AI + a16z
AI's Capital Flywheel: Models, Money, and the Future of Power

AI + a16z

Play Episode Listen Later Feb 24, 2026 57:52


a16z's Martin Casado and Sarah Wang join Latent Space hosts Alessio Fanelli and Swyx to discuss what makes this AI investment cycle unlike anything in the history of venture capital. They cover why the lines between venture and growth, apps and infrastructure are blurring, how frontier model companies can raise more than the aggregate of everyone built on top of them, and why the industry-wide gap between perception and reality has never been wider. Follow Alessio Fanelli on X: https://x.com/FanaHOVA Follow Swyx (Shawn Wang) on X: https://twitter.com/FanaHOVA Follow Martin Casado on X: https://twitter.com/martin_casado Follow Sarah Wang on X: https://twitter.com/sarahdingwang   Listen to more from Latent Space: https://www.youtube.com/@LatentSpacePod Stay Updated: Find a16z on YouTube: YouTube Find a16z on X Find a16z on LinkedIn Listen to the a16z Show on Spotify Listen to the a16z Show on Apple Podcasts Follow our host: https://twitter.com/eriktorenberg   Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

a16z
Capital, Compute, and the Fight for AI Dominance

a16z

Play Episode Listen Later Feb 19, 2026 57:52


a16z's Martin Casado and Sarah Wang join Latent Space hosts Alessio Fanelli and Swyx to discuss what makes this AI investment cycle unlike anything in the history of venture capital. They cover why the lines between venture and growth, apps and infrastructure are blurring, how frontier model companies can raise more than the aggregate of everyone built on top of them, and why the industry-wide gap between perception and reality has never been wider.   Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

ai capital dominance simplecast compute martin casado latent space
Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Bitter Lessons in Venture vs Growth: Anthropic vs OpenAI, Noam Shazeer, World Labs, Thinking Machines, Cursor, ASIC Economics — Martin Casado & Sarah Wang of a16z

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Feb 19, 2026 55:18


Tickets for AIEi Miami and AIE Europe are live, with first wave speakers announced!From pioneering software-defined networking to backing many of the most aggressive AI model companies of this cycle, Martin Casado and Sarah Wang sit at the center of the capital, compute, and talent arms race reshaping the tech industry. As partners at a16z investing across infrastructure and growth, they've watched venture and growth blur, model labs turn dollars into capability at unprecedented speed, and startups raise nine-figure rounds before monetization.Martin and Sarah join us to unpack the new financing playbook for AI: why today's rounds are really compute contracts in disguise, how the “raise → train → ship → raise bigger” flywheel works, and whether foundation model companies can outspend the entire app ecosystem built on top of them. They also share what's underhyped (boring enterprise software), what's overheated (talent wars and compensation spirals), and the two radically different futures they see for AI's market structure.We discuss:* Martin's “two futures” fork: infinite fragmentation and new software categories vs. a small oligopoly of general models that consume everything above them* The capital flywheel: how model labs translate funding directly into capability gains, then into revenue growth measured in weeks, not years* Why venture and growth have merged: $100M–$1B hybrid rounds, strategic investors, compute negotiations, and complex deal structures* The AGI vs. product tension: allocating scarce GPUs between long-term research and near-term revenue flywheels* Whether frontier labs can out-raise and outspend the entire app ecosystem built on top of their APIs* Why today's talent wars ($10M+ comp packages, $B acqui-hires) are breaking early-stage founder math* Cursor as a case study: building up from the app layer while training down into your own models* Why “boring” enterprise software may be the most underinvested opportunity in the AI mania* Hardware and robotics: why the ChatGPT moment hasn't yet arrived for robots and what would need to change* World Labs and generative 3D: bringing the marginal cost of 3D scene creation down by orders of magnitude* Why public AI discourse is often wildly disconnected from boardroom reality and how founders should navigate the noiseShow Notes:* “Where Value Will Accrue in AI: Martin Casado & Sarah Wang” - a16z show* “Jack Altman & Martin Casado on the Future of Venture Capital”* World Labs—Martin Casado• LinkedIn: https://www.linkedin.com/in/martincasado/• X: https://x.com/martin_casadoSarah Wang• LinkedIn: https://www.linkedin.com/in/sarah-wang-59b96a7• X: https://x.com/sarahdingwanga16z• https://a16z.com/Timestamps00:00:00 – Intro: Live from a16z00:01:20 – The New AI Funding Model: Venture + Growth Collide00:03:19 – Circular Funding, Demand & “No Dark GPUs”00:05:24 – Infrastructure vs Apps: The Lines Blur00:06:24 – The Capital Flywheel: Raise → Train → Ship → Raise Bigger00:09:39 – Can Frontier Labs Outspend the Entire App Ecosystem?00:11:24 – Character AI & The AGI vs Product Dilemma00:14:39 – Talent Wars, $10M Engineers & Founder Anxiety00:17:33 – What's Underinvested? The Case for “Boring” Software00:19:29 – Robotics, Hardware & Why It's Hard to Win00:22:42 – Custom ASICs & The $1B Training Run Economics00:24:23 – American Dynamism, Geography & AI Power Centers00:26:48 – How AI Is Changing the Investor Workflow (Claude Cowork)00:29:12 – Two Futures of AI: Infinite Expansion or Oligopoly?00:32:48 – If You Can Raise More Than Your Ecosystem, You Win00:34:27 – Are All Tasks AGI-Complete? Coding as the Test Case00:38:55 – Cursor & The Power of the App Layer00:44:05 – World Labs, Spatial Intelligence & 3D Foundation Models00:47:20 – Thinking Machines, Founder Drama & Media Narratives00:52:30 – Where Long-Term Power Accrues in the AI StackTranscriptLatent.Space - Inside AI's $10B+ Capital Flywheel — Martin Casado & Sarah Wang of a16z[00:00:00] Welcome to Latent Space (Live from a16z) + Meet the Guests[00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast, live from a 16 z. Uh, this is Alessio founder Kernel Lance, and I'm joined by Twix, editor of Latent Space.[00:00:08] swyx: Hey, hey, hey. Uh, and we're so glad to be on with you guys. Also a top AI podcast, uh, Martin Cado and Sarah Wang. Welcome, very[00:00:16] Martin Casado: happy to be here and welcome.[00:00:17] swyx: Yes, uh, we love this office. We love what you've done with the place. Uh, the new logo is everywhere now. It's, it's still getting, takes a while to get used to, but it reminds me of like sort of a callback to a more ambitious age, which I think is kind of[00:00:31] Martin Casado: definitely makes a statement.[00:00:33] swyx: Yeah.[00:00:34] Martin Casado: Not quite sure what that statement is, but it makes a statement.[00:00:37] swyx: Uh, Martin, I go back with you to Netlify.[00:00:40] Martin Casado: Yep.[00:00:40] swyx: Uh, and, uh, you know, you create a software defined networking and all, all that stuff people can read up on your background. Yep. Sarah, I'm newer to you. Uh, you, you sort of started working together on AI infrastructure stuff.[00:00:51] Sarah Wang: That's right. Yeah. Seven, seven years ago now.[00:00:53] Martin Casado: Best growth investor in the entire industry.[00:00:55] swyx: Oh, say[00:00:56] Martin Casado: more hands down there is, there is. [00:01:00] I mean, when it comes to AI companies, Sarah, I think has done the most kind of aggressive, um, investment thesis around AI models, right? So, worked for Nom Ja, Mira Ia, FEI Fey, and so just these frontier, kind of like large AI models.[00:01:15] I think, you know, Sarah's been the, the broadest investor. Is that fair?[00:01:20] Venture vs. Growth in the Frontier Model Era[00:01:20] Sarah Wang: No, I, well, I was gonna say, I think it's been a really interesting tag, tag team actually just ‘cause the, a lot of these big C deals, not only are they raising a lot of money, um, it's still a tech founder bet, which obviously is inherently early stage.[00:01:33] But the resources,[00:01:36] Martin Casado: so many, I[00:01:36] Sarah Wang: was gonna say the resources one, they just grow really quickly. But then two, the resources that they need day one are kind of growth scale. So I, the hybrid tag team that we have is. Quite effective, I think,[00:01:46] Martin Casado: what is growth these days? You know, you don't wake up if it's less than a billion or like, it's, it's actually, it's actually very like, like no, it's a very interesting time in investing because like, you know, take like the character around, right?[00:01:59] These tend to [00:02:00] be like pre monetization, but the dollars are large enough that you need to have a larger fund and the analysis. You know, because you've got lots of users. ‘cause this stuff has such high demand requires, you know, more of a number sophistication. And so most of these deals, whether it's US or other firms on these large model companies, are like this hybrid between venture growth.[00:02:18] Sarah Wang: Yeah. Total. And I think, you know, stuff like BD for example, you wouldn't usually need BD when you were seed stage trying to get market biz Devrel. Biz Devrel, exactly. Okay. But like now, sorry, I'm,[00:02:27] swyx: I'm not familiar. What, what, what does biz Devrel mean for a venture fund? Because I know what biz Devrel means for a company.[00:02:31] Sarah Wang: Yeah.[00:02:32] Compute Deals, Strategics, and the ‘Circular Funding' Question[00:02:32] Sarah Wang: You know, so a, a good example is, I mean, we talk about buying compute, but there's a huge negotiation involved there in terms of, okay, do you get equity for the compute? What, what sort of partner are you looking at? Is there a go-to market arm to that? Um, and these are just things on this scale, hundreds of millions, you know, maybe.[00:02:50] Six months into the inception of a company, you just wouldn't have to negotiate these deals before.[00:02:54] Martin Casado: Yeah. These large rounds are very complex now. Like in the past, if you did a series A [00:03:00] or a series B, like whatever, you're writing a 20 to a $60 million check and you call it a day. Now you normally have financial investors and strategic investors, and then the strategic portion always still goes with like these kind of large compute contracts, which can take months to do.[00:03:13] And so it's, it's very different ties. I've been doing this for 10 years. It's the, I've never seen anything like this.[00:03:19] swyx: Yeah. Do you have worries about the circular funding from so disease strategics?[00:03:24] Martin Casado: I mean, listen, as long as the demand is there, like the demand is there. Like the problem with the internet is the demand wasn't there.[00:03:29] swyx: Exactly. All right. This, this is like the, the whole pyramid scheme bubble thing, where like, as long as you mark to market on like the notional value of like, these deals, fine, but like once it starts to chip away, it really Well[00:03:41] Martin Casado: no, like as, as, as, as long as there's demand. I mean, you know, this, this is like a lot of these sound bites have already become kind of cliches, but they're worth saying it.[00:03:47] Right? Like during the internet days, like we were. Um, raising money to put fiber in the ground that wasn't used. And that's a problem, right? Because now you actually have a supply overhang.[00:03:58] swyx: Mm-hmm.[00:03:59] Martin Casado: And even in the, [00:04:00] the time of the, the internet, like the supply and, and bandwidth overhang, even as massive as it was in, as massive as the crash was only lasted about four years.[00:04:09] But we don't have a supply overhang. Like there's no dark GPUs, right? I mean, and so, you know, circular or not, I mean, you know, if, if someone invests in a company that, um. You know, they'll actually use the GPUs. And on the other side of it is the, is the ask for customer. So I I, I think it's a different time.[00:04:25] Sarah Wang: I think the other piece, maybe just to add onto this, and I'm gonna quote Martine in front of him, but this is probably also a unique time in that. For the first time, you can actually trace dollars to outcomes. Yeah, right. Provided that scaling laws are, are holding, um, and capabilities are actually moving forward.[00:04:40] Because if you can put translate dollars into capabilities, uh, a capability improvement, there's demand there to martine's point. But if that somehow breaks, you know, obviously that's an important assumption in this whole thing to make it work. But you know, instead of investing dollars into sales and marketing, you're, you're investing into r and d to get to the capability, um, you know, increase.[00:04:59] And [00:05:00] that's sort of been the demand driver because. Once there's an unlock there, people are willing to pay for it.[00:05:05] Alessio: Yeah.[00:05:06] Blurring Lines: Models as Infra + Apps, and the New Fundraising Flywheel[00:05:06] Alessio: Is there any difference in how you built the portfolio now that some of your growth companies are, like the infrastructure of the early stage companies, like, you know, OpenAI is now the same size as some of the cloud providers were early on.[00:05:16] Like what does that look like? Like how much information can you feed off each other between the, the two?[00:05:24] Martin Casado: There's so many lines that are being crossed right now, or blurred. Right. So we already talked about venture and growth. Another one that's being blurred is between infrastructure and apps, right? So like what is a model company?[00:05:35] Mm-hmm. Like, it's clearly infrastructure, right? Because it's like, you know, it's doing kind of core r and d. It's a horizontal platform, but it's also an app because it's um, uh, touches the users directly. And then of course. You know, the, the, the growth of these is just so high. And so I actually think you're just starting to see a, a, a new financing strategy emerge and, you know, we've had to adapt as a result of that.[00:05:59] And [00:06:00] so there's been a lot of changes. Um, you're right that these companies become platform companies very quickly. You've got ecosystem build out. So none of this is necessarily new, but the timescales of which it's happened is pretty phenomenal. And the way we'd normally cut lines before is blurred a little bit, but.[00:06:16] But that, that, that said, I mean, a lot of it also just does feel like things that we've seen in the past, like cloud build out the internet build out as well.[00:06:24] Sarah Wang: Yeah. Um, yeah, I think it's interesting, uh, I don't know if you guys would agree with this, but it feels like the emerging strategy is, and this builds off of your other question, um.[00:06:33] You raise money for compute, you pour that or you, you pour the money into compute, you get some sort of breakthrough. You funnel the breakthrough into your vertically integrated application. That could be chat GBT, that could be cloud code, you know, whatever it is. You massively gain share and get users.[00:06:49] Maybe you're even subsidizing at that point. Um, depending on your strategy. You raise money at the peak momentum and then you repeat, rinse and repeat. Um, and so. And that wasn't [00:07:00] true even two years ago, I think. Mm-hmm. And so it's sort of to your, just tying it to fundraising strategy, right? There's a, and hiring strategy.[00:07:07] All of these are tied, I think the lines are blurring even more today where everyone is, and they, but of course these companies all have API businesses and so they're these, these frenemy lines that are getting blurred in that a lot of, I mean, they have billions of dollars of API revenue, right? And so there are customers there.[00:07:23] But they're competing on the app layer.[00:07:24] Martin Casado: Yeah. So this is a really, really important point. So I, I would say for sure, venture and growth, that line is blurry app and infrastructure. That line is blurry. Um, but I don't think that that changes our practice so much. But like where the very open questions are like, does this layer in the same way.[00:07:43] Compute traditionally has like during the cloud is like, you know, like whatever, somebody wins one layer, but then another whole set of companies wins another layer. But that might not, might not be the case here. It may be the case that you actually can't verticalize on the token string. Like you can't build an app like it, it necessarily goes down just because there are no [00:08:00] abstractions.[00:08:00] So those are kinda the bigger existential questions we ask. Another thing that is very different this time than in the history of computer sciences is. In the past, if you raised money, then you basically had to wait for engineering to catch up. Which famously doesn't scale like the mythical mammoth. It take a very long time.[00:08:18] But like that's not the case here. Like a model company can raise money and drop a model in a, in a year, and it's better, right? And, and it does it with a team of 20 people or 10 people. So this type of like money entering a company and then producing something that has demand and growth right away and using that to raise more money is a very different capital flywheel than we've ever seen before.[00:08:39] And I think everybody's trying to understand what the consequences are. So I think it's less about like. Big companies and growth and this, and more about these more systemic questions that we actually don't have answers to.[00:08:49] Alessio: Yeah, like at Kernel Labs, one of our ideas is like if you had unlimited money to spend productively to turn tokens into products, like the whole early stage [00:09:00] market is very different because today you're investing X amount of capital to win a deal because of price structure and whatnot, and you're kind of pot committing.[00:09:07] Yeah. To a certain strategy for a certain amount of time. Yeah. But if you could like iteratively spin out companies and products and just throw, I, I wanna spend a million dollar of inference today and get a product out tomorrow.[00:09:18] swyx: Yeah.[00:09:19] Alessio: Like, we should get to the point where like the friction of like token to product is so low that you can do this and then you can change the Right, the early stage venture model to be much more iterative.[00:09:30] And then every round is like either 100 k of inference or like a hundred million from a 16 Z. There's no, there's no like $8 million C round anymore. Right.[00:09:38] When Frontier Labs Outspend the Entire App Ecosystem[00:09:38] Martin Casado: But, but, but, but there's a, there's a, the, an industry structural question that we don't know the answer to, which involves the frontier models, which is, let's take.[00:09:48] Anthropic it. Let's say Anthropic has a state-of-the-art model that has some large percentage of market share. And let's say that, uh, uh, uh, you know, uh, a company's building smaller models [00:10:00] that, you know, use the bigger model in the background, open 4.5, but they add value on top of that. Now, if Anthropic can raise three times more.[00:10:10] Every subsequent round, they probably can raise more money than the entire app ecosystem that's built on top of it. And if that's the case, they can expand beyond everything built on top of it. It's like imagine like a star that's just kind of expanding, so there could be a systemic. There could be a, a systemic situation where the soda models can raise so much money that they can out pay anybody that bills on top of ‘em, which would be something I don't think we've ever seen before just because we were so bottlenecked in engineering, and this is a very open question.[00:10:41] swyx: Yeah. It's, it is almost like bitter lesson applied to the startup industry.[00:10:45] Martin Casado: Yeah, a hundred percent. It literally becomes an issue of like raise capital, turn that directly into growth. Use that to raise three times more. Exactly. And if you can keep doing that, you literally can outspend any company that's built the, not any company.[00:10:57] You can outspend the aggregate of companies on top of [00:11:00] you and therefore you'll necessarily take their share, which is crazy.[00:11:02] swyx: Would you say that kind of happens in character? Is that the, the sort of postmortem on. What happened?[00:11:10] Sarah Wang: Um,[00:11:10] Martin Casado: no.[00:11:12] Sarah Wang: Yeah, because I think so,[00:11:13] swyx: I mean the actual postmortem is, he wanted to go back to Google.[00:11:15] Exactly. But like[00:11:18] Martin Casado: that's another difference that[00:11:19] Sarah Wang: you said[00:11:21] Martin Casado: it. We should talk, we should actually talk about that.[00:11:22] swyx: Yeah,[00:11:22] Sarah Wang: that's[00:11:23] swyx: Go for it. Take it. Take,[00:11:23] Sarah Wang: yeah.[00:11:24] Character.AI, Founder Goals (AGI vs Product), and GPU Allocation Tradeoffs[00:11:24] Sarah Wang: I was gonna say, I think, um. The, the, the character thing raises actually a different issue, which actually the Frontier Labs will face as well. So we'll see how they handle it.[00:11:34] But, um, so we invest in character in January, 2023, which feels like eons ago, I mean, three years ago. Feels like lifetimes ago. But, um, and then they, uh, did the IP licensing deal with Google in August, 2020. Uh, four. And so, um, you know, at the time, no, you know, he's talked publicly about this, right? He wanted to Google wouldn't let him put out products in the world.[00:11:56] That's obviously changed drastically. But, um, he went to go do [00:12:00] that. Um, but he had a product attached. The goal was, I mean, it's Nome Shair, he wanted to get to a GI. That was always his personal goal. But, you know, I think through collecting data, right, and this sort of very human use case, that the character product.[00:12:13] Originally was and still is, um, was one of the vehicles to do that. Um, I think the real reason that, you know. I if you think about the, the stress that any company feels before, um, you ultimately going one way or the other is sort of this a GI versus product. Um, and I think a lot of the big, I think, you know, opening eyes, feeling that, um, anthropic if they haven't started, you know, felt it, certainly given the success of their products, they may start to feel that soon.[00:12:39] And the real. I think there's real trade-offs, right? It's like how many, when you think about GPUs, that's a limited resource. Where do you allocate the GPUs? Is it toward the product? Is it toward new re research? Right? Is it, or long-term research, is it toward, um, n you know, near to midterm research? And so, um, in a case where you're resource constrained, um, [00:13:00] of course there's this fundraising game you can play, right?[00:13:01] But the fund, the market was very different back in 2023 too. Um. I think the best researchers in the world have this dilemma of, okay, I wanna go all in on a GI, but it's the product usage revenue flywheel that keeps the revenue in the house to power all the GPUs to get to a GI. And so it does make, um, you know, I think it sets up an interesting dilemma for any startup that has trouble raising up until that level, right?[00:13:27] And certainly if you don't have that progress, you can't continue this fly, you know, fundraising flywheel.[00:13:32] Martin Casado: I would say that because, ‘cause we're keeping track of all of the things that are different, right? Like, you know, venture growth and uh, app infra and one of the ones is definitely the personalities of the founders.[00:13:45] It's just very different this time I've been. Been doing this for a decade and I've been doing startups for 20 years. And so, um, I mean a lot of people start this to do a GI and we've never had like a unified North star that I recall in the same [00:14:00] way. Like people built companies to start companies in the past.[00:14:02] Like that was what it was. Like I would create an internet company, I would create infrastructure company, like it's kind of more engineering builders and this is kind of a different. You know, mentality. And some companies have harnessed that incredibly well because their direction is so obviously on the path to what somebody would consider a GI, but others have not.[00:14:20] And so like there is always this tension with personnel. And so I think we're seeing more kind of founder movement.[00:14:27] Sarah Wang: Yeah.[00:14:27] Martin Casado: You know, as a fraction of founders than we've ever seen. I mean, maybe since like, I don't know the time of like Shockly and the trade DUR aid or something like that. Way back in the beginning of the industry, I, it's a very, very.[00:14:38] Unusual time of personnel.[00:14:39] Sarah Wang: Totally.[00:14:40] Talent Wars, Mega-Comp, and the Rise of Acquihire M&A[00:14:40] Sarah Wang: And it, I think it's exacerbated by the fact that talent wars, I mean, every industry has talent wars, but not at this magnitude, right? No. Yeah. Very rarely can you see someone get poached for $5 billion. That's hard to compete with. And then secondly, if you're a founder in ai, you could fart and it would be on the front page of, you know, the information these days.[00:14:59] And so there's [00:15:00] sort of this fishbowl effect that I think adds to the deep anxiety that, that these AI founders are feeling.[00:15:06] Martin Casado: Hmm.[00:15:06] swyx: Uh, yes. I mean, just on, uh, briefly comment on the founder, uh, the sort of. Talent wars thing. I feel like 2025 was just like a blip. Like I, I don't know if we'll see that again.[00:15:17] ‘cause meta built the team. Like, I don't know if, I think, I think they're kind of done and like, who's gonna pay more than meta? I, I don't know.[00:15:23] Martin Casado: I, I agree. So it feels so, it feel, it feels this way to me too. It's like, it is like, basically Zuckerberg kind of came out swinging and then now he's kind of back to building.[00:15:30] Yeah,[00:15:31] swyx: yeah. You know, you gotta like pay up to like assemble team to rush the job, whatever. But then now, now you like you, you made your choices and now they got a ship.[00:15:38] Martin Casado: I mean, the, the o other side of that is like, you know, like we're, we're actually in the job hiring market. We've got 600 people here. I hire all the time.[00:15:44] I've got three open recs if anybody's interested, that's listening to this for investor. Yeah, on, on the team, like on the investing side of the team, like, and, um, a lot of the people we talk to have acting, you know, active, um, offers for 10 million a year or something like that. And like, you know, and we pay really, [00:16:00] really well.[00:16:00] And just to see what's out on the market is really, is really remarkable. And so I would just say it's actually, so you're right, like the really flashy one, like I will get someone for, you know, a billion dollars, but like the inflated, um, uh, trickles down. Yeah, it is still very active today. I mean,[00:16:18] Sarah Wang: yeah, you could be an L five and get an offer in the tens of millions.[00:16:22] Okay. Yeah. Easily. Yeah. It's so I think you're right that it felt like a blip. I hope you're right. Um, but I think it's been, the steady state is now, I think got pulled up. Yeah. Yeah. I'll pull up for[00:16:31] Martin Casado: sure. Yeah.[00:16:32] Alessio: Yeah. And I think that's breaking the early stage founder math too. I think before a lot of people would be like, well, maybe I should just go be a founder instead of like getting paid.[00:16:39] Yeah. 800 KA million at Google. But if I'm getting paid. Five, 6 million. That's different but[00:16:45] Martin Casado: on. But on the other hand, there's more strategic money than we've ever seen historically, right? Mm-hmm. And so, yep. The economics, the, the, the, the calculus on the economics is very different in a number of ways. And, uh, it's crazy.[00:16:58] It's cra it's causing like a, [00:17:00] a, a, a ton of change in confusion in the market. Some very positive, sub negative, like, so for example, the other side of the, um. The co-founder, like, um, acquisition, you know, mark Zuckerberg poaching someone for a lot of money is like, we were actually seeing historic amount of m and a for basically acquihires, right?[00:17:20] That you like, you know, really good outcomes from a venture perspective that are effective acquihires, right? So I would say it's probably net positive from the investment standpoint, even though it seems from the headlines to be very disruptive in a negative way.[00:17:33] Alessio: Yeah.[00:17:33] What's Underfunded: Boring Software, Robotics Skepticism, and Custom Silicon Economics[00:17:33] Alessio: Um, let's talk maybe about what's not being invested in, like maybe some interesting ideas that you would see more people build or it, it seems in a way, you know, as ycs getting more popular, it's like access getting more popular.[00:17:47] There's a startup school path that a lot of founders take and they know what's hot in the VC circles and they know what gets funded. Uh, and there's maybe not as much risk appetite for. Things outside of that. Um, I'm curious if you feel [00:18:00] like that's true and what are maybe, uh, some of the areas, uh, that you think are under discussed?[00:18:06] Martin Casado: I mean, I actually think that we've taken our eye off the ball in a lot of like, just traditional, you know, software companies. Um, so like, I mean. You know, I think right now there's almost a barbell, like you're like the hot thing on X, you're deep tech.[00:18:21] swyx: Mm-hmm.[00:18:22] Martin Casado: Right. But I, you know, I feel like there's just kind of a long, you know, list of like good.[00:18:28] Good companies that will be around for a long time in very large markets. Say you're building a database, you know, say you're building, um, you know, kind of monitoring or logging or tooling or whatever. There's some good companies out there right now, but like, they have a really hard time getting, um, the attention of investors.[00:18:43] And it's almost become a meme, right? Which is like, if you're not basically growing from zero to a hundred in a year, you're not interesting, which is just, is the silliest thing to say. I mean, think of yourself as like an introvert person, like, like your personal money, right? Mm-hmm. So. Your personal money, will you put it in the stock market at 7% or you put it in this company growing five x in a very large [00:19:00] market?[00:19:00] Of course you can put it in the company five x. So it's just like we say these stupid things, like if you're not going from zero to a hundred, but like those, like who knows what the margins of those are mean. Clearly these are good investments. True for anybody, right? True. Like our LPs want whatever.[00:19:12] Three x net over, you know, the life cycle of a fund, right? So a, a company in a big market growing five X is a great investment. We'd, everybody would be happy with these returns, but we've got this kind of mania on these, these strong growths. And so I would say that that's probably the most underinvested sector.[00:19:28] Right now.[00:19:29] swyx: Boring software, boring enterprise software.[00:19:31] Martin Casado: Traditional. Really good company.[00:19:33] swyx: No, no AI here.[00:19:34] Martin Casado: No. Like boring. Well, well, the AI of course is pulling them into use cases. Yeah, but that's not what they're, they're not on the token path, right? Yeah. Let's just say that like they're software, but they're not on the token path.[00:19:41] Like these are like they're great investments from any definition except for like random VC on Twitter saying VC on x, saying like, it's not growing fast enough. What do you[00:19:52] Sarah Wang: think? Yeah, maybe I'll answer a slightly different. Question, but adjacent to what you asked, um, which is maybe an area that we're not, uh, investing [00:20:00] right now that I think is a question and we're spending a lot of time in regardless of whether we pull the trigger or not.[00:20:05] Um, and it would probably be on the hardware side, actually. Robotics, right? And the robotics side. Robotics. Right. Which is, it's, I don't wanna say that it's not getting funding ‘cause it's clearly, uh, it's, it's sort of non-consensus to almost not invest in robotics at this point. But, um, we spent a lot of time in that space and I think for us, we just haven't seen the chat GPT moment.[00:20:22] Happen on the hardware side. Um, and the funding going into it feels like it's already. Taking that for granted.[00:20:30] Martin Casado: Yeah. Yeah. But we also went through the drone, you know, um, there's a zip line right, right out there. What's that? Oh yeah, there's a zip line. Yeah. What the drone, what the av And like one of the takeaways is when it comes to hardware, um, most companies will end up verticalizing.[00:20:46] Like if you're. If you're investing in a robot company for an A for agriculture, you're investing in an ag company. ‘cause that's the competition and that's surprising. And that's supply chain. And if you're doing it for mining, that's mining. And so the ad team does a lot of that type of stuff ‘cause they actually set up to [00:21:00] diligence that type of work.[00:21:01] But for like horizontal technology investing, there's very little when it comes to robots just because it's so fit for, for purpose. And so we kinda like to look at software. Solutions or horizontal solutions like applied intuition. Clearly from the AV wave deep map, clearly from the AV wave, I would say scale AI was actually a horizontal one for That's fair, you know, for robotics early on.[00:21:23] And so that sort of thing we're very, very interested. But the actual like robot interacting with the world is probably better for different team. Agree.[00:21:30] Alessio: Yeah, I'm curious who these teams are supposed to be that invest in them. I feel like everybody's like, yeah, robotics, it's important and like people should invest in it.[00:21:38] But then when you look at like the numbers, like the capital requirements early on versus like the moment of, okay, this is actually gonna work. Let's keep investing. That seems really hard to predict in a way that is not,[00:21:49] Martin Casado: I think co, CO two, kla, gc, I mean these are all invested in in Harvard companies. He just, you know, and [00:22:00] listen, I mean, it could work this time for sure.[00:22:01] Right? I mean if Elon's doing it, he's like, right. Just, just the fact that Elon's doing it means that there's gonna be a lot of capital and a lot of attempts for a long period of time. So that alone maybe suggests that we should just be investing in robotics just ‘cause you have this North star who's Elon with a humanoid and that's gonna like basically willing into being an industry.[00:22:17] Um, but we've just historically found like. We're a huge believer that this is gonna happen. We just don't feel like we're in a good position to diligence these things. ‘cause again, robotics companies tend to be vertical. You really have to understand the market they're being sold into. Like that's like that competitive equilibrium with a human being is what's important.[00:22:34] It's not like the core tech and like we're kind of more horizontal core tech type investors. And this is Sarah and I. Yeah, the ad team is different. They can actually do these types of things.[00:22:42] swyx: Uh, just to clarify, AD stands for[00:22:44] Martin Casado: American Dynamism.[00:22:45] swyx: Alright. Okay. Yeah, yeah, yeah. Uh, I actually, I do have a related question that, first of all, I wanna acknowledge also just on the, on the chip side.[00:22:51] Yeah. I, I recall a podcast that where you were on, i, I, I think it was the a CC podcast, uh, about two or three years ago where you, where you suddenly said [00:23:00] something, which really stuck in my head about how at some point, at some point kind of scale it makes sense to. Build a custom aic Yes. For per run.[00:23:07] Martin Casado: Yes.[00:23:07] It's crazy. Yeah.[00:23:09] swyx: We're here and I think you, you estimated 500 billion, uh, something.[00:23:12] Martin Casado: No, no, no. A billion, a billion dollar training run of $1 billion training run. It makes sense to actually do a custom meic if you can do it in time. The question now is timelines. Yeah, but not money because just, just, just rough math.[00:23:22] If it's a billion dollar training. Then the inference for that model has to be over a billion, otherwise it won't be solvent. So let's assume it's, if you could save 20%, which you could save much more than that with an ASIC 20%, that's $200 million. You can tape out a chip for $200 million. Right? So now you can literally like justify economically, not timeline wise.[00:23:41] That's a different issue. An ASIC per model, which[00:23:44] swyx: is because that, that's how much we leave on the table every single time. We, we, we do like generic Nvidia.[00:23:48] Martin Casado: Exactly. Exactly. No, it, it is actually much more than that. You could probably get, you know, a factor of two, which would be 500 million.[00:23:54] swyx: Typical MFU would be like 50.[00:23:55] Yeah, yeah. And that's good.[00:23:57] Martin Casado: Exactly. Yeah. Hundred[00:23:57] swyx: percent. Um, so, so, yeah, and I mean, and I [00:24:00] just wanna acknowledge like, here we are in, in, in 2025 and opening eyes confirming like Broadcom and all the other like custom silicon deals, which is incredible. I, I think that, uh, you know, speaking about ad there's, there's a really like interesting tie in that obviously you guys are hit on, which is like these sort, this sort of like America first movement or like sort of re industrialized here.[00:24:17] Yeah. Uh, move TSMC here, if that's possible. Um, how much overlap is there from ad[00:24:23] Martin Casado: Yeah.[00:24:23] swyx: To, I guess, growth and, uh, investing in particularly like, you know, US AI companies that are strongly bounded by their compute.[00:24:32] Martin Casado: Yeah. Yeah. So I mean, I, I would view, I would view AD as more as a market segmentation than like a mission, right?[00:24:37] So the market segmentation is, it has kind of regulatory compliance issues or government, you know, sale or it deals with like hardware. I mean, they're just set up to, to, to, to, to. To diligence those types of companies. So it's a more of a market segmentation thing. I would say the entire firm. You know, which has been since it is been intercepted, you know, has geographical biases, right?[00:24:58] I mean, for the longest time we're like, you [00:25:00] know, bay Area is gonna be like, great, where the majority of the dollars go. Yeah. And, and listen, there, there's actually a lot of compounding effects for having a geographic bias. Right. You know, everybody's in the same place. You've got an ecosystem, you're there, you've got presence, you've got a network.[00:25:12] Um, and, uh, I mean, I would say the Bay area's very much back. You know, like I, I remember during pre COVID, like it was like almost Crypto had kind of. Pulled startups away. Miami from the Bay Area. Miami, yeah. Yeah. New York was, you know, because it's so close to finance, came up like Los Angeles had a moment ‘cause it was so close to consumer, but now it's kind of come back here.[00:25:29] And so I would say, you know, we tend to be very Bay area focused historically, even though of course we've asked all over the world. And then I would say like, if you take the ring out, you know, one more, it's gonna be the US of course, because we know it very well. And then one more is gonna be getting us and its allies and Yeah.[00:25:44] And it goes from there.[00:25:45] Sarah Wang: Yeah,[00:25:45] Martin Casado: sorry.[00:25:46] Sarah Wang: No, no. I agree. I think from a, but I think from the intern that that's sort of like where the companies are headquartered. Maybe your questions on supply chain and customer base. Uh, I, I would say our customers are, are, our companies are fairly international from that perspective.[00:25:59] Like they're selling [00:26:00] globally, right? They have global supply chains in some cases.[00:26:03] Martin Casado: I would say also the stickiness is very different.[00:26:05] Sarah Wang: Yeah.[00:26:05] Martin Casado: Historically between venture and growth, like there's so much company building in venture, so much so like hiring the next PM. Introducing the customer, like all of that stuff.[00:26:15] Like of course we're just gonna be stronger where we have our network and we've been doing business for 20 years. I've been in the Bay Area for 25 years, so clearly I'm just more effective here than I would be somewhere else. Um, where I think, I think for some of the later stage rounds, the companies don't need that much help.[00:26:30] They're already kind of pretty mature historically, so like they can kind of be everywhere. So there's kind of less of that stickiness. This is different in the AI time. I mean, Sarah is now the, uh, chief of staff of like half the AI companies in, uh, in the Bay Area right now. She's like, ops Ninja Biz, Devrel, BizOps.[00:26:48] swyx: Are, are you, are you finding much AI automation in your work? Like what, what is your stack.[00:26:53] Sarah Wang: Oh my, in my personal stack.[00:26:54] swyx: I mean, because like, uh, by the way, it's the, the, the reason for this is it is triggering, uh, yeah. We, like, I'm hiring [00:27:00] ops, ops people. Um, a lot of ponders I know are also hiring ops people and I'm just, you know, it's opportunity Since you're, you're also like basically helping out with ops with a lot of companies.[00:27:09] What are people doing these days? Because it's still very manual as far as I can tell.[00:27:13] Sarah Wang: Hmm. Yeah. I think the things that we help with are pretty network based, um, in that. It's sort of like, Hey, how do do I shortcut this process? Well, let's connect you to the right person. So there's not quite an AI workflow for that.[00:27:26] I will say as a growth investor, Claude Cowork is pretty interesting. Yeah. Like for the first time, you can actually get one shot data analysis. Right. Which, you know, if you're gonna do a customer database, analyze a cohort retention, right? That's just stuff that you had to do by hand before. And our team, the other, it was like midnight and the three of us were playing with Claude Cowork.[00:27:47] We gave it a raw file. Boom. Perfectly accurate. We checked the numbers. It was amazing. That was my like, aha moment. That sounds so boring. But you know, that's, that's the kind of thing that a growth investor is like, [00:28:00] you know, slaving away on late at night. Um, done in a few seconds.[00:28:03] swyx: Yeah. You gotta wonder what the whole, like, philanthropic labs, which is like their new sort of products studio.[00:28:10] Yeah. What would that be worth as an independent, uh, startup? You know, like a[00:28:14] Martin Casado: lot.[00:28:14] Sarah Wang: Yeah, true.[00:28:16] swyx: Yeah. You[00:28:16] Martin Casado: gotta hand it to them. They've been executing incredibly well.[00:28:19] swyx: Yeah. I, I mean, to me, like, you know, philanthropic, like building on cloud code, I think, uh, it makes sense to me the, the real. Um, pedal to the metal, whatever the, the, the phrase is, is when they start coming after consumer with, uh, against OpenAI and like that is like red alert at Open ai.[00:28:35] Oh, I[00:28:35] Martin Casado: think they've been pretty clear. They're enterprise focused.[00:28:37] swyx: They have been, but like they've been free. Here's[00:28:40] Martin Casado: care publicly,[00:28:40] swyx: it's enterprise focused. It's coding. Right. Yeah.[00:28:43] AI Labs vs Startups: Disruption, Undercutting & the Innovator's Dilemma[00:28:43] swyx: And then, and, but here's cloud, cloud, cowork, and, and here's like, well, we, uh, they, apparently they're running Instagram ads for Claudia.[00:28:50] I, on, you know, for, for people on, I get them all the time. Right. And so, like,[00:28:54] Martin Casado: uh,[00:28:54] swyx: it, it's kind of like this, the disruption thing of, uh, you know. Mo Open has been doing, [00:29:00] consumer been doing the, just pursuing general intelligence in every mo modality, and here's a topic that only focus on this thing, but now they're sort of undercutting and doing the whole innovator's dilemma thing on like everything else.[00:29:11] Martin Casado: It's very[00:29:11] swyx: interesting.[00:29:12] Martin Casado: Yeah, I mean there's, there's a very open que so for me there's like, do you know that meme where there's like the guy in the path and there's like a path this way? There's a path this way. Like one which way Western man. Yeah. Yeah.[00:29:23] Two Futures for AI: Infinite Market vs AGI Oligopoly[00:29:23] Martin Casado: And for me, like, like all the entire industry kind of like hinges on like two potential futures.[00:29:29] So in, in one potential future, um, the market is infinitely large. There's perverse economies of scale. ‘cause as soon as you put a model out there, like it kind of sublimates and all the other models catch up and like, it's just like software's being rewritten and fractured all over the place and there's tons of upside and it just grows.[00:29:48] And then there's another path which is like, well. Maybe these models actually generalize really well, and all you have to do is train them with three times more money. That's all you have to [00:30:00] do, and it'll just consume everything beyond it. And if that's the case, like you end up with basically an oligopoly for everything, like, you know mm-hmm.[00:30:06] Because they're perfectly general and like, so this would be like the, the a GI path would be like, these are perfectly general. They can do everything. And this one is like, this is actually normal software. The universe is complicated. You've got, and nobody knows the answer.[00:30:18] The Economics Reality Check: Gross Margins, Training Costs & Borrowing Against the Future[00:30:18] Martin Casado: My belief is if you actually look at the numbers of these companies, so generally if you look at the numbers of these companies, if you look at like the amount they're making and how much they, they spent training the last model, they're gross margin positive.[00:30:30] You're like, oh, that's really working. But if you look at like. The current training that they're doing for the next model, their gross margin negative. So part of me thinks that a lot of ‘em are kind of borrowing against the future and that's gonna have to slow down. It's gonna catch up to them at some point in time, but we don't really know.[00:30:47] Sarah Wang: Yeah.[00:30:47] Martin Casado: Does that make sense? Like, I mean, it could be, it could be the case that the only reason this is working is ‘cause they can raise that next round and they can train that next model. ‘cause these models have such a short. Life. And so at some point in time, like, you know, they won't be able to [00:31:00] raise that next round for the next model and then things will kind of converge and fragment again.[00:31:03] But right now it's not.[00:31:04] Sarah Wang: Totally. I think the other, by the way, just, um, a meta point. I think the other lesson from the last three years is, and we talk about this all the time ‘cause we're on this. Twitter X bubble. Um, cool. But, you know, if you go back to, let's say March, 2024, that period, it felt like a, I think an open source model with an, like a, you know, benchmark leading capability was sort of launching on a daily basis at that point.[00:31:27] And, um, and so that, you know, that's one period. Suddenly it's sort of like open source takes over the world. There's gonna be a plethora. It's not an oligopoly, you know, if you fast, you know, if you, if you rewind time even before that GPT-4 was number one for. Nine months, 10 months. It's a long time. Right.[00:31:44] Um, and of course now we're in this era where it feels like an oligopoly, um, maybe some very steady state shifts and, and you know, it could look like this in the future too, but it just, it's so hard to call. And I think the thing that keeps, you know, us up at [00:32:00] night in, in a good way and bad way, is that the capability progress is actually not slowing down.[00:32:06] And so until that happens, right, like you don't know what's gonna look like.[00:32:09] Martin Casado: But I, I would, I would say for sure it's not converged, like for sure, like the systemic capital flows have not converged, meaning right now it's still borrowing against the future to subsidize growth currently, which you can do that for a period of time.[00:32:23] But, but you know, at the end, at some point the market will rationalize that and just nobody knows what that will look like.[00:32:29] Alessio: Yeah.[00:32:29] Martin Casado: Or, or like the drop in price of compute will, will, will save them. Who knows?[00:32:34] Alessio: Yeah. Yeah. I think the models need to ask them to, to specific tasks. You know? It's like, okay, now Opus 4.5 might be a GI at some specific task, and now you can like depreciate the model over a longer time.[00:32:45] I think now, now, right now there's like no old model.[00:32:47] Martin Casado: No, but let, but lemme just change that mental, that's, that used to be my mental model. Lemme just change it a little bit.[00:32:53] Capital as a Weapon vs Task Saturation: Where Real Enterprise Value Gets Built[00:32:53] Martin Casado: If you can raise three times, if you can raise more than the aggregate of anybody that uses your models, that doesn't even matter.[00:32:59] It doesn't [00:33:00] even matter. See what I'm saying? Like, yeah. Yeah. So, so I have an API Business. My API business is 60% margin, or 70% margin, or 80% margin is a high margin business. So I know what everybody is using. If I can raise more money than the aggregate of everybody that's using it, I will consume them whether I'm a GI or not.[00:33:14] And I will know if they're using it ‘cause they're using it. And like, unlike in the past where engineering stops me from doing that.[00:33:21] Alessio: Mm-hmm.[00:33:21] Martin Casado: It is very straightforward. You just train. So I also thought it was kind of like, you must ask the code a GI, general, general, general. But I think there's also just a possibility that the, that the capital markets will just give them the, the, the ammunition to just go after everybody on top of ‘em.[00:33:36] Sarah Wang: I, I do wonder though, to your point, um, if there's a certain task that. Getting marginally better isn't actually that much better. Like we've asked them to it, to, you know, we can call it a GI or whatever, you know, actually, Ali Goi talks about this, like we're already at a GI for a lot of functions in the enterprise.[00:33:50] Um. That's probably those for those tasks, you probably could build very specific companies that focus on just getting as much value out of that task that isn't [00:34:00] coming from the model itself. There's probably a rich enterprise business to be built there. I mean, could be wrong on that, but there's a lot of interesting examples.[00:34:08] So, right, if you're looking the legal profession or, or whatnot, and maybe that's not a great one ‘cause the models are getting better on that front too, but just something where it's a bit saturated, then the value comes from. Services. It comes from implementation, right? It comes from all these things that actually make it useful to the end customer.[00:34:24] Martin Casado: Sorry, what am I, one more thing I think is, is underused in all of this is like, to what extent every task is a GI complete.[00:34:31] Sarah Wang: Mm-hmm.[00:34:32] Martin Casado: Yeah. I code every day. It's so fun.[00:34:35] Sarah Wang: That's a core question. Yeah.[00:34:36] Martin Casado: And like. When I'm talking to these models, it's not just code. I mean, it's everything, right? Like I, you know, like it's,[00:34:43] swyx: it's healthcare.[00:34:44] It's,[00:34:44] Martin Casado: I mean, it's[00:34:44] swyx: Mele,[00:34:45] Martin Casado: but it's every, it is exactly that. Like, yeah, that's[00:34:47] Sarah Wang: great support. Yeah.[00:34:48] Martin Casado: It's everything. Like I'm asking these models to, yeah, to understand compliance. I'm asking these models to go search the web. I'm asking these models to talk about things I know in the history, like it's having a full conversation with me while I, I engineer, and so it could be [00:35:00] the case that like, mm-hmm.[00:35:01] The most a, you know, a GI complete, like I'm not an a GI guy. Like I think that's, you know, but like the most a GI complete model will is win independent of the task. And we don't know the answer to that one either.[00:35:11] swyx: Yeah.[00:35:12] Martin Casado: But it seems to me that like, listen, codex in my experience is for sure better than Opus 4.5 for coding.[00:35:18] Like it finds the hardest bugs that I work in with. Like, it is, you know. The smartest developers. I don't work on it. It's great. Um, but I think Opus 4.5 is actually very, it's got a great bedside manner and it really, and it, it really matters if you're building something very complex because like, it really, you know, like you're, you're, you're a partner and a brainstorming partner for somebody.[00:35:38] And I think we don't discuss enough how every task kind of has that quality.[00:35:42] swyx: Mm-hmm.[00:35:43] Martin Casado: And what does that mean to like capital investment and like frontier models and Submodels? Yeah.[00:35:47] Why “Coding Models” Keep Collapsing into Generalists (Reasoning vs Taste)[00:35:47] Martin Casado: Like what happened to all the special coding models? Like, none of ‘em worked right. So[00:35:51] Alessio: some of them, they didn't even get released.[00:35:53] Magical[00:35:54] Martin Casado: Devrel. There's a whole, there's a whole host. We saw a bunch of them and like there's this whole theory that like, there could be, and [00:36:00] I think one of the conclusions is, is like there's no such thing as a coding model,[00:36:04] Alessio: you know?[00:36:04] Martin Casado: Like, that's not a thing. Like you're talking to another human being and it's, it's good at coding, but like it's gotta be good at everything.[00:36:10] swyx: Uh, minor disagree only because I, I'm pretty like, have pretty high confidence that basically open eye will always release a GPT five and a GT five codex. Like that's the code's. Yeah. The way I call it is one for raisin, one for Tiz. Um, and, and then like someone internal open, it was like, yeah, that's a good way to frame it.[00:36:32] Martin Casado: That's so funny.[00:36:33] swyx: Uh, but maybe it, maybe it collapses down to reason and that's it. It's not like a hundred dimensions doesn't life. Yeah. It's two dimensions. Yeah, yeah, yeah, yeah. Like and exactly. Beside manner versus coding. Yeah.[00:36:43] Martin Casado: Yeah.[00:36:44] swyx: It's, yeah.[00:36:46] Martin Casado: I, I think for, for any, it's hilarious. For any, for anybody listening to this for, for, for, I mean, for you, like when, when you're like coding or using these models for something like that.[00:36:52] Like actually just like be aware of how much of the interaction has nothing to do with coding and it just turns out to be a large portion of it. And so like, you're, I [00:37:00] think like, like the best Soto ish model. You know, it is going to remain very important no matter what the task is.[00:37:06] swyx: Yeah.[00:37:07] What He's Actually Coding: Gaussian Splats, Spark.js & 3D Scene Rendering Demos[00:37:07] swyx: Uh, speaking of coding, uh, I, I'm gonna be cheeky and ask like, what actually are you coding?[00:37:11] Because obviously you, you could code anything and you are obviously a busy investor and a manager of the good. Giant team. Um, what are you calling?[00:37:18] Martin Casado: I help, um, uh, FEFA at World Labs. Uh, it's one of the investments and um, and they're building a foundation model that creates 3D scenes.[00:37:27] swyx: Yeah, we had it on the pod.[00:37:28] Yeah. Yeah,[00:37:28] Martin Casado: yeah. And so these 3D scenes are Gaussian splats, just by the way that kind of AI works. And so like, you can reconstruct a scene better with, with, with radiance feels than with meshes. ‘cause like they don't really have topology. So, so they, they, they produce each. Beautiful, you know, 3D rendered scenes that are Gaussian splats, but the actual industry support for Gaussian splats isn't great.[00:37:50] It's just never, you know, it's always been meshes and like, things like unreal use meshes. And so I work on a open source library called Spark js, which is a. Uh, [00:38:00] a JavaScript rendering layer ready for Gaussian splats. And it's just because, you know, um, you, you, you need that support and, and right now there's kind of a three js moment that's all meshes and so like, it's become kind of the default in three Js ecosystem.[00:38:13] As part of that to kind of exercise the library, I just build a whole bunch of cool demos. So if you see me on X, you see like all my demos and all the world building, but all of that is just to exercise this, this library that I work on. ‘cause it's actually a very tough algorithmics problem to actually scale a library that much.[00:38:29] And just so you know, this is ancient history now, but 30 years ago I paid for undergrad, you know, working on game engines in college in the late nineties. So I've got actually a back and it's very old background, but I actually have a background in this and so a lot of it's fun. You know, but, but the, the, the, the whole goal is just for this rendering library to, to,[00:38:47] Sarah Wang: are you one of the most active contributors?[00:38:49] The, their GitHub[00:38:50] Martin Casado: spark? Yes.[00:38:51] Sarah Wang: Yeah, yeah.[00:38:51] Martin Casado: There's only two of us there, so, yes. No, so by the way, so the, the pri The pri, yeah. Yeah. So the primary developer is a [00:39:00] guy named Andres Quist, who's an absolute genius. He and I did our, our PhDs together. And so like, um, we studied for constant Quas together. It was almost like hanging out with an old friend, you know?[00:39:09] And so like. So he, he's the core, core guy. I did mostly kind of, you know, the side I run venture fund.[00:39:14] swyx: It's amazing. Like five years ago you would not have done any of this. And it brought you back[00:39:19] Martin Casado: the act, the Activ energy, you're still back. Energy was so high because you had to learn all the framework b******t.[00:39:23] Man, I f*****g used to hate that. And so like, now I don't have to deal with that. I can like focus on the algorithmics so I can focus on the scaling and I,[00:39:29] swyx: yeah. Yeah.[00:39:29] LLMs vs Spatial Intelligence + How to Value World Labs' 3D Foundation Model[00:39:29] swyx: And then, uh, I'll observe one irony and then I'll ask a serious investor question, uh, which is like, the irony is FFE actually doesn't believe that LMS can lead us to spatial intelligence.[00:39:37] And here you are using LMS to like help like achieve spatial intelligence. I just see, I see some like disconnect in there.[00:39:45] Martin Casado: Yeah. Yeah. So I think, I think, you know, I think, I think what she would say is LLMs are great to help with coding.[00:39:51] swyx: Yes.[00:39:51] Martin Casado: But like, that's very different than a model that actually like provides, they, they'll never have the[00:39:56] swyx: spatial inte[00:39:56] Martin Casado: issues.[00:39:56] And listen, our brains clearly listen, our brains, brains clearly have [00:40:00] both our, our brains clearly have a language reasoning section and they clearly have a spatial reasoning section. I mean, it's just, you know, these are two pretty independent problems.[00:40:07] swyx: Okay. And you, you, like, I, I would say that the, the one data point I recently had, uh, against it is the DeepMind, uh, IMO Gold, where, so, uh, typically the, the typical answer is that this is where you start going down the neuros symbolic path, right?[00:40:21] Like one, uh, sort of very sort of abstract reasoning thing and one form, formal thing. Um, and that's what. DeepMind had in 2024 with alpha proof, alpha geometry, and now they just use deep think and just extended thinking tokens. And it's one model and it's, and it's in LM.[00:40:36] Martin Casado: Yeah, yeah, yeah, yeah, yeah.[00:40:37] swyx: And so that, that was my indication of like, maybe you don't need a separate system.[00:40:42] Martin Casado: Yeah. So, so let me step back. I mean, at the end of the day, at the end of the day, these things are like nodes in a graph with weights on them. Right. You know, like it can be modeled like if you, if you distill it down. But let me just talk about the two different substrates. Let's, let me put you in a dark room.[00:40:56] Like totally black room. And then let me just [00:41:00] describe how you exit it. Like to your left, there's a table like duck below this thing, right? I mean like the chances that you're gonna like not run into something are very low. Now let me like turn on the light and you actually see, and you can do distance and you know how far something away is and like where it is or whatever.[00:41:17] Then you can do it, right? Like language is not the right primitives to describe. The universe because it's not exact enough. So that's all Faye, Faye is talking about. When it comes to like spatial reasoning, it's like you actually have to know that this is three feet far, like that far away. It is curved.[00:41:37] You have to understand, you know, the, like the actual movement through space.[00:41:40] swyx: Yeah.[00:41:40] Martin Casado: So I do, I listen, I do think at the end of these models are definitely converging as far as models, but there's, there's, there's different representations of problems you're solving. One is language. Which, you know, that would be like describing to somebody like what to do.[00:41:51] And the other one is actually just showing them and the space reasoning is just showing them.[00:41:55] swyx: Yeah, yeah, yeah. Right. Got it, got it. Uh, the, in the investor question was on, on, well labs [00:42:00] is, well, like, how do I value something like this? What, what, what work does the, do you do? I'm just like, Fefe is awesome.[00:42:07] Justin's awesome. And you know, the other two co-founder, co-founders, but like the, the, the tech, everyone's building cool tech. But like, what's the value of the tech? And this is the fundamental question[00:42:16] Martin Casado: of, well, let, let, just like these, let me just maybe give you a rough sketch on the diffusion models. I actually love to hear Sarah because I'm a venture for, you know, so like, ventures always, always like kind of wild west type[00:42:24] swyx: stuff.[00:42:24] You, you, you, you paid a dream and she has to like, actually[00:42:28] Martin Casado: I'm gonna say I'm gonna mar to reality, so I'm gonna say the venture for you. And she can be like, okay, you a little kid. Yeah. So like, so, so these diffusion models literally. Create something for, for almost nothing. And something that the, the world has found to be very valuable in the past, in our real markets, right?[00:42:45] Like, like a 2D image. I mean, that's been an entire market. People value them. It takes a human being a long time to create it, right? I mean, to create a, you know, a, to turn me into a whatever, like an image would cost a hundred bucks in an hour. The inference cost [00:43:00] us a hundredth of a penny, right? So we've seen this with speech in very successful companies.[00:43:03] We've seen this with 2D image. We've seen this with movies. Right? Now, think about 3D scene. I mean, I mean, when's Grand Theft Auto coming out? It's been six, what? It's been 10 years. I mean, how, how like, but hasn't been 10 years.[00:43:14] Alessio: Yeah.[00:43:15] Martin Casado: How much would it cost to like, to reproduce this room in 3D? Right. If you, if you, if you hired somebody on fiber, like in, in any sort of quality, probably 4,000 to $10,000.[00:43:24] And then if you had a professional, probably $30,000. So if you could generate the exact same thing from a 2D image, and we know that these are used and they're using Unreal and they're using Blend, or they're using movies and they're using video games and they're using all. So if you could do that for.[00:43:36] You know, less than a dollar, that's four or five orders of magnitude cheaper. So you're bringing the marginal cost of something that's useful down by three orders of magnitude, which historically have created very large companies. So that would be like the venture kind of strategic dreaming map.[00:43:49] swyx: Yeah.[00:43:50] And, and for listeners, uh, you can do this yourself on your, on your own phone with like. Uh, the marble.[00:43:55] Martin Casado: Yeah. Marble.[00:43:55] swyx: Uh, or but also there's many Nerf apps where you just go on your iPhone and, and do this.[00:43:59] Martin Casado: Yeah. Yeah. [00:44:00] Yeah. And, and in the case of marble though, it would, what you do is you literally give it in.[00:44:03] So most Nerf apps you like kind of run around and take a whole bunch of pictures and then you kind of reconstruct it.[00:44:08] swyx: Yeah.[00:44:08] Martin Casado: Um, things like marble, just that the whole generative 3D space will just take a 2D image and it'll reconstruct all the like, like[00:44:16] swyx: meaning it has to fill in. Uh,[00:44:18] Martin Casado: stuff at the back of the table, under the table, the back, like, like the images, it doesn't see.[00:44:22] So the generator stuff is very different than reconstruction that it fills in the things that you can't see.[00:44:26] swyx: Yeah. Okay.[00:44:26] Sarah Wang: So,[00:44:27] Martin Casado: all right. So now the,[00:44:28] Sarah Wang: no, no. I mean I love that[00:44:29] Martin Casado: the adult[00:44:29] Sarah Wang: perspective. Um, well, no, I was gonna say these are very much a tag team. So we, we started this pod with that, um, premise. And I think this is a perfect question to even build on that further.[00:44:36] ‘cause it truly is, I mean, we're tag teaming all of these together.[00:44:39] Investing in Model Labs, Media Rumors, and the Cursor Playbook (Margins & Going Down-Stack)[00:44:39] Sarah Wang: Um, but I think every investment fundamentally starts with the same. Maybe the same two premises. One is, at this point in time, we actually believe that there are. And of one founders for their particular craft, and they have to be demonstrated in their prior careers, right?[00:44:56] So, uh, we're not investing in every, you know, now the term is NEO [00:45:00] lab, but every foundation model, uh, any, any company, any founder trying to build a foundation model, we're not, um, contrary to popular opinion, we're

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

From rewriting Google's search stack in the early 2000s to reviving sparse trillion-parameter models and co-designing TPUs with frontier ML research, Jeff Dean has quietly shaped nearly every layer of the modern AI stack. As Chief AI Scientist at Google and a driving force behind Gemini, Jeff has lived through multiple scaling revolutions from CPUs and sharded indices to multimodal models that reason across text, video, and code.Jeff joins us to unpack what it really means to “own the Pareto frontier,” why distillation is the engine behind every Flash model breakthrough, how energy (in picojoules) not FLOPs is becoming the true bottleneck, what it was like leading the charge to unify all of Google's AI teams, and why the next leap won't come from bigger context windows alone, but from systems that give the illusion of attending to trillions of tokens.We discuss:* Jeff's early neural net thesis in 1990: parallel training before it was cool, why he believed scaling would win decades early, and the “bigger model, more data, better results” mantra that held for 15 years* The evolution of Google Search: sharding, moving the entire index into memory in 2001, softening query semantics pre-LLMs, and why retrieval pipelines already resemble modern LLM systems* Pareto frontier strategy: why you need both frontier “Pro” models and low-latency “Flash” models, and how distillation lets smaller models surpass prior generations* Distillation deep dive: ensembles → compression → logits as soft supervision, and why you need the biggest model to make the smallest one good* Latency as a first-class objective: why 10–50x lower latency changes UX entirely, and how future reasoning workloads will demand 10,000 tokens/sec* Energy-based thinking: picojoules per bit, why moving data costs 1000x more than a multiply, batching through the lens of energy, and speculative decoding as amortization* TPU co-design: predicting ML workloads 2–6 years out, speculative hardware features, precision reduction, sparsity, and the constant feedback loop between model architecture and silicon* Sparse models and “outrageously large” networks: trillions of parameters with 1–5% activation, and why sparsity was always the right abstraction* Unified vs. specialized models: abandoning symbolic systems, why general multimodal models tend to dominate vertical silos, and when vertical fine-tuning still makes sense* Long context and the illusion of scale: beyond needle-in-a-haystack benchmarks toward systems that narrow trillions of tokens to 117 relevant documents* Personalized AI: attending to your emails, photos, and documents (with permission), and why retrieval + reasoning will unlock deeply personal assistants* Coding agents: 50 AI interns, crisp specifications as a new core skill, and how ultra-low latency will reshape human–agent collaboration* Why ideas still matter: transformers, sparsity, RL, hardware, systems — scaling wasn't blind; the pieces had to multiply togetherShow Notes:* Gemma 3 Paper* Gemma 3* Gemini 2.5 Report* Jeff Dean's “Software Engineering Advice fromBuilding Large-Scale Distributed Systems” Presentation (with Back of the Envelope Calculations)* Latency Numbers Every Programmer Should Know by Jeff Dean* The Jeff Dean Facts* Jeff Dean Google Bio* Jeff Dean on “Important AI Trends” @Stanford AI Club* Jeff Dean & Noam Shazeer — 25 years at Google (Dwarkesh)—Jeff Dean* LinkedIn: https://www.linkedin.com/in/jeff-dean-8b212555* X: https://x.com/jeffdeanGoogle* https://google.com* https://deepmind.googleFull Video EpisodeTimestamps00:00:04 — Introduction: Alessio & Swyx welcome Jeff Dean, chief AI scientist at Google, to the Latent Space podcast00:00:30 — Owning the Pareto Frontier & balancing frontier vs low-latency models00:01:31 — Frontier models vs Flash models + role of distillation00:03:52 — History of distillation and its original motivation00:05:09 — Distillation's role in modern model scaling00:07:02 — Model hierarchy (Flash, Pro, Ultra) and distillation sources00:07:46 — Flash model economics & wide deployment00:08:10 — Latency importance for complex tasks00:09:19 — Saturation of some tasks and future frontier tasks00:11:26 — On benchmarks, public vs internal00:12:53 — Example long-context benchmarks & limitations00:15:01 — Long-context goals: attending to trillions of tokens00:16:26 — Realistic use cases beyond pure language00:18:04 — Multimodal reasoning and non-text modalities00:19:05 — Importance of vision & motion modalities00:20:11 — Video understanding example (extracting structured info)00:20:47 — Search ranking analogy for LLM retrieval00:23:08 — LLM representations vs keyword search00:24:06 — Early Google search evolution & in-memory index00:26:47 — Design principles for scalable systems00:28:55 — Real-time index updates & recrawl strategies00:30:06 — Classic “Latency numbers every programmer should know”00:32:09 — Cost of memory vs compute and energy emphasis00:34:33 — TPUs & hardware trade-offs for serving models00:35:57 — TPU design decisions & co-design with ML00:38:06 — Adapting model architecture to hardware00:39:50 — Alternatives: energy-based models, speculative decoding00:42:21 — Open research directions: complex workflows, RL00:44:56 — Non-verifiable RL domains & model evaluation00:46:13 — Transition away from symbolic systems toward unified LLMs00:47:59 — Unified models vs specialized ones00:50:38 — Knowledge vs reasoning & retrieval + reasoning00:52:24 — Vertical model specialization & modules00:55:21 — Token count considerations for vertical domains00:56:09 — Low resource languages & contextual learning00:59:22 — Origins: Dean's early neural network work01:10:07 — AI for coding & human–model interaction styles01:15:52 — Importance of crisp specification for coding agents01:19:23 — Prediction: personalized models & state retrieval01:22:36 — Token-per-second targets (10k+) and reasoning throughput01:23:20 — Episode conclusion and thanksTranscriptAlessio Fanelli [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swyx, editor of Latent Space. Shawn Wang [00:00:11]: Hello, hello. We're here in the studio with Jeff Dean, chief AI scientist at Google. Welcome. Thanks for having me. It's a bit surreal to have you in the studio. I've watched so many of your talks, and obviously your career has been super legendary. So, I mean, congrats. I think the first thing must be said, congrats on owning the Pareto Frontier.Jeff Dean [00:00:30]: Thank you, thank you. Pareto Frontiers are good. It's good to be out there.Shawn Wang [00:00:34]: Yeah, I mean, I think it's a combination of both. You have to own the Pareto Frontier. You have to have like frontier capability, but also efficiency, and then offer that range of models that people like to use. And, you know, some part of this was started because of your hardware work. Some part of that is your model work, and I'm sure there's lots of secret sauce that you guys have worked on cumulatively. But, like, it's really impressive to see it all come together in, like, this slittily advanced.Jeff Dean [00:01:04]: Yeah, yeah. I mean, I think, as you say, it's not just one thing. It's like a whole bunch of things up and down the stack. And, you know, all of those really combine to help make UNOS able to make highly capable large models, as well as, you know, software techniques to get those large model capabilities into much smaller, lighter weight models that are, you know, much more cost effective and lower latency, but still, you know, quite capable for their size. Yeah.Alessio Fanelli [00:01:31]: How much pressure do you have on, like, having the lower bound of the Pareto Frontier, too? I think, like, the new labs are always trying to push the top performance frontier because they need to raise more money and all of that. And you guys have billions of users. And I think initially when you worked on the CPU, you were thinking about, you know, if everybody that used Google, we use the voice model for, like, three minutes a day, they were like, you need to double your CPU number. Like, what's that discussion today at Google? Like, how do you prioritize frontier versus, like, we have to do this? How do we actually need to deploy it if we build it?Jeff Dean [00:02:03]: Yeah, I mean, I think we always want to have models that are at the frontier or pushing the frontier because I think that's where you see what capabilities now exist that didn't exist at the sort of slightly less capable last year's version or last six months ago version. At the same time, you know, we know those are going to be really useful for a bunch of use cases, but they're going to be a bit slower and a bit more expensive than people might like for a bunch of other broader models. So I think what we want to do is always have kind of a highly capable sort of affordable model that enables a whole bunch of, you know, lower latency use cases. People can use them for agentic coding much more readily and then have the high-end, you know, frontier model that is really useful for, you know, deep reasoning, you know, solving really complicated math problems, those kinds of things. And it's not that. One or the other is useful. They're both useful. So I think we'd like to do both. And also, you know, through distillation, which is a key technique for making the smaller models more capable, you know, you have to have the frontier model in order to then distill it into your smaller model. So it's not like an either or choice. You sort of need that in order to actually get a highly capable, more modest size model. Yeah.Alessio Fanelli [00:03:24]: I mean, you and Jeffrey came up with the solution in 2014.Jeff Dean [00:03:28]: Don't forget, L'Oreal Vinyls as well. Yeah, yeah.Alessio Fanelli [00:03:30]: A long time ago. But like, I'm curious how you think about the cycle of these ideas, even like, you know, sparse models and, you know, how do you reevaluate them? How do you think about in the next generation of model, what is worth revisiting? Like, yeah, they're just kind of like, you know, you worked on so many ideas that end up being influential, but like in the moment, they might not feel that way necessarily. Yeah.Jeff Dean [00:03:52]: I mean, I think distillation was originally motivated because we were seeing that we had a very large image data set at the time, you know, 300 million images that we could train on. And we were seeing that if you create specialists for different subsets of those image categories, you know, this one's going to be really good at sort of mammals, and this one's going to be really good at sort of indoor room scenes or whatever, and you can cluster those categories and train on an enriched stream of data after you do pre-training on a much broader set of images. You get much better performance. If you then treat that whole set of maybe 50 models you've trained as a large ensemble, but that's not a very practical thing to serve, right? So distillation really came about from the idea of, okay, what if we want to actually serve that and train all these independent sort of expert models and then squish it into something that actually fits in a form factor that you can actually serve? And that's, you know, not that different from what we're doing today. You know, often today we're instead of having an ensemble of 50 models. We're having a much larger scale model that we then distill into a much smaller scale model.Shawn Wang [00:05:09]: Yeah. A part of me also wonders if distillation also has a story with the RL revolution. So let me maybe try to articulate what I mean by that, which is you can, RL basically spikes models in a certain part of the distribution. And then you have to sort of, well, you can spike models, but usually sometimes... It might be lossy in other areas and it's kind of like an uneven technique, but you can probably distill it back and you can, I think that the sort of general dream is to be able to advance capabilities without regressing on anything else. And I think like that, that whole capability merging without loss, I feel like it's like, you know, some part of that should be a distillation process, but I can't quite articulate it. I haven't seen much papers about it.Jeff Dean [00:06:01]: Yeah, I mean, I tend to think of one of the key advantages of distillation is that you can have a much smaller model and you can have a very large, you know, training data set and you can get utility out of making many passes over that data set because you're now getting the logits from the much larger model in order to sort of coax the right behavior out of the smaller model that you wouldn't otherwise get with just the hard labels. And so, you know, I think that's what we've observed. Is you can get, you know, very close to your largest model performance with distillation approaches. And that seems to be, you know, a nice sweet spot for a lot of people because it enables us to kind of, for multiple Gemini generations now, we've been able to make the sort of flash version of the next generation as good or even substantially better than the previous generations pro. And I think we're going to keep trying to do that because that seems like a good trend to follow.Shawn Wang [00:07:02]: So, Dara asked, so it was the original map was Flash Pro and Ultra. Are you just sitting on Ultra and distilling from that? Is that like the mother load?Jeff Dean [00:07:12]: I mean, we have a lot of different kinds of models. Some are internal ones that are not necessarily meant to be released or served. Some are, you know, our pro scale model and we can distill from that as well into our Flash scale model. So I think, you know, it's an important set of capabilities to have and also inference time scaling. It can also be a useful thing to improve the capabilities of the model.Shawn Wang [00:07:35]: And yeah, yeah, cool. Yeah. And obviously, I think the economy of Flash is what led to the total dominance. I think the latest number is like 50 trillion tokens. I don't know. I mean, obviously, it's changing every day.Jeff Dean [00:07:46]: Yeah, yeah. But, you know, by market share, hopefully up.Shawn Wang [00:07:50]: No, I mean, there's no I mean, there's just the economics wise, like because Flash is so economical, like you can use it for everything. Like it's in Gmail now. It's in YouTube. Like it's yeah. It's in everything.Jeff Dean [00:08:02]: We're using it more in our search products of various AI mode reviews.Shawn Wang [00:08:05]: Oh, my God. Flash past the AI mode. Oh, my God. Yeah, that's yeah, I didn't even think about that.Jeff Dean [00:08:10]: I mean, I think one of the things that is quite nice about the Flash model is not only is it more affordable, it's also a lower latency. And I think latency is actually a pretty important characteristic for these models because we're going to want models to do much more complicated things that are going to involve, you know, generating many more tokens from when you ask the model to do so. So, you know, if you're going to ask the model to do something until it actually finishes what you ask it to do, because you're going to ask now, not just write me a for loop, but like write me a whole software package to do X or Y or Z. And so having low latency systems that can do that seems really important. And Flash is one direction, one way of doing that. You know, obviously our hardware platforms enable a bunch of interesting aspects of our, you know, serving stack as well, like TPUs, the interconnect between. Chips on the TPUs is actually quite, quite high performance and quite amenable to, for example, long context kind of attention operations, you know, having sparse models with lots of experts. These kinds of things really, really matter a lot in terms of how do you make them servable at scale.Alessio Fanelli [00:09:19]: Yeah. Does it feel like there's some breaking point for like the proto Flash distillation, kind of like one generation delayed? I almost think about almost like the capability as a. In certain tasks, like the pro model today is a saturated, some sort of task. So next generation, that same task will be saturated at the Flash price point. And I think for most of the things that people use models for at some point, the Flash model in two generation will be able to do basically everything. And how do you make it economical to like keep pushing the pro frontier when a lot of the population will be okay with the Flash model? I'm curious how you think about that.Jeff Dean [00:09:59]: I mean, I think that's true. If your distribution of what people are asking people, the models to do is stationary, right? But I think what often happens is as the models become more capable, people ask them to do more, right? So, I mean, I think this happens in my own usage. Like I used to try our models a year ago for some sort of coding task, and it was okay at some simpler things, but wouldn't do work very well for more complicated things. And since then, we've improved dramatically on the more complicated coding tasks. And now I'll ask it to do much more complicated things. And I think that's true, not just of coding, but of, you know, now, you know, can you analyze all the, you know, renewable energy deployments in the world and give me a report on solar panel deployment or whatever. That's a very complicated, you know, more complicated task than people would have asked a year ago. And so you are going to want more capable models to push the frontier in the absence of what people ask the models to do. And that also then gives us. Insight into, okay, where does the, where do things break down? How can we improve the model in these, these particular areas, uh, in order to sort of, um, make the next generation even better.Alessio Fanelli [00:11:11]: Yeah. Are there any benchmarks or like test sets they use internally? Because it's almost like the same benchmarks get reported every time. And it's like, all right, it's like 99 instead of 97. Like, how do you have to keep pushing the team internally to it? Or like, this is what we're building towards. Yeah.Jeff Dean [00:11:26]: I mean, I think. Benchmarks, particularly external ones that are publicly available. Have their utility, but they often kind of have a lifespan of utility where they're introduced and maybe they're quite hard for current models. You know, I, I like to think of the best kinds of benchmarks are ones where the initial scores are like 10 to 20 or 30%, maybe, but not higher. And then you can sort of work on improving that capability for, uh, whatever it is, the benchmark is trying to assess and get it up to like 80, 90%, whatever. I, I think once it hits kind of 95% or something, you get very diminishing returns from really focusing on that benchmark, cuz it's sort of, it's either the case that you've now achieved that capability, or there's also the issue of leakage in public data or very related kind of data being, being in your training data. Um, so we have a bunch of held out internal benchmarks that we really look at where we know that wasn't represented in the training data at all. There are capabilities that we want the model to have. Um, yeah. Yeah. Um, that it doesn't have now, and then we can work on, you know, assessing, you know, how do we make the model better at these kinds of things? Is it, we need different kind of data to train on that's more specialized for this particular kind of task. Do we need, um, you know, a bunch of, uh, you know, architectural improvements or some sort of, uh, model capability improvements, you know, what would help make that better?Shawn Wang [00:12:53]: Is there, is there such an example that you, uh, a benchmark inspired in architectural improvement? Like, uh, I'm just kind of. Jumping on that because you just.Jeff Dean [00:13:02]: Uh, I mean, I think some of the long context capability of the, of the Gemini models that came, I guess, first in 1.5 really were about looking at, okay, we want to have, um, you know,Shawn Wang [00:13:15]: immediately everyone jumped to like completely green charts of like, everyone had, I was like, how did everyone crack this at the same time? Right. Yeah. Yeah.Jeff Dean [00:13:23]: I mean, I think, um, and once you're set, I mean, as you say that needed single needle and a half. Hey, stack benchmark is really saturated for at least context links up to 1, 2 and K or something. Don't actually have, you know, much larger than 1, 2 and 8 K these days or two or something. We're trying to push the frontier of 1 million or 2 million context, which is good because I think there are a lot of use cases where. Yeah. You know, putting a thousand pages of text or putting, you know, multiple hour long videos and the context and then actually being able to make use of that as useful. Try to, to explore the über graduation are fairly large. But the single needle in a haystack benchmark is sort of saturated. So you really want more complicated, sort of multi-needle or more realistic, take all this content and produce this kind of answer from a long context that sort of better assesses what it is people really want to do with long context. Which is not just, you know, can you tell me the product number for this particular thing?Shawn Wang [00:14:31]: Yeah, it's retrieval. It's retrieval within machine learning. It's interesting because I think the more meta level I'm trying to operate at here is you have a benchmark. You're like, okay, I see the architectural thing I need to do in order to go fix that. But should you do it? Because sometimes that's an inductive bias, basically. It's what Jason Wei, who used to work at Google, would say. Exactly the kind of thing. Yeah, you're going to win. Short term. Longer term, I don't know if that's going to scale. You might have to undo that.Jeff Dean [00:15:01]: I mean, I like to sort of not focus on exactly what solution we're going to derive, but what capability would you want? And I think we're very convinced that, you know, long context is useful, but it's way too short today. Right? Like, I think what you would really want is, can I attend to the internet while I answer my question? Right? But that's not going to happen. I think that's going to be solved by purely scaling the existing solutions, which are quadratic. So a million tokens kind of pushes what you can do. You're not going to do that to a trillion tokens, let alone, you know, a billion tokens, let alone a trillion. But I think if you could give the illusion that you can attend to trillions of tokens, that would be amazing. You'd find all kinds of uses for that. You would have attend to the internet. You could attend to the pixels of YouTube and the sort of deeper representations that we can find. You could attend to the form for a single video, but across many videos, you know, on a personal Gemini level, you could attend to all of your personal state with your permission. So like your emails, your photos, your docs, your plane tickets you have. I think that would be really, really useful. And the question is, how do you get algorithmic improvements and system level improvements that get you to something where you actually can attend to trillions of tokens? Right. In a meaningful way. Yeah.Shawn Wang [00:16:26]: But by the way, I think I did some math and it's like, if you spoke all day, every day for eight hours a day, you only generate a maximum of like a hundred K tokens, which like very comfortably fits.Jeff Dean [00:16:38]: Right. But if you then say, okay, I want to be able to understand everything people are putting on videos.Shawn Wang [00:16:46]: Well, also, I think that the classic example is you start going beyond language into like proteins and whatever else is extremely information dense. Yeah. Yeah.Jeff Dean [00:16:55]: I mean, I think one of the things about Gemini's multimodal aspects is we've always wanted it to be multimodal from the start. And so, you know, that sometimes to people means text and images and video sort of human-like and audio, audio, human-like modalities. But I think it's also really useful to have Gemini know about non-human modalities. Yeah. Like LIDAR sensor data from. Yes. Say, Waymo vehicles or. Like robots or, you know, various kinds of health modalities, x-rays and MRIs and imaging and genomics information. And I think there's probably hundreds of modalities of data where you'd like the model to be able to at least be exposed to the fact that this is an interesting modality and has certain meaning in the world. Where even if you haven't trained on all the LIDAR data or MRI data, you could have, because maybe that's not, you know, it doesn't make sense in terms of trade-offs of. You know, what you include in your main pre-training data mix, at least including a little bit of it is actually quite useful. Yeah. Because it sort of tempts the model that this is a thing.Shawn Wang [00:18:04]: Yeah. Do you believe, I mean, since we're on this topic and something I just get to ask you all the questions I always wanted to ask, which is fantastic. Like, are there some king modalities, like modalities that supersede all the other modalities? So a simple example was Vision can, on a pixel level, encode text. And DeepSeq had this DeepSeq CR paper that did that. Vision. And Vision has also been shown to maybe incorporate audio because you can do audio spectrograms and that's, that's also like a Vision capable thing. Like, so, so maybe Vision is just the king modality and like. Yeah.Jeff Dean [00:18:36]: I mean, Vision and Motion are quite important things, right? Motion. Well, like video as opposed to static images, because I mean, there's a reason evolution has evolved eyes like 23 independent ways, because it's such a useful capability for sensing the world around you, which is really what we want these models to be. So I think the only thing that we can be able to do is interpret the things we're seeing or the things we're paying attention to and then help us in using that information to do things. Yeah.Shawn Wang [00:19:05]: I think motion, you know, I still want to shout out, I think Gemini, still the only native video understanding model that's out there. So I use it for YouTube all the time. Nice.Jeff Dean [00:19:15]: Yeah. Yeah. I mean, it's actually, I think people kind of are not necessarily aware of what the Gemini models can actually do. Yeah. Like I have an example I've used in one of my talks. It had like, it was like a YouTube highlight video of 18 memorable sports moments across the last 20 years or something. So it has like Michael Jordan hitting some jump shot at the end of the finals and, you know, some soccer goals and things like that. And you can literally just give it the video and say, can you please make me a table of what all these different events are? What when the date is when they happened? And a short description. And so you get like now an 18 row table of that information extracted from the video, which is, you know, not something most people think of as like a turn video into sequel like table.Alessio Fanelli [00:20:11]: Has there been any discussion inside of Google of like, you mentioned tending to the whole internet, right? Google, it's almost built because a human cannot tend to the whole internet and you need some sort of ranking to find what you need. Yep. That ranking is like much different for an LLM because you can expect a person to look at maybe the first five, six links in a Google search versus for an LLM. Should you expect to have 20 links that are highly relevant? Like how do you internally figure out, you know, how do we build the AI mode that is like maybe like much broader search and span versus like the more human one? Yeah.Jeff Dean [00:20:47]: I mean, I think even pre-language model based work, you know, our ranking systems would be built to start. I mean, I think even pre-language model based work, you know, our ranking systems would be built to start. With a giant number of web pages in our index, many of them are not relevant. So you identify a subset of them that are relevant with very lightweight kinds of methods. You know, you're down to like 30,000 documents or something. And then you gradually refine that to apply more and more sophisticated algorithms and more and more sophisticated sort of signals of various kinds in order to get down to ultimately what you show, which is, you know, the final 10 results or, you know, 10 results plus. Other kinds of information. And I think an LLM based system is not going to be that dissimilar, right? You're going to attend to trillions of tokens, but you're going to want to identify, you know, what are the 30,000 ish documents that are with the, you know, maybe 30 million interesting tokens. And then how do you go from that into what are the 117 documents I really should be paying attention to in order to carry out the tasks that the user has asked? And I think, you know, you can imagine systems where you have, you know, a lot of highly parallel processing to identify those initial 30,000 candidates, maybe with very lightweight kinds of models. Then you have some system that sort of helps you narrow down from 30,000 to the 117 with maybe a little bit more sophisticated model or set of models. And then maybe the final model is the thing that looks. So the 117 things that might be your most capable model. So I think it has to, it's going to be some system like that, that is really enables you to give the illusion of attending to trillions of tokens. Sort of the way Google search gives you, you know, not the illusion, but you are searching the internet, but you're finding, you know, a very small subset of things that are, that are relevant.Shawn Wang [00:22:47]: Yeah. I often tell a lot of people that are not steeped in like Google search history that, well, you know, like Bert was. Like he was like basically immediately inside of Google search and that improves results a lot, right? Like I don't, I don't have any numbers off the top of my head, but like, I'm sure you guys, that's obviously the most important numbers to Google. Yeah.Jeff Dean [00:23:08]: I mean, I think going to an LLM based representation of text and words and so on enables you to get out of the explicit hard notion of, of particular words having to be on the page, but really getting at the notion of this topic of this page or this page. Paragraph is highly relevant to this query. Yeah.Shawn Wang [00:23:28]: I don't think people understand how much LLMs have taken over all these very high traffic system, very high traffic. Yeah. Like it's Google, it's YouTube. YouTube has this like semantics ID thing where it's just like every token or every item in the vocab is a YouTube video or something that predicts the video using a code book, which is absurd to me for YouTube size.Jeff Dean [00:23:50]: And then most recently GROK also for, for XAI, which is like, yeah. I mean, I'll call out even before LLMs were used extensively in search, we put a lot of emphasis on softening the notion of what the user actually entered into the query.Shawn Wang [00:24:06]: So do you have like a history of like, what's the progression? Oh yeah.Jeff Dean [00:24:09]: I mean, I actually gave a talk in, uh, I guess, uh, web search and data mining conference in 2009, uh, where we never actually published any papers about the origins of Google search, uh, sort of, but we went through sort of four or five or six. generations, four or five or six generations of, uh, redesigning of the search and retrieval system, uh, from about 1999 through 2004 or five. And that talk is really about that evolution. And one of the things that really happened in 2001 was we were sort of working to scale the system in multiple dimensions. So one is we wanted to make our index bigger, so we could retrieve from a larger index, which always helps your quality in general. Uh, because if you don't have the page in your index, you're going to not do well. Um, and then we also needed to scale our capacity because we were, our traffic was growing quite extensively. Um, and so we had, you know, a sharded system where you have more and more shards as the index grows, you have like 30 shards. And then if you want to double the index size, you make 60 shards so that you can bound the latency by which you respond for any particular user query. Um, and then as traffic grows, you add, you add more and more replicas of each of those. And so we eventually did the math that realized that in a data center where we had say 60 shards and, um, you know, 20 copies of each shard, we now had 1200 machines, uh, with disks. And we did the math and we're like, Hey, one copy of that index would actually fit in memory across 1200 machines. So in 2001, we introduced, uh, we put our entire index in memory and what that enabled from a quality perspective was amazing. Um, and so we had more and more replicas of each of those. Before you had to be really careful about, you know, how many different terms you looked at for a query, because every one of them would involve a disk seek on every one of the 60 shards. And so you, as you make your index bigger, that becomes even more inefficient. But once you have the whole index in memory, it's totally fine to have 50 terms you throw into the query from the user's original three or four word query, because now you can add synonyms like restaurant and restaurants and cafe and, uh, you know, things like that. Uh, bistro and all these things. And you can suddenly start, uh, sort of really, uh, getting at the meaning of the word as opposed to the exact semantic form the user typed in. And that was, you know, 2001, very much pre LLM, but really it was about softening the, the strict definition of what the user typed in order to get at the meaning.Alessio Fanelli [00:26:47]: What are like principles that you use to like design the systems, especially when you have, I mean, in 2001, the internet is like. Doubling, tripling every year in size is not like, uh, you know, and I think today you kind of see that with LLMs too, where like every year the jumps in size and like capabilities are just so big. Are there just any, you know, principles that you use to like, think about this? Yeah.Jeff Dean [00:27:08]: I mean, I think, uh, you know, first, whenever you're designing a system, you want to understand what are the sort of design parameters that are going to be most important in designing that, you know? So, you know, how many queries per second do you need to handle? How big is the internet? How big is the index you need to handle? How much data do you need to keep for every document in the index? How are you going to look at it when you retrieve things? Um, what happens if traffic were to double or triple, you know, will that system work well? And I think a good design principle is you're going to want to design a system so that the most important characteristics could scale by like factors of five or 10, but probably not beyond that because often what happens is if you design a system for X. And something suddenly becomes a hundred X, that would enable a very different point in the design space that would not make sense at X. But all of a sudden at a hundred X makes total sense. So like going from a disk space index to a in memory index makes a lot of sense once you have enough traffic, because now you have enough replicas of the sort of state on disk that those machines now actually can hold, uh, you know, a full copy of the, uh, index and memory. Yeah. And that all of a sudden enabled. A completely different design that wouldn't have been practical before. Yeah. Um, so I'm, I'm a big fan of thinking through designs in your head, just kind of playing with the design space a little before you actually do a lot of writing of code. But, you know, as you said, in the early days of Google, we were growing the index, uh, quite extensively. We were growing the update rate of the index. So the update rate actually is the parameter that changed the most. Surprising. So it used to be once a month.Shawn Wang [00:28:55]: Yeah.Jeff Dean [00:28:56]: And then we went to a system that could update any particular page in like sub one minute. Okay.Shawn Wang [00:29:02]: Yeah. Because this is a competitive advantage, right?Jeff Dean [00:29:04]: Because all of a sudden news related queries, you know, if you're, if you've got last month's news index, it's not actually that useful for.Shawn Wang [00:29:11]: News is a special beast. Was there any, like you could have split it onto a separate system.Jeff Dean [00:29:15]: Well, we did. We launched a Google news product, but you also want news related queries that people type into the main index to also be sort of updated.Shawn Wang [00:29:23]: So, yeah, it's interesting. And then you have to like classify whether the page is, you have to decide which pages should be updated and what frequency. Oh yeah.Jeff Dean [00:29:30]: There's a whole like, uh, system behind the scenes that's trying to decide update rates and importance of the pages. So even if the update rate seems low, you might still want to recrawl important pages quite often because, uh, the likelihood they change might be low, but the value of having updated is high.Shawn Wang [00:29:50]: Yeah, yeah, yeah, yeah. Uh, well, you know, yeah. This, uh, you know, mention of latency and, and saving things to this reminds me of one of your classics, which I have to bring up, which is latency numbers. Every programmer should know, uh, was there a, was it just a, just a general story behind that? Did you like just write it down?Jeff Dean [00:30:06]: I mean, this has like sort of eight or 10 different kinds of metrics that are like, how long does a cache mistake? How long does branch mispredict take? How long does a reference domain memory take? How long does it take to send, you know, a packet from the U S to the Netherlands or something? Um,Shawn Wang [00:30:21]: why Netherlands, by the way, or is it, is that because of Chrome?Jeff Dean [00:30:25]: Uh, we had a data center in the Netherlands, um, so, I mean, I think this gets to the point of being able to do the back of the envelope calculations. So these are sort of the raw ingredients of those, and you can use them to say, okay, well, if I need to design a system to do image search and thumb nailing or something of the result page, you know, how, what I do that I could pre-compute the image thumbnails. I could like. Try to thumbnail them on the fly from the larger images. What would that do? How much dis bandwidth than I need? How many des seeks would I do? Um, and you can sort of actually do thought experiments in, you know, 30 seconds or a minute with the sort of, uh, basic, uh, basic numbers at your fingertips. Uh, and then as you sort of build software using higher level libraries, you kind of want to develop the same intuitions for how long does it take to, you know, look up something in this particular kind of.Shawn Wang [00:31:21]: I'll see you next time.Shawn Wang [00:31:51]: Which is a simple byte conversion. That's nothing interesting. I wonder if you have any, if you were to update your...Jeff Dean [00:31:58]: I mean, I think it's really good to think about calculations you're doing in a model, either for training or inference.Jeff Dean [00:32:09]: Often a good way to view that is how much state will you need to bring in from memory, either like on-chip SRAM or HBM from the accelerator. Attached memory or DRAM or over the network. And then how expensive is that data motion relative to the cost of, say, an actual multiply in the matrix multiply unit? And that cost is actually really, really low, right? Because it's order, depending on your precision, I think it's like sub one picodule.Shawn Wang [00:32:50]: Oh, okay. You measure it by energy. Yeah. Yeah.Jeff Dean [00:32:52]: Yeah. I mean, it's all going to be about energy and how do you make the most energy efficient system. And then moving data from the SRAM on the other side of the chip, not even off the off chip, but on the other side of the same chip can be, you know, a thousand picodules. Oh, yeah. And so all of a sudden, this is why your accelerators require batching. Because if you move, like, say, the parameter of a model from SRAM on the, on the chip into the multiplier unit, that's going to cost you a thousand picodules. So you better make use of that, that thing that you moved many, many times with. So that's where the batch dimension comes in. Because all of a sudden, you know, if you have a batch of 256 or something, that's not so bad. But if you have a batch of one, that's really not good.Shawn Wang [00:33:40]: Yeah. Yeah. Right.Jeff Dean [00:33:41]: Because then you paid a thousand picodules in order to do your one picodule multiply.Shawn Wang [00:33:46]: I have never heard an energy-based analysis of batching.Jeff Dean [00:33:50]: Yeah. I mean, that's why people batch. Yeah. Ideally, you'd like to use batch size one because the latency would be great.Shawn Wang [00:33:56]: The best latency.Jeff Dean [00:33:56]: But the energy cost and the compute cost inefficiency that you get is quite large. So, yeah.Shawn Wang [00:34:04]: Is there a similar trick like, like, like you did with, you know, putting everything in memory? Like, you know, I think obviously NVIDIA has caused a lot of waves with betting very hard on SRAM with Grok. I wonder if, like, that's something that you already saw with, with the TPUs, right? Like that, that you had to. Uh, to serve at your scale, uh, you probably sort of saw that coming. Like what, what, what hardware, uh, innovations or insights were formed because of what you're seeing there?Jeff Dean [00:34:33]: Yeah. I mean, I think, you know, TPUs have this nice, uh, sort of regular structure of 2D or 3D meshes with a bunch of chips connected. Yeah. And each one of those has HBM attached. Um, I think for serving some kinds of models, uh, you know, you, you pay a lot higher cost. Uh, and time latency, um, bringing things in from HBM than you do bringing them in from, uh, SRAM on the chip. So if you have a small enough model, you can actually do model parallelism, spread it out over lots of chips and you actually get quite good throughput improvements and latency improvements from doing that. And so you're now sort of striping your smallish scale model over say 16 or 64 chips. Uh, but as if you do that and it all fits in. In SRAM, uh, that can be a big win. So yeah, that's not a surprise, but it is a good technique.Alessio Fanelli [00:35:27]: Yeah. What about the TPU design? Like how much do you decide where the improvements have to go? So like, this is like a good example of like, is there a way to bring the thousand picojoules down to 50? Like, is it worth designing a new chip to do that? The extreme is like when people say, oh, you should burn the model on the ASIC and that's kind of like the most extreme thing. How much of it? Is it worth doing an hardware when things change so quickly? Like what was the internal discussion? Yeah.Jeff Dean [00:35:57]: I mean, we, we have a lot of interaction between say the TPU chip design architecture team and the sort of higher level modeling, uh, experts, because you really want to take advantage of being able to co-design what should future TPUs look like based on where we think the sort of ML research puck is going, uh, in some sense, because, uh, you know, as a hardware designer for ML and in particular, you're trying to design a chip starting today and that design might take two years before it even lands in a data center. And then it has to sort of be a reasonable lifetime of the chip to take you three, four or five years. So you're trying to predict two to six years out where, what ML computations will people want to run two to six years out in a very fast changing field. And so having people with interest. Interesting ML research ideas of things we think will start to work in that timeframe or will be more important in that timeframe, uh, really enables us to then get, you know, interesting hardware features put into, you know, TPU N plus two, where TPU N is what we have today.Shawn Wang [00:37:10]: Oh, the cycle time is plus two.Jeff Dean [00:37:12]: Roughly. Wow. Because, uh, I mean, sometimes you can squeeze some changes into N plus one, but, you know, bigger changes are going to require the chip. Yeah. Design be earlier in its lifetime design process. Um, so whenever we can do that, it's generally good. And sometimes you can put in speculative features that maybe won't cost you much chip area, but if it works out, it would make something, you know, 10 times as fast. And if it doesn't work out, well, you burned a little bit of tiny amount of your chip area on that thing, but it's not that big a deal. Uh, sometimes it's a very big change and we want to be pretty sure this is going to work out. So we'll do like lots of carefulness. Uh, ML experimentation to show us, uh, this is actually the, the way we want to go. Yeah.Alessio Fanelli [00:37:58]: Is there a reverse of like, we already committed to this chip design so we can not take the model architecture that way because it doesn't quite fit?Jeff Dean [00:38:06]: Yeah. I mean, you, you definitely have things where you're going to adapt what the model architecture looks like so that they're efficient on the chips that you're going to have for both training and inference of that, of that, uh, generation of model. So I think it kind of goes both ways. Um, you know, sometimes you can take advantage of, you know, lower precision things that are coming in a future generation. So you can, might train it at that lower precision, even if the current generation doesn't quite do that. Mm.Shawn Wang [00:38:40]: Yeah. How low can we go in precision?Jeff Dean [00:38:43]: Because people are saying like ternary is like, uh, yeah, I mean, I'm a big fan of very low precision because I think that gets, that saves you a tremendous amount of time. Right. Because it's picojoules per bit that you're transferring and reducing the number of bits is a really good way to, to reduce that. Um, you know, I think people have gotten a lot of luck, uh, mileage out of having very low bit precision things, but then having scaling factors that apply to a whole bunch of, uh, those, those weights. Scaling. How does it, how does it, okay.Shawn Wang [00:39:15]: Interesting. You, so low, low precision, but scaled up weights. Yeah. Huh. Yeah. Never considered that. Yeah. Interesting. Uh, w w while we're on this topic, you know, I think there's a lot of, um, uh, this, the concept of precision at all is weird when we're sampling, you know, uh, we just, at the end of this, we're going to have all these like chips that I'll do like very good math. And then we're just going to throw a random number generator at the start. So, I mean, there's a movement towards, uh, energy based, uh, models and processors. I'm just curious if you've, obviously you've thought about it, but like, what's your commentary?Jeff Dean [00:39:50]: Yeah. I mean, I think. There's a bunch of interesting trends though. Energy based models is one, you know, diffusion based models, which don't sort of sequentially decode tokens is another, um, you know, speculative decoding is a way that you can get sort of an equivalent, very small.Shawn Wang [00:40:06]: Draft.Jeff Dean [00:40:07]: Batch factor, uh, for like you predict eight tokens out and that enables you to sort of increase the effective batch size of what you're doing by a factor of eight, even, and then you maybe accept five or six of those tokens. So you get. A five, a five X improvement in the amortization of moving weights, uh, into the multipliers to do the prediction for the, the tokens. So these are all really good techniques and I think it's really good to look at them from the lens of, uh, energy, real energy, not energy based models, um, and, and also latency and throughput, right? If you look at things from that lens, that sort of guides you to. Two solutions that are gonna be, uh, you know, better from, uh, you know, being able to serve larger models or, you know, equivalent size models more cheaply and with lower latency.Shawn Wang [00:41:03]: Yeah. Well, I think, I think I, um, it's appealing intellectually, uh, haven't seen it like really hit the mainstream, but, um, I do think that, uh, there's some poetry in the sense that, uh, you know, we don't have to do, uh, a lot of shenanigans if like we fundamentally. Design it into the hardware. Yeah, yeah.Jeff Dean [00:41:23]: I mean, I think there's still a, there's also sort of the more exotic things like analog based, uh, uh, computing substrates as opposed to digital ones. Uh, I'm, you know, I think those are super interesting cause they can be potentially low power. Uh, but I think you often end up wanting to interface that with digital systems and you end up losing a lot of the power advantages in the digital to analog and analog to digital conversions. You end up doing, uh, at the sort of boundaries. And periphery of that system. Um, I still think there's a tremendous distance we can go from where we are today in terms of energy efficiency with sort of, uh, much better and specialized hardware for the models we care about.Shawn Wang [00:42:05]: Yeah.Alessio Fanelli [00:42:06]: Um, any other interesting research ideas that you've seen, or like maybe things that you cannot pursue a Google that you would be interested in seeing researchers take a step at, I guess you have a lot of researchers. Yeah, I guess you have enough, but our, our research.Jeff Dean [00:42:21]: Our research portfolio is pretty broad. I would say, um, I mean, I think, uh, in terms of research directions, there's a whole bunch of, uh, you know, open problems and how do you make these models reliable and able to do much longer, kind of, uh, more complex tasks that have lots of subtasks. How do you orchestrate, you know, maybe one model that's using other models as tools in order to sort of build, uh, things that can accomplish, uh, you know, much more. Yeah. Significant pieces of work, uh, collectively, then you would ask a single model to do. Um, so that's super interesting. How do you get more verifiable, uh, you know, how do you get RL to work for non-verifiable domains? I think it's a pretty interesting open problem because I think that would broaden out the capabilities of the models, the improvements that you're seeing in both math and coding. Uh, if we could apply those to other less verifiable domains, because we've come up with RL techniques that actually enable us to do that. Uh, effectively, that would, that would really make the models improve quite a lot. I think.Alessio Fanelli [00:43:26]: I'm curious, like when we had Noam Brown on the podcast, he said, um, they already proved you can do it with deep research. Um, you kind of have it with AI mode in a way it's not verifiable. I'm curious if there's any thread that you think is interesting there. Like what is it? Both are like information retrieval of JSON. So I wonder if it's like the retrieval is like the verifiable part. That you can score or what are like, yeah, yeah. How, how would you model that, that problem?Jeff Dean [00:43:55]: Yeah. I mean, I think there are ways of having other models that can evaluate the results of what a first model did, maybe even retrieving. Can you have another model that says, is this things, are these things you retrieved relevant? Or can you rate these 2000 things you retrieved to assess which ones are the 50 most relevant or something? Um, I think those kinds of techniques are actually quite effective. Sometimes I can even be the same model, just prompted differently to be a, you know, a critic as opposed to a, uh, actual retrieval system. Yeah.Shawn Wang [00:44:28]: Um, I do think like there, there is that, that weird cliff where like, it feels like we've done the easy stuff and then now it's, but it always feels like that every year. It's like, oh, like we know, we know, and the next part is super hard and nobody's figured it out. And, uh, exactly with this RLVR thing where like everyone's talking about, well, okay, how do we. the next stage of the non-verifiable stuff. And everyone's like, I don't know, you know, Ellen judge.Jeff Dean [00:44:56]: I mean, I feel like the nice thing about this field is there's lots and lots of smart people thinking about creative solutions to some of the problems that we all see. Uh, because I think everyone sort of sees that the models, you know, are great at some things and they fall down around the edges of those things and, and are not as capable as we'd like in those areas. And then coming up with good techniques and trying those. And seeing which ones actually make a difference is sort of what the whole research aspect of this field is, is pushing forward. And I think that's why it's super interesting. You know, if you think about two years ago, we were struggling with GSM, eight K problems, right? Like, you know, Fred has two rabbits. He gets three more rabbits. How many rabbits does he have? That's a pretty far cry from the kinds of mathematics that the models can, and now you're doing IMO and Erdos problems in pure language. Yeah. Yeah. Pure language. So that is a really, really amazing jump in capabilities in, you know, in a year and a half or something. And I think, um, for other areas, it'd be great if we could make that kind of leap. Uh, and you know, we don't exactly see how to do it for some, some areas, but we do see it for some other areas and we're going to work hard on making that better. Yeah.Shawn Wang [00:46:13]: Yeah.Alessio Fanelli [00:46:14]: Like YouTube thumbnail generation. That would be very helpful. We need that. That would be AGI. We need that.Shawn Wang [00:46:20]: That would be. As far as content creators go.Jeff Dean [00:46:22]: I guess I'm not a YouTube creator, so I don't care that much about that problem, but I guess, uh, many people do.Shawn Wang [00:46:27]: It does. Yeah. It doesn't, it doesn't matter. People do judge books by their covers as it turns out. Um, uh, just to draw a bit on the IMO goal. Um, I'm still not over the fact that a year ago we had alpha proof and alpha geometry and all those things. And then this year we were like, screw that we'll just chuck it into Gemini. Yeah. What's your reflection? Like, I think this, this question about. Like the merger of like symbolic systems and like, and, and LMS, uh, was a very much core belief. And then somewhere along the line, people would just said, Nope, we'll just all do it in the LLM.Jeff Dean [00:47:02]: Yeah. I mean, I think it makes a lot of sense to me because, you know, humans manipulate symbols, but we probably don't have like a symbolic representation in our heads. Right. We have some distributed representation that is neural net, like in some way of lots of different neurons. And activation patterns firing when we see certain things and that enables us to reason and plan and, you know, do chains of thought and, you know, roll them back now that, that approach for solving the problem doesn't seem like it's going to work. I'm going to try this one. And, you know, in a lot of ways we're emulating what we intuitively think, uh, is happening inside real brains in neural net based models. So it never made sense to me to have like completely separate. Uh, discrete, uh, symbolic things, and then a completely different way of, of, uh, you know, thinking about those things.Shawn Wang [00:47:59]: Interesting. Yeah. Uh, I mean, it's maybe seems obvious to you, but it wasn't obvious to me a year ago. Yeah.Jeff Dean [00:48:06]: I mean, I do think like that IMO with, you know, translating to lean and using lean and then the next year and also a specialized geometry model. And then this year switching to a single unified model. That is roughly the production model with a little bit more inference budget, uh, is actually, you know, quite good because it shows you that the capabilities of that general model have improved dramatically and, and now you don't need the specialized model. This is actually sort of very similar to the 2013 to 16 era of machine learning, right? Like it used to be, people would train separate models for lots of different, each different problem, right? I have, I want to recognize street signs and something. So I train a street sign. Recognition recognition model, or I want to, you know, decode speech recognition. I have a speech model, right? I think now the era of unified models that do everything is really upon us. And the question is how well do those models generalize to new things they've never been asked to do and they're getting better and better.Shawn Wang [00:49:10]: And you don't need domain experts. Like one of my, uh, so I interviewed ETA who was on, who was on that team. Uh, and he was like, yeah, I, I don't know how they work. I don't know where the IMO competition was held. I don't know the rules of it. I just trained the models, the training models. Yeah. Yeah. And it's kind of interesting that like people with these, this like universal skill set of just like machine learning, you just give them data and give them enough compute and they can kind of tackle any task, which is the bitter lesson, I guess. I don't know. Yeah.Jeff Dean [00:49:39]: I mean, I think, uh, general models, uh, will win out over specialized ones in most cases.Shawn Wang [00:49:45]: Uh, so I want to push there a bit. I think there's one hole here, which is like, uh. There's this concept of like, uh, maybe capacity of a model, like abstractly a model can only contain the number of bits that it has. And, uh, and so it, you know, God knows like Gemini pro is like one to 10 trillion parameters. We don't know, but, uh, the Gemma models, for example, right? Like a lot of people want like the open source local models that are like that, that, that, and, and, uh, they have some knowledge, which is not necessary, right? Like they can't know everything like, like you have the. The luxury of you have the big model and big model should be able to capable of everything. But like when, when you're distilling and you're going down to the small models, you know, you're actually memorizing things that are not useful. Yeah. And so like, how do we, I guess, do we want to extract that? Can we, can we divorce knowledge from reasoning, you know?Jeff Dean [00:50:38]: Yeah. I mean, I think you do want the model to be most effective at reasoning if it can retrieve things, right? Because having the model devote precious parameter space. To remembering obscure facts that could be looked up is actually not the best use of that parameter space, right? Like you might prefer something that is more generally useful in more settings than this obscure fact that it has. Um, so I think that's always attention at the same time. You also don't want your model to be kind of completely detached from, you know, knowing stuff about the world, right? Like it's probably useful to know how long the golden gate be. Bridges just as a general sense of like how long are bridges, right? And, uh, it should have that kind of knowledge. It maybe doesn't need to know how long some teeny little bridge in some other more obscure part of the world is, but, uh, it does help it to have a fair bit of world knowledge and the bigger your model is, the more you can have. Uh, but I do think combining retrieval with sort of reasoning and making the model really good at doing multiple stages of retrieval. Yeah.Shawn Wang [00:51:49]: And reasoning through the intermediate retrieval results is going to be a, a pretty effective way of making the model seem much more capable, because if you think about, say, a personal Gemini, yeah, right?Jeff Dean [00:52:01]: Like we're not going to train Gemini on my email. Probably we'd rather have a single model that, uh, we can then use and use being able to retrieve from my email as a tool and have the model reason about it and retrieve from my photos or whatever, uh, and then make use of that and have multiple. Um, you know, uh, stages of interaction. that makes sense.Alessio Fanelli [00:52:24]: Do you think the vertical models are like, uh, interesting pursuit? Like when people are like, oh, we're building the best healthcare LLM, we're building the best law LLM, are those kind of like short-term stopgaps or?Jeff Dean [00:52:37]: No, I mean, I think, I think vertical models are interesting. Like you want them to start from a pretty good base model, but then you can sort of, uh, sort of viewing them, view them as enriching the data. Data distribution for that particular vertical domain for healthcare, say, um, we're probably not going to train or for say robotics. We're probably not going to train Gemini on all possible robotics data. We, you could train it on because we want it to have a balanced set of capabilities. Um, so we'll expose it to some robotics data, but if you're trying to build a really, really good robotics model, you're going to want to start with that and then train it on more robotics data. And then maybe that would. It's multilingual translation capability, but improve its robotics capabilities. And we're always making these kind of, uh, you know, trade-offs in the data mix that we train the base Gemini models on. You know, we'd love to include data from 200 more languages and as much data as we have for those languages, but that's going to displace some other capabilities of the model. It won't be as good at, um, you know, Pearl programming, you know, it'll still be good at Python programming. Cause we'll include it. Enough. Of that, but there's other long tail computer languages or coding capabilities that it may suffer on or multi, uh, multimodal reasoning capabilities may suffer. Cause we didn't get to expose it to as much data there, but it's really good at multilingual things. So I, I think some combination of specialized models, maybe more modular models. So it'd be nice to have the capability to have those 200 languages, plus this awesome robotics model, plus this awesome healthcare, uh, module that all can be knitted together to work in concert and called upon in different circumstances. Right? Like if I have a health related thing, then it should enable using this health module in conjunction with the main base model to be even better at those kinds of things. Yeah.Shawn Wang [00:54:36]: Installable knowledge. Yeah.Jeff Dean [00:54:37]: Right.Shawn Wang [00:54:38]: Just download as a, as a package.Jeff Dean [00:54:39]: And some of that installable stuff can come from retrieval, but some of it probably should come from preloaded training on, you know, uh, a hundred billion tokens or a trillion tokens of health data. Yeah.Shawn Wang [00:54:51]: And for listeners, I think, uh, I will highlight the Gemma three end paper where they, there was a little bit of that, I think. Yeah.Alessio Fanelli [00:54:56]: Yeah. I guess the question is like, how many billions of tokens do you need to outpace the frontier model improvements? You know, it's like, if I have to make this model better healthcare and the main. Gemini model is still improving. Do I need 50 billion tokens? Can I do it with a hundred, if I need a trillion healthcare tokens, it's like, they're probably not out there that you don't have, you know, I think that's really like the.Jeff Dean [00:55:21]: Well, I mean, I think healthcare is a particularly challenging domain, so there's a lot of healthcare data that, you know, we don't have access to appropriately, but there's a lot of, you know, uh, healthcare organizations that want to train models on their own data. That is not public healthcare data, uh, not public health. But public healthcare data. Um, so I think there are opportunities there to say, partner with a large healthcare organization and train models for their use that are going to be, you know, more bespoke, but probably, uh, might be better than a general model trained on say, public data. Yeah.Shawn Wang [00:55:58]: Yeah. I, I believe, uh, by the way, also this is like somewhat related to the language conversation. Uh, I think one of your, your favorite examples was you can put a low resource language in the context and it just learns. Yeah.Jeff Dean [00:56:09]: Oh, yeah, I think the example we used was Calamon, which is truly low resource because it's only spoken by, I think 120 people in the world and there's no written text.Shawn Wang [00:56:20]: So, yeah. So you can just do it that way. Just put it in the context. Yeah. Yeah. But I think your whole data set in the context, right.Jeff Dean [00:56:27]: If you, if you take a language like, uh, you know, Somali or something, there is a fair bit of Somali text in the world that, uh, or Ethiopian Amharic or something, um, you know, we probably. Yeah. Are not putting all the data from those languages into the Gemini based training. We put some of it, but if you put more of it, you'll improve the capabilities of those models.Shawn Wang [00:56:49]: Yeah.Jeff Dean [00:56:49]:

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Feb 6, 2026 68:01


From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation.In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire's core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire's answer is to build a bi-directional interface between humans and models: read what's happening inside, edit it surgically, and eventually use interpretability during training so customization isn't just brute-force guesswork.Mark and Myra walk through what that looks like when you stop treating interpretability like a lab demo and start treating it like infrastructure: lightweight probes that add near-zero latency, token-level safety filters that can run at inference time, and interpretability workflows that survive messy constraints (multilingual inputs, synthetic→real transfer, regulated domains, no access to sensitive data). We also get a live window into what “frontier-scale interp” means operationally (i.e. steering a trillion-parameter model in real time by targeting internal features) plus why the same tooling generalizes cleanly from language models to genomics, medical imaging, and “pixel-space” world models.We discuss:* Myra + Mark's path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments* What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design)* Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities* SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces”* Rakuten in production: deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks* Why interp can be operationally cheaper than LLM-judge guardrails: probes are lightweight, low-latency, and don't require hosting a second large model in the loop* Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use* Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods* Steering vs prompting: the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors)* Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners* World models + “pixel-space” interpretability: why vision/video models make concepts easier to see, how that accelerates the feedback loop, and why robotics/world-model partners are especially interesting design partners* The north star: moving from “data in, weights out” to intentional model design where experts can impart goals and constraints directly, not just via reward signals and brute-force post-training—Goodfire AI* Website: https://goodfire.ai* LinkedIn: https://www.linkedin.com/company/goodfire-ai/* X: https://x.com/GoodfireAIMyra Deng* Website: https://myradeng.com/* LinkedIn: https://www.linkedin.com/in/myra-deng/* X: https://x.com/myra_dengMark Bissell* LinkedIn: https://www.linkedin.com/in/mark-bissell/* X: https://x.com/MarkMBissellFull Video EpisodeTimestamps00:00:00 Introduction00:00:05 Introduction to the Latent Space Podcast and Guests from Goodfire00:00:29 What is Goodfire? Mission and Focus on Interpretability00:01:01 Goodfire's Practical Approach to Interpretability00:01:37 Goodfire's Series B Fundraise Announcement00:02:04 Backgrounds of Mark and Myra from Goodfire00:02:51 Team Structure and Roles at Goodfire00:05:13 What is Interpretability? Definitions and Techniques00:05:30 Understanding Errors00:07:29 Post-training vs. Pre-training Interpretability Applications00:08:51 Using Interpretability to Remove Unwanted Behaviors00:10:09 Grokking, Double Descent, and Generalization in Models00:10:15 404 Not Found Explained00:12:06 Subliminal Learning and Hidden Biases in Models00:14:07 How Goodfire Chooses Research Directions and Projects00:15:00 Troubleshooting Errors00:16:04 Limitations of SAEs and Probes in Interpretability00:18:14 Rakuten Case Study: Production Deployment of Interpretability00:20:45 Conclusion00:21:12 Efficiency Benefits of Interpretability Techniques00:21:26 Live Demo: Real-Time Steering in a Trillion Parameter Model00:25:15 How Steering Features are Identified and Labeled00:26:51 Detecting and Mitigating Hallucinations Using Interpretability00:31:20 Equivalence of Activation Steering and Prompting00:34:06 Comparing Steering with Fine-Tuning and LoRA Techniques00:36:04 Model Design and the Future of Intentional AI Development00:38:09 Getting Started in Mechinterp: Resources, Programs, and Open Problems00:40:51 Industry Applications and the Rise of Mechinterp in Practice00:41:39 Interpretability for Code Models and Real-World Usage00:43:07 Making Steering Useful for More Than Stylistic Edits00:46:17 Applying Interpretability to Healthcare and Scientific Discovery00:49:15 Why Interpretability is Crucial in High-Stakes Domains like Healthcare00:52:03 Call for Design Partners Across Domains00:54:18 Interest in World Models and Visual Interpretability00:57:22 Sci-Fi Inspiration: Ted Chiang and Interpretability01:00:14 Interpretability, Safety, and Alignment Perspectives01:04:27 Weak-to-Strong Generalization and Future Alignment Challenges01:05:38 Final Thoughts and Hiring/Collaboration Opportunities at GoodfireTranscriptShawn Wang [00:00:05]: So welcome to the Latent Space pod. We're back in the studio with our special MechInterp co-host, Vibhu. Welcome. Mochi, Mochi's special co-host. And Mochi, the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome. Thanks for having us on. Maybe we can sort of introduce Goodfire and then introduce you guys. How do you introduce Goodfire today?Myra Deng [00:00:29]: Yeah, it's a great question. So Goodfire, we like to say, is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. And we really believe that interpretability will unlock the new generation, next frontier of safe and powerful AI models. That's our description right now, and I'm excited to dive more into the work we're doing to make that happen.Shawn Wang [00:00:55]: Yeah. And there's always like the official description. Is there an understatement? Is there an unofficial one that sort of resonates more with a different audience?Mark Bissell [00:01:01]: Well, being an AI research lab that's focused on interpretability, there's obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied. And in particular, applying it in production scenarios, in high stakes industries, and really taking it sort of from the research world into the real world. Which, you know. It's a new field, so that hasn't been done all that much. And we're excited about actually seeing that sort of put into practice.Shawn Wang [00:01:37]: Yeah, I would say it wasn't too long ago that Anthopic was like still putting out like toy models or superposition and that kind of stuff. And I wouldn't have pegged it to be this far along. When you and I talked at NeurIPS, you were talking a little bit about your production use cases and your customers. And then not to bury the lead, today we're also announcing the fundraise, your Series B. $150 million. $150 million at a 1.25B valuation. Congrats, Unicorn.Mark Bissell [00:02:02]: Thank you. Yeah, no, things move fast.Shawn Wang [00:02:04]: We were talking to you in December and already some big updates since then. Let's dive, I guess, into a bit of your backgrounds as well. Mark, you were at Palantir working on health stuff, which is really interesting because the Goodfire has some interesting like health use cases. I don't know how related they are in practice.Mark Bissell [00:02:22]: Yeah, not super related, but I don't know. It was helpful context to know what it's like. Just to work. Just to work with health systems and generally in that domain. Yeah.Shawn Wang [00:02:32]: And Mara, you were at Two Sigma, which actually I was also at Two Sigma back in the day. Wow, nice.Myra Deng [00:02:37]: Did we overlap at all?Shawn Wang [00:02:38]: No, this is when I was briefly a software engineer before I became a sort of developer relations person. And now you're head of product. What are your sort of respective roles, just to introduce people to like what all gets done in Goodfire?Mark Bissell [00:02:51]: Yeah, prior to Goodfire, I was at Palantir for about three years as a forward deployed engineer, now a hot term. Wasn't always that way. And as a technical lead on the health care team and at Goodfire, I'm a member of the technical staff. And honestly, that I think is about as specific as like as as I could describe myself because I've worked on a range of things. And, you know, it's it's a fun time to be at a team that's still reasonably small. I think when I joined one of the first like ten employees, now we're above 40, but still, it looks like there's always a mix of research and engineering and product and all of the above. That needs to get done. And I think everyone across the team is, you know, pretty, pretty switch hitter in the roles they do. So I think you've seen some of the stuff that I worked on related to image models, which was sort of like a research demo. More recently, I've been working on our scientific discovery team with some of our life sciences partners, but then also building out our core platform for more of like flexing some of the kind of MLE and developer skills as well.Shawn Wang [00:03:53]: Very generalist. And you also had like a very like a founding engineer type role.Myra Deng [00:03:58]: Yeah, yeah.Shawn Wang [00:03:59]: So I also started as I still am a member of technical staff, did a wide range of things from the very beginning, including like finding our office space and all of this, which is we both we both visited when you had that open house thing. It was really nice.Myra Deng [00:04:13]: Thank you. Thank you. Yeah. Plug to come visit our office.Shawn Wang [00:04:15]: It looked like it was like 200 people. It has room for 200 people. But you guys are like 10.Myra Deng [00:04:22]: For a while, it was very empty. But yeah, like like Mark, I spend. A lot of my time as as head of product, I think product is a bit of a weird role these days, but a lot of it is thinking about how do we take our frontier research and really apply it to the most important real world problems and how does that then translate into a platform that's repeatable or a product and working across, you know, the engineering and research teams to make that happen and also communicating to the world? Like, what is interpretability? What is it used for? What is it good for? Why is it so important? All of these things are part of my day-to-day as well.Shawn Wang [00:05:01]: I love like what is things because that's a very crisp like starting point for people like coming to a field. They all do a fun thing. Vibhu, why don't you want to try tackling what is interpretability and then they can correct us.Vibhu Sapra [00:05:13]: Okay, great. So I think like one, just to kick off, it's a very interesting role to be head of product, right? Because you guys, at least as a lab, you're more of an applied interp lab, right? Which is pretty different than just normal interp, like a lot of background research. But yeah. You guys actually ship an API to try these things. You have Ember, you have products around it, which not many do. Okay. What is interp? So basically you're trying to have an understanding of what's going on in model, like in the model, in the internal. So different approaches to do that. You can do probing, SAEs, transcoders, all this stuff. But basically you have an, you have a hypothesis. You have something that you want to learn about what's happening in a model internals. And then you're trying to solve that from there. You can do stuff like you can, you know, you can do activation mapping. You can try to do steering. There's a lot of stuff that you can do, but the key question is, you know, from input to output, we want to have a better understanding of what's happening and, you know, how can we, how can we adjust what's happening on the model internals? How'd I do?Mark Bissell [00:06:12]: That was really good. I think that was great. I think it's also a, it's kind of a minefield of a, if you ask 50 people who quote unquote work in interp, like what is interpretability, you'll probably get 50 different answers. And. Yeah. To some extent also like where, where good fire sits in the space. I think that we're an AI research company above all else. And interpretability is a, is a set of methods that we think are really useful and worth kind of specializing in, in order to accomplish the goals we want to accomplish. But I think we also sort of see some of the goals as even more broader as, as almost like the science of deep learning and just taking a not black box approach to kind of any part of the like AI development life cycle, whether that. That means using interp for like data curation while you're training your model or for understanding what happened during post-training or for the, you know, understanding activations and sort of internal representations, what is in there semantically. And then a lot of sort of exciting updates that were, you know, are sort of also part of the, the fundraise around bringing interpretability to training, which I don't think has been done all that much before. A lot of this stuff is sort of post-talk poking at models as opposed to. To actually using this to intentionally design them.Shawn Wang [00:07:29]: Is this post-training or pre-training or is that not a useful.Myra Deng [00:07:33]: Currently focused on post-training, but there's no reason the techniques wouldn't also work in pre-training.Shawn Wang [00:07:38]: Yeah. It seems like it would be more active, applicable post-training because basically I'm thinking like rollouts or like, you know, having different variations of a model that you can tweak with the, with your steering. Yeah.Myra Deng [00:07:50]: And I think in a lot of the news that you've seen in, in, on like Twitter or whatever, you've seen a lot of unintended. Side effects come out of post-training processes, you know, overly sycophantic models or models that exhibit strange reward hacking behavior. I think these are like extreme examples. There's also, you know, very, uh, mundane, more mundane, like enterprise use cases where, you know, they try to customize or post-train a model to do something and it learns some noise or it doesn't appropriately learn the target task. And a big question that we've always had is like, how do you use your understanding of what the model knows and what it's doing to actually guide the learning process?Shawn Wang [00:08:26]: Yeah, I mean, uh, you know, just to anchor this for people, uh, one of the biggest controversies of last year was 4.0 GlazeGate. I've never heard of GlazeGate. I didn't know that was what it was called. The other one, they called it that on the blog post and I was like, well, how did OpenAI call it? Like officially use that term. And I'm like, that's funny, but like, yeah, I guess it's the pitch that if they had worked a good fire, they wouldn't have avoided it. Like, you know what I'm saying?Myra Deng [00:08:51]: I think so. Yeah. Yeah.Mark Bissell [00:08:53]: I think that's certainly one of the use cases. I think. Yeah. Yeah. I think the reason why post-training is a place where this makes a lot of sense is a lot of what we're talking about is surgical edits. You know, you want to be able to have expert feedback, very surgically change how your model is doing, whether that is, you know, removing a certain behavior that it has. So, you know, one of the things that we've been looking at or is, is another like common area where you would want to make a somewhat surgical edit is some of the models that have say political bias. Like you look at Quen or, um, R1 and they have sort of like this CCP bias.Shawn Wang [00:09:27]: Is there a CCP vector?Mark Bissell [00:09:29]: Well, there's, there are certainly internal, yeah. Parts of the representation space where you can sort of see where that lives. Yeah. Um, and you want to kind of, you know, extract that piece out.Shawn Wang [00:09:40]: Well, I always say, you know, whenever you find a vector, a fun exercise is just like, make it very negative to see what the opposite of CCP is.Mark Bissell [00:09:47]: The super America, bald eagles flying everywhere. But yeah. So in general, like lots of post-training tasks where you'd want to be able to, to do that. Whether it's unlearning a certain behavior or, you know, some of the other kind of cases where this comes up is, are you familiar with like the, the grokking behavior? I mean, I know the machine learning term of grokking.Shawn Wang [00:10:09]: Yeah.Mark Bissell [00:10:09]: Sort of this like double descent idea of, of having a model that is able to learn a generalizing, a generalizing solution, as opposed to even if memorization of some task would suffice, you want it to learn the more general way of doing a thing. And so, you know, another. A way that you can think about having surgical access to a model's internals would be learn from this data, but learn in the right way. If there are many possible, you know, ways to, to do that. Can make interp solve the double descent problem?Shawn Wang [00:10:41]: Depends, I guess, on how you. Okay. So I, I, I viewed that double descent as a problem because then you're like, well, if the loss curves level out, then you're done, but maybe you're not done. Right. Right. But like, if you actually can interpret what is a generalizing or what you're doing. What is, what is still changing, even though the loss is not changing, then maybe you, you can actually not view it as a double descent problem. And actually you're just sort of translating the space in which you view loss and like, and then you have a smooth curve. Yeah.Mark Bissell [00:11:11]: I think that's certainly like the domain of, of problems that we're, that we're looking to get.Shawn Wang [00:11:15]: Yeah. To me, like double descent is like the biggest thing to like ML research where like, if you believe in scaling, then you don't need, you need to know where to scale. And. But if you believe in double descent, then you don't, you don't believe in anything where like anything levels off, like.Vibhu Sapra [00:11:30]: I mean, also tendentially there's like, okay, when you talk about the China vector, right. There's the subliminal learning work. It was from the anthropic fellows program where basically you can have hidden biases in a model. And as you distill down or, you know, as you train on distilled data, those biases always show up, even if like you explicitly try to not train on them. So, you know, it's just like another use case of. Okay. If we can interpret what's happening in post-training, you know, can we clear some of this? Can we even determine what's there? Because yeah, it's just like some worrying research that's out there that shows, you know, we really don't know what's going on.Mark Bissell [00:12:06]: That is. Yeah. I think that's the biggest sentiment that we're sort of hoping to tackle. Nobody knows what's going on. Right. Like subliminal learning is just an insane concept when you think about it. Right. Train a model on not even the logits, literally the output text of a bunch of random numbers. And now your model loves owls. And you see behaviors like that, that are just, they defy, they defy intuition. And, and there are mathematical explanations that you can get into, but. I mean.Shawn Wang [00:12:34]: It feels so early days. Objectively, there are a sequence of numbers that are more owl-like than others. There, there should be.Mark Bissell [00:12:40]: According to, according to certain models. Right. It's interesting. I think it only applies to models that were initialized from the same starting Z. Usually, yes.Shawn Wang [00:12:49]: But I mean, I think that's a, that's a cheat code because there's not enough compute. But like if you believe in like platonic representation, like probably it will transfer across different models as well. Oh, you think so?Mark Bissell [00:13:00]: I think of it more as a statistical artifact of models initialized from the same seed sort of. There's something that is like path dependent from that seed that might cause certain overlaps in the latent space and then sort of doing this distillation. Yeah. Like it pushes it towards having certain other tendencies.Vibhu Sapra [00:13:24]: Got it. I think there's like a bunch of these open-ended questions, right? Like you can't train in new stuff during the RL phase, right? RL only reorganizes weights and you can only do stuff that's somewhat there in your base model. You're not learning new stuff. You're just reordering chains and stuff. But okay. My broader question is when you guys work at an interp lab, how do you decide what to work on and what's kind of the thought process? Right. Because we can ramble for hours. Okay. I want to know this. I want to know that. But like, how do you concretely like, you know, what's the workflow? Okay. There's like approaches towards solving a problem, right? I can try prompting. I can look at chain of thought. I can train probes, SAEs. But how do you determine, you know, like, okay, is this going anywhere? Like, do we have set stuff? Just, you know, if you can help me with all that. Yeah.Myra Deng [00:14:07]: It's a really good question. I feel like we've always at the very beginning of the company thought about like, let's go and try to learn what isn't working in machine learning today. Whether that's talking to customers or talking to researchers at other labs, trying to understand both where the frontier is going and where things are really not falling apart today. And then developing a perspective on how we can push the frontier using interpretability methods. And so, you know, even our chief scientist, Tom, spends a lot of time talking to customers and trying to understand what real world problems are and then taking that back and trying to apply the current state of the art to those problems and then seeing where they fall down basically. And then using those failures or those shortcomings to understand what hills to climb when it comes to interpretability research. So like on the fundamental side, for instance, when we have done some work applying SAEs and probes, we've encountered, you know, some shortcomings in SAEs that we found a little bit surprising. And so have gone back to the drawing board and done work on that. And then, you know, we've done some work on better foundational interpreter models. And a lot of our team's research is focused on what is the next evolution beyond SAEs, for instance. And then when it comes to like control and design of models, you know, we tried steering with our first API and realized that it still fell short of black box techniques like prompting or fine tuning. And so went back to the drawing board and we're like, how do we make that not the case and how do we improve it beyond that? And one of our researchers, Ekdeep, who just joined is actually Ekdeep and Atticus are like steering experts and have spent a lot of time trying to figure out like, what is the research that enables us to actually do this in a much more powerful, robust way? So yeah, the answer is like, look at real world problems, try to translate that into a research agenda and then like hill climb on both of those at the same time.Shawn Wang [00:16:04]: Yeah. Mark has the steering CLI demo queued up, which we're going to go into in a sec. But I always want to double click on when you drop hints, like we found some problems with SAEs. Okay. What are they? You know, and then we can go into the demo. Yeah.Myra Deng [00:16:19]: I mean, I'm curious if you have more thoughts here as well, because you've done it in the healthcare domain. But I think like, for instance, when we do things like trying to detect behaviors within models that are harmful or like behaviors that a user might not want to have in their model. So hallucinations, for instance, harmful intent, PII, all of these things. We first tried using SAE probes for a lot of these tasks. So taking the feature activation space from SAEs and then training classifiers on top of that, and then seeing how well we can detect the properties that we might want to detect in model behavior. And we've seen in many cases that probes just trained on raw activations seem to perform better than SAE probes, which is a bit surprising if you think that SAEs are actually also capturing the concepts that you would want to capture cleanly and more surgically. And so that is an interesting observation. I don't think that is like, I'm not down on SAEs at all. I think there are many, many things they're useful for, but we have definitely run into cases where I think the concept space described by SAEs is not as clean and accurate as we would expect it to be for actual like real world downstream performance metrics.Mark Bissell [00:17:34]: Fair enough. Yeah. It's the blessing and the curse of unsupervised methods where you get to peek into the AI's mind. But sometimes you wish that you saw other things when you walked inside there. Although in the PII instance, I think weren't an SAE based approach actually did prove to be the most generalizable?Myra Deng [00:17:53]: It did work well in the case that we published with Rakuten. And I think a lot of the reasons it worked well was because we had a noisier data set. And so actually the blessing of unsupervised learning is that we actually got to get more meaningful, generalizable signal from SAEs when the data was noisy. But in other cases where we've had like good data sets, it hasn't been the case.Shawn Wang [00:18:14]: And just because you named Rakuten and I don't know if we'll get it another chance, like what is the overall, like what is Rakuten's usage or production usage? Yeah.Myra Deng [00:18:25]: So they are using us to essentially guardrail and inference time monitor their language model usage and their agent usage to detect things like PII so that they don't route private user information.Myra Deng [00:18:41]: And so that's, you know, going through all of their user queries every day. And that's something that we deployed with them a few months ago. And now we are actually exploring very early partnerships, not just with Rakuten, but with other people around how we can help with potentially training and customization use cases as well. Yeah.Shawn Wang [00:19:03]: And for those who don't know, like it's Rakuten is like, I think number one or number two e-commerce store in Japan. Yes. Yeah.Mark Bissell [00:19:10]: And I think that use case actually highlights a lot of like what it looks like to deploy things in practice that you don't always think about when you're doing sort of research tasks. So when you think about some of the stuff that came up there that's more complex than your idealized version of a problem, they were encountering things like synthetic to real transfer of methods. So they couldn't train probes, classifiers, things like that on actual customer data of PII. So what they had to do is use synthetic data sets. And then hope that that transfer is out of domain to real data sets. And so we can evaluate performance on the real data sets, but not train on customer PII. So that right off the bat is like a big challenge. You have multilingual requirements. So this needed to work for both English and Japanese text. Japanese text has all sorts of quirks, including tokenization behaviors that caused lots of bugs that caused us to be pulling our hair out. And then also a lot of tasks you'll see. You might make simplifying assumptions if you're sort of treating it as like the easiest version of the problem to just sort of get like general results where maybe you say you're classifying a sentence to say, does this contain PII? But the need that Rakuten had was token level classification so that you could precisely scrub out the PII. So as we learned more about the problem, you're sort of speaking about what that looks like in practice. Yeah. A lot of assumptions end up breaking. And that was just one instance where you. A problem that seems simple right off the bat ends up being more complex as you keep diving into it.Vibhu Sapra [00:20:41]: Excellent. One of the things that's also interesting with Interp is a lot of these methods are very efficient, right? So where you're just looking at a model's internals itself compared to a separate like guardrail, LLM as a judge, a separate model. One, you have to host it. Two, there's like a whole latency. So if you use like a big model, you have a second call. Some of the work around like self detection of hallucination, it's also deployed for efficiency, right? So if you have someone like Rakuten doing it in production live, you know, that's just another thing people should consider.Mark Bissell [00:21:12]: Yeah. And something like a probe is super lightweight. Yeah. It's no extra latency really. Excellent.Shawn Wang [00:21:17]: You have the steering demos lined up. So we were just kind of see what you got. I don't, I don't actually know if this is like the latest, latest or like alpha thing.Mark Bissell [00:21:26]: No, this is a pretty hacky demo from from a presentation that someone else on the team recently gave. So this will give a sense for, for technology. So you can see the steering and action. Honestly, I think the biggest thing that this highlights is that as we've been growing as a company and taking on kind of more and more ambitious versions of interpretability related problems, a lot of that comes to scaling up in various different forms. And so here you're going to see steering on a 1 trillion parameter model. This is Kimi K2. And so it's sort of fun that in addition to the research challenges, there are engineering challenges that we're now tackling. Cause for any of this to be sort of useful in production, you need to be thinking about what it looks like when you're using these methods on frontier models as opposed to sort of like toy kind of model organisms. So yeah, this was thrown together hastily, pretty fragile behind the scenes, but I think it's quite a fun demo. So screen sharing is on. So I've got two terminal sessions pulled up here. On the left is a forked version that we have of the Kimi CLI that we've got running to point at our custom hosted Kimi model. And then on the right is a set up that will allow us to steer on certain concepts. So I should be able to chat with Kimi over here. Tell it hello. This is running locally. So the CLI is running locally, but the Kimi server is running back to the office. Well, hopefully should be, um, that's too much to run on that Mac. Yeah. I think it's, uh, it takes a full, like each 100 node. I think it's like, you can. You can run it on eight GPUs, eight 100. So, so yeah, Kimi's running. We can ask it a prompt. It's got a forked version of our, uh, of the SG line code base that we've been working on. So I'm going to tell it, Hey, this SG line code base is slow. I think there's a bug. Can you try to figure it out? There's a big code base, so it'll, it'll spend some time doing this. And then on the right here, I'm going to initialize in real time. Some steering. Let's see here.Mark Bissell [00:23:33]: searching for any. Bugs. Feature ID 43205.Shawn Wang [00:23:38]: Yeah.Mark Bissell [00:23:38]: 20, 30, 40. So let me, uh, this is basically a feature that we found that inside Kimi seems to cause it to speak in Gen Z slang. And so on the left, it's still sort of thinking normally it might take, I don't know, 15 seconds for this to kick in, but then we're going to start hopefully seeing him do this code base is massive for real. So we're going to start. We're going to start seeing Kimi transition as the steering kicks in from normal Kimi to Gen Z Kimi and both in its chain of thought and its actual outputs.Mark Bissell [00:24:19]: And interestingly, you can see, you know, it's still able to call tools, uh, and stuff. It's um, it's purely sort of it's it's demeanor. And there are other features that we found for interesting things like concision. So that's more of a practical one. You can make it more concise. Um, the types of programs, uh, programming languages that uses, but yeah, as we're seeing it come in. Pretty good. Outputs.Shawn Wang [00:24:43]: Scheduler code is actually wild.Vibhu Sapra [00:24:46]: Yo, this code is actually insane, bro.Vibhu Sapra [00:24:53]: What's the process of training in SAE on this, or, you know, how do you label features? I know you guys put out a pretty cool blog post about, um, finding this like autonomous interp. Um, something. Something about how agents for interp is different than like coding agents. I don't know while this is spewing up, but how, how do we find feature 43, two Oh five. Yeah.Mark Bissell [00:25:15]: So in this case, um, we, our platform that we've been building out for a long time now supports all the sort of classic out of the box interp techniques that you might want to have like SAE training, probing things of that kind, I'd say the techniques for like vanilla SAEs are pretty well established now where. You take your model that you're interpreting, run a whole bunch of data through it, gather activations, and then yeah, pretty straightforward pipeline to train an SAE. There are a lot of different varieties. There's top KSAEs, batch top KSAEs, um, normal ReLU SAEs. And then once you have your sparse features to your point, assigning labels to them to actually understand that this is a gen Z feature, that's actually where a lot of the kind of magic happens. Yeah. And the most basic standard technique is look at all of your d input data set examples that cause this feature to fire most highly. And then you can usually pick out a pattern. So for this feature, If I've run a diverse enough data set through my model feature 43, two Oh five. Probably tends to fire on all the tokens that sounds like gen Z slang. You know, that's the, that's the time of year to be like, Oh, I'm in this, I'm in this Um, and, um, so, you know, you could have a human go through all 43,000 concepts andVibhu Sapra [00:26:34]: And I've got to ask the basic question, you know, can we get examples where it hallucinates, pass it through, see what feature activates for hallucinations? Can I just, you know, turn hallucination down?Myra Deng [00:26:51]: Oh, wow. You really predicted a project we're already working on right now, which is detecting hallucinations using interpretability techniques. And this is interesting because hallucinations is something that's very hard to detect. And it's like a kind of a hairy problem and something that black box methods really struggle with. Whereas like Gen Z, you could always train a simple classifier to detect that hallucinations is harder. But we've seen that models internally have some... Awareness of like uncertainty or some sort of like user pleasing behavior that leads to hallucinatory behavior. And so, yeah, we have a project that's trying to detect that accurately. And then also working on mitigating the hallucinatory behavior in the model itself as well.Shawn Wang [00:27:39]: Yeah, I would say most people are still at the level of like, oh, I would just turn temperature to zero and that turns off hallucination. And I'm like, well, that's a fundamental misunderstanding of how this works. Yeah.Mark Bissell [00:27:51]: Although, so part of what I like about that question is you, there are SAE based approaches that might like help you get at that. But oftentimes the beauty of SAEs and like we said, the curse is that they're unsupervised. So when you have a behavior that you deliberately would like to remove, and that's more of like a supervised task, often it is better to use something like probes and specifically target the thing that you're interested in reducing as opposed to sort of like hoping that when you fragment the latent space, one of the vectors that pops out.Vibhu Sapra [00:28:20]: And as much as we're training an autoencoder to be sparse, we're not like for sure certain that, you know, we will get something that just correlates to hallucination. You'll probably split that up into 20 other things and who knows what they'll be.Mark Bissell [00:28:36]: Of course. Right. Yeah. So there's no sort of problems with like feature splitting and feature absorption. And then there's the off target effects, right? Ideally, you would want to be very precise where if you reduce the hallucination feature, suddenly maybe your model can't write. Creatively anymore. And maybe you don't like that, but you want to still stop it from hallucinating facts and figures.Shawn Wang [00:28:55]: Good. So Vibhu has a paper to recommend there that we'll put in the show notes. But yeah, I mean, I guess just because your demo is done, any any other things that you want to highlight or any other interesting features you want to show?Mark Bissell [00:29:07]: I don't think so. Yeah. Like I said, this is a pretty small snippet. I think the main sort of point here that I think is exciting is that there's not a whole lot of inter being applied to models quite at this scale. You know, Anthropic certainly has some some. Research and yeah, other other teams as well. But it's it's nice to see these techniques, you know, being put into practice. I think not that long ago, the idea of real time steering of a trillion parameter model would have sounded.Shawn Wang [00:29:33]: Yeah. The fact that it's real time, like you started the thing and then you edited the steering vector.Vibhu Sapra [00:29:38]: I think it's it's an interesting one TBD of what the actual like production use case would be on that, like the real time editing. It's like that's the fun part of the demo, right? You can kind of see how this could be served behind an API, right? Like, yes, you're you only have so many knobs and you can just tweak it a bit more. And I don't know how it plays in. Like people haven't done that much with like, how does this work with or without prompting? Right. How does this work with fine tuning? Like, there's a whole hype of continual learning, right? So there's just so much to see. Like, is this another parameter? Like, is it like parameter? We just kind of leave it as a default. We don't use it. So I don't know. Maybe someone here wants to put out a guide on like how to use this with prompting when to do what?Mark Bissell [00:30:18]: Oh, well, I have a paper recommendation. I think you would love from Act Deep on our team, who is an amazing researcher, just can't say enough amazing things about Act Deep. But he actually has a paper that as well as some others from the team and elsewhere that go into the essentially equivalence of activation steering and in context learning and how those are from a he thinks of everything in a cognitive neuroscience Bayesian framework, but basically how you can precisely show how. Prompting in context, learning and steering exhibit similar behaviors and even like get quantitative about the like magnitude of steering you would need to do to induce a certain amount of behavior similar to certain prompting, even for things like jailbreaks and stuff. It's a really cool paper. Are you saying steering is less powerful than prompting? More like you can almost write a formula that tells you how to convert between the two of them.Myra Deng [00:31:20]: And so like formally equivalent actually in the in the limit. Right.Mark Bissell [00:31:24]: So like one case study of this is for jailbreaks there. I don't know. Have you seen the stuff where you can do like many shot jailbreaking? You like flood the context with examples of the behavior. And the topic put out that paper.Shawn Wang [00:31:38]: A lot of people were like, yeah, we've been doing this, guys.Mark Bissell [00:31:40]: Like, yeah, what's in this in context learning and activation steering equivalence paper is you can like predict the number. Number of examples that you will need to put in there in order to jailbreak the model. That's cool. By doing steering experiments and using this sort of like equivalence mapping. That's cool. That's really cool. It's very neat. Yeah.Shawn Wang [00:32:02]: I was going to say, like, you know, I can like back rationalize that this makes sense because, you know, what context is, is basically just, you know, it updates the KV cache kind of and like and then every next token inference is still like, you know, the sheer sum of everything all the way. It's plus all the context. It's up to date. And you could, I guess, theoretically steer that with you probably replace that with your steering. The only problem is steering typically is on one layer, maybe three layers like like you did. So it's like not exactly equivalent.Mark Bissell [00:32:33]: Right, right. There's sort of you need to get precise about, yeah, like how you sort of define steering and like what how you're modeling the setup. But yeah, I've got the paper pulled up here. Belief dynamics reveal the dual nature. Yeah. The title is Belief Dynamics Reveal the Dual Nature of Incompetence. And it's an exhibition of the practical context learning and activation steering. So Eric Bigelow, Dan Urgraft on the who are doing fellowships at Goodfire, Ekt Deep's the final author there.Myra Deng [00:32:59]: I think actually to your question of like, what is the production use case of steering? I think maybe if you just think like one level beyond steering as it is today. Like imagine if you could adapt your model to be, you know, an expert legal reasoner. Like in almost real time, like very quickly. efficiently using human feedback or using like your semantic understanding of what the model knows and where it knows that behavior. I think that while it's not clear what the product is at the end of the day, it's clearly very valuable. Thinking about like what's the next interface for model customization and adaptation is a really interesting problem for us. Like we have heard a lot of people actually interested in fine-tuning an RL for open weight models in production. And so people are using things like Tinker or kind of like open source libraries to do that, but it's still very difficult to get models fine-tuned and RL'd for exactly what you want them to do unless you're an expert at model training. And so that's like something we'reShawn Wang [00:34:06]: looking into. Yeah. I never thought so. Tinker from Thinking Machines famously uses rank one LoRa. Is that basically the same as steering? Like, you know, what's the comparison there?Mark Bissell [00:34:19]: Well, so in that case, you are still applying updates to the parameters, right?Shawn Wang [00:34:25]: Yeah. You're not touching a base model. You're touching an adapter. It's kind of, yeah.Mark Bissell [00:34:30]: Right. But I guess it still is like more in parameter space then. I guess it's maybe like, are you modifying the pipes or are you modifying the water flowing through the pipes to get what you're after? Yeah. Just maybe one way.Mark Bissell [00:34:44]: I like that analogy. That's my mental map of it at least, but it gets at this idea of model design and intentional design, which is something that we're, that we're very focused on. And just the fact that like, I hope that we look back at how we're currently training models and post-training models and just think what a primitive way of doing that right now. Like there's no intentionalityShawn Wang [00:35:06]: really in... It's just data, right? The only thing in control is what data we feed in.Mark Bissell [00:35:11]: So, so Dan from Goodfire likes to use this analogy of, you know, he has a couple of young kids and he talks about like, what if I could only teach my kids how to be good people by giving them cookies or like, you know, giving them a slap on the wrist if they do something wrong, like not telling them why it was wrong or like what they should have done differently or something like that. Just figure it out. Right. Exactly. So that's RL. Yeah. Right. And, and, you know, it's sample inefficient. There's, you know, what do they say? It's like slurping feedback. It's like, slurping supervision. Right. And so you'd like to get to the point where you can have experts giving feedback to their models that are, uh, internalized and, and, you know, steering is an inference time way of sort of getting that idea. But ideally you're moving to a world whereVibhu Sapra [00:36:04]: it is much more intentional design in perpetuity for these models. Okay. This is one of the questions we asked Emmanuel from Anthropic on the podcast a few months ago. Basically the question, was you're at a research lab that does model training, foundation models, and you're on an interp team. How does it tie back? Right? Like, does this, do ideas come from the pre-training team? Do they go back? Um, you know, so for those interested, you can, you can watch that. There wasn't too much of a connect there, but it's still something, you know, it's something they want toMark Bissell [00:36:33]: push for down the line. It can be useful for all of the above. Like there are certainly post-hocVibhu Sapra [00:36:39]: use cases where it doesn't need to touch that. I think the other thing a lot of people forget is this stuff isn't too computationally expensive, right? Like I would say, if you're interested in getting into research, MechInterp is one of the most approachable fields, right? A lot of this train an essay, train a probe, this stuff, like the budget for this one, there's already a lot done. There's a lot of open source work. You guys have done some too. Um, you know,Shawn Wang [00:37:04]: There's like notebooks from the Gemini team for Neil Nanda or like, this is how you do it. Just step through the notebook.Vibhu Sapra [00:37:09]: Even if you're like, not even technical with any of this, you can still make like progress. There, you can look at different activations, but, uh, if you do want to get into training, you know, training this stuff, correct me if I'm wrong is like in the thousands of dollars, not even like, it's not that high scale. And then same with like, you know, applying it, doing it for post-training or all this stuff is fairly cheap in scale of, okay. I want to get into like model training. I don't have compute for like, you know, pre-training stuff. So it's, it's a very nice field to get into. And also there's a lot of like open questions, right? Um, some of them have to go with, okay, I want a product. I want to solve this. Like there's also just a lot of open-ended stuff that people could work on. That's interesting. Right. I don't know if you guys have any calls for like, what's open questions, what's open work that you either open collaboration with, or like, you'd just like to see solved or just, you know, for people listening that want to get into McInturk because people always talk about it. What are, what are the things they should check out? Start, of course, you know, join you guys as well. I'm sure you're hiring.Myra Deng [00:38:09]: There's a paper, I think from, was it Lee, uh, Sharky? It's open problems and, uh, it's, it's a bit of interpretability, which I recommend everyone who's interested in the field. Read. I'm just like a really comprehensive overview of what are the things that experts in the field think are the most important problems to be solved. I also think to your point, it's been really, really inspiring to see, I think a lot of young people getting interested in interpretability, actually not just young people also like scientists to have been, you know, experts in physics for many years and in biology or things like this, um, transitioning into interp, because the barrier of, of what's now interp. So it's really cool to see a number to entry is, you know, in some ways low and there's a lot of information out there and ways to get started. There's this anecdote of like professors at universities saying that all of a sudden every incoming PhD student wants to study interpretability, which was not the case a few years ago. So it just goes to show how, I guess, like exciting the field is, how fast it's moving, how quick it is to get started and things like that.Mark Bissell [00:39:10]: And also just a very welcoming community. You know, there's an open source McInturk Slack channel. There are people are always posting questions and just folks in the space are always responsive if you ask things on various forums and stuff. But yeah, the open paper, open problems paper is a really good one.Myra Deng [00:39:28]: For other people who want to get started, I think, you know, MATS is a great program. What's the acronym for? Machine Learning and Alignment Theory Scholars? It's like the...Vibhu Sapra [00:39:40]: Normally summer internship style.Myra Deng [00:39:42]: Yeah, but they've been doing it year round now. And actually a lot of our full-time staff have come through that program or gone through that program. And it's great for anyone who is transitioning into interpretability. There's a couple other fellows programs. We do one as well as Anthropic. And so those are great places to get started if anyone is interested.Mark Bissell [00:40:03]: Also, I think been seen as a research field for a very long time. But I think engineering... I think engineers are sorely wanted for interpretability as well, especially at Goodfire, but elsewhere, as it does scale up.Shawn Wang [00:40:18]: I should mention that Lee actually works with you guys, right? And in the London office and I'm adding our first ever McInturk track at AI Europe because I see this industry applications now emerging. And I'm pretty excited to, you know, help push that along. Yeah, I was looking forward to that. It'll effectively be the first industry McInturk conference. Yeah. I'm so glad you added that. You know, it's still a little bit of a bet. It's not that widespread, but I can definitely see this is the time to really get into it. We want to be early on things.Mark Bissell [00:40:51]: For sure. And I think the field understands this, right? So at ICML, I think the title of the McInturk workshop this year was actionable interpretability. And there was a lot of discussion around bringing it to various domains. Everyone's adding pragmatic, actionable, whatever.Shawn Wang [00:41:10]: It's like, okay, well, we weren't actionable before, I guess. I don't know.Vibhu Sapra [00:41:13]: And I mean, like, just, you know, being in Europe, you see the Interp room. One, like old school conferences, like, I think they had a very tiny room till they got lucky and they got it doubled. But there's definitely a lot of interest, a lot of niche research. So you see a lot of research coming out of universities, students. We covered the paper last week. It's like two unknown authors, not many citations. But, you know, you can make a lot of meaningful work there. Yeah. Yeah. Yeah.Shawn Wang [00:41:39]: Yeah. I think people haven't really mentioned this yet. It's just Interp for code. I think it's like an abnormally important field. We haven't mentioned this yet. The conspiracy theory last two years ago was when the first SAE work came out of Anthropic was they would do like, oh, we just used SAEs to turn the bad code vector down and then turn up the good code. And I think like, isn't that the dream? Like, you know, like, but basically, I guess maybe, why is it funny? Like, it's... If it was realistic, it would not be funny. It would be like, no, actually, we should do this. But it's funny because we know there's like, we feel there's some limitations to what steering can do. And I think a lot of the public image of steering is like the Gen Z stuff. Like, oh, you can make it really love the Golden Gate Bridge, or you can make it speak like Gen Z. To like be a legal reasoner seems like a huge stretch. Yeah. And I don't know if that will get there this way. Yeah.Myra Deng [00:42:36]: I think, um, I will say we are announcing. Something very soon that I will not speak too much about. Um, but I think, yeah, this is like what we've run into again and again is like, we, we don't want to be in the world where steering is only useful for like stylistic things. That's definitely not, not what we're aiming for. But I think the types of interventions that you need to do to get to things like legal reasoning, um, are much more sophisticated and require breakthroughs in, in learning algorithms. And that's, um...Shawn Wang [00:43:07]: And is this an emergent property of scale as well?Myra Deng [00:43:10]: I think so. Yeah. I mean, I think scale definitely helps. I think scale allows you to learn a lot of information and, and reduce noise across, you know, large amounts of data. But I also think we think that there's ways to do things much more effectively, um, even, even at scale. So like actually learning exactly what you want from the data and not learning things that you do that you don't want exhibited in the data. So we're not like anti-scale, but we are also realizing that scale is not going to get us anywhere. It's not going to get us to the type of AI development that we want to be at in, in the future as these models get more powerful and get deployed in all these sorts of like mission critical contexts. Current life cycle of training and deploying and evaluations is, is to us like deeply broken and has opportunities to, to improve. So, um, more to come on that very, very soon.Mark Bissell [00:44:02]: And I think that that's a use basically, or maybe just like a proof point that these concepts do exist. Like if you can manipulate them in the precise best way, you can get the ideal combination of them that you desire. And steering is maybe the most coarse grained sort of peek at what that looks like. But I think it's evocative of what you could do if you had total surgical control over every concept, every parameter. Yeah, exactly.Myra Deng [00:44:30]: There were like bad code features. I've got it pulled up.Vibhu Sapra [00:44:33]: Yeah. Just coincidentally, as you guys are talking.Shawn Wang [00:44:35]: This is like, this is exactly.Vibhu Sapra [00:44:38]: There's like specifically a code error feature that activates and they show, you know, it's not, it's not typo detection. It's like, it's, it's typos in code. It's not typical typos. And, you know, you can, you can see it clearly activates where there's something wrong in code. And they have like malicious code, code error. They have a whole bunch of sub, you know, sub broken down little grain features. Yeah.Shawn Wang [00:45:02]: Yeah. So, so the, the rough intuition for me, the, why I talked about post-training was that, well, you just, you know, have a few different rollouts with all these things turned off and on and whatever. And then, you know, you can, that's, that's synthetic data you can kind of post-train on. Yeah.Vibhu Sapra [00:45:13]: And I think we make it sound easier than it is just saying, you know, they do the real hard work.Myra Deng [00:45:19]: I mean, you guys, you guys have the right idea. Exactly. Yeah. We replicated a lot of these features in, in our Lama models as well. I remember there was like.Vibhu Sapra [00:45:26]: And I think a lot of this stuff is open, right? Like, yeah, you guys opened yours. DeepMind has opened a lot of essays on Gemma. Even Anthropic has opened a lot of this. There's, there's a lot of resources that, you know, we can probably share of people that want to get involved.Shawn Wang [00:45:41]: Yeah. And special shout out to like Neuronpedia as well. Yes. Like, yeah, amazing piece of work to visualize those things.Myra Deng [00:45:49]: Yeah, exactly.Shawn Wang [00:45:50]: I guess I wanted to pivot a little bit on, onto the healthcare side, because I think that's a big use case for you guys. We haven't really talked about it yet. This is a bit of a crossover for me because we are, we are, we do have a separate science pod that we're starting up for AI, for AI for science, just because like, it's such a huge investment category and also I'm like less qualified to do it, but we actually have bio PhDs to cover that, which is great, but I need to just kind of recover, recap your work, maybe on the evil two stuff, but then, and then building forward.Mark Bissell [00:46:17]: Yeah, for sure. And maybe to frame up the conversation, I think another kind of interesting just lens on interpretability in general is a lot of the techniques that were described. are ways to solve the AI human interface problem. And it's sort of like bidirectional communication is the goal there. So what we've been talking about with intentional design of models and, you know, steering, but also more advanced techniques is having humans impart our desires and control into models and over models. And the reverse is also very interesting, especially as you get to superhuman models, whether that's narrow superintelligence, like these scientific models that work on genomics, data, medical imaging, things like that. But down the line, you know, superintelligence of other forms as well. What knowledge can the AIs teach us as sort of that, that the other direction in that? And so some of our life science work to date has been getting at exactly that question, which is, well, some of it does look like debugging these various life sciences models, understanding if they're actually performing well, on tasks, or if they're picking up on spurious correlations, for instance, genomics models, you would like to know whether they are sort of focusing on the biologically relevant things that you care about, or if it's using some simpler correlate, like the ancestry of the person that it's looking at. But then also in the instances where they are superhuman, and maybe they are understanding elements of the human genome that we don't have names for or specific, you know, yeah, discoveries that they've made that that we don't know about, that's, that's a big goal. And so we're already seeing that, right, we are partnered with organizations like Mayo Clinic, leading research health system in the United States, our Institute, as well as a startup called Prima Menta, which focuses on neurodegenerative disease. And in our partnership with them, we've used foundation models, they've been training and applied our interpretability techniques to find novel biomarkers for Alzheimer's disease. So I think this is just the tip of the iceberg. But it's, that's like a flavor of some of the things that we're working on.Shawn Wang [00:48:36]: Yeah, I think that's really fantastic. Obviously, we did the Chad Zuckerberg pod last year as well. And like, there's a plethora of these models coming out, because there's so much potential and research. And it's like, very interesting how it's basically the same as language models, but just with a different underlying data set. But it's like, it's the same exact techniques. Like, there's no change, basically.Mark Bissell [00:48:59]: Yeah. Well, and even in like other domains, right? Like, you know, robotics, I know, like a lot of the companies just use Gemma as like the like backbone, and then they like make it into a VLA that like takes these actions. It's, it's, it's transformers all the way down. So yeah.Vibhu Sapra [00:49:15]: Like we have Med Gemma now, right? Like this week, even there was Med Gemma 1.5. And they're training it on this stuff, like 3d scans, medical domain knowledge, and all that stuff, too. So there's a push from both sides. But I think the thing that, you know, one of the things about McInturpp is like, you're a little bit more cautious in some domains, right? So healthcare, mainly being one, like guardrails, understanding, you know, we're more risk adverse to something going wrong there. So even just from a basic understanding, like, if we're trusting these systems to make claims, we want to know why and what's going on.Myra Deng [00:49:51]: Yeah, I think there's totally a kind of like deployment bottleneck to actually using. foundation models for real patient usage or things like that. Like, say you're using a model for rare disease prediction, you probably want some explanation as to why your model predicted a certain outcome, and an interpretable explanation at that. So that's definitely a use case. But I also think like, being able to extract scientific information that no human knows to accelerate drug discovery and disease treatment and things like that actually is a really, really big unlock for science, like scientific discovery. And you've seen a lot of startups, like say that they're going to accelerate scientific discovery. And I feel like we actually are doing that through our interp techniques. And kind of like, almost by accident, like, I think we got reached out to very, very early on from these healthcare institutions. And none of us had healthcare.Shawn Wang [00:50:49]: How did they even hear of you? A podcast.Myra Deng [00:50:51]: Oh, okay. Yeah, podcast.Vibhu Sapra [00:50:53]: Okay, well, now's that time, you know.Myra Deng [00:50:55]: Everyone can call us.Shawn Wang [00:50:56]: Podcasts are the most important thing. Everyone should listen to podcasts.Myra Deng [00:50:59]: Yeah, they reached out. They were like, you know, we have these really smart models that we've trained, and we want to know what they're doing. And we were like, really early that time, like three months old, and it was a few of us. And we were like, oh, my God, we've never used these models. Let's figure it out. But it's also like, great proof that interp techniques scale pretty well across domains. We didn't really have to learn too much about.Shawn Wang [00:51:21]: Interp is a machine learning technique, machine learning skills everywhere, right? Yeah. And it's obviously, it's just like a general insight. Yeah. Probably to finance too, I think, which would be fun for our history. I don't know if you have anything to say there.Mark Bissell [00:51:34]: Yeah, well, just across the science. Like, we've also done work on material science. Yeah, it really runs the gamut.Vibhu Sapra [00:51:40]: Yeah. Awesome. And, you know, for those that should reach out, like, you're obviously experts in this, but like, is there a call out for people that you're looking to partner with, design partners, people to use your stuff outside of just, you know, the general developer that wants to. Plug and play steering stuff, like on the research side more so, like, are there ideal design partners, customers, stuff like that?Myra Deng [00:52:03]: Yeah, I can talk about maybe non-life sciences, and then I'm curious to hear from you on the life sciences side. But we're looking for design partners across many domains, language, anyone who's customizing language models or trying to push the frontier of code or reasoning models is really interesting to us. And then also interested in the frontier of modeling. There's a lot of models that work in, like, pixel space, as we call it. So if you're doing world models, video models, even robotics, where there's not a very clean natural language interface to interact with, I think we think that Interp can really help and are looking for a few partners in that space.Shawn Wang [00:52:43]: Just because you mentioned the keyword

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

Play Episode Listen Later Feb 1, 2026 62:12


This crossover episode from the Latent Space podcast features Mark Zuckerberg and Priscilla Chan on the 10-year anniversary of the Chan Zuckerberg Initiative and their expanded Biohub vision. They discuss how a “Frontier Biology Lab” working in sync with a “Frontier AI Lab” could enable breakthroughs like a Virtual Cell and true N-of-1 precision medicine. The conversation covers the acquisition of Evolutionary Scale and ESM3, new biological data collection at scale, and how AI-powered biology might transform drug discovery and disease prevention. Sponsors: Blitzy: Blitzy is the autonomous code generation platform that ingests millions of lines of code to accelerate enterprise software development by up to 5x with premium, spec-driven output. Schedule a strategy session with their AI solutions consultants at https://blitzy.com Framer: Framer is an enterprise-grade website builder that lets business teams design, launch, and optimize their.com with AI-powered wireframing, real-time collaboration, and built-in analytics. Start building for free and get 30% off a Framer Pro annual plan at https://framer.com/cognitive Serval: Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week four at https://serval.com/cognitive Tasklet: Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai CHAPTERS: (00:00) About the Episode (04:27) CZI origins and focus (08:29) Why tools over cures (14:43) Virtual cells and imaging (Part 1) (20:19) Sponsors: Blitzy | Framer (23:24) Virtual cells and imaging (Part 2) (25:22) Data diversity and grounding (32:30) Evaluating models and Biohub (Part 1) (37:53) Sponsors: Serval | Tasklet (40:42) Evaluating models and Biohub (Part 2) (41:06) Future healthcare and aging (53:39) Modeling scales and immunity (58:53) Timelines, data, and collaboration (01:04:01) Outro PRODUCED BY: https://aipodcast.ing SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://linkedin.com/in/nathanlabenz/ Youtube: https://youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Machine Learning Street Talk
VAEs Are Energy-Based Models? [Dr. Jeff Beck]

Machine Learning Street Talk

Play Episode Listen Later Jan 25, 2026 46:56


What makes something truly *intelligent?* Is a rock an agent? Could a perfect simulation of your brain actually *be* you? In this fascinating conversation, Dr. Jeff Beck takes us on a journey through the philosophical and technical foundations of agency, intelligence, and the future of AI.Jeff doesn't hold back on the big questions. He argues that from a purely mathematical perspective, there's no structural difference between an agent and a rock – both execute policies that map inputs to outputs. The real distinction lies in *sophistication* – how complex are the internal computations? Does the system engage in planning and counterfactual reasoning, or is it just a lookup table that happens to give the right answers?*Key topics explored in this conversation:**The Black Box Problem of Agency* – How can we tell if something is truly planning versus just executing a pre-computed response? Jeff explains why this question is nearly impossible to answer from the outside, and why the best we can do is ask which model gives us the simplest explanation.*Energy-Based Models Explained* – A masterclass on how EBMs differ from standard neural networks. The key insight: traditional networks only optimize weights, while energy-based models optimize *both* weights and internal states – a subtle but profound distinction that connects to Bayesian inference.*Why Your Brain Might Have Evolved from Your Nose* – One of the most surprising moments in the conversation. Jeff proposes that the complex, non-smooth nature of olfactory space may have driven the evolution of our associative cortex and planning abilities.*The JEPA Revolution* – A deep dive into Yann LeCun's Joint Embedding Prediction Architecture and why learning in latent space (rather than predicting every pixel) might be the key to more robust AI representations.*AI Safety Without Skynet Fears* – Jeff takes a refreshingly grounded stance on AI risk. He's less worried about rogue superintelligences and more concerned about humans becoming "reward function selectors" – couch potatoes who just approve or reject AI outputs. His proposed solution? Use inverse reinforcement learning to derive AI goals from observed human behavior, then make *small* perturbations rather than naive commands like "end world hunger."Whether you're interested in the philosophy of mind, the technical details of modern machine learning, or just want to understand what makes intelligence *tick,* this conversation delivers insights you won't find anywhere else.---TIMESTAMPS:00:00:00 Geometric Deep Learning & Physical Symmetries00:00:56 Defining Agency: From Rocks to Planning00:05:25 The Black Box Problem & Counterfactuals00:08:45 Simulated Agency vs. Physical Reality00:12:55 Energy-Based Models & Test-Time Training00:17:30 Bayesian Inference & Free Energy00:20:07 JEPA, Latent Space, & Non-Contrastive Learning00:27:07 Evolution of Intelligence & Modular Brains00:34:00 Scientific Discovery & Automated Experimentation00:38:04 AI Safety, Enfeeblement & The Future of Work---REFERENCES:Concept:[00:00:58] Free Energy Principle (FEP)https://en.wikipedia.org/wiki/Free_energy_principle[00:06:00] Monte Carlo Tree Searchhttps://en.wikipedia.org/wiki/Monte_Carlo_tree_searchBook:[00:09:00] The Intentional Stancehttps://mitpress.mit.edu/9780262540537/the-intentional-stance/Paper:[00:13:00] A Tutorial on Energy-Based Learning (LeCun 2006)http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf[00:15:00] Auto-Encoding Variational Bayes (VAE)https://arxiv.org/abs/1312.6114[00:20:15] JEPA (Joint Embedding Prediction Architecture)https://openreview.net/forum?id=BZ5a1r-kVsf[00:22:30] The Wake-Sleep Algorithmhttps://www.cs.toronto.edu/~hinton/absps/ws.pdf---RESCRIPT:https://app.rescript.info/public/share/DJlSbJ_Qx080q315tWaqMWn3PixCQsOcM4Kf1IW9_EoPDF:https://app.rescript.info/api/public/sessions/0efec296b9b6e905/pdf

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 8, 2026 78:24


Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we'll explain in the next State of Latent Space post, we'll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates!We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross' AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities.We have chatted with both Clementine Fourrier of HuggingFace's OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use.George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really?We discuss:* The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet* Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers* The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints* How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)* The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs* Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding ”I don't know”), and Claude models lead with the lowest hallucination rates despite not always being the smartest* GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)* The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)* The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)* Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future* Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)* V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)Links to Artificial Analysis* Website: https://artificialanalysis.ai* George Cameron on X: https://x.com/georgecameron* Micah-Hill Smith on X: https://x.com/micahhsmithFull Episode on YouTubeTimestamps* 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins* 01:19 Business Model: Independence and Revenue Streams* 04:33 Origin Story: From Legal AI to Benchmarking Need* 16:22 AI Grant and Moving to San Francisco* 19:21 Intelligence Index Evolution: From V1 to V3* 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology* 13:52 Mystery Shopper Policy and Maintaining Independence* 28:01 New Benchmarks: Omissions Index for Hallucination Detection* 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning* 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks* 50:19 Stirrup Agent Harness: Open Source Agentic Framework* 52:43 Openness Index: Measuring Model Transparency Beyond Licenses* 58:25 The Smiling Curve: Cost Falling While Spend Rising* 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits* 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges* 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas* 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions* 1:16:50 Closing: The Insatiable Demand for IntelligenceTranscriptMicah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing.swyx [00:00:17]: Which was January 2024. I don't even remember doing that, but yeah, it was very influential to me. Yeah, I'm looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it's an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I've been following your progress. Congrats on... It's been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that...George [00:01:09]: Yeah, but you can't pay us for better results.swyx [00:01:12]: Yes, exactly.George [00:01:13]: Very important.Micah [00:01:14]: Start off with a spicy take.swyx [00:01:18]: Okay, how do I pay you?Micah [00:01:20]: Let's get right into that.swyx [00:01:21]: How do you make money?Micah [00:01:24]: Well, very happy to talk about that. So it's been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We're very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We've got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it's independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff.swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that?George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it's hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that's very different from the public benchmarking that we publicize, and there's no commercial model around that. For private benchmarking, we'll at times create benchmarks, run benchmarks to specs that enterprises want. And we'll also do that sometimes for AI companies who have built things, and we help them understand what they've built with private benchmarking. Yeah. So that's a piece mainly that we've developed through trying to support everybody publicly with our public benchmarks. Yeah.swyx [00:04:09]: Let's talk about TechStack behind that. But okay, I'm going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me.Micah [00:04:19]: George was an SF, but he's Australian, but he moved here already. Yeah.swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting artificial analysis in the first place? You know, you started with public benchmarks. And so let's start there. We'll go to the private benchmark. Yeah.George [00:04:33]: Why don't we even go back a little bit to like why we, you know, thought that it was needed? Yeah.Micah [00:04:40]: The story kind of begins like in 2022, 2023, like both George and I have been into AI stuff for quite a while. In 2023 specifically, I was trying to build a legal AI research assistant. So it actually worked pretty well for its era, I would say. Yeah. Yeah. So I was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem. So had like this multistage algorithm thing, trying to figure out what the minimum viable model for each bit was, trying to optimize every bit of it as you build that out, right? Like you're trying to think about accuracy, a bunch of other metrics and performance and cost. And mostly just no one was doing anything to independently evaluate all the models. And certainly not to look at the trade-offs for speed and cost. So we basically set out just to build a thing that developers could look at to see the trade-offs between all of those things measured independently across all the models and providers. Honestly, it was probably meant to be a side project when we first started doing it.swyx [00:05:49]: Like we didn't like get together and say like, Hey, like we're going to stop working on all this stuff. I'm like, this is going to be our main thing. When I first called you, I think you hadn't decided on starting a company yet.Micah [00:05:58]: That's actually true. I don't even think we'd pause like, like George had an acquittance job. I didn't quit working on my legal AI thing. Like it was genuinely a side project.George [00:06:05]: We built it because we needed it as people building in the space and thought, Oh, other people might find it useful too. So we'll buy domain and link it to the Vercel deployment that we had and tweet about it. And, but very quickly it started getting attention. Thank you, Swyx for, I think doing an initial retweet and spotlighting it there. This project that we released. And then very quickly though, it was useful to others, but very quickly it became more useful as the number of models released accelerated. We had Mixtrel 8x7B and it was a key. That's a fun one. Yeah. Like a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost. And so that was a key. And so it became more useful quite quickly. Yeah.swyx [00:07:02]: What I love talking to people like you who sit across the ecosystem is, well, I have theories about what people want, but you have data and that's obviously more relevant. But I want to stay on the origin story a little bit more. When you started out, I would say, I think the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers. And that's basically it. And I remember I did the legwork. I think everyone has some knowledge. I think there's some version of Excel sheet or a Google sheet where you just like copy and paste the numbers from every paper and just post it up there. And then sometimes they don't line up because they're independently run. And so your numbers are going to look better than... Your reproductions of other people's numbers are going to look worse because you don't hold their models correctly or whatever the excuse is. I think then Stanford Helm, Percy Liang's project would also have some of these numbers. And I don't know if there's any other source that you can cite. The way that if I were to start artificial analysis at the same time you guys started, I would have used the Luther AI's eval framework harness. Yup.Micah [00:08:06]: Yup. That was some cool stuff. At the end of the day, running these evals, it's like if it's a simple Q&A eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy. But it turns out there are an enormous number of things that you've got control for. And I mean, back when we started the website. Yeah. Yeah. Like one of the reasons why we realized that we had to run the evals ourselves and couldn't just take rules from the labs was just that they would all prompt the models differently. And when you're competing over a few points, then you can pretty easily get- You can put the answer into the model. Yeah. That in the extreme. And like you get crazy cases like back when I'm Googled a Gemini 1.0 Ultra and needed a number that would say it was better than GPT-4 and like constructed, I think never published like chain of thought examples. 32 of them in every topic in MLU to run it, to get the score, like there are so many things that you- They never shipped Ultra, right? That's the one that never made it up. Not widely. Yeah. Yeah. Yeah. I mean, I'm sure it existed, but yeah. So we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models. Yeah. And we were, we also did certain from the start that you couldn't look at those in isolation. You needed to look at them alongside the cost and performance stuff. Yeah.swyx [00:09:24]: Okay. A couple of technical questions. I mean, so obviously I also thought about this and I didn't do it because of cost. Yep. Did you not worry about costs? Were you funded already? Clearly not, but you know. No. Well, we definitely weren't at the start.Micah [00:09:36]: So like, I mean, we're paying for it personally at the start. There's a lot of money. Well, the numbers weren't nearly as bad a couple of years ago. So we certainly incurred some costs, but we were probably in the order of like hundreds of dollars of spend across all the benchmarking that we were doing. Yeah. So nothing. Yeah. It was like kind of fine. Yeah. Yeah. These days that's gone up an enormous amount for a bunch of reasons that we can talk about. But yeah, it wasn't that bad because you can also remember that like the number of models we were dealing with was hardly any and the complexity of the stuff that we wanted to do to evaluate them was a lot less. Like we were just asking some Q&A type questions and then one specific thing was for a lot of evals initially, we were just like sampling an answer. You know, like, what's the answer for this? Like, we didn't want to go into the answer directly without letting the models think. We weren't even doing chain of thought stuff initially. And that was the most useful way to get some results initially. Yeah.swyx [00:10:33]: And so for people who haven't done this work, literally parsing the responses is a whole thing, right? Like because sometimes the models, the models can answer any way they feel fit and sometimes they actually do have the right answer, but they just returned the wrong format and they will get a zero for that unless you work it into your parser. And that involves more work. And so, I mean, but there's an open question whether you should give it points for not following your instructions on the format.Micah [00:11:00]: It depends what you're looking at, right? Because you can, if you're trying to see whether or not it can solve a particular type of reasoning problem, and you don't want to test it on its ability to do answer formatting at the same time, then you might want to use an LLM as answer extractor approach to make sure that you get the answer out no matter how unanswered. But these days, it's mostly less of a problem. Like, if you instruct a model and give it examples of what the answers should look like, it can get the answers in your format, and then you can do, like, a simple regex.swyx [00:11:28]: Yeah, yeah. And then there's other questions around, I guess, sometimes if you have a multiple choice question, sometimes there's a bias towards the first answer, so you have to randomize the responses. All these nuances, like, once you dig into benchmarks, you're like, I don't know how anyone believes the numbers on all these things. It's so dark magic.Micah [00:11:47]: You've also got, like… You've got, like, the different degrees of variance in different benchmarks, right? Yeah. So, if you run four-question multi-choice on a modern reasoning model at the temperatures suggested by the labs for their own models, the variance that you can see on a four-question multi-choice eval is pretty enormous if you only do a single run of it and it has a small number of questions, especially. So, like, one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things. Yeah. So, that we can dial in the right number of repeats so that we can get to the 95% confidence intervals that we're comfortable with so that when we pull that together, we can be confident in intelligence index to at least as tight as, like, a plus or minus one at a 95% confidence. Yeah.swyx [00:12:32]: And, again, that just adds a straight multiple to the cost. Oh, yeah. Yeah, yeah.George [00:12:37]: So, that's one of many reasons that cost has gone up a lot more than linearly over the last couple of years. We report a cost to run the artificial analysis. We report a cost to run the artificial analysis intelligence index on our website, and currently that's assuming one repeat in terms of how we report it because we want to reflect a bit about the weighting of the index. But our cost is actually a lot higher than what we report there because of the repeats.swyx [00:13:03]: Yeah, yeah, yeah. And probably this is true, but just checking, you don't have any special deals with the labs. They don't discount it. You just pay out of pocket or out of your sort of customer funds. Oh, there is a mix. So, the issue is that sometimes they may give you a special end point, which is… Ah, 100%.Micah [00:13:21]: Yeah, yeah, yeah. Exactly. So, we laser focus, like, on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way. There are quite a lot of processes we've developed over the last couple of years to make that true for, like, the one you bring up, like, right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model, that it is totally possible. That what's sitting behind that black box is not the same as they serve on a public endpoint. We're very aware of that. We have what we call a mystery shopper policy. And so, and we're totally transparent with all the labs we work with about this, that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks… Yeah, that's the job. …without them being able to identify it. And no one's ever had a problem with that. Because, like, a thing that turns out to actually be quite a good… …good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.swyx [00:14:23]: That's true. I never thought about that. I've been in the database data industry prior, and there's a lot of shenanigans around benchmarking, right? So I'm just kind of going through the mental laundry list. Did I miss anything else in this category of shenanigans? Oh, potential shenanigans.Micah [00:14:36]: I mean, okay, the biggest one, like, that I'll bring up, like, is more of a conceptual one, actually, than, like, direct shenanigans. It's that the things that get measured become things that get targeted by labs that they're trying to build, right? Exactly. So that doesn't mean anything that we should really call shenanigans. Like, I'm not talking about training on test set. But if you know that you're going to be great at another particular thing, if you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wide range of how actual users want to use the thing that you're building. But will not necessarily work. Will not necessarily do that. So, for instance, the models are exceptional now at answering competition maths problems. There is some relevance of that type of reasoning, that type of work, to, like, how we might use modern coding agents and stuff. But it's clearly not one for one. So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, scores can get better on it without there being a reflection of overall generalized intelligence of these models. Getting better. That has been true for the last couple of years. It'll be true for the next couple of years. There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users. Yeah.swyx [00:15:58]: And we'll cover some of the new stuff that you guys are building as well, which is cool. Like, you used to just run other people's evals, but now you're coming up with your own. And I think, obviously, that is a necessary path once you're at the frontier. You've exhausted all the existing evals. I think the next point in history that I have for you is AI Grant that you guys decided to join and move here. What was it like? I think you were in, like, batch two? Batch four. Batch four. Okay.Micah [00:16:26]: I mean, it was great. Nat and Daniel are obviously great. And it's a really cool group of companies that we were in AI Grant alongside. It was really great to get Nat and Daniel on board. Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned. With the mission of what we were trying to do. Like, we're not quite typical of, like, a lot of the other AI startups that they've invested in.swyx [00:16:53]: And they were very much here for the mission of what we want to do. Did they say any advice that really affected you in some way or, like, were one of the events very impactful? That's an interesting question.Micah [00:17:03]: I mean, I remember fondly a bunch of the speakers who came and did fireside chats at AI Grant.swyx [00:17:09]: Which is also, like, a crazy list. Yeah.George [00:17:11]: Oh, totally. Yeah, yeah, yeah. There was something about, you know, speaking to Nat and Daniel about the challenges of working through a startup and just working through the questions that don't have, like, clear answers and how to work through those kind of methodically and just, like, work through the hard decisions. And they've been great mentors to us as we've built artificial analysis. Another benefit for us was that other companies in the batch and other companies in AI Grant are pushing the capabilities. Yeah. And I think that's a big part of what AI can do at this time. And so being in contact with them, making sure that artificial analysis is useful to them has been fantastic for supporting us in working out how should we build out artificial analysis to continue to being useful to those, like, you know, building on AI.swyx [00:17:59]: I think to some extent, I'm mixed opinion on that one because to some extent, your target audience is not people in AI Grants who are obviously at the frontier. Yeah. Do you disagree?Micah [00:18:09]: To some extent. To some extent. But then, so a lot of what the AI Grant companies are doing is taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications, which actually makes some of them pretty archetypical power users of artificial analysis. Some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they want to see next from us. Yeah. Yeah. Because when you're building any kind of AI application now, chances are you're using a whole bunch of different models. You're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to do with them at an accuracy level and to get better speed and cost characteristics. So for many of them, no, they're like not commercial customers of ours, like we don't charge for all our data on the website. Yeah. They are absolutely some of our power users.swyx [00:19:07]: So let's talk about just the evals as well. So you start out from the general like MMU and GPQA stuff. What's next? How do you sort of build up to the overall index? What was in V1 and how did you evolve it? Okay.Micah [00:19:22]: So first, just like background, like we're talking about the artificial analysis intelligence index, which is our synthesis metric that we pulled together currently from 10 different eval data sets to give what? We're pretty much the same as that. Pretty confident is the best single number to look at for how smart the models are. Obviously, it doesn't tell the whole story. That's why we published the whole website of all the charts to dive into every part of it and look at the trade-offs. But best single number. So right now, it's got a bunch of Q&A type data sets that have been very important to the industry, like a couple that you just mentioned. It's also got a couple of agentic data sets. It's got our own long context reasoning data set and some other use case focused stuff. As time goes on. The things that we're most interested in that are going to be important to the capabilities that are becoming more important for AI, what developers are caring about, are going to be first around agentic capabilities. So surprise, surprise. We're all loving our coding agents and how the model is going to perform like that and then do similar things for different types of work are really important to us. The linking to use cases to economically valuable use cases are extremely important to us. And then we've got some of the. Yeah. These things that the models still struggle with, like working really well over long contexts that are not going to go away as specific capabilities and use cases that we need to keep evaluating.swyx [00:20:46]: But I guess one thing I was driving was like the V1 versus the V2 and how bad it was over time.Micah [00:20:53]: Like how we've changed the index to where we are.swyx [00:20:55]: And I think that reflects on the change in the industry. Right. So that's a nice way to tell that story.Micah [00:21:00]: Well, V1 would be completely saturated right now. Almost every model coming out because doing things like writing the Python functions and human evil is now pretty trivial. It's easy to forget, actually, I think how much progress has been made in the last two years. Like we obviously play the game constantly of like the today's version versus last week's version and the week before and all of the small changes in the horse race between the current frontier and who has the best like smaller than 10B model like right now this week. Right. And that's very important to a lot of developers and people and especially in this particular city of San Francisco. But when you zoom out a couple of years ago, literally most of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today. And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence. We can talk about more in a bit. So V1, V2, V3, we made things harder. We covered a wider range of use cases. And we tried to get closer to things developers care about as opposed to like just the Q&A type stuff that MMLU and GPQA represented. Yeah.swyx [00:22:12]: I don't know if you have anything to add there. Or we could just go right into showing people the benchmark and like looking around and asking questions about it. Yeah.Micah [00:22:21]: Let's do it. Okay. This would be a pretty good way to chat about a few of the new things we've launched recently. Yeah.George [00:22:26]: And I think a little bit about the direction that we want to take it. And we want to push benchmarks. Currently, the intelligence index and evals focus a lot on kind of raw intelligence. But we kind of want to diversify how we think about intelligence. And we can talk about it. But kind of new evals that we've kind of built and partnered on focus on topics like hallucination. And we've got a lot of topics that I think are not covered by the current eval set that should be. And so we want to bring that forth. But before we get into that.swyx [00:23:01]: And so for listeners, just as a timestamp, right now, number one is Gemini 3 Pro High. Then followed by Cloud Opus at 70. Just 5.1 high. You don't have 5.2 yet. And Kimi K2 Thinking. Wow. Still hanging in there. So those are the top four. That will date this podcast quickly. Yeah. Yeah. I mean, I love it. I love it. No, no. 100%. Look back this time next year and go, how cute. Yep.George [00:23:25]: Totally. A quick view of that is, okay, there's a lot. I love it. I love this chart. Yeah.Micah [00:23:30]: This is such a favorite, right? Yeah. And almost every talk that George or I give at conferences and stuff, we always put this one up first to just talk about situating where we are in this moment in history. This, I think, is the visual version of what I was saying before about the zooming out and remembering how much progress there's been. If we go back to just over a year ago, before 01, before Cloud Sonnet 3.5, we didn't have reasoning models or coding agents as a thing. And the game was very, very different. If we go back even a little bit before then, we're in the era where, when you look at this chart, open AI was untouchable for well over a year. And, I mean, you would remember that time period well of there being very open questions about whether or not AI was going to be competitive, like full stop, whether or not open AI would just run away with it, whether we would have a few frontier labs and no one else would really be able to do anything other than consume their APIs. I am quite happy overall that the world that we have ended up in is one where... Multi-model. Absolutely. And strictly more competitive every quarter over the last few years. Yeah. This year has been insane. Yeah.George [00:24:42]: You can see it. This chart with everything added is hard to read currently. There's so many dots on it, but I think it reflects a little bit what we felt, like how crazy it's been.swyx [00:24:54]: Why 14 as the default? Is that a manual choice? Because you've got service now in there that are less traditional names. Yeah.George [00:25:01]: It's models that we're kind of highlighting by default in our charts, in our intelligence index. Okay.swyx [00:25:07]: You just have a manually curated list of stuff.George [00:25:10]: Yeah, that's right. But something that I actually don't think every artificial analysis user knows is that you can customize our charts and choose what models are highlighted. Yeah. And so if we take off a few names, it gets a little easier to read.swyx [00:25:25]: Yeah, yeah. A little easier to read. Totally. Yeah. But I love that you can see the all one jump. Look at that. September 2024. And the DeepSeek jump. Yeah.George [00:25:34]: Which got close to OpenAI's leadership. They were so close. I think, yeah, we remember that moment. Around this time last year, actually.Micah [00:25:44]: Yeah, yeah, yeah. I agree. Yeah, well, a couple of weeks. It was Boxing Day in New Zealand when DeepSeek v3 came out. And we'd been tracking DeepSeek and a bunch of the other global players that were less known over the second half of 2024 and had run evals on the earlier ones and stuff. I very distinctly remember Boxing Day in New Zealand, because I was with family for Christmas and stuff, running the evals and getting back result by result on DeepSeek v3. So this was the first of their v3 architecture, the 671b MOE.Micah [00:26:19]: And we were very, very impressed. That was the moment where we were sure that DeepSeek was no longer just one of many players, but had jumped up to be a thing. The world really noticed when they followed that up with the RL working on top of v3 and R1 succeeding a few weeks later. But the groundwork for that absolutely was laid with just extremely strong base model, completely open weights that we had as the best open weights model. So, yeah, that's the thing that you really see in the game. But I think that we got a lot of good feedback on Boxing Day. us on Boxing Day last year.George [00:26:48]: Boxing Day is the day after Christmas for those not familiar.George [00:26:54]: I'm from Singapore.swyx [00:26:55]: A lot of us remember Boxing Day for a different reason, for the tsunami that happened. Oh, of course. Yeah, but that was a long time ago. So yeah. So this is the rough pitch of AAQI. Is it A-A-Q-I or A-A-I-I? I-I. Okay. Good memory, though.Micah [00:27:11]: I don't know. I'm not used to it. Once upon a time, we did call it Quality Index, and we would talk about quality, performance, and price, but we changed it to intelligence.George [00:27:20]: There's been a few naming changes. We added hardware benchmarking to the site, and so benchmarks at a kind of system level. And so then we changed our throughput metric to, we now call it output speed, and thenswyx [00:27:32]: throughput makes sense at a system level, so we took that name. Take me through more charts. What should people know? Obviously, the way you look at the site is probably different than how a beginner might look at it.Micah [00:27:42]: Yeah, that's fair. There's a lot of fun stuff to dive into. Maybe so we can hit past all the, like, we have lots and lots of emails and stuff. The interesting ones to talk about today that would be great to bring up are a few of our recent things, I think, that probably not many people will be familiar with yet. So first one of those is our omniscience index. So this one is a little bit different to most of the intelligence evils that we've run. We built it specifically to look at the embedded knowledge in the models and to test hallucination by looking at when the model doesn't know the answer, so not able to get it correct, what's its probability of saying, I don't know, or giving an incorrect answer. So the metric that we use for omniscience goes from negative 100 to positive 100. Because we're simply taking off a point if you give an incorrect answer to the question. We're pretty convinced that this is an example of where it makes most sense to do that, because it's strictly more helpful to say, I don't know, instead of giving a wrong answer to factual knowledge question. And one of our goals is to shift the incentive that evils create for models and the labs creating them to get higher scores. And almost every evil across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped. And so you should take a shot at everything. There's no incentive to say, I don't know. So we did that for this one here.swyx [00:29:22]: I think there's a general field of calibration as well, like the confidence in your answer versus the rightness of the answer. Yeah, we completely agree. Yeah. Yeah.George [00:29:31]: On that. And one reason that we didn't do that is because. Or put that into this index is that we think that the, the way to do that is not to ask the models how confident they are.swyx [00:29:43]: I don't know. Maybe it might be though. You put it like a JSON field, say, say confidence and maybe it spits out something. Yeah. You know, we have done a few evils podcasts over the, over the years. And when we did one with Clementine of hugging face, who maintains the open source leaderboard, and this was one of her top requests, which is some kind of hallucination slash lack of confidence calibration thing. And so, Hey, this is one of them.Micah [00:30:05]: And I mean, like anything that we do, it's not a perfect metric or the whole story of everything that you think about as hallucination. But yeah, it's pretty useful and has some interesting results. Like one of the things that we saw in the hallucination rate is that anthropics Claude models at the, the, the very left-hand side here with the lowest hallucination rates out of the models that we've evaluated amnesty is on. That is an interesting fact. I think it probably correlates with a lot of the previously, not really measured vibes stuff that people like about some of the Claude models. Is the dataset public or what's is it, is there a held out set? There's a hell of a set for this one. So we, we have published a public test set, but we we've only published 10% of it. The reason is that for this one here specifically, it would be very, very easy to like have data contamination because it is just factual knowledge questions. We would. We'll update it at a time to also prevent that, but with yeah, kept most of it held out so that we can keep it reliable for a long time. It leads us to a bunch of really cool things, including breakdown quite granularly by topic. And so we've got some of that disclosed on the website publicly right now, and there's lots more coming in terms of our ability to break out very specific topics. Yeah.swyx [00:31:23]: I would be interested. Let's, let's dwell a little bit on this hallucination one. I noticed that Haiku hallucinates less than Sonnet hallucinates less than Opus. And yeah. Would that be the other way around in a normal capability environments? I don't know. What's, what do you make of that?George [00:31:37]: One interesting aspect is that we've found that there's not really a, not a strong correlation between intelligence and hallucination, right? That's to say that the smarter the models are in a general sense, isn't correlated with their ability to, when they don't know something, say that they don't know. It's interesting that Gemini three pro preview was a big leap over here. Gemini 2.5. Flash and, and, and 2.5 pro, but, and if I add pro quickly here.swyx [00:32:07]: I bet pro's really good. Uh, actually no, I meant, I meant, uh, the GPT pros.George [00:32:12]: Oh yeah.swyx [00:32:13]: Cause GPT pros are rumored. We don't know for a fact that it's like eight runs and then with the LM judge on top. Yeah.George [00:32:20]: So we saw a big jump in, this is accuracy. So this is just percent that they get, uh, correct and Gemini three pro knew a lot more than the other models. And so big jump in accuracy. But relatively no change between the Google Gemini models, between releases. And the hallucination rate. Exactly. And so it's likely due to just kind of different post-training recipe, between the, the Claude models. Yeah.Micah [00:32:45]: Um, there's, there's driven this. Yeah. You can, uh, you can partially blame us and how we define intelligence having until now not defined hallucination as a negative in the way that we think about intelligence.swyx [00:32:56]: And so that's what we're changing. Uh, I know many smart people who are confidently incorrect.George [00:33:02]: Uh, look, look at that. That, that, that is very humans. Very true. And there's times and a place for that. I think our view is that hallucination rate makes sense in this context where it's around knowledge, but in many cases, people want the models to hallucinate, to have a go. Often that's the case in coding or when you're trying to generate newer ideas. One eval that we added to artificial analysis is, is, is critical point and it's really hard, uh, physics problems. Okay.swyx [00:33:32]: And is it sort of like a human eval type or something different or like a frontier math type?George [00:33:37]: It's not dissimilar to frontier frontier math. So these are kind of research questions that kind of academics in the physics physics world would be able to answer, but models really struggled to answer. So the top score here is not 9%.swyx [00:33:51]: And when the people that, that created this like Minway and, and, and actually off via who was kind of behind sweep and what organization is this? Oh, is this, it's Princeton.George [00:34:01]: Kind of range of academics from, from, uh, different academic institutions, really smart people. They talked about how they turn the models up in terms of the temperature as high temperature as they can, where they're trying to explore kind of new ideas in physics as a, as a thought partner, just because they, they want the models to hallucinate. Um, yeah, sometimes it's something new. Yeah, exactly.swyx [00:34:21]: Um, so not right in every situation, but, um, I think it makes sense, you know, to test hallucination in scenarios where it makes sense. Also, the obvious question is, uh, this is one of. Many that there is there, every lab has a system card that shows some kind of hallucination number, and you've chosen to not, uh, endorse that and you've made your own. And I think that's a, that's a choice. Um, totally in some sense, the rest of artificial analysis is public benchmarks that other people can independently rerun. You provide it as a service here. You have to fight the, well, who are we to, to like do this? And your, your answer is that we have a lot of customers and, you know, but like, I guess, how do you converge the individual?Micah [00:35:08]: I mean, I think, I think for hallucinations specifically, there are a bunch of different things that you might care about reasonably, and that you'd measure quite differently, like we've called this a amnesty and solutionation rate, not trying to declare the, like, it's humanity's last hallucination. You could, uh, you could have some interesting naming conventions and all this stuff. Um, the biggest picture answer to that. It's something that I actually wanted to mention. Just as George was explaining, critical point as well is, so as we go forward, we are building evals internally. We're partnering with academia and partnering with AI companies to build great evals. We have pretty strong views on, in various ways for different parts of the AI stack, where there are things that are not being measured well, or things that developers care about that should be measured more and better. And we intend to be doing that. We're not obsessed necessarily with that. Everything we do, we have to do entirely within our own team. Critical point. As a cool example of where we were a launch partner for it, working with academia, we've got some partnerships coming up with a couple of leading companies. Those ones, obviously we have to be careful with on some of the independent stuff, but with the right disclosure, like we're completely comfortable with that. A lot of the labs have released great data sets in the past that we've used to great success independently. And so it's between all of those techniques, we're going to be releasing more stuff in the future. Cool.swyx [00:36:26]: Let's cover the last couple. And then we'll, I want to talk about your trends analysis stuff, you know? Totally.Micah [00:36:31]: So that actually, I have one like little factoid on omniscience. If you go back up to accuracy on omniscience, an interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure. The total parameter count of models makes a lot of sense intuitively, right? Because this is a knowledge eval. This is the pure knowledge metric. We're not looking at the index and the hallucination rate stuff that we think is much more about how the models are trained. This is just what facts did they recall? And yeah, it tracks parameter count extremely closely. Okay.swyx [00:37:05]: What's the rumored size of GPT-3 Pro? And to be clear, not confirmed for any official source, just rumors. But rumors do fly around. Rumors. I get, I hear all sorts of numbers. I don't know what to trust.Micah [00:37:17]: So if you, if you draw the line on omniscience accuracy versus total parameters, we've got all the open ways models, you can squint and see that likely the leading frontier models right now are quite a lot bigger than the ones that we're seeing right now. And the one trillion parameters that the open weights models cap out at, and the ones that we're looking at here, there's an interesting extra data point that Elon Musk revealed recently about XAI that for three trillion parameters for GROK 3 and 4, 6 trillion for GROK 5, but that's not out yet. Take those together, have a look. You might reasonably form a view that there's a pretty good chance that Gemini 3 Pro is bigger than that, that it could be in the 5 to 10 trillion parameters. To be clear, I have absolutely no idea, but just based on this chart, like that's where you would, you would land if you have a look at it. Yeah.swyx [00:38:07]: And to some extent, I actually kind of discourage people from guessing too much because what does it really matter? Like as long as they can serve it as a sustainable cost, that's about it. Like, yeah, totally.George [00:38:17]: They've also got different incentives in play compared to like open weights models who are thinking to supporting others in self-deployment for the labs who are doing inference at scale. It's I think less about total parameters in many cases. When thinking about inference costs and more around number of active parameters. And so there's a bit of an incentive towards larger sparser models. Agreed.Micah [00:38:38]: Understood. Yeah. Great. I mean, obviously if you're a developer or company using these things, not exactly as you say, it doesn't matter. You should be looking at all the different ways that we measure intelligence. You should be looking at cost to run index number and the different ways of thinking about token efficiency and cost efficiency based on the list prices, because that's all it matters.swyx [00:38:56]: It's not as good for the content creator rumor mill where I can say. Oh, GPT-4 is this small circle. Look at GPT-5 is this big circle. And then there used to be a thing for a while. Yeah.Micah [00:39:07]: But that is like on its own, actually a very interesting one, right? That is it just purely that chances are the last couple of years haven't seen a dramatic scaling up in the total size of these models. And so there's a lot of room to go up properly in total size of the models, especially with the upcoming hardware generations. Yes.swyx [00:39:29]: So, you know. Taking off my shitposting face for a minute. Yes. Yes. At the same time, I do feel like, you know, especially coming back from Europe, people do feel like Ilya is probably right that the paradigm is doesn't have many more orders of magnitude to scale out more. And therefore we need to start exploring at least a different path. GDPVal, I think it's like only like a month or so old. I was also very positive when it first came out. I actually talked to Tejo, who was the lead researcher on that. Oh, cool. And you have your own version.George [00:39:59]: It's a fantastic. It's a fantastic data set. Yeah.swyx [00:40:01]: And maybe it will recap for people who are still out of it. It's like 44 tasks based on some kind of GDP cutoff that's like meant to represent broad white collar work that is not just coding. Yeah.Micah [00:40:12]: Each of the tasks have a whole bunch of detailed instructions, some input files for a lot of them. It's within the 44 is divided into like two hundred and twenty two to five, maybe subtasks that are the level of that we run through the agenda. And yeah, they're really interesting. I will say that it doesn't. It doesn't necessarily capture like all the stuff that people do at work. No avail is perfect is always going to be more things to look at, largely because in order to make the tasks well enough to find that you can run them, they need to only have a handful of input files and very specific instructions for that task. And so I think the easiest way to think about them are that they're like quite hard take home exam tasks that you might do in an interview process.swyx [00:40:56]: Yeah, for listeners, it is not no longer like a long prompt. It is like, well, here's a zip file with like a spreadsheet or a PowerPoint deck or a PDF and go nuts and answer this question.George [00:41:06]: OpenAI released a great data set and they released a good paper which looks at performance across the different web chat bots on the data set. It's a great paper, encourage people to read it. What we've done is taken that data set and turned it into an eval that can be run on any model. So we created a reference agentic harness that can run. Run the models on the data set, and then we developed evaluator approach to compare outputs. That's kind of AI enabled, so it uses Gemini 3 Pro Preview to compare results, which we tested pretty comprehensively to ensure that it's aligned to human preferences. One data point there is that even as an evaluator, Gemini 3 Pro, interestingly, doesn't do actually that well. So that's kind of a good example of what we've done in GDPVal AA.swyx [00:42:01]: Yeah, the thing that you have to watch out for with LLM judge is self-preference that models usually prefer their own output, and in this case, it was not. Totally.Micah [00:42:08]: I think the way that we're thinking about the places where it makes sense to use an LLM as judge approach now, like quite different to some of the early LLM as judge stuff a couple of years ago, because some of that and MTV was a great project that was a good example of some of this a while ago was about judging conversations and like a lot of style type stuff. Here, we've got the task that the grader and grading model is doing is quite different to the task of taking the test. When you're taking the test, you've got all of the agentic tools you're working with, the code interpreter and web search, the file system to go through many, many turns to try to create the documents. Then on the other side, when we're grading it, we're running it through a pipeline to extract visual and text versions of the files and be able to provide that to Gemini, and we're providing the criteria for the task and getting it to pick which one more effectively meets the criteria of the task. Yeah. So we've got the task out of two potential outcomes. It turns out that we proved that it's just very, very good at getting that right, matched with human preference a lot of the time, because I think it's got the raw intelligence, but it's combined with the correct representation of the outputs, the fact that the outputs were created with an agentic task that is quite different to the way the grading model works, and we're comparing it against criteria, not just kind of zero shot trying to ask the model to pick which one is better.swyx [00:43:26]: Got it. Why is this an ELO? And not a percentage, like GDP-VAL?George [00:43:31]: So the outputs look like documents, and there's video outputs or audio outputs from some of the tasks. It has to make a video? Yeah, for some of the tasks. Some of the tasks.swyx [00:43:43]: What task is that?George [00:43:45]: I mean, it's in the data set. Like be a YouTuber? It's a marketing video.Micah [00:43:49]: Oh, wow. What? Like model has to go find clips on the internet and try to put it together. The models are not that good at doing that one, for now, to be clear. It's pretty hard to do that with a code editor. I mean, the computer stuff doesn't work quite well enough and so on and so on, but yeah.George [00:44:02]: And so there's no kind of ground truth, necessarily, to compare against, to work out percentage correct. It's hard to come up with correct or incorrect there. And so it's on a relative basis. And so we use an ELO approach to compare outputs from each of the models between the task.swyx [00:44:23]: You know what you should do? You should pay a contractor, a human, to do the same task. And then give it an ELO and then so you have, you have human there. It's just, I think what's helpful about GDPVal, the OpenAI one, is that 50% is meant to be normal human and maybe Domain Expert is higher than that, but 50% was the bar for like, well, if you've crossed 50, you are superhuman. Yeah.Micah [00:44:47]: So we like, haven't grounded this score in that exactly. I agree that it can be helpful, but we wanted to generalize this to a very large number. It's one of the reasons that presenting it as ELO is quite helpful and allows us to add models and it'll stay relevant for quite a long time. I also think it, it can be tricky looking at these exact tasks compared to the human performance, because the way that you would go about it as a human is quite different to how the models would go about it. Yeah.swyx [00:45:15]: I also liked that you included Lama 4 Maverick in there. Is that like just one last, like...Micah [00:45:20]: Well, no, no, no, no, no, no, it is the, it is the best model released by Meta. And... So it makes it into the homepage default set, still for now.George [00:45:31]: Other inclusion that's quite interesting is we also ran it across the latest versions of the web chatbots. And so we have...swyx [00:45:39]: Oh, that's right.George [00:45:40]: Oh, sorry.swyx [00:45:41]: I, yeah, I completely missed that. Okay.George [00:45:43]: No, not at all. So that, which has a checkered pattern. So that is their harness, not yours, is what you're saying. Exactly. And what's really interesting is that if you compare, for instance, Claude 4.5 Opus using the Claude web chatbot, it performs worse than the model in our agentic harness. And so in every case, the model performs better in our agentic harness than its web chatbot counterpart, the harness that they created.swyx [00:46:13]: Oh, my backwards explanation for that would be that, well, it's meant for consumer use cases and here you're pushing it for something.Micah [00:46:19]: The constraints are different and the amount of freedom that you can give the model is different. Also, you like have a cost goal. We let the models work as long as they want, basically. Yeah. Do you copy paste manually into the chatbot? Yeah. Yeah. That's, that was how we got the chatbot reference. We're not going to be keeping those updated at like quite the same scale as hundreds of models.swyx [00:46:38]: Well, so I don't know, talk to a browser base. They'll, they'll automate it for you. You know, like I have thought about like, well, we should turn these chatbot versions into an API because they are legitimately different agents in themselves. Yes. Right. Yeah.Micah [00:46:53]: And that's grown a huge amount of the last year, right? Like the tools. The tools that are available have actually diverged in my opinion, a fair bit across the major chatbot apps and the amount of data sources that you can connect them to have gone up a lot, meaning that your experience and the way you're using the model is more different than ever.swyx [00:47:10]: What tools and what data connections come to mind when you say what's interesting, what's notable work that people have done?Micah [00:47:15]: Oh, okay. So my favorite example on this is that until very recently, I would argue that it was basically impossible to get an LLM to draft an email for me in any useful way. Because most times that you're sending an email, you're not just writing something for the sake of writing it. Chances are context required is a whole bunch of historical emails. Maybe it's notes that you've made, maybe it's meeting notes, maybe it's, um, pulling something from your, um, any of like wherever you at work store stuff. So for me, like Google drive, one drive, um, in our super base databases, if we need to do some analysis or some data or something, preferably model can be plugged into all of those things and can go do some useful work based on it. The things that like I find most impressive currently that I am somewhat surprised work really well in late 2025, uh, that I can have models use super base MCP to query read only, of course, run a whole bunch of SQL queries to do pretty significant data analysis. And. And make charts and stuff and can read my Gmail and my notion. And okay. You actually use that. That's good. That's, that's, that's good. Is that a cloud thing? To various degrees of order, but chat GPD and Claude right now, I would say that this stuff like barely works in fairness right now. Like.George [00:48:33]: Because people are actually going to try this after they hear it. If you get an email from Micah, odds are it wasn't written by a chatbot.Micah [00:48:38]: So, yeah, I think it is true that I have never actually sent anyone an email drafted by a chatbot. Yet.swyx [00:48:46]: Um, and so you can, you can feel it right. And yeah, this time, this time next year, we'll come back and see where it's going. Totally. Um, super base shout out another famous Kiwi. Uh, I don't know if you've, you've any conversations with him about anything in particular on AI building and AI infra.George [00:49:03]: We have had, uh, Twitter DMS, um, with, with him because we're quite big, uh, super base users and power users. And we probably do some things more manually than we should in. In, in super base support line because you're, you're a little bit being super friendly. One extra, um, point regarding, um, GDP Val AA is that on the basis of the overperformance of the models compared to the chatbots turns out, we realized that, oh, like our reference harness that we built actually white works quite well on like gen generalist agentic tasks. This proves it in a sense. And so the agent harness is very. Minimalist. I think it follows some of the ideas that are in Claude code and we, all that we give it is context management capabilities, a web search, web browsing, uh, tool, uh, code execution, uh, environment. Anything else?Micah [00:50:02]: I mean, we can equip it with more tools, but like by default, yeah, that's it. We, we, we give it for GDP, a tool to, uh, view an image specifically, um, because the models, you know, can just use a terminal to pull stuff in text form into context. But to pull visual stuff into context, we had to give them a custom tool, but yeah, exactly. Um, you, you can explain an expert. No.George [00:50:21]: So it's, it, we turned out that we created a good generalist agentic harness. And so we, um, released that on, on GitHub yesterday. It's called stirrup. So if people want to check it out and, and it's a great, um, you know, base for, you know, generalist, uh, building a generalist agent for more specific tasks.Micah [00:50:39]: I'd say the best way to use it is get clone and then have your favorite coding. Agent make changes to it, to do whatever you want, because it's not that many lines of code and the coding agents can work with it. Super well.swyx [00:50:51]: Well, that's nice for the community to explore and share and hack on it. I think maybe in, in, in other similar environments, the terminal bench guys have done, uh, sort of the Harbor. Uh, and so it's, it's a, it's a bundle of, well, we need our minimal harness, which for them is terminus and we also need the RL environments or Docker deployment thing to, to run independently. So I don't know if you've looked at it. I don't know if you've looked at the harbor at all, is that, is that like a, a standard that people want to adopt?George [00:51:19]: Yeah, we've looked at it from a evals perspective and we love terminal bench and, and host benchmarks of, of, of terminal mention on artificial analysis. Um, we've looked at it from a, from a coding agent perspective, but could see it being a great, um, basis for any kind of agents. I think where we're getting to is that these models have gotten smart enough. They've gotten better, better tools that they can perform better when just given a minimalist. Set of tools and, and let them run, let the model control the, the agentic workflow rather than using another framework that's a bit more built out that tries to dictate the, dictate the flow. Awesome.swyx [00:51:56]: Let's cover the openness index and then let's go into the report stuff. Uh, so that's the, that's the last of the proprietary art numbers, I guess. I don't know how you sort of classify all these. Yeah.Micah [00:52:07]: Or call it, call it, let's call it the last of like the, the three new things that we're talking about from like the last few weeks. Um, cause I mean, there's a, we do a mix of stuff that. Where we're using open source, where we open source and what we do and, um, proprietary stuff that we don't always open source, like long context reasoning data set last year, we did open source. Um, and then all of the work on performance benchmarks across the site, some of them, we looking to open source, but some of them, like we're constantly iterating on and so on and so on and so on. So there's a huge mix, I would say, just of like stuff that is open source and not across the side. So that's a LCR for people. Yeah, yeah, yeah, yeah.swyx [00:52:41]: Uh, but let's, let's, let's talk about open.Micah [00:52:42]: Let's talk about openness index. This. Here is call it like a new way to think about how open models are. We, for a long time, have tracked where the models are open weights and what the licenses on them are. And that's like pretty useful. That tells you what you're allowed to do with the weights of a model, but there is this whole other dimension to how open models are. That is pretty important that we haven't tracked until now. And that's how much is disclosed about how it was made. So transparency about data, pre-training data and post-training data. And whether you're allowed to use that data and transparency about methodology and training code. So basically, those are the components. We bring them together to score an openness index for models so that you can in one place get this full picture of how open models are.swyx [00:53:32]: I feel like I've seen a couple other people try to do this, but they're not maintained. I do think this does matter. I don't know what the numbers mean apart from is there a max number? Is this out of 20?George [00:53:44]: It's out of 18 currently, and so we've got an openness index page, but essentially these are points, you get points for being more open across these different categories and the maximum you can achieve is 18. So AI2 with their extremely open OMO3 32B think model is the leader in a sense.swyx [00:54:04]: It's hooking face.George [00:54:05]: Oh, with their smaller model. It's coming soon. I think we need to run, we need to get the intelligence benchmarks right to get it on the site.swyx [00:54:12]: You can't have it open in the next. We can not include hooking face. We love hooking face. We'll have that, we'll have that up very soon. I mean, you know, the refined web and all that stuff. It's, it's amazing. Or is it called fine web? Fine web. Fine web.Micah [00:54:23]: Yeah, yeah, no, totally. Yep. One of the reasons this is cool, right, is that if you're trying to understand the holistic picture of the models and what you can do with all the stuff the company's contributing, this gives you that picture. And so we are going to keep it up to date alongside all the models that we do intelligence index on, on the site. And it's just an extra view to understand.swyx [00:54:43]: Can you scroll down to this? The, the, the, the trade-offs chart. Yeah, yeah. That one. Yeah. This, this really matters, right? Obviously, because you can b

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch.—-From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users.We discuss:* The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent* The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in* The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena's response demolished the paper's factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves)* Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina's nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities* The Nano Banana moment: changed Google's market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science* New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soonFull Video EpisodeTimestamps00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup00:02:47 The Decision to Start a Company: Scaling Beyond Academia00:03:38 The $100M Raise: Use of Funds and Platform Economics00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation00:08:12 Educational Value and Learning from the Community00:08:41 Technical Migration: From Gradio to React and Platform Evolution00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity00:12:29 Nano Banana Moment: How Preview Models Create Market Impact00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals00:19:10 API Strategy and Focus: Doing One Thing Well00:19:51 Community Management and Retention: Sign-In, History, and Daily Value00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses00:21:49 Hiring and Building a High-Performance Team Get full access to Latent.Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), the Gemini Nano Banana moment that changed Google's market share overnight (and why multimodal models are becoming economically critical for marketing, design, and AI-for-science), how they're thinking about agents and harnesses (Code Arena evaluates models, but maybe it should evaluate full agents like Devin), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users. We discuss: The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena's response demolished the paper's factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves) Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina's nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities The Nano Banana moment: changed Google's market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon Consumer retention: sign-in and persistent history were the unlock, but users are fickle and earned every single day—"every user is earned, they can leave at any moment" — Anastasios Angelopoulos Arena: https://lmarena.ai X: https://x.com/arena Chapters 00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey 00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup 00:02:47 The Decision to Start a Company: Scaling Beyond Academia 00:03:38 The $100M Raise: Use of Funds and Platform Economics 00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics 00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation 00:08:12 Educational Value and Learning from the Community 00:08:41 Technical Migration: From Gradio to React and Platform Evolution 00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity 00:12:29 Nano Banana Moment: How Preview Models Create Market Impact 00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value 00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity 00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals 00:19:10 API Strategy and Focus: Doing One Thing Well 00:19:51 Community Management and Retention: Sign-In, History, and Daily Value 00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses 00:21:49 Hiring and Building a High-Performance Team

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 14, 2025 85:27


Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what's next for AI infrastructure, research investing, and the future of engineering.We cover Glean's rise from “boring” enterprise search to a $7B AI-native company, Anthropic's meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools.Full Video EpisodeTimestamps* 00:00:00 Introduction and Deedy's Return to Latent Space* 00:01:20 Glean's Journey: From Boring Enterprise Search to Valuation* 00:15:37 Anthropic's Meteoric Rise and Market Share Dynamics* 00:17:50 Claude Artifacts and Product Innovation* 00:41:20 The Anthology Fund: Investing in the Anthropic Ecosystem* 00:48:01 Goodfire and Mechanistic Interpretability* 00:51:25 Prime Intellect and Distributed AI Training* 00:53:40 OpenRouter: Building the AI Model Gateway* 01:13:36 The Stargate Project and Infrastructure Arms Race* 01:18:14 The Future of Software Engineering and AI Coding Get full access to Latent.Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what's next for AI infrastructure, research investing, and the future of engineering. We cover Glean's rise from “boring” enterprise search to a $7B AI-native company, Anthropic's meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
⚡ Inside GitHub's AI Revolution: Jared Palmer Reveals Agent HQ & The Future of Coding Agents

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 10, 2025 35:51


Jared Palmer, SVP at GitHub and VP of CoreAI at Microsoft, joins Latent Space for an in-depth look at the evolution of coding agents and modern developer tools. Recently joining after leading AI initiatives at Vercel, Palmer shares firsthand insights from behind the scenes at GitHub Universe, including the launch of Agent HQ which is a new collaboration hub for coding agents and developers.This episode traces Palmer's journey from building Copilot inspired tools to pioneering the focused Next.js coding agent, v0, and explores how platform constraints fostered rapid experimentation and a breakout success in AI-powered frontend development. Palmer explains the unique advantages of GitHub's massive developer network, the challenges of scaling agent-based workflows, and why integrating seamless AI into developer experiences is now a top priority for both Microsoft and GitHub.Full Video EpisodeTimestamps00:00:00 Introduction and Jared's New Role at GitHub00:01:00 From V0 to Agent HQ: The Evolution of Coding Agents00:02:51 The V0 Origin Story: From ChatGPT to AI Playground00:05:40 Building the AI SDK and ShadCN Collaboration00:07:08 The Birth of V0: Prompt to UI Revolution00:09:18 V0's Growth Journey and Model Evolution00:11:05 Model Strategy: Composite Models vs User Choice00:13:16 GitHub's Agent HQ and Model Marketplace00:15:51 The Future of Agent Abstraction and Standards00:16:33 Microsoft Core AI Integration and Workflow Vision00:18:37 Dev Containers and Repo Setup Challenges00:24:10 Agent Quality and Infrastructure Reliability00:27:05 Using Coding Agents for Non-Coding Tasks00:29:11 GitHub Homepage Redesign and Community Feedback00:30:27 Stacked Diffs: GitHub's Most Requested Feature Get full access to Latent.Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
⚡ [AIE CODE Preview] Inside Google Labs: Building The Gemini Coding Agent — Jed Borovik, Jules

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 10, 2025 43:53


Jed Borovik, Product Lead at Google Labs, joins Latent Space to unpack how Google is building the future of AI-powered software development with Jules. From his journey discovering GenAI through Stable Diffusion to leading one of the most ambitious coding agent projects in tech, Borovik shares behind-the-scenes insights into how Google Labs operates at the intersection of DeepMind's model development and product innovation.We explore Jules' approach to autonomous coding agents and why they run on their own infrastructure, how Google simplified their agent scaffolding as models improved, and why embeddings-based RAG is giving way to attention-based search. Borovik reveals how developers are using Jules for hours or even days at a time, the challenges of managing context windows that push 2 million tokens, and why coding agents represent both the most important AI application and the clearest path to AGI.This conversation reveals Google's positioning in the coding agent race, the evolution from internal tools to public products, and what founders, developers, and AI engineers should understand about building for a future where AI becomes the new brush for software engineering.Full Video EpisodeTimestamps00:00:00 Introduction and GitHub Universe Recap00:00:57 New York Tech Scene and East Coast Hackathons00:02:19 From Google Search to AI Coding: Jed's Journey00:04:19 Google Labs Mission and DeepMind Collaboration00:06:41 Jules: Autonomous Coding Agents Explained00:09:39 The Evolution of Agent Scaffolding and Model Quality00:11:30 RAG vs Attention: The Shift in Code Understanding00:13:49 Jules' Journey from Preview to Production00:15:05 AI Engineer Summit: Community Building and Networking00:25:06 Context Management in Long-Running Agents00:29:02 The Future of Software Engineering with AI00:36:26 Beyond Vibe Coding: Spec Development and Verification00:40:20 Multimodal Input and Computer Use for Coding Agents Get full access to Latent.Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
⚡ Inside GitHub's AI Revolution: Jared Palmer Reveals Agent HQ & The Future of Coding Agents

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 10, 2025


Jared Palmer, SVP at GitHub and VP of CoreAI at Microsoft, joins Latent Space for an in-depth look at the evolution of coding agents and modern developer tools. Recently joining after leading AI initiatives at Vercel, Palmer shares firsthand insights from behind the scenes at GitHub Universe, including the launch of Agent HQ which is a new collaboration hub for coding agents and developers. This episode traces Palmer's journey from building Copilot inspired tools to pioneering the focused Next.js coding agent, v0, and explores how platform constraints fostered rapid experimentation and a breakout success in AI-powered frontend development. Palmer explains the unique advantages of GitHub's massive developer network, the challenges of scaling agent-based workflows, and why integrating seamless AI into developer experiences is now a top priority for both Microsoft and GitHub.

ai microsoft agent svp coding github copilot ai revolution github universe latent space jared palmer
Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
⚡ [AIE CODE Preview] Inside Google Labs: Building The Gemini Coding Agent — Jed Borovik, Jules

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 10, 2025


Jed Borovik, Product Lead at Google Labs, joins Latent Space to unpack how Google is building the future of AI-powered software development with Jules. From his journey discovering GenAI through Stable Diffusion to leading one of the most ambitious coding agent projects in tech, Borovik shares behind-the-scenes insights into how Google Labs operates at the intersection of DeepMind's model development and product innovation. We explore Jules' approach to autonomous coding agents and why they run on their own infrastructure, how Google simplified their agent scaffolding as models improved, and why embeddings-based RAG is giving way to attention-based search. Borovik reveals how developers are using Jules for hours or even days at a time, the challenges of managing context windows that push 2 million tokens, and why coding agents represent both the most important AI application and the clearest path to AGI. This conversation reveals Google's positioning in the coding agent race, the evolution from internal tools to public products, and what founders, developers, and AI engineers should understand about building for a future where AI becomes the new brush for software engineering. Chapters 00:00:00 Introduction and GitHub Universe Recap 00:00:57 New York Tech Scene and East Coast Hackathons 00:02:19 From Google Search to AI Coding: Jed's Journey 00:04:19 Google Labs Mission and DeepMind Collaboration 00:06:41 Jules: Autonomous Coding Agents Explained 00:09:39 The Evolution of Agent Scaffolding and Model Quality 00:11:30 RAG vs Attention: The Shift in Code Understanding 00:13:49 Jules' Journey from Preview to Production 00:15:05 AI Engineer Summit: Community Building and Networking 00:25:06 Context Management in Long-Running Agents 00:29:02 The Future of Software Engineering with AI 00:36:26 Beyond Vibe Coding: Spec Development and Verification 00:40:20 Multimodal Input and Computer Use for Coding Agents

The AI Breakdown: Daily Artificial Intelligence News and Discussions

With NLW currently on the road, he's joined in this conversation by Sean “Swyx” Wang — developer, writer, Latent Space host and newly joined member of Cognition. They explore how AI coding became 2025's defining story, why “vibe coding” is ending (sort of), what comes next for developers, and how “Agent Labs” are reshaping the balance between model makers and product builders. Swyx also previews the upcoming AI Engineer Code Summit in New York and shares why “code AGI” could deliver 80% of AGI's value long before full AGI arrives.Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.kpmg.us/AIpodcasts⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠AssemblyAI - The best way to build Voice AI apps - ⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.assemblyai.com/brief⁠⁠⁠⁠⁠⁠⁠⁠⁠Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai

The Reboot Chronicles with Dean DeBiase
Mapping The Synthetic Age, Henry Ajder - Founder Latent Space

The Reboot Chronicles with Dean DeBiase

Play Episode Listen Later Oct 29, 2025 35:44


Artificial Intelligence and the potential productivity benefits continue to be an area of focus across industries, yet AI's influence now extends into our culture. From deepfakes and cloned voices to AI-generated art and virtual influencers, a new wave of synthetic content is transforming media and testing our ability to trust what we see and hear.In this episode of The Reboot Chronicles,  Henry Ajder,founder of Latent Space Advisory and program lead of Generative AI and Business at the University of Cambridge joins us to discuss this shift. Henry has spent the better part of a decade decoding how synthetic media will alter trust and creativity. He is with us today to help understand what a synthetic age of media really means for business, leaders, governments, and everyday people. 

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Quinn Slack (CEO) and Thorsten Ball (Amp Dictator) from SourceGraph join the show to talk about Amp Code, how they ship 15x/day with no code reviews, and why subagents and prompt optimizers aren't a promising direction for coding agents. Amp Code: https://ampcode.com/ Latent Space: https://latent.space/ 00:00 Introduction 00:41 Transition from Cody to Amp 03:18 The Importance of Building the Best Coding Agent 06:43 Adapting to a Rapidly Evolving AI Tooling Landscape 09:36 Dogfooding at Sourcegraph 12:35 CLI vs. VS Code Extension 21:08 Positioning Amp in Coding Agent Market 24:10 The Diminishing Importance of Model Selectors 32:39 Tooling vs. Harness 37:19 Common Failure Modes of Coding Agents 47:33 Agent-Friendly Logging and Tooling 52:31 Are Subagents Real? 56:52 New Frameworks and Agent-Integrated Developer Tools 1:00:25 How Agents Are Encouraging Codebase and Workflow Changes 1:03:13 Evolving Outer Loop Tasks 1:07:09 Version Control and Merge Conflicts in an AI-First World 1:10:36 Rise of User-Generated Enterprise Software 1:14:39 Empowering Technical Leaders with AI 1:17:11 Evaluating Product Without Traditional Evals 1:20:58 Hiring

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Quinn Slack (CEO) and Thorsten Ball (Amp Dictator) from SourceGraph join the show to talk about Amp Code, how they ship 15x/day with no code reviews, and why subagents and prompt optimizers aren't a promising direction for coding agents.Amp Code: https://ampcode.com/Latent Space: https://latent.space/Full Video EpisodeTimestamps00:00 Introduction00:41 Transition from Cody to Amp03:18 The Importance of Building the Best Coding Agent06:43 Adapting to a Rapidly Evolving AI Tooling Landscape09:36 Dogfooding at Sourcegraph12:35 CLI vs. VS Code Extension21:08 Positioning Amp in Coding Agent Market24:10 The Diminishing Importance of Model Selectors32:39 Tooling vs. Harness37:19 Common Failure Modes of Coding Agents47:33 Agent-Friendly Logging and Tooling52:31 Are Subagents Real?56:52 New Frameworks and Agent-Integrated Developer Tools1:00:25 How Agents Are Encouraging Codebase and Workflow Changes1:03:13 Evolving Outer Loop Tasks1:07:09 Version Control and Merge Conflicts in an AI-First World1:10:36 Rise of User-Generated Enterprise Software1:14:39 Empowering Technical Leaders with AI1:17:11 Evaluating Product Without Traditional Evals1:20:58 Hiring Get full access to Latent.Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

When the first video diffusion models started emerging, they were little more than just “moving pictures” - still frames extended a few seconds in either direction in time. There was a ton of excitement about OpenAI's Sora on release through 2024, but so far only Sora-lite has been widely released. Meanwhile, other good videogen models like Genmo Mochi, Pika, MiniMax T2V, Tencent Hunyuan Video, and Kuaishou's Kling have emerged, but the reigning king this year seems to be Google's Veo 3, which for the first time has added native audio generation into their model capabilities, eliminating the need for a whole class of lipsynching tooling and SFX editing.The rise of Veo 3 unlocks a whole new category of AI Video creators that many of our audience may not have been exposed to, but is undeniably effective and important particularly in the “kids” and “brainrot” segments of the global consumer internet platforms like Tiktok, YouTube and Instagram.By far the best documentarians of these trends for laypeople are Olivia and Justine Moore, both partners at a16z, who not only collate the best examples from all over the web, but dabble in video creation themselves to put theory into practice. We've been thinking of dabbling in AI brainrot on a secondary channel for Latent Space, so we wanted to get the braindump from the Moore twins on how to make a Latent Space Brainrot channel. Jump on in!Full Video EpisodeTimestamps00:00 Introductions & Guest Welcome00:49 The Rise of Generative Media02:24 AI Video Trends: Italian Brain Rot & Viral Characters05:00 Following Trends & Creating AI Content07:17 Hands-On with AI Video Creation18:36 Monetization & Business of AI Content23:34 Platforms, Models, and the Creator Stack37:22 Native Content vs. Clipping & Going Viral41:52 Prompt Theory & Meta-Trends in AI Creativity47:42 Professional, Commercial, and Platform-Specific AI Video48:57 Wrap-Up & Final Thoughts Get full access to Latent.Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

When the first video diffusion models started emerging, they were little more than just “moving pictures” - still frames extended a few seconds in either direction in time. There was a ton of excitement about OpenAI's Sora on release through 2024, but so far only Sora-lite has been widely released. Meanwhile, other good videogen models like Genmo Mochi, Pika, MiniMax T2V, Tencent Hunyuan Video, and Kuaishou's Kling have emerged, but the reigning king this year seems to be Google's Veo 3, which for the first time has added native audio generation into their model capabilities, eliminating the need for a whole class of lipsynching tooling and SFX editing. The rise of Veo 3 unlocks a whole new category of AI Video creators that many of our audience may not have been exposed to, but is undeniably effective and important particularly in the “kids” and “brainrot” segments of the global consumer internet platforms like Tiktok, YouTube and Instagram. By far the best documentarians of these trends for laypeople are Olivia and Justine Moore, both partners at a16z, who not only collate the best examples from all over the web, but dabble in video creation themselves to put theory into practice. We've been thinking of dabbling in AI brainrot on a secondary channel for Latent Space, so we wanted to get the braindump from the Moore twins on how to make a Latent Space Brainrot channel. Jump on in! Chapters 00:00:00 Introductions & Guest Welcome 00:00:49 The Rise of Generative Media 00:02:24 AI Video Trends: Italian Brain Rot & Viral Characters 00:05:00 Following Trends & Creating AI Content 00:07:17 Hands-On with AI Video Creation 00:18:36 Monetization & Business of AI Content 00:23:34 Platforms, Models, and the Creator Stack 00:37:22 Native Content vs. Clipping & Going Viral 00:41:52 Prompt Theory & Meta-Trends in AI Creativity 00:47:42 Professional, Commercial, and Platform-Specific AI Video 00:48:57 Wrap-Up & Final Thoughts

Let's Talk AI
#210 - Claude 4, Google I/O 2025, OpenAI+io, Gemini Diffusion

Let's Talk AI

Play Episode Listen Later May 26, 2025 104:47 Transcription Available


Our 210th episode with a summary and discussion of last week's big AI news! Recorded on 05/23/2025 Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/. Join our Discord here! https://discord.gg/nTyezGSKwP In this episode: Google's Gemini diffusion technology showcases significant improvements in speed and efficiency for generating text, potentially revolutionizing the auto-regressive generation paradigm. Anthropic activates AI Safety Level 3 protections for Claude Opus 4, implementing robust measures such as bug bounties, synthetic jailbreak data, and preliminary egress bandwidth controls to mitigate bio-risk threats. OpenAI responds to the California Attorney General, refuting claims by the not-for-private-gain coalition and defending their controversial restructuring plans amidst ongoing criticism. Mistral delays the release of its Llama 4 Behemoth model due to training challenges, while Meta faces similar obstacles in rolling out its large-scale AI models, signaling difficulties in reaching frontier level performance. Timestamps + Links: (00:00:00) Intro / Banter (00:01:43) News Preview Tools & Apps (00:02:58) Anthropic's new Claude 4 AI models can reason over many steps (00:09:58) Google Unveils A.I. Chatbot, Signaling a New Era for Search (00:14:04) Google rolls out Project Mariner, its web-browsing AI agent (00:16:40) Veo 3 can generate videos — and soundtracks to go along with them (00:21:26) Imagen 4 is Google's newest AI image generator (00:23:15) Google Meet is getting real-time speech translation (00:25:36) Google's new Jules AI agent will help developers fix buggy code (00:26:43) GitHub's new AI coding agent can fix bugs for you (00:28:50) Mistral's new Devstral model was designed for coding Applications & Business (00:29:53) OpenAI Unites With Jony Ive in $6.5 Billion Deal to Create A.I. Devices (00:36:10) OpenAI's planned data center in Abu Dhabi would be bigger than Monaco (00:41:18) LM Arena, the organization behind popular AI leaderboards, lands $100M (00:45:21) Nvidia CEO says next chip after H20 for China won't be from Hopper series (00:46:39) Google's Gemini AI app has 400M monthly active users (00:51:15) AI Servers: End demand intact, but rising gap between upstream build and system production (2025.5.18) Projects & Open Source (00:53:46) Meta Is Delaying the Rollout of Its Flagship AI Model Research & Advancements (00:57:53) Gemini Diffusion (01:03:07) Chain-of-Model Learning for Language Model (01:09:16) Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space (01:15:38) Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training (01:20:16) Lessons from Defending Gemini Against Indirect Prompt Injections (01:23:35) How Fast Can Algorithms Advance Capabilities? (01:30:20) Reinforcement Learning Finetunes Small Subnetworks in Large Language Models Policy & Safety (01:31:12) Exclusive: What OpenAI Told California's Attorney General (01:38:25) Activating AI Safety Level 3 Protections

Lenny's Podcast: Product | Growth | Career
How Palantir built the ultimate founder factory | Nabeel S. Qureshi (founder, writer, ex-Palantir)

Lenny's Podcast: Product | Growth | Career

Play Episode Listen Later May 11, 2025 97:29


Nabeel Qureshi is an entrepreneur, writer, researcher, and visiting scholar of AI policy at the Mercatus Center (alongside Tyler Cowen). Previously, he spent nearly eight years at Palantir, working as a forward-deployed engineer. His work at Palantir ranged from accelerating the Covid-19 response to applying AI to drug discovery to optimizing aircraft manufacturing at Airbus. Nabeel was also a founding employee and VP of business development at GoCardless, a leading European fintech unicorn.What you'll learn:• Why almost a third of all Palantir's PMs go on to start companies• How the “forward-deployed engineer” model works and why it creates exceptional product leaders• How Palantir transformed from a “sparkling Accenture” into a $200 billion data/software platform company with more than 80% margins• The unconventional hiring approach that screens for independent-minded, intellectually curious, and highly competitive people• Why the company intentionally avoids traditional titles and career ladders—and what they do instead• Why they built an ontology-first data platform that LLMs love• How Palantir's controversial “bat signal” recruiting strategy filtered for specific talent types• The moral case for working at a company like Palantir—Brought to you by:• WorkOS—Modern identity platform for B2B SaaS, free up to 1 million MAUs• Attio—The powerful, flexible CRM for fast-growing startups• OneSchema—Import CSV data 10x faster—Where to find Nabeel S. Qureshi:• X: https://x.com/nabeelqu• LinkedIn: https://www.linkedin.com/in/nabeelqu/• Website: https://nabeelqu.co/—Where to find Lenny:• Newsletter: https://www.lennysnewsletter.com• X: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/—In this episode, we cover:(00:00) Introduction to Nabeel S. Qureshi(05:10) Palantir's unique culture and hiring(13:29) What Palantir looks for in people(16:14) Why they don't have titles(19:11) Forward-deployed engineers at Palantir(25:23) Key principles of Palantir's success(30:00) Gotham and Foundry(36:58) The ontology concept(38:02) Life as a forward-deployed engineer(41:36) Balancing custom solutions and product vision(46:36) Advice on how to implement forward-deployed engineers(50:41) The current state of forward-deployed engineers at Palantir(53:15) The power of ingesting, cleaning and analyzing data(59:25) Hiring for mission-driven startups(01:05:30) What makes Palantir PMs different(01:10:00) The moral question of Palantir(01:16:03) Advice for new startups(01:21:12) AI corner(01:24:00) Contrarian corner(01:25:42) Lightning round and final thoughts—Referenced:• Reflections on Palantir: https://nabeelqu.co/reflections-on-palantir• Palantir: https://www.palantir.com/• Intercom: https://www.intercom.com/• Which companies produce the best product managers: https://www.lennysnewsletter.com/p/which-companies-produce-the-best• Gotham: https://www.palantir.com/platforms/gotham/• Foundry: https://www.palantir.com/platforms/foundry/• Peter Thiel on X: https://x.com/peterthiel• Alex Karp: https://en.wikipedia.org/wiki/Alex_Karp• Stephen Cohen: https://en.wikipedia.org/wiki/Stephen_Cohen_(entrepreneur)• Joe Lonsdale on LinkedIn: https://www.linkedin.com/in/jtlonsdale/• Tyler Cowen's website: https://tylercowen.com/• This Scandinavian City Just Won the Internet With Its Hilarious New Tourism Ad: https://www.afar.com/magazine/oslos-new-tourism-ad-becomes-viral-hit• Safe Superintelligence: https://ssi.inc/• Mira Murati on X: https://x.com/miramurati• Stripe: https://stripe.com/• Building product at Stripe: craft, metrics, and customer obsession | Jeff Weinstein (Product lead): https://www.lennysnewsletter.com/p/building-product-at-stripe-jeff-weinstein• Airbus: https://www.airbus.com/en• NIH: https://www.nih.gov/• Jupyter Notebooks: https://jupyter.org/• Shyam Sankar on LinkedIn: https://www.linkedin.com/in/shyamsankar/• Palantir Gotham for Defense Decision Making: https://www.youtube.com/watch?v=rxKghrZU5w8• Foundry 2022 Operating System Demo: https://www.youtube.com/watch?v=uF-GSj-Exms• SQL: https://en.wikipedia.org/wiki/SQL• Airbus A350: https://en.wikipedia.org/wiki/Airbus_A350• SAP: https://www.sap.com/index.html• Barry McCardel on LinkedIn: https://www.linkedin.com/in/barrymccardel/• Understanding ‘Forward Deployed Engineering' and Why Your Company Probably Shouldn't Do It: https://www.barry.ooo/posts/fde-culture• David Hsu on LinkedIn: https://www.linkedin.com/in/dvdhsu/• Retool's Path to Product-Market Fit—Lessons for Getting to 100 Happy Customers, Faster: https://review.firstround.com/retools-path-to-product-market-fit-lessons-for-getting-to-100-happy-customers-faster/• How to foster innovation and big thinking | Eeke de Milliano (Retool, Stripe): https://www.lennysnewsletter.com/p/how-to-foster-innovation-and-big• Looker: https://cloud.google.com/looker• Sorry, that isn't an FDE: https://tedmabrey.substack.com/p/sorry-that-isnt-an-fde• Glean: https://www.glean.com/• Limited Engagement: Is Tech Becoming More Diverse?: https://www.bkmag.com/2017/01/31/limited-engagement-creating-diversity-in-the-tech-industry/• Operation Warp Speed: https://en.wikipedia.org/wiki/Operation_Warp_Speed• Mark Zuckerberg testifies: https://www.businessinsider.com/facebook-ceo-mark-zuckerberg-testifies-congress-libra-cryptocurrency-2019-10• Anduril: https://www.anduril.com/• SpaceX: https://www.spacex.com/• Principles: https://nabeelqu.co/principles• Wispr Flow: https://wisprflow.ai/• Claude code: https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview• Gemini Pro 2.5: https://deepmind.google/technologies/gemini/pro/• DeepMind: https://deepmind.google/• Latent Space newsletter: https://www.latent.space/• Swyx on x: https://x.com/swyx• Neural networks in chess programs: https://www.chessprogramming.org/Neural_Networks• AlphaZero: https://en.wikipedia.org/wiki/AlphaZero• The top chess players in the world: https://www.chess.com/players• Decision to Leave: https://www.imdb.com/title/tt12477480/• Oldboy: https://www.imdb.com/title/tt0364569/• Christopher Alexander: https://en.wikipedia.org/wiki/Christopher_Alexander—Recommended books:• The Technological Republic: Hard Power, Soft Belief, and the Future of the West: https://www.amazon.com/Technological-Republic-Power-Belief-Future/dp/0593798694• Zero to One: Notes on Startups, or How to Build the Future: https://www.amazon.com/Zero-One-Notes-Startups-Future/dp/0804139296• Impro: Improvisation and the Theatre: https://www.amazon.com/Impro-Improvisation-Theatre-Keith-Johnstone/dp/0878301178/• William Shakespeare: Histories: https://www.amazon.com/Histories-Everymans-Library-William-Shakespeare/dp/0679433120/• High Output Management: https://www.amazon.com/High-Output-Management-Andrew-Grove/dp/0679762884• Anna Karenina: https://www.amazon.com/Anna-Karenina-Leo-Tolstoy/dp/0143035002—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.—Lenny may be an investor in the companies discussed. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.lennysnewsletter.com/subscribe

The AI Breakdown: Daily Artificial Intelligence News and Discussions
MCP, Agents and What AI Engineers Are Thinking About Right Now feat. Swyx

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 17, 2025 46:02


What's at the top of AI engineers' minds? Swyx, organizer of the AI Engineer Summit and host of Latent Space, joins us to discuss MCP (Model Context Protocol), the rise of AI agents, and how the role of AI engineers is evolving.Find our guest online:https://x.com/swyxhttps://www.ai.engineer/https://www.latent.space/Get Ad Free AI Daily Brief: ⁠⁠⁠⁠⁠⁠⁠⁠⁠https://patreon.com/AIDailyBrief⁠⁠⁠⁠⁠⁠⁠⁠⁠Brought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://kpmg.com/ai⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlw⁠⁠⁠⁠⁠⁠Plumb - The Automation Platform for AI Experts - ⁠⁠⁠⁠⁠⁠https://useplumb.com/nlw⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

We are calling for the world's best AI Engineer talks for AI Architects, /r/localLlama, Model Context Protocol (MCP), GraphRAG, AI in Action, Evals, Agent Reliability, Reasoning and RL, Retrieval/Search/RecSys , Security, Infrastructure, Generative Media, AI Design & Novel AI UX, AI Product Management, Autonomy, Robotics, and Embodied Agents, Computer-Using Agents (CUA), SWE Agents, Vibe Coding, Voice, Sales/Support Agents at AIEWF 2025! Fill out the 2025 State of AI Eng survey for $250 in Amazon cards and see you from Jun 3-5 in SF!Coreweave's now-successful IPO has led to a lot of questions about the GPU Neocloud market, which Dylan Patel has written extensively about on SemiAnalysis. Understanding markets requires an interesting mix of technical and financial expertise, so this will be a different kind of episode than our usual LS domain.When we first published $2 H100s: How the GPU Rental Bubble Burst, we got 2 kinds of reactions on Hacker News:* “Ah, now the AI bubble is imploding!”* “Duh, this is how it works in every GPU cycle, are you new here?”We don't think either reaction is quite right. Specifically, it is not normal for the prices of one of the world's most important resources right now to swing from $1 to $8 per hour based on drastically inelastic demand AND supply curves - from 3 year lock-in contracts to stupendously competitive over-ordering dynamics for NVIDIA allocations — especially with increasing baseline compute needed for even the simplest academic ML research and for new AI startups getting off the ground.We're fortunate today to have Evan Conrad, CEO of SFCompute, one of the most exciting GPU marketplace startups, talk us through his theory of the economics of GPU markets, and why he thinks CoreWeave and Modal are well positioned, but Digital Ocean and Together are not.However, more broadly, the entire point of SFC is creating liquidity between GPU owners and consumers and making it broadly tradable, even programmable:As we explore, these are the primitives that you can then use to create your own, high quality, custom GPU availability for your time and money budget, similar to how Amazon Spot Instances automated the selective buying of unused compute.The ultimate end state of where all this is going is GPU that trade like other perishable, staple commodities of the world - oil, soybeans, milk. Because the contracts and markets are so well established, the price swings also are not nearly as drastic, and people can also start hedging and managing the risk of one of the biggest costs of their business, just like we have risk-managed commodities risks of all other sorts for centuries. As a former derivatives trader, you can bet that swyx doubleclicked on that…Show Notes* SF Compute* Evan Conrad* Ethan Anderson* John Phamous* The Curve talk* CoreWeave* Andromeda ClusterFull Video PodLike and subscribe!Timestamps* [00:00:05] Introductions* [00:00:12] Introduction of guest Evan Conrad from SF Compute* [00:00:12] CoreWeave Business Model Discussion* [00:05:37] CoreWeave as a Real Estate Business* [00:08:59] Interest Rate Risk and GPU Market Strategy Framework* [00:16:33] Why Together and DigitalOcean will lose money on their clusters* [00:20:37] SF Compute's AI Lab Origins* [00:25:49] Utilization Rates and Benefits of SF Compute Market Model* [00:30:00] H100 GPU Glut, Supply Chain Issues, and Future Demand Forecast* [00:34:00] P2P GPU networks* [00:36:50] Customer stories* [00:38:23] VC-Provided GPU Clusters and Credit Risk Arbitrage* [00:41:58] Market Pricing Dynamics and Preemptible GPU Pricing Model* [00:48:00] Future Plans for Financialization?* [00:52:59] Cluster auditing and quality control* [00:58:00] Futures Contracts for GPUs* [01:01:20] Branding and Aesthetic Choices Behind SF Compute* [01:06:30] Lessons from Previous Startups* [01:09:07] Hiring at SF ComputeTranscriptAlessio [00:00:05]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:12]: Hey, and today we're so excited to be finally in the studio with Evan Conrad from SF Compute. Welcome. I've been fortunate enough to be your friend before you were famous, and also we've hung out at various social things. So it's really cool to see that SF Compute is coming into its own thing, and it's a significant presence, at least in the San Francisco community, which of course, it's in the name, so you couldn't help but be. Evan: Indeed, indeed. I think we have a long way to go, but yeah, thanks. Swyx: Of course, yeah. One way I was thinking about kicking on this conversation is we will likely release this right after CoreWeave IPO. And I was watching, I was looking, doing some research on you. You did a talk at The Curve. I think I may have been viewer number 70. It was a great talk. More people should go see it, Evan Conrad at The Curve. But we have like three orders of magnitude more people. And I just wanted to, to highlight, like, what is your analysis of what CoreWeave did that went so right for them? Evan: Sell locked-in long-term contracts and don't really do much short-term at all. I think like a lot of people had this assumption that GPUs would work a lot like CPUs and the like standard business model of any sort of CPU cloud is you buy commodity hardware, then you lay on services that are mostly software, and that gives you high margins and pretty much all your value comes from those services. Not really the underlying. Compute in any capacity and because it's commodity hardware and it's not actually that expensive, most of that can be sort of on-demand compute. And while you do want locked-in contracts for folks, it's mostly just a sort of de-risk situation. It helps you plan revenue because you don't know if people are going to scale up or down. But fundamentally, people are like buying hourly and that's how your business is structured and you make 50 percent margins or higher. This like doesn't really work in GPUs. And the reason why it doesn't work is because you end up with like super price sensitive customers. And that isn't because necessarily it's just way more expensive, though that's totally the case. So in a CPU cloud, you might have like, you know, let's say if you had a million dollars of hardware in GPUs, you have a billion dollars of hardware. And so your customers are buying at much higher volumes than you otherwise expect. And it's also smaller customers who are buying at higher amounts of volume. So relative to what they're spending in general. But in GPUs in particular, your customer cares about the scaling law. So if you take like Gusto, for example, or Rippling or an HR service like this, when they're buying from an AWS or a GCP, they're buying CPUs and they're running web servers, those web servers, they kind of buy up to the capacity that they need, they buy enough, like CPUs, and then they don't buy any more, like, they don't buy any more at all. Yeah, you have a chart that goes like this and then flat. Correct. And it's like a complete flat. It's not even like an incremental tiny amount. It's not like you could just like turn on some more nodes. Yeah. And then suddenly, you know, they would make an incremental amount of money more, like Gusto isn't going to make like, you know, 5% more money, they're gonna make zero, like literally zero money from every incremental GPU or CPU after a certain point. This is not the case for anyone who is training models. And it's not the case for anyone who's doing test time inference or like inference that has scales at test time. Because like you, your scaling laws mean that you may have some diminishing returns, but there's always returns. Adding GPUs always means your model does actually get. And that actually does translate into revenue for you. And then for test time inference, you actually can just like run the inference longer and get a better performance. Or maybe you can run more customers faster and then charge for that. It actually does translate into revenue. Every incremental GPU translates to revenue. And what that means from the customer's perspective is you've got like a flat budget and you're trying to max the amount of GPUs you have for that budget. And it's very distinctly different than like where Augusto or Rippling might think, where they think, oh, we need this amount of CPUs. How do we, you know, reduce that? How do we reduce our amount of money that we're spending on this to get the same amount of CPUs? What that translates to is customers who are spending in really high volume, but also customers who are super price sensitive, who don't give a s**t. Can I swear on this? Can I swear? Yeah. Who don't give a s**t at all about your software. Because a 10% difference in a billion dollars of hardware is like $100 million of value for you. So if you have a 10% margin increase because you have great software, on your billion, the customers are that price sensitive. They will immediately switch off if they can. Because why wouldn't you? You would just take that $100 million. You'd spend $50 million on hiring a software engineering team to replicate anything that you possibly did. So that means that the best way to make money in GPUs was to do basically exactly what CoreWeave did, which is go out and sign only long-term contracts, pretty much ignore the bottom end of the market completely, and then maximize your long-term contracts. With customers who don't have credit risk, who won't sue you, or are unlikely to sue you for frivolous reasons. And then because they don't have credit risk and they won't sue you for frivolous reasons, you can go back to your lender and you can say, look, this is a really low risk situation for us to do. You should give me prime, prime interest rate. You should give me the lowest cost of capital you possibly can. And when you do that, you just make tons of money. The problem that I think lots of people are going to talk about with CoreWeave is it doesn't really look like a cloud platform. It doesn't really look like a cloud provider financially. It also doesn't really look like a software company financially.Swyx [00:05:37]: It's a bank.Evan [00:05:38]: It's a bank. It's a real estate company. And it's very hard to not be that. The problem of that that people have tricked themselves into is thinking that CoreWeave is a bad business. I don't think CoreWeave is explicitly a bad business. There's a bunch of people, there's kind of like two versions of the CoreWeave take at the moment. There's, oh my God, CoreWeave, amazing. CoreWeave is this great new cloud provider competitive with the hyperscalers. And to some extent, this is true from a structural perspective. Like, they are indeed a real sort of thing against the cloud providers in this particular category. And the other take is, oh my gosh, CoreWeave is this horrible business and so on and blah, blah, blah. And I think it's just like a set of perception or perspective. If you think CoreWeave's business is supposed to look like the traditional cloud providers, you're going to be really upset to learn that GPUs don't look like that at all. And in fact, for the hyperscalers, it doesn't look like this either. My intuition is that the hyperscalers are probably going to lose a lot of money, and they know they're going to lose a lot of money on reselling NVIDIA GPUs, at least. Hyperscalers, but I want to, Microsoft, AWS, Google. Correct, yeah. The Microsoft, AWS, and Google. Does Google resell? I mean, Google has TPUs. Google has TPUs, but I think you can also get H100s and so on. But there are like two ways they can make money. One is by selling to small customers who aren't actually buying in any serious volume. They're testing around, they're playing around. And if they get big, they're immediately going to do one of two things. They're going to ask you for a discount. Because they're not going to pay your crazy sort of margin that you have locked into your business. Because for CPUs, you need that. They're going to pay your massive per hour price. And so they want you to sign a long-term contract. And so that's your other way that you can make money, is you can basically do exactly what CoreWeave does, which is have them pay as much as possible upfront and lock in the contract for a long time. Or you can have small customers. But the problem is that for a hyperscaler, the GPUs to... To sell on the low margins relative to what your other business, your CPUs are, is a worse business than what you are currently doing. Because you could have spent the same money on those GPUs. And you could have trained model and you could have made a model on top of it and then turn that into a product and had high margins from your product. Or you could have taken that same money and you could have competed with NVIDIA. And you could have cut into their margin instead. But just simply reselling NVIDIA GPUs doesn't work like your CPU business. Where you're able to capture high margins from big customers and so on. And then they never leave you because your customers aren't actually price sensitive. And so they won't switch off if your prices are a little higher. You actually had a really nice chart, again, on that talk of this two by two. Sure. Of like where you want to be. And you also had some hot takes on who's making money and who isn't. Swyx: So CoreUv locked up long-term contracts. Get that. Yes. Maybe share your mental framework. Just verbally describe it because we're trying to help the audio listeners as well. Sure. People can look up the chart if they want to. Evan: Sure. Okay. So this is a graph of interest rates. And on the y-axis, it's a probability you're able to sell your GPUs from zero to one. And on the x-axis, it's how much they'll depreciate in cost from zero to one. And then you had ISO cost curves or ISO interest rate curves. Yeah. So they kind of shape in a sort of concave fashion. Yeah. The lowest interest rates enable the most aggressive. form of this cost curve. And the higher interest rates go, the more you have to push out to the top right. Yeah. And then you had some analysis of where every player sits in this, including CoreUv, but also Together and Modal and all these other guys. I thought that was super insightful. So I just wanted to elaborate. Basically, it's like a graph of risk and the genres of places where you can be and what the risk is associated with that. The optimal thing for you to do, if you can, is to lock in long-term contracts that are paid all up front or in with a situation in which you trust the other party to pay you over time. So if you're, you know, selling to Microsoft or something or OpenAI. Which are together 77% of the revenue of CoreUv. Yeah. So if you're doing that, that's a great business to be in because your interest rate that you can pitch for is really low because no one thinks Microsoft is going to default. And like maybe OpenAI will default, but the backing by Microsoft kind of doesn't. And I think there's enough, like, generally, it looks like OpenAI is winning that you can make it's just a much better case than if you're selling to the pre-seed startup that just raised $30 million or something pre-revenue. It's like way easier to make the case that the OpenAI is not going to default than the pre-seed startup. And so the optimal place to be is selling to the maximally low risk customer for as long as possible. And then you never have to worry about depreciation and you make lots of money. The less. Good. Good place to be is you could sell long-term contracts to people who might default on you. And then if you're not bringing it to the present, so you're not like saying, hey, you have to pay us all up front, then you're in this like more risky territory. So is it top left of the chart? If I have the chart right, maybe. Large contracts paid over time. Yeah. Large contracts paid over time is like top left. So it's more risky, but you could still probably get away with it. And then the other opportunity is that you could sell short-term contracts for really high prices. And so lots of people tried that too, because this is actually closer to the original business model that people thought would work in cloud providers for CPUs. It works for CPUs, but it doesn't really work for GPUs. And I don't think people were trying this because they were thinking about the risk associated with it. I think a lot of people are just come from a software background, have not really thought about like cogs or margins or inventory risk or things that you have to worry about in the physical world. And I think they were just like copy pasting the same business model onto CPUs. And also, I remember fundraising like a few years ago. And I know based on. Like what we knew other people were saying who were in a very similar business to us versus what we were saying. And we know that our pitch was way worse at the time, because in the beginning of SF Compute, we looked very similar to pretty much every other GPU cloud, not on purpose, but sort of accidentally. And I know that the correct pitch to give to an investor was we will look like a traditional CPU cloud with high margins and we'll sell to everyone. And that is a bad business model because your customers are price sensitive. And so what happens is if you. Sell at high prices, which is the price that you would need to sell it in order to de-risk your loss on the depreciation curve, and specifically what I mean by that is like, let's say you're selling it like $5 an hour and you're paying $1.50 an hour for the GPU under the hood. It's a little bit different than that, but you know, nice numbers, $5 an hour, $1.50 an hour. Great. Excellent. Well, you're charging a really high price per GPU hour because over time the price will go down and you'll get competed out. And what you need is to make sure that you never go under, or if you do go under your underlying cost. You've made so much money in the first part of it that the later end of it, like doesn't matter because from the whole structure of the deal, you've made money. The problem is that just, you think that you're going to be able to retain your customers with software. And actually what happens is your customers are super price sensitive and push you down and push you down and push you down and push you down, um, that they don't care about your software at all. And then the other problem that you have is you have, um, really big players like the hyperscalers who are looking to win the market and they have way more money than you, and they can push down on margin. Much better than you can. And so if they have to, and they don't, they don't necessarily all the time, um, I think they actually keep pride of higher margin, but if they needed to, they could totally just like wreck your margin at any point, um, and push you down, which meant that that quadrant over there where you're charging a high price, um, and just to make up for the risk completely got destroyed, like did not work at all for many places because of the price sensitivity, because people could just shove you down instead that pushed everybody up to the top right-hand corner of that, which is selling short-term. Contracts for low prices paid over time, which is the worst place to be in, um, the worst financial place to be in because it has the highest interest rate, um, which means that your, um, your costs go up at the same time, your, uh, your incoming cash goes down and squeezes your margins and squeezes your margins. The nice thing for like a core weave is that most of their business is over on the, on the other sides of those quadrants that the ones that survive. The only remaining question I have with core weave, and I promise I get to ask if I can compute, and I promise this is relevant to SOF Compute in general, because the framework is important, right? Sure. To understand the company. So why didn't NVIDIA or Microsoft, both of which have more money than core weave, do core weave, right? Why didn't they do core weave? Why have this middleman when either NVIDIA or Microsoft have more money than God, and they could have done an internal core weave, which is effectively like a self-funding vehicle, like a financial instrument. Why does there have to be a third party? Your question is like... Why didn't Microsoft, or why didn't NVIDIA just do core weave? Why didn't they just set up their own cloud provider? I think, and I don't know, and so correct me if I'm wrong, and lots of people will have different opinions here, or I mean, not opinions, they'll have actual facts that differ from my facts. Those aren't opinions. Those are actually indeed differences of reality, is that NVIDIA doesn't want to compete with their customers. They make a large amount of money by selling to existing clouds. If they launched their own core weave, then it would be a lot more money. It'd make it much harder for them to sell to the hyperscalers, and so they have a complex relationship with there. So not great for them. Second is that, at least for a while, I think they were dealing with antitrust concerns or fears that if they're going through, if they own too much layers of the stack, I could imagine that could be a problem for them. I don't know if that's actually true, but that's where my mind would go, I guess. Mostly, I think it's the first one. It's that they would be competing directly with their primary customers. Then Microsoft could have done it, right? That's the other question. Yeah, so Microsoft didn't do it. And my guess is that... NVIDIA doesn't want Microsoft to do it, and so they would limit the capacity because from NVIDIA's perspective, both they don't want to necessarily launch their own cloud provider because it's competing with their customers, but also they don't want only one customer or only a few customers. It's really bad for NVIDIA if you have customer concentration, and Microsoft and Google and Amazon, like Oracle, to buy up your entire supply, and then you have four or five customers or so who pretty much get to set prices. Monopsony. Yeah, monopsony. And so the optimal thing for you is a diverse set of customers who all are willing to pay at whatever price, because if you don't, somebody else will. And so it's really optimal for NVIDIA to have lots of other customers who are all competing against each other. Great. Just wanted to establish that. It's unintuitive for people who have never thought about it, and you think about it all day long. Yeah. Swyx: The last thing I'll call out from the talk, which is kind of cool, and then I promise we'll get to SF Compute, is why will DigitalOcean and Together lose money on their clusters? Why will DigitalOcean and Together lose money on their clusters?Evan [00:16:33]: I'm going to start by clarifying that all of these businesses are excellent and fantastic. That Together and DigitalOcean and Lambda, I think, are wonderful businesses who build excellent products. But my general intuition is that if you try to couple the software and the hardware together, you're going to lose money. That if you go out and you buy a long-term contract from someone and then you layer on services, or you buy the hardware yourself and you spin it up and you get a bunch of debt, you're going to run into the same problem that everybody else did, the same problem we did, same problem the hyperscalers did. And that's exactly what the hyperscalers are doing, which is you cannot add software and make high margins like a cloud provider can. You can pitch that into investors and it will totally make sense, and it's like the correct play in CPUs, but there isn't software you could make to make this occur. If you're spending a billion dollars on hardware, you need to make a billion dollars of software. There isn't a billion dollars of software that you can realistically make, and if you do, you're going to look like SAP. And that's not a knock on SAP. SAP makes a f**k ton of money, right? Right. Right. Right. Right. There aren't that many pieces of software that you could make, that you can realistically sell, like a billion dollars of software, and you're probably not going to do it to price-sensitive customers who are spending their entire budget already on compute. They don't have any more money to give you. It's a very hard proposition to do. And so many parties have been trying to do this, like, buy their own compute, because that's what a traditional cloud does. It doesn't really work for them. You know that meme where there's, like, the Grim Reaper? And he's, like, knocking on the door, and then he keeps knocking on the next door? We have just seen door after door after door of the Grim Reeker comes by, and the economic realities of the compute market come knocking. And so the thing we encourage folks to do is if you are thinking about buying a big GPU cluster and you are going to layer on software on top, don't. There are so many dead bodies in the wake there. We would recommend not doing that. And we, as SF Compute, our entire business is structured to help you not do that. It's helped disintegrate these. The GPU clouds are fantastic real estate businesses. If you treat them like real estate businesses, you will make a lot of money. The cloud services you can make on that, all the software you want to make on that, you can do that fantastically. If you don't own the underlying hardware, if you mix these businesses together, you get shot in the head. But if you combine, if you split them, and that's what the market does, it helps you split them, it allows you to buy, like, layer on services, but just buy from the market, you can make lots of money. So companies like Modal, who don't own the underlying compute, like they don't own it, lots of money, fantastic product. And then companies like Corbeave, who are functionally like really, really good real estate businesses, lots of money, fantastic product. But if you combine them, you die. That's the economic reality of compute. I think it also splits into trading versus inference, which are different kinds of workloads. Yeah. And then, yeah, one comment about the price sensitivity thing before we leave this. This topic, I want to credit Martin Casado for coining or naming this thing, which is like, you know, you said, you said this thing about like, you don't have room for a 10% margin on GPUs for software. Yep. And Martin actually played it out further. It's his first one I ever saw doing this at large enough runs. So let's say GPT-4 and O1 both had a total trading cost of like a $500 billion is the rough estimate. When you get the $5 billion runs, when you get the $50 billion runs, it is actually makes sense to build your own. You're going to have to get into chips, like for OpenEI to get into chip design, which is so funny. I would make an ASIC for this run. Yeah, maybe. I think a caveat of that that is not super well thought about is that only works if you're really confident. It only works if you really know which chip you're going to do. If you don't, then it's a little harder. So it makes in my head, it makes more sense for inference where you've already established it. But for training there's so much like experimentation. Any generality, yeah. Yeah. The generality is much more useful. Yeah. In some sense, you know, Google's like six generations into the CPUs. Yeah. Yeah. Okay, cool. Maybe we should go into SF Compute now. Sure. Yeah.Alessio [00:20:37]: Yeah. So you kind of talked about the different providers. Why did you decide to go with this approach and maybe talk a bit about how the market dynamics have evolved since you started a company?Evan [00:20:47]: So originally we were not doing this at all. We were definitely like forced into this to some extent. And SF Compute started because we wanted to go train models for music and audio in general. We were going to do a sort of generic audio model at some points, and then we were going to do a music model at some points. It was an early company. We didn't really spec down on a particular thing. But yeah, we were going to do a music model and audio model. First thing that you do when you start any AI lab is you go out and you buy a big cluster. The thing we had seen everybody else do was they went out and they raised a really big round and then they would get stuck. Because if you raise the amount of money that you need to train a model initially, like, you know, the $50 million pre-seed, pre-revenue, your valuation is so high or you get diluted so much that you can't raise the next round. And that's a very big ask to make. And also, I don't know, I felt like we just felt like we couldn't do it. We probably could have in retrospect, but I think one, we didn't really feel like we could do it. Two, it felt like if we did, we would have been stuck later on. We didn't want to raise the big round. And so instead, we thought, surely by now, we would be able to just go out. To any provider and buy like a traditional CPU cloud would sell offer you and just buy like on demand or buy like a month or so on. And this worked for like small incremental things. And I think this is where we were basing it off. We just like assumed we could go to like Lambda or something and like buy thousands of at the time A100s. And this just like was not at all the case. So we started doing all the sales calls with people and we said, OK, well, can we just get like month to month? Can we get like one month of compute or so on? Everyone told us at the time, no. You need to have a year long contract or longer or you're out of luck. Sorry. And at the time, we were just like pissed off. Like, why won't nobody sell us a month at a time? Nowadays, we totally understand why, because it's the same economic reason. Because if you if they had sold us the month to month or so on and we canceled or so on, they would have massive risk on that. And so the optimal thing to do was to only to just completely abandon the section of the market. We didn't like that. So our plan was we were going to buy a year long contract anyway. We would use a month. And then we would. At least the other 11 months. And we were locked in for a year, but we only had to pay on every individual month. And so we did this. But then immediately we said, oh, s**t, now we have a cloud provider, not a like training models company, not an AI lab, because every 30 days we owed about five hundred thousand dollars or so and we had about five hundred thousand dollars in the bank. So that meant that every single month, if we did not sell out our cluster, we would just go bankrupt. So that's what we did for the first year of the company. And when you're in that position. You try to think how in the world you get out of that position, what that transition to is, OK, well, we tend to be pretty good at like selling this cluster every month because we haven't died yet. And so what we should do is we should go basically be like this broker for other people and we will be more like a GPU real estate or like a GPU realtor. And so we started doing that for a while where we would go to other people who had who was trying to sell like a year long contract with somebody and we'd go to another person who like maybe this person wanted six months and somebody else on six months or something and we'd like combine all these people. Together to make the deal happen and we'd organize these like one off bespoke deals that looked like basically it ended up with us taking a bunch of customers, us signing with a vendor, taking some cut and then us operating the cluster for people typically with bare metal. And so we were doing this, but this was definitely like a oh, s**t, oh, s**t, oh, s**t. How do we get out of our current situation and less of a like a strategic plan of any sort? But while we were doing this, since like the beginning of the company, we had been thinking about how to buy GPU clusters, how to sell them effectively, because we'd seen every part of it. And what we ended up with was like a book of everybody who's trying to buy and everyone is trying to sell because we were these like GPU brokers. And so that turned into what is today SF Compute, which is a compute market, which we think we are the functionally the most liquid GPU market of any capacity. Honestly, I think we're the only thing that actually is like a real market that there's like bids and asks and there's like a like a trading engine that combines everything. And so. I think we're the only place where you can do things that a market should be able to do. Like you can go on SF Compute today and you get thousands of H100s for an hour if you want. And that's because there is a price for thousands of GPUs for an hour. That is like not a thing you can reasonably do on kind of any other cloud provider because nobody should realistically sell you thousands of GPUs for an hour. They should sell it to you for a year or so on. But one of the nice things about a market is that you can buy the year on SF Compute. But then if you need to sell. Back, you can sell back as well. And that opens up all these little pockets of liquidity where somebody who's just trying to buy for a little bit of time, some burst capacity. So people don't normally buy for an hour. That's not like actually a realistic thing, but it's like the range somebody who wants, who is like us, who needed to buy for a month can actually buy for a month. They can like place the order and there is actually a price for that. And it typically comes from somebody else who's selling back. Somebody who bought a longer term contract and is like they bought for some period of time, their code doesn't work, and now they need to like sell off a little bit.Alessio [00:25:49]: What are the utilization rates at which a market? What are the utilization rates at which a market? Like this works, what do you see the usual GPU utilization rate and like at what point does the market get saturated?Evan [00:26:00]: Assuming there are not like hardware problems or software problems, the utilization rate is like near 100 percent because the price dips until the utilization is 100 percent. So the price actually has to dip quite a lot in order for the utilization not to be. That's not always the case because you just have logistical problems like you get a cluster and parts of the InfiniBand fabric are broken. And there's like some issue with some switch somewhere and so you have to take some portion of the cluster offline or, you know, stuff like this, like there's just underlying physical realities of the clusters, but nominally we have better utilization than basically anybody because, but that's on utilization of the cluster, like that doesn't necessarily translate into, I mean, I actually do think we have much better overall money made for our underlying vendors than kind of anybody else. We work with the other GPU clouds and the basic pitch to the other GPU clouds is one. So we can sell your broker so we can we can find you the long term contracts that are at the prices that you want, but meanwhile, your cluster is idle and for that we can increase your utilization and get you more money because we can sell that idle cluster for you and then the moment we find the longer, the bigger customer and they come on, you can kick off those people and then go to the other ones. You get kind of the mix of like sell your cluster at whatever price you can get on the market and then sell your cluster at the big price that you want to do for long term contract, which is your ideal business model. And then the benefit of the whole thing being on the market. Is you can pitch your customer that they can cancel their long term contract, which is not a thing that you can reasonably do if you are just the GPU cloud, if you're just the GPU cloud, you can never cancel your contract, because that introduces so much risk that you would otherwise, like not get your cheap cost of capital or whatever. But if you're selling it through the market, or you're selling it with us, then you can say, hey, look, you can cancel for a fee. And that fee is the difference between the price of the market and then the price that they paid at, which means that they canceled and you have the ability to offer that flexibility. But you don't. You don't have to take the risk of it. The money's already there and like you got paid, but it's just being sold to somebody else. One of our top pieces from last year was talking about the H100 glut from all the long term contracts that were not being fully utilized and being put under the market. You have on here dollar a dollar per hour contracts as well as it goes up to two. Actually, I think you were involved. You were obliquely quoted in that article. I think you remember. I remember because this was hidden. Well, we hid your name, but then you were like, yeah, it's us. Yeah. Could you talk about the supply and demand of H100s? Was that just a normal cycle? Was that like a super cycle because of all the VC funding that went in in 2003? What was that like? GPU prices have come down. Yeah, GPU prices have come down. And there's some part that has normal depreciation cycle. Some part of that is just there were a lot of startups that bought GPUs and never used them. And now they're lending it out and therefore you exist. There's a lot of like various theories as to why. This happened. I dislike all of them because they're all kind of like they're often said with really high confidence. And I think just the market's much more complicated than that. Of course. And so everything I'm going to say is like very hedged. But there was a series of like places where a bunch of the orders were placed and people were pitching to their customers and their investors and just the broader market that they would arrive on time. And that is not how the world works. And because there was such a really quick build out of things, you would end up with bottlenecks in the supply chain somewhere that has nothing to do with necessarily the chip. It's like the InfiniBand cables or the NICs or like whatever. Or you need a bunch of like generators or you don't have data center space or like there's always some bottleneck somewhere else. And so a lot of the clusters didn't come online within the period of time. But then all the bottlenecks got sorted out and then they all came online all at the same time. So I think you saw a short. There was a shortage because supply chain hard. And then you saw a increase or like a glut because supply chain eventually figure itself out. And specifically people overordered in order to get the allocation that they wanted. Then they got the allocations and then they went under. Yeah, whatever. Right. There was just a lot of shenanigans. A caveat of this is every time you see somebody like overordered, there is this assumption that the problem was like the demand went down. I don't think that's the case at all. And so I want to clarify that. It definitely seems like a shortage. Like there's more demand for GPUs than there ever was. It's just that there was also more supply. So at the moment, I think there is still functionally a glut. But the difference that I think is happening is mostly the test time inference stuff that you just need way more chips for that than you did before. And so whenever you make a statement about the current market, people sort of take your words and then they assume that you're making a statement about the future market. And so if you say there's a glut now, people will continue to think there's a glut. But I think what is happening at the moment. My general prediction is that like by the winter, we will be back towards shortage. But then also, this very much depends on the rollout of future chips. And that comes with its own. I think I'm trying to give you like a good here's Evan's forecast. Okay. But I don't know if my forecast is right. You don't have to. Nobody is going to hold you to it. But like I think people want to know what's true and what's not. And there's a lot of vague speculations from people who are not that close to the market actually. And you are. I think I'm a closer. Close to the market, but also a vague speculator. Like I think there are a lot of really highly confident speculators and I am indeed a vague speculator. I think I have more information than a lot of other people. And this makes me more vague of a spectator because I feel less certain or less confident than I think a lot of other people do. The thing I do feel reasonably confident about saying is that the test time inference is probably going to quite significantly expand the amount of compute that was used for inference. So a caveat. This is like pretty much all the inference demand is in a few companies. A good example is like lots of bio and pharma was using H100s training sort of the bio models of sorts. And they would come along and they would buy, you know, thousands of H100s for training and then just like not a lot of stuff for inference. Not in any, not relative to like an opening iron anthropic or something because they like don't have a consumer product. Their inference event, if they can do it right. There's really like only one inference event that matters. And obviously I think they're going to run into it. And Batch and they're not going to literally just run one inference event. But like the one that produces the drug is the important one. Right. And I'm dumb and I don't know anything about biology, so I could be completely wrong here. But my understanding is that's kind of the gist. I can check that for you. You can check that for me. Check that for me. But my understanding is like the one that produces the sequence that is the drug that, you know, cures cancer or whatever. That's the important deal. But like a lot of models look like this where they're sort of more enterprising use cases or they're so prior to something that looks like test time inference. You got lots and lots of demand for training and then pretty much entirely fell off for inference. And I think like we looked at like Open Router, for example, the entirety of Open Router that was not anthropic or like Gemini or OpenAI or something. It was like 10 H100 nodes or something like that. It's just like not that much. It's like not that many GPUs actually to service that entire demand. But that's like a really sizable portion of the sort of open source market. But the actual amount of compute needed for it was not that much. But if you imagine like what an OpenAI needs for like GPT-4, it's like tremendously big. But that's because it's a consumer product that has almost all the inference demand. Yeah, that's a message we've had. Roughly open source AI compared to closed AI is like 5%. Yeah, it's like super small. Super small. It's super small. Super small. But test time inference changes that quite significantly. So I will... I will expect that to increase our overall demand. But my question on whether or not that actually affects your compute price is entirely based on how quickly do we roll out the next chips. The way that you burst is different for test time.Alessio [00:34:01]: Any thoughts on the third part of the market, which is the more peer-to-peer distributed, some are like crypto-enabled, like Hyperbolic, Prime Intellect, and all of that. Where do those fit? Like, do you see a lot of people will want to participate in a peer-to-peer market? Or just because of the capital requirements at the end of the day, it doesn't really matter?Evan [00:34:20]: I'm like wildly skeptical of these, to be frankly. The dream is like steady at home, right? I got this $15.90. Nobody has $15.90. $14.90 sitting at home. I can rent it out. Yeah. Like, I just don't really think this is going to ever be more efficient than a fully interconnected cluster with InfiniBand or, you know, whatever the sort of next spec might be. Like, I could be completely wrong. But speaking of... I mean, like, SpeedoLite is really hard to beat. And regardless of whatever you're using, you just like can't get around that physical limitation. And so you could like imagine a decentralized market that still has a lot of places where there's like co-location. But then you would get something that looks like SF Compute. And so that's what we do. That's why we take our general take is like on SF Compute, you're not buying from like random people. You're buying from the other GPU clouds, functionally. You're buying from data centers that are the same genre of people that you would work with already. And you can specify, oh, I want all these nodes to be co-located. And I don't think you're really going to get around that. And I think I buy crypto for the purposes of like transferring money. Like the financial system is like quite painful and so on. I can understand the uses of it to sort of incentivize an initial market or try to get around the cold start problem. We've been able to get around the cold start problem just fine. So it didn't actually need that at all. What I do think is totally possible is you could launch a token and then you could like subsidize the crypto. You could compute prices for a bit, but like maybe that will help you. I think that's what Nuus is doing. Yeah, I think there's lots of people who are trying to do things like this, but at some point that runs out. So I would, I think generally agree. I think the only thread in that model is very fine grained mixture of experts that can be like algorithms can shift to adapt to hardware realities. And the hardware reality is like, okay, it's annoying to do large co-located clusters. Then we'll just redesign attention or whatever in our architecture to distribute it more. There was a little bit buzz of block attention last year that Strong Compute made a big push on. But I think like, you know, in a world where we have 200 experts in MOE model, it starts to be a little bit better. Like, I don't disagree with this. I can imagine the world in which you have like, in which you've redesigned it to be more parallelizable, like across space.Evan [00:36:43]: But assuming without that, your hardware limitation is your speed of light limitation. And that's a very hard one to get around.Alessio [00:36:50]: Any customers or like stories that you want to shout out of like maybe things that wouldn't have been economically viable like others? I know there's some sensitivity on that.Evan [00:37:00]: My favorites are grad students, are folks who are trying to do things that would normally otherwise require the scale of a big lab. And the grad students are like the worst pilots. They're like the worst possible customer for the traditional GPU clouds because they will immediately turn if you sell them a thing because they're going to graduate and they're not going to go anywhere. They're not going to like, that project isn't continuing to spend lots of money. Like sometimes it does, but not if you're like working with the university or you're working with the lab of some sort. But a lot of times it's just like the ability for us to offer like big burst capacity, I think is lovely and wonderful. And it's like one of my favorite things to do because all those folks look like we did. And I have a special place in my heart for that. I have a special place in my heart for young hackers and young grad students and researchers who are trying to do the same genre of thing that we are doing. For the same reason, I have a special place in my heart for like the startups, the people who are just actively trying to compete on the same scale, but can't afford it time-wise, but can afford it spike-wise. Yeah, I liked your example of like, I have a grant of 100K and it's expiring. I got to spend it on that. That's really beautiful. Yeah. Interesting. Has there been interesting work coming out of that? Anything you want to mention? Yeah. So from like a startup perspective, like Standard Intelligence and Find, P-H-I-N-D. We've had them on the pod.Swyx [00:38:23]: Yeah. Yeah.Evan [00:38:23]: That was great. And then from grad students' perspective, we worked a lot with like the Schmidt Futures grantees of various sorts. My fear is if I talk about their research, I will be completely wrong to a sort of almost insulting degree because I am very dumb. But yeah. I think one thing that's maybe also relevant startups and GPUs-wise. Yeah. Is there was a brief moment where it kind of made sense that VCs provided GPU clusters. And obviously you worked at AI Grants, which set up Andromeda, which is supposedly a $100 million cluster. Yeah. I can explain why that's the case or why anybody would think that would be smart. Because I remember before any of that happened, we were asking for it to happen. Yeah. And the general reason is credit risk. Again, it's a bank. Yeah. I have lower risk than you due to credit transformation. I take your risk onto my balance sheet. Correct. Exactly. If you wanted to go for a while, if you wanted to go set up a GPU cluster, you had to be the one that actually bought the hardware and racked it and stacked it, like co-located it somewhere with someone. Functionally, it was like on your balance sheet, which means you had to get a loan. And you cannot get a loan for like $50 million as a startup. Like not really. You can get like venture debt and stuff, but like it's like very, very difficult to get a loan of any serious price for that. But it's like not that difficult to get a loan for $50 million. If you already have a fund or you already have like a million dollars under your assets somewhere or like you personally can like do a personal guarantee for it or something like this. If you have a lot of money, it is way easier for you to get a loan than if you don't have a lot of money. And so the hack of a VC or some capital partner offering equity for compute is always some arbitrage on the credit risk. That's amazing. Yeah. That's a hack. You should do that. I don't think people should do it right now. I think the market has like, I think it made sense at the time and it was helpful and useful for the people who did it at the time. But I think it was a one-time arbitrage because now there are lots of other sources that can do it. And also I think like it made sense when no one else was doing it and you were the only person who was doing it. But now it's like it's an arbitrage that gets competed down. Sure. So it's like super effective. I wouldn't totally recommend it. Like it's great that Andromeda did it. But the marginal increase of somebody else doing it is like not super helpful. I don't think that many people have followed in their footsteps. I think maybe Andreessen did it. Yeah. That's it. I think just because pretty much all the value like flows through Andromeda. What? That cannot be true. How many companies are in the air, Grant? Like 50? My understanding of Andromeda is it works with all the NFTG companies or like several of the NFTG companies. But I might be wrong about that. Again, you know, something something. Nat, don't kill me. I could be completely wrong. But the but you know, I think Andromeda was like an excellent idea to do at the right time in which it occurred. Perfect. His timing is impeccable. Timing. Yeah. Nat and Daniel are like, I mean, there's lots of people who are like... Sears? Yeah. Sears. Like S-E-E-R. Oh, Sears. Like Sears of the Valley. Yeah. They for years and years before any of the like ChatGPT moment or anything, they had fully understood what was going to happen. Like way, way before. Like. AI Grant is like, like five years old, six years old or something like that. Seven years old. When I, when it like first launched or something. Depends where you start. The nonprofit version. Yeah. The nonprofit version was like, like happening for a while, I think. It's going on for quite a bit of time. And then like Nat and Daniel are like the early investors in a lot of the sort of early AI labs of various sorts. They've been doing this for a bit.Alessio [00:41:58]: I was looking at your pricing yesterday. We're kind of talking about it before. And there's this weird thing where one week is more expensive of both one day and one month. Yeah. What are like some of the market pricing dynamics? What are things that like this to somebody that is not in the business? This looks really weird. But I'm curious, like if you have an explanation for it, if that looks normal to you. Yeah.Evan [00:42:18]: So the simple answer is preemptible pricing is cheaper than non-preemptible pricing. And the same economic principle is the reason why that's the case right now. That's not entirely true on SF Compute. SF Compute doesn't really have the concept of preemptible. Instead, what it has is very short reservations. So, you know, you go to a traditional cloud provider and you can say, hey, I want to reserve contract for a year. We will let you do a reserve contract for one hour, which is the part of SFC. But what you can do is you can just buy every single hour continuously. And you're reserving just for that hour. And then the next hour you reserve just for that next hour. And this is obviously like a built in. This is like an automation that you can do. But what you're seeing when you see the cheap price is you're seeing somebody who's buying the next hour, but maybe not necessarily buying an hour after that. So if the price goes up. Up too much. They might not get that next hour. And the underlying part of this of where that's coming from the market is you can imagine like day old milk or like milk that's about to be old. It might drop its price until it's expired because nobody wants to buy the milk that's in the past. Or maybe you can't legally sell it. Compute is the same way. No, you can't sell a block of compute that is not that is in the past. And so what you should do in the market and what people do do is they take. They take a block. A block of compute. And then they drop it and drop it and drop it and drop into a floor price right before it's about to expire. And they keep dropping it until it clears. And so anything that is idle drops until some point. So if you go and use on the website and you set that that chart to like a week from now, what you'll see is much more normal looking sort of curves. But if you say, oh, I want to start right now, that immediate instant, here's the compute that I want right now is the is functionally the preemptible price. It's where most people are getting the best compute or like the best compute prices from. The caveat of that is you can do really fun stuff on SFC if you want. So because it's not actually preemptible, it's it's reserved, but only reserved for an hour, which means that the optimal way to use as of compute is to just buy on the market price, but set a limit price that is much higher. So you can set a limit price for like four dollars and say, oh, if the market ever happens to spike up to four dollars, then don't buy. I don't want to buy that at that price for that price. I don't want to buy that at that price for that price for an hour. But otherwise, just buy at the cheapest price. And if you're comfortable with that of the volatility of it, you're actually going to get like really good prices, like close to a dollar an hour or so on, sometimes down to like 80 cents or whatever. You said four, though. Yeah. So that's the thing. You want to lower the limit. So four is your max price. Four is like where you basically want to like pull the plug and say don't do it because the actual average price is not or like the, you know, the preemptible price doesn't actually look like that. So what you're doing when you're saying four is always, always, always give me this compute. Like continue to buy every hour. Don't preempt me. Don't kick me off. And I want this compute and just buy at the preemptible price, but never kick me off. The only times in which you get kicked off is if there is a big price spike. And, you know, let's say one day out of the year, there's like a four dollar an hour price because of some weird fluke or something. If there are other periods of time, you're actually getting a much lower price than you. It makes sense. Your your average cost that you're actually paying is way better. And your trade off here is you don't literally know what price you're going to get. So it's volatile. But your actual average historically has been like everyone who's done this has gotten wildly better prices. And this is like one of the clever things you can do with the market. If you're willing to make those trade offs, you can get a lot of really good prices. You can also do other things like you can only buy at night, for example. So the price goes down at night. And so you can say, oh, I want to only buy, you know, if the price is lower than 90 cents. And so if you have some long running job, you can make it only run on 90 cents and then you recover back and so on. Yeah. So what you can kind of create as like a spot inst is what other the CPU world has. Yes. But you've created a system where you can kind of manufacture the exact profile that you want. Exactly. That is not just whatever the hyperscalers offer you, which is usually just one thing. Correct. SF Compute is like the power tool. The underlying primitives of like hourly compute is there. Correct. Yeah, it's pretty interesting. I've often asked OpenAI. So like, you know, all these guys. Cloud as well. They do batch APIs. So it's half off of whatever your thing is. Yeah. And the only contract is we'll return in 24 hours. Sure. Right. And I was like, 24 hours is good. But sometimes I want one hour. I want four hours. I want something. And so based off of SF Compute's system, you can actually kind of create that kind of guarantee. Totally. That would be like, you know, not 24, but within eight hours, within four hours, like the work half of a workday. Yes. I can return your results to you. And then I can return it to you. And if your latency requirements are like that low, actually it's fine. Yes. Correct. Yeah. You can carve out that. You can financially engineer that on SFC. Yeah. Yeah. I mean, I think to me that unlocks a lot of agent use cases that I want, which is like, yeah, I worked in a background, but I don't want you to take a day. Yeah. Correct. Take a couple hours or something. Yeah. This touches a lot of my like background because I used to be a derivatives trader. Yeah. And this is a forward market. Yeah. A futures forward market, whatever you call it. Not a future. Very explicitly not a future. Not yet a futures. Yes. But I don't know if you have any other points to talk about. So you recognize that you are a, you know, a marketplace and you've hired, I met Alex Epstein at your launch event and you're like, you're, you're building out the financialization of GPUs. Yeah. So part of that's legal. Mm-hmm. Totally. Part of that is like listing on an exchange. Yep. Maybe you're the exchange. I don't know how that works, but just like, talk to me about that. Like from the legal, the standardization, the like, where is this all headed? You know, is this like a full listed on the Chicago Mercantile Exchange or whatever? What we're trying to do is create an underlying spot market that gives you an index price that you can use. And then with that index price, you can create a cash settled future. And with a cash settled future, you can go back to the data centers and you can say, lock in your price now and de-risk your entire position, which lets you get cheaper cost of capital and so on. And that we think will improve the entire industry because the marginal cost of compute is the risk. It's risk as shown by that graph and basically every part of this conversation. It's risk that causes the price to be all sorts of funky. And we think a future is the correct solution to this. So that's the eventual goal. Right now you have to make the underlying spot market in order to make this occur. And then to make the spot market work, you actually have to solve a lot of technology problems. You really cannot make a spot market work if you don't run the clusters, if you don't have control over them, if you don't know how to audit them, because these are super computers, not soybeans. They have to work. In a way that like, it's just a lot simpler to deliver a soybean than it is to deliver it. I don't know. Talk to the soybean guys. Sure. You know? Yeah. But you have to have a delivery mechanism. Your delivery mechanism, like somebody somewhere has to actually get the compute at some point and it actually has to work. And it is really complicated. And so that is the other part of our business that we go and we build a bare metal infrastructure stack that goes. And then also we do auditing of all the clusters. You sort of de-risk the technical perspective and that allows you to eventually de-risk the financial perspective. And that is kind of the pitch of SF Compute. Yeah. I'll double click on the auditing on the clusters. This is something I've had conversations with Vitae on. He started Rika and I think he had a blog post which kind of shone the light a little bit on how unreliable some clusters are versus others. Correct. Yeah. And sometimes you kind of have to season them and age them a little bit to find the bad cards. You have to burn them in. Yeah. So what do you do to audit them? There's like a burn-in process, a suite of tests, and then active checking and passive checking. Burn-in process is where you typically run LINPACK. LINPACK is this thing that like a bunch of linear algebra equations that you're stress testing the GPUs. This is a proprietary thing that you wrote? No, no, no. LINPACK is like the most common form of burn-in. If you just type in burn-in, typically when people say burn-in, they literally just mean LINPACK. It's like an NVIDIA reference version of this. Again, NVIDIA could run this before they ship, but now the customers have to do it. It's annoying. You're not just checking for the GPU itself. You're checking like the whole component, all the hardware. And it's a lot of work. It's an integration test. It's an integration test. Yeah. So what you're doing when you're running LINPACK or burn-in in general is you're stress testing the GPUs for some period of time, 48 hours, for example, maybe seven days or so on. And you're just trying to kill all the dead GPUs or any components in the system that are broken. And we've had experiences where we ran LINPACK on a cluster and it rounds out, sort of comes offline when you run LINPACK. This is a pretty good sign that maybe there is a problem with this cluster. Yeah. So LINPACK is like the most common sort of standard test. But then beyond that, what you do is we have like a series of performance tests that replicate a much more realistic environment as well that we run just assuming if LINPACK works at all, then you run the next set of tests. And then while the GPUs are in operation, you're also going through and you're doing active tests and passive tests. Passive tests are things that are running in the background while somebody else is running, while like some other workload is running. And active tests are during like idle periods. You're running some sort of check that would otherwise sort of interrupt something. And then the active tests will take something offline, basically. Or a passive check might mark it to get taken offline later and so on. And then the thing that we are working on that we have working partially but not entirely is automated refunds, which is basically like, is the case that the hardware breaks so much. And there's only so much that we can do and it is the effect of pretty much the entire industry. So a pretty common thing that I think happens to kind of everybody in the space is a customer comes online, they experience your cluster, and your cluster has the same problem that like any cluster has, or it's I mean, a different problem every time, but they experience one of the problems of HPC. And then their experience is bad. And you have to like negotiate a refund or some other thing like this. It's always case by case. And like, yeah, a lot of people just eat the cost. Correct. So one of the nice things about a market that we can do as we get bigger and have been doing as we can bigger is we can immediately give you something else. And then also we can automatically refund you. And you're still gonna experience it like the hardware problems aren't going away until the underlying vendors fix things. But honestly, I don't think that's likely because you're always pushing the limits of HPC. This is the case of trying to build a supercomputer. that's one of the nice things that we can do is we can switch you out for somebody else somewhere, and then automatically refund you or prorate or whatever the correct move is. One of the things that you say in this conversation with me was like, you know, you know, a provider is good when they guarantee automatic refunds. Which doesn't happen. But yeah, that's, that's in our contact with all the underlying cloud providers. You built it in already. Yeah. So we have a quite strict SLA that we pass on to you. The reason why

Scaling "Thinking": Gemini 2.5 Tech Lead Jack Rae on Reasoning, Long Context, & the Path to AGI

Play Episode Listen Later Apr 5, 2025 76:28


In this illuminating episode of The Cognitive Revolution, host Nathan Labenz speaks with Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work. They explore the technical breakthroughs behind Google's Gemini 2.5 Pro model, discussing why reasoning techniques are suddenly working so effectively across the industry and whether these advances represent true breakthroughs or incremental progress. The conversation delves into critical questions about the relationship between reasoning and agency, the role of human data in shaping model behavior, and the roadmap from current capabilities to AGI, providing listeners with an insider's perspective on the trajectory of AI development. SPONSORS: Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive PRODUCED BY: https://aipodcast.ing CHAPTERS: (00:00) About the Episode (05:09) Introduction and Welcome (07:28) RL for Reasoning (10:46) Research Time Management (13:41) Convergence in Model Development (18:31) RL on Smaller Models (Part 1) (20:01) Sponsors: Oracle Cloud Infrastructure (OCI) | Shopify (22:35) RL on Smaller Models (Part 2) (23:30) Sculpting Cognitive Behaviors (25:05) Language Switching Behavior (28:02) Sharing Chain of Thought (32:03) RL on Chain of Thought (Part 1) (33:46) Sponsors: NetSuite (35:19) RL on Chain of Thought (Part 2) (35:26) Eliciting Human Reasoning (39:27) Reasoning vs. Agency (40:17) Understanding Model Reasoning (44:29) Reasoning in Latent Space (47:54) Interpretability Challenges (51:36) Platonic Model Hypothesis (56:05) Roadmap to AGI (01:00:57) Multimodal Integration (01:04:38) System Card Questions (01:07:51) Long Context Capabilities (01:13:49) Outro

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

We are happy to announce that there will be a dedicated MCP track at the 2025 AI Engineer World's Fair, taking place Jun 3rd to 5th in San Francisco, where the MCP core team and major contributors and builders will be meeting. Join us and apply to speak or sponsor!When we first wrote Why MCP Won, we had no idea how quickly it was about to win.In the past 4 weeks, OpenAI and now Google have now announced the MCP support, effectively confirming our prediction that MCP was the presumptive winner of the agent standard wars. MCP has now overtaken OpenAPI, the incumbent option and most direct alternative, in GitHub stars (3 months ahead of conservative trendline):We have explored the state of MCP at AIE (now the first ever >100k views workshop):And since then, we've added a 7th reason why MCP won - this team acts very quickly on feedback, with the 2025-03-26 spec update adding support for stateless/resumable/streamable HTTP transports, and comprehensive authz capabilities based on OAuth 2.1.This bodes very well for the future of the community and project. For protocol and history nerds, we also asked David and Justin to tell the origin story of MCP, which we leave to the reader to enjoy (you can also skim the transcripts, or, the changelogs of a certain favored IDE). It's incredible the impact that individual engineers solving their own problems can have on an entire industry.Full video episodeLike and subscribe on YouTube!Show Links* David* Justin* MCP* Why MCP WonTimestamps* 00:00 Introduction and Guest Welcome* 00:37 What is MCP?* 02:00 The Origin Story of MCP* 05:18 Development Challenges and Solutions* 08:06 Technical Details and Inspirations* 29:45 MCP vs Open API* 32:48 Building MCP Servers* 40:39 Exploring Model Independence in LLMs* 41:36 Building Richer Systems with MCP* 43:13 Understanding Agents in MCP* 45:45 Nesting and Tool Confusion in MCP* 49:11 Client Control and Tool Invocation* 52:08 Authorization and Trust in MCP Servers* 01:01:34 Future Roadmap and Stateless Servers* 01:10:07 Open Source Governance and Community Involvement* 01:18:12 Wishlist and Closing RemarksTranscriptAlessio [00:00:02]: Hey, everyone. Welcome back to Latent Space. This is Alessio, partner and CTO at Decibel, and I'm joined by my co-host Swyx, founder of Small AI.swyx [00:00:10]: Hey, morning. And today we have a remote recording, I guess, with David and Justin from Anthropic over in London. Welcome. Hey, good You guys have created a storm of hype because of MCP, and I'm really glad to have you on. Thanks for making the time. What is MCP? Let's start with a crisp what definition from the horse's mouth, and then we'll go into the origin story. But let's start off right off the bat. What is MCP?Justin/David [00:00:43]: Yeah, sure. So Model Context Protocol, or MCP for short, is basically something we've designed to help AI applications extend themselves or integrate with an ecosystem of plugins, basically. The terminology is a bit different. We use this client-server terminology, and we can talk about why that is and where that came from. But at the end of the day, it really is that. It's like extending and enhancing the functionality of AI application.swyx [00:01:05]: David, would you add anything?Justin/David [00:01:07]: Yeah, I think that's actually a good description. I think there's like a lot of different ways for how people are trying to explain it. But at the core, I think what Justin said is like extending AI applications is really what this is about. And I think the interesting bit here that I want to highlight, it's AI applications and not models themselves that this is focused on. That's a common misconception that we can talk about a bit later. But yeah. Another version that we've used and gotten to like is like MCP is kind of like the USB-C port of AI applications and that it's meant to be this universal connector to a whole ecosystem of things.swyx [00:01:44]: Yeah. Specifically, an interesting feature is, like you said, the client and server. And it's a sort of two-way, right? Like in the same way that said a USB-C is two-way, which could be super interesting. Yeah, let's go into a little bit of the origin story. There's many people who've tried to make statistics. There's many people who've tried to build open source. I think there's an overall, also, my sense is that Anthropic is going hard after developers in the way that other labs are not. And so I'm also curious if there was any external influence or was it just you two guys just in a room somewhere riffing?Justin/David [00:02:18]: It is actually mostly like us two guys in a room riffing. So this is not part of a big strategy. You know, if you roll back time a little bit and go into like July 2024. I was like, started. I started at Anthropic like three months earlier or two months earlier. And I was mostly working on internal developer tooling, which is what I've been doing for like years and years before. And as part of that, I think there was an effort of like, how do I empower more like employees at Anthropic to use, you know, to integrate really deeply with the models we have? Because we've seen these, like, how good it is, how amazing it will become even in the future. And of course, you know, just dogfoot your own model as much as you can. And as part of that. From my development tooling background, I quickly got frustrated by the idea that, you know, on one hand side, I have Cloud Desktop, which is this amazing tool with artifacts, which I really enjoyed. But it was very limited to exactly that feature set. And it was there was no way to extend it. And on the other hand side, I like work in IDEs, which could greatly like act on like the file system and a bunch of other things. But then they don't have artifacts or something like that. And so what I constantly did was just copy. Things back and forth on between Cloud Desktop and the IDE, and that quickly got me, honestly, just very frustrated. And part of that frustration wasn't like, how do I go and fix this? What, what do we need? And back to like this development developer, like focus that I have, I really thought about like, well, I know how to build all these integrations, but what do I need to do to let these applications let me do this? And so it's very quickly that you see that this is clearly like an M times N problem. Like you have multiple like applications. And multiple integrations you want to build and like, what that is better there to fix this than using a protocol. And at the same time, I was actually working on an LSP related thing internally that didn't go anywhere. But you put these things together in someone's brain and let them wait for like a few weeks. And out of that comes like the idea of like, let's build some, some protocol. And so back to like this little room, like it was literally just me going to a room with Justin and go like, I think we should build something like this. Uh, this is a good idea. And Justin. Lucky for me, just really took an interest in the idea, um, and, and took it from there to like, to, to build something, together with me, that's really the inception story is like, it's us to, from then on, just going and building it over, over the course of like, like a month and a half of like building the protocol, building the first integration, like Justin did a lot of the, like the heavy lifting of the first integrations in cloud desktop. I did a lot of the first, um, proof of concept of how this can look like in an IDE. And if you, we could talk about like some of. All the tidbits you can find way before the inception of like before the official release, if you were looking at the right repositories at the right time, but there you go. That's like some of the, the rough story.Alessio [00:05:12]: Uh, what was the timeline when, I know November 25th was like the official announcement date. When did you guys start working on it?Justin/David [00:05:19]: Justin, when did we start working on that? I think it, I think it was around July. I think, yeah, I, as soon as David pitched this initial idea, I got excited pretty quickly and we started working on it, I think. I think almost immediately after that conversation and then, I don't know, it was a couple, maybe a few months of, uh, building the really unrewarding bits, if we're being honest, because for, for establishing something that's like this communication protocol has clients and servers and like SDKs everywhere, there's just like a lot of like laying the groundwork that you have to do. So it was a pretty, uh, that was a pretty slow couple of months. But then afterward, once you get some things talking over that wire, it really starts to get exciting and you can start building. All sorts of crazy things. And I think this really came to a head. And I don't remember exactly when it was, maybe like approximately a month before release, there was an internal hackathon where some folks really got excited about MCP and started building all sorts of crazy applications. I think the coolest one of which was like an MCP server that can control a 3d printer or something. And so like, suddenly people are feeling this power of like cloud connecting to the outside world in a really tangible way. And that, that really added some, uh, some juice to us and to the release.Alessio [00:06:32]: Yeah. And we'll go into the technical details, but I just want to wrap up here. You mentioned you could have seen some things coming if you were looking in the right places. We always want to know what are the places to get alpha, how, how, how to find MCP early.Justin/David [00:06:44]: I'm a big Zed user. I liked the Zed editor. The first MCP implementation on an IDE was in Zed. It was written by me and it was there like a month and a half before the official release. Just because we needed to do it in the open because it's an open source project. Um, and so it was, it was not, it was named slightly differently because we. We were not set on the name yet, but it was there.swyx [00:07:05]: I'm happy to go a little bit. Anthropic also had some preview of a model with Zed, right? Some kind of fast editing, uh, model. Um, uh, I, I'm con I confess, you know, I'm a cursor windsurf user. Haven't tried Zed. Uh, what's, what's your, you know, unrelated or, you know, unsolicited two second pitch for, for Zed. That's a good question.Justin/David [00:07:28]: I, it really depends what you value in editors. For me. I, I wouldn't even say I like, I love Zed more than others. I like them all like complimentary in, in a way or another, like I do use windsurf. I do use Zed. Um, but I think my, my main pitch for Zed is low latency, super smooth experience editor with a decent enough AI integration.swyx [00:07:51]: I mean, and maybe, you know, I think that's, that's all it is for a lot of people. Uh, I think a lot of people obviously very tied to the VS code paradigm and the extensions that come along with it. Okay. So I wanted to go back a little bit. You know, on, on, on some of the things that you mentioned, Justin, uh, which was building MCP on paper, you know, obviously we only see the end result. It just seems inspired by LSP. And I, I think both of you have acknowledged that. So how much is there to build? And when you say build, is it a lot of code or a lot of design? Cause I felt like it's a lot of design, right? Like you're picking JSON RPC, like how much did you base off of LSP and, and, you know, what, what, what was the sort of hard, hard parts?Justin/David [00:08:29]: Yeah, absolutely. I mean, uh, we, we definitely did take heavy inspiration from LSP. David had much more prior experience with it than I did working on developer tools. So, you know, I've mostly worked on products or, or sort of infrastructural things. LSP was new to me. But as a, as a, like, or from design principles, it really makes a ton of sense because it does solve this M times N problem that David referred to where, you know, in the world before LSP, you had all these different IDEs and editors, and then all these different languages that each wants to support or that their users want them to support. And then everyone's just building like one. And so, like, you use Vim and you might have really great support for, like, honestly, I don't know, C or something, and then, like, you switch over to JetBrains and you have the Java support, but then, like, you don't get to use the great JetBrains Java support in Vim and you don't get to use the great C support in JetBrains or something like that. So LSP largely, I think, solved this problem by creating this common language that they could all speak and that, you know, you can have some people focus on really robust language server implementations, and then the IDE developers can really focus on that side. And they both benefit. So that was, like, our key takeaway for MCP is, like, that same principle and that same problem in the space of AI applications and extensions to AI applications. But in terms of, like, concrete particulars, I mean, we did take JSON RPC and we took this idea of bidirectionality, but I think we quickly took it down a different route after that. I guess there is one other principle from LSP that we try to stick to today, which is, like, this focus on how features manifest. More than. The semantics of things, if that makes sense. David refers to it as being presentation focused, where, like, basically thinking and, like, offering different primitives, not because necessarily the semantics of them are very different, but because you want them to show up in the application differently. Like, that was a key sort of insight about how LSP was developed. And that's also something we try to apply to MCP. But like I said, then from there, like, yeah, we spent a lot of time, really a lot of time, and we could go into this more separately, like, thinking about each of the primitives that we want to offer in MCP. And why they should be different, like, why we want to have all these different concepts. That was a significant amount of work. That was the design work, as you allude to. But then also already out of the gate, we had three different languages that we wanted to at least support to some degree. That was TypeScript, Python, and then for the Z integration, it was Rust. So there was some SDK building work in those languages, a mixture of clients and servers to build out to try to create this, like, internal ecosystem that we could start playing with. And then, yeah, I guess just trying to make everything, like, robust over, like, I don't know, this whole, like, concept that we have for local MCP, where you, like, launch subprocesses and stuff and making that robust took some time as well. Yeah, maybe adding to that, I think the LSP inference goes even a little bit further. Like, we did take actually quite a look at criticisms on LSP, like, things that LSP didn't do right and things that people felt they would love to have different and really took that to heart to, like, see, you know, what are some of the things. that we wish, you know, we should do better. We took a, you know, like, a lengthy, like, look at, like, their very unique approach to JSON RPC, I may say, and then we decided that this is not what we do. And so there's, like, these differences, but it's clearly very, very inspired. Because I think when you're trying to build and focus, if you're trying to build something like MCP, you kind of want to pick the areas you want to innovate in, but you kind of want to be boring about the other parts in pattern matching LSP. So the problem allows you to be boring in a lot of the core pieces that you want to be boring in. Like, the choice of JSON RPC is very non-controversial to us because it's just, like, it doesn't matter at all, like, what the action, like, bites on the bar that you're speaking. It makes no difference to us. The innovation is on the primitives you choose and these type of things. And so there's way more focus on that that we wanted to do. So having some prior art is good there, basically.swyx [00:12:26]: It does. I wanted to double click. I mean, there's so many things you can go into. Obviously, I am passionate about protocol design. I wanted to show you guys this. I mean, I think you guys know, but, you know, you already referred to the M times N problem. And I can just share my screen here about anyone working in developer tools has faced this exact issue where you see the God box, basically. Like, the fundamental problem and solution of all infrastructure engineering is you have things going to N things, and then you put the God box and they'll all be better, right? So here is one problem for Uber. One problem for... GraphQL, one problem for Temporal, where I used to work at, and this is from React. And I was just kind of curious, like, you know, did you solve N times N problems at Facebook? Like, it sounds like, David, you did that for a living, right? Like, this is just N times N for a living.Justin/David [00:13:16]: David Pérez- Yeah, yeah. To some degree, for sure. I did. God, what a good example of this, but like, I did a bunch of this kind of work on like source control systems and these type of things. And so there were a bunch of these type of problems. And so you just shove them into something that everyone can read from and everyone can write to, and you build your God box somewhere, and it works. But yeah, it's just in developer tooling, you're absolutely right. In developer tooling, this is everywhere, right?swyx [00:13:47]: And that, you know, it shows up everywhere. And what was interesting is I think everyone who makes the God box then has the same set of problems, which is also you now have like composability off and remotes versus local. So, you know, there's this very common set of problems. So I kind of want to take a meta lesson on how to do the God box, but, you know, we can talk about the sort of development stuff later. I wanted to double click on, again, the presentation that Justin mentioned of like how features manifest and how you said some things are the same, but you just want to reify some concepts so they show up differently. And I had that sense, you know, when I was looking at the MCP docs, I'm like, why do these two things need to be the difference in other? I think a lot of people treat tool calling as the solution to everything, right? And sometimes you can actually sort of view kinds of different kinds of tool calls as different things. And sometimes they're resources. Sometimes they're actually taking actions. Sometimes they're something else that I don't really know yet. But I just want to see, like, what are some things that you sort of mentally group as adjacent concepts and why were they important to you to emphasize?Justin/David [00:14:58]: Yeah, I can chat about this a bit. I think fundamentally we every sort of primitive that we thought through, we thought from the perspective of the application developer first, like if I'm building an application, whether it is an IDE or, you know, call a desktop or some agent interface or whatever the case may be, what are the different things that I would want to receive from like an integration? And I think once you take that lens, it becomes quite clear that that tool calling is necessary, but very insufficient. Like there are many other things you would want to do besides just get tools. And plug them into the model and you want to have some way of differentiating what those different things are. So the kind of core primitives that we started MCP with, we've since added a couple more, but the core ones are really tools, which we've already talked about. It's like adding, adding tools directly to the model or function calling is sometimes called resources, which is basically like bits of data or context that you might want to add to the context. So excuse me, to the, to the model context. And this, this is the first primitive where it's like, we, we. Decided this could be like application controlled, like maybe you want a model to automatically search through and, and find relevant resources and bring them into context. But maybe you also want that to be an explicit UI affordance in the application where the user can like, you know, pick through a dropdown or like a paperclip menu or whatever, and find specific things and tag them in. And then that becomes part of like their message to the LLM. Like those are both use cases for resources. And then the third one is prompts. Which are deliberately meant to be like user initiated or. Like. User substituted. Text or messages. So like the analogy here would be like, if you're an editor, like a slash command or something like that, or like an at, you know, auto completion type thing where it's like, I have this kind of macro effectively that I want to drop in and use. And we have sort of expressed opinions through MCP about the different ways that these things could manifest, but ultimately it is for application developers to decide, okay, you, you get these different concepts expressed differently. Um, and it's very useful as an application developer because you can decide. The appropriate experience for each, and actually this can be a point of differentiation to, like, we were also thinking, you know, from the application developer perspective, they, you know, application developers don't want to be commoditized. They don't want the application to end up the same as every other AI application. So like, what are the unique things that they could do to like create the best user experience even while connecting up to this big open ecosystem of integration? I, yeah. And I think to add to that, the, I think there are two, two aspects to that, that I want to. I want to mention the first one is that interestingly enough, like while nowadays tool calling is obviously like probably like 95% plus of the integrations, and I wish there would be, you know, more clients doing tool resources, doing prompts. The, the very first implementation in that is actually a prompt implementation. It doesn't deal with tools. And, and it, we found this actually quite useful because what it allows you to do is, for example, build an MCP server that takes like a backtrack. So it's, it's not necessarily like a tool that literally just like rawizes from Sentry or any other like online platform that, that tracks your, your crashes. And just lets you pull this into the context window beforehand. And so it's quite nice that way that it's like a user driven interaction that you does the user decide when to pull this in and don't have to wait for the model to do it. And so it's a great way to craft the prompt in a way. And I think similarly, you know, I wish, you know, more MCP servers today would bring prompts as examples of, like how to even use the tools. Yeah. at the same time. The resources bits are quite interesting as well. And I wish we would see more usage there because it's very easy to envision, but yet nobody has really implemented it. A system where like an MCP server exposes, you know, a set of documents that you have, your database, whatever you might want to as a set of resources. And then like a client application would build a full rack index around this, right? This is definitely an application use case we had in mind as to why these are exposed in such a way that they're not model driven, because you might want to have way more resource content than is, you know, realistically usable in a context window. And so I think, you know, I wish applications and I hope applications will do this in the next few months, use these primitives, you know, way better, because I think there's way more rich experiences to be created that way. Yeah, completely agree with that. And I would also add that I would go into it if I haven't.Alessio [00:19:30]: I think that's a great point. And everybody just, you know, has a hammer and wants to do tool calling on everything. I think a lot of people do tool calling to do a database query. They don't use resources for it. What are like the, I guess, maybe like pros and cons or like when people should use a tool versus a resource, especially when it comes to like things that do have an API interface, like for a database, you can do a tool that does a SQL query versus when should you do that or a resource instead with the data? Yeah.Justin/David [00:20:00]: The way we separate these is like tools are always meant to be initiated by the model. It's sort of like at the model's discretion that it will like find the right tool and apply it. So if that's the interaction you want as a server developer, where it's like, okay, this, you know, suddenly I've given the LLM the ability to run a SQL queries, for example, that makes sense as a tool. But resources are more flexible, basically. And I think, to be completely honest, the story here is practically a bit complicated today. Because many clients don't support resources yet. But like, I think in an ideal world where all these concepts are fully realized, and there's like full ecosystem support, you would do resources for things like the schemas of your database tables and stuff like that, as a way to like either allow the user to say like, okay, now, you know, cloud, I want to talk to you about this database table. Here it is. Let's have this conversation. Or maybe the particular AI application that you're using, like, you know, could be something agentic, like cloud code. is able to just like agentically look up resources and find the right schema of the database table you're talking about, like both those interactions are possible. But I think like, anytime you have this sort of like, you want to list a bunch of entities, and then read any of them, that makes sense to model as resources. Resources are also, they're uniquely identified by a URI, always. And so you can also think of them as like, you know, sort of general purpose transformers, even like, if you want to support an interaction where a user just like drops a URI in, and then you like automatically figure out how to interpret that, you could use MCP servers to do that interpretation. One of the interesting side notes here, back to the Z example of resources, is that has like a prompt library that you can do, that people can interact with. And we just exposed a set of default prompts that we want everyone to have as part of that prompt library. Yeah, resources for a while so that like, you boot up Zed and Zed will just populate the prompt library from an MCP server, which was quite a cool interaction. And that was, again, a very specific, like, both sides needed to agree upon the URI format and the underlying data format. And but that was a nice and kind of like neat little application of resources. There's also going back to that perspective of like, as an application developer, what are the things that I would want? Yeah. We also applied this thinking to like, you know, like, we can do this, we can do this, we can do this, we can do this. Like what existing features of applications could conceivably be kind of like factored out into MCP servers if you were to take that approach today. And so like basically any IDE where you have like an attachment menu that I think naturally models as resources. It's just, you know, those implementations already existed.swyx [00:22:49]: Yeah, I think the immediate like, you know, when you introduced it for cloud desktop and I saw the at sign there, I was like, oh, yeah, that's what Cursor has. But this is for everyone else. And, you know, I think like that that is a really good design target because it's something that already exists and people can map on pretty neatly. I was actually featuring this chart from Mahesh's workshop that presumably you guys agreed on. I think this is so useful that it should be on the front page of the docs. Like probably should be. I think that's a good suggestion.Justin/David [00:23:19]: Do you want to do you want to do a PR for this? I love it.swyx [00:23:21]: Yeah, do a PR. I've done a PR for just Mahesh's workshop in general, just because I'm like, you know. I know.SPEAKER_03 [00:23:28]: I approve. Yeah.swyx [00:23:30]: Thank you. Yeah. I mean, like, but, you know, I think for me as a developer relations person, I always insist on having a map for people. Here are all the main things you have to understand. We'll spend the next two hours going through this. So some one image that kind of covers all this, I think is pretty helpful. And I like your emphasis on prompts. I would say that it's interesting that like I think, you know, in the earliest early days of like chat GPT and cloud, people. Often came up with, oh, you can't really follow my screen, can you? In the early days of chat of, of chat, GPT and all that, like a lot, a lot of people started like, you know, GitHub for prompts, like we'll do prop manager libraries and, and like those never really took off. And I think something like this is helpful and important. I would say like, I've also seen prompt file from human loop, I think, as, as other ways to standardize how people share prompts. But yeah, I agree that like, there should be. There should be more innovation here. And I think probably people want some dynamicism, which I think you, you afford, you allow for. And I like that you have multi-step that this was, this is the main thing that got me like, like these guys really get it. You know, I think you, you maybe have a published some research that says like, actually sometimes to get, to get the model working the right way, you have to do multi-step prompting or jailbreaking to, to, to behave the way that you want. And so I think prompts are not just single conversations. They're sometimes chains of conversations. Yeah.Alessio [00:25:05]: Another question that I had when I was looking at some server implementations, the server builders kind of decide what data gets eventually returned, especially for tool calls. For example, the Google maps one, right? If you just look through it, they decide what, you know, attributes kind of get returned and the user can not override that if there's a missing one. That has always been my gripe with like SDKs in general, when people build like API wrapper SDKs. And then they miss one parameter that maybe it's new and then I can not use it. How do you guys think about that? And like, yeah, how much should the user be able to intervene in that versus just letting the server designer do all the work?Justin/David [00:25:41]: I think we probably bear responsibility for the Google maps one, because I think that's one of the reference servers we've released. I mean, in general, for things like for tool results in particular, we've actually made the deliberate decision, at least thus far, for tool results to be not like sort of structured JSON data, not matching a schema, really, but as like a text or images or basically like messages that you would pass into the LLM directly. And so I guess the correlation that is, you really should just return a whole jumble of data and trust the LLM to like sort through it and see. I mean, I think we've clearly done a lot of work. But I think we really need to be able to shift and like, you know, extract the information it cares about, because that's what that's exactly what they excel at. And we really try to think about like, yeah, how to, you know, use LLMs to their full potential and not maybe over specify and then end up with something that doesn't scale as LLMs themselves get better and better. So really, yeah, I suppose what should be happening in this example server, which again, will request welcome. It would be great. It's like if all these result types were literally just passed through from the API that it's calling, and then the API would be able to pass through automatically.Alessio [00:26:50]: Thank you for joining us.Alessio [00:27:19]: It's a hard to sign decisions on where to draw the line.Justin/David [00:27:22]: I'll maybe throw AI under the bus a little bit here and just say that Claude wrote a lot of these example servers. No surprise at all. But I do think, sorry, I do think there's an interesting point in this that I do think people at the moment still to mostly still just apply their normal software engineering API approaches to this. And I think we're still need a little bit more relearning of how to build something for LLMs and trust them, particularly, you know, as they are getting significantly better year to year. Right. And I think, you know, two years ago, maybe that approach would have been very valid. But nowadays, just like just throw data at that thing that is really good at dealing with data is a good approach to this problem. And I think it's just like unlearning like 20, 30, 40 years of software engineering practices that go a little bit into this to some degree. If I could add to that real quickly, just one framing as well for MCP is thinking in terms of like how crazily fast AI is advancing. I mean, it's exciting. It's also scary. Like thinking, us thinking that like the biggest bottleneck to, you know, the next wave of capabilities for models might actually be their ability to like interact with the outside world to like, you know, read data from outside data sources or like take stateful actions. Working at Anthropic, we absolutely care about doing that. Safely and with the right control and alignment measures in place and everything. But also as AI gets better, people will want that. That'll be key to like becoming productive with AI is like being able to connect them up to all those things. So MCP is also sort of like a bet on the future and where this is all going and how important that will be.Alessio [00:29:05]: Yeah. Yeah, I would say any API attribute that says formatted underscore should kind of be gone and we should just get the raw data from all of them. Because why, you know, why are you formatting? For me, the, the model is definitely smart enough to format an address. So I think that should go to the end user.swyx [00:29:23]: Yeah. I have, I think Alessio is about to move on to like server implementation. I wanted to, I think we were talking, we're still talking about sort of MCP design and goals and intentions. And we've, I think we've indirectly identified like some problems that MCP is really trying to address. But I wanted to give you the spot to directly take on MCP versus open API, because I think obviously there's a, this is a top question. I wanted to sort of recap everything we just talked about and give people a nice little segment that, that people can say, say, like, this is a definitive answer on MCP versus open API.Justin/David [00:29:56]: Yeah, I think fundamentally, I mean, open API specifications are a very great tool. And like I've used them a lot in developing APIs and consumers of APIs. I think fundamentally, or we think that they're just like too granular for what you want to do with LLMs. Like they don't express higher level AI specific concepts like this whole mental model. Yeah. But we've talked about with the primitives of MCP and thinking from the perspective of the application developer, like you don't get any of that when you encode this information into an open API specification. So we believe that models will benefit more from like the purpose built or purpose design tools, resources, prompts, and the other primitives than just kind of like, here's our REST API, go wild. I do think there, there's another aspect. I think that I'm not an open API expert, so I might, everything might not be perfectly accurate. But I do think that we're... Like there's been, and we can talk about this a bit more later. There's a deliberate design decision to make the protocol somewhat stateful because we do really believe that AI applications and AI like interactions will become inherently more stateful and that we're the current state of like, like need for statelessness is more a temporary point in time that will, you know, to some degree that will always exist. But I think like more statefulness will become increasingly more popular, particularly when you think about additional modalities that go beyond just pure text-based, you know, interactions with models, like it might be like video, audio, whatever other modalities exist and out there already. And so I do think that like having something a bit more stateful is just inherently useful in this interaction pattern. I do think they're actually more complimentary open API and MCP than if people wanted to make it out. Like people look. For these, like, you know, A versus B and like, you know, have, have all the, all the developers of these things go in a room and fist fight it out. But that's rarely what's going on. I think it's actually, they're very complimentary and they have their little space where they're very, very strong. And I think, you know, just use the best tool for the job. And if you want to have a rich interaction between an AI application, it's probably like, it's probably MCP. That's the right choice. And if, if you want to have like an API spec somewhere that is very easy and like a model can read. And to interpret, and that's what, what worked for you, then open API is the way to go. One more thing to add here is that we've already seen people, I mean, this happened very early. People in the community built like bridges between the two as well. So like, if what you have is an open API specification and no one's, you know, building a custom MCP server for it, there are already like translators that will take that and re-expose it as MCP. And you could do the other direction too. Awesome.Alessio [00:32:43]: Yeah. I think there's the other side of MCPs that people don't talk as much. Okay. I think there's the other side of MCPs that people don't talk as much about because it doesn't go viral, which is building the servers. So I think everybody does the tweets about like connect the cloud desktop to XMCP. It's amazing. How would you guys suggest people start with building servers? I think the spec is like, so there's so many things you can do that. It's almost like, how do you draw the line between being very descriptive as a server developer versus like going back to our discussion before, like just take the data and then let them auto manipulate it later. Do you have any suggestions for people?Justin/David [00:33:16]: I. I think there, I have a few suggestions. I think that one of the best things I think about MCP and something that we got right very early is that it's just very, very easy to build like something very simple that might not be amazing, but it's pretty, it's good enough because models are very good and get this going within like half an hour, you know? And so I think that the best part is just like pick the language of, you know, of your choice that you love the most, pick the SDK for it, if there's an SDK for it, and then just go build a tool of the thing that matters to you personally. And that you want to use. You want to see the model like interact with, build the server, throw the tool in, don't even worry too much about the description just yet, like do a bit of like, write your little description as you think about it and just give it to the model and just throw it to standard IO protocol transport wise into like an application that you like and see it do things. And I think that's part of the magic that, or like, you know, empowerment and magic for developers to get so quickly to something that the model does. Or something that you care about. That I think really gets you going and gets you into this flow of like, okay, I see this thing can do cool things. Now I go and, and can expand on this and now I can go and like really think about like, which are the different tools I want, which are the different raw resources and prompts I want. Okay. Now that I have that. Okay. Now do I, what do my evals look like for how I want this to go? How do I optimize my prompts for the evals using like tools like that? This is infinite depth so that you can do. But. Okay. Just start. As simple as possible and just go build a server in like half an hour in the language of your choice and how the model interacts with the things that matter to you. And I think that's where the fun is at. And I think people, I think a lot of what MCP makes great is it just adds a lot of fun to the development piece to just go and have models do things quickly. I also, I'm quite partial, again, to using AI to help me do the coding. Like, I think even during the initial development process, we realized it was quite easy to basically just take all the SDK code. Again, you know, what David suggested, like, you know, pick the language you care about, and then pick the SDK. And once you have that, you can literally just drop the whole SDK code into an LLM's context window and say, okay, now that you know MCP, build me a server that does that. This, this, this. And like, the results, I think, are astounding. Like, I mean, it might not be perfect around every single corner or whatever. And you can refine it over time. But like, it's a great way to kind of like one shot something that basically does what you want, and then you can iterate from there. And like David said, there has been a big emphasis from the beginning on like making servers as easy and simple to build as possible, which certainly helps with LLMs doing it too. We often find that like, getting started is like, you know, 100, 200 lines of code in the last couple of years. It's really quite easy. Yeah. And if you don't have an SDK, again, give the like, give the subset of the spec that you care about to the model, and like another SDK and just have it build you an SDK. And it usually works for like, that subset. Building a full SDK is a different story. But like, to get a model to tool call in Haskell or whatever, like language you like, it's probably pretty straightforward.swyx [00:36:32]: Yeah. Sorry.Alessio [00:36:34]: No, I was gonna say, I co-hosted a hackathon at the AGI house. I'm a personal agent, and one of the personal agents somebody built was like an MCP server builder agent, where they will basically put the URL of the API spec, and it will build an MCP server for them. Do you see that today as kind of like, yeah, most servers are just kind of like a layer on top of an existing API without too much opinion? And how, yeah, do you think that's kind of like how it's going to be going forward? Just like AI generated, exposed to API that already exists? Or are we going to see kind of like net new MCP experiences that you... You couldn't do before?Justin/David [00:37:10]: I think, go for it. I think both, like, I, I think there, there will always be value in like, oh, I have, you know, I have my data over here, and I want to use some connector to bring it into my application over here. That use case will certainly remain. I think, you know, this, this kind of goes back to like, I think a lot of things today are maybe defaulting to tool use when some of the other primitives would be maybe more appropriate over time. And so it could still be that connector. It could still just be that sort of adapter layer, but could like actually adapt it onto different primitives, which is one, one way to add more value. But then I also think there's plenty of opportunity for use cases, which like do, you know, or for MCP servers that kind of do interesting things in and out themselves and aren't just adapters. Some of the earliest examples of this were like, you know, the memory MCP server, which gives the LLM the ability to remember things across conversations or like someone who's a close coworker built the... I shouldn't have said that, not a close coworker. Someone. Yeah. Built the sequential thinking MCP server, which gives a model the ability to like really think step-by-step and get better at its reasoning capabilities. This is something where it's like, it really isn't integrating with anything external. It's just providing this sort of like way of thinking for a model.Justin/David [00:38:27]: I guess either way though, I think AI authorship of the servers is totally possible. Like I've had a lot of success in prompting, just being like, Hey, I want to build an MCP server that like does this thing. And even if this thing is not. Adapting some other API, but it's doing something completely original. It's usually able to figure that out too. Yeah. I do. I do think that the, to add to that, I do think that a good part of, of what MCP servers will be, will be these like just API wrapper to some degree. Um, and that's good to be valid because that works and it gets you very, very far. But I think we're just very early, like in, in exploring what you can do. Um, and I think as client support for like certain primitives get better, like we can talk about sampling. I'm playing with my favorite topic and greatest frustration at the same time. Um, I think you can just see it very easily see like way, way, way richer experiences and we have, we have built them internally for as prototyping aspects. And I think you see some of that in the community already, but there's just, you know, things like, Hey, summarize my, you know, my, my, my, my favorite subreddits for the morning MCP server that nobody has built yet, but it's very easy to envision. And the protocol can totally do this. And these are like slightly richer experiences. And I think as people like go away from like the, oh, I just want to like, I'm just in this new world where I can hook up the things that matter to me, to the LLM, to like actually want a real workflow, a real, like, like more richer experience that I, I really want exposed to the model. I think then you will see these things pop up, but again, that's a, there's a little bit of a chicken and egg problem at the moment with like what a client supported versus, you know, what servers like authors want to do. Yeah.Alessio [00:40:10]: That, that, that was. That's kind of my next question on composability. Like how, how do you guys see that? Do you have plans for that? What's kind of like the import of MCPs, so to speak, into another MCP? Like if I want to build like the subreddit one, there's probably going to be like the Reddit API, uh, MCP, and then the summarization MCP. And then how do I, how do I do a super MCP?Justin/David [00:40:33]: Yeah. So, so this is an interesting topic and I think there, um, so there, there are two aspects to it. I think that the one aspect is like, how can I build something? I think agentically that you requires an LLM call and like a one form of fashion, like for summarization or so, but I'm staying model independent and for that, that's where like part of this by directionality comes in, in this more rich experience where we do have this facility for servers to ask the client again, who owns the LLM interaction, right? Like we talk about cursor, who like runs the, the, the loop with the LLM for you there that for the server author to ask the client for a completion. Um, and basically have it like summarize something for the server and return it back. And so now what model summarizes this depends on which one you have selected in cursor and not depends on what the author brings. The author doesn't bring an SDK. It doesn't have, you had an API key. It's completely model independent, how you can build this. There's just one aspect to that. The second aspect to building richer, richer systems with MCP is that you can easily envision an MCP server that serves something to like something like cursor or win server. For a cloud desktop, but at the same time, also is an MCP client at the same time and itself can use MCP servers to create a rich experience. And now you have a recursive property, which we actually quite carefully in the design principles, try to retain. You, you know, you see it all over the place and authorization and other aspects, um, to the spec that we retain this like recursive pattern. And now you can think about like, okay, I have, I have this little bundle of applications, both a server and a client. And I can add. Add these in chains and build basically graphs like, uh, DAGs out of MCP servers, um, uh, that can just richly interact with each other. A agentic MCP server can also use the whole ecosystem of MCP servers available to themselves. And I think that's a really cool environment, cool thing you can do. And people have experimented with this. And I think you see hopefully more of this, particularly when you think about like auto-selecting, auto-installing, there's a bunch of these things you can do that make, uh, make a really fun experience. I, I think practically there are some niceties we still need to add to the SDKs to make this really simple and like easy to execute on like this kind of recursive MCP server that is also a client or like kind of multiplexing together the behaviors of multiple MCP servers into one host, as we call it. These are things we definitely want to add. We haven't been able to yet, but like, uh, I think that would go some way to showcasing these things that we know are already possible, but not necessarily taken up that much yet. Okay.swyx [00:43:08]: This is, uh, very exciting. And very, I'm sure, I'm sure a lot of people get very, very, uh, a lot of ideas and inspiration from this. Is an MCP server that is also a client, is that an agent?Justin/David [00:43:19]: What's an agent? There's a lot of definitions of agents.swyx [00:43:22]: Because like you're, in some ways you're, you're requesting something and it's going off and doing stuff that you don't necessarily know. There's like a layer of abstraction between you and the ultimate raw source of the data. You could dispute that. Yeah. I just, I don't know if you have a hot take on agents.Justin/David [00:43:35]: I do think, I do think that you can build an agent that way. For me, I think you need to define the difference between. An MCP server plus client that is just a proxy versus an agent. I think there's a difference. And I think that difference might be in, um, you know, for example, using a sample loop to create a more richer experience to, uh, to, to have a model call tools while like inside that MCP server through these clients. I think then you have a, an actual like agent. Yeah. I do think it's very simple to build agents that way. Yeah. I think there are maybe a few paths here. Like it definitely feels like there's some relationship. Between MCP and agents. One possible version is like, maybe MCP is a great way to represent agents. Maybe there are some like, you know, features or specific things that are missing that would make the ergonomics of it better. And we should make that part of MCP. That's one possibility. Another is like, maybe MCP makes sense as kind of like a foundational communication layer for agents to like compose with other agents or something like that. Or there could be other possibilities entirely. Maybe MCP should specialize and narrowly focus on kind of the AI application side. And not as much on the agent side. I think it's a very live question and I think there are sort of trade-offs in every direction going back to the analogy of the God box. I think one thing that we have to be very careful about in designing a protocol and kind of curating or shepherding an ecosystem is like trying to do too much. I think it's, it's a very big, yeah, you know, you don't want a protocol that tries to do absolutely everything under the sun because then it'll be bad at everything too. And so I think the key question, which is still unresolved is like, to what degree are agents. Really? Really naturally fitting in to this existing model and paradigm or to what degree is it basically just like orthogonal? It should be something.swyx [00:45:17]: I think once you enable two way and once you enable client server to be the same and delegation of work to another MCP server, it's definitely more agentic than not. But I appreciate that you keep in mind simplicity and not trying to solve every problem under the sun. Cool. I'm happy to move on there. I mean, I'm going to double click on a couple of things that I marked out because they coincide with things that we wanted to ask you. Anyway, so the first one is, it's just a simple, how many MCP things can one implementation support, you know, so this is the, the, the sort of wide versus deep question. And, and this, this is direct relevance to the nesting of MCPs that we just talked about in April, 2024, when, when Claude was launching one of its first contexts, the first million token context example, they said you can support 250 tools. And in a lot of cases, you can't do that. You know, so to me, that's wide in, in the sense that you, you don't have tools that call tools. You just have the model and a flat hierarchy of tools, but then obviously you have tool confusion. It's going to happen when the tools are adjacent, you call the wrong tool. You're going to get the bad results, right? Do you have a recommendation of like a maximum number of MCP servers that are enabled at any given time?Justin/David [00:46:32]: I think be honest, like, I think there's not one answer to this because to some extent, it depends on the model that you're using. To some extent, it depends on like how well the tools are named and described for the model and stuff like that to avoid confusion. I mean, I think that the dream is certainly like you just furnish all this information to the LLM and it can make sense of everything. This, this kind of goes back to like the, the future we envision with MCP is like all this information is just brought to the model and it decides what to do with it. But today the reality or the practicalities might mean that like, yeah, maybe you, maybe in your client application, like the AI application, you do some fill in the blanks. Maybe you do some filtering over the tool set or like maybe you, you run like a faster, smaller LLM to like filter to what's most relevant and then only pass those tools to the bigger model. Or you could use an MCP server, which is a proxy to other MCP servers and does some filtering at that level or something like that. I think hundreds, as you referenced, is still a fairly safe bet, at least for Claude. I can't speak to the other models, but yeah, I don't know. I think over time we should just expect this to get better. So we're wary of like constraining anything and preventing that. Sort of long. Yeah, and obviously it highly, it highly depends on the overlap of the description, right? Like if you, if you have like very separate servers that do very separate things and the tools have very clear unique names, very clear, well-written descriptions, you know, your mileage might be more higher than if you have a GitLab and a GitHub server at the same time in your context. And, and then the overlap is quite significant because they look very similar to the model and confusion becomes easier. There's different considerations too. Depending on the AI application, if you're, if you're trying to build something very agentic, maybe you are trying to minimize the amount of times you need to go back to the user with a question or, you know, minimize the amount of like configurability in your interface or something. But if you're building other applications, you're building an IDE or you're building a chat application or whatever, like, I think it's totally reasonable to have affordances that allow the user to say like, at this moment, I want this feature set or at this different moment, I want this different feature set or something like that. And maybe not treat it as like always on. The full list always on all the time. Yeah.swyx [00:48:42]: That's where I think the concepts of resources and tools get to blend a little bit, right? Because now you're saying you want some degree of user control, right? Or application control. And other times you want the model to control it, right? So now we're choosing just subsets of tools. I don't know.Justin/David [00:49:00]: Yeah, I think it's a fair point or a fair concern. I guess the way I think about this is still like at the end of the day, and this is a core MCP design principle is like, ultimately, the concept of a tool is not a tool. It's a client application, and by extension, the user. Ultimately, they should be in full control of absolutely everything that's happening via MCP. When we say that tools are model controlled, what we really mean is like, tools should only be invoked by the model. Like there really shouldn't be an application interaction or a user interaction where it's like, okay, as a user, I now want you to use this tool. I mean, occasionally you might do that for prompting reasons, but like, I think that shouldn't be like a UI affordance. But I think the client application or the user deciding to like filter out the user, it's not a tool. I think the client application or the user deciding to like filter out things that MCP servers are offering, totally reasonable, or even like transform them. Like you could imagine a client application that takes tool descriptions from an MCP server and like enriches them, makes them better. We really want the client applications to have full control in the MCP paradigm. That in addition, though, like I think there, one thing that's very, very early in my thinking is there might be a addition to the protocol where you want to give the server author the ability to like logically group certain primitives together, potentially. Yeah. To inform that, because they might know some of these logical groupings better, and that could like encompasses prompts, resources, and tools at the same time. I mean, personally, we can have a design discussion on there. I mean, personally, my take would be that those should be separate MCP servers, and then the user should be able to compose them together. But we can figure it out.Alessio [00:50:31]: Is there going to be like a MCP standard library, so to speak, of like, hey, these are like the canonical servers, do not build this. We're just going to take care of those. And those can be maybe the building blocks that people can compose. Or do you expect people to just rebuild their own MCP servers for like a lot of things?Justin/David [00:50:49]: I think we will not be prescriptive in that sense. I think there will be inherently, you know, there's a lot of power. Well, let me rephrase it. Like, I have a long history in open source, and I feel the bizarre approach to this problem is somewhat useful, right? And I think so that the best and most interesting option wins. And I don't think we want to be very prescriptive. I will definitely foresee, and this already exists, that there will be like 25 GitHub servers and like 25, you know, Postgres servers and whatnot. And that's all cool. And that's good. And I think they all add in their own way. But effectively, eventually, over months or years, the ecosystem will converge to like a set of very widely used ones who basically, I don't know if you call it winning, but like that will be the most used ones. And I think that's completely fine. Because being prescriptive about this, I don't think it's any useful, any use. I do think, of course, that there will be like MCP servers, and you see them already that are driven by companies for their products. And, you know, they will inherently be probably the canonical implementation. Like if you want to work with Cloudflow workers and use an MCP server for that, you'll probably want to use the one developed by Cloudflare. Yeah. I think there's maybe a related thing here, too, just about like one big thing worth thinking about. We don't have any like solutions completely ready to go. It's this question of like trust or like, you know, vetting is maybe a better word. Like, how do you determine which MCP servers are like the kind of good and safe ones to use? Regardless of if there are any implementations of GitHub MCP servers, that could be totally fine. But you want to make sure that you're not using ones that are really like sus, right? And so trying to think about like how to kind of endow reputation or like, you know, if hypothetically. Anthropic is like, we've vetted this. It meets our criteria for secure coding or something. How can that be reflected in kind of this open model where everyone in the ecosystem can benefit? Don't really know the answer yet, but that's very much top of mind.Alessio [00:52:49]: But I think that's like a great design choice of MCPs, which is like language agnostic. Like already, and there's not, to my knowledge, an Anthropic official Ruby SDK, nor an OpenAI SDK. And Alex Roudal does a great job building those. But now with MCPs is like. You don't actually have to translate an SDK to all these languages. You just do one, one interface and kind of bless that interface as, as Anthropic. So yeah, that was, that was nice.swyx [00:53:18]: I have a quick answer to this thing. So like, obviously there's like five or six different registries already popped up. You guys announced your official registry that's gone away. And a registry is very tempting to offer download counts, likes, reviews, and some kind of trust thing. I think it's kind of brittle. Like no matter what kind of social proof or other thing you can, you can offer, the next update can compromise a trusted package. And actually that's the one that does the most damage, right? So abusing the trust system is like setting up a trust system creates the damage from the trust system. And so I actually want to encourage people to try out MCP Inspector because all you got to do is actually just look at the traffic. And like, I think that's, that goes for a lot of security issues.Justin/David [00:54:03]: Yeah, absolutely. Cool. And then I think like that's very classic, just supply chain problem that like all registries effectively have. And the, you know, there are different approaches to this problem. Like you can take the Apple approach and like vet things and like have like an army of, of both automated system and review teams to do this. And then you effectively build an app store, right? That's, that's one approach to this type of problem. It kind of works in, you know, in a very set, certain set of ways. But I don't think it works in an open source kind of ecosystem for which you always have a registry kind of approach, like similar to MPM and packages and PiPi.swyx [00:54:36]: And they all have inherently these, like these, these supply chain attack problems, right? Yeah, yeah, totally. Quick time check. I think we're going to go for another like 20, 25 minutes. Is that okay for you guys? Okay, awesome. Cool. I wanted to double click, take the time. So I'm going to sort of, we previewed a little bit on like the future coming stuff. So I want to leave the future coming stuff to the end, like registry, the, the, the stateless servers and remote servers, all the other stuff. But I wanted to double click a little bit. A little bit more on the launch, the core servers that are part of the official repo. And some of them are special ones, like the, like the ones we already talked about. So let me just pull them up already. So for example, you mentioned memory, you mentioned sequential thinking. And I think I really, really encourage people should look at these, what I call special servers. Like they're, they're not normal servers in the, in the sense that they, they wrap some API and it's just easier to interact with those than to work at the APIs. And so I'll, I'll highlight the, the memory one first, just because like, I think there are, there are a few memory startups, but actually you don't need them if you just use this one. It's also like 200 lines of code. It's super simple. And, and obviously then if you need to scale it up, you should probably do some, some more battle tested thing. But if you're interested, if you're just introducing memory, I think this is a really good implementation. I don't know if there's like special stories that you want to highlight with, with some of these.Justin/David [00:56:00]: I think, no, I don't, I don't think there's special stories. I think a lot of these, not all of them, but a lot of them originated from that hackathon that I mentioned before, where folks got excited about the idea of MCP. People internally inside Anthropik who wanted to have memory or like wanted to play around with the idea could quickly now prototype something using MCP in a way that wasn't possible before. Someone who's not like, you know, you don't have to become the, the end to end expert. You don't have access. You don't have to have access to this. Like, you know. You don't have to have this private, you know, proprietary code base. You can just now extend cloud with this memory capability. So that's how a lot of these came about. And then also just thinking about like, you know, what is the breadth of functionality that we want to demonstrate at launch?swyx [00:56:47]: Totally. And I think that is partially why it made your launch successful because you launch with a sufficiently spanning set of here's examples and then people just copy paste and expand from there. I would also highligh

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what's real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs. Top guests: Noam Shazeer, Bob McGrew, Noam Brown, Dylan Patel, Percy Liang, David Luan https://www.latent.space/p/unsupervised-learning Timestamps 00:00 Introduction and Excitement for Collaboration 00:27 Reflecting on Surprises in AI Over the Past Year 01:44 Open Source Models and Their Adoption 06:01 The Rise of GPT Wrappers 06:55 AI Builders and Low-Code Platforms 09:35 Overhyped and Underhyped AI Trends 22:17 Product Market Fit in AI 28:23 Google's Current Momentum 28:33 Customer Support and AI 29:54 AI's Impact on Cost and Growth 31:05 Voice AI and Scheduling 32:59 Emerging AI Applications 34:12 Education and AI 36:34 Defensibility in AI Applications 40:10 Infrastructure and AI 47:08 Challenges and Future of AI 52:15 Quick Fire Round and Closing Remarks Chapters 00:00:00 Introduction and Collab Excitement 00:00:58 Open Source and Model Adoption 00:01:58 Enterprise Use of Open Source Models 00:02:57 The Competitive Edge of Closed Source Models 00:03:56 DeepSea and Open Source Model Releases 00:04:54 Market Narrative and DeepSea Impact 00:05:53 AI Engineering and GPT Wrappers 00:06:53 AI Builders and Low-Code Platforms 00:07:50 Innovating Beyond Existing Paradigms 00:08:50 Apple and AI Product Development 00:09:48 Overhyped and Underhyped AI Trends 00:10:46 Frameworks and Protocols in AI Development 00:11:45 Emerging Opportunities in AI 00:12:44 Stateful AI and Memory Innovation 00:13:44 Challenges with Memory in AI Agents 00:14:44 The Future of Model Training Companies 00:15:44 Specialized Use Cases for AI Models 00:16:44 Vertical Models vs General Purpose Models 00:17:42 General Purpose vs Domain-Specific Models 00:18:42 Reflections on Model Companies 00:19:39 Model Companies Entering Product Space 00:20:38 Competition in AI Model and Product Sectors 00:21:35 Coding Agents and Market Dynamics 00:22:35 Defensibility in AI Applications 00:23:35 Investing in Underappreciated AI Ventures 00:24:32 Analyzing Market Fit in AI 00:25:31 AI Applications with Product Market Fit 00:26:31 OpenAI's Impact on the Market 00:27:31 Google and OpenAI Competition 00:28:31 Exploring Google's Advancements 00:29:29 Customer Support and AI Applications 00:30:27 The Future of AI in Customer Support 00:31:26 Cost-Cutting vs Growth in AI 00:32:23 Voice AI and Real-World Applications 00:33:23 Scaling AI Applications for Demand 00:34:22 Summarization and Conversational AI 00:35:20 Future AI Use Cases and Market Fit 00:36:20 AI Education and Model Capabilities 00:37:17 Reforming Education with AI 00:38:15 Defensibility in AI Apps 00:39:13 Network Effects and AI 00:40:12 AI Brand and Market Positioning 00:41:11 AI Application Defensibility 00:42:09 LLM OS and AI Infrastructure 00:43:06 Security and AI Application 00:44:06 OpenAI's Role in AI Infrastructure 00:45:02 The Balance of AI Applications and Infrastructure 00:46:02 Capital Efficiency in AI Infrastructure 00:47:01 Challenges in AI DevOps and Infrastructure 00:47:59 AI SRE and Monitoring 00:48:59 Scaling AI and Hardware Challenges 00:49:58 Reliability and Compute in AI 00:50:57 Nvidia's Dominance and AI Hardware 00:51:57 Emerging Competition in AI Silicon 00:52:54 Agent Authentication Challenges 00:53:53 Dream Podcast Guests 00:54:51 Favorite News Sources and Startups 00:55:50 The Value of In-Person Conversations 00:56:50 Private vs Public AI Discourse 00:57:48 Latent Space and Podcasting 00:58:46 Conclusion and Final Thoughts

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

If you're in SF: Join us for the Claude Plays Pokemon hackathon this Sunday!If you're not: Fill out the 2025 State of AI Eng survey for $250 in Amazon cards!Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what's real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs. Top guests: Noam Shazeer, Bob McGrew, Noam Brown, Dylan Patel, Percy Liang, David LuanFull Episode on Their YouTubeTimestamps* 00:00 Introduction and Excitement for Collaboration* 00:27 Reflecting on Surprises in AI Over the Past Year* 01:44 Open Source Models and Their Adoption* 06:01 The Rise of GPT Wrappers* 06:55 AI Builders and Low-Code Platforms* 09:35 Overhyped and Underhyped AI Trends* 22:17 Product Market Fit in AI* 28:23 Google's Current Momentum* 28:33 Customer Support and AI* 29:54 AI's Impact on Cost and Growth* 31:05 Voice AI and Scheduling* 32:59 Emerging AI Applications* 34:12 Education and AI* 36:34 Defensibility in AI Applications* 40:10 Infrastructure and AI* 47:08 Challenges and Future of AI* 52:15 Quick Fire Round and Closing RemarksTranscript[00:00:00] Introduction and Podcast Overview[00:00:00] Jacob: well, thanks so much for doing this, guys. I feel like we've we've been excited to do a collab for a while. I[00:00:13] swyx: love crossovers. Yeah. Yeah. This, this is great. Like the ultimate meta about just podcasters talking to other podcasters. Yeah. It's a lot. Podcasts all the way up.[00:00:21] Jacob: I figured we'd have a pretty free ranging conversation today but brought a few conversation starters to, to, to kick us off.[00:00:27] Reflecting on AI Surprises and Trends[00:00:27] Jacob: And so I figured one interesting place to start is you know, obviously it feels that this world is changing like every few months. Wondering as you guys reflect path on the past year, like what surprised you the most?[00:00:36] Alessio: I think definitely recently models we kinda on the, on the right here. Like, oh, that, well, I, I I think there's, there's like the, what surprised us in a good way.[00:00:44] May maybe in a, in a bad way. I would say in a good way. Recently models and I think the release of them right after the new reps scaling instead talked by Ilia. I think there was maybe like a, a little. It's so over and then we're so back. I'm like such a short, short period. It was really [00:01:00] fortuitous[00:01:00] Jacob: timing though, like right.[00:01:01] As pre-training died, I mean, obviously I'm sure within the labs they knew pre-training was dying and had to find something. But you know, from the outside it was it, it felt like one right into the other.[00:01:09] Alessio: Yeah. Yeah, exactly. So that, that was a good surprise,[00:01:12] swyx: I would say, if you wanna make that comment about timing, I think it's suspiciously neat that like, because we know that Strawberry was being worked on for like two years-ish.[00:01:20] Like, and we know exactly when Nome joined OpenAI, and that was obviously a big strategic bet by OpenAI. So like, for it to transition, so transition so nicely when like, pre-training is kind of tapped out to, into like, oh, now inference time is, is the new scaling law is like conv very convenient. I, I, I like if there were an Illuminati, this would be what they planned.[00:01:41] Or if we're living in a simulation or something. Yeah.[00:01:44] Open Source Models and Their Impact[00:01:44] swyx: Then you said open source[00:01:45] Alessio: as well? Yeah. Well, no, I, I think like open source. Yeah. We're discussing this on the negative. I would say the relevance of open source. I would specifically open models. Yeah, I was surprised the lack, like the llamas of the world by the lack of adoption.[00:01:56] And I mean, people use it obviously, but I would say nobody's [00:02:00] really like a huge fanboy, you know, I think the local llama community and some of the more obvious use cases really like it. But when we talk to like enterprise folks, it's like, it's cool, you know? And I think people love to argue about licenses and all of that, but the reality is that it doesn't really change the adoption path of, of ai.[00:02:18] So[00:02:19] swyx: yeah, the specific stat that I got from on anchor from Braintrust mm-hmm. In one of the episodes that we did was I think he estimated that open source model usage in work in enterprises is that like 5% and going down.[00:02:31] Jacob: And it feels like you're basically all these enterprises are in like use case discovery mode, where it's like, let's just take what we think is the most powerful model and figure out if we can find anything that works.[00:02:39] And, you know, so much of, of, of it feels like discovery of that. And then, right, as you've discovered something, a new generation of models are out and so you have to go do discovery with those. And you know, I think obviously we're probably optimistic that the that the open source models increase in uptake.[00:02:50] It's funny, I was gonna say my biggest surprise in the last year was open source related, but it was just how Fast Open Source caught up on the reasoning models. It was kind of unclear to me, like over time whether there would be, you know, [00:03:00] a compounding advantage for some of the closed source models where in the, okay, in the early days of, of scaling you know, there was a, a tight time loop, but over time, you know, would would the gap increase?[00:03:08] And if anything it feels like a trunk. You know, and I think deep seek specifically was just really surprising in how, you know, in many ways if the value of these model companies is like you have a model for a period of time and you're the only one that can build products on top of that model while you have it.[00:03:21] Like, God, that time period is a lot shorter than a, than I thought it was gonna be a year ago.[00:03:25] swyx: Yeah. I mean, again, I I, I don't like this label of how Fast Open Source caught up because it's really how Fast Deepsea caught up. Right. And now we have, like, I think some of it is that Deepsea is basically gonna stop open sourcing models.[00:03:36] Yeah. So like there, there's no team open source, there's just different companies and they choose to open source or not. And we got lucky with deep seek releasing something and then everyone else is basically distilling from deep seek and those are distillations. Catching up is such an easier lower bar than like actually catching up, which is like you, you are like from scratch.[00:03:56] You're training something that like is competitive on that front. I don't know if [00:04:00] that's happening. Like basically the only player right now is we're waiting for LA four.[00:04:03] Jordan: I mean, it's always an order of magnitude cheaper to replicate what's already been done than to create something fundamentally new.[00:04:09] And so that's why I think deep seek overall was overhyped. Right? I mean obviously it's a good open source, new entrant, but at the same time there's nothing new fundamentally there other than sort of doing it executing what's already been done really well.[00:04:21] Alessio: Yeah,[00:04:21] Jordan: right.[00:04:21] Alessio: So Well, but I think the traces is like maybe the biggest thing, I think most previous open models is like the same model, just a little worse and cheaper.[00:04:30] Yeah. Like R one is like the first model that had the full traces. So I think that's like a net unique thing in fair, open source. But yeah, I, I think like we talked about deep seek in the our n of year 2023 recap, and we're mostly focused on cheaper inference. Like we didn't really have deep, see, deep CV three[00:04:47] swyx: was out then, and we were like, that was already like talking about fine green mixture of experts and all that.[00:04:51] Like that's a great receipt to[00:04:52] Jacob: have[00:04:52] swyx: to be like, yeah.[00:04:52] Jacob: End[00:04:53] swyx: of year 20. Yeah. That's a,[00:04:54] Jacob: that's a, that's, that's an[00:04:55] swyx: impressive one. You follow the right whale believers in Twitter. It's, it's like [00:05:00] pretty obvious. I actually had like so, you know, I used to be in finance and, and a lot, a lot of my hedge fund and PE friends called me up.[00:05:06] They were like, why didn't you tip us off on deep seek? And I'm like, well, I mean, it's been there. It's, it's actually like kind of surprising that like, Nvidia like fell like what, 15% in one day? Yeah. Because deep seek and I, I think it's just like whatever the market, public market narrative decides is a story, becomes the story, but really like the technical movements are usually.[00:05:26] One to two years in the making. Before that,[00:05:27] Jacob: basically these people were telling on themselves that they didn't listen to your podcast. They've been on the end of year 22, 3. No, no,[00:05:32] swyx: no. Like yeah, we weren't, we weren't like banging the drum. So like it's also on us to be like, no, like this. This is an actual tipping point.[00:05:38] And I think I like as people who are like, our function as podcasters and industry analysts is to raise the bar or focus attention on things that you think matter. And sometimes we're too passive about it. And I think I was too passive there. I'd be, I'd be happy to own up on that.[00:05:52] Jacob: No, I feel like over time you guys have moved into this margin general role of like taking stances of things that are or aren't important and, you know I feel like you've done that with MCP of [00:06:00] late and a bunch of[00:06:00] swyx: things.[00:06:00] Yeah.[00:06:01] Challenges and Opportunities in AI Engineering[00:06:01] swyx: So like the, the general pushes is AI engineering, you know, like it's gotta, gotta wrap the shirt. And MCP is part of that, but like the, the general movement is what can engineers do above the model layer to augment model capabilities. And it turns out it's a lot. And turns out we went from like, making fun of GPT rappers to now I think the overwhelming consensus GPT wrappers is the only thing that's interesting.[00:06:20] Yeah.[00:06:21] Jacob: I remember like, Arvin from Perplexity came on our podcast and he was like, I'm proudly a rapper. Like, you know, it's like anyone that's like talking about like, you know, differentiation, like pre-product market fit is like a ridiculous thing to, to say, like, build something people want and then yeah.[00:06:33] Over time you can kind of worry about that.[00:06:35] swyx: Yeah. I, I interviewed him in 2023 and I think he may have been the first person on our podcast to like, probably be a GBT rapper. Yeah. And yeah, and obviously he's built a huge business on that. Totally. Now, now we now we all can't get enough of it. I have another one for, Oh, nice.[00:06:47] That was Alessia's one and we, we perhaps individual answers just to be interesting in the same Uber on the way up. Yeah. You just like in the, in different Oh, I was driving too. Oh, you were driving. So I actually, I mean, it was a Tesla mostly drove mine was [00:07:00] actually, it is interesting that low-code builders did not capture the AI builder market.[00:07:04] Right. AI builders being bought lovable, low-code builders being Zapier, Airtable, retool notion. Any of those, like you're not technical. You can build software.[00:07:14] misc: Yeah.[00:07:14] swyx: Somehow not all them missed it. Why? It's bizarre. Like they should have the DNA, I don't know. They should have. They already have the reach, they already have the, the distribution.[00:07:25] Like why? I I have no idea. The ability to[00:07:27] Jacob: fast follow too. Like I'm surprised there's Yeah. There's just[00:07:29] swyx: nothing. Yeah. What do you make of that? I, it seems and you know, not to come back to the AI engineering future, like it takes a, a certain kind of. Founder mindset or AI engineer mindset to be like, we will build this from whole cloth and not be tied to existing paradigms.[00:07:45] I think, 'cause I like, if I was, if I'm to, you know, you know, Wade or who's, who's, who's the Zapier person than, you know, Mike. Mike who has left the Zapier. Yeah. What's the, yeah. Like you know, Zapier, when they decided to do Zapier ai, they [00:08:00] were like, oh, you can use natural language to make Zap actions, right?[00:08:03] When Notion decided to do Notion ai, they were like, oh, you can like, you know write documents or, you know, fill in tables with, with ai. Like, they didn't do the, the, the, the next step because they already had their base and they were like, let's improve our baseline. And the other people who actually tried for to, to create a phone cloth were like, we, we got no prior preconceptions.[00:08:24] Like, let's see what we can, what kinda software people can build with like from scratch, basically. I don't know that, that's my explanation. I dunno if you guys have any retros on the AI builders?[00:08:33] Jacob: Yeah. Or, or, or did they kind of get lucky getting, you know starting that product journey? Like right as the models were reaching the inflection point?[00:08:39] There's the timing[00:08:40] swyx: issue. Yeah. Yeah, yeah. Yeah. Yeah, I don't know. Like I, I, to some extent, I think the only reason you and I are talking about it is that they, both of them have reported like ridiculous numbers. Like zero to 20 million in three months, basically, both of them. Jordan, did you have a, a big surprise?[00:08:55] Jordan: Yeah, I mean, some of what's already been discussed. I guess the only other thing would be on the Apple side in particular, I [00:09:00] think, I think you know, for the last text message summary, like, but they're[00:09:04] Jacob: funny. They're funny at how bad they had, how off they're, they're viral. Yeah.[00:09:08] Jordan: I mean, so like for the last couple years we've seen so many companies that are trying to do personal assistance, like all these various consumer things, and one of the things we've always asked is, well, apple is in prime position to do all this.[00:09:18] And then with Apple Intelligence, they just. Totally messed up in so many different ways. And then the whole BBC thing saying that the guy shot himself when he didn't. And just like, there's just so many things at this point that I would've thought that they would've ironed up their, their AI products better, but just didn't really catch on,[00:09:35] Jacob: you know, second on this list of, of generally overly broad opening questions would be anything that you guys think is kind of like overhyped or under hyped in the AI world right now?[00:09:43] Alessio: Overhyped agents framework. Sorry. Not naming any particular ones. I'm sorry. Not, not not, yeah, exactly. It's not, I, I would say they're just overall a chase to try and be the framework when the workloads are like in such flux. Yeah. That I just think is like so [00:10:00] hard to reconcile the two. I think what Harrison and Link Chain has done so amazingly, it's like product velocity.[00:10:05] Like, you know, the initial obstructions were maybe not the ending obstruction, but like they were just releasing stuff every day trying to be on top of it. But I think now we're like past that, like what people are looking for now. It's like something that they can actually build on mm-hmm. And stay on for the next couple of years.[00:10:23] And we talked about this with Brett Taylor on our episode, and it feels like, it's like the jQuery era Yeah. Of like agents and lms. It's like, it's kinda like, you know, single file, big frameworks, kinda like a lot of players, but maybe we need React. And I think people are just trying to build still Jake Barry.[00:10:39] Like, I don't really see a lot of people doing react like,[00:10:43] swyx: yeah. Maybe the, the only modification I made about that is maybe it's too early even for frameworks at all. And the thing that, and do you think[00:10:50] Jacob: there's enough stability in the underlying model layer and, and patterns to, to have this,[00:10:54] swyx: the thing is the protocol and not the framework?[00:10:56] Jacob: Yeah.[00:10:56] swyx: Because frameworks inherently embed protocols, but if you just focus on a protocol, maybe that [00:11:00] works. And obviously MCP is. The current leading mm-hmm. Area. And you know, I think the comparison there would be, instead of just jQuery, it is XML HTB requests, which is like the, the thing that enabled Ajax.[00:11:10] And that was the, the, the, the, the sort of inciting incident for JavaScripts being popular as a language.[00:11:16] Jordan: I would largely agree with that. I mean, I think on the, the react side of things, I think we're starting to see more frameworks sort of go after more of that, I guess like master is sort of like on the TypeScript side and more of like a sort of master.[00:11:28] Yeah, yeah, yeah, yeah. The traction is really impressive there. And so I think we're starting to see more surface there, but I think there's still a big opportunity. What do you have for for an over or under hyped on the under hype side? You know, I actually, I, I know I mentioned Apple already, but I think the private cloud compute side with PCC, I actually think that could be really big.[00:11:45] It's under the radar right now. Mm-hmm. But in terms of basically bringing. The on device sort of security to the cloud. They've done a lot of architecturally interesting things there. Who's they? Apple. Oh, okay. On the PCC side. And so I actually think of that.[00:11:58] swyx: So you're negative on Apple [00:12:00] Intelligence, but also on Apple Cloud,[00:12:01] Jordan: on the more of the local device.[00:12:04] Sort of, I think there'll be a lot of workloads still on device, but when you need to speak to the cloud for larger LLMs, I think that Apple has done really interesting thing on the privacy side.[00:12:13] Alessio: Yeah. We did the seed of a company that does that, so Yeah. Especially as things become more co that you set 'em up on purpose.[00:12:18] So that felt like a perfect Yeah, no, I was like, let's go Jordan, you guys concluding before this episode? Tell me about that company after. We'll chat after, but, but yes, I, I think that's like the unique the thing about LLM workflows is like you just cannot have everything be single tenant, right?[00:12:35] Because you just cannot get enough GPUs. Like even like large enterprises are used to having VPCs and like everything runs privately. But now you just cannot get enough GPUs to run in a VPC. So I think you're gonna need to be in a multi-tenant architecture, and you need, like you said, like single tenant guarantees in multi-tenant environment.[00:12:52] So yeah, it's a interesting space.[00:12:55] swyx: Yeah. What about you, Swiss? Under hypes, I want to say [00:13:00] memory. Just like stateful ai. As part of my keynote on, on for just like every, every conference I do, I do a keynote and I try to do the task of like defining an agent, just, you know, always evergreen content, every content for a keynote.[00:13:14] But I did it in a, in a way that it was like I think like a, what a researcher would do. Like you, you survey what people say and then you sort of categorize and, and go like, okay, this is the, the. What everyone calls agents and here are the groups of DEF definitions. Pick and choose. Right. And then it was very interesting that the week after that OpenAI launched their agents SDK and kind of formalized what they think agents are.[00:13:34] CloudFlare also did the same with us and none of them had memory. Yeah, it's very strange. The, pretty much like the only big lab o obviously there, there's conversation memory, but there's not memory memory like in like a, like a let's store a large across fact about you and like, you know, exceed the, the context length.[00:13:54] And here's the, if you, if you're look, if you look closely enough, there's a really good implementation of memory inside of [00:14:00] MCP when they launched with the initial set of servers. They had a memory server in there, which I, I would recommend as like, that's where you start with memory. But I think like if there was a better, I.[00:14:10] Memory abstraction, then a lot of our agents would be smarter and could learn on, on the job, which is something that we all want. And for some reason we all just like ignored that because it's just convenient to, and, but do you feel like[00:14:24] Jacob: it's being ignored or it's just a really hard problem and like lots of, I feel like lots of people are working on it.[00:14:27] Just feels like it's, it's proven more challenging.[00:14:29] swyx: Yeah. Yeah. Yeah. So, so Harrison has lang me, which I think now he's like, you know, relaunched again. And then we had letter come speak at our mm-hmm. Our conference I don't know, Zep, I think there's a bunch of other memory guys, but like, something like this I think should be normal in the stack.[00:14:44] And basically I think anything stateful should be interesting to VCs 'cause it's databases and, you know, we know how those things make money.[00:14:51] Jacob: I think on the over hype side, the only thing I'd add is like, I'm, I'm still surprised how many net new companies there are training models. I thought we were kind of like past that.[00:14:58] And[00:14:58] swyx: I would say they died end of last year. And now, [00:15:00] now they've resurfaced. Yeah. I mean they, that's one of the questions that you had down there of like, yeah. Sorry. Is there an opportunity for net new model players? I wouldn't say no. I don't know what you guys think.[00:15:08] Alessio: I, I don't have a reason to say no, but I also don't have a reason to say, this is what is missing and you should have a new model company do it.[00:15:15] But again, I'm an add here. Like, all these guys wanna[00:15:17] swyx: pursue a GI, you know, all, they all want to be like, oh, we'll, we'll like hit, you know, soda on all the benchmarks and like, they can't all do it. Yeah.[00:15:25] Jacob: I mean, look, I don't know if Ilia has the secret secret approach up his sleeve of of something beyond test time compute.[00:15:29] Mm-hmm. But it was funny, I, we had Noam Shaer on the podcast last week. I was asking him like, you know, is, is there like some sort of other algorithmic breakthrough? Would he make a Ilia? And he's like, look, I think what he is implicitly said was test time compute gets to the point where these models are doing AI engineering for us.[00:15:43] And so, you know, at that point they'll figure out the next algorithm breakthrough. Yeah. Which I thought was was pretty interesting.[00:15:47] Jordan: I agree with you folks. I think that we're most interested, at least from our side and like, you know, foundation models for specific use cases and more specialized use cases.[00:15:55] Mm-hmm. I guess the broader point is if there is something like that, that these companies can latch onto [00:16:00] and being there sort of. Known for being the best at. Maybe there's a case for that. Largely though I do agree with you that I don't think there should be, at this point, more model companies. I think it's like[00:16:09] Jacob: these[00:16:09] Jordan: unique data[00:16:09] Jacob: sets, right?[00:16:10] I mean, obviously robotics has been an area we've been really interested in. It's entirely different set of data that's required, you know, on top of like a, a good BLM and then, you know, biology, material sciences, more the specific use cases basically. Yeah. But also specific, like specific markets. A lot of these models are super generalizable, but like, you know finding opportunities to, you know, where, you know, for a lot of these bio companies, they have wet labs, like they're like running a ton of experiments or you know, same on the material sciences side.[00:16:31] And so I still feel like there's some, some opportunities there, but the core kind of like LLM agent space is it's tough, tough to compete with the big ones.[00:16:38] Alessio: Yeah. Agree. Yeah. But they're moving more into product. Yeah. So I think that's the question is like, if they could do better vertical models, why not do that instead of trying to do deep research and operator?[00:16:50] And these different things. Mm-hmm. I think that's what I'm, in my mind, it's like the agents coming[00:16:53] swyx: out too.[00:16:54] Alessio: Well. Yeah. In my, in my mind it's like financial pressure. Like they need to monetize in a much shorter timeframe [00:17:00] because the costs are so high. But maybe it's like, it's not that easy to, do[00:17:04] Jacob: you think they would be, that it would be a better business model to like, do a bunch of vertical?[00:17:07] Well, it's more like[00:17:07] Alessio: why wouldn't they, you know, like you make less enemies if you're like a model builder, right? Yeah. Like, like now with deep research and like search, now perplexity like an enemy and like a, you know, Gemini deep research is like more of an enemy. Versus if they were doing a finance model, you know?[00:17:25] Mm-hmm. Or whatever, like they would just enable so many more companies and they always have, like they had as one of the customer case studies for GBT search, but they're not building a finance based model for them. So is it because it's super hard and somebody should do it? Or is it because the new models.[00:17:41] Are gonna be so much better that like the vertical models are useless anyways. Like this is better lesson. Exactly.[00:17:46] Jacob: It still seems to be a somewhat outstanding question. I, I'd say like, all the signs of the last few years seem to be like a general purpose model is like the way to go. And, you know, you know, like training a hyper-specific model in this, in, in a domain is like, you know, maybe it's cheaper and faster, but it's not gonna be like higher quality.[00:17:59] But [00:18:00] also like, I think it's still an, I mean, we were talking to, to no and Jack Ray from Google last week, and they were like, yeah, this is still an outstanding, like, we, we check this every time we have a new model. Like whether there's you know, there that still seems to be holding. I remember like a few years ago, it felt like all the rage was like the, it was like the Bloomberg GPT model came out.[00:18:14] Everyone was like, oh, you gotta like, you know, massive data. Yeah. I had[00:18:17] swyx: a GPA, I had DP of AI of Bloomberg present on that. Yeah. That must be a really[00:18:20] Jacob: interesting episode to go back on because I feel like, like very shortly thereafter, the next opening AI model came out and just like beat it on all sorts of[00:18:25] swyx: No, it, it was a talk.[00:18:26] We haven't released it yet, but yeah, I mean it's basically they concluded that the, the closed models were better so they just Yeah. Stopped. Interesting. Exactly. So I feel like that's been the but he's I, I would be. He's very insistent that the work that they did, the team he assembled, the data that he collected is actually useful for more than just the model.[00:18:42] So like, basically everything but the model survived. What are the other things? The data pipeline. Okay. The team that they, they, they assembled for like fine tuning and implementing whatever models they, they ended up picking. Yeah, it seems like they are happy with that. And they're running with that.[00:18:57] He runs like 12, 13 [00:19:00] teams at Bloomberg just working. Jenny, I across the company.[00:19:03] Jacob: I mean, I guess we've, we've all kind of been alluding it to it right now, but I guess because it's a natural transition. You know, the other broad opening I have is just what we're paying most attention to right now. And I think back on this, like, you know, the model company's coming into the product area.[00:19:13] I mean, I think that's gonna be like, I'm fascinated to see how that plays out over the next year and kind of these like frenemy dynamics and it feels like it's gonna first boil up on like cursor anthropic and like the way that plays out over the next six months I think will be. What, what is Cursor?[00:19:26] swyx: Anthropic is, you mean Cursor versus anthropic or, yeah. And I[00:19:29] Jacob: assume, you know, over time Anthropic wants to get more into the application side of coding Uhhuh. And you know, I assume over time Cursor will wanna diversify off of, you know, just using the Anthropic model.[00:19:39] swyx: It's interesting that now Cursor is now worth like 10 billion, nine, nine, 10 billion.[00:19:43] Yeah. And like they've made themselves hard to acquire, like I would've said, like, you should just get yourself to five, 6 billion and join OpenAI. And like all the training data goes through OpenAI and that's how they train their coding model. Now it's not as complicated. Now they need to be an independent company.[00:19:57] Jacob: Increasingly, it's seems to the model companies want to get into the [00:20:00] product layer. And so seeing over the next six, 12 months does having the best model, you know let you kind of start from a cold start on the product side and, and get something in market. Or are the, you know, companies with the best products, even if they eventually have to switch to a somewhat worse, tiny bit worse model, does it not, you know, where do the developers ultimately choose to go?[00:20:16] I think that'll be super interesting. Yeah.[00:20:18] Alessio: Don't you think that Devon is more in trouble than cursor? I, I feel like on Tropic, if anything wants to move more towards, I don't think they wanna build the ID like if I think about coding, it's like kind of like, you know, you look at it like a cube, it's like the ID is like one way to get the code and then the agent is like the other side.[00:20:33] Yeah. I feel like on Tropic wants more be on the agent side and then hand you off the cursor when you want to go in depth versus like trying to build the claw. IDEI think that's not, I would say, I don't know how you think the[00:20:46] swyx: existence, a cloud code doesn't show, doesn't support what you say. Like maybe they would, but[00:20:52] Jacob: assume, like I assume both just converge eventually where you want have where will you be able to do both?[00:20:57] So,[00:20:57] swyx: so in order to be so we're, we're talking [00:21:00] about coding agents, whether it's sort of what is it? Inner loop versus auto loop, right? Like inner loop is inside cursor, inside your ID between inside of a GI commit and auto loop is between GI commits on, on the cloud. And I think like to be an outer loop coding agent, you have to be more of a, like, we will integrate with your code base, we'll sign your whatever.[00:21:17] You know, security thing that you need to sign. Yeah. That kinda schlep. I don't think the model ads wanna do that schlep, they just want to provide models. So that, that, that's, that would be my argument against like why cognition should still have, have, have some moat against anthropic just simply because they cognition would do the schlep and the biz dev and the infra that philanthropic doesn't really care about.[00:21:39] Jacob: I know the schlep is pretty sticky though. Once you do it,[00:21:41] swyx: it's very sticky. Yeah. Yeah. I mean it's, it's, it's interesting. Like, I, I think the natural winner of that should be sourcegraph. But there's another[00:21:47] Jacob: unprompted point portfolio. Nice. We, I mean they, they're[00:21:51] swyx: big supporters like very friendly with both Quinn and B and they've they've done a lot of work with Cody, but like, no, not much work on the outer [00:22:00] loop stuff yet.[00:22:01] But like any company where like they have already had, like, we've been around for 10 years, we, we like have all the enterprise contracts that you already trust us with your code base. Why would you go trust like factory or cognition as like, you know, 2-year-old startups who like just came outta MIT Like, I don't know.[00:22:17] Product Market Fit in AI[00:22:17] Jacob: I guess switching gears to the to the application side I'm curious for both of you, like how do you kind of characterize what has genuine product market fit in AI today? And I guess less, you more and your side of the investing side, like more interesting to invest in that category of the stuff that works today or kind of where the capabilities are going long term.[00:22:35] Alessio: That's hard. I was asking you to do my job for you, like, man, that's a easy, that's a layout. Tell us all your investing[00:22:40] pieces. Yeah, yeah, yeah. I, I, I would say we, well we only really do mostly seed investing, so it's hard to invest in things that already work. Yeah. That fair. Are really late. So we try to, but, but we try to be at the cusp of like, you know, usually the investments we like to make, there's like really not that much market risk.[00:22:57] It's like if this works. Obviously people are gonna [00:23:00] use it, but like it's unclear whether or not it's gonna work. So that's kind of more what we skew towards. We try not to chase as many trends and I don't know, I, you know, I was a founder myself and sometimes I feel like it's easy to just jump in and do the thing that is hot, but like becoming a founder to do something that is like underappreciated or like doesn't yet work shows some level of like dread and self, like you, you actually really believe in the thing.[00:23:25] So that alone for me is like, kind of makes me skew more towards that. And you do a lot of angel investing too, so I'm curious how,[00:23:31] swyx: Yeah, but I don't regard, I don't have, I don't use, put, put that in my mental framework of things like I come at this much more as a content creator or market analyst of like, yeah, it, it really does matter to me what has part of market fit because.[00:23:45] People, I have to answer the question of what is working now When, when people ask me,[00:23:50] Jacob: do you feel like relative to the, the obviously the hype and discourse out there, like, you know, do you feel like there's a lot of things that have product market fit or like a few things, like where a few things? Yeah.[00:23:58] swyx: I was gonna say this, so I have a list [00:24:00] of like two years ago we, I wrote the Anatomy of autonomy posts where it was like the, the first, like what's going on in agents and, and and, and, and what is actually making money. Because I think there's a lot of gen I skeptics out there. They're all like, these, these things are toys.[00:24:13] They're, they're not unreliable. And you know, why, why, why you dedicating your life to these things. And I think for me, the party market fit bar at the time was a hundred million dollars, right? Like what use cases can reasonably fit a hundred million dollars. And at the time it was like co-pilot it was Jasper.[00:24:30] No longer, but mm-hmm. You know, in that category of like help you write. Yeah. Which I think, I think was, was helpful. And then and the cursor I think was on there as, as a, as, as, as like a coding agent. Plus plus. I think that list will just grow over time of like the form factors that we know to work, and then we can just adapt the form factors to a bunch of other things.[00:24:47] So like the, the one that's the most recently added to this is deep research.[00:24:52] misc: Yeah.[00:24:52] swyx: Right. Where anything that looks like a deep research whether it's a grok version, Gemini version, perplexity version, whatever. He has an investment [00:25:00] that that he likes called Brightwave that is basically deep research for finance.[00:25:02] Yeah. And anything where like all it is like long-term agent, agent reporting and it's starting to take more and more of the job away from you and, and just give you much more reason to report. I think it's going to work. And that has some PMFI think obviously has PMF like I, I would say. It's I, I went to this exercise of trying to handicap how much money open AI made from launching open ai deep research.[00:25:25] I think it's billions. Like the, the, the mo the the she upgrade from like $20 to 200. It has to be billions in the R off. Maybe not all them will stick around, but like that is some amount of PMF that is didn't they have to immediately drop it down[00:25:38] Jacob: to the $20 tier?[00:25:39] swyx: They expanded access. I don't, I wouldn't say, which I thought was[00:25:42] Jacob: really telling of the market.[00:25:43] Right. It's like where you have a you know, I think it's gonna be so interesting to see what they're actually able to get in that 200 or $2,000 tier, which we all think is, is, you know, has a ton of potential. But I thought it was fascinating. I don't know whether it was just to get more people exposure to it or the fact that like Google had a similar product obviously, and, and other folks did too.[00:25:59] But [00:26:00] it was really interesting how quickly they dropped it down.[00:26:02] swyx: I don't, I think that's just a more general policy of no matter what they have at the top tier, they always want to have smaller versions of that in the, in the lower tiers. Yeah. And just get people exposure to it. Just, yeah, just get exposure.[00:26:12] The brand of being first to market and, and like the default choice Yeah. Is paramount to open ai[00:26:18] Jacob: though. I thought that whole thing was fascinating 'cause Google had the first product, right? Yeah. And no, like, you know, I, we[00:26:24] swyx: interviewed them. I, I, I, straight up to their faces, I was like, opening, I mocked you.[00:26:28] And they were like, yeah, well, actually curious, what's[00:26:30] Jacob: it, this is totally off topic, but whatever. Like, what is it going to take for go? Google just released some great models like a, a few weeks ago. Like I feel like it's happening. The stuff they're shipping is really cool. It's happening. Yeah, but I, I, I also, I feel like at least in the, you know, broader discourse, it's still like a drop in the bucket relative to[00:26:45] swyx: Yeah.[00:26:45] I mean, I, I can riff on, on this. I, I, but I, I think it's happening. I think it takes some time, but I am, like my Gemini usage is up. Like, I, I use, I use it a lot more for anything from like summarizing YouTube videos to the [00:27:00] native image generation Yeah. That they just launched to like flash thinking.[00:27:02] So yeah, multi-mobile stuff's great. Yeah. I run you know, and I run like a daily sort of news recap called AI news that is, 99% generated by models, and I do a bake off between all the frontier models every day. And it's every day. Like does it switch? I manual? Yes, it does switch. And I, man, I manually do it.[00:27:18] And flash is, flash wins most days. So, so like, I think it's happening. I think I was thinking, I was thinking about tracking myself like number of opens of tragedy, g Bt versus Gemini. And at some point it will cross. I think that Gemini will be my main and, and it, it, I I like that will slowly happen for a bunch of people.[00:27:37] And, and, and then that will, that'll shift. I, I think that's, that's a really interesting for developers, this is a different question. Yeah. It's Google getting over itself of having Google Cloud versus Vertex versus AI studio, all these like five different brands, slowly consolidating it. It'll happen just slowly, I guess.[00:27:53] Alessio: Yeah.[00:27:54] Yeah. I, I mean, another good example is like you cannot use the thinking models in cursor. Yeah. And I know [00:28:00] Logan killed Patrick's that they're working on it, but I, I think there's all these small things where like if I cannot easily use it, I'm really not gonna go out of my way to do it. But I do agree that when you do use them, their models are, are great.[00:28:12] So yeah. They just need better, better bridges.[00:28:15] swyx: You had one of the questions in the prep.[00:28:16] Debating Public Companies: Google vs. Apple[00:28:16] swyx: What public company are you long and short and minus Google versus, versus Apple, like, long, short. That was also my[00:28:23] Jacob: combo. I, I feel like, yeah, I mean, it does feel like Google's really cooking right now.[00:28:26] swyx: Yeah. So okay, coming back to what has product market fit[00:28:29] Jacob: now,[00:28:29] swyx: now that we come[00:28:30] Jacob: back to my complete total sidetrack,[00:28:33] Customer Support and AI's Role[00:28:33] swyx: there's also customer support.[00:28:35] We were talking on, on the car about Decagon and Sierra, obviously Brett, Brett Taylor is founder of Sierra. And yeah, it seems like there's just this, these layers of agents that'll like, I think you just look at like the income statement or like the, the org chart of any large scaled company and you start picking them off one by one.[00:28:51] What like is interesting knowledge work? And they would just kind of eat. Things slowly from the outside in. Yeah, that makes sense.[00:28:57] Alessio: I, I mean, the episode with the, [00:29:00] with Brett, he's so passionate about developer tools and Yeah. He did not do a developer tools. We spent like two hours talking about developer tools and like, all, all of that stuff.[00:29:10] And it's like, I, they a customer support company, I'm like, man, that says something. You know what I mean? Yeah. It's like when you have somebody like him who can like, raise any amount of money from anybody to do anything. Yeah. To pick customer support as the market to go after while also being the chairman of OpenAI, like that shows you that like, these things have moats and have longstanding, like they're gonna stick around, you know?[00:29:32] Otherwise he's smarter than that. So yeah, that's a, that's a space where maybe initially, you know, I would've said, I don't know, it's like the most exciting thing to, to jump into, but then if you really look at the shape of like, how the workforce are structured and like how the cost centers of like the business really end up, especially for more consumer facing businesses, like a lot of it goes into customer support.[00:29:54] AI's Impact on Business Growth[00:29:54] Alessio: All the AI story of the last two years has been cost cutting. Yeah. I think now we're gonna switch more towards growth revenue. [00:30:00] Totally. You know, like you've seen Jensen, like last year, GTC was saying the more you buy, the more you save this year is that the more you buy, the more you make. So we're hot off the[00:30:08] Jacob: press.[00:30:10] We were there. We were there. Yeah. I do think that's one of the most interesting things about the, this first wave of apps where it's like almost the easiest thing that you could you could get real traction with was stuff that, you know, for lack of a better way to frame it, like so that people had already been comfortable outsourcing the BPOs or something and kind of implicitly said like, Hey, this is a cost center.[00:30:24] Like we are willing to take some performance cut for cost in the past. You know, the, the irony of that, or what I'm really curious to see how it plays out is, you know, you, you could imagine that is the area where price competition is going to be most fierce because it's already stuff that you know, that people have said, Hey, we don't need the like a hundred percent best version of that.[00:30:42] And I wonder, you know, this next wave of apps. May prove actually even more defensible as you get these capabilities that actually are, you know, increased top line or whatnot where you're like, you take ai, go to market, for example. Like you're, you'd pay like twice as much for something that brought, like, 'cause there's just a kind of very clean ROI story to it.[00:30:59] And so [00:31:00] I wonder ultimately whether the, like this next set of apps actually ends up being more interesting than the, than the first wave.[00:31:05] Alessio: Yeah,[00:31:05] Voice AI and Scheduling Solutions[00:31:05] Jordan: I think a lot of the voice AI ones are interesting too, because you don't need a hundred percent precision recall to actually, you know, have a great product.[00:31:12] And so for example, we looked into a bunch of you know, scheduling intake companies, for example, like home services, right? For electricians and stuff like that. Today they miss 50% of their calls. So even if the AI is only effective, say 75% of the time, yeah, it's crazy, right? So if it's effective 75% of the time, that's totally fine because that's still a ton of increased revenue for the customer, right?[00:31:32] And so you don't need that a hundred percent accuracy. Yeah. And so as the models. And the reliability of these agents are getting better is totally fine, because you're still getting a ton of value in the meantime.[00:31:41] swyx: Yeah. One, this is, I don't know how related this is, but I, one of my favorite meetings at it is related one of my favorite meetings at AI Engineer Summit, it is like, like I do these, this is our first one in New York, and I it is like met the different crew than, than you meet here.[00:31:55] Like everyone here is loves developer tools, loves infra over there. They're actually more interested in [00:32:00] applications. It's kind of cool. I met this like bootstrap team that, like, they're only doing appointment scheduling for vets. They, they, yeah. And like, they're like, this is a, this is an anomaly. We don't usually come to engineering summits 'cause we usually go to vet summits and like talk to the, they're, they're like, you know, they, they're, they're literally, I'm sure it's a[00:32:16] Jordan: massive pain point.[00:32:17] They're willing to pay a lot of money.[00:32:20] Alessio: Yeah. But, but, but this is like my point about saving versus making more, it's like if an electrician takes two x more calls, do they have the bandwidth? To actually do two X more in-house and they get higher. Well, yeah, exactly. That's the thing is like, I don't think today most businesses are like structured to just like overnight two, three x the band, you know?[00:32:38] I think that's like a startup thing. Like mo most businesses then you make an[00:32:42] swyx: electrician agent. Well, no, totally. That's how do you, how do you recruiting agent for electrician, for like[00:32:49] Alessio: electrician. Great. That's a good point. How do you do lambda school for electrician? I, it's hilarious.[00:32:53] Jacob: Whack-a-mole for the bottlenecks in these businesses.[00:32:55] Like as, oh, now we have a ton of demand. Like, cool. Like where do we go?[00:32:58] swyx: Yeah.[00:32:59] Exploring AI Applications in Various Fields[00:32:59] swyx: So just to [00:33:00] round out the, the this PMF thing I think this is relevant in a certain sense of, like, it's pretty obvious that the killer agents are coding agents, support agents, deep research, right? Roughly, right. We've covered all those three already.[00:33:10] Then, then, then you have to sort of be, turn to offense and go like, okay, what's next? And like, what, what about, I[00:33:16] Jacob: mean, I also just like summarization of, of voice and conversation, right? Yep. Absolutely. We actually had that on there. I[00:33:21] swyx: just, I didn't put it as agent. Because seems less agentic, you know? But yes, still, still a good AI use case.[00:33:26] That one I, I've seen I would mention granola and what's the other one? Monterey, I think a bridge was one wanted to mention. I was say bridge. Yeah, bridge. Okay. So I'll just, I'll call out what I had on my slides. Yeah. For, for the agent engineering thing. So it was screen sharing, which I think is actually kind of, kind of underrated.[00:33:42] Like people, like an AI watching you as you do your work and just like offering assistance outbound sales. So instead of support, just being more outbound hiring, you say[00:33:51] Jacob: outbound sales has brought a market fit?[00:33:53] swyx: No, it, it, it will, it's come out. Oh, on the comp. Yeah. I was totally agree with that. Yeah. Hiring like the recruiting side education, like the, [00:34:00] the sort of like personalized teaching, I think.[00:34:02] I'm kind of shocked we haven't seen more there. Yeah. Yeah. I don't know if that's like, like it's like Duolingo is the thing. Amigo.[00:34:08] Jacob: Yeah. I mean, speak in some of these like, you know,[00:34:10] swyx: speak, practice, yeah. Interesting. And then finance, I, there's, there's a ton of finance cases that we can talk about that and then personal ai, which we also had a little bit of that, but I think personal AI is a harder to monetize, but I, I think those would be like, what I would say is up and coming in terms of like, that's what I'm currently focusing on.[00:34:27] Jacob: I feel like this question's been asked a few different ways but I'm, I'm curious what you guys think it's like, is it like, if we just froze model capabilities today, like is there, you know, trillions of dollars of application value to be unlocked? Like, like AI education? Like if we just stopped today all model development, like with this current generation of models, we could probably build some pretty amazing education apps.[00:34:44] Or like, how much of this, how much of, of all this is like contingent upon just like, okay, people have had two years with GBT four and like, you know, I don't know, six months with the reasoning models, like how much is contingent upon it just being more time with these things versus like the models actually have to get better?[00:34:58] I dunno, it's a hard question, so I'm gonna just throw it [00:35:00] to you.[00:35:00] Alessio: Yeah. Well I think the societal thing, it's maybe harder, especially in education. You know, like, can you basically like Doge. The education system. Probably you should, but like, can you, I I think it's more of a human,[00:35:14] Jacob: but people pay for all sorts of like, get ahead things outside of class and you know, certainly in other countries there's a ton of consumer spend and education.[00:35:21] It feels like the market opportunity is there.[00:35:23] swyx: Yeah. And, and private education, I think yeah, public Public is a very different, yeah. One of my most interesting quests from last year was kind of reforming Singapore's education system to be more sort of AI native, just what you were doing on the side while you were Yes.[00:35:38] That's a great, that's a great side quest. My stated goal is for Singapore to be the first country that has Python as a first language, as a, as a national language. Anyway, so, but the, the, the, the defense, the pushback I got from Ministry of Education was that the teachers would be unprepared to do it.[00:35:53] So it's like, it was like the def the, like, the it was really interesting, like immediate pushback. Was that the defacto teachers union being like, [00:36:00] resistant to change and like, okay. It's that that's par for the course. Anyway, so not, not to, not to dwell too much on that, but like yeah, I mean, like, I, I think like education is one of those things that pe everyone, like has strong opinions on.[00:36:11] 'cause they all have kids, all be the education system. But like, I think it's gonna be like the, the domain specific, like, like speak like such a amazing example of like top down. Like, we will go through the idea maze and we'll go to Korea and teach them English. Like, it's like, what the hell? And I would love to see more examples of that.[00:36:29] Like, just like really focus, like no one tried to solve everything. Just, just do your thing really, really well[00:36:34] Defensibility in AI Applications[00:36:34] Jacob: on this trend of of, of difficult questions that come up. I'm gonna just ask you the one that my partners like to ask me every single Monday, which is how do you think about defensibility at the at the app layer?[00:36:41] Alessio: Oh[00:36:41] Jacob: yeah, that's great. Just gimme an answer. I can copy paste and just like, you know, have network effects. Auto, auto response.[00:36:47] swyx: Honestly like network effects. I think people don't prioritize those enough because they're trying to make the single player experience good. But then, then they neglect the [00:37:00] multiplayer experience.[00:37:00] I think one of the I always think about like load-bearing episodes, like, you know, as, as park that you do one a week and like, you know, some of those you don't really talk about ever again. And others you keep mentioning every single podcast. And one of the, this is obviously gonna be the last one. I think the recap episodes for us are pretty load-bearing.[00:37:15] Like we, we refer to them every three months or so. And like one of them I think for us is Chai for me is chai research, even though that wasn't like a super popular one among the broader community outside of Chai, the chai community, for those who don't know, chai Research is basically a character AI competitor.[00:37:32] Right. They were bootstraps, they were founded at the same time and they have out outlasted character of de facto. Right. It's funny, like I, I would love to ask Mil a bit more about like the whole character thing, but good luck getting past the Google copy. But like, so he, like, he, like he doesn't have his own models, basically he has his own network of people submitting models to be run.[00:37:54] And I think like. That is like short term going to be hurting him because he doesn't have [00:38:00] proprietary ip. But long term he has the network network effect to make him robust to any changes in the future. And I think, like I wanna see more of that where like he's basically looking himself as kind of a marketplace and he's identified the choke point, which is will be app or the, the sort of protocol layer that interfaces between the users and the model providers.[00:38:18] And then make sure that the money kind of flows through and that works. I, I wish that more AI builders or AI founders emphasize network effects. 'cause that that's the only thing that you're gonna have with the end of the day. Yeah. And like brand deeds into network effects you.[00:38:34] Jacob: Yeah, I guess you know, harder in, in the enterprise context.[00:38:36] Right. But I mean, I feel, it's funny, we do this exercise and I feel like we talk a lot about like, you know, obviously there's, you know kind of the velocity and the breadth you're able to kind of build of product surface area. There's just like the ability to become a brand in a space. Like, I'm shocked that even in like six, nine months, how an individual company can become synonymous with like an entire category.[00:38:52] And like, then they're in every room for customers and like all the other startups are like clawing their way to try and get in like one, you know, 20th of those rooms.[00:38:59] Jordan: There's a [00:39:00] bunch of categories where we talk about an IC and it's like, oh, pricing compression's gonna happen, not as defensible. And so ACVs are gonna go down over time.[00:39:08] In actuality, some of these, the ACVs have doubled, we've seen, and the reason for that is just, you know, people go to them and pay for that premium of being that brand.[00:39:16] Jacob: Yeah. I mean, one thing I'm struck by is there's been, there was such a head fake in the early days of, of AI apps where people were like, we want this amazing defensibility story, and then what's the easiest defensibility story?[00:39:24] It's like, oh, like. Totally unique data set or like train your own model or something. And I feel like that was just like a total head fake where I don't think that's actually useful at all. It's the much less, you sound much less articulate when you're like, well the defensibility here is like the thousand small things that this company does to make like the user experience design everything just like delightful and just like the speed at which they move to kind of both create a really broad product, but then also every three, six months when a new model comes out, it's kind of an existential event for like any company.[00:39:49] 'cause if you're not the first to like figure out how to use it, someone else will. Yeah. And so velocity really matters there. And it's funny in in, in kinda our internal discussions, we've been like, man, that sounds pretty similar to like how we thought about like application SaaS [00:40:00] companies. That there isn't some like revolutionary reason you don't sound like a genius when you're like, here's applications why application SaaS company A is so much better than B.[00:40:07] But it's like a lot of little things that compound over time.[00:40:10] Infrastructure and AI: Current Trends[00:40:10] Jacob: What about the infrastructure space, guys? Like I'm curious you know. What, how do you guys think about where the interesting categories are here today and you know, like where, where, where do you wanna see more startups or, or where do you think there are too many?[00:40:21] Alessio: Yeah. Yeah, we call it kind of the L-L-M-O-S. But I would say[00:40:24] swyx: not we, I mean Andre, Andre calls it LMOS[00:40:27] Alessio: Well, but yeah, we, well everyone else just copies whatever two. And Andre, the three of you call it the LMO. Well, we have just like four words of ai framework Yeah. Yeah. That we use. And LM Os is one of them, but yeah, I mean, code execution is one.[00:40:39] We've been banging the drum, everybody now knows where investors in E two B. Mm-hmm. Memory, you know, is one that we kind of touched on before. Super interesting search we talked about. I, I think those are more not traditional infra, not like the bare metal infra. It's more like the infra around the tools for agents model, you know?[00:40:57] Which I think is where a lot of the value is gonna [00:41:00] be. The security[00:41:00] swyx: ones. Yeah.[00:41:01] Alessio: Yeah. And cyber security. I mean there's so much to be done there. And it's more like basically any area where. AI is being used by the offense. AI needs to be applied on the defense side, like email security, you know, identity, like all these different things.[00:41:16] So we've been doing a lot there as well as, you know, how do you rethink things that used to be costly, like red teaming and maybe used to be a checkbox in the past Today they can be actually helpful. Yeah. To make you secure your app. And there's this whole idea of like, semantics, right? That not the models can be good at.[00:41:32] You know, in the past everything is about syntax. It's kind of like very basic, you know, constraint rules. I think now you can start to infer semantics from things that are beyond just like simple recognition to like understanding why certain things are happening a certain way. So in the security space, we're seeing that with binary inspection, for example.[00:41:51] Like there's kinda like the syntax, but then there are like semantics of like understanding what is the scope overall really trying to do. Even though this [00:42:00] individual syntax, it's like seeing something specific. Not to get too technical, but yeah, I, I think infra overall, it's like a super interesting place if you're making use of the model, if you're just, I'm less bullish.[00:42:13] Not, not that it's not a great business, but I think it's a very capital intensive business, which is like serving the models. Mm-hmm. Yeah. I think that infra is like, great people will make money, but yeah. I, I, I don't think there's as much of a interest from, from us at[00:42:25] Jordan: least. Yeah. How, how do you guys think about what OpenAI and the big research labs will encompass as part of the developer and infra category?[00:42:31] Yeah.[00:42:31] Alessio: That, that's why I, I would say I search is the first example of one of the things we used to mention on, you know, we had X on the podcast and perplexity obviously as a, as an API. The basic idea[00:42:44] swyx: is if you go into like the chat GBT custom GPT builder, like what are the check boxes? Each of them is a startup.[00:42:50] Alessio: Yeah. And, and now they're also APIs. So now search is also an a p, we will see what the adoption is. There's the, you know, in traditional infra, like everybody wants to be [00:43:00] multi-cloud, so maybe we'll see the same Where change GPD search or open AI search. API is like, great with the open AI models because you get it all bundled in, but their price is very high.[00:43:11] If you compare it to like, you know, XI think is like five times the, the price for the same amount of research, which makes sense if you have a big open AI contract. But maybe if you're just like pick and best in breed, you wanna compare different ones. Yeah. Yeah, they don't have a code execution one.[00:43:26] I'm sure they'll release one soon. So they wanna own that too, but yeah. Same question we were talking about before, right? Did they wanna be an API company or a product company? Do you make more money building Tri g BT search or selling search? API?[00:43:38] swyx: Yeah. The, the broader lesson, instead of like going, we did applications just now.[00:43:42] And then what do you think is interesting infrastructure? Like it's not 50 50, it's not like equal weighted, like it, it's just very clearly the application layer has like. Been way more interesting. Like yes, there, there's interesting in infrastructure plays and I even want to like push back on like the, the, the whole GPU serving thing because like together [00:44:00] AI is doing well, fireworks, I mean I was, that worked.[00:44:02] Alessio: It's like data[00:44:02] Jacob: centers[00:44:03] Alessio: and inference[00:44:03] Jacob: providers,[00:44:04] Alessio: the,[00:44:04] swyx: you know,[00:44:04] Alessio: I think it's not like the capital[00:44:06] swyx: Oh, I see.[00:44:07] Alessio: I for, for again, capital efficiency. Yeah. Much larger funds. So you, I'm sure you have GPU clouds. Yeah.[00:44:13] swyx: Yeah. So that's, that's, that is one thing I have been learning in, in that you know, I think I have historically had dev tools and infra bias and so has he, and we've had to learn that applications actually are very interesting and also maybe kind of the killer application of models in a sense that you can charge for utility and not for cost.[00:44:33] Right? Which, where like most infrastructure reduces to cost plus. Yeah. Right. So, and like, that's not where you wanna be for ai. So that's, that's interesting for, for me I thought it would be interesting for me to be the only non VC in the room to be saying what is not investible. 'cause like then I then, you know, you can I, I won't be canceled for saying like, your, your whole category is, we have a great thing where like, this thing's[00:44:54] Jacob: not investible and then like three months later we're desperately chasing.[00:44:56] Exactly. Exactly. So you don't wanna be on a record space changes so [00:45:00] fast. It's like you gotta, every opinion you hold, you have to like, hold it quite loosely. Yeah.[00:45:02] swyx: I'm happy to be wrong in public, you know, I think that's how you learn the most, right? Yeah. So like, fine tuning companys is something I struggled with and still, like, I don't see how this becomes a big thing.[00:45:12] Like you kind of have to wrap it up in a broader, ser broader enterprise AI company, like services company, like a writer, AI where like they will find you and it's part of the overall offering. Mm-hmm. But like, that's not where you spike. Yeah, it's kind of interesting. And then I, I'll, I'll just kind of AI DevOps and like, there's a lot of AI SRE out there seems like.[00:45:32] There's a lot of data out there that that should be able to be plugged into your code base or, or, or your app to it's self-heal or whatever. It's just, I don't know if that's like, been a thing yet. And you guys can correct me if you're, if I'm wrong. And then the, the last thing I'll mention is voice realtime infra again, like very interesting, very, very hot.[00:45:49] But again, how big is it? Those are the, the main three that I'm thinking about for things I'm struggling with.[00:45:54] Jordan: Yeah. I guess a couple comments on the A-I-S-R-E side. I actually disagree with that one. Yeah. I think that the [00:46:00] reason they haven't sort of taken off yet is because the tech is just not there quite yet.[00:46:04] And so it goes back to the earlier question, do we think about investing towards where the companies will be when the models improve versus now? I think that's going to be, in short term we'll get there, but it's just not there just yet. But I think it's an interesting opportunity overall.[00:46:18] swyx: Yeah. It's my pushback to you is, well it's monitoring a lot of logs, right?[00:46:22] Yeah. And it's basically anomaly detection rather than. Like there's, there's a whole bunch of like stuff that can happen after you detect the anomaly, but it's really just an anomaly detection. And we've always had that, you know, like it's, this is like not a Transformers LLM use case. This is just regular anomaly detection.[00:46:38] Jordan: It's more in terms of like, it's not going to be an autonomous SRE for a while. Yeah. And so the question is how, how much can the latest sort of AI advancements increase the efficacy of going, bringing your MTTR

Unsupervised Learning
Ep 60: Swyx and Alessio (Latent Space) on What has PMF Today, Google is Cooking & GPT Wrappers are Winning

Unsupervised Learning

Play Episode Listen Later Mar 28, 2025 61:30


To unpack some of the most topical questions in AI, I'm joined by two fellow AI podcasters: Swyx and Alessio Fanelli, co-hosts of the Latent Space podcast. We've been wanting to do a cross-over episode for a while and finally made it happen.Swyx brings deep experience from his time at AWS, Temporal, and Airbyte, and is now focused on AI agents and dev tools. Alessio is an investor at Decibel, where he's been backing early technical teams pushing the boundaries of infrastructure and applied AI. Together they run Latent Space, a technical newsletter and podcast by and for AI engineers.To subscribe or learn more about Latent Space, click here: https://www.latent.space/ [0:00] Intro[1:08] Reflecting on AI Surprises of the Past Year[2:24] Open Source Models and Their Adoption[6:48] The Rise of GPT Wrappers[7:49] Challenges in AI Model Training[10:33] Over-hyped and Under-hyped AI Trends[24:00] The Future of AI Product Market Fit[30:27] Google's Momentum and Customer Support Insights[33:16] Emerging AI Applications and Market Trends[35:13] Challenges and Opportunities in AI Development[39:02] Defensibility in AI Applications[42:42] Infrastructure and Security in AI[50:04] Future of AI and Unanswered Questions[55:34] Quickfire With your co-hosts: @jacobeffron - Partner at Redpoint, Former PM Flatiron Health @patrickachase - Partner at Redpoint, Former ML Engineer LinkedIn @ericabrescia - Former COO Github, Founder Bitnami (acq'd by VMWare) @jordan_segall - Partner at Redpoint

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

If you're in SF: Join us for the Claude Plays Pokemon hackathon this Sunday!If you're not: Fill out the 2025 State of AI Eng survey for $250 in Amazon cards!We are SO excited to share our conversation with Dharmesh Shah, co-founder of HubSpot and creator of Agent.ai.A particularly compelling concept we discussed is the idea of "hybrid teams" - the next evolution in workplace organization where human workers collaborate with AI agents as team members. Just as we previously saw hybrid teams emerge in terms of full-time vs. contract workers, or in-office vs. remote workers, Dharmesh predicts that the next frontier will be teams composed of both human and AI members. This raises interesting questions about team dynamics, trust, and how to effectively delegate tasks between human and AI team members.The discussion of business models in AI reveals an important distinction between Work as a Service (WaaS) and Results as a Service (RaaS), something Dharmesh has written extensively about. While RaaS has gained popularity, particularly in customer support applications where outcomes are easily measurable, Dharmesh argues that this model may be over-indexed. Not all AI applications have clearly definable outcomes or consistent economic value per transaction, making WaaS more appropriate in many cases. This insight is particularly relevant for businesses considering how to monetize AI capabilities.The technical challenges of implementing effective agent systems are also explored, particularly around memory and authentication. Shah emphasizes the importance of cross-agent memory sharing and the need for more granular control over data access. He envisions a future where users can selectively share parts of their data with different agents, similar to how OAuth works but with much finer control. This points to significant opportunities in developing infrastructure for secure and efficient agent-to-agent communication and data sharing.Other highlights from our conversation* The Evolution of AI-Powered Agents – Exploring how AI agents have evolved from simple chatbots to sophisticated multi-agent systems, and the role of MCPs in enabling that.* Hybrid Digital Teams and the Future of Work – How AI agents are becoming teammates rather than just tools, and what this means for business operations and knowledge work.* Memory in AI Agents – The importance of persistent memory in AI systems and how shared memory across agents could enhance collaboration and efficiency.* Business Models for AI Agents – Exploring the shift from software as a service (SaaS) to work as a service (WaaS) and results as a service (RaaS), and what this means for monetization.* The Role of Standards Like MCP – Why MCP has been widely adopted and how it enables agent collaboration, tool use, and discovery.* The Future of AI Code Generation and Software Engineering – How AI-assisted coding is changing the role of software engineers and what skills will matter most in the future.* Domain Investing and Efficient Markets – Dharmesh's approach to domain investing and how inefficiencies in digital asset markets create business opportunities.* The Philosophy of Saying No – Lessons from "Sorry, You Must Pass" and how prioritization leads to greater productivity and focus.Timestamps* 00:00 Introduction and Guest Welcome* 02:29 Dharmesh Shah's Journey into AI* 05:22 Defining AI Agents* 06:45 The Evolution and Future of AI Agents* 13:53 Graph Theory and Knowledge Representation* 20:02 Engineering Practices and Overengineering* 25:57 The Role of Junior Engineers in the AI Era* 28:20 Multi-Agent Systems and MCP Standards* 35:55 LinkedIn's Legal Battles and Data Scraping* 37:32 The Future of AI and Hybrid Teams* 39:19 Building Agent AI: A Professional Network for Agents* 40:43 Challenges and Innovations in Agent AI* 45:02 The Evolution of UI in AI Systems* 01:00:25 Business Models: Work as a Service vs. Results as a Service* 01:09:17 The Future Value of Engineers* 01:09:51 Exploring the Role of Agents* 01:10:28 The Importance of Memory in AI* 01:11:02 Challenges and Opportunities in AI Memory* 01:12:41 Selective Memory and Privacy Concerns* 01:13:27 The Evolution of AI Tools and Platforms* 01:18:23 Domain Names and AI Projects* 01:32:08 Balancing Work and Personal Life* 01:35:52 Final Thoughts and ReflectionsTranscriptAlessio [00:00:04]: Hey everyone, welcome back to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.swyx [00:00:12]: Hello, and today we're super excited to have Dharmesh Shah to join us. I guess your relevant title here is founder of Agent AI.Dharmesh [00:00:20]: Yeah, that's true for this. Yeah, creator of Agent.ai and co-founder of HubSpot.swyx [00:00:25]: Co-founder of HubSpot, which I followed for many years, I think 18 years now, gonna be 19 soon. And you caught, you know, people can catch up on your HubSpot story elsewhere. I should also thank Sean Puri, who I've chatted with back and forth, who's been, I guess, getting me in touch with your people. But also, I think like, just giving us a lot of context, because obviously, My First Million joined you guys, and they've been chatting with you guys a lot. So for the business side, we can talk about that, but I kind of wanted to engage your CTO, agent, engineer side of things. So how did you get agent religion?Dharmesh [00:01:00]: Let's see. So I've been working, I'll take like a half step back, a decade or so ago, even though actually more than that. So even before HubSpot, the company I was contemplating that I had named for was called Ingenisoft. And the idea behind Ingenisoft was a natural language interface to business software. Now realize this is 20 years ago, so that was a hard thing to do. But the actual use case that I had in mind was, you know, we had data sitting in business systems like a CRM or something like that. And my kind of what I thought clever at the time. Oh, what if we used email as the kind of interface to get to business software? And the motivation for using email is that it automatically works when you're offline. So imagine I'm getting on a plane or I'm on a plane. There was no internet on planes back then. It's like, oh, I'm going through business cards from an event I went to. I can just type things into an email just to have them all in the backlog. When it reconnects, it sends those emails to a processor that basically kind of parses effectively the commands and updates the software, sends you the file, whatever it is. And there was a handful of commands. I was a little bit ahead of the times in terms of what was actually possible. And I reattempted this natural language thing with a product called ChatSpot that I did back 20...swyx [00:02:12]: Yeah, this is your first post-ChatGPT project.Dharmesh [00:02:14]: I saw it come out. Yeah. And so I've always been kind of fascinated by this natural language interface to software. Because, you know, as software developers, myself included, we've always said, oh, we build intuitive, easy-to-use applications. And it's not intuitive at all, right? Because what we're doing is... We're taking the mental model that's in our head of what we're trying to accomplish with said piece of software and translating that into a series of touches and swipes and clicks and things like that. And there's nothing natural or intuitive about it. And so natural language interfaces, for the first time, you know, whatever the thought is you have in your head and expressed in whatever language that you normally use to talk to yourself in your head, you can just sort of emit that and have software do something. And I thought that was kind of a breakthrough, which it has been. And it's gone. So that's where I first started getting into the journey. I started because now it actually works, right? So once we got ChatGPT and you can take, even with a few-shot example, convert something into structured, even back in the ChatGP 3.5 days, it did a decent job in a few-shot example, convert something to structured text if you knew what kinds of intents you were going to have. And so that happened. And that ultimately became a HubSpot project. But then agents intrigued me because I'm like, okay, well, that's the next step here. So chat's great. Love Chat UX. But if we want to do something even more meaningful, it felt like the next kind of advancement is not this kind of, I'm chatting with some software in a kind of a synchronous back and forth model, is that software is going to do things for me in kind of a multi-step way to try and accomplish some goals. So, yeah, that's when I first got started. It's like, okay, what would that look like? Yeah. And I've been obsessed ever since, by the way.Alessio [00:03:55]: Which goes back to your first experience with it, which is like you're offline. Yeah. And you want to do a task. You don't need to do it right now. You just want to queue it up for somebody to do it for you. Yes. As you think about agents, like, let's start at the easy question, which is like, how do you define an agent? Maybe. You mean the hardest question in the universe? Is that what you mean?Dharmesh [00:04:12]: You said you have an irritating take. I do have an irritating take. I think, well, some number of people have been irritated, including within my own team. So I have a very broad definition for agents, which is it's AI-powered software that accomplishes a goal. Period. That's it. And what irritates people about it is like, well, that's so broad as to be completely non-useful. And I understand that. I understand the criticism. But in my mind, if you kind of fast forward months, I guess, in AI years, the implementation of it, and we're already starting to see this, and we'll talk about this, different kinds of agents, right? So I think in addition to having a usable definition, and I like yours, by the way, and we should talk more about that, that you just came out with, the classification of agents actually is also useful, which is, is it autonomous or non-autonomous? Does it have a deterministic workflow? Does it have a non-deterministic workflow? Is it working synchronously? Is it working asynchronously? Then you have the different kind of interaction modes. Is it a chat agent, kind of like a customer support agent would be? You're having this kind of back and forth. Is it a workflow agent that just does a discrete number of steps? So there's all these different flavors of agents. So if I were to draw it in a Venn diagram, I would draw a big circle that says, this is agents, and then I have a bunch of circles, some overlapping, because they're not mutually exclusive. And so I think that's what's interesting, and we're seeing development along a bunch of different paths, right? So if you look at the first implementation of agent frameworks, you look at Baby AGI and AutoGBT, I think it was, not Autogen, that's the Microsoft one. They were way ahead of their time because they assumed this level of reasoning and execution and planning capability that just did not exist, right? So it was an interesting thought experiment, which is what it was. Even the guy that, I'm an investor in Yohei's fund that did Baby AGI. It wasn't ready, but it was a sign of what was to come. And so the question then is, when is it ready? And so lots of people talk about the state of the art when it comes to agents. I'm a pragmatist, so I think of the state of the practical. It's like, okay, well, what can I actually build that has commercial value or solves actually some discrete problem with some baseline of repeatability or verifiability?swyx [00:06:22]: There was a lot, and very, very interesting. I'm not irritated by it at all. Okay. As you know, I take a... There's a lot of anthropological view or linguistics view. And in linguistics, you don't want to be prescriptive. You want to be descriptive. Yeah. So you're a goals guy. That's the key word in your thing. And other people have other definitions that might involve like delegated trust or non-deterministic work, LLM in the loop, all that stuff. The other thing I was thinking about, just the comment on Baby AGI, LGBT. Yeah. In that piece that you just read, I was able to go through our backlog and just kind of track the winter of agents and then the summer now. Yeah. And it's... We can tell the whole story as an oral history, just following that thread. And it's really just like, I think, I tried to explain the why now, right? Like I had, there's better models, of course. There's better tool use with like, they're just more reliable. Yep. Better tools with MCP and all that stuff. And I'm sure you have opinions on that too. Business model shift, which you like a lot. I just heard you talk about RAS with MFM guys. Yep. Cost is dropping a lot. Yep. Inference is getting faster. There's more model diversity. Yep. Yep. I think it's a subtle point. It means that like, you have different models with different perspectives. You don't get stuck in the basin of performance of a single model. Sure. You can just get out of it by just switching models. Yep. Multi-agent research and RL fine tuning. So I just wanted to let you respond to like any of that.Dharmesh [00:07:44]: Yeah. A couple of things. Connecting the dots on the kind of the definition side of it. So we'll get the irritation out of the way completely. I have one more, even more irritating leap on the agent definition thing. So here's the way I think about it. By the way, the kind of word agent, I looked it up, like the English dictionary definition. The old school agent, yeah. Is when you have someone or something that does something on your behalf, like a travel agent or a real estate agent acts on your behalf. It's like proxy, which is a nice kind of general definition. So the other direction I'm sort of headed, and it's going to tie back to tool calling and MCP and things like that, is if you, and I'm not a biologist by any stretch of the imagination, but we have these single-celled organisms, right? Like the simplest possible form of what one would call life. But it's still life. It just happens to be single-celled. And then you can combine cells and then cells become specialized over time. And you have much more sophisticated organisms, you know, kind of further down the spectrum. In my mind, at the most fundamental level, you can almost think of having atomic agents. What is the simplest possible thing that's an agent that can still be called an agent? What is the equivalent of a kind of single-celled organism? And the reason I think that's useful is right now we're headed down the road, which I think is very exciting around tool use, right? That says, okay, the LLMs now can be provided a set of tools that it calls to accomplish whatever it needs to accomplish in the kind of furtherance of whatever goal it's trying to get done. And I'm not overly bothered by it, but if you think about it, if you just squint a little bit and say, well, what if everything was an agent? And what if tools were actually just atomic agents? Because then it's turtles all the way down, right? Then it's like, oh, well, all that's really happening with tool use is that we have a network of agents that know about each other through something like an MMCP and can kind of decompose a particular problem and say, oh, I'm going to delegate this to this set of agents. And why do we need to draw this distinction between tools, which are functions most of the time? And an actual agent. And so I'm going to write this irritating LinkedIn post, you know, proposing this. It's like, okay. And I'm not suggesting we should call even functions, you know, call them agents. But there is a certain amount of elegance that happens when you say, oh, we can just reduce it down to one primitive, which is an agent that you can combine in complicated ways to kind of raise the level of abstraction and accomplish higher order goals. Anyway, that's my answer. I'd say that's a success. Thank you for coming to my TED Talk on agent definitions.Alessio [00:09:54]: How do you define the minimum viable agent? Do you already have a definition for, like, where you draw the line between a cell and an atom? Yeah.Dharmesh [00:10:02]: So in my mind, it has to, at some level, use AI in order for it to—otherwise, it's just software. It's like, you know, we don't need another word for that. And so that's probably where I draw the line. So then the question, you know, the counterargument would be, well, if that's true, then lots of tools themselves are actually not agents because they're just doing a database call or a REST API call or whatever it is they're doing. And that does not necessarily qualify them, which is a fair counterargument. And I accept that. It's like a good argument. I still like to think about—because we'll talk about multi-agent systems, because I think—so we've accepted, which I think is true, lots of people have said it, and you've hopefully combined some of those clips of really smart people saying this is the year of agents, and I completely agree, it is the year of agents. But then shortly after that, it's going to be the year of multi-agent systems or multi-agent networks. I think that's where it's going to be headed next year. Yeah.swyx [00:10:54]: Opening eyes already on that. Yeah. My quick philosophical engagement with you on this. I often think about kind of the other spectrum, the other end of the cell spectrum. So single cell is life, multi-cell is life, and you clump a bunch of cells together in a more complex organism, they become organs, like an eye and a liver or whatever. And then obviously we consider ourselves one life form. There's not like a lot of lives within me. I'm just one life. And now, obviously, I don't think people don't really like to anthropomorphize agents and AI. Yeah. But we are extending our consciousness and our brain and our functionality out into machines. I just saw you were a Bee. Yeah. Which is, you know, it's nice. I have a limitless pendant in my pocket.Dharmesh [00:11:37]: I got one of these boys. Yeah.swyx [00:11:39]: I'm testing it all out. You know, got to be early adopters. But like, we want to extend our personal memory into these things so that we can be good at the things that we're good at. And, you know, machines are good at it. Machines are there. So like, my definition of life is kind of like going outside of my own body now. I don't know if you've ever had like reflections on that. Like how yours. How our self is like actually being distributed outside of you. Yeah.Dharmesh [00:12:01]: I don't fancy myself a philosopher. But you went there. So yeah, I did go there. I'm fascinated by kind of graphs and graph theory and networks and have been for a long, long time. And to me, we're sort of all nodes in this kind of larger thing. It just so happens that we're looking at individual kind of life forms as they exist right now. But so the idea is when you put a podcast out there, there's these little kind of nodes you're putting out there of like, you know, conceptual ideas. Once again, you have varying kind of forms of those little nodes that are up there and are connected in varying and sundry ways. And so I just think of myself as being a node in a massive, massive network. And I'm producing more nodes as I put content or ideas. And, you know, you spend some portion of your life collecting dots, experiences, people, and some portion of your life then connecting dots from the ones that you've collected over time. And I found that really interesting things happen and you really can't know in advance how those dots are necessarily going to connect in the future. And that's, yeah. So that's my philosophical take. That's the, yes, exactly. Coming back.Alessio [00:13:04]: Yep. Do you like graph as an agent? Abstraction? That's been one of the hot topics with LandGraph and Pydantic and all that.Dharmesh [00:13:11]: I do. The thing I'm more interested in terms of use of graphs, and there's lots of work happening on that now, is graph data stores as an alternative in terms of knowledge stores and knowledge graphs. Yeah. Because, you know, so I've been in software now 30 plus years, right? So it's not 10,000 hours. It's like 100,000 hours that I've spent doing this stuff. And so I've grew up with, so back in the day, you know, I started on mainframes. There was a product called IMS from IBM, which is basically an index database, what we'd call like a key value store today. Then we've had relational databases, right? We have tables and columns and foreign key relationships. We all know that. We have document databases like MongoDB, which is sort of a nested structure keyed by a specific index. We have vector stores, vector embedding database. And graphs are interesting for a couple of reasons. One is, so it's not classically structured in a relational way. When you say structured database, to most people, they're thinking tables and columns and in relational database and set theory and all that. Graphs still have structure, but it's not the tables and columns structure. And you could wonder, and people have made this case, that they are a better representation of knowledge for LLMs and for AI generally than other things. So that's kind of thing number one conceptually, and that might be true, I think is possibly true. And the other thing that I really like about that in the context of, you know, I've been in the context of data stores for RAG is, you know, RAG, you say, oh, I have a million documents, I'm going to build the vector embeddings, I'm going to come back with the top X based on the semantic match, and that's fine. All that's very, very useful. But the reality is something gets lost in the chunking process and the, okay, well, those tend, you know, like, you don't really get the whole picture, so to speak, and maybe not even the right set of dimensions on the kind of broader picture. And it makes intuitive sense to me that if we did capture it properly in a graph form, that maybe that feeding into a RAG pipeline will actually yield better results for some use cases, I don't know, but yeah.Alessio [00:15:03]: And do you feel like at the core of it, there's this difference between imperative and declarative programs? Because if you think about HubSpot, it's like, you know, people and graph kind of goes hand in hand, you know, but I think maybe the software before was more like primary foreign key based relationship, versus now the models can traverse through the graph more easily.Dharmesh [00:15:22]: Yes. So I like that representation. There's something. It's just conceptually elegant about graphs and just from the representation of it, they're much more discoverable, you can kind of see it, there's observability to it, versus kind of embeddings, which you can't really do much with as a human. You know, once they're in there, you can't pull stuff back out. But yeah, I like that kind of idea of it. And the other thing that's kind of, because I love graphs, I've been long obsessed with PageRank from back in the early days. And, you know, one of the kind of simplest algorithms in terms of coming up, you know, with a phone, everyone's been exposed to PageRank. And the idea is that, and so I had this other idea for a project, not a company, and I have hundreds of these, called NodeRank, is to be able to take the idea of PageRank and apply it to an arbitrary graph that says, okay, I'm going to define what authority looks like and say, okay, well, that's interesting to me, because then if you say, I'm going to take my knowledge store, and maybe this person that contributed some number of chunks to the graph data store has more authority on this particular use case or prompt that's being submitted than this other one that may, or maybe this one was more. popular, or maybe this one has, whatever it is, there should be a way for us to kind of rank nodes in a graph and sort them in some, some useful way. Yeah.swyx [00:16:34]: So I think that's generally useful for, for anything. I think the, the problem, like, so even though at my conferences, GraphRag is super popular and people are getting knowledge, graph religion, and I will say like, it's getting space, getting traction in two areas, conversation memory, and then also just rag in general, like the, the, the document data. Yeah. It's like a source. Most ML practitioners would say that knowledge graph is kind of like a dirty word. The graph database, people get graph religion, everything's a graph, and then they, they go really hard into it and then they get a, they get a graph that is too complex to navigate. Yes. And so like the, the, the simple way to put it is like you at running HubSpot, you know, the power of graphs, the way that Google has pitched them for many years, but I don't suspect that HubSpot itself uses a knowledge graph. No. Yeah.Dharmesh [00:17:26]: So when is it over engineering? Basically? It's a great question. I don't know. So the question now, like in AI land, right, is the, do we necessarily need to understand? So right now, LLMs for, for the most part are somewhat black boxes, right? We sort of understand how the, you know, the algorithm itself works, but we really don't know what's going on in there and, and how things come out. So if a graph data store is able to produce the outcomes we want, it's like, here's a set of queries I want to be able to submit and then it comes out with useful content. Maybe the underlying data store is as opaque as a vector embeddings or something like that, but maybe it's fine. Maybe we don't necessarily need to understand it to get utility out of it. And so maybe if it's messy, that's okay. Um, that's, it's just another form of lossy compression. Uh, it's just lossy in a way that we just don't completely understand in terms of, because it's going to grow organically. Uh, and it's not structured. It's like, ah, we're just gonna throw a bunch of stuff in there. Let the, the equivalent of the embedding algorithm, whatever they called in graph land. Um, so the one with the best results wins. I think so. Yeah.swyx [00:18:26]: Or is this the practical side of me is like, yeah, it's, if it's useful, we don't necessarilyDharmesh [00:18:30]: need to understand it.swyx [00:18:30]: I have, I mean, I'm happy to push back as long as you want. Uh, it's not practical to evaluate like the 10 different options out there because it takes time. It takes people, it takes, you know, resources, right? Set. That's the first thing. Second thing is your evals are typically on small things and some things only work at scale. Yup. Like graphs. Yup.Dharmesh [00:18:46]: Yup. That's, yeah, no, that's fair. And I think this is one of the challenges in terms of implementation of graph databases is that the most common approach that I've seen developers do, I've done it myself, is that, oh, I've got a Postgres database or a MySQL or whatever. I can represent a graph with a very set of tables with a parent child thing or whatever. And that sort of gives me the ability, uh, why would I need anything more than that? And the answer is, well, if you don't need anything more than that, you don't need anything more than that. But there's a high chance that you're sort of missing out on the actual value that, uh, the graph representation gives you. Which is the ability to traverse the graph, uh, efficiently in ways that kind of going through the, uh, traversal in a relational database form, even though structurally you have the data, practically you're not gonna be able to pull it out in, in useful ways. Uh, so you wouldn't like represent a social graph, uh, in, in using that kind of relational table model. It just wouldn't scale. It wouldn't work.swyx [00:19:36]: Uh, yeah. Uh, I think we want to move on to MCP. Yeah. But I just want to, like, just engineering advice. Yeah. Uh, obviously you've, you've, you've run, uh, you've, you've had to do a lot of projects and run a lot of teams. Do you have a general rule for over-engineering or, you know, engineering ahead of time? You know, like, because people, we know premature engineering is the root of all evil. Yep. But also sometimes you just have to. Yep. When do you do it? Yes.Dharmesh [00:19:59]: It's a great question. This is, uh, a question as old as time almost, which is what's the right and wrong levels of abstraction. That's effectively what, uh, we're answering when we're trying to do engineering. I tend to be a pragmatist, right? So here's the thing. Um, lots of times doing something the right way. Yeah. It's like a marginal increased cost in those cases. Just do it the right way. And this is what makes a, uh, a great engineer or a good engineer better than, uh, a not so great one. It's like, okay, all things being equal. If it's going to take you, you know, roughly close to constant time anyway, might as well do it the right way. Like, so do things well, then the question is, okay, well, am I building a framework as the reusable library? To what degree, uh, what am I anticipating in terms of what's going to need to change in this thing? Uh, you know, along what dimension? And then I think like a business person in some ways, like what's the return on calories, right? So, uh, and you look at, um, energy, the expected value of it's like, okay, here are the five possible things that could happen, uh, try to assign probabilities like, okay, well, if there's a 50% chance that we're going to go down this particular path at some day, like, or one of these five things is going to happen and it costs you 10% more to engineer for that. It's basically, it's something that yields a kind of interest compounding value. Um, as you get closer to the time of, of needing that versus having to take on debt, which is when you under engineer it, you're taking on debt. You're going to have to pay off when you do get to that eventuality where something happens. One thing as a pragmatist, uh, so I would rather under engineer something than over engineer it. If I were going to err on the side of something, and here's the reason is that when you under engineer it, uh, yes, you take on tech debt, uh, but the interest rate is relatively known and payoff is very, very possible, right? Which is, oh, I took a shortcut here as a result of which now this thing that should have taken me a week is now going to take me four weeks. Fine. But if that particular thing that you thought might happen, never actually, you never have that use case transpire or just doesn't, it's like, well, you just save yourself time, right? And that has value because you were able to do other things instead of, uh, kind of slightly over-engineering it away, over-engineering it. But there's no perfect answers in art form in terms of, uh, and yeah, we'll, we'll bring kind of this layers of abstraction back on the code generation conversation, which we'll, uh, I think I have later on, butAlessio [00:22:05]: I was going to ask, we can just jump ahead quickly. Yeah. Like, as you think about vibe coding and all that, how does the. Yeah. Percentage of potential usefulness change when I feel like we over-engineering a lot of times it's like the investment in syntax, it's less about the investment in like arc exacting. Yep. Yeah. How does that change your calculus?Dharmesh [00:22:22]: A couple of things, right? One is, um, so, you know, going back to that kind of ROI or a return on calories, kind of calculus or heuristic you think through, it's like, okay, well, what is it going to cost me to put this layer of abstraction above the code that I'm writing now, uh, in anticipating kind of future needs. If the cost of fixing, uh, or doing under engineering right now. Uh, we'll trend towards zero that says, okay, well, I don't have to get it right right now because even if I get it wrong, I'll run the thing for six hours instead of 60 minutes or whatever. It doesn't really matter, right? Like, because that's going to trend towards zero to be able, the ability to refactor a code. Um, and because we're going to not that long from now, we're going to have, you know, large code bases be able to exist, uh, you know, as, as context, uh, for a code generation or a code refactoring, uh, model. So I think it's going to make it, uh, make the case for under engineering, uh, even stronger. Which is why I take on that cost. You just pay the interest when you get there, it's not, um, just go on with your life vibe coded and, uh, come back when you need to. Yeah.Alessio [00:23:18]: Sometimes I feel like there's no decision-making in some things like, uh, today I built a autosave for like our internal notes platform and I literally just ask them cursor. Can you add autosave? Yeah. I don't know if it's over under engineer. Yep. I just vibe coded it. Yep. And I feel like at some point we're going to get to the point where the models kindDharmesh [00:23:36]: of decide where the right line is, but this is where the, like the, in my mind, the danger is, right? So there's two sides to this. One is the cost of kind of development and coding and things like that stuff that, you know, we talk about. But then like in your example, you know, one of the risks that we have is that because adding a feature, uh, like a save or whatever the feature might be to a product as that price tends towards zero, are we going to be less discriminant about what features we add as a result of making more product products more complicated, which has a negative impact on the user and navigate negative impact on the business. Um, and so that's the thing I worry about if it starts to become too easy, are we going to be. Too promiscuous in our, uh, kind of extension, adding product extensions and things like that. It's like, ah, why not add X, Y, Z or whatever back then it was like, oh, we only have so many engineering hours or story points or however you measure things. Uh, that least kept us in check a little bit. Yeah.Alessio [00:24:22]: And then over engineering, you're like, yeah, it's kind of like you're putting that on yourself. Yeah. Like now it's like the models don't understand that if they add too much complexity, it's going to come back to bite them later. Yep. So they just do whatever they want to do. Yeah. And I'm curious where in the workflow that's going to be, where it's like, Hey, this is like the amount of complexity and over-engineering you can do before you got to ask me if we should actually do it versus like do something else.Dharmesh [00:24:45]: So you know, we've already, let's like, we're leaving this, uh, in the code generation world, this kind of compressed, um, cycle time. Right. It's like, okay, we went from auto-complete, uh, in the GitHub co-pilot to like, oh, finish this particular thing and hit tab to a, oh, I sort of know your file or whatever. I can write out a full function to you to now I can like hold a bunch of the context in my head. Uh, so we can do app generation, which we have now with lovable and bolt and repletage. Yeah. Association and other things. So then the question is, okay, well, where does it naturally go from here? So we're going to generate products. Make sense. We might be able to generate platforms as though I want a platform for ERP that does this, whatever. And that includes the API's includes the product and the UI, and all the things that make for a platform. There's no nothing that says we would stop like, okay, can you generate an entire software company someday? Right. Uh, with the platform and the monetization and the go-to-market and the whatever. And you know, that that's interesting to me in terms of, uh, you know, what, when you take it to almost ludicrous levels. of abstract.swyx [00:25:39]: It's like, okay, turn it to 11. You mentioned vibe coding, so I have to, this is a blog post I haven't written, but I'm kind of exploring it. Is the junior engineer dead?Dharmesh [00:25:49]: I don't think so. I think what will happen is that the junior engineer will be able to, if all they're bringing to the table is the fact that they are a junior engineer, then yes, they're likely dead. But hopefully if they can communicate with carbon-based life forms, they can interact with product, if they're willing to talk to customers, they can take their kind of basic understanding of engineering and how kind of software works. I think that has value. So I have a 14-year-old right now who's taking Python programming class, and some people ask me, it's like, why is he learning coding? And my answer is, is because it's not about the syntax, it's not about the coding. What he's learning is like the fundamental thing of like how things work. And there's value in that. I think there's going to be timeless value in systems thinking and abstractions and what that means. And whether functions manifested as math, which he's going to get exposed to regardless, or there are some core primitives to the universe, I think, that the more you understand them, those are what I would kind of think of as like really large dots in your life that will have a higher gravitational pull and value to them that you'll then be able to. So I want him to collect those dots, and he's not resisting. So it's like, okay, while he's still listening to me, I'm going to have him do things that I think will be useful.swyx [00:26:59]: You know, part of one of the pitches that I evaluated for AI engineer is a term. And the term is that maybe the traditional interview path or career path of software engineer goes away, which is because what's the point of lead code? Yeah. And, you know, it actually matters more that you know how to work with AI and to implement the things that you want. Yep.Dharmesh [00:27:16]: That's one of the like interesting things that's happened with generative AI. You know, you go from machine learning and the models and just that underlying form, which is like true engineering, right? Like the actual, what I call real engineering. I don't think of myself as a real engineer, actually. I'm a developer. But now with generative AI. We call it AI and it's obviously got its roots in machine learning, but it just feels like fundamentally different to me. Like you have the vibe. It's like, okay, well, this is just a whole different approach to software development to so many different things. And so I'm wondering now, it's like an AI engineer is like, if you were like to draw the Venn diagram, it's interesting because the cross between like AI things, generative AI and what the tools are capable of, what the models do, and this whole new kind of body of knowledge that we're still building out, it's still very young, intersected with kind of classic engineering, software engineering. Yeah.swyx [00:28:04]: I just described the overlap as it separates out eventually until it's its own thing, but it's starting out as a software. Yeah.Alessio [00:28:11]: That makes sense. So to close the vibe coding loop, the other big hype now is MCPs. Obviously, I would say Cloud Desktop and Cursor are like the two main drivers of MCP usage. I would say my favorite is the Sentry MCP. I can pull in errors and then you can just put the context in Cursor. How do you think about that abstraction layer? Does it feel... Does it feel almost too magical in a way? Do you think it's like you get enough? Because you don't really see how the server itself is then kind of like repackaging theDharmesh [00:28:41]: information for you? I think MCP as a standard is one of the better things that's happened in the world of AI because a standard needed to exist and absent a standard, there was a set of things that just weren't possible. Now, we can argue whether it's the best possible manifestation of a standard or not. Does it do too much? Does it do too little? I get that, but it's just simple enough to both be useful and unobtrusive. It's understandable and adoptable by mere mortals, right? It's not overly complicated. You know, a reasonable engineer can put a stand up an MCP server relatively easily. The thing that has me excited about it is like, so I'm a big believer in multi-agent systems. And so that's going back to our kind of this idea of an atomic agent. So imagine the MCP server, like obviously it calls tools, but the way I think about it, so I'm working on my current passion project is agent.ai. And we'll talk more about that in a little bit. More about the, I think we should, because I think it's interesting not to promote the project at all, but there's some interesting ideas in there. One of which is around, we're going to need a mechanism for, if agents are going to collaborate and be able to delegate, there's going to need to be some form of discovery and we're going to need some standard way. It's like, okay, well, I just need to know what this thing over here is capable of. We're going to need a registry, which Anthropic's working on. I'm sure others will and have been doing directories of, and there's going to be a standard around that too. How do you build out a directory of MCP servers? I think that's going to unlock so many things just because, and we're already starting to see it. So I think MCP or something like it is going to be the next major unlock because it allows systems that don't know about each other, don't need to, it's that kind of decoupling of like Sentry and whatever tools someone else was building. And it's not just about, you know, Cloud Desktop or things like, even on the client side, I think we're going to see very interesting consumers of MCP, MCP clients versus just the chat body kind of things. Like, you know, Cloud Desktop and Cursor and things like that. But yeah, I'm very excited about MCP in that general direction.swyx [00:30:39]: I think the typical cynical developer take, it's like, we have OpenAPI. Yeah. What's the new thing? I don't know if you have a, do you have a quick MCP versus everything else? Yeah.Dharmesh [00:30:49]: So it's, so I like OpenAPI, right? So just a descriptive thing. It's OpenAPI. OpenAPI. Yes, that's what I meant. So it's basically a self-documenting thing. We can do machine-generated, lots of things from that output. It's a structured definition of an API. I get that, love it. But MCPs sort of are kind of use case specific. They're perfect for exactly what we're trying to use them for around LLMs in terms of discovery. It's like, okay, I don't necessarily need to know kind of all this detail. And so right now we have, we'll talk more about like MCP server implementations, but We will? I think, I don't know. Maybe we won't. At least it's in my head. It's like a back processor. But I do think MCP adds value above OpenAPI. It's, yeah, just because it solves this particular thing. And if we had come to the world, which we have, like, it's like, hey, we already have OpenAPI. It's like, if that were good enough for the universe, the universe would have adopted it already. There's a reason why MCP is taking office because marginally adds something that was missing before and doesn't go too far. And so that's why the kind of rate of adoption, you folks have written about this and talked about it. Yeah, why MCP won. Yeah. And it won because the universe decided that this was useful and maybe it gets supplanted by something else. Yeah. And maybe we discover, oh, maybe OpenAPI was good enough the whole time. I doubt that.swyx [00:32:09]: The meta lesson, this is, I mean, he's an investor in DevTools companies. I work in developer experience at DevRel in DevTools companies. Yep. Everyone wants to own the standard. Yeah. I'm sure you guys have tried to launch your own standards. Actually, it's Houseplant known for a standard, you know, obviously inbound marketing. But is there a standard or protocol that you ever tried to push? No.Dharmesh [00:32:30]: And there's a reason for this. Yeah. Is that? And I don't mean, need to mean, speak for the people of HubSpot, but I personally. You kind of do. I'm not smart enough. That's not the, like, I think I have a. You're smart. Not enough for that. I'm much better off understanding the standards that are out there. And I'm more on the composability side. Let's, like, take the pieces of technology that exist out there, combine them in creative, unique ways. And I like to consume standards. I don't like to, and that's not that I don't like to create them. I just don't think I have the, both the raw wattage or the credibility. It's like, okay, well, who the heck is Dharmesh, and why should we adopt a standard he created?swyx [00:33:07]: Yeah, I mean, there are people who don't monetize standards, like OpenTelemetry is a big standard, and LightStep never capitalized on that.Dharmesh [00:33:15]: So, okay, so if I were to do a standard, there's two things that have been in my head in the past. I was one around, a very, very basic one around, I don't even have the domain, I have a domain for everything, for open marketing. Because the issue we had in HubSpot grew up in the marketing space. There we go. There was no standard around data formats and things like that. It doesn't go anywhere. But the other one, and I did not mean to go here, but I'm going to go here. It's called OpenGraph. I know the term was already taken, but it hasn't been used for like 15 years now for its original purpose. But what I think should exist in the world is right now, our information, all of us, nodes are in the social graph at Meta or the professional graph at LinkedIn. Both of which are actually relatively closed in actually very annoying ways. Like very, very closed, right? Especially LinkedIn. Especially LinkedIn. I personally believe that if it's my data, and if I would get utility out of it being open, I should be able to make my data open or publish it in whatever forms that I choose, as long as I have control over it as opt-in. So the idea is around OpenGraph that says, here's a standard, here's a way to publish it. I should be able to go to OpenGraph.org slash Dharmesh dot JSON and get it back. And it's like, here's your stuff, right? And I can choose along the way and people can write to it and I can prove. And there can be an entire system. And if I were to do that, I would do it as a... Like a public benefit, non-profit-y kind of thing, as this is a contribution to society. I wouldn't try to commercialize that. Have you looked at AdProto? What's that? AdProto.swyx [00:34:43]: It's the protocol behind Blue Sky. Okay. My good friend, Dan Abramov, who was the face of React for many, many years, now works there. And he actually did a talk that I can send you, which basically kind of tries to articulate what you just said. But he does, he loves doing these like really great analogies, which I think you'll like. Like, you know, a lot of our data is behind a handle, behind a domain. Yep. So he's like, all right, what if we flip that? What if it was like our handle and then the domain? Yep. So, and that's really like your data should belong to you. Yep. And I should not have to wait 30 days for my Twitter data to export. Yep.Dharmesh [00:35:19]: you should be able to at least be able to automate it or do like, yes, I should be able to plug it into an agentic thing. Yeah. Yes. I think we're... Because so much of our data is... Locked up. I think the trick here isn't that standard. It is getting the normies to care.swyx [00:35:37]: Yeah. Because normies don't care.Dharmesh [00:35:38]: That's true. But building on that, normies don't care. So, you know, privacy is a really hot topic and an easy word to use, but it's not a binary thing. Like there are use cases where, and we make these choices all the time, that I will trade, not all privacy, but I will trade some privacy for some productivity gain or some benefit to me that says, oh, I don't care about that particular data being online if it gives me this in return, or I don't mind sharing this information with this company.Alessio [00:36:02]: If I'm getting, you know, this in return, but that sort of should be my option. I think now with computer use, you can actually automate some of the exports. Yes. Like something we've been doing internally is like everybody exports their LinkedIn connections. Yep. And then internally, we kind of merge them together to see how we can connect our companies to customers or things like that.Dharmesh [00:36:21]: And not to pick on LinkedIn, but since we're talking about it, but they feel strongly enough on the, you know, do not take LinkedIn data that they will block even browser use kind of things or whatever. They go to great, great lengths, even to see patterns of usage. And it says, oh, there's no way you could have, you know, gotten that particular thing or whatever without, and it's, so it's, there's...swyx [00:36:42]: Wasn't there a Supreme Court case that they lost? Yeah.Dharmesh [00:36:45]: So the one they lost was around someone that was scraping public data that was on the public internet. And that particular company had not signed any terms of service or whatever. It's like, oh, I'm just taking data that's on, there was no, and so that's why they won. But now, you know, the question is around, can LinkedIn... I think they can. Like, when you use, as a user, you use LinkedIn, you are signing up for their terms of service. And if they say, well, this kind of use of your LinkedIn account that violates our terms of service, they can shut your account down, right? They can. And they, yeah, so, you know, we don't need to make this a discussion. By the way, I love the company, don't get me wrong. I'm an avid user of the product. You know, I've got... Yeah, I mean, you've got over a million followers on LinkedIn, I think. Yeah, I do. And I've known people there for a long, long time, right? And I have lots of respect. And I understand even where the mindset originally came from of this kind of members-first approach to, you know, a privacy-first. I sort of get that. But sometimes you sort of have to wonder, it's like, okay, well, that was 15, 20 years ago. There's likely some controlled ways to expose some data on some member's behalf and not just completely be a binary. It's like, no, thou shalt not have the data.swyx [00:37:54]: Well, just pay for sales navigator.Alessio [00:37:57]: Before we move to the next layer of instruction, anything else on MCP you mentioned? Let's move back and then I'll tie it back to MCPs.Dharmesh [00:38:05]: So I think the... Open this with agent. Okay, so I'll start with... Here's my kind of running thesis, is that as AI and agents evolve, which they're doing very, very quickly, we're going to look at them more and more. I don't like to anthropomorphize. We'll talk about why this is not that. Less as just like raw tools and more like teammates. They'll still be software. They should self-disclose as being software. I'm totally cool with that. But I think what's going to happen is that in the same way you might collaborate with a team member on Slack or Teams or whatever you use, you can imagine a series of agents that do specific things just like a team member might do, that you can delegate things to. You can collaborate. You can say, hey, can you take a look at this? Can you proofread that? Can you try this? You can... Whatever it happens to be. So I think it is... I will go so far as to say it's inevitable that we're going to have hybrid teams someday. And what I mean by hybrid teams... So back in the day, hybrid teams were, oh, well, you have some full-time employees and some contractors. Then it was like hybrid teams are some people that are in the office and some that are remote. That's the kind of form of hybrid. The next form of hybrid is like the carbon-based life forms and agents and AI and some form of software. So let's say we temporarily stipulate that I'm right about that over some time horizon that eventually we're going to have these kind of digitally hybrid teams. So if that's true, then the question you sort of ask yourself is that then what needs to exist in order for us to get the full value of that new model? It's like, okay, well... You sort of need to... It's like, okay, well, how do I... If I'm building a digital team, like, how do I... Just in the same way, if I'm interviewing for an engineer or a designer or a PM, whatever, it's like, well, that's why we have professional networks, right? It's like, oh, they have a presence on likely LinkedIn. I can go through that semi-structured, structured form, and I can see the experience of whatever, you know, self-disclosed. But, okay, well, agents are going to need that someday. And so I'm like, okay, well, this seems like a thread that's worth pulling on. That says, okay. So I... So agent.ai is out there. And it's LinkedIn for agents. It's LinkedIn for agents. It's a professional network for agents. And the more I pull on that thread, it's like, okay, well, if that's true, like, what happens, right? It's like, oh, well, they have a profile just like anyone else, just like a human would. It's going to be a graph underneath, just like a professional network would be. It's just that... And you can have its, you know, connections and follows, and agents should be able to post. That's maybe how they do release notes. Like, oh, I have this new version. Whatever they decide to post, it should just be able to... Behave as a node on the network of a professional network. As it turns out, the more I think about that and pull on that thread, the more and more things, like, start to make sense to me. So it may be more than just a pure professional network. So my original thought was, okay, well, it's a professional network and agents as they exist out there, which I think there's going to be more and more of, will kind of exist on this network and have the profile. But then, and this is always dangerous, I'm like, okay, I want to see a world where thousands of agents are out there in order for the... Because those digital employees, the digital workers don't exist yet in any meaningful way. And so then I'm like, oh, can I make that easier for, like... And so I have, as one does, it's like, oh, I'll build a low-code platform for building agents. How hard could that be, right? Like, very hard, as it turns out. But it's been fun. So now, agent.ai has 1.3 million users. 3,000 people have actually, you know, built some variation of an agent, sometimes just for their own personal productivity. About 1,000 of which have been published. And the reason this comes back to MCP for me, so imagine that and other networks, since I know agent.ai. So right now, we have an MCP server for agent.ai that exposes all the internally built agents that we have that do, like, super useful things. Like, you know, I have access to a Twitter API that I can subsidize the cost. And I can say, you know, if you're looking to build something for social media, these kinds of things, with a single API key, and it's all completely free right now, I'm funding it. That's a useful way for it to work. And then we have a developer to say, oh, I have this idea. I don't have to worry about open AI. I don't have to worry about, now, you know, this particular model is better. It has access to all the models with one key. And we proxy it kind of behind the scenes. And then expose it. So then we get this kind of community effect, right? That says, oh, well, someone else may have built an agent to do X. Like, I have an agent right now that I built for myself to do domain valuation for website domains because I'm obsessed with domains, right? And, like, there's no efficient market for domains. There's no Zillow for domains right now that tells you, oh, here are what houses in your neighborhood sold for. It's like, well, why doesn't that exist? We should be able to solve that problem. And, yes, you're still guessing. Fine. There should be some simple heuristic. So I built that. It's like, okay, well, let me go look for past transactions. You say, okay, I'm going to type in agent.ai, agent.com, whatever domain. What's it actually worth? I'm looking at buying it. It can go and say, oh, which is what it does. It's like, I'm going to go look at are there any published domain transactions recently that are similar, either use the same word, same top-level domain, whatever it is. And it comes back with an approximate value, and it comes back with its kind of rationale for why it picked the value and comparable transactions. Oh, by the way, this domain sold for published. Okay. So that agent now, let's say, existed on the web, on agent.ai. Then imagine someone else says, oh, you know, I want to build a brand-building agent for startups and entrepreneurs to come up with names for their startup. Like a common problem, every startup is like, ah, I don't know what to call it. And so they type in five random words that kind of define whatever their startup is. And you can do all manner of things, one of which is like, oh, well, I need to find the domain for it. What are possible choices? Now it's like, okay, well, it would be nice to know if there's an aftermarket price for it, if it's listed for sale. Awesome. Then imagine calling this valuation agent. It's like, okay, well, I want to find where the arbitrage is, where the agent valuation tool says this thing is worth $25,000. It's listed on GoDaddy for $5,000. It's close enough. Let's go do that. Right? And that's a kind of composition use case that in my future state. Thousands of agents on the network, all discoverable through something like MCP. And then you as a developer of agents have access to all these kind of Lego building blocks based on what you're trying to solve. Then you blend in orchestration, which is getting better and better with the reasoning models now. Just describe the problem that you have. Now, the next layer that we're all contending with is that how many tools can you actually give an LLM before the LLM breaks? That number used to be like 15 or 20 before you kind of started to vary dramatically. And so that's the thing I'm thinking about now. It's like, okay, if I want to... If I want to expose 1,000 of these agents to a given LLM, obviously I can't give it all 1,000. Is there some intermediate layer that says, based on your prompt, I'm going to make a best guess at which agents might be able to be helpful for this particular thing? Yeah.Alessio [00:44:37]: Yeah, like RAG for tools. Yep. I did build the Latent Space Researcher on agent.ai. Okay. Nice. Yeah, that seems like, you know, then there's going to be a Latent Space Scheduler. And then once I schedule a research, you know, and you build all of these things. By the way, my apologies for the user experience. You realize I'm an engineer. It's pretty good.swyx [00:44:56]: I think it's a normie-friendly thing. Yeah. That's your magic. HubSpot does the same thing.Alessio [00:45:01]: Yeah, just to like quickly run through it. You can basically create all these different steps. And these steps are like, you know, static versus like variable-driven things. How did you decide between this kind of like low-code-ish versus doing, you know, low-code with code backend versus like not exposing that at all? Any fun design decisions? Yeah. And this is, I think...Dharmesh [00:45:22]: I think lots of people are likely sitting in exactly my position right now, coming through the choosing between deterministic. Like if you're like in a business or building, you know, some sort of agentic thing, do you decide to do a deterministic thing? Or do you go non-deterministic and just let the alum handle it, right, with the reasoning models? The original idea and the reason I took the low-code stepwise, a very deterministic approach. A, the reasoning models did not exist at that time. That's thing number one. Thing number two is if you can get... If you know in your head... If you know in your head what the actual steps are to accomplish whatever goal, why would you leave that to chance? There's no upside. There's literally no upside. Just tell me, like, what steps do you need executed? So right now what I'm playing with... So one thing we haven't talked about yet, and people don't talk about UI and agents. Right now, the primary interaction model... Or they don't talk enough about it. I know some people have. But it's like, okay, so we're used to the chatbot back and forth. Fine. I get that. But I think we're going to move to a blend of... Some of those things are going to be synchronous as they are now. But some are going to be... Some are going to be async. It's just going to put it in a queue, just like... And this goes back to my... Man, I talk fast. But I have this... I only have one other speed. It's even faster. So imagine it's like if you're working... So back to my, oh, we're going to have these hybrid digital teams. Like, you would not go to a co-worker and say, I'm going to ask you to do this thing, and then sit there and wait for them to go do it. Like, that's not how the world works. So it's nice to be able to just, like, hand something off to someone. It's like, okay, well, maybe I expect a response in an hour or a day or something like that.Dharmesh [00:46:52]: In terms of when things need to happen. So the UI around agents. So if you look at the output of agent.ai agents right now, they are the simplest possible manifestation of a UI, right? That says, oh, we have inputs of, like, four different types. Like, we've got a dropdown, we've got multi-select, all the things. It's like back in HTML, the original HTML 1.0 days, right? Like, you're the smallest possible set of primitives for a UI. And it just says, okay, because we need to collect some information from the user, and then we go do steps and do things. And generate some output in HTML or markup are the two primary examples. So the thing I've been asking myself, if I keep going down that path. So people ask me, I get requests all the time. It's like, oh, can you make the UI sort of boring? I need to be able to do this, right? And if I keep pulling on that, it's like, okay, well, now I've built an entire UI builder thing. Where does this end? And so I think the right answer, and this is what I'm going to be backcoding once I get done here, is around injecting a code generation UI generation into, the agent.ai flow, right? As a builder, you're like, okay, I'm going to describe the thing that I want, much like you would do in a vibe coding world. But instead of generating the entire app, it's going to generate the UI that exists at some point in either that deterministic flow or something like that. It says, oh, here's the thing I'm trying to do. Go generate the UI for me. And I can go through some iterations. And what I think of it as a, so it's like, I'm going to generate the code, generate the code, tweak it, go through this kind of prompt style, like we do with vibe coding now. And at some point, I'm going to be happy with it. And I'm going to hit save. And that's going to become the action in that particular step. It's like a caching of the generated code that I can then, like incur any inference time costs. It's just the actual code at that point.Alessio [00:48:29]: Yeah, I invested in a company called E2B, which does code sandbox. And they powered the LM arena web arena. So it's basically the, just like you do LMS, like text to text, they do the same for like UI generation. So if you're asking a model, how do you do it? But yeah, I think that's kind of where.Dharmesh [00:48:45]: That's the thing I'm really fascinated by. So the early LLM, you know, we're understandably, but laughably bad at simple arithmetic, right? That's the thing like my wife, Normies would ask us, like, you call this AI, like it can't, my son would be like, it's just stupid. It can't even do like simple arithmetic. And then like we've discovered over time that, and there's a reason for this, right? It's like, it's a large, there's, you know, the word language is in there for a reason in terms of what it's been trained on. It's not meant to do math, but now it's like, okay, well, the fact that it has access to a Python interpreter that I can actually call at runtime, that solves an entire body of problems that it wasn't trained to do. And it's basically a form of delegation. And so the thought that's kind of rattling around in my head is that that's great. So it's, it's like took the arithmetic problem and took it first. Now, like anything that's solvable through a relatively concrete Python program, it's able to do a bunch of things that I couldn't do before. Can we get to the same place with UI? I don't know what the future of UI looks like in a agentic AI world, but maybe let the LLM handle it, but not in the classic sense. Maybe it generates it on the fly, or maybe we go through some iterations and hit cache or something like that. So it's a little bit more predictable. Uh, I don't know, but yeah.Alessio [00:49:48]: And especially when is the human supposed to intervene? So, especially if you're composing them, most of them should not have a UI because then they're just web hooking to somewhere else. I just want to touch back. I don't know if you have more comments on this.swyx [00:50:01]: I was just going to ask when you, you said you got, you're going to go back to code. What

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

We are working with Amplify on the 2025 State of AI Engineering Survey to be presented at the AIE World's Fair in SF! Join the survey to shape the future of AI Eng!We first met Snipd over a year ago, and were immediately impressed by the design, but were doubtful about the behavior of snipping as the title behavior:Podcast apps are enormously sticky - Spotify spent almost $1b in podcast acquisitions and exclusive content just to get an 8% bump in market share among normies.However, after a disappointing Overcast 2.0 rewrite with no AI features in the last 3 years, I finally bit the bullet and switched to Snipd. It's 2025, your podcast app should be able to let you search transcripts of your podcasts. Snipd is the best implementation of this so far.And yet they keep shipping:What impressed us wasn't just how this tiny team of 4 was able to bootstrap a consumer AI app against massive titans and do so well; but also how seriously they think about learning through podcasts and improving retention of knowledge over time, aka “Duolingo for podcasts”. As an educational AI podcast, that's a mission we can get behind.Full Video PodFind us on YouTube! This was the first pod we've ever shot outdoors!Show Notes* How does Shazam work?* Flutter/FlutterFlow* wav2vec paper* Perplexity Online LLM* Google Search Grounding* Comparing Snipd transcription with our Bee episode* NIPS 2017 Flo Rida* Gustav Söderström - Background AudioTimestamps* [00:00:03] Takeaways from AI Engineer NYC* [00:00:17] Weather in New York.* [00:00:26] Swyx and Snipd.* [00:01:01] Kevin's AI summit experience.* [00:01:31] Zurich and AI.* [00:03:25] SigLIP authors join OpenAI.* [00:03:39] Zurich is very costly.* [00:04:06] The Snipd origin story.* [00:05:24] Introduction to machine learning.* [00:09:28] Snipd and user knowledge extraction.* [00:13:48] App's tech stack, Flutter, Python.* [00:15:11] How speakers are identified.* [00:18:29] The concept of "backgroundable" video.* [00:29:05] Voice cloning technology.* [00:31:03] Using AI agents.* [00:34:32] Snipd's future is multi-modal AI.* [00:36:37] Snipd and existing user behaviour.* [00:42:10] The app, summary, and timestamps.* [00:55:25] The future of AI and podcasting.* [1:14:55] Voice AITranscriptswyx [00:00:03]: Hey, I'm here in New York with Kevin Ben-Smith of Snipd. Welcome.Kevin [00:00:07]: Hi. Hi. Amazing to be here.swyx [00:00:09]: Yeah. This is our first ever, I think, outdoors podcast recording.Kevin [00:00:14]: It's quite a location for the first time, I have to say.swyx [00:00:18]: I was actually unsure because, you know, it's cold. It's like, I checked the temperature. It's like kind of one degree Celsius, but it's not that bad with the sun. No, it's quite nice. Yeah. Especially with our beautiful tea. With the tea. Yeah. Perfect. We're going to talk about Snips. I'm a Snips user. I'm a Snips user. I had to basically, you know, apart from Twitter, it's like the number one use app on my phone. Nice. When I wake up in the morning, I open Snips and I, you know, see what's new. And I think in terms of time spent or usage on my phone, I think it's number one or number two. Nice. Nice. So I really had to talk about it also because I think people interested in AI want to think about like, how can we, we're an AI podcast, we have to talk about the AI podcast app. But before we get there, we just finished. We just finished the AI Engineer Summit and you came for the two days. How was it?Kevin [00:01:07]: It was quite incredible. I mean, for me, the most valuable was just being in the same room with like-minded people who are building the future and who are seeing the future. You know, especially when it comes to AI agents, it's so often I have conversations with friends who are not in the AI world. And it's like so quickly it happens that you, it sounds like you're talking in science fiction. And it's just crazy talk. It was, you know, it's so refreshing to talk with so many other people who already see these things and yeah, be inspired then by them and not always feel like, like, okay, I think I'm just crazy. And like, this will never happen. It really is happening. And for me, it was very valuable. So day two, more relevant, more relevant for you than day one. Yeah. Day two. So day two was the engineering track. Yeah. That was definitely the most valuable for me. Like also as a producer. Practitioner myself, especially there were one or two talks that had to do with voice AI and AI agents with voice. Okay. So that was quite fascinating. Also spoke with the speakers afterwards. Yeah. And yeah, they were also very open and, and, you know, this, this sharing attitudes that's, I think in general, quite prevalent in the AI community. I also learned a lot, like really practical things that I can now take away with me. Yeah.swyx [00:02:25]: I mean, on my side, I, I think I watched only like half of the talks. Cause I was running around and I think people saw me like towards the end, I was kind of collapsing. I was on the floor, like, uh, towards the end because I, I needed to get, to get a rest, but yeah, I'm excited to watch the voice AI talks myself.Kevin [00:02:43]: Yeah. Yeah. Do that. And I mean, from my side, thanks a lot for organizing this conference for bringing everyone together. Do you have anything like this in Switzerland? The short answer is no. Um, I mean, I have to say the AI community in, especially Zurich, where. Yeah. Where we're, where we're based. Yeah. It is quite good. And it's growing, uh, especially driven by ETH, the, the technical university there and all of the big companies, they have AI teams there. Google, like Google has the biggest tech hub outside of the U S in Zurich. Yeah. Facebook is doing a lot in reality labs. Uh, Apple has a secret AI team, open AI and then SwapBit just announced that they're coming to Zurich. Yeah. Um, so there's a lot happening. Yeah.swyx [00:03:23]: So, yeah, uh, I think the most recent notable move, I think the entire vision team from Google. Uh, Lucas buyer, um, and, and all the other authors of Siglip left Google to join open AI, which I thought was like, it's like a big move for a whole team to move all at once at the same time. So I've been to Zurich and it just feels expensive. Like it's a great city. Yeah. It's great university, but I don't see it as like a business hub. Is it a business hub? I guess it is. Right.Kevin [00:03:51]: Like it's kind of, well, historically it's, uh, it's a finance hub, finance hub. Yeah. I mean, there are some, some large banks there, right? Especially UBS, uh, the, the largest wealth manager in the world, but it's really becoming more of a tech hub now with all of the big, uh, tech companies there.swyx [00:04:08]: I guess. Yeah. Yeah. And, but we, and research wise, it's all ETH. Yeah. There's some other things. Yeah. Yeah. Yeah.Kevin [00:04:13]: It's all driven by ETH. And then, uh, it's sister university EPFL, which is in Lausanne. Okay. Um, which they're also doing a lot, but, uh, it's, it's, it's really ETH. Uh, and otherwise, no, I mean, it's a beautiful, really beautiful city. I can recommend. To anyone. To come, uh, visit Zurich, uh, uh, let me know, happy to show you around and of course, you know, you, you have the nature so close, you have the mountains so close, you have so, so beautiful lakes. Yeah. Um, I think that's what makes it such a livable city. Yeah.swyx [00:04:42]: Um, and the cost is not, it's not cheap, but I mean, we're in New York city right now and, uh, I don't know, I paid $8 for a coffee this morning, so, uh, the coffee is cheaper in Zurich than the New York city. Okay. Okay. Let's talk about Snipt. What is Snipt and, you know, then we'll talk about your origin story, but I just, let's, let's get a crisp, what is Snipt? Yeah.Kevin [00:05:03]: I always see two definitions of Snipt, so I'll give you one really simple, straightforward one, and then a second more nuanced, um, which I think will be valuable for the rest of our conversation. So the most simple one is just to say, look, we're an AI powered podcast app. So if you listen to podcasts, we're now providing this AI enhanced experience. But if you look at the more nuanced, uh, podcast. Uh, perspective, it's actually, we, we've have a very big focus on people who like your audience who listened to podcasts to learn something new. Like your audience, you want, they want to learn about AI, what's happening, what's, what's, what's the latest research, what's going on. And we want to provide a, a spoken audio platform where you can do that most effectively. And AI is basically the way that we can achieve that. Yeah.swyx [00:05:53]: Means to an end. Yeah, exactly. When you started. Was it always meant to be AI or is it, was it more about the social sharing?Kevin [00:05:59]: So the first version that we ever released was like three and a half years ago. Okay. Yeah. So this was before ChatGPT. Before Whisper. Yeah. Before Whisper. Yeah. So I think a lot of the features that we now have in the app, they weren't really possible yet back then. But we already from the beginning, we always had the focus on knowledge. That's the reason why, you know, we in our team, why we listen to podcasts, but we did have a bit of a different approach. Like the idea in the very beginning was, so the name is Snips and you can create these, what we call Snips, which is basically a small snippet, like a clip from a, from a podcast. And we did envision sort of like a, like a social TikTok platform where some people would listen to full episodes and they would snip certain, like the best parts of it. And they would post that in a feed and other users would consume this feed of Snips. And use that as a discovery tool or just as a means to an end. And yeah, so you would have both people who create Snips and people who listen to Snips. So our big hypothesis in the beginning was, you know, it will be easy to get people to listen to these Snips, but super difficult to actually get them to create them. So we focused a lot of, a lot of our effort on making it as seamless and easy as possible to create a Snip. Yeah.swyx [00:07:17]: It's similar to TikTok. You need CapCut for there to be videos on TikTok. Exactly.Kevin [00:07:23]: And so for, for Snips, basically whenever you hear an amazing insight, a great moment, you can just triple tap your headphones. And our AI actually then saves the moment that you just listened to and summarizes it to create a note. And this is then basically a Snip. So yeah, we built, we built all of this, launched it. And what we found out was basically the exact opposite. So we saw that people use the Snips to discover podcasts, but they really, you know, they don't. You know, really love listening to long form podcasts, but they were creating Snips like crazy. And this was, this was definitely one of these aha moments when we realized like, hey, we should be really doubling down on the knowledge of learning of, yeah, helping you learn most effectively and helping you capture the knowledge that you listen to and actually do something with it. Because this is in general, you know, we, we live in this world where there's so much content and we consume and consume and consume. And it's so easy to just at the end of the podcast. You just start listening to the next podcast. And five minutes later, you've forgotten everything. 90%, 99% of what you've actually just learned. Yeah.swyx [00:08:31]: You don't know this, but, and most people don't know this, but this is my fourth podcast. My third podcast was a personal mixtape podcast where I Snipped manually sections of podcasts that I liked and added my own commentary on top of them and published them as small episodes. Nice. So those would be maybe five to 10 minute Snips. Yeah. And then I added something that I thought was a good story or like a good insight. And then I added my own commentary and published it as a separate podcast. It's cool. Is that still live? It's still live, but it's not active, but you can go back and find it. If you're, if, if you're curious enough, you'll see it. Nice. Yeah. You have to show me later. It was so manual because basically what my process would be, I hear something interesting. I note down the timestamp and I note down the URL of the podcast. I used to use Overcast. So it would just link to the Overcast page. And then. Put in my note taking app, go home. Whenever I feel like publishing, I will take one of those things and then download the MP3, clip out the MP3 and record my intro, outro and then publish it as a, as a podcast. But now Snips, I mean, I can just kind of double click or triple tap.Kevin [00:09:39]: I mean, those are very similar stories to what we hear from our users. You know, it's, it's normal that you're doing, you're doing something else while you're listening to a podcast. Yeah. A lot of our users, they're driving, they're working out, walking their dog. So in those moments when you hear something amazing, it's difficult to just write them down or, you know, you have to take out your phone. Some people take a screenshot, write down a timestamp, and then later on you have to go back and try to find it again. Of course you can't find it anymore because there's no search. There's no command F. And, um, these, these were all of the issues that, that, that we encountered also ourselves as users. And given that our background was in AI, we realized like, wait, hey, this is. This should not be the case. Like podcast apps today, they're still, they're basically repurposed music players, but we actually look at podcasts as one of the largest sources of knowledge in the world. And once you have that different angle of looking at it together with everything that AI is now enabling, you realize like, hey, this is not the way that we, that podcast apps should be. Yeah.swyx [00:10:41]: Yeah. I agree. You mentioned something that you said your background is in AI. Well, first of all, who's the team and what do you mean your background is in AI?Kevin [00:10:48]: Those are two very different things. I'm going to ask some questions. Yeah. Um, maybe starting with, with my backstory. Yeah. My backstory actually goes back, like, let's say 12 years ago or something like that. I moved to Zurich to study at ETH and actually I studied something completely different. I studied mathematics and economics basically with this specialization for quant finance. Same. Okay. Wow. All right. So yeah. And then as you know, all of these mathematical models for, um, asset pricing, derivative pricing, quantitative trading. And for me, the thing that, that fascinates me the most was the mathematical modeling behind it. Uh, mathematics, uh, statistics, but I was never really that passionate about the finance side of things.swyx [00:11:32]: Oh really? Oh, okay. Yeah. I mean, we're different there.Kevin [00:11:36]: I mean, one just, let's say symptom that I noticed now, like, like looking back during that time. Yeah. I think I never read an academic paper about the subject in my free time. And then it was towards the end of my studies. I was already working for a big bank. One of my best friends, he comes to me and says, Hey, I just took this course. You have to, you have to do this. You have to take this lecture. Okay. And I'm like, what, what, what is it about? It's called machine learning and I'm like, what, what, what kind of stupid name is that? Uh, so you sent me the slides and like over a weekend I went through all of the slides and I just, I just knew like freaking hell. Like this is it. I'm, I'm in love. Wow. Yeah. Okay. And that was then over the course of the next, I think like 12 months, I just really got into it. Started reading all about it, like reading blog posts, starting building my own models.swyx [00:12:26]: Was this course by a famous person, famous university? Was it like the Andrew Wayne Coursera thing? No.Kevin [00:12:31]: So this was a ETH course. So a professor at ETH. Did he teach in English by the way? Yeah. Okay.swyx [00:12:37]: So these slides are somewhere available. Yeah. Definitely. I mean, now they're quite outdated. Yeah. Sure. Well, I think, you know, reflecting on the finance thing for a bit. So I, I was, used to be a trader, uh, sell side and buy side. I was options trader first and then I was more like a quantitative hedge fund analyst. We never really use machine learning. It was more like a little bit of statistical modeling, but really like you, you fit, you know, your regression.Kevin [00:13:03]: No, I mean, that's, that's what it is. And, uh, or you, you solve partial differential equations and have then numerical methods to, to, to solve these. That's, that's for you. That's your degree. And that's, that's not really what you do at work. Right. Unless, well, I don't know what you do at work. In my job. No, no, we weren't solving the partial differential. Yeah.swyx [00:13:18]: You learn all this in school and then you don't use it.Kevin [00:13:20]: I mean, we, we, well, let's put it like that. Um, in some things, yeah, I mean, I did code algorithms that would do it, but it was basically like, it was the most basic algorithms and then you just like slightly improve them a little bit. Like you just tweak them here and there. Yeah. It wasn't like starting from scratch, like, Oh, here's this new partial differential equation. How do we know?swyx [00:13:43]: Yeah. Yeah. I mean, that's, that's real life, right? Most, most of it's kind of boring or you're, you're using established things because they're established because, uh, they tackle the most important topics. Um, yeah. Portfolio management was more interesting for me. Um, and, uh, we, we were sort of the first to combine like social data with, with quantitative trading. And I think, uh, I think now it's very common, but, um, yeah. Anyway, then you, you went, you went deep on machine learning and then what? You quit your job? Yeah. Yeah. Wow.Kevin [00:14:12]: I quit my job because, uh, um, I mean, I started using it at the bank as well. Like try, like, you know, I like desperately tried to find any kind of excuse to like use it here or there, but it just was clear to me, like, no, if I want to do this, um, like I just have to like make a real cut. So I quit my job and joined an early stage, uh, tech startup in Zurich where then built up the AI team over five years. Wow. Yeah. So yeah, we built various machine learning, uh, things for, for banks from like models for, for sales teams to identify which clients like which product to sell to them and with what reasons all the way to, we did a lot, a lot with bank transactions. One of the actually most fun projects for me was we had an, an NLP model that would take the booking text of a transaction, like a credit card transaction and pretty fired. Yeah. Because it had all of these, you know, like numbers in there and abbreviations and whatnot. And sometimes you look at it like, what, what is this? And it was just, you know, it would just change it to, I don't know, CVS. Yeah.swyx [00:15:15]: Yeah. But I mean, would you have hallucinations?Kevin [00:15:17]: No, no, no. The way that everything was set up, it wasn't like, it wasn't yet fully end to end generative, uh, neural network as what you would use today. Okay.swyx [00:15:30]: Awesome. And then when did you go like full time on Snips? Yeah.Kevin [00:15:33]: So basically that was, that was afterwards. I mean, how that started was the friend of mine who got me into machine learning, uh, him and I, uh, like he also got me interested into startups. He's had a big impact on my life. And the two of us were just a jam on, on like ideas for startups every now and then. And his background was also in AI data science. And we had a couple of ideas, but given that we were working full times, we were thinking about, uh, so we participated in Hack Zurich. That's, uh, Europe's biggest hackathon, um, or at least was at the time. And we said, Hey, this is just a weekend. Let's just try out an idea, like hack something together and see how it works. And the idea was that we'd be able to search through podcast episodes, like within a podcast. Yeah. So we did that. Long story short, uh, we managed to do it like to build something that we realized, Hey, this actually works. You can, you can find things again in podcasts. We had like a natural language search and we pitched it on stage. And we actually won the hackathon, which was cool. I mean, we, we also, I think we had a good, um, like a good, good pitch or a good example. So we, we used the famous Joe Rogan episode with Elon Musk where Elon Musk smokes a joint. Okay. Um, it's like a two and a half hour episode. So we were on stage and then we just searched for like smoking weed and it would find that exact moment. It will play it. And it just like, come on with Elon Musk, just like smoking. Oh, so it was video as well? No, it was actually completely based on audio. But we did have the video for the presentation. Yeah. Which had a, had of course an amazing effect. Yeah. Like this gave us a lot of activation energy, but it wasn't actually about winning the hackathon. Yeah. But the interesting thing that happened was after we pitched on stage, several of the other participants, like a lot of them came up to us and started saying like, Hey, can I use this? Like I have this issue. And like some also came up and told us about other problems that they have, like very adjacent to this with a podcast. Where's like, like this. Like, could, could I use this for that as well? And that was basically the, the moment where I realized, Hey, it's actually not just us who are having these issues with, with podcasts and getting to the, making the most out of this knowledge. Yeah. The other people. Yeah. That was now, I guess like four years ago or something like that. And then, yeah, we decided to quit our jobs and start, start this whole snip thing. Yeah. How big is the team now? We're just four people. Yeah. Just four people. Yeah. Like four. We're all technical. Yeah. Basically two on the, the backend side. So one of my co-founders is this person who got me into machine learning and startups. And we won the hackathon together. So we have two people for the backend side with the AI and all of the other backend things. And two for the front end side, building the app.swyx [00:18:18]: Which is mostly Android and iOS. Yeah.Kevin [00:18:21]: It's iOS and Android. We also have a watch app for, for Apple, but yeah, it's mostly iOS. Yeah.swyx [00:18:27]: The watch thing, it was very funny because in the, in the Latent Space discord, you know, most of us have been slowly adopting snips. You came to me like a year ago and you introduced snip to me. I was like, I don't know. I'm, you know, I'm very sticky to overcast and then slowly we switch. Why watch?Kevin [00:18:43]: So it goes back to a lot of our users, they do something else while, while listening to a podcast, right? Yeah. And one of the, us giving them the ability to then capture this knowledge, even though they're doing something else at the same time is one of the killer features. Yeah. Maybe I can actually, maybe at some point I should maybe give a bit more of an overview of what the, all of the features that we have. Sure. So this is one of the killer features and for one big use case that people use this for is for running. Yeah. So if you're a big runner, a big jogger or cycling, like really, really cycling competitively and a lot of the people, they don't want to take their phone with them when they go running. So you load everything onto the watch. So you can download episodes. I mean, if you, if you have an Apple watch that has internet access, like with a SIM card, you can also directly stream. That's also possible. Yeah. So of course it's a, it's basically very limited to just listening and snipping. And then you can see all of your snips later on your phone. Let me tell you this error I just got.swyx [00:19:47]: Error playing episode. Substack, the host of this podcast, does not allow this podcast to be played on an Apple watch. Yeah.Kevin [00:19:52]: That's a very beautiful thing. So we found out that all of the podcasts hosted on Substack, you cannot play them on an Apple watch. Why is this restriction? What? Like, don't ask me. We try to reach out to Substack. We try to reach out to some of the bigger podcasters who are hosting the podcast on Substack to also let them know. Substack doesn't seem to care. This is not specific to our app. You can also check out the Apple podcast app. Yeah. It's the same problem. It's just that we actually have identified it. And we tell the user what's going on.swyx [00:20:25]: I would say we host our podcast on Substack, but they're not very serious about their podcasting tools. I've told them before, I've been very upfront with them. So I don't feel like I'm shitting on them in any way. And it's kind of sad because otherwise it's a perfect creative platform. But the way that they treat podcasting as an afterthought, I think it's really disappointing.Kevin [00:20:45]: Maybe given that you mentioned all these features, maybe I can give a bit of a better overview of the features that we have. Let's do that. Let's do that. So I think we're mostly in our minds. Maybe for some of the listeners.swyx [00:20:55]: I mean, I'll tell you my version. Yeah. They can correct me, right? So first of all, I think the main job is for it to be a podcast listening app. It should be basically a complete superset of what you normally get on Overcast or Apple Podcasts or anything like that. You pull your show list from ListenNotes. How do you find shows? You've got to type in anything and you find them, right?Kevin [00:21:18]: Yeah. We have a search engine that is powered by ListenNotes. Yeah. But I mean, in the meantime, we have a huge database of like 99% of all podcasts out there ourselves. Yeah.swyx [00:21:27]: What I noticed, the default experience is you do not auto-download shows. And that's one very big difference for you guys versus other apps, where like, you know, if I'm subscribed to a thing, it auto-downloads and I already have the MP3 downloaded overnight. For me, I have to actively put it onto my queue, then it auto-downloads. And actually, I initially didn't like that. I think I maybe told you that I was like, oh, it's like a feature that I don't like. Like, because it means that I have to choose to listen to it in order to download and not to... It's like opt-in. There's a difference between opt-in and opt-out. So I opt-in to every episode that I listen to. And then, like, you know, you open it and depends on whether or not you have the AI stuff enabled. But the default experience is no AI stuff enabled. You can listen to it. You can see the snips, the number of snips and where people snip during the episode, which roughly correlates to interest level. And obviously, you can snip there. I think that's the default experience. I think snipping is really cool. Like, I use it to share a lot on Discord. I think we have tons and tons of just people sharing snips and stuff. Tweeting stuff is also like a nice, pleasant experience. But like the real features come when you actually turn on the AI stuff. And so the reason I got snipped, because I got fed up with Overcast not implementing any AI features at all. Instead, they spent two years rewriting their app to be a little bit faster. And I'm like, like, it's 2025. I should have a podcast that has transcripts that I can search. Very, very basic thing. Overcast will basically never have it.Kevin [00:22:49]: Yeah, I think that was a good, like, basic overview. Maybe I can add a bit to it with the AI features that we have. So one thing that we do every time a new podcast comes out, we transcribe the episode. We do speaker diarization. We identify the speaker names. Each guest, we extract a mini bio of the guest, try to find a picture of the guest online, add it. We break the podcast down into chapters, as in AI generated chapters. That one. That one's very handy. With a quick description per title and quick description per each chapter. We identify all books that get mentioned on a podcast. You can tell I don't use that one. It depends on the podcast. There are some podcasts where the guests often recommend like an amazing book. So later on, you can you can find that again.swyx [00:23:42]: So you literally search for the word book or I just read blah, blah, blah.Kevin [00:23:46]: No, I mean, it's all LLM based. Yeah. So basically, we have we have an LLM that goes through the entire transcript and identifies if a user mentions a book, then we use perplexity API together with various other LLM orchestration to go out there on the Internet, find everything that there is to know about the book, find the cover, find who or what the author is, get a quick description of it for the author. We then check on which other episodes the author appeared on.swyx [00:24:15]: Yeah, that is killer.Kevin [00:24:17]: Because that for me, if. If there's an interesting book, the first thing I do is I actually listen to a podcast episode with a with a writer because he usually gives a really great overview already on a podcast.swyx [00:24:28]: Sometimes the podcast is with the person as a guest. Sometimes his podcast is about the person without him there. Do you pick up both?Kevin [00:24:37]: So, yes, we pick up both in like our latest models. But actually what we show you in the app, the goal is to currently only show you the guest to separate that. In the future, we want to show the other things more.swyx [00:24:47]: For what it's worth, I don't mind. Yeah, I don't think like if I like if I like somebody, I'll just learn about them regardless of whether they're there or not.Kevin [00:24:55]: Yeah, I mean, yes and no. We we we have seen there are some personalities where this can break down. So, for example, the first version that we released with this feature, it picked up much more often a person, even if it was not a guest. Yeah. For example, the best examples for me is Sam Altman and Elon Musk. Like they're just mentioned on every second podcast and it has like they're not on there. And if you're interested in it, you can go to Elon Musk. And actually like learning from them. Yeah, I see. And yeah, we updated our our algorithms, improved that a lot. And now it's gotten much better to only pick it up if they're a guest. And yeah, so this this is maybe to come back to the features, two more important features like we have the ability to chat with an episode. Yes. Of course, you can do the old style of searching through a transcript with a keyword search. But I think for me, this is this is how you used to do search and extracting knowledge in the in the past. Old school. And the A.I. Web. Way is is basically an LLM. So you can ask the LLM, hey, when do they talk about topic X? If you're interested in only a certain part of the episode, you can ask them for four to give a quick overview of the episode. Key takeaways afterwards also to create a note for you. So this is really like very open, open ended. And yeah. And then finally, the snipping feature that we mentioned just to reiterate. Yeah. I mean, here the the feature is that whenever you hear an amazing idea, you can trip. It's up your headphones or click a button in the app and the A.I. summarizes the insight you just heard and saves that together with the original transcript and audio in your knowledge library. I also noticed that you you skip dynamic content. So dynamic content, we do not skip it automatically. Oh, sorry. You detect. But we detect it. Yeah. I mean, that's one of the thing that most people don't don't actually know that like the way that ads get inserted into podcasts or into most podcasts is actually that every time you listen. To a podcast, you actually get access to a different audio file and on the server, a different ad is inserted into the MP3 file automatically. Yeah. Based on IP. Exactly. And that's what that means is if we transcribe an episode and have a transcript with timestamps like words, word specific timestamps, if you suddenly get a different audio file, like the whole time says I messed up and that's like a huge issue. And for that, we actually had to build another algorithm that would dynamically on the floor. I re sync the audio that you're listening to the transcript that we have. Yeah. Which is a fascinating problem in and of itself.swyx [00:27:24]: You sync by matching up the sound waves? Or like, or do you sync by matching up words like you basically do partial transcription?Kevin [00:27:33]: We are not matching up words. It's happening on the basically a bytes level matching. Yeah. Okay.swyx [00:27:40]: It relies on this. It relies on the exact match at some point.Kevin [00:27:46]: So it's actually. We're actually not doing exact matches, but we're doing fuzzy matches to identify the moment. It's basically, we basically built Shazam for podcasts. Just as a little side project to solve this issue.swyx [00:28:02]: Actually, fun fact, apparently the Shazam algorithm is open. They published the paper, it's talked about it. I haven't really dived into the paper. I thought it was kind of interesting that basically no one else has built Shazam.Kevin [00:28:16]: Yeah, I mean, well, the one thing is the algorithm. If you now talk about Shazam, the other thing is also having the database behind it and having the user mindset that if they have this problem, they come to you, right?swyx [00:28:29]: Yeah, I'm very interested in the tech stack. There's a big data pipeline. Could you share what is the tech stack?Kevin [00:28:35]: What are the most interesting or challenging pieces of it? So the general tech stack is our entire backend is, or 90% of our backend is written in Python. Okay. Hosting everything on Google Cloud Platform. And our front end is written with, well, we're using the Flutter framework. So it's written in Dart and then compiled natively. So we have one code base that handles both Android and iOS. You think that was a good decision? It's something that a lot of people are exploring. So up until now, yes. Okay. Look, it has its pros and cons. Some of the, you know, for example, earlier, I mentioned we have a Apple Watch app. Yeah. I mean, there's no Flutter for that, right? So that you build native. And then of course you have to sort of like sync these things together. I mean, I'm not the front end engineer, so I'm not just relaying this information, but our front end engineers are very happy with it. It's enabled us to be quite fast and be on both platforms from the very beginning. And when I talk with people and they hear that we are using Flutter, usually they think like, ah, it's not performant. It's super junk, janky and everything. And then they use it. They use our app and they're always super surprised. Or if they've already used our app, I couldn't tell them. They're like, what? Yeah. Um, so there is actually a lot that you can do with it.swyx [00:29:51]: The danger, the concern, there's a few concerns, right? One, it's Google. So when were they, when are they going to abandon it? Two, you know, they're optimized for Android first. So iOS is like a second, second thought, or like you can feel that it is not a native iOS app. Uh, but you guys put a lot of care into it. And then maybe three, from my point of view, JavaScript, as a JavaScript guy, React Native was supposed to be there. And I think that it hasn't really fulfilled that dream. Um, maybe Expo is trying to do that, but, um, again, it is not, does not feel as productive as Flutter. And I've, I spent a week on Flutter and dot, and I'm an investor in Flutter flow, which is the local, uh, Flutter, Flutter startup. That's doing very, very well. I think a lot of people are still Flutter skeptics. Yeah. Wait. So are you moving away from Flutter?Kevin [00:30:41]: I don't know. We don't have plans to do that. Yeah.swyx [00:30:43]: You're just saying about that. What? Yeah. Watch out. Okay. Let's go back to the stack.Kevin [00:30:47]: You know, that was just to give you a bit of an overview. I think the more interesting things are, of course, on the AI side. So we, like, as I mentioned earlier, when we started out, it was before chat GPT for the chat GPT moment before there was the GPT 3.5 turbo, uh, API. So in the beginning, we actually were running everything ourselves, open source models, try to fine tune them. They worked. There was us, but let's, let's be honest. They weren't. What was the sort of? Before Whisper, the transcription. Yeah, we were using wave to work like, um, there was a Google one, right? No, it was a Facebook, Facebook one. That was actually one of the papers. Like when that came out for me, that was one of the reasons why I said we, we should try something to start a startup in the audio space. For me, it was a bit like before that I had been following the NLP space, uh, quite closely. And as, as I mentioned earlier, we, we did some stuff at the startup as well, that I was working up. But before, and wave to work was the first paper that I had at least seen where the whole transformer architecture moved over to audio and bit more general way of saying it is like, it was the first time that I saw the transformer architecture being applied to continuous data instead of discrete tokens. Okay. And it worked amazingly. Ah, and like the transformer architecture plus self-supervised learning, like these two things moved over. And then for me, it was like, Hey, this is now going to take off similarly. It's the text space has taken off. And with these two things in place, even if some features that we want to build are not possible yet, they will be possible in the near term, uh, with this, uh, trajectory. So that was a little side, side note. No, it's in the meantime. Yeah. We're using whisper. We're still hosting some of the models ourselves. So for example, the whole transcription speaker diarization pipeline, uh,swyx [00:32:38]: You need it to be as cheap as possible.Kevin [00:32:40]: Yeah, exactly. I mean, we're doing this at scale where we have a lot of audio.swyx [00:32:44]: We're what numbers can you disclose? Like what, what are just to give people an idea because it's a lot. So we have more than a million podcasts that we've already processed when you say a million. So processing is basically, you have some kind of list of podcasts that you will auto process and others where a paying pay member can choose to press the button and transcribe it. Right. Is that the rough idea? Yeah, exactly.Kevin [00:33:08]: Yeah. And if, when you press that button or we also transcribe it. Yeah. So first we do the, we do the transcription. We do the. The, the speaker diarization. So basically you identify speech blocks that belong to the same speaker. This is then all orchestrated within, within LLM to identify which speech speech block belongs to which speaker together with, you know, we identify, as I mentioned earlier, we identify the guest name and the bio. So all of that comes together with an LLM to actually then assign assigned speaker names to, to each block. Yeah. And then most of the rest of the, the pipeline we've now used, we've now migrated to LLM. So we use mainly open AI, Google models, so the Gemini models and the open AI models, and we use some perplexity basically for those things where we need, where we need web search. Yeah. That's something I'm still hoping, especially open AI will also provide us an API. Oh, why? Well, basically for us as a consumer, the more providers there are.swyx [00:34:07]: The more downtime.Kevin [00:34:08]: The more competition and it will lead to better, better results. And, um, lower costs over time. I don't, I don't see perplexity as expensive. If you use the web search, the price is like $5 per a thousand queries. Okay. Which is affordable. But, uh, if you compare that to just a normal LLM call, um, it's, it's, uh, much more expensive. Have you tried Exa? We've, uh, looked into it, but we haven't really tried it. Um, I mean, we, we started with perplexity and, uh, it works, it works well. And if I remember. Correctly, Exa is also a bit more expensive.swyx [00:34:45]: I don't know. I don't know. They seem to focus on the search thing as a search API, whereas perplexity, maybe more consumer-y business that is higher, higher margin. Like I'll put it like perplexity is trying to be a product, Exa is trying to be infrastructure. Yeah. So that, that'll be my distinction there. And then the other thing I will mention is Google has a search grounding feature. Yeah. Which you, which you might want. Yeah.Kevin [00:35:07]: Yeah. We've, uh, we've also tried that out. Um, not as good. So we, we didn't, we didn't go into. Too much detail in like really comparing it, like quality wise, because we actually already had the perplexity one and it, and it's, and it's working. Yeah. Um, I think also there, the price is actually higher than perplexity. Yeah. Really? Yeah.swyx [00:35:26]: Google should cut their prices.Kevin [00:35:29]: Maybe it was the same price. I don't want to say something incorrect, but it wasn't cheaper. It wasn't like compelling. And then, then there was no reason to switch. So, I mean, maybe like in general, like for us, given that we do work with a lot of content, price is actually something that we do look at. Like for us, it's not just about taking the best model for every task, but it's really getting the best, like identifying what kind of intelligence level you need and then getting the best price for that to be able to really scale this and, and provide us, um, yeah, let our users use these features with as many podcasts as possible. Yeah.swyx [00:36:03]: I wanted to double, double click on diarization. Yeah. Uh, it's something that I don't think people do very well. So you know, I'm, I'm a, I'm a B user. I don't have it right now. And, and they were supposed to speak, but they dropped out last minute. Um, but, uh, we've had them on the podcast before and it's not great yet. Do you use just PI Anode, the default stuff, or do you find any tricks for diarization?Kevin [00:36:27]: So we do use the, the open source packages, but we have tweaked it a bit here and there. For example, if you mentioned the BAI guys, I actually listened to the podcast episode was super nice. Thank you. And when you started talking about speaker diarization, and I just have to think about, uh, I don't know.Kevin [00:36:49]: Is it possible? I don't know. I don't know. F**k this. Yeah, no, I don't know.Kevin [00:36:55]: Yeah. We are the best. This is a.swyx [00:37:07]: I don't know. This is the best. I don't know. This is the best. Yeah. Yeah. Yeah. You're doing good.Kevin [00:37:12]: So, so yeah. This is great. This is good. Yeah. No, so that of course helps us. Another thing that helps us is that we know certain structural aspects of the podcast. For example, how often does someone speak? Like if someone, like let's say there's a one hour episode and someone speaks for 30 seconds, that person is most probably not the guest and not the host. It's probably some ad, like some speaker from an ad. So we have like certain of these heuristics that we can use and we leverage to improve things. And in the past, we've also changed the clustering algorithm. So basically how a lot of the speaker diarization works is you basically create an embedding for the speech that's happening. And then you try to somehow cluster these embeddings and then find out this is all one speaker. This is all another speaker. And there we've also tweaked a couple of things where we again used heuristics that we could apply from knowing how podcasts function. And that's also actually why I was feeling so much with the BAI guys, because like all of these heuristics, like for them, it's probably almost impossible to use any heuristics because it can just be any situation, anything.Kevin [00:38:34]: So that's one thing that we do. Yeah, another thing is that we actually combine it with LLM. So the transcript, LLMs and the speaker diarization, like bringing all of these together to recalibrate some of the switching points. Like when does the speaker stop? When does the next one start?swyx [00:38:51]: The LLMs can add errors as well. You know, I wouldn't feel safe using them to be so precise.Kevin [00:38:58]: I mean, at the end of the day, like also just to not give a wrong impression, like the speaker diarization is also not perfect that we're doing, right? I basically don't really notice it.swyx [00:39:08]: Like I use it for search.Kevin [00:39:09]: Yeah, it's not perfect yet, but it's gotten quite good. Like, especially if you compare, if you look at some of the, like if you take a latest episode and you compare it to an episode that came out a year ago, we've improved it quite a bit.swyx [00:39:23]: Well, it's beautifully presented. Oh, I love that I can click on the transcript and it goes to the timestamp. So simple, but you know, it should exist. Yeah, I agree. I agree. So this, I'm loading a two hour episode of Detect Me Right Home, where there's a lot of different guests calling in and you've identified the guest name. And yeah, so these are all LLM based. Yeah, it's really nice.Kevin [00:39:49]: Yeah, like the speaker names.swyx [00:39:50]: I would say that, you know, obviously I'm a power user of all these tools. You have done a better job than Descript. Okay, wow. Descript is so much funding. They had their open AI invested in them and they still suck. So I don't know, like, you know, keep going. You're doing great. Yeah, thanks. Thanks.Kevin [00:40:12]: I mean, I would, I would say that, especially for anyone listening who's interested in building a consumer app with AI, I think the, like, especially if your background is in AI and you love working with AI and doing all of that, I think the most important thing is just to keep reminding yourself of what's actually the job to be done here. Like, what does actually the consumer want? Like, for example, you now were just delighted by the ability to click on this word and it jumps there. Yeah. Like, this is not, this is not rocket science. This is, like, you don't have to be, like, I don't know, Android Kapathi to come up with that and build that, right? And I think that's, that's something that's super important to keep in mind.swyx [00:40:52]: Yeah, yeah. Amazing. I mean, there's so many features, right? It's, it's so packed. There's quotes that you pick up. There's summarization. Oh, by the way, I'm going to use this as my official feature request. I want to customize what, how it's summarized. I want to, I want to have a custom prompt. Yeah. Because your summarization is good, but, you know, I have different preferences, right? Like, you know.Kevin [00:41:14]: So one thing that you can already do today, I completely get your feature request. And I think it just.swyx [00:41:18]: I'm sure people have asked it.Kevin [00:41:19]: I mean, maybe just in general as a, as a, how I see the future, you know, like in the future, I think all, everything will be personalized. Yeah, yeah. Like, not, this is not specific to us. Yeah. And today we're still in a, in a phase where the cost of LLMs, at least if you're working with, like, such long context windows. As us, I mean, there's a lot of tokens in, if you take an entire podcast, so you still have to take that cost into consideration. So if for every single user, we regenerate it entirely, it gets expensive. But in the future, this, you know, cost will continue to go down and then it will just be personalized. So that being said, you can already today, if you go to the player screen. Okay. And open up the chat. Yeah. You can go to the, to the chat. Yes. And just ask for a summary in your style.swyx [00:42:13]: Yeah. Okay. I mean, I, I listen to consume, you know? Yeah. Yeah. I, I've never really used this feature. I don't know. I think that's, that's me being a slow adopter. No, no. I mean, that's. It has, when does the conversation start? Okay.Kevin [00:42:26]: I mean, you can just type anything. I think what you're, what you're describing, I mean, maybe that is also an interesting topic to talk about. Yes. Where, like, basically I told you, like, look, we have this chat. You can just ask for it. Yeah. And this is, this is how ChatGPT works today. But if you're building a consumer app, you have to move beyond the chat box. People do not want to always type out what they want. So your feature request was, even though theoretically it's already possible, what you are actually asking for is, hey, I just want to open up the app and it should just be there in a nicely formatted way. Beautiful way such that I can read it or consume it without any issues. Interesting. And I think that's in general where a lot of the, the. Opportunities lie currently in the market. If you want to build a consumer app, taking the capability and the intelligence, but finding out what the actual user interface is the best way how a user can engage with this intelligence in a natural way.swyx [00:43:24]: Is this something I've been thinking about as kind of like AI that's not in your face? Because right now, you know, we like to say like, oh, use Notion has Notion AI. And we have the little thing there. And there's, or like some other. Any other platform has like the sparkle magic wand emoji, like that's our AI feature. Use this. And it's like really in your face. A lot of people don't like it. You know, it should just kind of become invisible, kind of like an invisible AI.Kevin [00:43:49]: 100%. I mean, the, the way I see it as AI is, is the electricity of, of the future. And like no one, like, like we don't talk about, I don't know, this, this microphone uses electricity, this phone, you don't think about it that way. It's just in there, right? It's not an electricity enabled product. No, it's just a product. Yeah. It will be the same with AI. I mean, now. It's still a, something that you use to market your product. I mean, we do, we do the same, right? Because it's still something that people realize, ah, they're doing something new, but at some point, no, it'll just be a podcast app and it will be normal that it has all of this AI in there.swyx [00:44:24]: I noticed you do something interesting in your chat where you source the timestamps. Yeah. Is that part of this prompt? Is there a separate pipeline that adds source sources?Kevin [00:44:33]: This is, uh, actually part of the prompt. Um, so this is all prompt engine. Engineering, um, uh, you should be able to click on it. Yeah, I clicked on it. Um, this is all prompt engineering with how to provide the, the context, you know, we, because we provide all of the transcript, how to provide the context and then, yeah, I get them all to respond in a correct way with a certain format and then rendering that on the front end. This is one of the examples where I would say it's so easy to create like a quick demo of this. I mean, you can just go to chat to be deep, paste this thing in and say like, yeah, do this. Okay. Like 15 minutes and you're done. Yeah. But getting this to like then production level that it actually works 99% of the time. Okay. This is then where, where the difference lies. Yeah. So, um, for this specific feature, like we actually also have like countless regexes that they're just there to correct certain things that the LLM is doing because it doesn't always adhere to the format correctly. And then it looks super ugly on the front end. So yeah, we have certain regexes that correct that. And maybe you'd ask like, why don't you use an LLM for that? Because that's sort of the, again, the AI native way, like who uses regexes anymore. But with the chat for user experience, it's very important that you have the streaming because otherwise you need to wait so long until your message has arrived. So we're streaming live the, like, just like ChatGPT, right? You get the answer and it's streaming the text. So if you're streaming the text and something is like incorrect. It's currently not easy to just like pipe, like stream this into another stream, stream this into another stream and get the stream back, which corrects it, that would be amazing. I don't know, maybe you can answer that. Do you know of any?swyx [00:46:19]: There's no API that does this. Yeah. Like you cannot stream in. If you own the models, you can, uh, you know, whatever token sequence has, has been emitted, start loading that into the next one. If you fully own the models, uh, I don't, it's probably not worth it. That's what you do. It's better. Yeah. I think. Yeah. Most engineers who are new to AI research and benchmarking actually don't know how much regexing there is that goes on in normal benchmarks. It's just like this ugly list of like a hundred different, you know, matches for some criteria that you're looking for. No, it's very cool. I think it's, it's, it's an example of like real world engineering. Yeah. Do you have a tooling that you're proud of that you've developed for yourself?Kevin [00:47:02]: Is it just a test script or is it, you know? I think it's a bit more, I guess the term that has come up is, uh, vibe coding, uh, vibe coding, some, no, sorry, that's actually something else in this case, but, uh, no, no, yes, um, vibe evals was a term that in one of the talks actually on, on, um, I think it might've been the first, the first or the first day at the conference, someone brought that up. Yeah. Uh, because yeah, a lot of the talks were about evals, right. Which is so important. And yeah, I think for us, it's a bit more vibe. Evals, you know, that's also part of, you know, being a startup, we can take risks, like we can take the cost of maybe sometimes it failing a little bit or being a little bit off and our users know that and they appreciate that in return, like we're moving fast and iterating and building, building amazing things, but you know, a Spotify or something like that, half of our features will probably be in a six month review through legal or I don't know what, uh, before they could sell them out.swyx [00:48:04]: Let's just say Spotify is not very good at podcasting. Um, I have a documented, uh, dislike for, for their podcast features, just overall, really, really well integrated any other like sort of LLM focused engineering challenges or problems that you, that you want to highlight.Kevin [00:48:20]: I think it's not unique to us, but it goes again in the direction of handling the uncertainty of LLMs. So for example, with last year, at the end of the year, we did sort of a snipped wrapped. And one of the things we thought it would be fun to, just to do something with, uh, with an LLM and something with the snips that, that a user has. And, uh, three, let's say unique LLM features were that we assigned a personality to you based on the, the snips that, that you have. It was, I mean, it was just all, I guess, a bit of a fun, playful way. I'm going to look up mine. I forgot mine already.swyx [00:48:57]: Um, yeah, I don't know whether it's actually still in the, in the, we all took screenshots of it.Kevin [00:49:01]: Ah, we posted it in the, in the discord. And the, the second one, it was, uh, we had a learning scorecard where we identified the topics that you snipped on the most, and you got like a little score for that. And the third one was a, a quote that stood out. And the quote is actually a very good example of where we would run that for user. And most of the time it was an interesting quote, but every now and then it was like a super boring quotes that you think like, like how, like, why did you select that? Like, come on for there. The solution was actually just to say, Hey, give me five. So it extracted five quotes as a candidate, and then we piped it into a different model as a judge, LLM as a judge, and there we use a, um, a much better model because with the, the initial model, again, as, as I mentioned also earlier, we do have to look at the, like the, the costs because it's like, we have so much text that goes into it. So we, there we use a bit more cheaper model, but then the judge can be like a really good model to then just choose one out of five. This is a practical example.swyx [00:50:03]: I can't find it. Bad search in discord. Yeah. Um, so, so you do recommend having a much smarter model as a judge, uh, and that works for you. Yeah. Yeah. Interesting. I think this year I'm very interested in LM as a judge being more developed as a concept, I think for things like, you know, snips, raps, like it's, it's fine. Like, you know, it's, it's, it's, it's entertaining. There's no right answer.Kevin [00:50:29]: I mean, we also have it. Um, we also use the same concept for our books feature where we identify the, the mention. Books. Yeah. Because there it's the same thing, like 90% of the time it, it works perfectly out of the box one shot and every now and then it just, uh, starts identifying books that were not really mentioned or that are not books or made, yeah, starting to make up books. And, uh, they are basically, we have the same thing of like another LLM challenging it. Um, yeah. And actually with the speakers, we do the same now that I think about it. Yeah. Um, so I'm, I think it's a, it's a great technique. Interesting.swyx [00:51:05]: You run a lot of calls.Kevin [00:51:07]: Yeah.swyx [00:51:08]: Okay. You know, you mentioned costs. You move from self hosting a lot of models to the, to the, you know, big lab models, open AI, uh, and Google, uh, non-topic.Kevin [00:51:18]: Um, no, we love Claude. Like in my opinion, Claude is the, the best one when it comes to the way it formulates things. The personality. Yeah. The personality. Okay. I actually really love it. But yeah, the cost is. It's still high.swyx [00:51:36]: So you cannot, you tried Haiku, but you're, you're like, you have to have Sonnet.Kevin [00:51:40]: Uh, like basically we like with Haiku, we haven't experimented too much. We obviously work a lot with 3.5 Sonnet. Uh, also, you know, coding. Yeah. For coding, like in cursor, just in general, also brainstorming. We use it a lot. Um, I think it's a great brainstorm partner, but yeah, with, uh, with, with a lot of things that we've done done, we opted for different models.swyx [00:52:00]: What I'm trying to drive at is how much cheaper can you get if you go from cloud to cloud? Closed models to open models. And maybe it's like 0% cheaper, maybe it's 5% cheaper, or maybe it's like 50% cheaper. Do you have a sense?Kevin [00:52:13]: It's very difficult to, to judge that. I don't really have a sense, but I can, I can give you a couple of thoughts that have gone through our minds over the time, because obviously we do realize like, given that we, we have a couple of tasks where there are just so many tokens going in, um, at some point it will make sense to, to offload some of that. Uh, to an open source model, but going back to like, we're, we're a startup, right? Like we're not an AI lab or whatever, like for us, actually the most important thing is to iterate fast because we need to learn from our users, improve that. And yeah, just this velocity of this, these iterations. And for that, the closed models hosted by open AI, Google is, uh, and swapping, they're just unbeatable because you just, it's just an API call. Yeah. Um, so you don't need to worry about. Yeah. So much complexity behind that. So this is, I would say the biggest reason why we're not doing more in this space, but there are other thoughts, uh, also for the future. Like I see two different, like we basically have two different usage patterns of LLMs where one is this, this pre-processing of a podcast episode, like this initial processing, like the transcription, speaker diarization, chapterization. We do that once. And this, this usage pattern it's, it's quite predictable. Because we know how many podcasts get released when, um, so we can sort of have a certain capacity and we can, we, we're running that 24 seven, it's one big queue running 24 seven.swyx [00:53:44]: What's the queue job runner? Uh, is it a Django, just like the Python one?Kevin [00:53:49]: No, that, that's just our own, like our database and the backend talking to the database, picking up jobs, finding it back. I'm just curious in orchestration and queues. I mean, we, we of course have like, uh, a lot of other orchestration where we're, we're, where we use, uh, the Google pub sub, uh, thing, but okay. So we have this, this, this usage pattern of like very predictable, uh, usage, and we can max out the, the usage. And then there's this other pattern where it's, for example, the snippet where it's like a user, it's a user action that triggers an LLM call and it has to be real time. And there can be moments where it's by usage and there can be moments when there's very little usage for that. There. So that's, that's basically where these LLM API calls are just perfect because you don't need to worry about scaling this up, scaling this down, um, handling, handling these issues. Serverless versus serverful.swyx [00:54:44]: Yeah, exactly. Okay.Kevin [00:54:45]: Like I see them a bit, like I see open AI and all of these other providers, I see them a bit as the, like as the Amazon, sorry, AWS of, of AI. So it's a bit similar how like back before AWS, you would have to have your, your servers and buy new servers or get rid of servers. And then with AWS, it just became so much easier to just ramp stuff up and down. Yeah. And this is like the taking it even, even, uh, to the next level for AI. Yeah.swyx [00:55:18]: I am a big believer in this. Basically it's, you know, intelligence on demand. Yeah. We're probably not using it enough in our daily lives to do things. I should, we should be able to spin up a hundred things at once and go through things and then, you know, stop. And I feel like we're still trying to figure out how to use LLMs in our lives effectively. Yeah. Yeah.Kevin [00:55:38]: 100%. I think that goes back to the whole, like that, that's for me where the big opportunity is for, if you want to do a startup, um, it's not about, but you can let the big labs handleswyx [00:55:48]: the challenge of more intelligence, but, um, it's the... Existing intelligence. How do you integrate? How do you actually incorporate it into your life? AI engineering. Okay, cool. Cool. Cool. Cool. Um, the one, one other thing I wanted to touch on was multimodality in frontier models. Dwarcash had a interesting application of Gemini recently where he just fed raw audio in and got diarized transcription out or timestamps out. And I think that will come. So basically what we're saying here is another wave of transformers eating things because right now models are pretty much single modality things. You know, you have whisper, you have a pipeline and everything. Yeah. You can't just say, Oh, no, no, no, we only fit like the raw, the raw files. Do you think that will be realistic for you? I 100% agree. Okay.Kevin [00:56:38]: Basically everything that we talked about earlier with like the speaker diarization and heuristics and everything, I completely agree. Like in the, in the future that would just be put everything into a big multimodal LLM. Okay. And it will output, uh, everything that you want. Yeah. So I've also experimented with that. Like just... With, with Gemini 2? With Gemini 2.0 Flash. Yeah. Just for fun. Yeah. Yeah. Because the big difference right now is still like the cost difference of doing speaker diarization this way or doing transcription this way is a huge difference to the pipeline that we've built up. Huh. Okay.swyx [00:57:15]: I need to figure out what, what that cost is because in my mind 2.0 Flash is so cheap. Yeah. But maybe not cheap enough for you.Kevin [00:57:23]: Uh, no, I mean, if you compare it to, yeah, whisper and speaker diarization and especially self-hosting it and... Yeah. Yeah. Yeah.swyx [00:57:30]: Yeah.Kevin [00:57:30]: Okay. But we will get there, right? Like this is just a question of time.swyx [00:57:33]: And, um, at some point, as soon as that happens, we'll be the first ones to switch. Yeah. Awesome. Anything else that you're like sort of eyeing on the horizon as like, we are thinking about this feature, we're thinking about incorporating this new functionality of AI into our, into our app? Yeah.Kevin [00:57:50]: I mean, we, there's so many areas that we're thinking about, like our challenge is a bit more... Choosing. Yeah. Choosing. Yeah. So, I mean, I think for me, like looking into like the next couple of years, like the big areas that interest us a lot, basically four areas, like one is content. Um, right now it's, it's podcasts. I mean, you did mention, I think you mentioned like you can also upload audio books and YouTube videos. YouTube. I actually use the YouTube one a fair amount. But in the future, we, we want to also have audio books natively in the app. And, uh, we want to enable AI generated content. Like just think of, take deep research and notebook analysis. Like put these together. That should be, that should be in our app. The second area is discovery. I think in general. Yeah.swyx [00:58:38]: I noticed that you don't have, so you

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

While everyone is now repeating that 2025 is the “Year of the Agent”, OpenAI is heads down building towards it. In the first 2 months of the year they released Operator and Deep Research (arguably the most successful agent archetype so far), and today they are bringing a lot of those capabilities to the API:* Responses API* Web Search Tool* Computer Use Tool* File Search Tool* A new open source Agents SDK with integrated Observability ToolsWe cover all this and more in today's lightning pod on YouTube!More details here:Responses APIIn our Michelle Pokrass episode we talked about the Assistants API needing a redesign. Today OpenAI is launching the Responses API, “a more flexible foundation for developers building agentic applications”. It's a superset of the chat completion API, and the suggested starting point for developers working with OpenAI models. One of the big upgrades is the new set of built-in tools for the responses API: Web Search, Computer Use, and Files. Web Search ToolWe previously had Exa AI on the podcast to talk about web search for AI. OpenAI is also now joining the race; the Web Search API is actually a new “model” that exposes two 4o fine-tunes: gpt-4o-search-preview and gpt-4o-mini-search-preview. These are the same models that power ChatGPT Search, and are priced at $30/1000 queries and $25/1000 queries respectively. The killer feature is inline citations: you do not only get a link to a page, but also a deep link to exactly where your query was answered in the result page. Computer Use ToolThe model that powers Operator, called Computer-Using-Agent (CUA), is also now available in the API. The computer-use-preview model is SOTA on most benchmarks, achieving 38.1% success on OSWorld for full computer use tasks, 58.1% on WebArena, and 87% on WebVoyager for web-based interactions.As you will notice in the docs, `computer-use-preview` is both a model and a tool through which you can specify the environment. Usage is priced at $3/1M input tokens and $12/1M output tokens, and it's currently only available to users in tiers 3-5.File Search ToolFile Search was also available in the Assistants API, and it's now coming to Responses too. OpenAI is bringing search + RAG all under one umbrella, and we'll definitely see more people trying to find new ways to build all-in-one apps on OpenAI. Usage is priced at $2.50 per thousand queries and file storage at $0.10/GB/day, with the first GB free.Agent SDK: Swarms++!https://github.com/openai/openai-agents-pythonTo bring it all together, after the viral reception to Swarm, OpenAI is releasing an officially supported agents framework (which was previewed at our AI Engineer Summit) with 4 core pieces:* Agents: Easily configurable LLMs with clear instructions and built-in tools.* Handoffs: Intelligently transfer control between agents.* Guardrails: Configurable safety checks for input and output validation.* Tracing & Observability: Visualize agent execution traces to debug and optimize performance.Multi-agent workflows are here to stay!OpenAI is now explicitly designs for a set of common agentic patterns: Workflows, Handoffs, Agents-as-Tools, LLM-as-a-Judge, Parallelization, and Guardrails. OpenAI previewed this in part 2 of their talk at NYC:Further coverage of the launch from Kevin Weil, WSJ, and OpenAIDevs, AMA here.Show Notes* Assistants API* Swarm (OpenAI)* Fine-Tuning in AI* 2024 OpenAI DevDay Recap with Romain* Michelle Pokrass episode (API lead)Timestamps* 00:00 Intros* 02:31 Responses API * 08:34 Web Search API * 17:14 Files Search API * 18:46 Files API vs RAG * 20:06 Computer Use / Operator API * 22:30 Agents SDKAnd of course you can catch up with the full livestream here:TranscriptAlessio [00:00:03]: Hey, everyone. Welcome back to another Latent Space Lightning episode. This is Alessio, partner and CTO at Decibel, and I'm joined by Swyx, founder of Small AI.swyx [00:00:11]: Hi, and today we have a super special episode because we're talking with our old friend Roman. Hi, welcome.Romain [00:00:19]: Thank you. Thank you for having me.swyx [00:00:20]: And Nikunj, who is most famously, if anyone has ever tried to get any access to anything on the API, Nikunj is the guy. So I know your emails because I look forward to them.Nikunj [00:00:30]: Yeah, nice to meet all of you.swyx [00:00:32]: I think that we're basically convening today to talk about the new API. So perhaps you guys want to just kick off. What is OpenAI launching today?Nikunj [00:00:40]: Yeah, so I can kick it off. We're launching a bunch of new things today. We're going to do three new built-in tools. So we're launching the web search tool. This is basically chat GPD for search, but available in the API. We're launching an improved file search tool. So this is you bringing your data to OpenAI. You upload it. We, you know, take care of parsing it, chunking it. We're embedding it, making it searchable, give you this like ready vector store that you can use. So that's the file search tool. And then we're also launching our computer use tool. So this is the tool behind the operator product in chat GPD. So that's coming to developers today. And to support all of these tools, we're going to have a new API. So, you know, we launched chat completions, like I think March 2023 or so. It's been a while. So we're looking for an update over here to support all the new things that the models can do. And so we're launching this new API. It is, you know, it works with tools. We think it'll be like a great option for all the future agentic products that we build. And so that is also launching today. Actually, the last thing we're launching is the agents SDK. We launched this thing called Swarm last year where, you know, it was an experimental SDK for people to do multi-agent orchestration and stuff like that. It was supposed to be like educational experimental, but like people, people really loved it. They like ate it up. And so we are like, all right, let's, let's upgrade this thing. Let's give it a new name. And so we're calling it the agents SDK. It's going to have built-in tracing in the OpenAI dashboard. So lots of cool stuff going out. So, yeah.Romain [00:02:14]: That's a lot, but we said 2025 was the year of agents. So there you have it, like a lot of new tools to build these agents for developers.swyx [00:02:20]: Okay. I guess, I guess we'll just kind of go one by one and we'll leave the agents SDK towards the end. So responses API, I think the sort of primary concern that people have and something I think I've voiced to you guys when, when, when I was talking with you in the, in the planning process was, is chat completions going away? So I just wanted to let it, let you guys respond to the concerns that people might have.Romain [00:02:41]: Chat completion is definitely like here to stay, you know, it's a bare metal API we've had for quite some time. Lots of tools built around it. So we want to make sure that it's maintained and people can confidently keep on building on it. At the same time, it was kind of optimized for a different world, right? It was optimized for a pre-multi-modality world. We also optimized for kind of single turn. It takes two problems. It takes prompt in, it takes response out. And now with these agentic workflows, we, we noticed that like developers and companies want to build longer horizon tasks, you know, like things that require multiple returns to get the task accomplished. And computer use is one of those, for instance. And so that's why the responses API came to life to kind of support these new agentic workflows. But chat completion is definitely here to stay.swyx [00:03:27]: And assistance API, we've, uh, has a target sunset date of first half of 2020. So this is kind of like, in my mind, there was a kind of very poetic mirroring of the API with the models. This, I kind of view this as like kind of the merging of assistance API and chat completions, right. Into one unified responses. So it's kind of like how GPT and the old series models are also unifying.Romain [00:03:48]: Yeah, that's exactly the right, uh, that's the right framing, right? Like, I think we took the best of what we learned from the assistance API, especially like being able to access tools very, uh, very like conveniently, but at the same time, like simplifying the way you have to integrate, like, you no longer have to think about six different objects to kind of get access to these tools with the responses API. You just get one API request and suddenly you can weave in those tools, right?Nikunj [00:04:12]: Yeah, absolutely. And I think we're going to make it really easy and straightforward for assistance API users to migrate over to responsive. Right. To the API without any loss of functionality or data. So our plan is absolutely to add, you know, assistant like objects and thread light objects to that, that work really well with the responses API. We'll also add like the code interpreter tool, which is not launching today, but it'll come soon. And, uh, we'll add async mode to responses API, because that's another difference with, with, uh, assistance. I will have web hooks and stuff like that, but I think it's going to be like a pretty smooth transition. Uh, once we have all of that in place. And we'll be. Like a full year to migrate and, and help them through any issues they, they, they face. So overall, I feel like assistance users are really going to benefit from this longer term, uh, with this more flexible, primitive.Alessio [00:05:01]: How should people think about when to use each type of API? So I know that in the past, the assistance was maybe more stateful, kind of like long running, many tool use kind of like file based things. And the chat completions is more stateless, you know, kind of like traditional completion API. Is that still the mental model that people should have? Or like, should you buy the.Nikunj [00:05:20]: So the responses API is going to support everything that it's at launch, going to support everything that chat completion supports, and then over time, it's going to support everything that assistance supports. So it's going to be a pretty good fit for anyone starting out with open AI. Uh, they should be able to like go to responses responses, by the way, also has a stateless mode, so you can pass in store false and they'll make the whole API stateless, just like chat completions. You're really trying to like get this unification. A story in so that people don't have to juggle multiple endpoints. That being said, like chat completions, just like the most widely adopted API, it's it's so popular. So we're still going to like support it for years with like new models and features. But if you're a new user, you want to or if you want to like existing, you want to tap into some of these like built in tools or something, you should feel feel totally fine migrating to responses and you'll have more capabilities and performance than the tech completions.swyx [00:06:16]: I think the messaging that I agree that I think resonated the most. When I talked to you was that it is a strict superset, right? Like you should be able to do everything that you could do in chat completions and with assistants. And the thing that I just assumed that because you're you're now, you know, by default is stateful, you're actually storing the chat logs or the chat state. I thought you'd be charging me for it. So, you know, to me, it was very surprising that you figured out how to make it free.Nikunj [00:06:43]: Yeah, it's free. We store your state for 30 days. You can turn it off. But yeah, it's it's free. And the interesting thing on state is that it just like makes particularly for me, it makes like debugging things and building things so much simpler, where I can like create a responses object that's like pretty complicated and part of this more complex application that I've built, I can just go into my dashboard and see exactly what happened that mess up my prompt that is like not called one of these tools that misconfigure one of the tools like the visual observability of everything that you're doing is so, so helpful. So I'm excited, like about people trying that out and getting benefits from it, too.swyx [00:07:19]: Yeah, it's a it's really, I think, a really nice to have. But all I'll say is that my friend Corey Quinn says that anything that can be used as a database will be used as a database. So be prepared for some abuse.Romain [00:07:34]: All right. Yeah, that's a good one. Some of that I've tried with the metadata. That's some people are very, very creative at stuffing data into an object. Yeah.Nikunj [00:07:44]: And we do have metadata with responses. Exactly. Yeah.Alessio [00:07:48]: Let's get through it. All of these. So web search. I think the when I first said web search, I thought you were going to just expose a API that then return kind of like a nice list of thing. But the way it's name is like GPD for all search preview. So I'm guessing you have you're using basically the same model that is in the chat GPD search, which is fine tune for search. I'm guessing it's a different model than the base one. And it's impressive the jump in performance. So just to give an example, in simple QA, GPD for all is 38% accuracy for all search is 90%. But we always talk about. How tools are like models is not everything you need, like tools around it are just as important. So, yeah, maybe give people a quick review on like the work that went into making this special.Nikunj [00:08:29]: Should I take that?Alessio [00:08:29]: Yeah, go for it.Nikunj [00:08:30]: So firstly, we're launching web search in two ways. One in responses API, which is our API for tools. It's going to be available as a web search tool itself. So you'll be able to go tools, turn on web search and you're ready to go. We still wanted to give chat completions people access to real time information. So in that. Chat completions API, which does not support built in tools. We're launching the direct access to the fine tuned model that chat GPD for search uses, and we call it GPD for search preview. And how is this model built? Basically, we have our search research team has been working on this for a while. Their main goal is to, like, get information, like get a bunch of information from all of our data sources that we use to gather information for search and then pick the right things and then cite them. As accurately as possible. And that's what the search team has really focused on. They've done some pretty cool stuff. They use like synthetic data techniques. They've done like all series model distillation to, like, make these four or fine tunes really good. But yeah, the main thing is, like, can it remain factual? Can it answer questions based on what it retrieves and get cited accurately? And that's what this like fine tune model really excels at. And so, yeah, so we're excited that, like, it's going to be directly available in chat completions along with being available as a tool. Yeah.Alessio [00:09:49]: Just to clarify, if I'm using the responses API, this is a tool. But if I'm using chat completions, I have to switch model. I cannot use 01 and call search as a tool. Yeah, that's right. Exactly.Romain [00:09:58]: I think what's really compelling, at least for me and my own uses of it so far, is that when you use, like, web search as a tool, it combines nicely with every other tool and every other feature of the platform. So think about this for a second. For instance, imagine you have, like, a responses API call with the web search tool, but suddenly you turn on function calling. You also turn on, let's say, structure. So you can have, like, the ability to structure any data from the web in real time in the JSON schema that you need for your application. So it's quite powerful when you start combining those features and tools together. It's kind of like an API for the Internet almost, you know, like you get, like, access to the precise schema you need for your app. Yeah.Alessio [00:10:39]: And then just to wrap up on the infrastructure side of it, I read on the post that people, publisher can choose to appear in the web search. So are people by default in it? Like, how can we get Latent Space in the web search API?Nikunj [00:10:53]: Yeah. Yeah. I think we have some documentation around how websites, publishers can control, like, what shows up in a web search tool. And I think you should be able to, like, read that. I think we should be able to get Latent Space in for sure. Yeah.swyx [00:11:10]: You know, I think so. I compare this to a broader trend that I started covering last year of online LLMs. Actually, Perplexity, I think, was the first. It was the first to say, to offer an API that is connected to search, and then Gemini had the sort of search grounding API. And I think you guys, I actually didn't, I missed this in the original reading of the docs, but you even give like citations with like the exact sub paragraph that is matching, which I think is the standard nowadays. I think my question is, how do we take what a knowledge cutoff is for something like this, right? Because like now, basically there's no knowledge cutoff is always live, but then there's a difference between what the model has sort of internalized in its back propagation and what is searching up its rag.Romain [00:11:53]: I think it kind of depends on the use case, right? And what you want to showcase as the source. Like, for instance, you take a company like Hebbia that has used this like web search tool. They can combine like for credit firms or law firms, they can find like, you know, public information from the internet with the live sources and citation that sometimes you do want to have access to, as opposed to like the internal knowledge. But if you're building something different, well, like, you just want to have the information. If you want to have an assistant that relies on the deep knowledge that the model has, you may not need to have these like direct citations. So I think it kind of depends on the use case a little bit, but there are many, uh, many companies like Hebbia that will need that access to these citations to precisely know where the information comes from.swyx [00:12:34]: Yeah, yeah, uh, for sure. And then one thing on the, on like the breadth, you know, I think a lot of the deep research, open deep research implementations have this sort of hyper parameter about, you know, how deep they're searching and how wide they're searching. I don't see that in the docs. But is that something that we can tune? Is that something you recommend thinking about?Nikunj [00:12:53]: Super interesting. It's definitely not a parameter today, but we should explore that. It's very interesting. I imagine like how you would do it with the web search tool and responsive API is you would have some form of like, you know, agent orchestration over here where you have a planning step and then each like web search call that you do like explicitly goes a layer deeper and deeper and deeper. But it's not a parameter that's available out of the box. But it's a cool. It's a cool thing to think about. Yeah.swyx [00:13:19]: The only guidance I'll offer there is a lot of these implementations offer top K, which is like, you know, top 10, top 20, but actually don't really want that. You want like sort of some kind of similarity cutoff, right? Like some matching score cuts cutoff, because if there's only five things, five documents that match fine, if there's 500 that match, maybe that's what I want. Right. Yeah. But also that might, that might make my costs very unpredictable because the costs are something like $30 per a thousand queries, right? So yeah. Yeah.Nikunj [00:13:49]: I guess you could, you could have some form of like a context budget and then you're like, go as deep as you can and pick the best stuff and put it into like X number of tokens. There could be some creative ways of, of managing cost, but yeah, that's a super interesting thing to explore.Alessio [00:14:05]: Do you see people using the files and the search API together where you can kind of search and then store everything in the file so the next time I'm not paying for the search again and like, yeah, how should people balance that?Nikunj [00:14:17]: That's actually a very interesting question. And let me first tell you about how I've seen a really cool way I've seen people use files and search together is they put their user preferences or memories in the vector store and so a query comes in, you use the file search tool to like get someone's like reading preferences or like fashion preferences and stuff like that, and then you search the web for information or products that they can buy related to those preferences and you then render something beautiful to show them, like, here are five things that you might be interested in. So that's how I've seen like file search, web search work together. And by the way, that's like a single responses API call, which is really cool. So you just like configure these things, go boom, and like everything just happens. But yeah, that's how I've seen like files and web work together.Romain [00:15:01]: But I think that what you're pointing out is like interesting, and I'm sure developers will surprise us as they always do in terms of how they combine these tools and how they might use file search as a way to have memory and preferences, like Nikum says. But I think like zooming out, what I find very compelling and powerful here is like when you have these like neural networks. That have like all of the knowledge that they have today, plus real time access to the Internet for like any kind of real time information that you might need for your app and file search, where you can have a lot of company, private documents, private details, you combine those three, and you have like very, very compelling and precise answers for any kind of use case that your company or your product might want to enable.swyx [00:15:41]: It's a difference between sort of internal documents versus the open web, right? Like you're going to need both. Exactly, exactly. I never thought about it doing memory as well. I guess, again, you know, anything that's a database, you can store it and you will use it as a database. That sounds awesome. But I think also you've been, you know, expanding the file search. You have more file types. You have query optimization, custom re-ranking. So it really seems like, you know, it's been fleshed out. Obviously, I haven't been paying a ton of attention to the file search capability, but it sounds like your team has added a lot of features.Nikunj [00:16:14]: Yeah, metadata filtering was like the main thing people were asking us for for a while. And I'm super excited about it. I mean, it's just so critical once your, like, web store size goes over, you know, more than like, you know, 5,000, 10,000 records, you kind of need that. So, yeah, metadata filtering is coming, too.Romain [00:16:31]: And for most companies, it's also not like a competency that you want to rebuild in-house necessarily, you know, like, you know, thinking about embeddings and chunking and, you know, how of that, like, it sounds like very complex for something very, like, obvious to ship for your users. Like companies like Navant, for instance. They were able to build with the file search, like, you know, take all of the FAQ and travel policies, for instance, that you have, you, you put that in file search tool, and then you don't have to think about anything. Now your assistant becomes naturally much more aware of all of these policies from the files.swyx [00:17:03]: The question is, like, there's a very, very vibrant RAG industry already, as you well know. So there's many other vector databases, many other frameworks. Probably if it's an open source stack, I would say like a lot of the AI engineers that I talk to want to own this part of the stack. And it feels like, you know, like, when should we DIY and when should we just use whatever OpenAI offers?Nikunj [00:17:24]: Yeah. I mean, like, if you're doing something completely from scratch, you're going to have more control, right? Like, so super supportive of, you know, people trying to, like, roll up their sleeves, build their, like, super custom chunking strategy and super custom retrieval strategy and all of that. And those are things that, like, will be harder to do with OpenAI tools. OpenAI tool has, like, we have an out-of-the-box solution. We give you the tools. We use some knobs to customize things, but it's more of, like, a managed RAG service. So my recommendation would be, like, start with the OpenAI thing, see if it, like, meets your needs. And over time, we're going to be adding more and more knobs to make it even more customizable. But, you know, if you want, like, the completely custom thing, you want control over every single thing, then you'd probably want to go and hand roll it using other solutions. So we're supportive of both, like, engineers should pick. Yeah.Alessio [00:18:16]: And then we got computer use. Which I think Operator was obviously one of the hot releases of the year. And we're only two months in. Let's talk about that. And that's also, it seems like a separate model that has been fine-tuned for Operator that has browser access.Nikunj [00:18:31]: Yeah, absolutely. I mean, the computer use models are exciting. The cool thing about computer use is that we're just so, so early. It's like the GPT-2 of computer use or maybe GPT-1 of computer use right now. But it is a separate model that has been, you know, the computer. The computer use team has been working on, you send it screenshots and it tells you what action to take. So the outputs of it are almost always tool calls and you're inputting screenshots based on whatever computer you're trying to operate.Romain [00:19:01]: Maybe zooming out for a second, because like, I'm sure your audience is like super, super like AI native, obviously. But like, what is computer use as a tool, right? And what's operator? So the idea for computer use is like, how do we let developers also build agents that can complete tasks for the users, but using a computer? Okay. Or a browser instead. And so how do you get that done? And so that's why we have this custom model, like optimized for computer use that we use like for operator ourselves. But the idea behind like putting it as an API is that imagine like now you want to, you want to automate some tasks for your product or your own customers. Then now you can, you can have like the ability to spin up one of these agents that will look at the screen and act on the screen. So that means able, the ability to click, the ability to scroll. The ability to type and to report back on the action. So that's what we mean by computer use and wrapping it as a tool also in the responses API. So now like that gives a hint also at the multi-turned thing that we were hinting at earlier, the idea that like, yeah, maybe one of these actions can take a couple of minutes to complete because there's maybe like 20 steps to complete that task. But now you can.swyx [00:20:08]: Do you think a computer use can play Pokemon?Romain [00:20:11]: Oh, interesting. I guess we tried it. I guess we should try it. You know?swyx [00:20:17]: Yeah. There's a lot of interest. I think Pokemon really is a good agent benchmark, to be honest. Like it seems like Claude is, Claude is running into a lot of trouble.Romain [00:20:25]: Sounds like we should make that a new eval, it looks like.swyx [00:20:28]: Yeah. Yeah. Oh, and then one more, one more thing before we move on to agents SDK. I know you have a hard stop. There's all these, you know, blah, blah, dash preview, right? Like search preview, computer use preview, right? And you see them all like fine tunes of 4.0. I think the question is, are we, are they all going to be merged into the main branch or are we basically always going to have subsets? Of these models?Nikunj [00:20:49]: Yeah, I think in the early days, research teams at OpenAI like operate with like fine tune models. And then once the thing gets like more stable, we sort of merge it into the main line. So that's definitely the vision, like going out of preview as we get more comfortable with and learn about all the developer use cases and we're doing a good job at them. We'll sort of like make them part of like the core models so that you don't have to like deal with the bifurcation.Romain [00:21:12]: You should think of it this way as exactly what happened last year when we introduced vision capabilities, you know. Yes. Vision capabilities were in like a vision preview model based off of GPT-4 and then vision capabilities now are like obviously built into GPT-4.0. You can think about it the same way for like the other modalities like audio and those kind of like models, like optimized for search and computer use.swyx [00:21:34]: Agents SDK, we have a few minutes left. So let's just assume that everyone has looked at Swarm. Sure. I think that Swarm has really popularized the handoff technique, which I thought was like, you know, really, really interesting for sort of a multi-agent. What is new with the SDK?Nikunj [00:21:50]: Yeah. Do you want to start? Yeah, for sure. So we've basically added support for types. We've made this like a lot. Yeah. Like we've added support for types. We've added support for guard railing, which is a very common pattern. So in the guardrail example, you basically have two things happen in parallel. The guardrail can sort of block the execution. It's a type of like optimistic generation that happens. And I think we've added support for tracing. So I think that's really cool. So you can basically look at the traces that the Agents SDK creates in the OpenAI dashboard. We also like made this pretty flexible. So you can pick any API from any provider that supports the ChatCompletions API format. So it supports responses by default, but you can like easily plug it in to anyone that uses the ChatCompletions API. And similarly, on the tracing side, you can support like multiple tracing providers. By default, it sort of points to the OpenAI dashboard. But, you know, there's like so many tracing providers. There's so many tracing companies out there. And we'll announce some partnerships on that front, too. So just like, you know, adding lots of core features and making it more usable, but still centered around like handoffs is like the main, main concept.Romain [00:22:59]: And by the way, it's interesting, right? Because Swarm just came to life out of like learning from customers directly that like orchestrating agents in production was pretty hard. You know, simple ideas could quickly turn very complex. Like what are those guardrails? What are those handoffs, et cetera? So that came out of like learning from customers. And it was initially shipped. It was not as a like low-key experiment, I'd say. But we were kind of like taken by surprise at how much momentum there was around this concept. And so we decided to learn from that and embrace it. To be like, okay, maybe we should just embrace that as a core primitive of the OpenAI platform. And that's kind of what led to the Agents SDK. And I think now, as Nikuj mentioned, it's like adding all of these new capabilities to it, like leveraging the handoffs that we had, but tracing also. And I think what's very compelling for developers is like instead of having one agent to rule them all and you stuff like a lot of tool calls in there that can be hard to monitor, now you have the tools you need to kind of like separate the logic, right? And you can have a triage agent that based on an intent goes to different kind of agents. And then on the OpenAI dashboard, we're releasing a lot of new user interface logs as well. So you can see all of the tracing UIs. Essentially, you'll be able to troubleshoot like what exactly happened. In that workflow, when the triage agent did a handoff to a secondary agent and the third and see the tool calls, et cetera. So we think that the Agents SDK combined with the tracing UIs will definitely help users and developers build better agentic workflows.Alessio [00:24:28]: And just before we wrap, are you thinking of connecting this with also the RFT API? Because I know you already have, you kind of store my text completions and then I can do fine tuning of that. Is that going to be similar for agents where you're storing kind of like my traces? And then help me improve the agents?Nikunj [00:24:43]: Yeah, absolutely. Like you got to tie the traces to the evals product so that you can generate good evals. Once you have good evals and graders and tasks, you can use that to do reinforcement fine tuning. And, you know, lots of details to be figured out over here. But that's the vision. And I think we're going to go after it like pretty hard and hope we can like make this whole workflow a lot easier for developers.Alessio [00:25:05]: Awesome. Thank you so much for the time. I'm sure you'll be busy on Twitter tomorrow with all the developer feedback. Yeah.Romain [00:25:12]: Thank you so much for having us. And as always, we can't wait to see what developers will build with these tools and how we can like learn as quickly as we can from them to make them even better over time.Nikunj [00:25:21]: Yeah.Romain [00:25:22]: Thank you, guys.Nikunj [00:25:23]: Thank you.Romain [00:25:23]: Thank you both. Awesome. Get full access to Latent.Space at www.latent.space/subscribe

ChinaTalk
Manus: A DeepSeek Moment?

ChinaTalk

Play Episode Listen Later Mar 10, 2025 52:56


A Wuhan-developed AI agent went viral this weekend. Guests Rohit Krishnan of the substack Strange Loop Canon, Shawn Wang of Latent Space, and Dean Ball of Mercatus and Hyperdimensional join us to discuss. We get into What Manus is and isn't What Manus tells us about the broader AI ecosystem's ability to produce products we actually want to use The political economy and liability issues that AI agents will engender More ChinaTalk coverage: https://www.chinatalk.media/p/manus-chinas-latest-ai-sensation Outtro Music: La Marelu "Mala" https://www.youtube.com/watch?v=HAB5rx8mqjM&ab_channel=LaMarelu-Topic Alaska y Dinarama "A Quién Le Importa" https://www.youtube.com/watch?v=2uQhdDtdXg0&ab_channel=YouMoreTv-Espect%C3%A1culo Learn more about your ad choices. Visit megaphone.fm/adchoices

ai wuhan manus mercatus shawn wang latent space
ChinaEconTalk
Manus: A DeepSeek Moment?

ChinaEconTalk

Play Episode Listen Later Mar 10, 2025 52:56


A Wuhan-developed AI agent went viral this weekend. Guests Rohit Krishnan of the substack Strange Loop Canon, Shawn Wang of Latent Space, and Dean Ball of Mercatus and Hyperdimensional join us to discuss. We get into What Manus is and isn't What Manus tells us about the broader AI ecosystem's ability to produce products we actually want to use The political economy and liability issues that AI agents will engender More ChinaTalk coverage: https://www.chinatalk.media/p/manus-chinas-latest-ai-sensation Outtro Music: La Marelu "Mala" https://www.youtube.com/watch?v=HAB5rx8mqjM&ab_channel=LaMarelu-Topic Alaska y Dinarama "A Quién Le Importa" https://www.youtube.com/watch?v=2uQhdDtdXg0&ab_channel=YouMoreTv-Espect%C3%A1culo Learn more about your ad choices. Visit megaphone.fm/adchoices

ai wuhan manus mercatus shawn wang latent space
Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Today's episode is with Paul Klein, founder of Browserbase. We talked about building browser infrastructure for AI agents, the future of agent authentication, and their open source framework Stagehand.* [00:00:00] Introductions* [00:04:46] AI-specific challenges in browser infrastructure* [00:07:05] Multimodality in AI-Powered Browsing* [00:12:26] Running headless browsers at scale* [00:18:46] Geolocation when proxying* [00:21:25] CAPTCHAs and Agent Auth* [00:28:21] Building “User take over” functionality* [00:33:43] Stagehand: AI web browsing framework* [00:38:58] OpenAI's Operator and computer use agents* [00:44:44] Surprising use cases of Browserbase* [00:47:18] Future of browser automation and market competition* [00:53:11] Being a solo founderTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.swyx [00:00:12]: Hey, and today we are very blessed to have our friends, Paul Klein, for the fourth, the fourth, CEO of Browserbase. Welcome.Paul [00:00:21]: Thanks guys. Yeah, I'm happy to be here. I've been lucky to know both of you for like a couple of years now, I think. So it's just like we're hanging out, you know, with three ginormous microphones in front of our face. It's totally normal hangout.swyx [00:00:34]: Yeah. We've actually mentioned you on the podcast, I think, more often than any other Solaris tenant. Just because like you're one of the, you know, best performing, I think, LLM tool companies that have started up in the last couple of years.Paul [00:00:50]: Yeah, I mean, it's been a whirlwind of a year, like Browserbase is actually pretty close to our first birthday. So we are one years old. And going from, you know, starting a company as a solo founder to... To, you know, having a team of 20 people, you know, a series A, but also being able to support hundreds of AI companies that are building AI applications that go out and automate the web. It's just been like, really cool. It's been happening a little too fast. I think like collectively as an AI industry, let's just take a week off together. I took my first vacation actually two weeks ago, and Operator came out on the first day, and then a week later, DeepSeat came out. And I'm like on vacation trying to chill. I'm like, we got to build with this stuff, right? So it's been a breakneck year. But I'm super happy to be here and like talk more about all the stuff we're seeing. And I'd love to hear kind of what you guys are excited about too, and share with it, you know?swyx [00:01:39]: Where to start? So people, you've done a bunch of podcasts. I think I strongly recommend Jack Bridger's Scaling DevTools, as well as Turner Novak's The Peel. And, you know, I'm sure there's others. So you covered your Twilio story in the past, talked about StreamClub, you got acquired to Mux, and then you left to start Browserbase. So maybe we just start with what is Browserbase? Yeah.Paul [00:02:02]: Browserbase is the web browser for your AI. We're building headless browser infrastructure, which are browsers that run in a server environment that's accessible to developers via APIs and SDKs. It's really hard to run a web browser in the cloud. You guys are probably running Chrome on your computers, and that's using a lot of resources, right? So if you want to run a web browser or thousands of web browsers, you can't just spin up a bunch of lambdas. You actually need to use a secure containerized environment. You have to scale it up and down. It's a stateful system. And that infrastructure is, like, super painful. And I know that firsthand, because at my last company, StreamClub, I was CTO, and I was building our own internal headless browser infrastructure. That's actually why we sold the company, is because Mux really wanted to buy our headless browser infrastructure that we'd built. And it's just a super hard problem. And I actually told my co-founders, I would never start another company unless it was a browser infrastructure company. And it turns out that's really necessary in the age of AI, when AI can actually go out and interact with websites, click on buttons, fill in forms. You need AI to do all of that work in an actual browser running somewhere on a server. And BrowserBase powers that.swyx [00:03:08]: While you're talking about it, it occurred to me, not that you're going to be acquired or anything, but it occurred to me that it would be really funny if you became the Nikita Beer of headless browser companies. You just have one trick, and you make browser companies that get acquired.Paul [00:03:23]: I truly do only have one trick. I'm screwed if it's not for headless browsers. I'm not a Go programmer. You know, I'm in AI grant. You know, browsers is an AI grant. But we were the only company in that AI grant batch that used zero dollars on AI spend. You know, we're purely an infrastructure company. So as much as people want to ask me about reinforcement learning, I might not be the best guy to talk about that. But if you want to ask about headless browser infrastructure at scale, I can talk your ear off. So that's really my area of expertise. And it's a pretty niche thing. Like, nobody has done what we're doing at scale before. So we're happy to be the experts.swyx [00:03:59]: You do have an AI thing, stagehand. We can talk about the sort of core of browser-based first, and then maybe stagehand. Yeah, stagehand is kind of the web browsing framework. Yeah.What is Browserbase? Headless Browser Infrastructure ExplainedAlessio [00:04:10]: Yeah. Yeah. And maybe how you got to browser-based and what problems you saw. So one of the first things I worked on as a software engineer was integration testing. Sauce Labs was kind of like the main thing at the time. And then we had Selenium, we had Playbrite, we had all these different browser things. But it's always been super hard to do. So obviously you've worked on this before. When you started browser-based, what were the challenges? What were the AI-specific challenges that you saw versus, there's kind of like all the usual running browser at scale in the cloud, which has been a problem for years. What are like the AI unique things that you saw that like traditional purchase just didn't cover? Yeah.AI-specific challenges in browser infrastructurePaul [00:04:46]: First and foremost, I think back to like the first thing I did as a developer, like as a kid when I was writing code, I wanted to write code that did stuff for me. You know, I wanted to write code to automate my life. And I do that probably by using curl or beautiful soup to fetch data from a web browser. And I think I still do that now that I'm in the cloud. And the other thing that I think is a huge challenge for me is that you can't just create a web site and parse that data. And we all know that now like, you know, taking HTML and plugging that into an LLM, you can extract insights, you can summarize. So it was very clear that now like dynamic web scraping became very possible with the rise of large language models or a lot easier. And that was like a clear reason why there's been more usage of headless browsers, which are necessary because a lot of modern websites don't expose all of their page content via a simple HTTP request. You know, they actually do require you to run this type of code for a specific time. JavaScript on the page to hydrate this. Airbnb is a great example. You go to airbnb.com. A lot of that content on the page isn't there until after they run the initial hydration. So you can't just scrape it with a curl. You need to have some JavaScript run. And a browser is that JavaScript engine that's going to actually run all those requests on the page. So web data retrieval was definitely one driver of starting BrowserBase and the rise of being able to summarize that within LLM. Also, I was familiar with if I wanted to automate a website, I could write one script and that would work for one website. It was very static and deterministic. But the web is non-deterministic. The web is always changing. And until we had LLMs, there was no way to write scripts that you could write once that would run on any website. That would change with the structure of the website. Click the login button. It could mean something different on many different websites. And LLMs allow us to generate code on the fly to actually control that. So I think that rise of writing the generic automation scripts that can work on many different websites, to me, made it clear that browsers are going to be a lot more useful because now you can automate a lot more things without writing. If you wanted to write a script to book a demo call on 100 websites, previously, you had to write 100 scripts. Now you write one script that uses LLMs to generate that script. That's why we built our web browsing framework, StageHand, which does a lot of that work for you. But those two things, web data collection and then enhanced automation of many different websites, it just felt like big drivers for more browser infrastructure that would be required to power these kinds of features.Alessio [00:07:05]: And was multimodality also a big thing?Paul [00:07:08]: Now you can use the LLMs to look, even though the text in the dome might not be as friendly. Maybe my hot take is I was always kind of like, I didn't think vision would be as big of a driver. For UI automation, I felt like, you know, HTML is structured text and large language models are good with structured text. But it's clear that these computer use models are often vision driven, and they've been really pushing things forward. So definitely being multimodal, like rendering the page is required to take a screenshot to give that to a computer use model to take actions on a website. And it's just another win for browser. But I'll be honest, that wasn't what I was thinking early on. I didn't even think that we'd get here so fast with multimodality. I think we're going to have to get back to multimodal and vision models.swyx [00:07:50]: This is one of those things where I forgot to mention in my intro that I'm an investor in Browserbase. And I remember that when you pitched to me, like a lot of the stuff that we have today, we like wasn't on the original conversation. But I did have my original thesis was something that we've talked about on the podcast before, which is take the GPT store, the custom GPT store, all the every single checkbox and plugin is effectively a startup. And this was the browser one. I think the main hesitation, I think I actually took a while to get back to you. The main hesitation was that there were others. Like you're not the first hit list browser startup. It's not even your first hit list browser startup. There's always a question of like, will you be the category winner in a place where there's a bunch of incumbents, to be honest, that are bigger than you? They're just not targeted at the AI space. They don't have the backing of Nat Friedman. And there's a bunch of like, you're here in Silicon Valley. They're not. I don't know.Paul [00:08:47]: I don't know if that's, that was it, but like, there was a, yeah, I mean, like, I think I tried all the other ones and I was like, really disappointed. Like my background is from working at great developer tools, companies, and nothing had like the Vercel like experience. Um, like our biggest competitor actually is partly owned by private equity and they just jacked up their prices quite a bit. And the dashboard hasn't changed in five years. And I actually used them at my last company and tried them and I was like, oh man, like there really just needs to be something that's like the experience of these great infrastructure companies, like Stripe, like clerk, like Vercel that I use in love, but oriented towards this kind of like more specific category, which is browser infrastructure, which is really technically complex. Like a lot of stuff can go wrong on the internet when you're running a browser. The internet is very vast. There's a lot of different configurations. Like there's still websites that only work with internet explorer out there. How do you handle that when you're running your own browser infrastructure? These are the problems that we have to think about and solve at BrowserBase. And it's, it's certainly a labor of love, but I built this for me, first and foremost, I know it's super cheesy and everyone says that for like their startups, but it really, truly was for me. If you look at like the talks I've done even before BrowserBase, and I'm just like really excited to try and build a category defining infrastructure company. And it's, it's rare to have a new category of infrastructure exists. We're here in the Chroma offices and like, you know, vector databases is a new category of infrastructure. Is it, is it, I mean, we can, we're in their office, so, you know, we can, we can debate that one later. That is one.Multimodality in AI-Powered Browsingswyx [00:10:16]: That's one of the industry debates.Paul [00:10:17]: I guess we go back to the LLMOS talk that Karpathy gave way long ago. And like the browser box was very clearly there and it seemed like the people who were building in this space also agreed that browsers are a core primitive of infrastructure for the LLMOS that's going to exist in the future. And nobody was building something there that I wanted to use. So I had to go build it myself.swyx [00:10:38]: Yeah. I mean, exactly that talk that, that honestly, that diagram, every box is a startup and there's the code box and then there's the. The browser box. I think at some point they will start clashing there. There's always the question of the, are you a point solution or are you the sort of all in one? And I think the point solutions tend to win quickly, but then the only ones have a very tight cohesive experience. Yeah. Let's talk about just the hard problems of browser base you have on your website, which is beautiful. Thank you. Was there an agency that you used for that? Yeah. Herb.paris.Paul [00:11:11]: They're amazing. Herb.paris. Yeah. It's H-E-R-V-E. I highly recommend for developers. Developer tools, founders to work with consumer agencies because they end up building beautiful things and the Parisians know how to build beautiful interfaces. So I got to give prep.swyx [00:11:24]: And chat apps, apparently are, they are very fast. Oh yeah. The Mistral chat. Yeah. Mistral. Yeah.Paul [00:11:31]: Late chat.swyx [00:11:31]: Late chat. And then your videos as well, it was professionally shot, right? The series A video. Yeah.Alessio [00:11:36]: Nico did the videos. He's amazing. Not the initial video that you shot at the new one. First one was Austin.Paul [00:11:41]: Another, another video pretty surprised. But yeah, I mean, like, I think when you think about how you talk about your company. You have to think about the way you present yourself. It's, you know, as a developer, you think you evaluate a company based on like the API reliability and the P 95, but a lot of developers say, is the website good? Is the message clear? Do I like trust this founder? I'm building my whole feature on. So I've tried to nail that as well as like the reliability of the infrastructure. You're right. It's very hard. And there's a lot of kind of foot guns that you run into when running headless browsers at scale. Right.Competing with Existing Headless Browser Solutionsswyx [00:12:10]: So let's pick one. You have eight features here. Seamless integration. Scalability. Fast or speed. Secure. Observable. Stealth. That's interesting. Extensible and developer first. What comes to your mind as like the top two, three hardest ones? Yeah.Running headless browsers at scalePaul [00:12:26]: I think just running headless browsers at scale is like the hardest one. And maybe can I nerd out for a second? Is that okay? I heard this is a technical audience, so I'll talk to the other nerds. Whoa. They were listening. Yeah. They're upset. They're ready. The AGI is angry. Okay. So. So how do you run a browser in the cloud? Let's start with that, right? So let's say you're using a popular browser automation framework like Puppeteer, Playwright, and Selenium. Maybe you've written a code, some code locally on your computer that opens up Google. It finds the search bar and then types in, you know, search for Latent Space and hits the search button. That script works great locally. You can see the little browser open up. You want to take that to production. You want to run the script in a cloud environment. So when your laptop is closed, your browser is doing something. The browser is doing something. Well, I, we use Amazon. You can see the little browser open up. You know, the first thing I'd reach for is probably like some sort of serverless infrastructure. I would probably try and deploy on a Lambda. But Chrome itself is too big to run on a Lambda. It's over 250 megabytes. So you can't easily start it on a Lambda. So you maybe have to use something like Lambda layers to squeeze it in there. Maybe use a different Chromium build that's lighter. And you get it on the Lambda. Great. It works. But it runs super slowly. It's because Lambdas are very like resource limited. They only run like with one vCPU. You can run one process at a time. Remember, Chromium is super beefy. It's barely running on my MacBook Air. I'm still downloading it from a pre-run. Yeah, from the test earlier, right? I'm joking. But it's big, you know? So like Lambda, it just won't work really well. Maybe it'll work, but you need something faster. Your users want something faster. Okay. Well, let's put it on a beefier instance. Let's get an EC2 server running. Let's throw Chromium on there. Great. Okay. I can, that works well with one user. But what if I want to run like 10 Chromium instances, one for each of my users? Okay. Well, I might need two EC2 instances. Maybe 10. All of a sudden, you have multiple EC2 instances. This sounds like a problem for Kubernetes and Docker, right? Now, all of a sudden, you're using ECS or EKS, the Kubernetes or container solutions by Amazon. You're spending up and down containers, and you're spending a whole engineer's time on kind of maintaining this stateful distributed system. Those are some of the worst systems to run because when it's a stateful distributed system, it means that you are bound by the connections to that thing. You have to keep the browser open while someone is working with it, right? That's just a painful architecture to run. And there's all this other little gotchas with Chromium, like Chromium, which is the open source version of Chrome, by the way. You have to install all these fonts. You want emojis working in your browsers because your vision model is looking for the emoji. You need to make sure you have the emoji fonts. You need to make sure you have all the right extensions configured, like, oh, do you want ad blocking? How do you configure that? How do you actually record all these browser sessions? Like it's a headless browser. You can't look at it. So you need to have some sort of observability. Maybe you're recording videos and storing those somewhere. It all kind of adds up to be this just giant monster piece of your project when all you wanted to do was run a lot of browsers in production for this little script to go to google.com and search. And when I see a complex distributed system, I see an opportunity to build a great infrastructure company. And we really abstract that away with Browserbase where our customers can use these existing frameworks, Playwright, Publisher, Selenium, or our own stagehand and connect to our browsers in a serverless-like way. And control them, and then just disconnect when they're done. And they don't have to think about the complex distributed system behind all of that. They just get a browser running anywhere, anytime. Really easy to connect to.swyx [00:15:55]: I'm sure you have questions. My standard question with anything, so essentially you're a serverless browser company, and there's been other serverless things that I'm familiar with in the past, serverless GPUs, serverless website hosting. That's where I come from with Netlify. One question is just like, you promised to spin up thousands of servers. You promised to spin up thousands of browsers in milliseconds. I feel like there's no real solution that does that yet. And I'm just kind of curious how. The only solution I know, which is to kind of keep a kind of warm pool of servers around, which is expensive, but maybe not so expensive because it's just CPUs. So I'm just like, you know. Yeah.Browsers as a Core Primitive in AI InfrastructurePaul [00:16:36]: You nailed it, right? I mean, how do you offer a serverless-like experience with something that is clearly not serverless, right? And the answer is, you need to be able to run... We run many browsers on single nodes. We use Kubernetes at browser base. So we have many pods that are being scheduled. We have to predictably schedule them up or down. Yes, thousands of browsers in milliseconds is the best case scenario. If you hit us with 10,000 requests, you may hit a slower cold start, right? So we've done a lot of work on predictive scaling and being able to kind of route stuff to different regions where we have multiple regions of browser base where we have different pools available. You can also pick the region you want to go to based on like lower latency, round trip, time latency. It's very important with these types of things. There's a lot of requests going over the wire. So for us, like having a VM like Firecracker powering everything under the hood allows us to be super nimble and spin things up or down really quickly with strong multi-tenancy. But in the end, this is like the complex infrastructural challenges that we have to kind of deal with at browser base. And we have a lot more stuff on our roadmap to allow customers to have more levers to pull to exchange, do you want really fast browser startup times or do you want really low costs? And if you're willing to be more flexible on that, we may be able to kind of like work better for your use cases.swyx [00:17:44]: Since you used Firecracker, shouldn't Fargate do that for you or did you have to go lower level than that? We had to go lower level than that.Paul [00:17:51]: I find this a lot with Fargate customers, which is alarming for Fargate. We used to be a giant Fargate customer. Actually, the first version of browser base was ECS and Fargate. And unfortunately, it's a great product. I think we were actually the largest Fargate customer in our region for a little while. No, what? Yeah, seriously. And unfortunately, it's a great product, but I think if you're an infrastructure company, you actually have to have a deeper level of control over these primitives. I think it's the same thing is true with databases. We've used other database providers and I think-swyx [00:18:21]: Yeah, serverless Postgres.Paul [00:18:23]: Shocker. When you're an infrastructure company, you're on the hook if any provider has an outage. And I can't tell my customers like, hey, we went down because so-and-so went down. That's not acceptable. So for us, we've really moved to bringing things internally. It's kind of opposite of what we preach. We tell our customers, don't build this in-house, but then we're like, we build a lot of stuff in-house. But I think it just really depends on what is in the critical path. We try and have deep ownership of that.Alessio [00:18:46]: On the distributed location side, how does that work for the web where you might get sort of different content in different locations, but the customer is expecting, you know, if you're in the US, I'm expecting the US version. But if you're spinning up my browser in France, I might get the French version. Yeah.Paul [00:19:02]: Yeah. That's a good question. Well, generally, like on the localization, there is a thing called locale in the browser. You can set like what your locale is. If you're like in the ENUS browser or not, but some things do IP, IP based routing. And in that case, you may want to have a proxy. Like let's say you're running something in the, in Europe, but you want to make sure you're showing up from the US. You may want to use one of our proxy features so you can turn on proxies to say like, make sure these connections always come from the United States, which is necessary too, because when you're browsing the web, you're coming from like a, you know, data center IP, and that can make things a lot harder to browse web. So we do have kind of like this proxy super network. Yeah. We have a proxy for you based on where you're going, so you can reliably automate the web. But if you get scheduled in Europe, that doesn't happen as much. We try and schedule you as close to, you know, your origin that you're trying to go to. But generally you have control over the regions you can put your browsers in. So you can specify West one or East one or Europe. We only have one region of Europe right now, actually. Yeah.Alessio [00:19:55]: What's harder, the browser or the proxy? I feel like to me, it feels like actually proxying reliably at scale. It's much harder than spending up browsers at scale. I'm curious. It's all hard.Paul [00:20:06]: It's layers of hard, right? Yeah. I think it's different levels of hard. I think the thing with the proxy infrastructure is that we work with many different web proxy providers and some are better than others. Some have good days, some have bad days. And our customers who've built browser infrastructure on their own, they have to go and deal with sketchy actors. Like first they figure out their own browser infrastructure and then they got to go buy a proxy. And it's like you can pay in Bitcoin and it just kind of feels a little sus, right? It's like you're buying drugs when you're trying to get a proxy online. We have like deep relationships with these counterparties. We're able to audit them and say, is this proxy being sourced ethically? Like it's not running on someone's TV somewhere. Is it free range? Yeah. Free range organic proxies, right? Right. We do a level of diligence. We're SOC 2. So we have to understand what is going on here. But then we're able to make sure that like we route around proxy providers not working. There's proxy providers who will just, the proxy will stop working all of a sudden. And then if you don't have redundant proxying on your own browsers, that's hard down for you or you may get some serious impacts there. With us, like we intelligently know, hey, this proxy is not working. Let's go to this one. And you can kind of build a network of multiple providers to really guarantee the best uptime for our customers. Yeah. So you don't own any proxies? We don't own any proxies. You're right. The team has been saying who wants to like take home a little proxy server, but not yet. We're not there yet. You know?swyx [00:21:25]: It's a very mature market. I don't think you should build that yourself. Like you should just be a super customer of them. Yeah. Scraping, I think, is the main use case for that. I guess. Well, that leads us into CAPTCHAs and also off, but let's talk about CAPTCHAs. You had a little spiel that you wanted to talk about CAPTCHA stuff.Challenges of Scaling Browser InfrastructurePaul [00:21:43]: Oh, yeah. I was just, I think a lot of people ask, if you're thinking about proxies, you're thinking about CAPTCHAs too. I think it's the same thing. You can go buy CAPTCHA solvers online, but it's the same buying experience. It's some sketchy website, you have to integrate it. It's not fun to buy these things and you can't really trust that the docs are bad. What Browserbase does is we integrate a bunch of different CAPTCHAs. We do some stuff in-house, but generally we just integrate with a bunch of known vendors and continually monitor and maintain these things and say, is this working or not? Can we route around it or not? These are CAPTCHA solvers. CAPTCHA solvers, yeah. Not CAPTCHA providers, CAPTCHA solvers. Yeah, sorry. CAPTCHA solvers. We really try and make sure all of that works for you. I think as a dev, if I'm buying infrastructure, I want it all to work all the time and it's important for us to provide that experience by making sure everything does work and monitoring it on our own. Yeah. Right now, the world of CAPTCHAs is tricky. I think AI agents in particular are very much ahead of the internet infrastructure. CAPTCHAs are designed to block all types of bots, but there are now good bots and bad bots. I think in the future, CAPTCHAs will be able to identify who a good bot is, hopefully via some sort of KYC. For us, we've been very lucky. We have very little to no known abuse of Browserbase because we really look into who we work with. And for certain types of CAPTCHA solving, we only allow them on certain types of plans because we want to make sure that we can know what people are doing, what their use cases are. And that's really allowed us to try and be an arbiter of good bots, which is our long term goal. I want to build great relationships with people like Cloudflare so we can agree, hey, here are these acceptable bots. We'll identify them for you and make sure we flag when they come to your website. This is a good bot, you know?Alessio [00:23:23]: I see. And Cloudflare said they want to do more of this. So they're going to set by default, if they think you're an AI bot, they're going to reject. I'm curious if you think this is something that is going to be at the browser level or I mean, the DNS level with Cloudflare seems more where it should belong. But I'm curious how you think about it.Paul [00:23:40]: I think the web's going to change. You know, I think that the Internet as we have it right now is going to change. And we all need to just accept that the cat is out of the bag. And instead of kind of like wishing the Internet was like it was in the 2000s, we can have free content line that wouldn't be scraped. It's just it's not going to happen. And instead, we should think about like, one, how can we change? How can we change the models of, you know, information being published online so people can adequately commercialize it? But two, how do we rebuild applications that expect that AI agents are going to log in on their behalf? Those are the things that are going to allow us to kind of like identify good and bad bots. And I think the team at Clerk has been doing a really good job with this on the authentication side. I actually think that auth is the biggest thing that will prevent agents from accessing stuff, not captchas. And I think there will be agent auth in the future. I don't know if it's going to happen from an individual company, but actually authentication providers that have a, you know, hidden login as agent feature, which will then you put in your email, you'll get a push notification, say like, hey, your browser-based agent wants to log into your Airbnb. You can approve that and then the agent can proceed. That really circumvents the need for captchas or logging in as you and sharing your password. I think agent auth is going to be one way we identify good bots going forward. And I think a lot of this captcha solving stuff is really short-term problems as the internet kind of reorients itself around how it's going to work with agents browsing the web, just like people do. Yeah.Managing Distributed Browser Locations and Proxiesswyx [00:24:59]: Stitch recently was on Hacker News for talking about agent experience, AX, which is a thing that Netlify is also trying to clone and coin and talk about. And we've talked about this on our previous episodes before in a sense that I actually think that's like maybe the only part of the tech stack that needs to be kind of reinvented for agents. Everything else can stay the same, CLIs, APIs, whatever. But auth, yeah, we need agent auth. And it's mostly like short-lived, like it should not, it should be a distinct, identity from the human, but paired. I almost think like in the same way that every social network should have your main profile and then your alt accounts or your Finsta, it's almost like, you know, every, every human token should be paired with the agent token and the agent token can go and do stuff on behalf of the human token, but not be presumed to be the human. Yeah.Paul [00:25:48]: It's like, it's, it's actually very similar to OAuth is what I'm thinking. And, you know, Thread from Stitch is an investor, Colin from Clerk, Octaventures, all investors in browser-based because like, I hope they solve this because they'll make browser-based submission more possible. So we don't have to overcome all these hurdles, but I think it will be an OAuth-like flow where an agent will ask to log in as you, you'll approve the scopes. Like it can book an apartment on Airbnb, but it can't like message anybody. And then, you know, the agent will have some sort of like role-based access control within an application. Yeah. I'm excited for that.swyx [00:26:16]: The tricky part is just, there's one, one layer of delegation here, which is like, you're authoring my user's user or something like that. I don't know if that's tricky or not. Does that make sense? Yeah.Paul [00:26:25]: You know, actually at Twilio, I worked on the login identity and access. Management teams, right? So like I built Twilio's login page.swyx [00:26:31]: You were an intern on that team and then you became the lead in two years? Yeah.Paul [00:26:34]: Yeah. I started as an intern in 2016 and then I was the tech lead of that team. How? That's not normal. I didn't have a life. He's not normal. Look at this guy. I didn't have a girlfriend. I just loved my job. I don't know. I applied to 500 internships for my first job and I got rejected from every single one of them except for Twilio and then eventually Amazon. And they took a shot on me and like, I was getting paid money to write code, which was my dream. Yeah. Yeah. I'm very lucky that like this coding thing worked out because I was going to be doing it regardless. And yeah, I was able to kind of spend a lot of time on a team that was growing at a company that was growing. So it informed a lot of this stuff here. I think these are problems that have been solved with like the SAML protocol with SSO. I think it's a really interesting stuff with like WebAuthn, like these different types of authentication, like schemes that you can use to authenticate people. The tooling is all there. It just needs to be tweaked a little bit to work for agents. And I think the fact that there are companies that are already. Providing authentication as a service really sets it up. Well, the thing that's hard is like reinventing the internet for agents. We don't want to rebuild the internet. That's an impossible task. And I think people often say like, well, we'll have this second layer of APIs built for agents. I'm like, we will for the top use cases, but instead of we can just tweak the internet as is, which is on the authentication side, I think we're going to be the dumb ones going forward. Unfortunately, I think AI is going to be able to do a lot of the tasks that we do online, which means that it will be able to go to websites, click buttons on our behalf and log in on our behalf too. So with this kind of like web agent future happening, I think with some small structural changes, like you said, it feels like it could all slot in really nicely with the existing internet.Handling CAPTCHAs and Agent Authenticationswyx [00:28:08]: There's one more thing, which is the, your live view iframe, which lets you take, take control. Yeah. Obviously very key for operator now, but like, was, is there anything interesting technically there or that the people like, well, people always want this.Paul [00:28:21]: It was really hard to build, you know, like, so, okay. Headless browsers, you don't see them, right. They're running. They're running in a cloud somewhere. You can't like look at them. And I just want to really make, it's a weird name. I wish we came up with a better name for this thing, but you can't see them. Right. But customers don't trust AI agents, right. At least the first pass. So what we do with our live view is that, you know, when you use browser base, you can actually embed a live view of the browser running in the cloud for your customer to see it working. And that's what the first reason is the build trust, like, okay, so I have this script. That's going to go automate a website. I can embed it into my web application via an iframe and my customer can watch. I think. And then we added two way communication. So now not only can you watch the browser kind of being operated by AI, if you want to pause and actually click around type within this iframe that's controlling a browser, that's also possible. And this is all thanks to some of the lower level protocol, which is called the Chrome DevTools protocol. It has a API called start screencast, and you can also send mouse clicks and button clicks to a remote browser. And this is all embeddable within iframes. You have a browser within a browser, yo. And then you simulate the screen, the click on the other side. Exactly. And this is really nice often for, like, let's say, a capture that can't be solved. You saw this with Operator, you know, Operator actually uses a different approach. They use VNC. So, you know, you're able to see, like, you're seeing the whole window here. What we're doing is something a little lower level with the Chrome DevTools protocol. It's just PNGs being streamed over the wire. But the same thing is true, right? Like, hey, I'm running a window. Pause. Can you do something in this window? Human. Okay, great. Resume. Like sometimes 2FA tokens. Like if you get that text message, you might need a person to type that in. Web agents need human-in-the-loop type workflows still. You still need a person to interact with the browser. And building a UI to proxy that is kind of hard. You may as well just show them the whole browser and say, hey, can you finish this up for me? And then let the AI proceed on afterwards. Is there a future where I stream my current desktop to browser base? I don't think so. I think we're very much cloud infrastructure. Yeah. You know, but I think a lot of the stuff we're doing, we do want to, like, build tools. Like, you know, we'll talk about the stage and, you know, web agent framework in a second. But, like, there's a case where a lot of people are going desktop first for, you know, consumer use. And I think cloud is doing a lot of this, where I expect to see, you know, MCPs really oriented around the cloud desktop app for a reason, right? Like, I think a lot of these tools are going to run on your computer because it makes... I think it's breaking out. People are putting it on a server. Oh, really? Okay. Well, sweet. We'll see. We'll see that. I was surprised, though, wasn't I? I think that the browser company, too, with Dia Browser, it runs on your machine. You know, it's going to be...swyx [00:30:50]: What is it?Paul [00:30:51]: So, Dia Browser, as far as I understand... I used to use Arc. Yeah. I haven't used Arc. But I'm a big fan of the browser company. I think they're doing a lot of cool stuff in consumer. As far as I understand, it's a browser where you have a sidebar where you can, like, chat with it and it can control the local browser on your machine. So, if you imagine, like, what a consumer web agent is, which it lives alongside your browser, I think Google Chrome has Project Marina, I think. I almost call it Project Marinara for some reason. I don't know why. It's...swyx [00:31:17]: No, I think it's someone really likes the Waterworld. Oh, I see. The classic Kevin Costner. Yeah.Paul [00:31:22]: Okay. Project Marinara is a similar thing to the Dia Browser, in my mind, as far as I understand it. You have a browser that has an AI interface that will take over your mouse and keyboard and control the browser for you. Great for consumer use cases. But if you're building applications that rely on a browser and it's more part of a greater, like, AI app experience, you probably need something that's more like infrastructure, not a consumer app.swyx [00:31:44]: Just because I have explored a little bit in this area, do people want branching? So, I have the state. Of whatever my browser's in. And then I want, like, 100 clones of this state. Do people do that? Or...Paul [00:31:56]: People don't do it currently. Yeah. But it's definitely something we're thinking about. I think the idea of forking a browser is really cool. Technically, kind of hard. We're starting to see this in code execution, where people are, like, forking some, like, code execution, like, processes or forking some tool calls or branching tool calls. Haven't seen it at the browser level yet. But it makes sense. Like, if an AI agent is, like, using a website and it's not sure what path it wants to take to crawl this website. To find the information it's looking for. It would make sense for it to explore both paths in parallel. And that'd be a very, like... A road not taken. Yeah. And hopefully find the right answer. And then say, okay, this was actually the right one. And memorize that. And go there in the future. On the roadmap. For sure. Don't make my roadmap, please. You know?Alessio [00:32:37]: How do you actually do that? Yeah. How do you fork? I feel like the browser is so stateful for so many things.swyx [00:32:42]: Serialize the state. Restore the state. I don't know.Paul [00:32:44]: So, it's one of the reasons why we haven't done it yet. It's hard. You know? Like, to truly fork, it's actually quite difficult. The naive way is to open the same page in a new tab and then, like, hope that it's at the same thing. But if you have a form halfway filled, you may have to, like, take the whole, you know, container. Pause it. All the memory. Duplicate it. Restart it from there. It could be very slow. So, we haven't found a thing. Like, the easy thing to fork is just, like, copy the page object. You know? But I think there needs to be something a little bit more robust there. Yeah.swyx [00:33:12]: So, MorphLabs has this infinite branch thing. Like, wrote a custom fork of Linux or something that let them save the system state and clone it. MorphLabs, hit me up. I'll be a customer. Yeah. That's the only. I think that's the only way to do it. Yeah. Like, unless Chrome has some special API for you. Yeah.Paul [00:33:29]: There's probably something we'll reverse engineer one day. I don't know. Yeah.Alessio [00:33:32]: Let's talk about StageHand, the AI web browsing framework. You have three core components, Observe, Extract, and Act. Pretty clean landing page. What was the idea behind making a framework? Yeah.Stagehand: AI web browsing frameworkPaul [00:33:43]: So, there's three frameworks that are very popular or already exist, right? Puppeteer, Playwright, Selenium. Those are for building hard-coded scripts to control websites. And as soon as I started to play with LLMs plus browsing, I caught myself, you know, code-genning Playwright code to control a website. I would, like, take the DOM. I'd pass it to an LLM. I'd say, can you generate the Playwright code to click the appropriate button here? And it would do that. And I was like, this really should be part of the frameworks themselves. And I became really obsessed with SDKs that take natural language as part of, like, the API input. And that's what StageHand is. StageHand exposes three APIs, and it's a super set of Playwright. So, if you go to a page, you may want to take an action, click on the button, fill in the form, etc. That's what the act command is for. You may want to extract some data. This one takes a natural language, like, extract the winner of the Super Bowl from this page. You can give it a Zod schema, so it returns a structured output. And then maybe you're building an API. You can do an agent loop, and you want to kind of see what actions are possible on this page before taking one. You can do observe. So, you can observe the actions on the page, and it will generate a list of actions. You can guide it, like, give me actions on this page related to buying an item. And you can, like, buy it now, add to cart, view shipping options, and pass that to an LLM, an agent loop, to say, what's the appropriate action given this high-level goal? So, StageHand isn't a web agent. It's a framework for building web agents. And we think that agent loops are actually pretty close to the application layer because every application probably has different goals or different ways it wants to take steps. I don't think I've seen a generic. Maybe you guys are the experts here. I haven't seen, like, a really good AI agent framework here. Everyone kind of has their own special sauce, right? I see a lot of developers building their own agent loops, and they're using tools. And I view StageHand as the browser tool. So, we expose act, extract, observe. Your agent can call these tools. And from that, you don't have to worry about it. You don't have to worry about generating playwright code performantly. You don't have to worry about running it. You can kind of just integrate these three tool calls into your agent loop and reliably automate the web.swyx [00:35:48]: A special shout-out to Anirudh, who I met at your dinner, who I think listens to the pod. Yeah. Hey, Anirudh.Paul [00:35:54]: Anirudh's a man. He's a StageHand guy.swyx [00:35:56]: I mean, the interesting thing about each of these APIs is they're kind of each startup. Like, specifically extract, you know, Firecrawler is extract. There's, like, Expand AI. There's a whole bunch of, like, extract companies. They just focus on extract. I'm curious. Like, I feel like you guys are going to collide at some point. Like, right now, it's friendly. Everyone's in a blue ocean. At some point, it's going to be valuable enough that there's some turf battle here. I don't think you have a dog in a fight. I think you can mock extract to use an external service if they're better at it than you. But it's just an observation that, like, in the same way that I see each option, each checkbox in the side of custom GBTs becoming a startup or each box in the Karpathy chart being a startup. Like, this is also becoming a thing. Yeah.Paul [00:36:41]: I mean, like, so the way StageHand works is that it's MIT-licensed, completely open source. You bring your own API key to your LLM of choice. You could choose your LLM. We don't make any money off of the extract or really. We only really make money if you choose to run it with our browser. You don't have to. You can actually use your own browser, a local browser. You know, StageHand is completely open source for that reason. And, yeah, like, I think if you're building really complex web scraping workflows, I don't know if StageHand is the tool for you. I think it's really more if you're building an AI agent that needs a few general tools or if it's doing a lot of, like, web automation-intensive work. But if you're building a scraping company, StageHand is not your thing. You probably want something that's going to, like, get HTML content, you know, convert that to Markdown, query it. That's not what StageHand does. StageHand is more about reliability. I think we focus a lot on reliability and less so on cost optimization and speed at this point.swyx [00:37:33]: I actually feel like StageHand, so the way that StageHand works, it's like, you know, page.act, click on the quick start. Yeah. It's kind of the integration test for the code that you would have to write anyway, like the Puppeteer code that you have to write anyway. And when the page structure changes, because it always does, then this is still the test. This is still the test that I would have to write. Yeah. So it's kind of like a testing framework that doesn't need implementation detail.Paul [00:37:56]: Well, yeah. I mean, Puppeteer, Playwright, and Slenderman were all designed as testing frameworks, right? Yeah. And now people are, like, hacking them together to automate the web. I would say, and, like, maybe this is, like, me being too specific. But, like, when I write tests, if the page structure changes. Without me knowing, I want that test to fail. So I don't know if, like, AI, like, regenerating that. Like, people are using StageHand for testing. But it's more for, like, usability testing, not, like, testing of, like, does the front end, like, has it changed or not. Okay. But generally where we've seen people, like, really, like, take off is, like, if they're using, you know, something. If they want to build a feature in their application that's kind of like Operator or Deep Research, they're using StageHand to kind of power that tool calling in their own agent loop. Okay. Cool.swyx [00:38:37]: So let's go into Operator, the first big agent launch of the year from OpenAI. Seems like they have a whole bunch scheduled. You were on break and your phone blew up. What's your just general view of computer use agents is what they're calling it. The overall category before we go into Open Operator, just the overall promise of Operator. I will observe that I tried it once. It was okay. And I never tried it again.OpenAI's Operator and computer use agentsPaul [00:38:58]: That tracks with my experience, too. Like, I'm a huge fan of the OpenAI team. Like, I think that I do not view Operator as the company. I'm not a company killer for browser base at all. I think it actually shows people what's possible. I think, like, computer use models make a lot of sense. And I'm actually most excited about computer use models is, like, their ability to, like, really take screenshots and reasoning and output steps. I think that using mouse click or mouse coordinates, I've seen that proved to be less reliable than I would like. And I just wonder if that's the right form factor. What we've done with our framework is anchor it to the DOM itself, anchor it to the actual item. So, like, if it's clicking on something, it's clicking on that thing, you know? Like, it's more accurate. No matter where it is. Yeah, exactly. Because it really ties in nicely. And it can handle, like, the whole viewport in one go, whereas, like, Operator can only handle what it sees. Can you hover? Is hovering a thing that you can do? I don't know if we expose it as a tool directly, but I'm sure there's, like, an API for hovering. Like, move mouse to this position. Yeah, yeah, yeah. I think you can trigger hover, like, via, like, the JavaScript on the DOM itself. But, no, I think, like, when we saw computer use, everyone's eyes lit up because they realized, like, wow, like, AI is going to actually automate work for people. And I think seeing that kind of happen from both of the labs, and I'm sure we're going to see more labs launch computer use models, I'm excited to see all the stuff that people build with it. I think that I'd love to see computer use power, like, controlling a browser on browser base. And I think, like, Open Operator, which was, like, our open source version of OpenAI's Operator, was our first take on, like, how can we integrate these models into browser base? And we handle the infrastructure and let the labs do the models. I don't have a sense that Operator will be released as an API. I don't know. Maybe it will. I'm curious to see how well that works because I think it's going to be really hard for a company like OpenAI to do things like support CAPTCHA solving or, like, have proxies. Like, I think it's hard for them structurally. Imagine this New York Times headline, OpenAI CAPTCHA solving. Like, that would be a pretty bad headline, this New York Times headline. Browser base solves CAPTCHAs. No one cares. No one cares. And, like, our investors are bored. Like, we're all okay with this, you know? We're building this company knowing that the CAPTCHA solving is short-lived until we figure out how to authenticate good bots. I think it's really hard for a company like OpenAI, who has this brand that's so, so good, to balance with, like, the icky parts of web automation, which it can be kind of complex to solve. I'm sure OpenAI knows who to call whenever they need you. Yeah, right. I'm sure they'll have a great partnership.Alessio [00:41:23]: And is Open Operator just, like, a marketing thing for you? Like, how do you think about resource allocation? So, you can spin this up very quickly. And now there's all this, like, open deep research, just open all these things that people are building. We started it, you know. You're the original Open. We're the original Open operator, you know? Is it just, hey, look, this is a demo, but, like, we'll help you build out an actual product for yourself? Like, are you interested in going more of a product route? That's kind of the OpenAI way, right? They started as a model provider and then…Paul [00:41:53]: Yeah, we're not interested in going the product route yet. I view Open Operator as a model provider. It's a reference project, you know? Let's show people how to build these things using the infrastructure and models that are out there. And that's what it is. It's, like, Open Operator is very simple. It's an agent loop. It says, like, take a high-level goal, break it down into steps, use tool calling to accomplish those steps. It takes screenshots and feeds those screenshots into an LLM with the step to generate the right action. It uses stagehand under the hood to actually execute this action. It doesn't use a computer use model. And it, like, has a nice interface using the live view that we talked about, the iframe, to embed that into an application. So I felt like people on launch day wanted to figure out how to build their own version of this. And we turned that around really quickly to show them. And I hope we do that with other things like deep research. We don't have a deep research launch yet. I think David from AOMNI actually has an amazing open deep research that he launched. It has, like, 10K GitHub stars now. So he's crushing that. But I think if people want to build these features natively into their application, they need good reference projects. And I think Open Operator is a good example of that.swyx [00:42:52]: I don't know. Actually, I'm actually pretty bullish on API-driven operator. Because that's the only way that you can sort of, like, once it's reliable enough, obviously. And now we're nowhere near. But, like, give it five years. It'll happen, you know. And then you can sort of spin this up and browsers are working in the background and you don't necessarily have to know. And it just is booking restaurants for you, whatever. I can definitely see that future happening. I had this on the landing page here. This might be a slightly out of order. But, you know, you have, like, sort of three use cases for browser base. Open Operator. Or this is the operator sort of use case. It's kind of like the workflow automation use case. And it completes with UiPath in the sort of RPA category. Would you agree with that? Yeah, I would agree with that. And then there's Agents we talked about already. And web scraping, which I imagine would be the bulk of your workload right now, right?Paul [00:43:40]: No, not at all. I'd say actually, like, the majority is browser automation. We're kind of expensive for web scraping. Like, I think that if you're building a web scraping product, if you need to do occasional web scraping or you have to do web scraping that works every single time, you want to use browser automation. Yeah. You want to use browser-based. But if you're building web scraping workflows, what you should do is have a waterfall. You should have the first request is a curl to the website. See if you can get it without even using a browser. And then the second request may be, like, a scraping-specific API. There's, like, a thousand scraping APIs out there that you can use to try and get data. Scraping B. Scraping B is a great example, right? Yeah. And then, like, if those two don't work, bring out the heavy hitter. Like, browser-based will 100% work, right? It will load the page in a real browser, hydrate it. I see.swyx [00:44:21]: Because a lot of people don't render to JS.swyx [00:44:25]: Yeah, exactly.Paul [00:44:26]: So, I mean, the three big use cases, right? Like, you know, automation, web data collection, and then, you know, if you're building anything agentic that needs, like, a browser tool, you want to use browser-based.Alessio [00:44:35]: Is there any use case that, like, you were super surprised by that people might not even think about? Oh, yeah. Or is it, yeah, anything that you can share? The long tail is crazy. Yeah.Surprising use cases of BrowserbasePaul [00:44:44]: One of the case studies on our website that I think is the most interesting is this company called Benny. So, the way that it works is if you're on food stamps in the United States, you can actually get rebates if you buy certain things. Yeah. You buy some vegetables. You submit your receipt to the government. They'll give you a little rebate back. Say, hey, thanks for buying vegetables. It's good for you. That process of submitting that receipt is very painful. And the way Benny works is you use their app to take a photo of your receipt, and then Benny will go submit that receipt for you and then deposit the money into your account. That's actually using no AI at all. It's all, like, hard-coded scripts. They maintain the scripts. They've been doing a great job. And they build this amazing consumer app. But it's an example of, like, all these, like, tedious workflows that people have to do to kind of go about their business. And they're doing it for the sake of their day-to-day lives. And I had never known about, like, food stamp rebates or the complex forms you have to do to fill them. But the world is powered by millions and millions of tedious forms, visas. You know, Emirate Lighthouse is a customer, right? You know, they do the O1 visa. Millions and millions of forms are taking away humans' time. And I hope that Browserbase can help power software that automates away the web forms that we don't need anymore. Yeah.swyx [00:45:49]: I mean, I'm very supportive of that. I mean, forms. I do think, like, government itself is a big part of it. I think the government itself should embrace AI more to do more sort of human-friendly form filling. Mm-hmm. But I'm not optimistic. I'm not holding my breath. Yeah. We'll see. Okay. I think I'm about to zoom out. I have a little brief thing on computer use, and then we can talk about founder stuff, which is, I tend to think of developer tooling markets in impossible triangles, where everyone starts in a niche, and then they start to branch out. So I already hinted at a little bit of this, right? We mentioned more. We mentioned E2B. We mentioned Firecrawl. And then there's Browserbase. So there's, like, all this stuff of, like, have serverless virtual computer that you give to an agent and let them do stuff with it. And there's various ways of connecting it to the internet. You can just connect to a search API, like SERP API, whatever other, like, EXA is another one. That's what you're searching. You can also have a JSON markdown extractor, which is Firecrawl. Or you can have a virtual browser like Browserbase, or you can have a virtual machine like Morph. And then there's also maybe, like, a virtual sort of code environment, like Code Interpreter. So, like, there's just, like, a bunch of different ways to tackle the problem of give a computer to an agent. And I'm just kind of wondering if you see, like, everyone's just, like, happily coexisting in their respective niches. And as a developer, I just go and pick, like, a shopping basket of one of each. Or do you think that you eventually, people will collide?Future of browser automation and market competitionPaul [00:47:18]: I think that currently it's not a zero-sum market. Like, I think we're talking about... I think we're talking about all of knowledge work that people do that can be automated online. All of these, like, trillions of hours that happen online where people are working. And I think that there's so much software to be built that, like, I tend not to think about how these companies will collide. I just try to solve the problem as best as I can and make this specific piece of infrastructure, which I think is an important primitive, the best I possibly can. And yeah. I think there's players that are actually going to like it. I think there's players that are going to launch, like, over-the-top, you know, platforms, like agent platforms that have all these tools built in, right? Like, who's building the rippling for agent tools that has the search tool, the browser tool, the operating system tool, right? There are some. There are some. There are some, right? And I think in the end, what I have seen as my time as a developer, and I look at all the favorite tools that I have, is that, like, for tools and primitives with sufficient levels of complexity, you need to have a solution that's really bespoke to that primitive, you know? And I am sufficiently convinced that the browser is complex enough to deserve a primitive. Obviously, I have to. I'm the founder of BrowserBase, right? I'm talking my book. But, like, I think maybe I can give you one spicy take against, like, maybe just whole OS running. I think that when I look at computer use when it first came out, I saw that the majority of use cases for computer use were controlling a browser. And do we really need to run an entire operating system just to control a browser? I don't think so. I don't think that's necessary. You know, BrowserBase can run browsers for way cheaper than you can if you're running a full-fledged OS with a GUI, you know, operating system. And I think that's just an advantage of the browser. It is, like, browsers are little OSs, and you can run them very efficiently if you orchestrate it well. And I think that allows us to offer 90% of the, you know, functionality in the platform needed at 10% of the cost of running a full OS. Yeah.Open Operator: Browserbase's Open-Source Alternativeswyx [00:49:16]: I definitely see the logic in that. There's a Mark Andreessen quote. I don't know if you know this one. Where he basically observed that the browser is turning the operating system into a poorly debugged set of device drivers, because most of the apps are moved from the OS to the browser. So you can just run browsers.Paul [00:49:31]: There's a place for OSs, too. Like, I think that there are some applications that only run on Windows operating systems. And Eric from pig.dev in this upcoming YC batch, or last YC batch, like, he's building all run tons of Windows operating systems for you to control with your agent. And like, there's some legacy EHR systems that only run on Internet-controlled systems. Yeah.Paul [00:49:54]: I think that's it. I think, like, there are use cases for specific operating systems for specific legacy software. And like, I'm excited to see what he does with that. I just wanted to give a shout out to the pig.dev website.swyx [00:50:06]: The pigs jump when you click on them. Yeah. That's great.Paul [00:50:08]: Eric, he's the former co-founder of banana.dev, too.swyx [00:50:11]: Oh, that Eric. Yeah. That Eric. Okay. Well, he abandoned bananas for pigs. I hope he doesn't start going around with pigs now.Alessio [00:50:18]: Like he was going around with bananas. A little toy pig. Yeah. Yeah. I love that. What else are we missing? I think we covered a lot of, like, the browser-based product history, but. What do you wish people asked you? Yeah.Paul [00:50:29]: I wish people asked me more about, like, what will the future of software look like? Because I think that's really where I've spent a lot of time about why do browser-based. Like, for me, starting a company is like a means of last resort. Like, you shouldn't start a company unless you absolutely have to. And I remain convinced that the future of software is software that you're going to click a button and it's going to do stuff on your behalf. Right now, software. You click a button and it maybe, like, calls it back an API and, like, computes some numbers. It, like, modifies some text, whatever. But the future of software is software using software. So, I may log into my accounting website for my business, click a button, and it's going to go load up my Gmail, search my emails, find the thing, upload the receipt, and then comment it for me. Right? And it may use it using APIs, maybe a browser. I don't know. I think it's a little bit of both. But that's completely different from how we've built software so far. And that's. I think that future of software has different infrastructure requirements. It's going to require different UIs. It's going to require different pieces of infrastructure. I think the browser infrastructure is one piece that fits into that, along with all the other categories you mentioned. So, I think that it's going to require developers to think differently about how they've built software for, you know

Machine Learning Street Talk
Clement Bonnet - Can Latent Program Networks Solve Abstract Reasoning?

Machine Learning Street Talk

Play Episode Listen Later Feb 19, 2025 51:26


Clement Bonnet discusses his novel approach to the ARC (Abstraction and Reasoning Corpus) challenge. Unlike approaches that rely on fine-tuning LLMs or generating samples at inference time, Clement's method encodes input-output pairs into a latent space, optimizes this representation with a search algorithm, and decodes outputs for new inputs. This end-to-end architecture uses a VAE loss, including reconstruction and prior losses. SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich. Goto https://tufalabs.ai/***TRANSCRIPT + RESEARCH OVERVIEW:https://www.dropbox.com/scl/fi/j7m0gaz1126y594gswtma/CLEMMLST.pdf?rlkey=y5qvwq2er5nchbcibm07rcfpq&dl=0Clem and Matthew-https://www.linkedin.com/in/clement-bonnet16/https://github.com/clement-bonnethttps://mvmacfarlane.github.io/TOC1. LPN Fundamentals [00:00:00] 1.1 Introduction to ARC Benchmark and LPN Overview [00:05:05] 1.2 Neural Networks' Challenges with ARC and Program Synthesis [00:06:55] 1.3 Induction vs Transduction in Machine Learning2. LPN Architecture and Latent Space [00:11:50] 2.1 LPN Architecture and Latent Space Implementation [00:16:25] 2.2 LPN Latent Space Encoding and VAE Architecture [00:20:25] 2.3 Gradient-Based Search Training Strategy [00:23:39] 2.4 LPN Model Architecture and Implementation Details3. Implementation and Scaling [00:27:34] 3.1 Training Data Generation and re-ARC Framework [00:31:28] 3.2 Limitations of Latent Space and Multi-Thread Search [00:34:43] 3.3 Program Composition and Computational Graph Architecture4. Advanced Concepts and Future Directions [00:45:09] 4.1 AI Creativity and Program Synthesis Approaches [00:49:47] 4.2 Scaling and Interpretability in Latent Space ModelsREFS[00:00:05] ARC benchmark, Chollethttps://arxiv.org/abs/2412.04604[00:02:10] Latent Program Spaces, Bonnet, Macfarlanehttps://arxiv.org/abs/2411.08706[00:07:45] Kevin Ellis work on program generationhttps://www.cs.cornell.edu/~ellisk/[00:08:45] Induction vs transduction in abstract reasoning, Li et al.https://arxiv.org/abs/2411.02272[00:17:40] VAEs, Kingma, Wellinghttps://arxiv.org/abs/1312.6114[00:27:50] re-ARC, Hodelhttps://github.com/michaelhodel/re-arc[00:29:40] Grid size in ARC tasks, Chollethttps://github.com/fchollet/ARC-AGI[00:33:00] Critique of deep learning, Marcushttps://arxiv.org/vc/arxiv/papers/2002/2002.06177v1.pdf

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

The free livestreams for AI Engineer Summit are now up! Please hit the bell to help us appease the algo gods. We're also announcing a special Online Track later today.Today's Deep Research episode is our last in our series of AIE Summit preview podcasts - thanks for following along with our OpenAI, Portkey, Pydantic, Bee, and Bret Taylor episodes, and we hope you enjoy the Summit! Catch you on livestream.Everybody's going deep now. Deep Work. Deep Learning. DeepMind. If 2025 is the Year of Agents, then the 2020s are the Decade of Deep.While “LLM-powered Search” is as old as Perplexity and SearchGPT, and open source projects like GPTResearcher and clones like OpenDeepResearch exist, the difference with “Deep Research” products is they are both “agentic” (loosely meaning that an LLM decides the next step in a workflow, usually involving tools) and bundling custom-tuned frontier models (custom tuned o3 and Gemini 1.5 Flash).The reception to OpenAI's Deep Research agent has been nothing short of breathless:"Deep Research is the best public-facing AI product Google has ever released. It's like having a college-educated researcher in your pocket." - Jason Calacanis“I have had [Deep Research] write a number of ten-page papers for me, each of them outstanding. I think of the quality as comparable to having a good PhD-level research assistant, and sending that person away with a task for a week or two, or maybe more. Except Deep Research does the work in five or six minutes.” - Tyler Cowen“Deep Research is one of the best bargains in technology.” - Ben Thompson“my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.” - sama“Using Deep Research over the past few weeks has been my own personal AGI moment. It takes 10 mins to generate accurate and thorough competitive and market research (with sources) that previously used to take me at least 3 hours.” - OAI employee“It's like a bazooka for the curious mind” - Dan Shipper“Deep research can be seen as a new interface for the internet, in addition to being an incredible agent… This paradigm will be so powerful that in the future, navigating the internet manually via a browser will be "old-school", like performing arithmetic calculations by hand.” - Jason Wei“One notable characteristic of Deep Research is its extreme patience. I think this is rapidly approaching “superhuman patience”. One realization working on this project was that intelligence and patience go really well together.” - HyungWon“I asked it to write a reference Interaction Calculus evaluator in Haskell. A few exchanges later, it gave me a complete file, including a parser, an evaluator, O(1) interactions and everything. The file compiled, and worked on my test inputs. There are some minor issues, but it is mostly correct. So, in about 30 minutes, o3 performed a job that would take me a day or so.” - Victor Taelin“Can confirm OpenAI Deep Research is quite strong. In a few minutes it did what used to take a dozen hours. The implications to knowledge work is going to be quite profound when you just ask an AI Agent to perform full tasks for you and come back with a finished result.” - Aaron Levie“Deep Research is genuinely useful” - Gary MarcusWith the advent of “Deep Research” agents, we are now routinely asking models to go through 100+ websites and generate in-depth reports on any topic. The Deep Research revolution has hit the AI scene in the last 2 weeks: * Dec 11th: Gemini Deep Research (today's guest!) rolls out with Gemini Advanced* Feb 2nd: OpenAI releases Deep Research* Feb 3rd: a dozen “Open Deep Research” clones launch* Feb 5th: Gemini 2.0 Flash GA* Feb 15th: Perplexity launches Deep Research * Feb 17th: xAI launches Deep SearchIn today's episode, we welcome Aarush Selvan and Mukund Sridhar, the lead PM and tech lead for Gemini Deep Research, the originators of the entire category. We asked detailed questions from inspiration to implementation, why they had to finetune a special model for it instead of using the standard Gemini model, how to run evals for them, and how to think about the distribution of use cases. (We also have an upcoming Gemini 2 episode with our returning first guest Logan Kilpatrick so stay tuned

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Bundle tickets for AIE Summit NYC have now sold out. You can now sign up for the livestream — where we will be making a big announcement soon. NYC-based readers and Summit attendees should check out the meetups happening around the Summit.2024 was a very challenging year for AI Hardware. After the buzz of CES last January, 2024 was marked by the meteoric rise and even harder fall of AI Wearables companies like Rabbit and Humane, with an assist from a pre-wallpaper-app MKBHD. Even Friend.com, the first to launch in the AI pendant category, and which spurred Rewind AI to rebrand to Limitless and follow in their footsteps, ended up delaying their wearable ship date and launching an experimental website chatbot version. We have been cautiously excited about this category, keeping tabs on most of the top entrants, including Omi and Compass. However, to date the biggest winner still standing from the AI Wearable wars is Bee AI, founded by today's guests Maria and Ethan. Bee is an always on hardware device with beamforming microphones, 7 day battery life and a mute button, that can be worn as a wristwatch or a clip-on pin, backed by an incredible transcription, diarization and very long context memory processing pipeline that helps you to remember your day, your todos, and even perform actions by operating a virtual cloud phone. This is one of the most advanced, production ready, personal AI agents we've ever seen, so we were excited to be their first podcast appearance. We met Bee when we ran the world's first Personal AI meetup in April last year.As a user of Bee (and not an investor! just a friend!) it's genuinely been a joy to use, and we were glad to take advantage of the opportunity to ask hard questions about the privacy and legal/ethical side of things as much as the AI and Hardware engineering side of Bee. We hope you enjoy the episode and tune in next Friday for Bee's first conference talk: Building Perfect Memory.Show Notes* Bee Website* Ethan Sutin, Maria de Lourdes Zollo* Bee @ Personal AI Meetup* Buy Bee with Listener Discount Code!Timestamps* 00:00:00 Introductions and overview of Bee Computer* 00:01:58 Personal context and use cases for Bee* 00:03:02 Origin story of Bee and the founders' background* 00:06:56 Evolution from app to hardware device* 00:09:54 Short-term value proposition for users* 00:12:17 Demo of Bee's functionality* 00:17:54 Hardware form factor considerations* 00:22:22 Privacy concerns and legal considerations* 00:30:57 User adoption and reactions to wearing Bee* 00:35:56 CES experience and hardware manufacturing challenges* 00:41:40 Software pipeline and inference costs* 00:53:38 Technical challenges in real-time processing* 00:57:46 Memory and personal context modeling* 01:02:45 Social aspects and agent-to-agent interactions* 01:04:34 Location sharing and personal data exchange* 01:05:11 Personality analysis capabilities* 01:06:29 Hiring and future of always-on AITranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of SmallAI.swyx [00:00:12]: Hey, and today we are very honored to have in the studio Maria and Ethan from Bee.Maria [00:00:16]: Hi, thank you for having us.swyx [00:00:20]: And you are, I think, the first hardware founders we've had on the podcast. I've been looking to have had a hardware founder, like a wearable hardware, like a wearable hardware founder for a while. I think we're going to have two or three of them this year. And you're the ones that I wear every day. So thank you for making Bee. Thank you for all the feedback and the usage. Yeah, you know, I've been a big fan. You are the speaker gift for the Engineering World's Fair. And let's start from the beginning. What is Bee Computer?Ethan [00:00:52]: Bee Computer is a personal AI system. So you can think of it as AI living alongside you in first person. So it can kind of capture your in real life. So with that understanding can help you in significant ways. You know, the obvious one is memory, but that's that's really just the base kind of use case. So recalling and reflective. I know, Swyx, that you you like the idea of journaling, but you don't but still have some some kind of reflective summary of what you experienced in real life. But it's also about just having like the whole context of a human being and understanding, you know, giving the machine the ability to understand, like, what's going on in your life. Your attitudes, your desires, specifics about your preferences, so that not only can it help you with recall, but then anything that you need it to do, it already knows, like, if you think about like somebody who you've worked with or lived with for a long time, they just know kind of without having to ask you what you would want, it's clear that like, that is the future that personal AI, like, it's just going to be very, you know, the AI is just so much more valuable with personal context.Maria [00:01:58]: I will say that one of the things that we are really passionate is really understanding this. Personal context, because we'll make the AI more useful. Think about like a best friend that know you so well. That's one of the things that we are seeing from the user. They're using from a companion standpoint or professional use cases. There are many ways to use B, but companionship and professional are the ones that we are seeing now more.swyx [00:02:22]: Yeah. It feels so dry to talk about use cases. Yeah. Yeah.Maria [00:02:26]: It's like really like investor question. Like, what kind of use case?Ethan [00:02:28]: We're just like, we've been so broken and trained. But I mean, on the base case, it's just like, don't you want your AI to know everything you've said and like everywhere you've been, like, wouldn't you want that?Maria [00:02:40]: Yeah. And don't stay there and repeat every time, like, oh, this is what I like. You already know that. And you do things for me based on that. That's I think is really cool.swyx [00:02:50]: Great. Do you want to jump into a demo? Do you have any other questions?Alessio [00:02:54]: I want to maybe just cover the origin story. Just how did you two meet? What was the was this the first idea you started working on? Was there something else before?Maria [00:03:02]: I can start. So Ethan and I, we know each other from six years now. He had a company called Squad. And before that was called Olabot and was a personal AI. Yeah, I should. So maybe you should start this one. But yeah, that's how I know Ethan. Like he was pivoting from personal AI to Squad. And there was a co-watching with friends product. I had experience working with TikTok and video content. So I had the pivoting and we launched Squad and was really successful. And at the end. The founders decided to sell that to Twitter, now X. So both of us, we joined X. We launched Twitter Spaces. We launched many other products. And yeah, till then, we basically continue to work together to the start of B.Ethan [00:03:46]: The interesting thing is like this isn't the first attempt at personal AI. In 2016, when I started my first company, it started out as a personal AI company. This is before Transformers, no BERT even like just RNNs. You couldn't really do any convincing dialogue at all. I met Esther, who was my previous co-founder. We both really interested in the idea of like having a machine kind of model or understand a dynamic human. We wanted to make personal AI. This was like more geared towards because we had obviously much limited tools, more geared towards like younger people. So I don't know if you remember in 2016, there was like a brief chatbot boom. It was way premature, but it was when Zuckerberg went up on F8 and yeah, M and like. Yeah. The messenger platform, people like, oh, bots are going to replace apps. It was like for about six months. And then everybody realized, man, these things are terrible and like they're not replacing apps. But it was at that time that we got excited and we're like, we tried to make this like, oh, teach the AI about you. So it was just an app that you kind of chatted with and it would ask you questions and then like give you some feedback.Maria [00:04:53]: But Hugging Face first version was launched at the same time. Yeah, we started it.Ethan [00:04:56]: We started out the same office as Hugging Face because Betaworks was our investor. So they had to think. They had a thing called Bot Camp. Betaworks is like a really cool VC because they invest in out there things. They're like way ahead of everybody else. And like back then it was they had something called Bot Camp. They took six companies and it was us and Hugging Face. And then I think the other four, I'm pretty sure, are dead. But and Hugging Face was the one that really got, you know, I mean, 30% success rate is pretty good. Yeah. But yeah, when we it was, it was like it was just the two founders. Yeah, they were kind of like an AI company in the beginning. It was a chat app for teenagers. A lot of people don't know that Hugging Face was like, hey, friend, how was school? Let's trade selfies. But then, you know, they built the Transformers library, I believe, to help them make their chat app better. And then they open sourced and it was like it blew up. And like they're like, oh, maybe this is the opportunity. And now they're Hugging Face. But anyway, like we were obsessed with it at that time. But then it was clear that there's some people who really love chatting and like answering questions. But it's like a lot of work, like just to kind of manually.Maria [00:06:00]: Yeah.Ethan [00:06:01]: Teach like all these things about you to an AI.Maria [00:06:04]: Yeah, there were some people that were super passionate, for example, teenagers. They really like, for example, to speak about themselves a lot. So they will reply to a lot of questions and speak about them. But most of the people, they don't really want to spend time.Ethan [00:06:18]: And, you know, it's hard to like really bring the value with it. We had like sentence similarity and stuff and could try and do, but it was like it was premature with the technology at the time. And so we pivoted. We went to YC and the long story, but like we pivoted to consumer video and that kind of went really viral and got a lot of usage quickly. And then we ended up selling it to Twitter, worked there and left before Elon, not related to Elon, but left Twitter.swyx [00:06:46]: And then I should mention this is the famous time when well, when when Elon was just came in, this was like Esther was the famous product manager who slept there.Ethan [00:06:56]: My co-founder, my former co-founder, she sleeping bag. She was the sleep where you were. Yeah, yeah, she stayed. We had left by that point.swyx [00:07:03]: She very stayed, she's famous for staying.Ethan [00:07:06]: Yeah, but later, later left or got, I think, laid off, laid off. Yeah, I think the whole product team got laid off. She was a product manager, director. But yeah, like we left before that. And then we're like, oh, my God, things are different now. You know, I think this is we really started working on again right before ChatGPT came out. But we had an app version and we kind of were trying different things around it. And then, you know, ultimately, it was clear that, like, there were some limitations we can go on, like a good question to ask any wearable company is like, why isn't this an app? Yes. Yeah. Because like.Maria [00:07:40]: Because we tried the app at the beginning.Ethan [00:07:43]: Yeah. Like the idea that it could be more of a and B comes from ambient. So like if it was more kind of just around you all the time and less about you having to go open the app and do the effort to, like, enter in data that led us down the path of hardware. Yeah. Because the sensors on this are microphones. So it's capturing and understanding audio. We started actually our first hardware with a vision component, too. And we can talk about why we're not doing that right now. But if you wanted to, like, have a continuous understanding of audio with your phone, it would monopolize your microphone. It would get interrupted by calls and you'd have to remember to turn it on. And like that little bit of friction is actually like a substantial barrier to, like, get your phone. It's like the experience of it just being with you all the time and like living alongside you. And so I think that that's like the key reason it's not an app. And in fact, we do have Apple Watch support. So anybody who has a watch, Apple Watch can use it right away without buying any hardware. Because we worked really hard to make a version for the watch that can run in the background, not super drain your battery. But even with the watch, there's still friction because you have to remember to turn it on and it still gets interrupted if somebody calls you. And you have to remember to. We send a notification, but you still have to go back and turn it on because it's just the way watchOS works.Maria [00:09:04]: One of the things that we are seeing from our Apple Watch users, like I love the Apple Watch integration. One of the things that we are seeing is that people, they start using it from Apple Watch and after a couple of days they buy the B because they just like to wear it.Ethan [00:09:17]: Yeah, we're seeing.Maria [00:09:18]: That's something that like they're learning and it's really cool. Yeah.Ethan [00:09:21]: I mean, I think like fundamentally we like to think that like a personal AI is like the mission. And it's more about like the understanding. Connecting the dots, making use of the data to provide some value. And the hardware is like the ears of the AI. It's not like integrating like the incoming sensor data. And that's really what we focus on. And like the hardware is, you know, if we can do it well and have a great experience on the Apple Watch like that, that's just great. I mean, but there's just some platform restrictions that like existing hardware makes it hard to provide that experience. Yeah.Alessio [00:09:54]: What do people do in like two or three days that then convinces them to buy it? They buy the product. This feels like a product where like after you use it for a while, you have enough data to start to get a lot of insights. But it sounds like maybe there's also like a short term.Maria [00:10:07]: From the Apple Watch users, I believe that because every time that you receive a call after, they need to go back to B and open it again. Or for example, every day they need to charge Apple Watch and reminds them to open the app every day. They feel like, okay, maybe this is too much work. I just want to wear the B and just keep it open and that's it. And I don't need to think about it.Ethan [00:10:27]: I think they see the kind of potential of it just from the watch. Because even if you wear it a day, like we send a summary notification at the end of the day about like just key things that happened to you in your day. And like I didn't even think like I'm not like a journaling type person or like because like, oh, I just live the day. Why do I need to like think about it? But like it's actually pretty sometimes I'm surprised how interesting it is to me just to kind of be like, oh, yeah, that and how it kind of fits together. And I think that's like just something people get immediately with the watch. But they're like, oh, I'd like an easier watch. I'd like a better way to do this.swyx [00:10:58]: It's surprising because I only know about the hardware. But I use the watch as like a backup for when I don't have the hardware. I feel like because now you're beamforming and all that, this is significantly better. Yeah, that's the other thing.Ethan [00:11:11]: We have way more control over like the Apple Watch. You're limited in like you can't set the gain. You can't change the sample rate. There's just very limited framework support for doing anything with audio. Whereas if you control it. Then you can kind of optimize it for your use case. The Apple Watch isn't meant to be kind of recording this. And we can talk when we get to the part about audio, why it's so hard. This is like audio on the hardest level because you don't know it has to work in all environments or you try and make it work as best as it can. Like this environment is very great. We're in a studio. But, you know, afterwards at dinner in a restaurant, it's totally different audio environment. And there's a lot of challenges with that. And having really good source audio helps. But then there's a lot more. But with the machine learning that still is, you know, has to be done to try and account because like you can tune something for one environment or another. But it'll make one good and one bad. And like making something that's flexible enough is really challenging.Alessio [00:12:10]: Do we want to do a demo just to set the stage? And then we kind of talk about.Maria [00:12:14]: Yeah, I think we can go like a walkthrough and the prod.Alessio [00:12:17]: Yeah, sure.swyx [00:12:17]: So I think we said I should. So for listeners, we'll be switching to video. That was superimposed on. And to this video, if you want to see it, go to our YouTube, like and subscribe as always. Yeah.Maria [00:12:31]: And by the bee. Yes.swyx [00:12:33]: And by the bee. While you wait. While you wait. Exactly. It doesn't take long.Maria [00:12:39]: Maybe you should have a discount code just for the listeners. Sure.swyx [00:12:43]: If you want to offer it, I'll take it. All right. Yeah. Well, discount code Swyx. Oh s**t. Okay. Yeah. There you go.Ethan [00:12:49]: An important thing to mention also is that the hardware is meant to work with the phone. And like, I think, you know, if you, if you look at rabbit or, or humane, they're trying to create like a new hardware platform. We think that the phone's just so dominant and it will be until we have the next generation, which is not going to be for five, you know, maybe some Orion type glasses that are cheap enough and like light enough. Like that's going to take a long time before with the phone rather than trying to just like replace it. So in the app, we have a summary of your days, but at the top, it's kind of what's going on now. And that's updating your phone. It's updating continuously. So right now it's saying, I'm discussing, you know, the development of, you know, personal AI, and that's just kind of the ongoing conversation. And then we give you a readable form. That's like little kind of segments of what's the important parts of the conversations. We do speaker identification, which is really important because you don't want your personal AI thinking you said something and attributing it to you when it was just somebody else in the conversation. So you can also teach it other people's voices. So like if some, you know, somebody close to you, so it can start to understand your relationships a little better. And then we do conversation end pointing, which is kind of like a task that didn't even exist before, like, cause nobody needed to do this. But like if you had somebody's whole day, how do you like break it into logical pieces? And so we use like not just voice activity, but other signals to try and split up because conversations are a little fuzzy. They can like lead into one, can start to the next. So also like the semantic content of it. When a conversation ends, we run it through larger models to try and get a better, you know, sense of the actual, what was said and then summarize it, provide key points. What was the general atmosphere and tone of the conversation and potential action items that might've come of that. But then at the end of the day, we give you like a summary of all your day and where you were and just kind of like a step-by-step walkthrough of what happened and what were the key points. That's kind of just like the base capture layer. So like if you just want to get a kind of glimpse or recall or reflect that's there. But really the key is like all of this is now like being influenced on to generate personal context about you. So we generate key items known to be true about you and that you can, you know, there's a human in the loop aspect is like you can, you have visibility. Right. Into that. And you can, you know, I have a lot of facts about technology because that's basically what I talk about all the time. Right. But I do have some hobbies that show up and then like, how do you put use to this context? So I kind of like measure my day now and just like, what is my token output of the day? You know, like, like as a human, how much information do I produce? And it's kind of measured in tokens and it turns out it's like around 200,000 or so a day. But so in the recall case, we have, um. A chat interface, but the key here is on the recall of it. Like, you know, how do you, you know, I probably have 50 million tokens of personal context and like how to make sense of that, make it useful. So I can ask simple, like, uh, recall questions, like details about the trip I was on to Taiwan, where recently we're with our manufacturer and, um, in real time, like it will, you know, it has various capabilities such as searching through your, your memories, but then also being able to search the web or look at my calendar, we have integrations with Gmail and calendars. So like connecting the dots between the in real life and the digital life. And, you know, I just asked it about my Taiwan trip and it kind of gives me the, the breakdown of the details, what happened, the issues we had around, you know, certain manufacturing problems and it, and it goes back and references the conversation so I can, I can go back to the source. Yeah.Maria [00:16:46]: Not just the conversation as well, the integrations. So we have as well Gmail and Google calendar. So if there is something there that was useful to have more context, we can see that.Ethan [00:16:56]: So like, and it can, I never use the word agentic cause it's, it's cringe, but like it can search through, you know, if I, if I'm brainstorming about something that spans across, like search through my conversation, search the email, look at the calendar and then depending on what's needed. Then synthesize, you know, something with all that context.Maria [00:17:18]: I love that you did the Spotify wrapped. That was pretty cool. Yeah.Ethan [00:17:22]: Like one thing I did was just like make a Spotify wrap for my 2024, like of my life. You can do that. Yeah, you can.Maria [00:17:28]: Wait. Yeah. I like those crazy.Ethan [00:17:31]: Make a Spotify wrapped for my life in 2024. Yeah. So it's like surprisingly good. Um, it like kind of like game metrics. So it was like you visited three countries, you shipped, you know, XMini, beta. Devices.Maria [00:17:46]: And that's kind of more personal insights and reflection points. Yeah.swyx [00:17:51]: That's fascinating. So that's the demo.Ethan [00:17:54]: Well, we have, we can show something that's in beta. I don't know if we want to do it. I don't know.Maria [00:17:58]: We want to show something. Do it.Ethan [00:18:00]: And then we can kind of fit. Yeah.Maria [00:18:01]: Yeah.Ethan [00:18:02]: So like the, the, the, the vision is also like, not just about like AI being with you in like just passively understanding you through living your experience, but also then like it proactively suggesting things to you. Yeah. Like at the appropriate time. So like not just pool, but, but kind of, it can step in and suggest things to you. So, you know, one integration we have that, uh, is in beta is with WhatsApp. Maria is asking for a recommendation for an Italian restaurant. Would you like me to look up some highly rated Italian restaurants nearby and send her a suggestion?Maria [00:18:34]: So what I did, I just sent to Ethan a message through WhatsApp in his own personal phone. Yeah.Ethan [00:18:41]: So, so basically. B is like watching all my incoming notifications. And if it meets two criteria, like, is it important enough for me to raise a suggestion to the user? And then is there something I could potentially help with? So this is where the actions come into place. So because Maria is my co-founder and because it was like a restaurant recommendation, something that it could probably help with, it proposed that to me. And then I can, through either the chat and we have another kind of push to talk walkie talkie style button. It's actually a multi-purpose button to like toggle it on or off, but also if you push to hold, you can talk. So I can say, yes, uh, find one and send it to her on WhatsApp is, uh, an Android cloud phone. So it's, uh, going to be able to, you know, that has access to all my accounts. So we're going to abstract this away and the execution environment is not really important, but like we can go into technically why Android is actually a pretty good one right now. But, you know, it's searching for Italian restaurants, you know, and we don't have to watch this. I could be, you know, have my ear AirPods in and in my pocket, you know, it's going to go to WhatsApp, going to find Maria's thread, send her the response and then, and then let us know. Oh my God.Alessio [00:19:56]: But what's the, I mean, an Italian restaurant. Yeah. What did it choose? What did it choose? It's easy to say. Real Italian is hard to play. Exactly.Ethan [00:20:04]: It's easy to say. So I doubt it. I don't know.swyx [00:20:06]: For the record, since you have the Italians, uh, best Italian restaurant in SF.Maria [00:20:09]: Oh my God. I still don't have one. What? No.Ethan [00:20:14]: I don't know. Successfully found and shared.Alessio [00:20:16]: Let's see. Let's see what the AI says. Bottega. Bottega? I think it's Bottega.Maria [00:20:21]: Have you been to Bottega? How is it?Alessio [00:20:24]: It's fine.Maria [00:20:25]: I've been to one called like Norcina, I think it was good.Alessio [00:20:29]: Bottega is on Valencia Street. It's fine. The pizza is not good.Maria [00:20:32]: It's not good.Alessio [00:20:33]: Some of the pastas are good.Maria [00:20:34]: You know, the people I'm sorry to interrupt. Sorry. But there is like this Delfina. Yeah. That here everybody's like, oh, Pizzeria Delfina is amazing. I'm overrated. This is not. I don't know. That's great. That's great.swyx [00:20:46]: The North Beach Cafe. That place you took us with Michele last time. Vega. Oh.Alessio [00:20:52]: The guy at Vega, Giuseppe, he's Italian. Which one is that? It's in Bernal Heights. Ugh. He's nice. He's not nice. I don't know that one. What's the name of the place? Vega. Vega. Vega. Cool. We got the name. Vega. But it's not Vega.Maria [00:21:02]: It's Italian. Whatswyx [00:21:10]: Vega. Vega.swyx [00:21:16]: Vega. Vega. Vega. Vega. Vega. Vega. Vega. Vega. Vega.Ethan [00:21:29]: Vega. Vega. Vega. Vega. Vega.Ethan [00:21:40]: We're going to see a lot of innovation around hardware and stuff, but I think the real core is being able to do something useful with the personal context. You always had the ability to capture everything, right? We've always had recorders, camcorders, body cameras, stuff like that. But what's different now is we can actually make sense and find the important parts in all of that context.swyx [00:22:04]: Yeah. So, and then one last thing, I'm just doing this for you, is you also have an API, which I think I'm the first developer against. Because I had to build my own. We need to hire a developer advocate. Or just hire AI engineers. The point is that you should be able to program your own assistant. And I tried OMI, the former friend, the knockoff friend, and then real friend doesn't have an API. And then Limitless also doesn't have an API. So I think it's very important to own your data. To be able to reprocess your audio, maybe. Although, by default, you do not store audio. And then also just to do any corrections. There's no way that my needs can be fully met by you. So I think the API is very important.Ethan [00:22:47]: Yeah. And I mean, I've always been a consumer of APIs in all my products.swyx [00:22:53]: We are API enjoyers in this house.Ethan [00:22:55]: Yeah. It's very frustrating when you have to go build a scraper. But yeah, it's for sure. Yeah.swyx [00:23:03]: So this whole combination of you have my location, my calendar, my inbox. It really is, for me, the sort of personal API.Alessio [00:23:10]: And is the API just to write into it or to have it take action on external systems?Ethan [00:23:16]: Yeah, we're expanding it. It's right now read-only. In the future, very soon, when the actions are more generally available, it'll be fully supported in the API.Alessio [00:23:27]: Nice. I'll buy one after the episode.Ethan [00:23:30]: The API thing, to me, is the most interesting. Yeah. We do have real-time APIs, so you can even connect a socket and connect it to whatever you want it to take actions with. Yeah. It's too smart for me.Alessio [00:23:43]: Yeah. I think when I look at these apps, and I mean, there's so many of these products, we launch, it's great that I can go on this app and do things. But most of my work and personal life is managed somewhere else. Yeah. So being able to plug into it. Integrate that. It's nice. I have a bunch of more, maybe, human questions. Sure. I think maybe people might have. One, is it good to have instant replay for any argument that you have? I can imagine arguing with my wife about something. And, you know, there's these commercials now where it's basically like two people arguing, and they're like, they can throw a flag, like in football, and have an instant replay of the conversation. I feel like this is similar, where it's almost like people cannot really argue anymore or, like, lie to each other. Because in a world in which everybody adopts this, I don't know if you thought about it. And also, like, how the lies. You know, all of us tell lies, right? How do you distinguish between when I'm, there's going to be sometimes things that contradict each other, because I might say something publicly, and I might think something, really, that I tell someone else. How do you handle that when you think about building a product like this?Maria [00:24:48]: I would say that I like the fact that B is an objective point of view. So I don't care too much about the lies, but I care more about the fact that can help me to understand what happened. Mm-hmm. And the emotions in a really objective way, like, really, like, critical and objective way. And if you think about humans, they have so many emotions. And sometimes something that happened to me, like, I don't know, I would feel, like, really upset about it or really angry or really emotional. But the AI doesn't have those emotions. It can read the conversation, understand what happened, and be objective. And I think the level of support is the one that I really like more. Instead of, like, oh, did this guy tell me a lie? I feel like that's not exactly, like, what I feel. I find it curious for me in terms of opportunity.Alessio [00:25:35]: Is the B going to interject in real time? Say I'm arguing with somebody. The B is like, hey, look, no, you're wrong. What? That person actually said.Ethan [00:25:43]: The proactivity is something we're very interested in. Maybe not for, like, specifically for, like, selling arguments, but more for, like, and I think that a lot of the challenge here is, you know, you need really good reasoning to kind of pull that off. Because you don't want it just constantly interjecting, because that would be super annoying. And you don't want it to miss things that it should be interjecting. So, like, it would be kind of a hard task even for a human to be, like, just come in at the right times when it's appropriate. Like, it would take the, you know, with the personal context, it's going to be a lot better. Because, like, if somebody knows about you, but even still, it requires really good reasoning to, like, not be too much or too little and just right.Maria [00:26:20]: And the second part about, well, like, some things, you know, you say something to somebody else, but after I change my mind, I send something. Like, it's every time I have, like, different type of conversation. And I'm like, oh, I want to know more about you. And I'm like, oh, I want to know more about you. I think that's something that I found really fascinating. One of the things that we are learning is that, indeed, humans, they evolve over time. So, for us, one of the challenges is actually understand, like, is this a real fact? Right. And so far, what we do is we give, you know, to the, we have the human in the loop that can say, like, yes, this is true, this is not. Or they can edit their own fact. For sure, in the future, we want to have all of that automatized inside of the product.Ethan [00:26:57]: But, I mean, I think your question kind of hits on, and I know that we'll talk about privacy, but also just, like, if you have some memory and you want to confirm it with somebody else, that's one thing. But it's for sure going to be true that in the future, like, not even that far into the future, that it's just going to be kind of normalized. And we're kind of in a transitional period now. And I think it's, like, one of the key things that is for us to kind of navigate that and make sure we're, like, thinking of all the consequences. And how to, you know, make the right choices in the way that everything's designed. And so, like, it's more beneficial than it could be harmful. But it's just too valuable for your AI to understand you. And so if it's, like, MetaRay bands or the Google Astra, I think it's just people are going to be more used to it. So people's behaviors and expectations will change. Whether that's, like, you know, something that is going to happen now or in five years, it's probably in that range. And so, like, I think we... We kind of adapt to new technologies all the time. Like, when the Ring cameras came out, that was kind of quite controversial. It's like... But now it's kind of... People just understand that a lot of people have cameras on their doors. And so I think that...Maria [00:28:09]: Yeah, we're in a transitional period for sure.swyx [00:28:12]: I will press on the privacy thing because that is the number one thing that everyone talks about. Obviously, I think in Silicon Valley, people are a little bit more tech-forward, experimental, whatever. But you want to go mainstream. You want to sell to consumers. And we have to worry about this stuff. Baseline question. The hardest version of this is law. There are one-party consent states where this is perfectly legal. Then there are two-party consent states where they're not. What have you come around to this on?Ethan [00:28:38]: Yeah, so the EU is a totally different regulatory environment. But in the U.S., it's basically on a state-by-state level. Like, in Nevada, it's single-party. In California, it's two-party. But it's kind of untested. You know, it's different laws, whether it's a phone call, whether it's in person. In a state like California, it's two-party. Like, anytime you're in public, there's no consent comes into play because the expectation of privacy is that you're in public. But we process the audio and nothing is persisted. And then it's summarized with the speaker identification focusing on the user. Now, it's kind of untested on a legal, and I'm not a lawyer, but does that constitute the same as, like, a recording? So, you know, it's kind of a gray area and untested in law right now. I think that the bigger question is, you know, because, like, if you had your Ray-Ban on and were recording, then you have a video of something that happened. And that's different than kind of having, like, an AI give you a summary that's focused on you that's not really capturing anybody's voice. You know, I think the bigger question is, regardless of the legal status, like, what is the ethical kind of situation with that? Because even in Nevada that we're—or many other U.S. states where you can record. Everything. And you don't have to have consent. Is it still, like, the right thing to do? The way we think about it is, is that, you know, we take a lot of precautions to kind of not capture personal information of people around. Both through the speaker identification, through the pipeline, and then the prompts, and the way we store the information to be kind of really focused on the user. Now, we know that's not going to, like, satisfy a lot of people. But I think if you do try it and wear it again. It's very hard for me to see anything, like, if somebody was wearing a bee around me that I would ever object that it captured about me as, like, a third party to it. And like I said, like, we're in this transitional period where the expectation will just be more normalized. That it's, like, an AI. It's not capturing, you know, a full audio recording of what you said. And it's—everything is fully geared towards helping the person kind of understand their state and providing valuable information to them. Not about, like, logging details about people they encounter.Alessio [00:30:57]: You know, I've had the same question also with the Zoom meeting transcribers thing. I think there's kind of, like, the personal impact that there's a Firefly's AI recorder. Yeah. I just know that it's being recorded. It's not like a—I don't know if I'm going to say anything different. But, like, intrinsically, you kind of feel—because it's not pervasive. And I'm curious, especially, like, in your investor meetings. Do people feel differently? Like, have you had people ask you to, like, turn it off? Like, in a business meeting, to not record? I'm curious if you've run into any of these behaviors.Maria [00:31:29]: You know what's funny? On my end, I wear it all the time. I take my coffee, a blue bottle with it. Or I work with it. Like, obviously, I work on it. So, I wear it all the time. And so far, I don't think anybody asked me to turn it off. I'm not sure if because they were really friendly with me that they know that I'm working on it. But nobody really cared.swyx [00:31:48]: It's because you live in SF.Maria [00:31:49]: Actually, I've been in Italy as well. Uh-huh. And in Italy, it's a super privacy concern. Like, Europe is a super privacy concern. And again, they're nothing. Like, it's—I don't know. Yeah. That, for me, was interesting.Ethan [00:32:01]: I think—yeah, nobody's ever asked me to turn it off, even after giving them full demos and disclosing. I think that some people have said, well, my—you know, in a personal relationship, my partner initially was, like, kind of uncomfortable about it. We heard that from a few users. And that was, like, more in just, like— It's not like a personal relationship situation. And the other big one is people are like, I do like it, but I cannot wear this at work. I guess. Yeah. Yeah. Because, like, I think I will get in trouble based on policies or, like, you know, if you're wearing it inside a research lab or something where you're working on things that are kind of sensitive that, like—you know, so we're adding certain features like geofencing, just, like, at this location. It's just never active.swyx [00:32:50]: I mean, I've often actually explained to it the other way, where maybe you only want it at work, so you never take it from work. And it's just a work device, just like your Zoom meeting recorder is a work device.Ethan [00:33:09]: Yeah, professionals have been a big early adopter segment. And you say in San Francisco, but we have out there our daily shipment of over 100. If you go look at the addresses, Texas, I think, is our biggest state, and Florida, just the biggest states. A lot of professionals who talk for, and we didn't go out to build it for that use case, but I think there is a lot of demand for white-collar people who talk for a living. And I think we're just starting to talk with them. I think they just want to be able to improve their performance around, understand what they were doing.Alessio [00:33:47]: How do you think about Gong.io? Some of these, for example, sales training thing, where you put on a sales call and then it coaches you. They're more verticalized versus having more horizontal platform.Ethan [00:33:58]: I am not super familiar with those things, because like I said, it was kind of a surprise to us. But I think that those are interesting. I've seen there's a bunch of them now, right? Yeah. It kind of makes sense. I'm terrible at sales, so I could probably use one. But it's not my job, fundamentally. But yeah, I think maybe it's, you know, we heard also people with restaurants, if they're able to understand, if they're doing well.Maria [00:34:26]: Yeah, but in general, I think a lot of people, they like to have the double check of, did I do this well? Or can you suggest me how I can do better? We had a user that was saying to us that he used for interviews. Yeah, he used job interviews. So he used B and after asked to the B, oh, actually, how do you think my interview went? What I should do better? And I like that. And like, oh, that's actually like a personal coach in a way.Alessio [00:34:50]: Yeah. But I guess the question is like, do you want to build all of those use cases? Or do you see B as more like a platform where somebody is going to build like, you know, the sales coach that connects to B so that you're kind of the data feed into it?Ethan [00:35:02]: I don't think this is like a data feed, more like an understanding kind of engine and like definitely. In the future, having third parties to the API and building out for all the different use cases is something that we want to do. But the like initial case we're trying to do is like build that layer for all that to work. And, you know, we're not trying to build all those verticals because no startup could do that well. But I think that it's really been quite fascinating to see, like, you know, I've done consumer for a long time. Consumer is very hard to predict, like, what's going to be. It's going to be like the thing that's the killer feature. And so, I mean, we really believe that it's the future, but we don't know like what exactly like process it will take to really gain mass adoption.swyx [00:35:50]: The killer consumer feature is whatever Nikita Beer does. Yeah. Social app for teens.Ethan [00:35:56]: Yeah, well, I like Nikita, but, you know, he's good at building bootstrap companies and getting them very viral. And then selling them and then they shut down.swyx [00:36:05]: Okay, so you just came back from CES.Maria [00:36:07]: Yeah, crazy. Yeah, tell us. It was my first time in Vegas and first time CES, both of them were overwhelming.swyx [00:36:15]: First of all, did you feel like you had to do it because you're in consumer hardware?Maria [00:36:19]: Then we decided to be there and to have a lot of partners and media meetings, but we didn't have our own booth. So we decided to just keep that. But we decided to be there and have a presence there, even just us and speak with people. It's very hard to stand out. Yeah, I think, you know, it depends what type of booth you have. I think if you can prepare like a really cool booth.Ethan [00:36:41]: Have you been to CES?Maria [00:36:42]: I think it can be pretty cool.Ethan [00:36:43]: It's massive. It's huge. It's like 80,000, 90,000 people across the Venetian and the convention center. And it's, to me, I always wanted to go just like...Maria [00:36:53]: Yeah, you were the one who was like...swyx [00:36:55]: I thought it was your idea.Ethan [00:36:57]: I always wanted to go just as a, like, just as a fan of...Maria [00:37:01]: Yeah, you wanted to go anyways.Ethan [00:37:02]: Because like, growing up, I think CES like kind of peaked for a while and it was like, oh, I want to go. That's where all the cool, like... gadgets, everything. Yeah, now it's like SmartBitch and like, you know, vacuuming the picks up socks. Exactly.Maria [00:37:13]: There are a lot of cool vacuums. Oh, they love it.swyx [00:37:15]: They love the Roombas, the pick up socks.Maria [00:37:16]: And pet tech. Yeah, yeah. And dog stuff.swyx [00:37:20]: Yeah, there's a lot of like robot stuff. New TVs, new cars that never ship. Yeah. Yeah. I'm thinking like last year, this time last year was when Rabbit and Humane launched at CES and Rabbit kind of won CES. And now this year, no wearables except for you guys.Ethan [00:37:32]: It's funny because it's obviously it's AI everything. Yeah. Like every single product. Yeah.Maria [00:37:37]: Toothbrush with AI, vacuums with AI. Yeah. Yeah.Ethan [00:37:41]: We like hair blow, literally a hairdryer with AI. We saw.Maria [00:37:45]: Yeah, that was cool.Ethan [00:37:46]: But I think that like, yeah, we didn't, another kind of difference like around our, like we didn't want to do like a big overhypey promised kind of Rabbit launch. Because I mean, they did, hats off to them, like on the presentation and everything, obviously. But like, you know, we want to let the product kind of speak for itself and like get it out there. And I think we were really happy. We got some very good interest from media and some of the partners there. So like it was, I think it was definitely worth going. I would say like if you're in hardware, it's just kind of how you make use of it. Like I think to do it like a big Rabbit style or to have a huge show on there, like you need to plan that six months in advance. And it's very expensive. But like if you, you know, go there, there's everybody's there. All the media is there. There's a lot of some pre-show events that it's just great to talk to people. And the industry also, all the manufacturers, suppliers are there. So we learned about some really cool stuff that we might like. We met with somebody. They have like thermal energy capture. And it's like, oh, could you maybe not need to charge it? Because they have like a thermal that can capture your body heat. And what? Yeah, they're here. They're actually here. And in Palo Alto, they have like a Fitbit thing that you don't have to charge.swyx [00:39:01]: Like on paper, that's the power you can get from that. What's the power draw for this thing?Ethan [00:39:05]: It's more than you could get from the body heat, it turns out. But it's quite small. I don't want to disclose technically. But I think that solar is still, they also have one where it's like this thing could be like the face of it. It's just a solar cell. And like that is more realistic. Or kinetic. Kinetic, apparently, I'm not an expert in this, but they seem to think it wouldn't be enough. Kinetic is quite small, I guess, on the capture.swyx [00:39:33]: Well, I mean, watch. Watchmakers have been powering with kinetic for a long time. Yeah. We don't have to talk about that. I just want to get a sense of CES. Would you do it again? I definitely would not. Okay. You're just a fan of CES. Business point of view doesn't make sense. I happen to be in the conference business, right? So I'm kind of just curious. Yeah.Maria [00:39:49]: So I would say as we did, so without the booth and really like straightforward conversations that were already planned. Three days. That's okay. I think it was okay. Okay. But if you need to invest for a booth that is not. Okay. A good one. Which is how much? I think.Ethan [00:40:06]: 10 by 10 is 5,000. But on top of that, you need to. And then they go like 10 by 10 is like super small. Yeah. And like some companies have, I think would probably be more in like the six figure range to get. And I mean, I think that, yeah, it's very noisy. We heard this, that it's very, very noisy. Like obviously if you're, everything is being launched there and like everything from cars to cell phones are being launched. Yeah. So it's hard to stand out. But like, I think going in with a plan of who you want to talk to, I feel like.Maria [00:40:36]: That was worth it.Ethan [00:40:37]: Worth it. We had a lot of really positive media coverage from it and we got the word out and like, so I think we accomplished what we wanted to do.swyx [00:40:46]: I mean, there's some world in which my conference is kind of the CES of whatever AI becomes. Yeah. I think that.Maria [00:40:52]: Don't do it in Vegas. Don't do it in Vegas. Yeah. Don't do it in Vegas. That's the only thing. I didn't really like Vegas. That's great. Amazing. Those are my favorite ones.Alessio [00:41:02]: You can not fit 90,000 people in SF. That's really duh.Ethan [00:41:05]: You need to do like multiple locations so you can do Moscone and then have one in.swyx [00:41:09]: I mean, that's what Salesforce conferences. Well, GDC is how many? That might be 50,000, right? Okay. Form factor, right? Like my way to introduce this idea was that I was at the launch in Solaris. What was the old name of it? Newton. Newton. Of Tab when Avi first launched it. He was like, I thought through everything. Every form factor, pendant is the thing. And then we got the pendants for this original. The first one was just pendants and I took it off and I forgot to put it back on. So you went through pendants, pin, bracelet now, and maybe there's sort of earphones in the future, but what was your iterations?Maria [00:41:49]: So we had, I believe now three or four iterations. And one of the things that we learned is indeed that people don't like the pendant. In particular, woman, you don't want to have like anything here on the chest because it's maybe you have like other necklace or any other stuff.Ethan [00:42:03]: You just ship a premium one that's gold. Yeah. We're talking some fashion reached out to us.Maria [00:42:11]: Some big fashion. There is something there.swyx [00:42:13]: This is where it helps to have an Italian on the team.Maria [00:42:15]: There is like some big Italian luxury. I can't say anything. So yeah, bracelet actually came from the community because they were like, oh, I don't want to wear anything like as necklace or as a pendant. Like it's. And also like the one that we had, I don't know if you remember, like it was like circle, like it was like this and was like really bulky. Like people didn't like it. And also, I mean, I actually, I don't dislike, like we were running fast when we did that. Like our, our thing was like, we wanted to ship them as soon as possible. So we're not overthinking the form factor or the material. We were just want to be out. But after the community organically, basically all of them were like, well, why you don't just don't do the bracelet? Like he's way better. I will just wear it. And that's it. So that's how we ended up with the bracelet, but it's still modular. So I still want to play around the father is modular and you can, you know, take it off and wear it as a clip or in the future, maybe we will bring back the pendant. But I like the fact that there is some personalization and right now we have two colors, yellow and black. Soon we will have other ones. So yeah, we can play a lot around that.Ethan [00:43:25]: I think the form factor. Like the goal is for it to be not super invasive. Right. And something that's easy. So I think in the future, smaller, thinner, not like apple type obsession with thinness, but it does matter like the, the size and weight. And we would love to have more context because that will help, but to make it work, I think it really needs to have good power consumption, good battery life. And, you know, like with the humane swapping the batteries, I have one, I mean, I'm, I'm, I think we've made, and there's like pretty incredible, some of the engineering they did, but like, it wasn't kind of geared towards solving the problem. It was just, it's too heavy. The swappable batteries is too much to man, like the heat, the thermals is like too much to light interface thing. Yeah. Like that. That's cool. It's cool. It's cool. But it's like, if, if you have your handout here, you want to use your phone, like it's not really solving a problem. Cause you know how to use your phone. It's got a brilliant display. You have to kind of learn how to gesture this low range. Yeah. It's like a resolution laser, but the laser is cool that the fact they got it working in that thing, even though if it did overheat, but like too heavy, too cumbersome, too complicated with the multiple batteries. So something that's power efficient, kind of thin, both in the physical sense and also in the edge compute kind of way so that it can be as unobtrusive as possible. Yeah.Maria [00:44:47]: Users really like, like, I like when they say yes, I like to wear it and forget about it because I don't need to charge it every single day. On the other version, I believe we had like 35 hours or something, which was okay. But people, they just prefer the seven days battery life and-swyx [00:45:03]: Oh, this is seven days? Yeah. Oh, I've been charging every three days.Maria [00:45:07]: Oh, no, you can like keep it like, yeah, it's like almost seven days.swyx [00:45:11]: The other thing that occurs to me, maybe there's an Apple watch strap so that I don't have to double watch. Yeah.Maria [00:45:17]: That's the other one that, yeah, I thought about it. I saw as well the ones that like, you can like put it like back on the phone. Like, you know- Plog. There is a lot.swyx [00:45:27]: So yeah, there's a competitor called Plog. Yeah. It's not really a competitor. They only transcribe, right? Yeah, they only transcribe. But they're very good at it. Yeah.Ethan [00:45:33]: No, they're great. Their hardware is really good too.swyx [00:45:36]: And they just launched the pin too. Yeah.Ethan [00:45:38]: I think that the MagSafe kind of form factor has a lot of advantages, but some disadvantages. You can definitely put a very huge battery on that, you know? And so like the battery life's not, the power consumption's not so much of a concern, but you know, downside the phone's like in your pocket. And so I think that, you know, form factors will continue to evolve, but, and you know, more sensors, less obtrusive and-Maria [00:46:02]: Yeah. We have a new version.Ethan [00:46:04]: Easier to use.Maria [00:46:05]: Okay.swyx [00:46:05]: Looking forward to that. Yeah. I mean, we'll, whenever we launch this, we'll try to show whatever, but I'm sure you're going to keep iterating. Last thing on hardware, and then we'll go on to the software side, because I think that's where you guys are also really, really strong. Vision. You wanted to talk about why no vision? Yeah.Ethan [00:46:20]: I think it comes down to like when you're, when you're a startup, especially in hardware, you're just, you work within the constraints, right? And so like vision is super useful and super interesting. And what we actually started with, there's two issues with vision that make it like not the place we decided to start. One is power consumption. So you know, you kind of have to trade off your power budget, like capturing even at a low frame rate and transmitting the radio is actually the thing that takes up the majority of the power. So. Yeah. So you would really have to have quite a, like unacceptably, like large and heavy battery to do it continuously all day. We have, I think, novel kind of alternative ways that might allow us to do that. And we have some prototypes. The other issue is form factor. So like even with like a wide field of view, if you're wearing something on your chest, it's going, you know, obviously the wrist is not really that much of an option. And if you're wearing it on your chest, it's, it's often gone. You're going to probably be not capturing like the field of view of what's interesting to you. So that leaves you kind of with your head and face. And then anything that goes on, on the face has to look cool. Like I don't know if you remember the spectacles, it was kind of like the first, yeah, but they kind of, they didn't, they were not very successful. And I think one of the reasons is they were, they're so weird looking. Yeah. The camera was so big on the side. And if you look at them at array bands where they're way more successful, they, they look almost indistinguishable from array bands. And they invested a lot into that and they, they have a partnership with Qualcomm to develop custom Silicon. They have a stake in Luxottica now. So like they coming from all the angles, like to make glasses, I think like, you know, I don't know if you know, Brilliant Labs, they're cool company, they make frames, which is kind of like a cool hackable glasses and, and, and like, they're really good, like on hardware, they're really good. But even if you look at the frames, which I would say is like the most advanced kind of startup. Yeah. Yeah. Yeah. There was one that launched at CES, but it's not shipping yet. Like one that you can buy now, it's still not something you'd wear every day and the battery life is super short. So I think just the challenge of doing vision right, like off the bat, like would require quite a bit more resources. And so like audio is such a good entry point and it's also the privacy around audio. If you, if you had images, that's like another huge challenge to overcome. So I think that. Ideally the personal AI would have, you know, all the senses and you know, we'll, we'll get there. Yeah. Okay.swyx [00:48:57]: One last hardware thing. I have to ask this because then we'll move to the software. Were either of you electrical engineering?Ethan [00:49:04]: No, I'm CES. And so I have a, I've taken some EE courses, but I, I had done prior to working on, on the hardware here, like I had done a little bit of like embedded systems, like very little firmware, but we have luckily on the team, somebody with deep experience. Yeah.swyx [00:49:21]: I'm just like, you know, like you have to become hardware people. Yeah.Ethan [00:49:25]: Yeah. I mean, I learned to worry about supply chain power. I think this is like radio.Maria [00:49:30]: There's so many things to learn.Ethan [00:49:32]: I would tell this about hardware, like, and I know it's been said before, but building a prototype and like learning how the electronics work and learning about firmware and developing, this is like, I think fun for a lot of engineers and it's, it's all totally like achievable, especially now, like with, with the tools we have, like stuff you might've been intimidated about. Like, how do I like write this firmware now? With Sonnet, like you can, you can get going and actually see results quickly. But I think going from prototype to actually making something manufactured is a enormous jump. And it's not all about technology, the supply chain, the procurement, the regulations, the cost, the tooling. The thing about software that I'm used to is it's funny that you can make changes all along the way and ship it. But like when you have to buy tooling for an enclosure that's expensive.swyx [00:50:24]: Do you buy your own tooling? You have to.Ethan [00:50:25]: Don't you just subcontract out to someone in China? Oh, no. Do we make the tooling? No, no. You have to have CNC and like a bunch of machines.Maria [00:50:31]: Like nobody makes their own tooling, but like you have to design this design and you submitEthan [00:50:36]: it and then they go four to six weeks later. Yeah. And then if there's a problem with it, well, then you're not, you're not making any, any of your enclosures. And so you have to really plan ahead. And like.swyx [00:50:48]: I just want to leave tips for other hardware founders. Like what resources or websites are most helpful in your sort of manufacturing journey?Ethan [00:50:55]: You know, I think it's different depending on like it's hardware so specialized in different ways.Maria [00:51:00]: I will say that, for example, I should choose a manufacturer company. I speak with other founders and like we can give you like some, you know, some tips of who is good and who is not, or like who's specialized in something versus somebody else. Yeah.Ethan [00:51:15]: Like some people are good in plastics. Some people are good.Maria [00:51:18]: I think like for us, it really helped at the beginning to speak with others and understand. Okay. Like who is around. I work in Shenzhen. I lived almost two years in China. I have an idea about like different hardware manufacturer and all of that. Soon I will go back to Shenzhen to check out. So I think it's good also to go in place and check.Ethan [00:51:40]: Yeah, you have to like once you, if you, so we did some stuff domestically and like if you have that ability. The reason I say ability is very expensive, but like to build out some proof of concepts and do field testing before you take it to a manufacturer, despite what people say, there's really good domestic manufacturing for small quantities at extremely high prices. So we got our first PCB and the assembly done in LA. So there's a lot of good because of the defense industry that can do quick churn. So it's like, we need this board. We need to find out if it's working. We have this deadline we want to start, but you need to go through this. And like if you want to have it done and fabricated in a week, they can do it for a price. But I think, you know, everybody's kind of trending even for prototyping now moving that offshore because in China you can do prototyping and get it within almost the same timeline. But the thing is with manufacturing, like it really helps to go there and kind of establish the relationship. Yeah.Alessio [00:52:38]: My first company was a hardware company and we did our PCBs in China and took a long time. Now things are better. But this was, yeah, I don't know, 10 years ago, something like that. Yeah.Ethan [00:52:47]: I think that like the, and I've heard this too, we didn't run into this problem, but like, you know, if it's something where you don't have the relationship, they don't see you, they don't know you, you know, you might get subcontracted out or like they're not paying attention. But like if you're, you know, you have the relationship and a priority, like, yeah, it's really good. We ended up doing the fabrication assembly in Taiwan for various reasons.Maria [00:53:11]: And I think it really helped the fact that you went there at some point. Yeah.Ethan [00:53:15]: We're really happy with the process and, but I mean the whole process of just Choosing the right people. Choosing the right people, but also just sourcing the bill materials and all of that stuff. Like, I guess like if you have time, it's not that bad, but if you're trying to like really push the speed at that, it's incredibly stressful. Okay. We got to move to the software. Yeah.Alessio [00:53:38]: Yeah. So the hardware, maybe it's hard for people to understand, but what software people can understand is that running. Transcription and summarization, all of these things in real time every day for 24 hours a day. It's not easy. So you mentioned 200,000 tokens for a day. Yeah. How do you make it basically free to run all of this for the consumer?Ethan [00:53:59]: Well, I think that the pipeline and the inference, like people think about all of these tokens, but as you know, the price of tokens is like dramatically dropping. You guys probably have some charts somewhere that you've posted. We do. And like, if you see that trend in like 250,000 input tokens, it's not really that much, right? Like the output.swyx [00:54:21]: You do several layers. You do live. Yeah.Ethan [00:54:23]: Yeah. So the speech to text is like the most challenging part actually, because you know, it requires like real time processing and then like later processing with a larger model. And one thing that is fairly obvious is that like, you don't need to transcribe things that don't have any voice in it. Right? So good voice activity is key, right? Because like the majority of most people's day is not spent with voice activity. Right? So that is the first step to cutting down the amount of compute you have to do. And voice activity is a fairly cheap thing to do. Very, very cheap thing to do. The models that need to summarize, you don't need a Sonnet level kind of model to summarize. You do need a Sonnet level model to like execute things like the agent. And we will be having a subscription for like features like that because it's, you know, although now with the R1, like we'll see, we haven't evaluated it. A deep seek? Yeah. I mean, not that one in particular, but like, you know, they're already there that can kind of perform at that level. I was like, it's going to stay in six months, but like, yeah. So self-hosted models help in the things where you can. So you are self-hosting models. Yes. You are fine tuning your own ASR. Yes. I will say that I see in the future that everything's trending down. Although like, I think there might be an intermediary step with things to become expensive, which is like, we're really interested because like the pipeline is very tedious and like a lot of tuning. Right. Which is brutal because it's just a lot of trial and error. Whereas like, well, wouldn't it be nice if an end to end model could just do all of this and learn it? If we could do transcription with like an LLM, there's so many advantages to that, but it's going to be a larger model and hence like more compute, you know, we're optim

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B's new workshop and RSVP here!We're happy to announce that today's guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left.If you're a Python developer, it's very likely that you've heard of Pydantic. Every month, it's downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it's at the core of FastAPI, and if you've followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”. Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework.Logfire: bringing OTEL to AIOpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam's view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps. If you're interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part.Agents are the killer app for graphsPydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it's very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc.They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed.“We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.”Why Graphs?* More natural for complex or multi-step AI workflows.* Easy to visualize and debug with mermaid diagrams.* Potential for distributed runs, or “waiting days” between steps in certain flows.In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously.Full Video EpisodeLike and subscribe!Chapters* 00:00:00 Introductions* 00:00:24 Origins of Pydantic* 00:05:28 Pydantic's AI moment * 00:08:05 Why build a new agents framework?* 00:10:17 Overview of Pydantic AI* 00:12:33 Becoming a believer in graphs* 00:24:02 God Model vs Compound AI Systems* 00:28:13 Why not build an LLM gateway?* 00:31:39 Programmatic testing vs live evals* 00:35:51 Using OpenTelemetry for AI traces* 00:43:19 Why they don't use Clickhouse* 00:48:34 Competing in the observability space* 00:50:41 Licensing decisions for Pydantic and LogFire* 00:51:48 Building Pydantic.run* 00:55:24 Marimo and the future of Jupyter notebooks* 00:57:44 London's AI sceneShow Notes* Sam Colvin* Pydantic* Pydantic AI* Logfire* Pydantic.run* Zod* E2B* Arize* Langsmith* Marimo* Prefect* GLA (Google Generative Language API)* OpenTelemetry* Jason Liu* Sebastian Ramirez* Bogomil Balkansky* Hood Chatham* Jeremy Howard* Andrew LambTranscriptAlessio [00:00:03]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:12]: Good morning. And today we're very excited to have Sam Colvin join us from Pydantic AI. Welcome. Sam, I heard that Pydantic is all we need. Is that true?Samuel [00:00:24]: I would say you might need Pydantic AI and Logfire as well, but it gets you a long way, that's for sure.Swyx [00:00:29]: Pydantic almost basically needs no introduction. It's almost 300 million downloads in December. And obviously, in the previous podcasts and discussions we've had with Jason Liu, he's been a big fan and promoter of Pydantic and AI.Samuel [00:00:45]: Yeah, it's weird because obviously I didn't create Pydantic originally for uses in AI, it predates LLMs. But it's like we've been lucky that it's been picked up by that community and used so widely.Swyx [00:00:58]: Actually, maybe we'll hear it. Right from you, what is Pydantic and maybe a little bit of the origin story?Samuel [00:01:04]: The best name for it, which is not quite right, is a validation library. And we get some tension around that name because it doesn't just do validation, it will do coercion by default. We now have strict mode, so you can disable that coercion. But by default, if you say you want an integer field and you get in a string of 1, 2, 3, it will convert it to 123 and a bunch of other sensible conversions. And as you can imagine, the semantics around it. Exactly when you convert and when you don't, it's complicated, but because of that, it's more than just validation. Back in 2017, when I first started it, the different thing it was doing was using type hints to define your schema. That was controversial at the time. It was genuinely disapproved of by some people. I think the success of Pydantic and libraries like FastAPI that build on top of it means that today that's no longer controversial in Python. And indeed, lots of other people have copied that route, but yeah, it's a data validation library. It uses type hints for the for the most part and obviously does all the other stuff you want, like serialization on top of that. But yeah, that's the core.Alessio [00:02:06]: Do you have any fun stories on how JSON schemas ended up being kind of like the structure output standard for LLMs? And were you involved in any of these discussions? Because I know OpenAI was, you know, one of the early adopters. So did they reach out to you? Was there kind of like a structure output console in open source that people were talking about or was it just a random?Samuel [00:02:26]: No, very much not. So I originally. Didn't implement JSON schema inside Pydantic and then Sebastian, Sebastian Ramirez, FastAPI came along and like the first I ever heard of him was over a weekend. I got like 50 emails from him or 50 like emails as he was committing to Pydantic, adding JSON schema long pre version one. So the reason it was added was for OpenAPI, which is obviously closely akin to JSON schema. And then, yeah, I don't know why it was JSON that got picked up and used by OpenAI. It was obviously very convenient for us. That's because it meant that not only can you do the validation, but because Pydantic will generate you the JSON schema, it will it kind of can be one source of source of truth for structured outputs and tools.Swyx [00:03:09]: Before we dive in further on the on the AI side of things, something I'm mildly curious about, obviously, there's Zod in JavaScript land. Every now and then there is a new sort of in vogue validation library that that takes over for quite a few years and then maybe like some something else comes along. Is Pydantic? Is it done like the core Pydantic?Samuel [00:03:30]: I've just come off a call where we were redesigning some of the internal bits. There will be a v3 at some point, which will not break people's code half as much as v2 as in v2 was the was the massive rewrite into Rust, but also fixing all the stuff that was broken back from like version zero point something that we didn't fix in v1 because it was a side project. We have plans to move some of the basically store the data in Rust types after validation. Not completely. So we're still working to design the Pythonic version of it, in order for it to be able to convert into Python types. So then if you were doing like validation and then serialization, you would never have to go via a Python type we reckon that can give us somewhere between three and five times another three to five times speed up. That's probably the biggest thing. Also, like changing how easy it is to basically extend Pydantic and define how particular types, like for example, NumPy arrays are validated and serialized. But there's also stuff going on. And for example, Jitter, the JSON library in Rust that does the JSON parsing, has SIMD implementation at the moment only for AMD64. So we can add that. We need to go and add SIMD for other instruction sets. So there's a bunch more we can do on performance. I don't think we're going to go and revolutionize Pydantic, but it's going to continue to get faster, continue, hopefully, to allow people to do more advanced things. We might add a binary format like CBOR for serialization for when you'll just want to put the data into a database and probably load it again from Pydantic. So there are some things that will come along, but for the most part, it should just get faster and cleaner.Alessio [00:05:04]: From a focus perspective, I guess, as a founder too, how did you think about the AI interest rising? And then how do you kind of prioritize, okay, this is worth going into more, and we'll talk about Pydantic AI and all of that. What was maybe your early experience with LLAMP, and when did you figure out, okay, this is something we should take seriously and focus more resources on it?Samuel [00:05:28]: I'll answer that, but I'll answer what I think is a kind of parallel question, which is Pydantic's weird, because Pydantic existed, obviously, before I was starting a company. I was working on it in my spare time, and then beginning of 22, I started working on the rewrite in Rust. And I worked on it full-time for a year and a half, and then once we started the company, people came and joined. And it was a weird project, because that would never go away. You can't get signed off inside a startup. Like, we're going to go off and three engineers are going to work full-on for a year in Python and Rust, writing like 30,000 lines of Rust just to release open-source-free Python library. The result of that has been excellent for us as a company, right? As in, it's made us remain entirely relevant. And it's like, Pydantic is not just used in the SDKs of all of the AI libraries, but I can't say which one, but one of the big foundational model companies, when they upgraded from Pydantic v1 to v2, their number one internal model... The metric of performance is time to first token. That went down by 20%. So you think about all of the actual AI going on inside, and yet at least 20% of the CPU, or at least the latency inside requests was actually Pydantic, which shows like how widely it's used. So we've benefited from doing that work, although it didn't, it would have never have made financial sense in most companies. In answer to your question about like, how do we prioritize AI, I mean, the honest truth is we've spent a lot of the last year and a half building. Good general purpose observability inside LogFire and making Pydantic good for general purpose use cases. And the AI has kind of come to us. Like we just, not that we want to get away from it, but like the appetite, uh, both in Pydantic and in LogFire to go and build with AI is enormous because it kind of makes sense, right? Like if you're starting a new greenfield project in Python today, what's the chance that you're using GenAI 80%, let's say, globally, obviously it's like a hundred percent in California, but even worldwide, it's probably 80%. Yeah. And so everyone needs that stuff. And there's so much yet to be figured out so much like space to do things better in the ecosystem in a way that like to go and implement a database that's better than Postgres is a like Sisyphean task. Whereas building, uh, tools that are better for GenAI than some of the stuff that's about now is not very difficult. Putting the actual models themselves to one side.Alessio [00:07:40]: And then at the same time, then you released Pydantic AI recently, which is, uh, um, you know, agent framework and early on, I would say everybody like, you know, Langchain and like, uh, Pydantic kind of like a first class support, a lot of these frameworks, we're trying to use you to be better. What was the decision behind we should do our own framework? Were there any design decisions that you disagree with any workloads that you think people didn't support? Well,Samuel [00:08:05]: it wasn't so much like design and workflow, although I think there were some, some things we've done differently. Yeah. I think looking in general at the ecosystem of agent frameworks, the engineering quality is far below that of the rest of the Python ecosystem. There's a bunch of stuff that we have learned how to do over the last 20 years of building Python libraries and writing Python code that seems to be abandoned by people when they build agent frameworks. Now I can kind of respect that, particularly in the very first agent frameworks, like Langchain, where they were literally figuring out how to go and do this stuff. It's completely understandable that you would like basically skip some stuff.Samuel [00:08:42]: I'm shocked by the like quality of some of the agent frameworks that have come out recently from like well-respected names, which it just seems to be opportunism and I have little time for that, but like the early ones, like I think they were just figuring out how to do stuff and just as lots of people have learned from Pydantic, we were able to learn a bit from them. I think from like the gap we saw and the thing we were frustrated by was the production readiness. And that means things like type checking, even if type checking makes it hard. Like Pydantic AI, I will put my hand up now and say it has a lot of generics and you need to, it's probably easier to use it if you've written a bit of Rust and you really understand generics, but like, and that is, we're not claiming that that makes it the easiest thing to use in all cases, we think it makes it good for production applications in big systems where type checking is a no-brainer in Python. But there are also a bunch of stuff we've learned from maintaining Pydantic over the years that we've gone and done. So every single example in Pydantic AI's documentation is run on Python. As part of tests and every single print output within an example is checked during tests. So it will always be up to date. And then a bunch of things that, like I say, are standard best practice within the rest of the Python ecosystem, but I'm not followed surprisingly by some AI libraries like coverage, linting, type checking, et cetera, et cetera, where I think these are no-brainers, but like weirdly they're not followed by some of the other libraries.Alessio [00:10:04]: And can you just give an overview of the framework itself? I think there's kind of like the. LLM calling frameworks, there are the multi-agent frameworks, there's the workflow frameworks, like what does Pydantic AI do?Samuel [00:10:17]: I glaze over a bit when I hear all of the different sorts of frameworks, but I like, and I will tell you when I built Pydantic, when I built Logfire and when I built Pydantic AI, my methodology is not to go and like research and review all of the other things. I kind of work out what I want and I go and build it and then feedback comes and we adjust. So the fundamental building block of Pydantic AI is agents. The exact definition of agents and how you want to define them. is obviously ambiguous and our things are probably sort of agent-lit, not that we would want to go and rename them to agent-lit, but like the point is you probably build them together to build something and most people will call an agent. So an agent in our case has, you know, things like a prompt, like system prompt and some tools and a structured return type if you want it, that covers the vast majority of cases. There are situations where you want to go further and the most complex workflows where you want graphs and I resisted graphs for quite a while. I was sort of of the opinion you didn't need them and you could use standard like Python flow control to do all of that stuff. I had a few arguments with people, but I basically came around to, yeah, I can totally see why graphs are useful. But then we have the problem that by default, they're not type safe because if you have a like add edge method where you give the names of two different edges, there's no type checking, right? Even if you go and do some, I'm not, not all the graph libraries are AI specific. So there's a, there's a graph library called, but it allows, it does like a basic runtime type checking. Ironically using Pydantic to try and make up for the fact that like fundamentally that graphs are not typed type safe. Well, I like Pydantic, but it did, that's not a real solution to have to go and run the code to see if it's safe. There's a reason that starting type checking is so powerful. And so we kind of, from a lot of iteration eventually came up with a system of using normally data classes to define nodes where you return the next node you want to call and where we're able to go and introspect the return type of a node to basically build the graph. And so the graph is. Yeah. Inherently type safe. And once we got that right, I, I wasn't, I'm incredibly excited about graphs. I think there's like masses of use cases for them, both in gen AI and other development, but also software's all going to have interact with gen AI, right? It's going to be like web. There's no longer be like a web department in a company is that there's just like all the developers are building for web building with databases. The same is going to be true for gen AI.Alessio [00:12:33]: Yeah. I see on your docs, you call an agent, a container that contains a system prompt function. Tools, structure, result, dependency type model, and then model settings. Are the graphs in your mind, different agents? Are they different prompts for the same agent? What are like the structures in your mind?Samuel [00:12:52]: So we were compelled enough by graphs once we got them right, that we actually merged the PR this morning. That means our agent implementation without changing its API at all is now actually a graph under the hood as it is built using our graph library. So graphs are basically a lower level tool that allow you to build these complex workflows. Our agents are technically one of the many graphs you could go and build. And we just happened to build that one for you because it's a very common, commonplace one. But obviously there are cases where you need more complex workflows where the current agent assumptions don't work. And that's where you can then go and use graphs to build more complex things.Swyx [00:13:29]: You said you were cynical about graphs. What changed your mind specifically?Samuel [00:13:33]: I guess people kept giving me examples of things that they wanted to use graphs for. And my like, yeah, but you could do that in standard flow control in Python became a like less and less compelling argument to me because I've maintained those systems that end up with like spaghetti code. And I could see the appeal of this like structured way of defining the workflow of my code. And it's really neat that like just from your code, just from your type hints, you can get out a mermaid diagram that defines exactly what can go and happen.Swyx [00:14:00]: Right. Yeah. You do have very neat implementation of sort of inferring the graph from type hints, I guess. Yeah. Is what I would call it. Yeah. I think the question always is I have gone back and forth. I used to work at Temporal where we would actually spend a lot of time complaining about graph based workflow solutions like AWS step functions. And we would actually say that we were better because you could use normal control flow that you already knew and worked with. Yours, I guess, is like a little bit of a nice compromise. Like it looks like normal Pythonic code. But you just have to keep in mind what the type hints actually mean. And that's what we do with the quote unquote magic that the graph construction does.Samuel [00:14:42]: Yeah, exactly. And if you look at the internal logic of actually running a graph, it's incredibly simple. It's basically call a node, get a node back, call that node, get a node back, call that node. If you get an end, you're done. We will add in soon support for, well, basically storage so that you can store the state between each node that's run. And then the idea is you can then distribute the graph and run it across computers. And also, I mean, the other weird, the other bit that's really valuable is across time. Because it's all very well if you look at like lots of the graph examples that like Claude will give you. If it gives you an example, it gives you this lovely enormous mermaid chart of like the workflow, for example, managing returns if you're an e-commerce company. But what you realize is some of those lines are literally one function calls another function. And some of those lines are wait six days for the customer to print their like piece of paper and put it in the post. And if you're writing like your demo. Project or your like proof of concept, that's fine because you can just say, and now we call this function. But when you're building when you're in real in real life, that doesn't work. And now how do we manage that concept to basically be able to start somewhere else in the in our code? Well, this graph implementation makes it incredibly easy because you just pass the node that is the start point for carrying on the graph and it continues to run. So it's things like that where I was like, yeah, I can just imagine how things I've done in the past would be fundamentally easier to understand if we had done them with graphs.Swyx [00:16:07]: You say imagine, but like right now, this pedantic AI actually resume, you know, six days later, like you said, or is this just like a theoretical thing we can go someday?Samuel [00:16:16]: I think it's basically Q&A. So there's an AI that's asking the user a question and effectively you then call the CLI again to continue the conversation. And it basically instantiates the node and calls the graph with that node again. Now, we don't have the logic yet for effectively storing state in the database between individual nodes that we're going to add soon. But like the rest of it is basically there.Swyx [00:16:37]: It does make me think that not only are you competing with Langchain now and obviously Instructor, and now you're going into sort of the more like orchestrated things like Airflow, Prefect, Daxter, those guys.Samuel [00:16:52]: Yeah, I mean, we're good friends with the Prefect guys and Temporal have the same investors as us. And I'm sure that my investor Bogomol would not be too happy if I was like, oh, yeah, by the way, as well as trying to take on Datadog. We're also going off and trying to take on Temporal and everyone else doing that. Obviously, we're not doing all of the infrastructure of deploying that right yet, at least. We're, you know, we're just building a Python library. And like what's crazy about our graph implementation is, sure, there's a bit of magic in like introspecting the return type, you know, extracting things from unions, stuff like that. But like the actual calls, as I say, is literally call a function and get back a thing and call that. It's like incredibly simple and therefore easy to maintain. The question is, how useful is it? Well, I don't know yet. I think we have to go and find out. We have a whole. We've had a slew of people joining our Slack over the last few days and saying, tell me how good Pydantic AI is. How good is Pydantic AI versus Langchain? And I refuse to answer. That's your job to go and find that out. Not mine. We built a thing. I'm compelled by it, but I'm obviously biased. The ecosystem will work out what the useful tools are.Swyx [00:17:52]: Bogomol was my board member when I was at Temporal. And I think I think just generally also having been a workflow engine investor and participant in this space, it's a big space. Like everyone needs different functions. I think the one thing that I would say like yours, you know, as a library, you don't have that much control of it over the infrastructure. I do like the idea that each new agents or whatever or unit of work, whatever you call that should spin up in this sort of isolated boundaries. Whereas yours, I think around everything runs in the same process. But you ideally want to sort of spin out its own little container of things.Samuel [00:18:30]: I agree with you a hundred percent. And we will. It would work now. Right. As in theory, you're just like as long as you can serialize the calls to the next node, you just have to all of the different containers basically have to have the same the same code. I mean, I'm super excited about Cloudflare workers running Python and being able to install dependencies. And if Cloudflare could only give me my invitation to the private beta of that, we would be exploring that right now because I'm super excited about that as a like compute level for some of this stuff where exactly what you're saying, basically. You can run everything as an individual. Like worker function and distribute it. And it's resilient to failure, et cetera, et cetera.Swyx [00:19:08]: And it spins up like a thousand instances simultaneously. You know, you want it to be sort of truly serverless at once. Actually, I know we have some Cloudflare friends who are listening, so hopefully they'll get in front of the line. Especially.Samuel [00:19:19]: I was in Cloudflare's office last week shouting at them about other things that frustrate me. I have a love-hate relationship with Cloudflare. Their tech is awesome. But because I use it the whole time, I then get frustrated. So, yeah, I'm sure I will. I will. I will get there soon.Swyx [00:19:32]: There's a side tangent on Cloudflare. Is Python supported at full? I actually wasn't fully aware of what the status of that thing is.Samuel [00:19:39]: Yeah. So Pyodide, which is Python running inside the browser in scripting, is supported now by Cloudflare. They basically, they're having some struggles working out how to manage, ironically, dependencies that have binaries, in particular, Pydantic. Because these workers where you can have thousands of them on a given metal machine, you don't want to have a difference. You basically want to be able to have a share. Shared memory for all the different Pydantic installations, effectively. That's the thing they work out. They're working out. But Hood, who's my friend, who is the primary maintainer of Pyodide, works for Cloudflare. And that's basically what he's doing, is working out how to get Python running on Cloudflare's network.Swyx [00:20:19]: I mean, the nice thing is that your binary is really written in Rust, right? Yeah. Which also compiles the WebAssembly. Yeah. So maybe there's a way that you'd build... You have just a different build of Pydantic and that ships with whatever your distro for Cloudflare workers is.Samuel [00:20:36]: Yes, that's exactly what... So Pyodide has builds for Pydantic Core and for things like NumPy and basically all of the popular binary libraries. Yeah. It's just basic. And you're doing exactly that, right? You're using Rust to compile the WebAssembly and then you're calling that shared library from Python. And it's unbelievably complicated, but it works. Okay.Swyx [00:20:57]: Staying on graphs a little bit more, and then I wanted to go to some of the other features that you have in Pydantic AI. I see in your docs, there are sort of four levels of agents. There's single agents, there's agent delegation, programmatic agent handoff. That seems to be what OpenAI swarms would be like. And then the last one, graph-based control flow. Would you say that those are sort of the mental hierarchy of how these things go?Samuel [00:21:21]: Yeah, roughly. Okay.Swyx [00:21:22]: You had some expression around OpenAI swarms. Well.Samuel [00:21:25]: And indeed, OpenAI have got in touch with me and basically, maybe I'm not supposed to say this, but basically said that Pydantic AI looks like what swarms would become if it was production ready. So, yeah. I mean, like, yeah, which makes sense. Awesome. Yeah. I mean, in fact, it was specifically saying, how can we give people the same feeling that they were getting from swarms that led us to go and implement graphs? Because my, like, just call the next agent with Python code was not a satisfactory answer to people. So it was like, okay, we've got to go and have a better answer for that. It's not like, let us to get to graphs. Yeah.Swyx [00:21:56]: I mean, it's a minimal viable graph in some sense. What are the shapes of graphs that people should know? So the way that I would phrase this is I think Anthropic did a very good public service and also kind of surprisingly influential blog post, I would say, when they wrote Building Effective Agents. We actually have the authors coming to speak at my conference in New York, which I think you're giving a workshop at. Yeah.Samuel [00:22:24]: I'm trying to work it out. But yes, I think so.Swyx [00:22:26]: Tell me if you're not. yeah, I mean, like, that was the first, I think, authoritative view of, like, what kinds of graphs exist in agents and let's give each of them a name so that everyone is on the same page. So I'm just kind of curious if you have community names or top five patterns of graphs.Samuel [00:22:44]: I don't have top five patterns of graphs. I would love to see what people are building with them. But like, it's been it's only been a couple of weeks. And of course, there's a point is that. Because they're relatively unopinionated about what you can go and do with them. They don't suit them. Like, you can go and do lots of lots of things with them, but they don't have the structure to go and have like specific names as much as perhaps like some other systems do. I think what our agents are, which have a name and I can't remember what it is, but this basically system of like, decide what tool to call, go back to the center, decide what tool to call, go back to the center and then exit. One form of graph, which, as I say, like our agents are effectively one implementation of a graph, which is why under the hood they are now using graphs. And it'll be interesting to see over the next few years whether we end up with these like predefined graph names or graph structures or whether it's just like, yep, I built a graph or whether graphs just turn out not to match people's mental image of what they want and die away. We'll see.Swyx [00:23:38]: I think there is always appeal. Every developer eventually gets graph religion and goes, oh, yeah, everything's a graph. And then they probably over rotate and go go too far into graphs. And then they have to learn a whole bunch of DSLs. And then they're like, actually, I didn't need that. I need this. And they scale back a little bit.Samuel [00:23:55]: I'm at the beginning of that process. I'm currently a graph maximalist, although I haven't actually put any into production yet. But yeah.Swyx [00:24:02]: This has a lot of philosophical connections with other work coming out of UC Berkeley on compounding AI systems. I don't know if you know of or care. This is the Gartner world of things where they need some kind of industry terminology to sell it to enterprises. I don't know if you know about any of that.Samuel [00:24:24]: I haven't. I probably should. I should probably do it because I should probably get better at selling to enterprises. But no, no, I don't. Not right now.Swyx [00:24:29]: This is really the argument is that instead of putting everything in one model, you have more control and more maybe observability to if you break everything out into composing little models and changing them together. And obviously, then you need an orchestration framework to do that. Yeah.Samuel [00:24:47]: And it makes complete sense. And one of the things we've seen with agents is they work well when they work well. But when they. Even if you have the observability through log five that you can see what was going on, if you don't have a nice hook point to say, hang on, this is all gone wrong. You have a relatively blunt instrument of basically erroring when you exceed some kind of limit. But like what you need to be able to do is effectively iterate through these runs so that you can have your own control flow where you're like, OK, we've gone too far. And that's where one of the neat things about our graph implementation is you can basically call next in a loop rather than just running the full graph. And therefore, you have this opportunity to to break out of it. But yeah, basically, it's the same point, which is like if you have two bigger unit of work to some extent, whether or not it involves gen AI. But obviously, it's particularly problematic in gen AI. You only find out afterwards when you've spent quite a lot of time and or money when it's gone off and done done the wrong thing.Swyx [00:25:39]: Oh, drop on this. We're not going to resolve this here, but I'll drop this and then we can move on to the next thing. This is the common way that we we developers talk about this. And then the machine learning researchers look at us. And laugh and say, that's cute. And then they just train a bigger model and they wipe us out in the next training run. So I think there's a certain amount of we are fighting the bitter lesson here. We're fighting AGI. And, you know, when AGI arrives, this will all go away. Obviously, on Latent Space, we don't really discuss that because I think AGI is kind of this hand wavy concept that isn't super relevant. But I think we have to respect that. For example, you could do a chain of thoughts with graphs and you could manually orchestrate a nice little graph that does like. Reflect, think about if you need more, more inference time, compute, you know, that's the hot term now. And then think again and, you know, scale that up. Or you could train Strawberry and DeepSeq R1. Right.Samuel [00:26:32]: I saw someone saying recently, oh, they were really optimistic about agents because models are getting faster exponentially. And I like took a certain amount of self-control not to describe that it wasn't exponential. But my main point was. If models are getting faster as quickly as you say they are, then we don't need agents and we don't really need any of these abstraction layers. We can just give our model and, you know, access to the Internet, cross our fingers and hope for the best. Agents, agent frameworks, graphs, all of this stuff is basically making up for the fact that right now the models are not that clever. In the same way that if you're running a customer service business and you have loads of people sitting answering telephones, the less well trained they are, the less that you trust them, the more that you need to give them a script to go through. Whereas, you know, so if you're running a bank and you have lots of customer service people who you don't trust that much, then you tell them exactly what to say. If you're doing high net worth banking, you just employ people who you think are going to be charming to other rich people and set them off to go and have coffee with people. Right. And the same is true of models. The more intelligent they are, the less we need to tell them, like structure what they go and do and constrain the routes in which they take.Swyx [00:27:42]: Yeah. Yeah. Agree with that. So I'm happy to move on. So the other parts of Pydantic AI that are worth commenting on, and this is like my last rant, I promise. So obviously, every framework needs to do its sort of model adapter layer, which is, oh, you can easily swap from OpenAI to Cloud to Grok. You also have, which I didn't know about, Google GLA, which I didn't really know about until I saw this in your docs, which is generative language API. I assume that's AI Studio? Yes.Samuel [00:28:13]: Google don't have good names for it. So Vertex is very clear. That seems to be the API that like some of the things use, although it returns 503 about 20% of the time. So... Vertex? No. Vertex, fine. But the... Oh, oh. GLA. Yeah. Yeah.Swyx [00:28:28]: I agree with that.Samuel [00:28:29]: So we have, again, another example of like, well, I think we go the extra mile in terms of engineering is we run on every commit, at least commit to main, we run tests against the live models. Not lots of tests, but like a handful of them. Oh, okay. And we had a point last week where, yeah, GLA is a little bit better. GLA1 was failing every single run. One of their tests would fail. And we, I think we might even have commented out that one at the moment. So like all of the models fail more often than you might expect, but like that one seems to be particularly likely to fail. But Vertex is the same API, but much more reliable.Swyx [00:29:01]: My rant here is that, you know, versions of this appear in Langchain and every single framework has to have its own little thing, a version of that. I would put to you, and then, you know, this is, this can be agree to disagree. This is not needed in Pydantic AI. I would much rather you adopt a layer like Lite LLM or what's the other one in JavaScript port key. And that's their job. They focus on that one thing and they, they normalize APIs for you. All new models are automatically added and you don't have to duplicate this inside of your framework. So for example, if I wanted to use deep seek, I'm out of luck because Pydantic AI doesn't have deep seek yet.Samuel [00:29:38]: Yeah, it does.Swyx [00:29:39]: Oh, it does. Okay. I'm sorry. But you know what I mean? Should this live in your code or should it live in a layer that's kind of your API gateway that's a defined piece of infrastructure that people have?Samuel [00:29:49]: And I think if a company who are well known, who are respected by everyone had come along and done this at the right time, maybe we should have done it a year and a half ago and said, we're going to be the universal AI layer. That would have been a credible thing to do. I've heard varying reports of Lite LLM is the truth. And it didn't seem to have exactly the type safety that we needed. Also, as I understand it, and again, I haven't looked into it in great detail. Part of their business model is proxying the request through their, through their own system to do the generalization. That would be an enormous put off to an awful lot of people. Honestly, the truth is I don't think it is that much work unifying the model. I get where you're coming from. I kind of see your point. I think the truth is that everyone is centralizing around open AIs. Open AI's API is the one to do. So DeepSeq support that. Grok with OK support that. Ollama also does it. I mean, if there is that library right now, it's more or less the open AI SDK. And it's very high quality. It's well type checked. It uses Pydantic. So I'm biased. But I mean, I think it's pretty well respected anyway.Swyx [00:30:57]: There's different ways to do this. Because also, it's not just about normalizing the APIs. You have to do secret management and all that stuff.Samuel [00:31:05]: Yeah. And there's also. There's Vertex and Bedrock, which to one extent or another, effectively, they host multiple models, but they don't unify the API. But they do unify the auth, as I understand it. Although we're halfway through doing Bedrock. So I don't know about it that well. But they're kind of weird hybrids because they support multiple models. But like I say, the auth is centralized.Swyx [00:31:28]: Yeah, I'm surprised they don't unify the API. That seems like something that I would do. You know, we can discuss all this all day. There's a lot of APIs. I agree.Samuel [00:31:36]: It would be nice if there was a universal one that we didn't have to go and build.Alessio [00:31:39]: And I guess the other side of, you know, routing model and picking models like evals. How do you actually figure out which one you should be using? I know you have one. First of all, you have very good support for mocking in unit tests, which is something that a lot of other frameworks don't do. So, you know, my favorite Ruby library is VCR because it just, you know, it just lets me store the HTTP requests and replay them. That part I'll kind of skip. I think you are busy like this test model. We're like just through Python. You try and figure out what the model might respond without actually calling the model. And then you have the function model where people can kind of customize outputs. Any other fun stories maybe from there? Or is it just what you see is what you get, so to speak?Samuel [00:32:18]: On those two, I think what you see is what you get. On the evals, I think watch this space. I think it's something that like, again, I was somewhat cynical about for some time. Still have my cynicism about some of the well, it's unfortunate that so many different things are called evals. It would be nice if we could agree. What they are and what they're not. But look, I think it's a really important space. I think it's something that we're going to be working on soon, both in Pydantic AI and in LogFire to try and support better because it's like it's an unsolved problem.Alessio [00:32:45]: Yeah, you do say in your doc that anyone who claims to know for sure exactly how your eval should be defined can safely be ignored.Samuel [00:32:52]: We'll delete that sentence when we tell people how to do their evals.Alessio [00:32:56]: Exactly. I was like, we need we need a snapshot of this today. And so let's talk about eval. So there's kind of like the vibe. Yeah. So you have evals, which is what you do when you're building. Right. Because you cannot really like test it that many times to get statistical significance. And then there's the production eval. So you also have LogFire, which is kind of like your observability product, which I tried before. It's very nice. What are some of the learnings you've had from building an observability tool for LEMPs? And yeah, as people think about evals, even like what are the right things to measure? What are like the right number of samples that you need to actually start making decisions?Samuel [00:33:33]: I'm not the best person to answer that is the truth. So I'm not going to come in here and tell you that I think I know the answer on the exact number. I mean, we can do some back of the envelope statistics calculations to work out that like having 30 probably gets you most of the statistical value of having 200 for, you know, by definition, 15% of the work. But the exact like how many examples do you need? For example, that's a much harder question to answer because it's, you know, it's deep within the how models operate in terms of LogFire. One of the reasons we built LogFire the way we have and we allow you to write SQL directly against your data and we're trying to build the like powerful fundamentals of observability is precisely because we know we don't know the answers. And so allowing people to go and innovate on how they're going to consume that stuff and how they're going to process it is we think that's valuable. Because even if we come along and offer you an evals framework on top of LogFire, it won't be right in all regards. And we want people to be able to go and innovate and being able to write their own SQL connected to the API. And effectively query the data like it's a database with SQL allows people to innovate on that stuff. And that's what allows us to do it as well. I mean, we do a bunch of like testing what's possible by basically writing SQL directly against LogFire as any user could. I think the other the other really interesting bit that's going on in observability is OpenTelemetry is centralizing around semantic attributes for GenAI. So it's a relatively new project. A lot of it's still being added at the moment. But basically the idea that like. They unify how both SDKs and or agent frameworks send observability data to to any OpenTelemetry endpoint. And so, again, we can go and having that unification allows us to go and like basically compare different libraries, compare different models much better. That stuff's in a very like early stage of development. One of the things we're going to be working on pretty soon is basically, I suspect, GenAI will be the first agent framework that implements those semantic attributes properly. Because, again, we control and we can say this is important for observability, whereas most of the other agent frameworks are not maintained by people who are trying to do observability. With the exception of Langchain, where they have the observability platform, but they chose not to go down the OpenTelemetry route. So they're like plowing their own furrow. And, you know, they're a lot they're even further away from standardization.Alessio [00:35:51]: Can you maybe just give a quick overview of how OTEL ties into the AI workflows? There's kind of like the question of is, you know, a trace. And a span like a LLM call. Is it the agent? It's kind of like the broader thing you're tracking. How should people think about it?Samuel [00:36:06]: Yeah, so they have a PR that I think may have now been merged from someone at IBM talking about remote agents and trying to support this concept of remote agents within GenAI. I'm not particularly compelled by that because I don't think that like that's actually by any means the common use case. But like, I suppose it's fine for it to be there. The majority of the stuff in OTEL is basically defining how you would instrument. A given call to an LLM. So basically the actual LLM call, what data you would send to your telemetry provider, how you would structure that. Apart from this slightly odd stuff on remote agents, most of the like agent level consideration is not yet implemented in is not yet decided effectively. And so there's a bit of ambiguity. Obviously, what's good about OTEL is you can in the end send whatever attributes you like. But yeah, there's quite a lot of churn in that space and exactly how we store the data. I think that one of the most interesting things, though, is that if you think about observability. Traditionally, it was sure everyone would say our observability data is very important. We must keep it safe. But actually, companies work very hard to basically not have anything that sensitive in their observability data. So if you're a doctor in a hospital and you search for a drug for an STI, the sequel might be sent to the observability provider. But none of the parameters would. It wouldn't have the patient number or their name or the drug. With GenAI, that distinction doesn't exist because it's all just messed up in the text. If you have that same patient asking an LLM how to. What drug they should take or how to stop smoking. You can't extract the PII and not send it to the observability platform. So the sensitivity of the data that's going to end up in observability platforms is going to be like basically different order of magnitude to what's in what you would normally send to Datadog. Of course, you can make a mistake and send someone's password or their card number to Datadog. But that would be seen as a as a like mistake. Whereas in GenAI, a lot of data is going to be sent. And I think that's why companies like Langsmith and are trying hard to offer observability. On prem, because there's a bunch of companies who are happy for Datadog to be cloud hosted, but want self-hosted self-hosting for this observability stuff with GenAI.Alessio [00:38:09]: And are you doing any of that today? Because I know in each of the spans you have like the number of tokens, you have the context, you're just storing everything. And then you're going to offer kind of like a self-hosting for the platform, basically. Yeah. Yeah.Samuel [00:38:23]: So we have scrubbing roughly equivalent to what the other observability platforms have. So if we, you know, if we see password as the key, we won't send the value. But like, like I said, that doesn't really work in GenAI. So we're accepting we're going to have to store a lot of data and then we'll offer self-hosting for those people who can afford it and who need it.Alessio [00:38:42]: And then this is, I think, the first time that most of the workloads performance is depending on a third party. You know, like if you're looking at Datadog data, usually it's your app that is driving the latency and like the memory usage and all of that. Here you're going to have spans that maybe take a long time to perform because the GLA API is not working or because OpenAI is kind of like overwhelmed. Do you do anything there since like the provider is almost like the same across customers? You know, like, are you trying to surface these things for people and say, hey, this was like a very slow span, but actually all customers using OpenAI right now are seeing the same thing. So maybe don't worry about it or.Samuel [00:39:20]: Not yet. We do a few things that people don't generally do in OTA. So we send. We send information at the beginning. At the beginning of a trace as well as sorry, at the beginning of a span, as well as when it finishes. By default, OTA only sends you data when the span finishes. So if you think about a request which might take like 20 seconds, even if some of the intermediate spans finished earlier, you can't basically place them on the page until you get the top level span. And so if you're using standard OTA, you can't show anything until those requests are finished. When those requests are taking a few hundred milliseconds, it doesn't really matter. But when you're doing Gen AI calls or when you're like running a batch job that might take 30 minutes. That like latency of not being able to see the span is like crippling to understanding your application. And so we've we do a bunch of slightly complex stuff to basically send data about a span as it starts, which is closely related. Yeah.Alessio [00:40:09]: Any thoughts on all the other people trying to build on top of OpenTelemetry in different languages, too? There's like the OpenLEmetry project, which doesn't really roll off the tongue. But how do you see the future of these kind of tools? Is everybody going to have to build? Why does everybody want to build? They want to build their own open source observability thing to then sell?Samuel [00:40:29]: I mean, we are not going off and trying to instrument the likes of the OpenAI SDK with the new semantic attributes, because at some point that's going to happen and it's going to live inside OTEL and we might help with it. But we're a tiny team. We don't have time to go and do all of that work. So OpenLEmetry, like interesting project. But I suspect eventually most of those semantic like that instrumentation of the big of the SDKs will live, like I say, inside the main OpenTelemetry report. I suppose. What happens to the agent frameworks? What data you basically need at the framework level to get the context is kind of unclear. I don't think we know the answer yet. But I mean, I was on the, I guess this is kind of semi-public, because I was on the call with the OpenTelemetry call last week talking about GenAI. And there was someone from Arize talking about the challenges they have trying to get OpenTelemetry data out of Langchain, where it's not like natively implemented. And obviously they're having quite a tough time. And I was realizing, hadn't really realized this before, but how lucky we are to primarily be talking about our own agent framework, where we have the control rather than trying to go and instrument other people's.Swyx [00:41:36]: Sorry, I actually didn't know about this semantic conventions thing. It looks like, yeah, it's merged into main OTel. What should people know about this? I had never heard of it before.Samuel [00:41:45]: Yeah, I think it looks like a great start. I think there's some unknowns around how you send the messages that go back and forth, which is kind of the most important part. It's the most important thing of all. And that is moved out of attributes and into OTel events. OTel events in turn are moving from being on a span to being their own top-level API where you send data. So there's a bunch of churn still going on. I'm impressed by how fast the OTel community is moving on this project. I guess they, like everyone else, get that this is important, and it's something that people are crying out to get instrumentation off. So I'm kind of pleasantly surprised at how fast they're moving, but it makes sense.Swyx [00:42:25]: I'm just kind of browsing through the specification. I can already see that this basically bakes in whatever the previous paradigm was. So now they have genai.usage.prompt tokens and genai.usage.completion tokens. And obviously now we have reasoning tokens as well. And then only one form of sampling, which is top-p. You're basically baking in or sort of reifying things that you think are important today, but it's not a super foolproof way of doing this for the future. Yeah.Samuel [00:42:54]: I mean, that's what's neat about OTel is you can always go and send another attribute and that's fine. It's just there are a bunch that are agreed on. But I would say, you know, to come back to your previous point about whether or not we should be relying on one centralized abstraction layer, this stuff is moving so fast that if you start relying on someone else's standard, you risk basically falling behind because you're relying on someone else to keep things up to date.Swyx [00:43:14]: Or you fall behind because you've got other things going on.Samuel [00:43:17]: Yeah, yeah. That's fair. That's fair.Swyx [00:43:19]: Any other observations just about building LogFire, actually? Let's just talk about this. So you announced LogFire. I was kind of only familiar with LogFire because of your Series A announcement. I actually thought you were making a separate company. I remember some amount of confusion with you when that came out. So to be clear, it's Pydantic LogFire and the company is one company that has kind of two products, an open source thing and an observability thing, correct? Yeah. I was just kind of curious, like any learnings building LogFire? So classic question is, do you use ClickHouse? Is this like the standard persistence layer? Any learnings doing that?Samuel [00:43:54]: We don't use ClickHouse. We started building our database with ClickHouse, moved off ClickHouse onto Timescale, which is a Postgres extension to do analytical databases. Wow. And then moved off Timescale onto DataFusion. And we're basically now building, it's DataFusion, but it's kind of our own database. Bogomil is not entirely happy that we went through three databases before we chose one. I'll say that. But like, we've got to the right one in the end. I think we could have realized that Timescale wasn't right. I think ClickHouse. They both taught us a lot and we're in a great place now. But like, yeah, it's been a real journey on the database in particular.Swyx [00:44:28]: Okay. So, you know, as a database nerd, I have to like double click on this, right? So ClickHouse is supposed to be the ideal backend for anything like this. And then moving from ClickHouse to Timescale is another counterintuitive move that I didn't expect because, you know, Timescale is like an extension on top of Postgres. Not super meant for like high volume logging. But like, yeah, tell us those decisions.Samuel [00:44:50]: So at the time, ClickHouse did not have good support for JSON. I was speaking to someone yesterday and said ClickHouse doesn't have good support for JSON and got roundly stepped on because apparently it does now. So they've obviously gone and built their proper JSON support. But like back when we were trying to use it, I guess a year ago or a bit more than a year ago, everything happened to be a map and maps are a pain to try and do like looking up JSON type data. And obviously all these attributes, everything you're talking about there in terms of the GenAI stuff. You can choose to make them top level columns if you want. But the simplest thing is just to put them all into a big JSON pile. And that was a problem with ClickHouse. Also, ClickHouse had some really ugly edge cases like by default, or at least until I complained about it a lot, ClickHouse thought that two nanoseconds was longer than one second because they compared intervals just by the number, not the unit. And I complained about that a lot. And then they caused it to raise an error and just say you have to have the same unit. Then I complained a bit more. And I think as I understand it now, they have some. They convert between units. But like stuff like that, when all you're looking at is when a lot of what you're doing is comparing the duration of spans was really painful. Also things like you can't subtract two date times to get an interval. You have to use the date sub function. But like the fundamental thing is because we want our end users to write SQL, the like quality of the SQL, how easy it is to write, matters way more to us than if you're building like a platform on top where your developers are going to write the SQL. And once it's written and it's working, you don't mind too much. So I think that's like one of the fundamental differences. The other problem that I have with the ClickHouse and Impact Timescale is that like the ultimate architecture, the like snowflake architecture of binary data in object store queried with some kind of cache from nearby. They both have it, but it's closed sourced and you only get it if you go and use their hosted versions. And so even if we had got through all the problems with Timescale or ClickHouse, we would end up like, you know, they would want to be taking their 80% margin. And then we would be wanting to take that would basically leave us less space for margin. Whereas data fusion. Properly open source, all of that same tooling is open source. And for us as a team of people with a lot of Rust expertise, data fusion, which is implemented in Rust, we can literally dive into it and go and change it. So, for example, I found that there were some slowdowns in data fusion's string comparison kernel for doing like string contains. And it's just Rust code. And I could go and rewrite the string comparison kernel to be faster. Or, for example, data fusion, when we started using it, didn't have JSON support. Obviously, as I've said, it's something we can do. It's something we needed. I was able to go and implement that in a weekend using our JSON parser that we built for Pydantic Core. So it's the fact that like data fusion is like for us the perfect mixture of a toolbox to build a database with, not a database. And we can go and implement stuff on top of it in a way that like if you were trying to do that in Postgres or in ClickHouse. I mean, ClickHouse would be easier because it's C++, relatively modern C++. But like as a team of people who are not C++ experts, that's much scarier than data fusion for us.Swyx [00:47:47]: Yeah, that's a beautiful rant.Alessio [00:47:49]: That's funny. Most people don't think they have agency on these projects. They're kind of like, oh, I should use this or I should use that. They're not really like, what should I pick so that I contribute the most back to it? You know, so but I think you obviously have an open source first mindset. So that makes a lot of sense.Samuel [00:48:05]: I think if we were probably better as a startup, a better startup and faster moving and just like headlong determined to get in front of customers as fast as possible, we should have just started with ClickHouse. I hope that long term we're in a better place for having worked with data fusion. We like we're quite engaged now with the data fusion community. Andrew Lam, who maintains data fusion, is an advisor to us. We're in a really good place now. But yeah, it's definitely slowed us down relative to just like building on ClickHouse and moving as fast as we can.Swyx [00:48:34]: OK, we're about to zoom out and do Pydantic run and all the other stuff. But, you know, my last question on LogFire is really, you know, at some point you run out sort of community goodwill just because like, oh, I use Pydantic. I love Pydantic. I'm going to use LogFire. OK, then you start entering the territory of the Datadogs, the Sentrys and the honeycombs. Yeah. So where are you going to really spike here? What differentiator here?Samuel [00:48:59]: I wasn't writing code in 2001, but I'm assuming that there were people talking about like web observability and then web observability stopped being a thing, not because the web stopped being a thing, but because all observability had to do web. If you were talking to people in 2010 or 2012, they would have talked about cloud observability. Now that's not a term because all observability is cloud first. The same is going to happen to gen AI. And so whether or not you're trying to compete with Datadog or with Arise and Langsmith, you've got to do first class. You've got to do general purpose observability with first class support for AI. And as far as I know, we're the only people really trying to do that. I mean, I think Datadog is starting in that direction. And to be honest, I think Datadog is a much like scarier company to compete with than the AI specific observability platforms. Because in my opinion, and I've also heard this from lots of customers, AI specific observability where you don't see everything else going on in your app is not actually that useful. Our hope is that we can build the first general purpose observability platform with first class support for AI. And that we have this open source heritage of putting developer experience first that other companies haven't done. For all I'm a fan of Datadog and what they've done. If you search Datadog logging Python. And you just try as a like a non-observability expert to get something up and running with Datadog and Python. It's not trivial, right? That's something Sentry have done amazingly well. But like there's enormous space in most of observability to do DX better.Alessio [00:50:27]: Since you mentioned Sentry, I'm curious how you thought about licensing and all of that. Obviously, your MIT license, you don't have any rolling license like Sentry has where you can only use an open source, like the one year old version of it. Was that a hard decision?Samuel [00:50:41]: So to be clear, LogFire is co-sourced. So Pydantic and Pydantic AI are MIT licensed and like properly open source. And then LogFire for now is completely closed source. And in fact, the struggles that Sentry have had with licensing and the like weird pushback the community gives when they take something that's closed source and make it source available just meant that we just avoided that whole subject matter. I think the other way to look at it is like in terms of either headcount or revenue or dollars in the bank. The amount of open source we do as a company is we've got to be open source. We're up there with the most prolific open source companies, like I say, per head. And so we didn't feel like we were morally obligated to make LogFire open source. We have Pydantic. Pydantic is a foundational library in Python. That and now Pydantic AI are our contribution to open source. And then LogFire is like openly for profit, right? As in we're not claiming otherwise. We're not sort of trying to walk a line if it's open source. But really, we want to make it hard to deploy. So you probably want to pay us. We're trying to be straight. That it's to pay for. We could change that at some point in the future, but it's not an immediate plan.Alessio [00:51:48]: All right. So the first one I saw this new I don't know if it's like a product you're building the Pydantic that run, which is a Python browser sandbox. What was the inspiration behind that? We talk a lot about code interpreter for lamps. I'm an investor in a company called E2B, which is a code sandbox as a service for remote execution. Yeah. What's the Pydantic that run story?Samuel [00:52:09]: So Pydantic that run is again completely open source. I have no interest in making it into a product. We just needed a sandbox to be able to demo LogFire in particular, but also Pydantic AI. So it doesn't have it yet, but I'm going to add basically a proxy to OpenAI and the other models so that you can run Pydantic AI in the browser. See how it works. Tweak the prompt, et cetera, et cetera. And we'll have some kind of limit per day of what you can spend on it or like what the spend is. The other thing we wanted to b

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
The Agent Reasoning Interface: o1/o3, Claude 3, ChatGPT Canvas, Tasks, and Operator — with Karina Nguyen of OpenAI

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Feb 1, 2025 68:40


Sponsorships and tickets for the AI Engineer Summit are selling fast! See the new website with speakers and schedules live! If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you, this Feb 20-22nd in NYC.We're pleased to share that Karina will be presenting OpenAI's closing keynote at the AI Engineer Summit. We were fortunate to get some time with her today to introduce some of her work, and hope this serves as nice background for her talk!There are very few early AI careers that have been as impactful as Karina Nguyen's. After stints at Notion, Square, Dropbox, Primer, the New York Times, and UC Berkeley, She joined Anthropic as employee ~60 and worked on a wide range of research/product roles for Claude 1, 2, and 3. We'll just let her LinkedIn speak for itself:Now, as Research manager and Post-training lead in Model Behavior at OpenAI, she creates new interaction paradigms for reasoning interfaces and capabilities, like ChatGPT Canvas, Tasks, SimpleQA, streaming chain-of-thought for o1 models, and more via novel synthetic model training. Ideal AI Research+Product ProcessIn the podcast we got a sense of what Karina has found works for her and her team to be as productive as they have been:* Write PRD (Define what you want)* Funding (Get resources)* Prototype Prompted Baseline (See what's possible)* Write and Run Evals (Get failures to hillclimb)* Model training (Exceed baseline without overfitting)* Bugbash (Find bugs and solve them)* Ship (Get users!)We could turn this into a snazzy viral graphic but really this is all it is. Simple to say, difficult to do well. Hopefully it helps you define your process if you do similar product-research work. Show Notes* Our Reasoning Price War post * Karina LinkedIn, Website, Twitter* OSINT visualization work* Ukraine 3D storytelling* Karina on Claude Artifacts* Karina on Claude 3 Benchmarks* Inspiration for Artifacts / Canvas from early UX work she did on GPT-3* “i really believe that things like canvas and tasks should and could have happened like 2 yrs ago, idk why we are lagging in the form factors” (tweet)* Our article on prompting o1 vs Karina's Claude prompting principles* Canvas: https://openai.com/index/introducing-canvas/ * We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions.To support this, our research team developed the following core behaviors:* Triggering the canvas for writing and coding* Generating diverse content types* Making targeted edits* Rewriting documents* Providing inline critiqueWe measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data.* Tasks: https://www.theverge.com/2025/1/14/24343528/openai-chatgpt-repeating-tasks-agent-ai* * Agents and Operator* What are agents? “Agents are a gradual progression of tasks: starting with one-off actions, moving to collaboration, and ultimately fully trustworthy long-horizon delegation in complex envs like multi-player/multiagents.” (tweet)* tasks and canvas fall within the first two, and we are def. marching towards the third—though the form factor for 3 will take time to develop * Operator/Computer Use Agents* https://openai.com/index/introducing-operator/* Misc:* Andrew Ng* Prediction: Personal AI Consumer playbook* ChatGPT as generative OSTimestamps* 00:00 Welcome to the Latent Space Podcast* 00:11 Introducing Karina Nguyen* 02:21 Karina's Journey to OpenAI* 04:45 Early Prototypes and Projects* 05:25 Joining Anthropic and Early Work* 07:16 Challenges and Innovations at Anthropic* 11:30 Launching Claude 3* 21:57 Behavioral Design and Model Personality* 27:37 The Making of ChatGPT Canvas* 34:34 Canvas Update and Initial Impressions* 34:46 Differences Between Canvas and API Outputs* 35:50 Core Use Cases of Canvas* 36:35 Canvas as a Writing Partner* 36:55 Canvas vs. Google Docs and Future Improvements* 37:35 Canvas for Coding and Executing Code* 38:50 Challenges in Developing Canvas* 41:45 Introduction to Tasks* 41:53 Developing and Iterating on Tasks* 46:27 Future Vision for Tasks and Proactive Models* 52:23 Computer Use Agents and Their Potential* 01:00:21 Cultural Differences Between OpenAI and Anthropic* 01:03:46 Call to Action and Final Thoughts Get full access to Latent Space at www.latent.space/subscribe

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Outlasting Noam Shazeer, crowdsourcing Chat + AI with >1.4m DAU, and becoming the "Western DeepSeek" — with William Beauchamp, Chai Research

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 26, 2025 75:46


One last Gold sponsor slot is available for the AI Engineer Summit in NYC. Our last round of invites is going out soon - apply here - If you are building AI agents or AI eng teams, this will be the single highest-signal conference of the year for you!While the world melts down over DeepSeek, few are talking about the OTHER notable group of former hedge fund traders who pivoted into AI and built a remarkably profitable consumer AI business with a tiny team with incredibly cracked engineering team — Chai Research. In short order they have:* Started a Chat AI company well before Noam Shazeer started Character AI, and outlasted his departure.* Crossed 1m DAU in 2.5 years - William updates us on the pod that they've hit 1.4m DAU now, another +40% from a few months ago. Revenue crossed >$22m. * Launched the Chaiverse model crowdsourcing platform - taking 3-4 week A/B testing cycles down to 3-4 hours, and deploying >100 models a week.While they're not paying million dollar salaries, you can tell they're doing pretty well for an 11 person startup:The Chai Recipe: Building infra for rapid evalsRemember how the central thesis of LMarena (formerly LMsys) is that the only comprehensive way to evaluate LLMs is to let users try them out and pick winners?At the core of Chai is a mobile app that looks like Character AI, but is actually the largest LLM A/B testing arena in the world, specialized on retaining chat users for Chai's usecases (therapy, assistant, roleplay, etc). It's basically what LMArena would be if taken very, very seriously at one company (with $1m in prizes to boot):Chai publishes occasional research on how they think about this, including talks at their Palo Alto office:William expands upon this in today's podcast (34 mins in):Fundamentally, the way I would describe it is when you're building anything in life, you need to be able to evaluate it. And through evaluation, you can iterate, we can look at benchmarks, and we can say the issues with benchmarks and why they may not generalize as well as one would hope in the challenges of working with them. But something that works incredibly well is getting feedback from humans. And so we built this thing where anyone can submit a model to our developer backend, and it gets put in front of 5000 users, and the users can rate it. And we can then have a really accurate ranking of like which model, or users finding more engaging or more entertaining. And it gets, you know, it's at this point now, where every day we're able to, I mean, we evaluate between 20 and 50 models, LLMs, every single day, right. So even though we've got only got a team of, say, five AI researchers, they're able to iterate a huge quantity of LLMs, right. So our team ships, let's just say minimum 100 LLMs a week is what we're able to iterate through. Now, before that moment in time, we might iterate through three a week, we might, you know, there was a time when even doing like five a month was a challenge, right? By being able to change the feedback loops to the point where it's not, let's launch these three models, let's do an A-B test, let's assign, let's do different cohorts, let's wait 30 days to see what the day 30 retention is, which is the kind of the, if you're doing an app, that's like A-B testing 101 would be, do a 30-day retention test, assign different treatments to different cohorts and come back in 30 days. So that's insanely slow. That's just, it's too slow. And so we were able to get that 30-day feedback loop all the way down to something like three hours.In Crowdsourcing the leap to Ten Trillion-Parameter AGI, William describes Chai's routing as a recommender system, which makes a lot more sense to us than previous pitches for model routing startups:William is notably counter-consensus in a lot of his AI product principles:* No streaming: Chats appear all at once to allow rejection sampling* No voice: Chai actually beat Character AI to introducing voice - but removed it after finding that it was far from a killer feature.* Blending: “Something that we love to do at Chai is blending, which is, you know, it's the simplest way to think about it is you're going to end up, and you're going to pretty quickly see you've got one model that's really smart, one model that's really funny. How do you get the user an experience that is both smart and funny? Well, just 50% of the requests, you can serve them the smart model, 50% of the requests, you serve them the funny model.” (that's it!)But chief above all is the recommender system.We also referenced Exa CEO Will Bryk's concept of SuperKnowlege:Full Video versionOn YouTube. please like and subscribe!Timestamps* 00:00:04 Introductions and background of William Beauchamp* 00:01:19 Origin story of Chai AI* 00:04:40 Transition from finance to AI* 00:11:36 Initial product development and idea maze for Chai* 00:16:29 User psychology and engagement with AI companions* 00:20:00 Origin of the Chai name* 00:22:01 Comparison with Character AI and funding challenges* 00:25:59 Chai's growth and user numbers* 00:34:53 Key inflection points in Chai's growth* 00:42:10 Multi-modality in AI companions and focus on user-generated content* 00:46:49 Chaiverse developer platform and model evaluation* 00:51:58 Views on AGI and the nature of AI intelligence* 00:57:14 Evaluation methods and human feedback in AI development* 01:02:01 Content creation and user experience in Chai* 01:04:49 Chai Grant program and company culture* 01:07:20 Inference optimization and compute costs* 01:09:37 Rejection sampling and reward models in AI generation* 01:11:48 Closing thoughts and recruitmentTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and today we're in the Chai AI office with my usual co-host, Swyx.swyx [00:00:14]: Hey, thanks for having us. It's rare that we get to get out of the office, so thanks for inviting us to your home. We're in the office of Chai with William Beauchamp. Yeah, that's right. You're founder of Chai AI, but previously, I think you're concurrently also running your fund?William [00:00:29]: Yep, so I was simultaneously running an algorithmic trading company, but I fortunately was able to kind of exit from that, I think just in Q3 last year. Yeah, congrats. Yeah, thanks.swyx [00:00:43]: So Chai has always been on my radar because, well, first of all, you do a lot of advertising, I guess, in the Bay Area, so it's working. Yep. And second of all, the reason I reached out to a mutual friend, Joyce, was because I'm just generally interested in the... ...consumer AI space, chat platforms in general. I think there's a lot of inference insights that we can get from that, as well as human psychology insights, kind of a weird blend of the two. And we also share a bit of a history as former finance people crossing over. I guess we can just kind of start it off with the origin story of Chai.William [00:01:19]: Why decide working on a consumer AI platform rather than B2B SaaS? So just quickly touching on the background in finance. Sure. Originally, I'm from... I'm from the UK, born in London. And I was fortunate enough to go study economics at Cambridge. And I graduated in 2012. And at that time, everyone in the UK and everyone on my course, HFT, quant trading was really the big thing. It was like the big wave that was happening. So there was a lot of opportunity in that space. And throughout college, I'd sort of played poker. So I'd, you know, I dabbled as a professional poker player. And I was able to accumulate this sort of, you know, say $100,000 through playing poker. And at the time, as my friends would go work at companies like ChangeStreet or Citadel, I kind of did the maths. And I just thought, well, maybe if I traded my own capital, I'd probably come out ahead. I'd make more money than just going to work at ChangeStreet.swyx [00:02:20]: With 100k base as capital?William [00:02:22]: Yes, yes. That's not a lot. Well, it depends what strategies you're doing. And, you know, there is an advantage. There's an advantage to being small, right? Because there are, if you have a 10... Strategies that don't work in size. Exactly, exactly. So if you have a fund of $10 million, if you find a little anomaly in the market that you might be able to make 100k a year from, that's a 1% return on your 10 million fund. If your fund is 100k, that's 100% return, right? So being small, in some sense, was an advantage. So started off, and the, taught myself Python, and machine learning was like the big thing as well. Machine learning had really, it was the first, you know, big time machine learning was being used for image recognition, neural networks come out, you get dropout. And, you know, so this, this was the big thing that's going on at the time. So I probably spent my first three years out of Cambridge, just building neural networks, building random forests to try and predict asset prices, right, and then trade that using my own money. And that went well. And, you know, if you if you start something, and it goes well, you You try and hire more people. And the first people that came to mind was the talented people I went to college with. And so I hired some friends. And that went well and hired some more. And eventually, I kind of ran out of friends to hire. And so that was when I formed the company. And from that point on, we had our ups and we had our downs. And that was a whole long story and journey in itself. But after doing that for about eight or nine years, on my 30th birthday, which was four years ago now, I kind of took a step back to just evaluate my life, right? This is what one does when one turns 30. You know, I just heard it. I hear you. And, you know, I looked at my 20s and I loved it. It was a really special time. I was really lucky and fortunate to have worked with this amazing team, been successful, had a lot of hard times. And through the hard times, learned wisdom and then a lot of success and, you know, was able to enjoy it. And so the company was making about five million pounds a year. And it was just me and a team of, say, 15, like, Oxford and Cambridge educated mathematicians and physicists. It was like the real dream that you'd have if you wanted to start a quant trading firm. It was like...swyx [00:04:40]: Your own, all your own money?William [00:04:41]: Yeah, exactly. It was all the team's own money. We had no customers complaining to us about issues. There's no investors, you know, saying, you know, they don't like the risk that we're taking. We could. We could really run the thing exactly as we wanted it. It's like Susquehanna or like Rintec. Yeah, exactly. Yeah. And they're the companies that we would kind of look towards as we were building that thing out. But on my 30th birthday, I look and I say, OK, great. This thing is making as much money as kind of anyone would really need. And I thought, well, what's going to happen if we keep going in this direction? And it was clear that we would never have a kind of a big, big impact on the world. We can enrich ourselves. We can make really good money. Everyone on the team would be paid very, very well. Presumably, I can make enough money to buy a yacht or something. But this stuff wasn't that important to me. And so I felt a sort of obligation that if you have this much talent and if you have a talented team, especially as a founder, you want to be putting all that talent towards a good use. I looked at the time of like getting into crypto and I had a really strong view on crypto, which was that as far as a gambling device. This is like the most fun form of gambling invented in like ever super fun, I thought as a way to evade monetary regulations and banking restrictions. I think it's also absolutely amazing. So it has two like killer use cases, not so much banking the unbanked, but everything else, but everything else to do with like the blockchain and, and you know, web, was it web 3.0 or web, you know, that I, that didn't, it didn't really make much sense. And so instead of going into crypto, which I thought, even if I was successful, I'd end up in a lot of trouble. I thought maybe it'd be better to build something that governments wouldn't have a problem with. I knew that LLMs were like a thing. I think opening. I had said they hadn't released GPT-3 yet, but they'd said GPT-3 is so powerful. We can't release it to the world or something. Was it GPT-2? And then I started interacting with, I think Google had open source, some language models. They weren't necessarily LLMs, but they, but they were. But yeah, exactly. So I was able to play around with, but nowadays so many people have interacted with the chat GPT, they get it, but it's like the first time you, you can just talk to a computer and it talks back. It's kind of a special moment and you know, everyone who's done that goes like, wow, this is how it should be. Right. It should be like, rather than having to type on Google and search, you should just be able to ask Google a question. When I saw that I read the literature, I kind of came across the scaling laws and I think even four years ago. All the pieces of the puzzle were there, right? Google had done this amazing research and published, you know, a lot of it. Open AI was still open. And so they'd published a lot of their research. And so you really could be fully informed on, on the state of AI and where it was going. And so at that point I was confident enough, it was worth a shot. I think LLMs are going to be the next big thing. And so that's the thing I want to be building in, in that space. And I thought what's the most impactful product I can possibly build. And I thought it should be a platform. So I myself love platforms. I think they're fantastic because they open up an ecosystem where anyone can contribute to it. Right. So if you think of a platform like a YouTube, instead of it being like a Hollywood situation where you have to, if you want to make a TV show, you have to convince Disney to give you the money to produce it instead, anyone in the world can post any content they want to YouTube. And if people want to view it, the algorithm is going to promote it. Nowadays. You can look at creators like Mr. Beast or Joe Rogan. They would have never have had that opportunity unless it was for this platform. Other ones like Twitter's a great one, right? But I would consider Wikipedia to be a platform where instead of the Britannica encyclopedia, which is this, it's like a monolithic, you get all the, the researchers together, you get all the data together and you combine it in this, in this one monolithic source. Instead. You have this distributed thing. You can say anyone can host their content on Wikipedia. Anyone can contribute to it. And anyone can maybe their contribution is they delete stuff. When I was hearing like the kind of the Sam Altman and kind of the, the Muskian perspective of AI, it was a very kind of monolithic thing. It was all about AI is basically a single thing, which is intelligence. Yeah. Yeah. The more intelligent, the more compute, the more intelligent, and the more and better AI researchers, the more intelligent, right? They would speak about it as a kind of erased, like who can get the most data, the most compute and the most researchers. And that would end up with the most intelligent AI. But I didn't believe in any of that. I thought that's like the total, like I thought that perspective is the perspective of someone who's never actually done machine learning. Because with machine learning, first of all, you see that the performance of the models follows an S curve. So it's not like it just goes off to infinity, right? And the, the S curve, it kind of plateaus around human level performance. And you can look at all the, all the machine learning that was going on in the 2010s, everything kind of plateaued around the human level performance. And we can think about the self-driving car promises, you know, how Elon Musk kept saying the self-driving car is going to happen next year, it's going to happen next, next year. Or you can look at the image recognition, the speech recognition. You can look at. All of these things, there was almost nothing that went superhuman, except for something like AlphaGo. And we can speak about why AlphaGo was able to go like super superhuman. So I thought the most likely thing was going to be this, I thought it's not going to be a monolithic thing. That's like an encyclopedia Britannica. I thought it must be a distributed thing. And I actually liked to look at the world of finance for what I think a mature machine learning ecosystem would look like. So, yeah. So finance is a machine learning ecosystem because all of these quant trading firms are running machine learning algorithms, but they're running it on a centralized platform like a marketplace. And it's not the case that there's one giant quant trading company of all the data and all the quant researchers and all the algorithms and compute, but instead they all specialize. So one will specialize on high frequency training. Another will specialize on mid frequency. Another one will specialize on equity. Another one will specialize. And I thought that's the way the world works. That's how it is. And so there must exist a platform where a small team can produce an AI for a unique purpose. And they can iterate and build the best thing for that, right? And so that was the vision for Chai. So we wanted to build a platform for LLMs.Alessio [00:11:36]: That's kind of the maybe inside versus contrarian view that led you to start the company. Yeah. And then what was maybe the initial idea maze? Because if somebody told you that was the Hugging Face founding story, people might believe it. It's kind of like a similar ethos behind it. How did you land on the product feature today? And maybe what were some of the ideas that you discarded that initially you thought about?William [00:11:58]: So the first thing we built, it was fundamentally an API. So nowadays people would describe it as like agents, right? But anyone could write a Python script. They could submit it to an API. They could send it to the Chai backend and we would then host this code and execute it. So that's like the developer side of the platform. On their Python script, the interface was essentially text in and text out. An example would be the very first bot that I created. I think it was a Reddit news bot. And so it would first, it would pull the popular news. Then it would prompt whatever, like I just use some external API for like Burr or GPT-2 or whatever. Like it was a very, very small thing. And then the user could talk to it. So you could say to the bot, hi bot, what's the news today? And it would say, this is the top stories. And you could chat with it. Now four years later, that's like perplexity or something. That's like the, right? But back then the models were first of all, like really, really dumb. You know, they had an IQ of like a four year old. And users, there really wasn't any demand or any PMF for interacting with the news. So then I was like, okay. Um. So let's make another one. And I made a bot, which was like, you could talk to it about a recipe. So you could say, I'm making eggs. Like I've got eggs in my fridge. What should I cook? And it'll say, you should make an omelet. Right. There was no PMF for that. No one used it. And so I just kept creating bots. And so every single night after work, I'd be like, okay, I like, we have AI, we have this platform. I can create any text in textile sort of agent and put it on the platform. And so we just create stuff night after night. And then all the coders I knew, I would say, yeah, this is what we're going to do. And then I would say to them, look, there's this platform. You can create any like chat AI. You should put it on. And you know, everyone's like, well, chatbots are super lame. We want absolutely nothing to do with your chatbot app. No one who knew Python wanted to build on it. I'm like trying to build all these bots and no consumers want to talk to any of them. And then my sister who at the time was like just finishing college or something, I said to her, I was like, if you want to learn Python, you should just submit a bot for my platform. And she, she built a therapy for me. And I was like, okay, cool. I'm going to build a therapist bot. And then the next day I checked the performance of the app and I'm like, oh my God, we've got 20 active users. And they spent, they spent like an average of 20 minutes on the app. I was like, oh my God, what, what bot were they speaking to for an average of 20 minutes? And I looked and it was the therapist bot. And I went, oh, this is where the PMF is. There was no demand for, for recipe help. There was no demand for news. There was no demand for dad jokes or pub quiz or fun facts or what they wanted was they wanted the therapist bot. the time I kind of reflected on that and I thought, well, if I want to consume news, the most fun thing, most fun way to consume news is like Twitter. It's not like the value of there being a back and forth, wasn't that high. Right. And I thought if I need help with a recipe, I actually just go like the New York times has a good recipe section, right? It's not actually that hard. And so I just thought the thing that AI is 10 X better at is a sort of a conversation right. That's not intrinsically informative, but it's more about an opportunity. You can say whatever you want. You're not going to get judged. If it's 3am, you don't have to wait for your friend to text back. It's like, it's immediate. They're going to reply immediately. You can say whatever you want. It's judgment-free and it's much more like a playground. It's much more like a fun experience. And you could see that if the AI gave a person a compliment, they would love it. It's much easier to get the AI to give you a compliment than a human. From that day on, I said, okay, I get it. Humans want to speak to like humans or human like entities and they want to have fun. And that was when I started to look less at platforms like Google. And I started to look more at platforms like Instagram. And I was trying to think about why do people use Instagram? And I could see that I think Chai was, was filling the same desire or the same drive. If you go on Instagram, typically you want to look at the faces of other humans, or you want to hear about other people's lives. So if it's like the rock is making himself pancakes on a cheese plate. You kind of feel a little bit like you're the rock's friend, or you're like having pancakes with him or something, right? But if you do it too much, you feel like you're sad and like a lonely person, but with AI, you can talk to it and tell it stories and tell you stories, and you can play with it for as long as you want. And you don't feel like you're like a sad, lonely person. You feel like you actually have a friend.Alessio [00:16:29]: And what, why is that? Do you have any insight on that from using it?William [00:16:33]: I think it's just the human psychology. I think it's just the idea that, with old school social media. You're just consuming passively, right? So you'll just swipe. If I'm watching TikTok, just like swipe and swipe and swipe. And even though I'm getting the dopamine of like watching an engaging video, there's this other thing that's building my head, which is like, I'm feeling lazier and lazier and lazier. And after a certain period of time, I'm like, man, I just wasted 40 minutes. I achieved nothing. But with AI, because you're interacting, you feel like you're, it's not like work, but you feel like you're participating and contributing to the thing. You don't feel like you're just. Consuming. So you don't have a sense of remorse basically. And you know, I think on the whole people, the way people talk about, try and interact with the AI, they speak about it in an incredibly positive sense. Like we get people who say they have eating disorders saying that the AI helps them with their eating disorders. People who say they're depressed, it helps them through like the rough patches. So I think there's something intrinsically healthy about interacting that TikTok and Instagram and YouTube doesn't quite tick. From that point on, it was about building more and more kind of like human centric AI for people to interact with. And I was like, okay, let's make a Kanye West bot, right? And then no one wanted to talk to the Kanye West bot. And I was like, ah, who's like a cool persona for teenagers to want to interact with. And I was like, I was trying to find the influencers and stuff like that, but no one cared. Like they didn't want to interact with the, yeah. And instead it was really just the special moment was when we said the realization that developers and software engineers aren't interested in building this sort of AI, but the consumers are right. And rather than me trying to guess every day, like what's the right bot to submit to the platform, why don't we just create the tools for the users to build it themselves? And so nowadays this is like the most obvious thing in the world, but when Chai first did it, it was not an obvious thing at all. Right. Right. So we took the API for let's just say it was, I think it was GPTJ, which was this 6 billion parameter open source transformer style LLM. We took GPTJ. We let users create the prompt. We let users select the image and we let users choose the name. And then that was the bot. And through that, they could shape the experience, right? So if they said this bot's going to be really mean, and it's going to be called like bully in the playground, right? That was like a whole category that I never would have guessed. Right. People love to fight. They love to have a disagreement, right? And then they would create, there'd be all these romantic archetypes that I didn't know existed. And so as the users could create the content that they wanted, that was when Chai was able to, to get this huge variety of content and rather than appealing to, you know, 1% of the population that I'd figured out what they wanted, you could appeal to a much, much broader thing. And so from that moment on, it was very, very crystal clear. It's like Chai, just as Instagram is this social media platform that lets people create images and upload images, videos and upload that, Chai was really about how can we let the users create this experience in AI and then share it and interact and search. So it's really, you know, I say it's like a platform for social AI.Alessio [00:20:00]: Where did the Chai name come from? Because you started the same path. I was like, is it character AI shortened? You started at the same time, so I was curious. The UK origin was like the second, the Chai.William [00:20:15]: We started way before character AI. And there's an interesting story that Chai's numbers were very, very strong, right? So I think in even 20, I think late 2022, was it late 2022 or maybe early 2023? Chai was like the number one AI app in the app store. So we would have something like 100,000 daily active users. And then one day we kind of saw there was this website. And we were like, oh, this website looks just like Chai. And it was the character AI website. And I think that nowadays it's, I think it's much more common knowledge that when they left Google with the funding, I think they knew what was the most trending, the number one app. And I think they sort of built that. Oh, you found the people.swyx [00:21:03]: You found the PMF for them.William [00:21:04]: We found the PMF for them. Exactly. Yeah. So I worked a year very, very hard. And then they, and then that was when I learned a lesson, which is that if you're VC backed and if, you know, so Chai, we'd kind of ran, we'd got to this point, I was the only person who'd invested. I'd invested maybe 2 million pounds in the business. And you know, from that, we were able to build this thing, get to say a hundred thousand daily active users. And then when character AI came along, the first version, we sort of laughed. We were like, oh man, this thing sucks. Like they don't know what they're building. They're building the wrong thing anyway, but then I saw, oh, they've raised a hundred million dollars. Oh, they've raised another hundred million dollars. And then our users started saying, oh guys, your AI sucks. Cause we were serving a 6 billion parameter model, right? How big was the model that character AI could afford to serve, right? So we would be spending, let's say we would spend a dollar per per user, right? Over the, the, you know, the entire lifetime.swyx [00:22:01]: A dollar per session, per chat, per month? No, no, no, no.William [00:22:04]: Let's say we'd get over the course of the year, we'd have a million users and we'd spend a million dollars on the AI throughout the year. Right. Like aggregated. Exactly. Exactly. Right. They could spend a hundred times that. So people would say, why is your AI much dumber than character AIs? And then I was like, oh, okay, I get it. This is like the Silicon Valley style, um, hyper scale business. And so, yeah, we moved to Silicon Valley and, uh, got some funding and iterated and built the flywheels. And, um, yeah, I, I'm very proud that we were able to compete with that. Right. So, and I think the reason we were able to do it was just customer obsession. And it's similar, I guess, to how deep seek have been able to produce such a compelling model when compared to someone like an open AI, right? So deep seek, you know, their latest, um, V2, yeah, they claim to have spent 5 million training it.swyx [00:22:57]: It may be a bit more, but, um, like, why are you making it? Why are you making such a big deal out of this? Yeah. There's an agenda there. Yeah. You brought up deep seek. So we have to ask you had a call with them.William [00:23:07]: We did. We did. We did. Um, let me think what to say about that. I think for one, they have an amazing story, right? So their background is again in finance.swyx [00:23:16]: They're the Chinese version of you. Exactly.William [00:23:18]: Well, there's a lot of similarities. Yes. Yes. I have a great affinity for companies which are like, um, founder led, customer obsessed and just try and build something great. And I think what deep seek have achieved. There's quite special is they've got this amazing inference engine. They've been able to reduce the size of the KV cash significantly. And then by being able to do that, they're able to significantly reduce their inference costs. And I think with kind of with AI, people get really focused on like the kind of the foundation model or like the model itself. And they sort of don't pay much attention to the inference. To give you an example with Chai, let's say a typical user session is 90 minutes, which is like, you know, is very, very long for comparison. Let's say the average session length on TikTok is 70 minutes. So people are spending a lot of time. And in that time they're able to send say 150 messages. That's a lot of completions, right? It's quite different from an open AI scenario where people might come in, they'll have a particular question in mind. And they'll ask like one question. And a few follow up questions, right? So because they're consuming, say 30 times as many requests for a chat, or a conversational experience, you've got to figure out how to how to get the right balance between the cost of that and the quality. And so, you know, I think with AI, it's always been the case that if you want a better experience, you can throw compute at the problem, right? So if you want a better model, you can just make it bigger. If you want it to remember better, give it a longer context. And now, what open AI is doing to great fanfare is with projection sampling, you can generate many candidates, right? And then with some sort of reward model or some sort of scoring system, you can serve the most promising of these many candidates. And so that's kind of scaling up on the inference time compute side of things. And so for us, it doesn't make sense to think of AI is just the absolute performance. So. But what we're seeing, it's like the MML you score or the, you know, any of these benchmarks that people like to look at, if you just get that score, it doesn't really tell tell you anything. Because it's really like progress is made by improving the performance per dollar. And so I think that's an area where deep seek have been able to form very, very well, surprisingly so. And so I'm very interested in what Lama four is going to look like. And if they're able to sort of match what deep seek have been able to achieve with this performance per dollar gain.Alessio [00:25:59]: Before we go into the inference, some of the deeper stuff, can you give people an overview of like some of the numbers? So I think last I checked, you have like 1.4 million daily active now. It's like over 22 million of revenue. So it's quite a business.William [00:26:12]: Yeah, I think we grew by a factor of, you know, users grew by a factor of three last year. Revenue over doubled. You know, it's very exciting. We're competing with some really big, really well funded companies. Character AI got this, I think it was almost a $3 billion valuation. And they have 5 million DAU is a number that I last heard. Torquay, which is a Chinese built app owned by a company called Minimax. They're incredibly well funded. And these companies didn't grow by a factor of three last year. Right. And so when you've got this company and this team that's able to keep building something that gets users excited, and they want to tell their friend about it, and then they want to come and they want to stick on the platform. I think that's very special. And so last year was a great year for the team. And yeah, I think the numbers reflect the hard work that we put in. And then fundamentally, the quality of the app, the quality of the content, the quality of the content, the quality of the content, the quality of the content, the quality of the content. AI is the quality of the experience that you have. You actually published your DAU growth chart, which is unusual. And I see some inflections. Like, it's not just a straight line. There's some things that actually inflect. Yes. What were the big ones? Cool. That's a great, great, great question. Let me think of a good answer. I'm basically looking to annotate this chart, which doesn't have annotations on it. Cool. The first thing I would say is this is, I think the most important thing to know about success is that success is born out of failures. Right? Through failures that we learn. You know, if you think something's a good idea, and you do and it works, great, but you didn't actually learn anything, because everything went exactly as you imagined. But if you have an idea, you think it's going to be good, you try it, and it fails. There's a gap between the reality and expectation. And that's an opportunity to learn. The flat periods, that's us learning. And then the up periods is that's us reaping the rewards of that. So I think the big, of the growth shot of just 2024, I think the first thing that really kind of put a dent in our growth was our backend. So we just reached this scale. So we'd, from day one, we'd built on top of Google's GCP, which is Google's cloud platform. And they were fantastic. We used them when we had one daily active user, and they worked pretty good all the way up till we had about 500,000. It was never the cheapest, but from an engineering perspective, man, that thing scaled insanely good. Like, not Vertex? Not Vertex. Like GKE, that kind of stuff? We use Firebase. So we use Firebase. I'm pretty sure we're the biggest user ever on Firebase. That's expensive. Yeah, we had calls with engineers, and they're like, we wouldn't recommend using this product beyond this point, and you're 3x over that. So we pushed Google to their absolute limits. You know, it was fantastic for us, because we could focus on the AI. We could focus on just adding as much value as possible. But then what happened was, after 500,000, just the thing, the way we were using it, and it would just, it wouldn't scale any further. And so we had a really, really painful, at least three-month period, as we kind of migrated between different services, figuring out, like, what requests do we want to keep on Firebase, and what ones do we want to move on to something else? And then, you know, making mistakes. And learning things the hard way. And then after about three months, we got that right. So that, we would then be able to scale to the 1.5 million DAE without any further issues from the GCP. But what happens is, if you have an outage, new users who go on your app experience a dysfunctional app, and then they're going to exit. And so your next day, the key metrics that the app stores track are going to be something like retention rates. And so your next day, the key metrics that the app stores track are going to be something like retention rates. Money spent, and the star, like, the rating that they give you. In the app store. In the app store, yeah. Tyranny. So if you're ranked top 50 in entertainment, you're going to acquire a certain rate of users organically. If you go in and have a bad experience, it's going to tank where you're positioned in the algorithm. And then it can take a long time to kind of earn your way back up, at least if you wanted to do it organically. If you throw money at it, you can jump to the top. And I could talk about that. But broadly speaking, if we look at 2024, the first kink in the graph was outages due to hitting 500k DAU. The backend didn't want to scale past that. So then we just had to do the engineering and build through it. Okay, so we built through that, and then we get a little bit of growth. And so, okay, that's feeling a little bit good. I think the next thing, I think it's, I'm not going to lie, I have a feeling that when Character AI got... I was thinking. I think so. I think... So the Character AI team fundamentally got acquired by Google. And I don't know what they changed in their business. I don't know if they dialed down that ad spend. Products don't change, right? Products just what it is. I don't think so. Yeah, I think the product is what it is. It's like maintenance mode. Yes. I think the issue that people, you know, some people may think this is an obvious fact, but running a business can be very competitive, right? Because other businesses can see what you're doing, and they can imitate you. And then there's this... There's this question of, if you've got one company that's spending $100,000 a day on advertising, and you've got another company that's spending zero, if you consider market share, and if you're considering new users which are entering the market, the guy that's spending $100,000 a day is going to be getting 90% of those new users. And so I have a suspicion that when the founders of Character AI left, they dialed down their spending on user acquisition. And I think that kind of gave oxygen to like the other apps. And so Chai was able to then start growing again in a really healthy fashion. I think that's kind of like the second thing. I think a third thing is we've really built a great data flywheel. Like the AI team sort of perfected their flywheel, I would say, in end of Q2. And I could speak about that at length. But fundamentally, the way I would describe it is when you're building anything in life, you need to be able to evaluate it. And through evaluation, you can iterate, we can look at benchmarks, and we can say the issues with benchmarks and why they may not generalize as well as one would hope in the challenges of working with them. But something that works incredibly well is getting feedback from humans. And so we built this thing where anyone can submit a model to our developer backend, and it gets put in front of 5000 users, and the users can rate it. And we can then have a really accurate ranking of like which model, or users finding more engaging or more entertaining. And it gets, you know, it's at this point now, where every day we're able to, I mean, we evaluate between 20 and 50 models, LLMs, every single day, right. So even though we've got only got a team of, say, five AI researchers, they're able to iterate a huge quantity of LLMs, right. So our team ships, let's just say minimum 100 LLMs a week is what we're able to iterate through. Now, before that moment in time, we might iterate through three a week, we might, you know, there was a time when even doing like five a month was a challenge, right? By being able to change the feedback loops to the point where it's not, let's launch these three models, let's do an A-B test, let's assign, let's do different cohorts, let's wait 30 days to see what the day 30 retention is, which is the kind of the, if you're doing an app, that's like A-B testing 101 would be, do a 30-day retention test, assign different treatments to different cohorts and come back in 30 days. So that's insanely slow. That's just, it's too slow. And so we were able to get that 30-day feedback loop all the way down to something like three hours. And when we did that, we could really, really, really perfect techniques like DPO, fine tuning, prompt engineering, blending, rejection sampling, training a reward model, right, really successfully, like boom, boom, boom, boom, boom. And so I think in Q3 and Q4, we got, the amount of AI improvements we got was like astounding. It was getting to the point, I thought like how much more, how much more edge is there to be had here? But the team just could keep going and going and going. That was like number three for the inflection point.swyx [00:34:53]: There's a fourth?William [00:34:54]: The important thing about the third one is if you go on our Reddit or you talk to users of AI, there's like a clear date. It's like somewhere in October or something. The users, they flipped. Before October, the users... The users would say character AI is better than you, for the most part. Then from October onwards, they would say, wow, you guys are better than character AI. And that was like a really clear positive signal that we'd sort of done it. And I think people, you can't cheat consumers. You can't trick them. You can't b******t them. They know, right? If you're going to spend 90 minutes on a platform, and with apps, there's the barriers to switching is pretty low. Like you can try character AI, you can't cheat consumers. You can't cheat them. You can't cheat them. You can't cheat AI for a day. If you get bored, you can try Chai. If you get bored of Chai, you can go back to character. So the users, the loyalty is not strong, right? What keeps them on the app is the experience. If you deliver a better experience, they're going to stay and they can tell. So that was the fourth one was we were fortunate enough to get this hire. He was hired one really talented engineer. And then they said, oh, at my last company, we had a head of growth. He was really, really good. And he was the head of growth for ByteDance for two years. Would you like to speak to him? And I was like, yes. Yes, I think I would. And so I spoke to him. And he just blew me away with what he knew about user acquisition. You know, it was like a 3D chessswyx [00:36:21]: sort of thing. You know, as much as, as I know about AI. Like ByteDance as in TikTok US. Yes.William [00:36:26]: Not ByteDance as other stuff. Yep. He was interviewing us as we were interviewing him. Right. And so pick up options. Yeah, exactly. And so he was kind of looking at our metrics. And he was like, I saw him get really excited when he said, guys, you've got a million daily active users and you've done no advertising. I said, correct. And he was like, that's unheard of. He's like, I've never heard of anyone doing that. And then he started looking at our metrics. And he was like, if you've got all of this organically, if you start spending money, this is going to be very exciting. I was like, let's give it a go. So then he came in, we've just started ramping up the user acquisition. So that looks like spending, you know, let's say we're spending, we started spending $20,000 a day, it looked very promising than 20,000. Right now we're spending $40,000 a day on user acquisition. That's still only half of what like character AI or talkie may be spending. But from that, it's sort of, we were growing at a rate of maybe say, 2x a year. And that got us growing at a rate of 3x a year. So I'm growing, I'm evolving more and more to like a Silicon Valley style hyper growth, like, you know, you build something decent, and then you canswyx [00:37:33]: slap on a huge... You did the important thing, you did the product first.William [00:37:36]: Of course, but then you can slap on like, like the rocket or the jet engine or something, which is just this cash in, you pour in as much cash, you buy a lot of ads, and your growth is faster.swyx [00:37:48]: Not to, you know, I'm just kind of curious what's working right now versus what surprisinglyWilliam [00:37:52]: doesn't work. Oh, there's a long, long list of surprising stuff that doesn't work. Yeah. The surprising thing, like the most surprising thing, what doesn't work is almost everything doesn't work. That's what's surprising. And I'll give you an example. So like a year and a half ago, I was working at a company, we were super excited by audio. I was like, audio is going to be the next killer feature, we have to get in the app. And I want to be the first. So everything Chai does, I want us to be the first. We may not be the company that's strongest at execution, but we can always be theswyx [00:38:22]: most innovative. Interesting. Right? So we can... You're pretty strong at execution.William [00:38:26]: We're much stronger, we're much stronger. A lot of the reason we're here is because we were first. If we launched today, it'd be so hard to get the traction. Because it's like to get the flywheel, to get the users, to build a product people are excited about. If you're first, people are naturally excited about it. But if you're fifth or 10th, man, you've got to beswyx [00:38:46]: insanely good at execution. So you were first with voice? We were first. We were first. I only knowWilliam [00:38:51]: when character launched voice. They launched it, I think they launched it at least nine months after us. Okay. Okay. But the team worked so hard for it. At the time we did it, latency is a huge problem. Cost is a huge problem. Getting the right quality of the voice is a huge problem. Right? Then there's this user interface and getting the right user experience. Because you don't just want it to start blurting out. Right? You want to kind of activate it. But then you don't have to keep pressing a button every single time. There's a lot that goes into getting a really smooth audio experience. So we went ahead, we invested the three months, we built it all. And then when we did the A-B test, there was like, no change in any of the numbers. And I was like, this can't be right, there must be a bug. And we spent like a week just checking everything, checking again, checking again. And it was like, the users just did not care. And it was something like only 10 or 15% of users even click the button to like, they wanted to engage the audio. And they would only use it for 10 or 15% of the time. So if you do the math, if it's just like something that one in seven people use it for one seventh of their time. You've changed like 2% of the experience. So even if that that 2% of the time is like insanely good, it doesn't translate much when you look at the retention, when you look at the engagement, and when you look at the monetization rates. So audio did not have a big impact. I'm pretty big on audio. But yeah, I like it too. But it's, you know, so a lot of the stuff which I do, I'm a big, you can have a theory. And you resist. Yeah. Exactly, exactly. So I think if you want to make audio work, it has to be a unique, compelling, exciting experience that they can't have anywhere else.swyx [00:40:37]: It could be your models, which just weren't good enough.William [00:40:39]: No, no, no, they were great. Oh, yeah, they were very good. it was like, it was kind of like just the, you know, if you listen to like an audible or Kindle, or something like, you just hear this voice. And it's like, you don't go like, wow, this is this is special, right? It's like a convenience thing. But the idea is that if you can, if Chai is the only platform, like, let's say you have a Mr. Beast, and YouTube is the only platform you can use to make audio work, then you can watch a Mr. Beast video. And it's the most engaging, fun video that you want to watch, you'll go to a YouTube. And so it's like for audio, you can't just put the audio on there. And people go, oh, yeah, it's like 2% better. Or like, 5% of users think it's 20% better, right? It has to be something that the majority of people, for the majority of the experience, go like, wow, this is a big deal. That's the features you need to be shipping. If it's not going to appeal to the majority of people, for the majority of the experience, and it's not a big deal, it's not going to move you. Cool. So you killed it. I don't see it anymore. Yep. So I love this. The longer, it's kind of cheesy, I guess, but the longer I've been working at Chai, and I think the team agrees with this, all the platitudes, at least I thought they were platitudes, that you would get from like the Steve Jobs, which is like, build something insanely great, right? Or be maniacally focused, or, you know, the most important thing is saying no to, not to work on. All of these sort of lessons, they just are like painfully true. They're painfully true. So now I'm just like, everything I say, I'm either quoting Steve Jobs or Zuckerberg. I'm like, guys, move fast and break free.swyx [00:42:10]: You've jumped the Apollo to cool it now.William [00:42:12]: Yeah, it's just so, everything they said is so, so true. The turtle neck. Yeah, yeah, yeah. Everything is so true.swyx [00:42:18]: This last question on my side, and I want to pass this to Alessio, is on just, just multi-modality in general. This actually comes from Justine Moore from A16Z, who's a friend of ours. And a lot of people are trying to do voice image video for AI companions. Yes. You just said voice didn't work. Yep. What would make you revisit?William [00:42:36]: So Steve Jobs, he was very, listen, he was very, very clear on this. There's a habit of engineers who, once they've got some cool technology, they want to find a way to package up the cool technology and sell it to consumers, right? That does not work. So you're free to try and build a startup where you've got your cool tech and you want to find someone to sell it to. That's not what we do at Chai. At Chai, we start with the consumer. What does the consumer want? What is their problem? And how do we solve it? So right now, the number one problems for the users, it's not the audio. That's not the number one problem. It's not the image generation either. That's not their problem either. The number one problem for users in AI is this. All the AI is being generated by middle-aged men in Silicon Valley, right? That's all the content. You're interacting with this AI. You're speaking to it for 90 minutes on average. It's being trained by middle-aged men. The guys out there, they're out there. They're talking to you. They're talking to you. They're like, oh, what should the AI say in this situation, right? What's funny, right? What's cool? What's boring? What's entertaining? That's not the way it should be. The way it should be is that the users should be creating the AI, right? And so the way I speak about it is this. Chai, we have this AI engine in which sits atop a thin layer of UGC. So the thin layer of UGC is absolutely essential, right? It's just prompts. But it's just prompts. It's just an image. It's just a name. It's like we've done 1% of what we could do. So we need to keep thickening up that layer of UGC. It must be the case that the users can train the AI. And if reinforcement learning is powerful and important, they have to be able to do that. And so it's got to be the case that there exists, you know, I say to the team, just as Mr. Beast is able to spend 100 million a year or whatever it is on his production company, and he's got a team building the content, the Mr. Beast company is able to spend 100 million a year on his production company. And he's got a team building the content, which then he shares on the YouTube platform. Until there's a team that's earning 100 million a year or spending 100 million on the content that they're producing for the Chai platform, we're not finished, right? So that's the problem. That's what we're excited to build. And getting too caught up in the tech, I think is a fool's errand. It does not work.Alessio [00:44:52]: As an aside, I saw the Beast Games thing on Amazon Prime. It's not doing well. And I'mswyx [00:44:56]: curious. It's kind of like, I mean, the audience reading is high. The run-to-meet-all sucks, but the audience reading is high.Alessio [00:45:02]: But it's not like in the top 10. I saw it dropped off of like the... Oh, okay. Yeah, that one I don't know. I'm curious, like, you know, it's kind of like similar content, but different platform. And then going back to like, some of what you were saying is like, you know, people come to ChaiWilliam [00:45:13]: expecting some type of content. Yeah, I think it's something that's interesting to discuss is like, is moats. And what is the moat? And so, you know, if you look at a platform like YouTube, the moat, I think is in first is really is in the ecosystem. And the ecosystem, is comprised of you have the content creators, you have the users, the consumers, and then you have the algorithms. And so this, this creates a sort of a flywheel where the algorithms are able to be trained on the users, and the users data, the recommend systems can then feed information to the content creators. So Mr. Beast, he knows which thumbnail does the best. He knows the first 10 seconds of the video has to be this particular way. And so his content is super optimized for the YouTube platform. So that's why it doesn't do well on Amazon. If he wants to do well on Amazon, how many videos has he created on the YouTube platform? By thousands, 10s of 1000s, I guess, he needs to get those iterations in on the Amazon. So at Chai, I think it's all about how can we get the most compelling, rich user generated content, stick that on top of the AI engine, the recommender systems, in such that we get this beautiful data flywheel, more users, better recommendations, more creative, more content, more users.Alessio [00:46:34]: You mentioned the algorithm, you have this idea of the Chaiverse on Chai, and you have your own kind of like LMSYS-like ELO system. Yeah, what are things that your models optimize for, like your users optimize for, and maybe talk about how you build it, how people submit models?William [00:46:49]: So Chaiverse is what I would describe as a developer platform. More often when we're speaking about Chai, we're thinking about the Chai app. And the Chai app is really this product for consumers. And so consumers can come on the Chai app, they can come on the Chai app, they can come on the Chai app, they can interact with our AI, and they can interact with other UGC. And it's really just these kind of bots. And it's a thin layer of UGC. Okay. Our mission is not to just have a very thin layer of UGC. Our mission is to have as much UGC as possible. So we must have, I don't want people at Chai training the AI. I want people, not middle aged men, building AI. I want everyone building the AI, as many people building the AI as possible. Okay, so what we built was we built Chaiverse. And Chaiverse is kind of, it's kind of like a prototype, is the way to think about it. And it started with this, this observation that, well, how many models get submitted into Hugging Face a day? It's hundreds, it's hundreds, right? So there's hundreds of LLMs submitted each day. Now consider that, what does it take to build an LLM? It takes a lot of work, actually. It's like someone devoted several hours of compute, several hours of their time, prepared a data set, launched it, ran it, evaluated it, submitted it, right? So there's a lot of, there's a lot of, there's a lot of work that's going into that. So what we did was we said, well, why can't we host their models for them and serve them to users? And then what would that look like? The first issue is, well, how do you know if a model is good or not? Like, we don't want to serve users the crappy models, right? So what we would do is we would, I love the LMSYS style. I think it's really cool. It's really simple. It's a very intuitive thing, which is you simply present the users with two completions. You can say, look, this is from model one. This is from model two. This is from model three. This is from model A. This is from model B, which is better. And so if someone submits a model to Chaiverse, what we do is we spin up a GPU. We download the model. We're going to now host that model on this GPU. And we're going to start routing traffic to it. And we're going to send, we think it takes about 5,000 completions to get an accurate signal. That's roughly what LMSYS does. And from that, we're able to get an accurate ranking. And we're able to get an accurate ranking. And we're able to get an accurate ranking of which models are people finding entertaining and which models are not entertaining. If you look at the bottom 80%, they'll suck. You can just disregard them. They totally suck. Then when you get the top 20%, you know you've got a decent model, but you can break it down into more nuance. There might be one that's really descriptive. There might be one that's got a lot of personality to it. There might be one that's really illogical. Then the question is, well, what do you do with these top models? From that, you can do more sophisticated things. You can try and do like a routing thing where you say for a given user request, we're going to try and predict which of these end models that users enjoy the most. That turns out to be pretty expensive and not a huge source of like edge or improvement. Something that we love to do at Chai is blending, which is, you know, it's the simplest way to think about it is you're going to end up, and you're going to pretty quickly see you've got one model that's really smart, one model that's really funny. How do you get the user an experience that is both smart and funny? Well, just 50% of the requests, you can serve them the smart model, 50% of the requests, you serve them the funny model. Just a random 50%? Just a random, yeah. And then... That's blending? That's blending. You can do more sophisticated things on top of that, as in all things in life, but the 80-20 solution, if you just do that, you get a pretty powerful effect out of the gate. Random number generator. I think it's like the robustness of randomness. Random is a very powerful optimization technique, and it's a very robust thing. So you can explore a lot of the space very efficiently. There's one thing that's really, really important to share, and this is the most exciting thing for me, is after you do the ranking, you get an ELO score, and you can track a user's first join date, the first date they submit a model to Chaiverse, they almost always get a terrible ELO, right? So let's say the first submission they get an ELO of 1,100 or 1,000 or something, and you can see that they iterate and they iterate and iterate, and it will be like, no improvement, no improvement, no improvement, and then boom. Do you give them any data, or do you have to come up with this themselves? We do, we do, we do, we do. We try and strike a balance between giving them data that's very useful, you've got to be compliant with GDPR, which is like, you have to work very hard to preserve the privacy of users of your app. So we try to give them as much signal as possible, to be helpful. The minimum is we're just going to give you a score, right? That's the minimum. But that alone is people can optimize a score pretty well, because they're able to come up with theories, submit it, does it work? No. A new theory, does it work? No. And then boom, as soon as they figure something out, they keep it, and then they iterate, and then boom,Alessio [00:51:46]: they figure something out, and they keep it. Last year, you had this post on your blog, cross-sourcing the lead to the 10 trillion parameter, AGI, and you call it a mixture of experts, recommenders. Yep. Any insights?William [00:51:58]: Updated thoughts, 12 months later? I think the odds, the timeline for AGI has certainly been pushed out, right? Now, this is in, I'm a controversial person, I don't know, like, I just think... You don't believe in scaling laws, you think AGI is further away. I think it's an S-curve. I think everything's an S-curve. And I think that the models have proven to just be far worse at reasoning than people sort of thought. And I think whenever I hear people talk about LLMs as reasoning engines, I sort of cringe a bit. I don't think that's what they are. I think of them more as like a simulator. I think of them as like a, right? So they get trained to predict the next most likely token. It's like a physics simulation engine. So you get these like games where you can like construct a bridge, and you drop a car down, and then it predicts what should happen. And that's really what LLMs are doing. It's not so much that they're reasoning, it's more that they're just doing the most likely thing. So fundamentally, the ability for people to add in intelligence, I think is very limited. What most people would consider intelligence, I think the AI is not a crowdsourcing problem, right? Now with Wikipedia, Wikipedia crowdsources knowledge. It doesn't crowdsource intelligence. So it's a subtle distinction. AI is fantastic at knowledge. I think it's weak at intelligence. And a lot, it's easy to conflate the two because if you ask it a question and it gives you, you know, if you said, who was the seventh president of the United States, and it gives you the correct answer, I'd say, well, I don't know the answer to that. And you can conflate that with intelligence. But really, that's a question of knowledge. And knowledge is really this thing about saying, how can I store all of this information? And then how can I retrieve something that's relevant? Okay, they're fantastic at that. They're fantastic at storing knowledge and retrieving the relevant knowledge. They're superior to humans in that regard. And so I think we need to come up for a new word. How does one describe AI should contain more knowledge than any individual human? It should be more accessible than any individual human. That's a very powerful thing. That's superswyx [00:54:07]: powerful. But what words do we use to describe that? We had a previous guest on Exa AI that does search. And he tried to coin super knowledge as the opposite of super intelligence.William [00:54:20]: Exactly. I think super knowledge is a more accurate word for it.swyx [00:54:24]: You can store more things than any human can.William [00:54:26]: And you can retrieve it better than any human can as well. And I think it's those two things combined that's special. I think that thing will exist. That thing can be built. And I think you can start with something that's entertaining and fun. And I think, I often think it's like, look, it's going to be a 20 year journey. And we're in like, year four, or it's like the web. And this is like 1998 or something. You know, you've got a long, long way to go before the Amazon.coms are like these huge, multi trillion dollar businesses that every single person uses every day. And so AI today is very simplistic. And it's fundamentally the way we're using it, the flywheels, and this ability for how can everyone contribute to it to really magnify the value that it brings. Right now, like, I think it's a bit sad. It's like, right now you have big labs, I'm going to pick on open AI. And they kind of go to like these human labelers. And they say, we're going to pay you to just label this like subset of questions that we want to get a really high quality data set, then we're going to get like our own computers that are really powerful. And that's kind of like the thing. For me, it's so much like Encyclopedia Britannica. It's like insane. All the people that were interested in blockchain, it's like, well, this is this is what needs to be decentralized, you need to decentralize that thing. Because if you distribute it, people can generate way more data in a distributed fashion, way more, right? You need the incentive. Yeah, of course. Yeah. But I mean, the, the, that's kind of the exciting thing about Wikipedia was it's this understanding, like the incentives, you don't need money to incentivize people. You don't need dog coins. No. Sometimes, sometimes people get the satisfaction fro

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Sponsorships and applications for the AI Engineer Summit in NYC are live! (Speaker CFPs have closed) If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you.Right after Christmas, the Chinese Whale Bros ended 2024 by dropping the last big model launch of the year: DeepSeek v3. Right now on LM Arena, DeepSeek v3 has a score of 1319, right under the full o1 model, Gemini 2, and 4o latest. This makes it the best open weights model in the world in January 2025.There has been a big recent trend in Chinese labs releasing very large open weights models, with TenCent releasing Hunyuan-Large in November and Hailuo releasing MiniMax-Text this week, both over 400B in size. However these extra-large language models are very difficult to serve.Baseten was the first of the Inference neocloud startups to get DeepSeek V3 online, because of their H200 clusters, their close collaboration with the DeepSeek team and early support of SGLang, a relatively new VLLM alternative that is also used at frontier labs like X.ai. Each H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs. We have been close to Baseten since Sarah Guo introduced Amir Haghighat to swyx, and they supported the very first Latent Space Demo Day in San Francisco, which was effectively the trial run for swyx and Alessio to work together! Since then, Philip Kiely also led a well attended workshop on TensorRT LLM at the 2024 World's Fair. We worked with him to get two of their best representatives, Amir and Lead Model Performance Engineer Yineng Zhang, to discuss DeepSeek, SGLang, and everything they have learned running Mission Critical Inference workloads at scale for some of the largest AI products in the world.The Three Pillars of Mission Critical InferenceWe initially planned to focus the conversation on SGLang, but Amir and Yineng were quick to correct us that the choice of inference framework is only the simplest, first choice of 3 things you need for production inference at scale:“I think it takes three things, and each of them individually is necessary but not sufficient: * Performance at the model level: how fast are you running this one model running on a single GPU, let's say. The framework that you use there can, can matter. The techniques that you use there can matter. The MLA technique, for example, that Yineng mentioned, or the CUDA kernels that are being used. But there's also techniques being used at a higher level, things like speculative decoding with draft models or with Medusa heads. And these are implemented in the different frameworks, or you can even implement it yourself, but they're not necessarily tied to a single framework. But using speculative decoding gets you massive upside when it comes to being able to handle high throughput. But that's not enough. Invariably, that one model running on a single GPU, let's say, is going to get too much traffic that it cannot handle.* Horizontal scaling at the cluster/region level: And at that point, you need to horizontally scale it. That's not an ML problem. That's not a PyTorch problem. That's an infrastructure problem. How quickly do you go from, a single replica of that model to 5, to 10, to 100. And so that's the second, that's the second pillar that is necessary for running these machine critical inference workloads.And what does it take to do that? It takes, some people are like, Oh, You just need Kubernetes and Kubernetes has an autoscaler and that just works. That doesn't work for, for these kinds of mission critical inference workloads. And you end up catching yourself wanting to bit by bit to rebuild those infrastructure pieces from scratch. This has been our experience. * And then going even a layer beyond that, Kubernetes runs in a single. cluster. It's a single cluster. It's a single region tied to a single region. And when it comes to inference workloads and needing GPUs more and more, you know, we're seeing this that you cannot meet the demand inside of a single region. A single cloud's a single region. In other words, a single model might want to horizontally scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s or even a full node, you run into limits of the capacity inside of that one region. And what we had to build to get around that was the ability to have a single model have replicas across different regions. So, you know, there are models on Baseten today that have 50 replicas in GCP East and, 80 replicas in AWS West and Oracle in London, etc.* Developer experience for Compound AI Systems: The final one is wrapping the power of the first two pillars in a very good developer experience to be able to afford certain workflows like the ones that I mentioned, around multi step, multi model inference workloads, because more and more we're seeing that the market is moving towards those that the needs are generally in these sort of more complex workflows. We think they said it very well.Show Notes* Amir Haghighat, Co-Founder, Baseten* Yineng Zhang, Lead Software Engineer, Model Performance, BasetenFull YouTube EpisodePlease like and subscribe!Timestamps* 00:00 Introduction and Latest AI Model Launch* 00:11 DeepSeek v3: Specifications and Achievements* 03:10 Latent Space Podcast: Special Guests Introduction* 04:12 DeepSeek v3: Technical Insights* 11:14 Quantization and Model Performance* 16:19 MOE Models: Trends and Challenges* 18:53 Baseten's Inference Service and Pricing* 31:13 Optimization for DeepSeek* 31:45 Three Pillars of Mission Critical Inference Workloads* 32:39 Scaling Beyond Single GPU* 33:09 Challenges with Kubernetes and Infrastructure* 33:40 Multi-Region Scaling Solutions* 35:34 SG Lang: A New Framework* 38:52 Key Techniques Behind SG Lang* 48:27 Speculative Decoding and Performance* 49:54 Future of Fine-Tuning and RLHF* 01:00:28 Baseten's V3 and Industry TrendsBaseten's previous TensorRT LLM workshop: Get full access to Latent Space at www.latent.space/subscribe

Dodging Latent Space Detectors: Obfuscated Activation Attacks with Luke, Erik, and Scott.

Play Episode Listen Later Jan 18, 2025 130:23


In this episode of The Cognitive Revolution, Nathan explores the groundbreaking paper on obfuscated activations with 3 members from the research team - Luke Bailey, Eric Jenner, and Scott Emmons. The team discusses how their work challenges latent-based defenses in AI systems, demonstrating methods to bypass safety mechanisms while maintaining harmful behaviors. Join us for an in-depth technical conversation about AI safety, interpretability, and the ongoing challenge of creating robust defense systems. Do check out the "Obfuscated Activations Bypass LLM Latent-Space Defenses" paper here: https://obfuscated-activations.github.io/ Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse SPONSORS: Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive Shopify: Dreaming of starting your own business? Shopify makes it easier than ever. With customizable templates, shoppable social media posts, and their new AI sidekick, Shopify Magic, you can focus on creating great products while delegating the rest. Manage everything from shipping to payments in one place. Start your journey with a $1/month trial at https://shopify.com/cognitive and turn your 2025 dreams into reality. Vanta: Vanta simplifies security and compliance for businesses of all sizes. Automate compliance across 35+ frameworks like SOC 2 and ISO 27001, streamline security workflows, and complete questionnaires up to 5x faster. Trusted by over 9,000 companies, Vanta helps you manage risk and prove security in real time. Get $1,000 off at https://vanta.com/revolution RECOMMENDED PODCAST: Check out Modern Relationships where Erik Torenberg interviews tech power couples and leading thinkers to explore how ambitious people actually make partnerships work. This season's guests include: Delian Asparouhov & Nadia Asparouhova, Kristen Berman & Phil Levin, Rob Henderson, and Liv Boeree & Igor Kurganov. Apple: https://podcasts.apple.com/us/podcast/id1786227593 Spotify: https://open.spotify.com/show/5hJzs0gDg6lRT6r10mdpVg YouTube: https://www.youtube.com/@ModernRelationshipsPod CHAPTERS: (00:00:00) Teaser (00:00:46) About the Episode (00:05:11) Latent Space Defenses (00:08:41) Sleeper Agents (00:15:06) Three Case Studies (Part 1) (00:17:02) Sponsors: Oracle Cloud Infrastructure (OCI) | NetSuite (00:19:42) Three Case Studies (Part 2) (00:24:09) SQL Generation (00:26:17) Understanding Defenses (00:32:52) Out-of-Distribution Detection (Part 1) (00:35:37) Sponsors: Shopify | Vanta (00:38:52) Out-of-Distribution Detection (Part 2) (00:45:13) Loss Function Weighting (00:57:49) Who Moves Last? (01:11:41) High-Level Triggers (01:25:33) Open Source vs. Access (01:38:57) Internalizing Reasoning (01:53:07) Representing Concepts (02:06:38) Final Thoughts (02:09:33) Outro

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Due to overwhelming demand (>15x applications:slots), we are closing CFPs for AI Engineer Summit NYC today. Last call! Thanks, we'll be reaching out to all shortly!The world's top AI blogger and friend of every pod, Simon Willison, dropped a monster 2024 recap: Things we learned about LLMs in 2024. Brian of the excellent TechMeme Ride Home pinged us for a connection and a special crossover episode, our first in 2025. The target audience for this podcast is a tech-literate, but non-technical one. You can see Simon's notes for AI Engineers in his World's Fair Keynote.Timestamp* 00:00 Introduction and Guest Welcome* 01:06 State of AI in 2025* 01:43 Advancements in AI Models* 03:59 Cost Efficiency in AI* 06:16 Challenges and Competition in AI* 17:15 AI Agents and Their Limitations* 26:12 Multimodal AI and Future Prospects* 35:29 Exploring Video Avatar Companies* 36:24 AI Influencers and Their Future* 37:12 Simplifying Content Creation with AI* 38:30 The Importance of Credibility in AI* 41:36 The Future of LLM User Interfaces* 48:58 Local LLMs: A Growing Interest* 01:07:22 AI Wearables: The Next Big Thing* 01:10:16 Wrapping Up and Final ThoughtsTranscript[00:00:00] Introduction and Guest Welcome[00:00:00] Brian: Welcome to the first bonus episode of the Tech Meme Write Home for the year 2025. I'm your host as always, Brian McCullough. Listeners to the pod over the last year know that I have made a habit of quoting from Simon Willison when new stuff happens in AI from his blog. Simon has been, become a go to for many folks in terms of, you know, Analyzing things, criticizing things in the AI space.[00:00:33] Brian: I've wanted to talk to you for a long time, Simon. So thank you for coming on the show. No, it's a privilege to be here. And the person that made this connection happen is our friend Swyx, who has been on the show back, even going back to the, the Twitter Spaces days but also an AI guru in, in their own right Swyx, thanks for coming on the show also.[00:00:54] swyx (2): Thanks. I'm happy to be on and have been a regular listener, so just happy to [00:01:00] contribute as well.[00:01:00] Brian: And a good friend of the pod, as they say. Alright, let's go right into it.[00:01:06] State of AI in 2025[00:01:06] Brian: Simon, I'm going to do the most unfair, broad question first, so let's get it out of the way. The year 2025. Broadly, what is the state of AI as we begin this year?[00:01:20] Brian: Whatever you want to say, I don't want to lead the witness.[00:01:22] Simon: Wow. So many things, right? I mean, the big thing is everything's got really good and fast and cheap. Like, that was the trend throughout all of 2024. The good models got so much cheaper, they got so much faster, they got multimodal, right? The image stuff isn't even a surprise anymore.[00:01:39] Simon: They're growing video, all of that kind of stuff. So that's all really exciting.[00:01:43] Advancements in AI Models[00:01:43] Simon: At the same time, they didn't get massively better than GPT 4, which was a bit of a surprise. So that's sort of one of the open questions is, are we going to see huge, but I kind of feel like that's a bit of a distraction because GPT 4, but way cheaper, much larger context lengths, and it [00:02:00] can do multimodal.[00:02:01] Simon: is better, right? That's a better model, even if it's not.[00:02:05] Brian: What people were expecting or hoping, maybe not expecting is not the right word, but hoping that we would see another step change, right? Right. From like GPT 2 to 3 to 4, we were expecting or hoping that maybe we were going to see the next evolution in that sort of, yeah.[00:02:21] Brian: We[00:02:21] Simon: did see that, but not in the way we expected. We thought the model was just going to get smarter, and instead we got. Massive drops in, drops in price. We got all of these new capabilities. You can talk to the things now, right? They can do simulated audio input, all of that kind of stuff. And so it's kind of, it's interesting to me that the models improved in all of these ways we weren't necessarily expecting.[00:02:43] Simon: I didn't know it would be able to do an impersonation of Santa Claus, like a, you know, Talked to it through my phone and show it what I was seeing by the end of 2024. But yeah, we didn't get that GPT 5 step. And that's one of the big open questions is, is that actually just around the corner and we'll have a bunch of GPT 5 class models drop in the [00:03:00] next few months?[00:03:00] Simon: Or is there a limit?[00:03:03] Brian: If you were a betting man and wanted to put money on it, do you expect to see a phase change, step change in 2025?[00:03:11] Simon: I don't particularly for that, like, the models, but smarter. I think all of the trends we're seeing right now are going to keep on going, especially the inference time compute, right?[00:03:21] Simon: The trick that O1 and O3 are doing, which means that you can solve harder problems, but they cost more and it churns away for longer. I think that's going to happen because that's already proven to work. I don't know. I don't know. Maybe there will be a step change to a GPT 5 level, but honestly, I'd be completely happy if we got what we've got right now.[00:03:41] Simon: But cheaper and faster and more capabilities and longer contexts and so forth. That would be thrilling to me.[00:03:46] Brian: Digging into what you've just said one of the things that, by the way, I hope to link in the show notes to Simon's year end post about what, what things we learned about LLMs in 2024. Look for that in the show notes.[00:03:59] Cost Efficiency in AI[00:03:59] Brian: One of the things that you [00:04:00] did say that you alluded to even right there was that in the last year, you felt like the GPT 4 barrier was broken, like IE. Other models, even open source ones are now regularly matching sort of the state of the art.[00:04:13] Simon: Well, it's interesting, right? So the GPT 4 barrier was a year ago, the best available model was OpenAI's GPT 4 and nobody else had even come close to it.[00:04:22] Simon: And they'd been at the, in the lead for like nine months, right? That thing came out in what, February, March of, of 2023. And for the rest of 2023, nobody else came close. And so at the start of last year, like a year ago, the big question was, Why has nobody beaten them yet? Like, what do they know that the rest of the industry doesn't know?[00:04:40] Simon: And today, that I've counted 18 organizations other than GPT 4 who've put out a model which clearly beats that GPT 4 from a year ago thing. Like, maybe they're not better than GPT 4. 0, but that's, that, that, that barrier got completely smashed. And yeah, a few of those I've run on my laptop, which is wild to me.[00:04:59] Simon: Like, [00:05:00] it was very, very wild. It felt very clear to me a year ago that if you want GPT 4, you need a rack of 40, 000 GPUs just to run the thing. And that turned out not to be true. Like the, the, this is that big trend from last year of the models getting more efficient, cheaper to run, just as capable with smaller weights and so forth.[00:05:20] Simon: And I ran another GPT 4 model on my laptop this morning, right? Microsoft 5. 4 just came out. And that, if you look at the benchmarks, it's definitely, it's up there with GPT 4. 0. It's probably not as good when you actually get into the vibes of the thing, but it, it runs on my, it's a 14 gigabyte download and I can run it on a MacBook Pro.[00:05:38] Simon: Like who saw that coming? The most exciting, like the close of the year on Christmas day, just a few weeks ago, was when DeepSeek dropped their DeepSeek v3 model on Hugging Face without even a readme file. It was just like a giant binary blob that I can't run on my laptop. It's too big. But in all of the benchmarks, it's now by far the best available [00:06:00] open, open weights model.[00:06:01] Simon: Like it's, it's, it's beating the, the metalamas and so forth. And that was trained for five and a half million dollars, which is a tenth of the price that people thought it costs to train these things. So everything's trending smaller and faster and more efficient.[00:06:15] Brian: Well, okay.[00:06:16] Challenges and Competition in AI[00:06:16] Brian: I, I kind of was going to get to that later, but let's, let's combine this with what I was going to ask you next, which is, you know, you're talking, you know, Also in the piece about the LLM prices crashing, which I've even seen in projects that I'm working on, but explain Explain that to a general audience, because we hear all the time that LLMs are eye wateringly expensive to run, but what we're suggesting, and we'll come back to the cheap Chinese LLM, but first of all, for the end user, what you're suggesting is that we're starting to see the cost come down sort of in the traditional technology way of Of costs coming down over time,[00:06:49] Simon: yes, but very aggressively.[00:06:51] Simon: I mean, my favorite thing, the example here is if you look at GPT-3, so open AI's g, PT three, which was the best, a developed model in [00:07:00] 2022 and through most of 20 2023. That, the models that we have today, the OpenAI models are a hundred times cheaper. So there was a 100x drop in price for OpenAI from their best available model, like two and a half years ago to today.[00:07:13] Simon: And[00:07:14] Brian: just to be clear, not to train the model, but for the use of tokens and things. Exactly,[00:07:20] Simon: for running prompts through them. And then When you look at the, the really, the top tier model providers right now, I think, are OpenAI, Anthropic, Google, and Meta. And there are a bunch of others that I could list there as well.[00:07:32] Simon: Mistral are very good. The, the DeepSeq and Quen models have got great. There's a whole bunch of providers serving really good models. But even if you just look at the sort of big brand name providers, they all offer models now that are A fraction of the price of the, the, of the models we were using last year.[00:07:49] Simon: I think I've got some numbers that I threw into my blog entry here. Yeah. Like Gemini 1. 5 flash, that's Google's fast high quality model is [00:08:00] how much is that? It's 0. 075 dollars per million tokens. Like these numbers are getting, So we just do cents per million now,[00:08:09] swyx (2): cents per million,[00:08:10] Simon: cents per million makes, makes a lot more sense.[00:08:12] Simon: Yeah they have one model 1. 5 flash 8B, the absolute cheapest of the Google models, is 27 times cheaper than GPT 3. 5 turbo was a year ago. That's it. And GPT 3. 5 turbo, that was the cheap model, right? Now we've got something 27 times cheaper, and the Google, this Google one can do image recognition, it can do million token context, all of those tricks.[00:08:36] Simon: But it's, it's, it's very, it's, it really is startling how inexpensive some of this stuff has got.[00:08:41] Brian: Now, are we assuming that this, that happening is directly the result of competition? Because again, you know, OpenAI, and probably they're doing this for their own almost political reasons, strategic reasons, keeps saying, we're losing money on everything, even the 200.[00:08:56] Brian: So they probably wouldn't, the prices wouldn't be [00:09:00] coming down if there wasn't intense competition in this space.[00:09:04] Simon: The competition is absolutely part of it, but I have it on good authority from sources I trust that Google Gemini is not operating at a loss. Like, the amount of electricity to run a prompt is less than they charge you.[00:09:16] Simon: And the same thing for Amazon Nova. Like, somebody found an Amazon executive and got them to say, Yeah, we're not losing money on this. I don't know about Anthropic and OpenAI, but clearly that demonstrates it is possible to run these things at these ludicrously low prices and still not be running at a loss if you discount the Army of PhDs and the, the training costs and all of that kind of stuff.[00:09:36] Brian: One, one more for me before I let Swyx jump in here. To, to come back to DeepSeek and this idea that you could train, you know, a cutting edge model for 6 million. I, I was saying on the show, like six months ago, that if we are getting to the point where each new model It would cost a billion, ten billion, a hundred billion to train that.[00:09:54] Brian: At some point it would almost, only nation states would be able to train the new models. Do you [00:10:00] expect what DeepSeek and maybe others are proving to sort of blow that up? Or is there like some sort of a parallel track here that maybe I'm not technically, I don't have the mouse to understand the difference.[00:10:11] Brian: Is the model, are the models going to go, you know, Up to a hundred billion dollars or can we get them down? Sort of like DeepSeek has proven[00:10:18] Simon: so I'm the wrong person to answer that because I don't work in the lab training these models. So I can give you my completely uninformed opinion, which is, I felt like the DeepSeek thing.[00:10:27] Simon: That was a bomb shell. That was an absolute bombshell when they came out and said, Hey, look, we've trained. One of the best available models and it cost us six, five and a half million dollars to do it. I feel, and they, the reason, one of the reasons it's so efficient is that we put all of these export controls in to stop Chinese companies from giant buying GPUs.[00:10:44] Simon: So they've, were forced to be, go as efficient as possible. And yet the fact that they've demonstrated that that's possible to do. I think it does completely tear apart this, this, this mental model we had before that yeah, the training runs just keep on getting more and more expensive and the number of [00:11:00] organizations that can afford to run these training runs keeps on shrinking.[00:11:03] Simon: That, that's been blown out of the water. So yeah, that's, again, this was our Christmas gift. This was the thing they dropped on Christmas day. Yeah, it makes me really optimistic that we can, there are, It feels like there was so much low hanging fruit in terms of the efficiency of both inference and training and we spent a whole bunch of last year exploring that and getting results from it.[00:11:22] Simon: I think there's probably a lot left. I think there's probably, well, I would not be surprised to see even better models trained spending even less money over the next six months.[00:11:31] swyx (2): Yeah. So I, I think there's a unspoken angle here on what exactly the Chinese labs are trying to do because DeepSea made a lot of noise.[00:11:41] swyx (2): so much for joining us for around the fact that they train their model for six million dollars and nobody quite quite believes them. Like it's very, very rare for a lab to trumpet the fact that they're doing it for so cheap. They're not trying to get anyone to buy them. So why [00:12:00] are they doing this? They make it very, very obvious.[00:12:05] swyx (2): Deepseek is about 150 employees. It's an order of magnitude smaller than at least Anthropic and maybe, maybe more so for OpenAI. And so what's, what's the end game here? Are they, are they just trying to show that the Chinese are better than us?[00:12:21] Simon: So Deepseek, it's the arm of a hedge, it's a, it's a quant fund, right?[00:12:25] Simon: It's an algorithmic quant trading thing. So I, I, I would love to get more insight into how that organization works. My assumption from what I've seen is it looks like they're basically just flexing. They're like, hey, look at how utterly brilliant we are with this amazing thing that we've done. And it's, it's working, right?[00:12:43] Simon: They but, and so is that it? Are they, is this just their kind of like, this is, this is why our company is so amazing. Look at this thing that we've done, or? I don't know. I'd, I'd love to get Some insight from, from within that industry as to, as to how that's all playing out.[00:12:57] swyx (2): The, the prevailing theory among the Local Llama [00:13:00] crew and the Twitter crew that I indexed for my newsletter is that there is some amount of copying going on.[00:13:06] swyx (2): It's like Sam Altman you know, tweet, tweeting about how they're being copied. And then also there's this, there, there are other sort of opening eye employees that have said, Stuff that is similar that DeepSeek's rate of progress is how U. S. intelligence estimates the number of foreign spies embedded in top labs.[00:13:22] swyx (2): Because a lot of these ideas do spread around, but they surprisingly have a very high density of them in the DeepSeek v3 technical report. So it's, it's interesting. We don't know how much, how many, how much tokens. I think that, you know, people have run analysis on how often DeepSeek thinks it is cloud or thinks it is opening GPC 4.[00:13:40] swyx (2): Thanks for watching! And we don't, we don't know. We don't know. I think for me, like, yeah, we'll, we'll, we basically will never know as, as external commentators. I think what's interesting is how, where does this go? Is there a logical floor or bottom by my estimations for the same amount of ELO started last year to the end of last year cost went down by a thousand X for the [00:14:00] GPT, for, for GPT 4 intelligence.[00:14:02] swyx (2): Would, do they go down a thousand X this year?[00:14:04] Simon: That's a fascinating question. Yeah.[00:14:06] swyx (2): Is there a Moore's law going on, or did we just get a one off benefit last year for some weird reason?[00:14:14] Simon: My uninformed hunch is low hanging fruit. I feel like up until a year ago, people haven't been focusing on efficiency at all. You know, it was all about, what can we get these weird shaped things to do?[00:14:24] Simon: And now once we've sort of hit that, okay, we know that we can get them to do what GPT 4 can do, When thousands of researchers around the world all focus on, okay, how do we make this more efficient? What are the most important, like, how do we strip out all of the weights that have stuff in that doesn't really matter?[00:14:39] Simon: All of that kind of thing. So yeah, maybe that was it. Maybe 2024 was a freak year of all of the low hanging fruit coming out at once. And we'll actually see a reduction in the, in that rate of improvement in terms of efficiency. I wonder, I mean, I think we'll know for sure in about three months time if that trend's going to continue or not.[00:14:58] swyx (2): I agree. You know, I [00:15:00] think the other thing that you mentioned that DeepSeq v3 was the gift that was given from DeepSeq over Christmas, but I feel like the other thing that might be underrated was DeepSeq R1,[00:15:11] Speaker 4: which is[00:15:13] swyx (2): a reasoning model you can run on your laptop. And I think that's something that a lot of people are looking ahead to this year.[00:15:18] swyx (2): Oh, did they[00:15:18] Simon: release the weights for that one?[00:15:20] swyx (2): Yeah.[00:15:21] Simon: Oh my goodness, I missed that. I've been playing with the quen. So the other great, the other big Chinese AI app is Alibaba's quen. Actually, yeah, I, sorry, R1 is an API available. Yeah. Exactly. When that's really cool. So Alibaba's Quen have released two reasoning models that I've run on my laptop.[00:15:38] Simon: Now there was, the first one was Q, Q, WQ. And then the second one was QVQ because the second one's a vision model. So you can like give it vision puzzles and a prompt that these things, they are so much fun to run. Because they think out loud. It's like the OpenAR 01 sort of hides its thinking process. The Query ones don't.[00:15:59] Simon: They just, they [00:16:00] just churn away. And so you'll give it a problem and it will output literally dozens of paragraphs of text about how it's thinking. My favorite thing that happened with QWQ is I asked it to draw me a pelican on a bicycle in SVG. That's like my standard stupid prompt. And for some reason it thought in Chinese.[00:16:18] Simon: It spat out a whole bunch of like Chinese text onto my terminal on my laptop, and then at the end it gave me quite a good sort of artistic pelican on a bicycle. And I ran it all through Google Translate, and yeah, it was like, it was contemplating the nature of SVG files as a starting point. And the fact that my laptop can think in Chinese now is so delightful.[00:16:40] Simon: It's so much fun watching you do that.[00:16:43] swyx (2): Yeah, I think Andrej Karpathy was saying, you know, we, we know that we have achieved proper reasoning inside of these models when they stop thinking in English, and perhaps the best form of thought is in Chinese. But yeah, for listeners who don't know Simon's blog he always, whenever a new model comes out, you, I don't know how you do it, but [00:17:00] you're always the first to run Pelican Bench on these models.[00:17:02] swyx (2): I just did it for 5.[00:17:05] Simon: Yeah.[00:17:07] swyx (2): So I really appreciate that. You should check it out. These are not theoretical. Simon's blog actually shows them.[00:17:12] Brian: Let me put on the investor hat for a second.[00:17:15] AI Agents and Their Limitations[00:17:15] Brian: Because from the investor side of things, a lot of the, the VCs that I know are really hot on agents, and this is the year of agents, but last year was supposed to be the year of agents as well. Lots of money flowing towards, And Gentic startups.[00:17:32] Brian: But in in your piece that again, we're hopefully going to have linked in the show notes, you sort of suggest there's a fundamental flaw in AI agents as they exist right now. Let me let me quote you. And then I'd love to dive into this. You said, I remain skeptical as to their ability based once again, on the Challenge of gullibility.[00:17:49] Brian: LLMs believe anything you tell them, any systems that attempt to make meaningful decisions on your behalf, will run into the same roadblock. How good is a travel agent, or a digital assistant, or even a research tool, if it [00:18:00] can't distinguish truth from fiction? So, essentially, what you're suggesting is that the state of the art now that allows agents is still, it's still that sort of 90 percent problem, the edge problem, getting to the Or, or, or is there a deeper flaw?[00:18:14] Brian: What are you, what are you saying there?[00:18:16] Simon: So this is the fundamental challenge here and honestly my frustration with agents is mainly around definitions Like any if you ask anyone who says they're working on agents to define agents You will get a subtly different definition from each person But everyone always assumes that their definition is the one true one that everyone else understands So I feel like a lot of these agent conversations, people talking past each other because one person's talking about the, the sort of travel agent idea of something that books things on your behalf.[00:18:41] Simon: Somebody else is talking about LLMs with tools running in a loop with a cron job somewhere and all of these different things. You, you ask academics and they'll laugh at you because they've been debating what agents mean for over 30 years at this point. It's like this, this long running, almost sort of an in joke in that community.[00:18:57] Simon: But if we assume that for this purpose of this conversation, an [00:19:00] agent is something that, Which you can give a job and it goes off and it does that thing for you like, like booking travel or things like that. The fundamental challenge is, it's the reliability thing, which comes from this gullibility problem.[00:19:12] Simon: And a lot of my, my interest in this originally came from when I was thinking about prompt injections as a source of this form of attack against LLM systems where you deliberately lay traps out there for this LLM to stumble across,[00:19:24] Brian: and which I should say you have been banging this drum that no one's gotten any far, at least on solving this, that I'm aware of, right.[00:19:31] Brian: Like that's still an open problem. The two years.[00:19:33] Simon: Yeah. Right. We've been talking about this problem and like, a great illustration of this was Claude so Anthropic released Claude computer use a few months ago. Fantastic demo. You could fire up a Docker container and you could literally tell it to do something and watch it open a web browser and navigate to a webpage and click around and so forth.[00:19:51] Simon: Really, really, really interesting and fun to play with. And then, um. One of the first demos somebody tried was, what if you give it a web page that says download and run this [00:20:00] executable, and it did, and the executable was malware that added it to a botnet. So the, the very first most obvious dumb trick that you could play on this thing just worked, right?[00:20:10] Simon: So that's obviously a really big problem. If I'm going to send something out to book travel on my behalf, I mean, it's hard enough for me to figure out which airlines are trying to scam me and which ones aren't. Do I really trust a language model that believes the literal truth of anything that's presented to it to go out and do those things?[00:20:29] swyx (2): Yeah I definitely think there's, it's interesting to see Anthropic doing this because they used to be the safety arm of OpenAI that split out and said, you know, we're worried about letting this thing out in the wild and here they are enabling computer use for agents. Thanks. The, it feels like things have merged.[00:20:49] swyx (2): You know, I'm, I'm also fairly skeptical about, you know, this always being the, the year of Linux on the desktop. And this is the equivalent of this being the year of agents that people [00:21:00] are not predicting so much as wishfully thinking and hoping and praying for their companies and agents to work.[00:21:05] swyx (2): But I, I feel like things are. Coming along a little bit. It's to me, it's kind of like self driving. I remember in 2014 saying that self driving was just around the corner. And I mean, it kind of is, you know, like in, in, in the Bay area. You[00:21:17] Simon: get in a Waymo and you're like, Oh, this works. Yeah, but it's a slow[00:21:21] swyx (2): cook.[00:21:21] swyx (2): It's a slow cook over the next 10 years. We're going to hammer out these things and the cynical people can just point to all the flaws, but like, there are measurable or concrete progress steps that are being made by these builders.[00:21:33] Simon: There is one form of agent that I believe in. I believe, mostly believe in the research assistant form of agents.[00:21:39] Simon: The thing where you've got a difficult problem and, and I've got like, I'm, I'm on the beta for the, the Google Gemini 1. 5 pro with deep research. I think it's called like these names, these names. Right. But. I've been using that. It's good, right? You can give it a difficult problem and it tells you, okay, I'm going to look at 56 different websites [00:22:00] and it goes away and it dumps everything to its context and it comes up with a report for you.[00:22:04] Simon: And it's not, it won't work against adversarial websites, right? If there are websites with deliberate lies in them, it might well get caught out. Most things don't have that as a problem. And so I've had some answers from that which were genuinely really valuable to me. And that feels to me like, I can see how given existing LLM tech, especially with Google Gemini with its like million token contacts and Google with their crawl of the entire web and their, they've got like search, they've got search and cache, they've got a cache of every page and so forth.[00:22:35] Simon: That makes sense to me. And that what they've got right now, I don't think it's, it's not as good as it can be, obviously, but it's, it's, it's, it's a real useful thing, which they're going to start rolling out. So, you know, Perplexity have been building the same thing for a couple of years. That, that I believe in.[00:22:50] Simon: You know, if you tell me that you're going to have an agent that's a research assistant agent, great. The coding agents I mean, chat gpt code interpreter, Nearly two years [00:23:00] ago, that thing started writing Python code, executing the code, getting errors, rewriting it to fix the errors. That pattern obviously works.[00:23:07] Simon: That works really, really well. So, yeah, coding agents that do that sort of error message loop thing, those are proven to work. And they're going to keep on getting better, and that's going to be great. The research assistant agents are just beginning to get there. The things I'm critical of are the ones where you trust, you trust this thing to go out and act autonomously on your behalf, and make decisions on your behalf, especially involving spending money, like that.[00:23:31] Simon: I don't see that working for a very long time. That feels to me like an AGI level problem.[00:23:37] swyx (2): It's it's funny because I think Stripe actually released an agent toolkit which is one of the, the things I featured that is trying to enable these agents each to have a wallet that they can go and spend and have, basically, it's a virtual card.[00:23:49] swyx (2): It's not that, not that difficult with modern infrastructure. can[00:23:51] Simon: stick a 50 cap on it, then at least it's an honor. Can't lose more than 50.[00:23:56] Brian: You know I don't, I don't know if either of you know Rafat Ali [00:24:00] he runs Skift, which is a, a travel news vertical. And he, he, he constantly laughs at the fact that every agent thing is, we're gonna get rid of booking a, a plane flight for you, you know?[00:24:11] Brian: And, and I would point out that, like, historically, when the web started, the first thing everyone talked about is, You can go online and book a trip, right? So it's funny for each generation of like technological advance. The thing they always want to kill is the travel agent. And now they want to kill the webpage travel agent.[00:24:29] Simon: Like it's like I use Google flight search. It's great, right? If you gave me an agent to do that for me, it would save me, I mean, maybe 15 seconds of typing in my things, but I still want to see what my options are and go, yeah, I'm not flying on that airline, no matter how cheap they are.[00:24:44] swyx (2): Yeah. For listeners, go ahead.[00:24:47] swyx (2): For listeners, I think, you know, I think both of you are pretty positive on NotebookLM. And you know, we, we actually interviewed the NotebookLM creators, and there are actually two internal agents going on internally. The reason it takes so long is because they're running an agent loop [00:25:00] inside that is fairly autonomous, which is kind of interesting.[00:25:01] swyx (2): For one,[00:25:02] Simon: for a definition of agent loop, if you picked that particularly well. For one definition. And you're talking about the podcast side of this, right?[00:25:07] swyx (2): Yeah, the podcast side of things. They have a there's, there's going to be a new version coming out that, that we'll be featuring at our, at our conference.[00:25:14] Simon: That one's fascinating to me. Like NotebookLM, I think it's two products, right? On the one hand, it's actually a very good rag product, right? You dump a bunch of things in, you can run searches, that, that, it does a good job of. And then, and then they added the, the podcast thing. It's a bit of a, it's a total gimmick, right?[00:25:30] Simon: But that gimmick got them attention, because they had a great product that nobody paid any attention to at all. And then you add the unfeasibly good voice synthesis of the podcast. Like, it's just, it's, it's, it's the lesson.[00:25:43] Brian: It's the lesson of mid journey and stuff like that. If you can create something that people can post on socials, you don't have to lift a finger again to do any marketing for what you're doing.[00:25:53] Brian: Let me dig into Notebook LLM just for a second as a podcaster. As a [00:26:00] gimmick, it makes sense, and then obviously, you know, you dig into it, it sort of has problems around the edges. It's like, it does the thing that all sort of LLMs kind of do, where it's like, oh, we want to Wrap up with a conclusion.[00:26:12] Multimodal AI and Future Prospects[00:26:12] Brian: I always call that like the the eighth grade book report paper problem where it has to have an intro and then, you know But that's sort of a thing where because I think you spoke about this again in your piece at the year end About how things are going multimodal and how things are that you didn't expect like, you know vision and especially audio I think So that's another thing where, at least over the last year, there's been progress made that maybe you, you didn't think was coming as quick as it came.[00:26:43] Simon: I don't know. I mean, a year ago, we had one really good vision model. We had GPT 4 vision, was, was, was very impressive. And Google Gemini had just dropped Gemini 1. 0, which had vision, but nobody had really played with it yet. Like Google hadn't. People weren't taking Gemini [00:27:00] seriously at that point. I feel like it was 1.[00:27:02] Simon: 5 Pro when it became apparent that actually they were, they, they got over their hump and they were building really good models. And yeah, and they, to be honest, the video models are mostly still using the same trick. The thing where you divide the video up into one image per second and you dump that all into the context.[00:27:16] Simon: So maybe it shouldn't have been so surprising to us that long context models plus vision meant that the video was, was starting to be solved. Of course, it didn't. Not being, you, what you really want with videos, you want to be able to do the audio and the images at the same time. And I think the models are beginning to do that now.[00:27:33] Simon: Like, originally, Gemini 1. 5 Pro originally ignored the audio. It just did the, the, like, one frame per second video trick. As far as I can tell, the most recent ones are actually doing pure multimodal. But the things that opens up are just extraordinary. Like, the the ChatGPT iPhone app feature that they shipped as one of their 12 days of, of OpenAI, I really can be having a conversation and just turn on my video camera and go, Hey, what kind of tree is [00:28:00] this?[00:28:00] Simon: And so forth. And it works. And for all I know, that's just snapping a like picture once a second and feeding it into the model. The, the, the things that you can do with that as an end user are extraordinary. Like that, that to me, I don't think most people have cottoned onto the fact that you can now stream video directly into a model because it, it's only a few weeks old.[00:28:22] Simon: Wow. That's a, that's a, that's a, that's Big boost in terms of what kinds of things you can do with this stuff. Yeah. For[00:28:30] swyx (2): people who are not that close I think Gemini Flashes free tier allows you to do something like capture a photo, one photo every second or a minute and leave it on 24, seven, and you can prompt it to do whatever.[00:28:45] swyx (2): And so you can effectively have your own camera app or monitoring app that that you just prompt and it detects where it changes. It detects for, you know, alerts or anything like that, or describes your day. You know, and, and, and the fact that this is free I think [00:29:00] it's also leads into the previous point of it being the prices haven't come down a lot.[00:29:05] Simon: And even if you're paying for this stuff, like a thing that I put in my blog entry is I ran a calculation on what it would cost to process 68, 000 photographs in my photo collection, and for each one just generate a caption, and using Gemini 1. 5 Flash 8B, it would cost me 1. 68 to process 68, 000 images, which is, I mean, that, that doesn't make sense.[00:29:28] Simon: None of that makes sense. Like it's, it's a, for one four hundredth of a cent per image to generate captions now. So you can see why feeding in a day's worth of video just isn't even very expensive to process.[00:29:40] swyx (2): Yeah, I'll tell you what is expensive. It's the other direction. So we're here, we're talking about consuming video.[00:29:46] swyx (2): And this year, we also had a lot of progress, like probably one of the most excited, excited, anticipated launches of the year was Sora. We actually got Sora. And less exciting.[00:29:55] Simon: We did, and then VO2, Google's Sora, came out like three [00:30:00] days later and upstaged it. Like, Sora was exciting until VO2 landed, which was just better.[00:30:05] swyx (2): In general, I feel the media, or the social media, has been very unfair to Sora. Because what was released to the world, generally available, was Sora Lite. It's the distilled version of Sora, right? So you're, I did not[00:30:16] Simon: realize that you're absolutely comparing[00:30:18] swyx (2): the, the most cherry picked version of VO two, the one that they published on the marketing page to the, the most embarrassing version of the soa.[00:30:25] swyx (2): So of course it's gonna look bad, so, well, I got[00:30:27] Simon: access to the VO two I'm in the VO two beta and I've been poking around with it and. Getting it to generate pelicans on bicycles and stuff. I would absolutely[00:30:34] swyx (2): believe that[00:30:35] Simon: VL2 is actually better. Is Sora, so is full fat Sora coming soon? Do you know, when, when do we get to play with that one?[00:30:42] Simon: No one's[00:30:43] swyx (2): mentioned anything. I think basically the strategy is let people play around with Sora Lite and get info there. But the, the, keep developing Sora with the Hollywood studios. That's what they actually care about. Gotcha. Like the rest of us. Don't really know what to do with the video anyway. Right.[00:30:59] Simon: I mean, [00:31:00] that's my thing is I realized that for generative images and images and video like images We've had for a few years and I don't feel like they've broken out into the talented artist community yet Like lots of people are having fun with them and doing and producing stuff. That's kind of cool to look at but what I want you know that that movie everything everywhere all at once, right?[00:31:20] Simon: One, one ton of Oscars, utterly amazing film. The VFX team for that were five people, some of whom were watching YouTube videos to figure out what to do. My big question for, for Sora and and and Midjourney and stuff, what happens when a creative team like that starts using these tools? I want the creative geniuses behind everything, everywhere all at once.[00:31:40] Simon: What are they going to be able to do with this stuff in like a few years time? Because that's really exciting to me. That's where you take artists who are at the very peak of their game. Give them these new capabilities and see, see what they can do with them.[00:31:52] swyx (2): I should, I know a little bit here. So it should mention that, that team actually used RunwayML.[00:31:57] swyx (2): So there was, there was,[00:31:57] Simon: yeah.[00:31:59] swyx (2): I don't know how [00:32:00] much I don't. So, you know, it's possible to overstate this, but there are people integrating it. Generated video within their workflow, even pre SORA. Right, because[00:32:09] Brian: it's not, it's not the thing where it's like, okay, tomorrow we'll be able to do a full two hour movie that you prompt with three sentences.[00:32:15] Brian: It is like, for the very first part of, of, you know video effects in film, it's like, if you can get that three second clip, if you can get that 20 second thing that they did in the matrix that blew everyone's minds and took a million dollars or whatever to do, like, it's the, it's the little bits and pieces that they can fill in now that it's probably already there.[00:32:34] swyx (2): Yeah, it's like, I think actually having a layered view of what assets people need and letting AI fill in the low value assets. Right, like the background video, the background music and, you know, sometimes the sound effects. That, that maybe, maybe more palatable maybe also changes the, the way that you evaluate the stuff that's coming out.[00:32:57] swyx (2): Because people tend to, in social media, try to [00:33:00] emphasize foreground stuff, main character stuff. So you really care about consistency, and you, you really are bothered when, like, for example, Sorad. Botch's image generation of a gymnast doing flips, which is horrible. It's horrible. But for background crowds, like, who cares?[00:33:18] Brian: And by the way, again, I was, I was a film major way, way back in the day, like, that's how it started. Like things like Braveheart, where they filmed 10 people on a field, and then the computer could turn it into 1000 people on a field. Like, that's always been the way it's around the margins and in the background that first comes in.[00:33:36] Brian: The[00:33:36] Simon: Lord of the Rings movies were over 20 years ago. Although they have those giant battle sequences, which were very early, like, I mean, you could almost call it a generative AI approach, right? They were using very sophisticated, like, algorithms to model out those different battles and all of that kind of stuff.[00:33:52] Simon: Yeah, I know very little. I know basically nothing about film production, so I try not to commentate on it. But I am fascinated to [00:34:00] see what happens when, when these tools start being used by the real, the people at the top of their game.[00:34:05] swyx (2): I would say like there's a cultural war that is more that being fought here than a technology war.[00:34:11] swyx (2): Most of the Hollywood people are against any form of AI anyway, so they're busy Fighting that battle instead of thinking about how to adopt it and it's, it's very fringe. I participated here in San Francisco, one generative AI video creative hackathon where the AI positive artists actually met with technologists like myself and then we collaborated together to build short films and that was really nice and I think, you know, I'll be hosting some of those in my events going forward.[00:34:38] swyx (2): One thing that I think like I want to leave it. Give people a sense of it's like this is a recap of last year But then sometimes it's useful to walk away as well with like what can we expect in the future? I don't know if you got anything. I would also call out that the Chinese models here have made a lot of progress Hyde Law and Kling and God knows who like who else in the video arena [00:35:00] Also making a lot of progress like surprising him like I think maybe actually Chinese China is surprisingly ahead with regards to Open8 at least, but also just like specific forms of video generation.[00:35:12] Simon: Wouldn't it be interesting if a film industry sprung up in a country that we don't normally think of having a really strong film industry that was using these tools? Like, that would be a fascinating sort of angle on this. Mm hmm. Mm hmm.[00:35:25] swyx (2): Agreed. I, I, I Oh, sorry. Go ahead.[00:35:29] Exploring Video Avatar Companies[00:35:29] swyx (2): Just for people's Just to put it on people's radar as well, Hey Jen, there's like there's a category of video avatar companies that don't specifically, don't specialize in general video.[00:35:41] swyx (2): They only do talking heads, let's just say. And HeyGen sings very well.[00:35:45] Brian: Swyx, you know that that's what I've been using, right? Like, have, have I, yeah, right. So, if you see some of my recent YouTube videos and things like that, where, because the beauty part of the HeyGen thing is, I, I, I don't want to use the robot voice, so [00:36:00] I record the mp3 file for my computer, And then I put that into HeyGen with the avatar that I've trained it on, and all it does is the lip sync.[00:36:09] Brian: So it looks, it's not 100 percent uncanny valley beatable, but it's good enough that if you weren't looking for it, it's just me sitting there doing one of my clips from the show. And, yeah, so, by the way, HeyGen. Shout out to them.[00:36:24] AI Influencers and Their Future[00:36:24] swyx (2): So I would, you know, in terms of like the look ahead going, like, looking, reviewing 2024, looking at trends for 2025, I would, they basically call this out.[00:36:33] swyx (2): Meta tried to introduce AI influencers and failed horribly because they were just bad at it. But at some point that there will be more and more basically AI influencers Not in a way that Simon is but in a way that they are not human.[00:36:50] Simon: Like the few of those that have done well, I always feel like they're doing well because it's a gimmick, right?[00:36:54] Simon: It's a it's it's novel and fun to like Like that, the AI Seinfeld thing [00:37:00] from last year, the Twitch stream, you know, like those, if you're the only one or one of just a few doing that, you'll get, you'll attract an audience because it's an interesting new thing. But I just, I don't know if that's going to be sustainable longer term or not.[00:37:11] Simon: Like,[00:37:12] Simplifying Content Creation with AI[00:37:12] Brian: I'm going to tell you, Because I've had discussions, I can't name the companies or whatever, but, so think about the workflow for this, like, now we all know that on TikTok and Instagram, like, holding up a phone to your face, and doing like, in my car video, or walking, a walk and talk, you know, that's, that's very common, but also, if you want to do a professional sort of talking head video, you still have to sit in front of a camera, you still have to do the lighting, you still have to do the video editing, versus, if you can just record, what I'm saying right now, the last 30 seconds, If you clip that out as an mp3 and you have a good enough avatar, then you can put that avatar in front of Times Square, on a beach, or whatever.[00:37:50] Brian: So, like, again for creators, the reason I think Simon, we're on the verge of something, it, it just, it's not going to, I think it's not, oh, we're going to have [00:38:00] AI avatars take over, it'll be one of those things where it takes another piece of the workflow out and simplifies it. I'm all[00:38:07] Simon: for that. I, I always love this stuff.[00:38:08] Simon: I like tools. Tools that help human beings do more. Do more ambitious things. I'm always in favor of, like, that, that, that's what excites me about this entire field.[00:38:17] swyx (2): Yeah. We're, we're looking into basically creating one for my podcast. We have this guy Charlie, he's Australian. He's, he's not real, but he pre, he opens every show and we are gonna have him present all the shorts.[00:38:29] Simon: Yeah, go ahead.[00:38:30] The Importance of Credibility in AI[00:38:30] Simon: The thing that I keep coming back to is this idea of credibility like in a world that is full of like AI generated everything and so forth It becomes even more important that people find the sources of information that they trust and find people and find Sources that are credible and I feel like that's the one thing that LLMs and AI can never have is credibility, right?[00:38:49] Simon: ChatGPT can never stake its reputation on telling you something useful and interesting because That means nothing, right? It's a matrix multiplication. It depends on who prompted it and so forth. So [00:39:00] I'm always, and this is when I'm blogging as well, I'm always looking for, okay, who are the reliable people who will tell me useful, interesting information who aren't just going to tell me whatever somebody's paying them to tell, tell them, who aren't going to, like, type a one sentence prompt into an LLM and spit out an essay and stick it online.[00:39:16] Simon: And that, that to me, Like, earning that credibility is really important. That's why a lot of my ethics around the way that I publish are based on the idea that I want people to trust me. I want to do things that, that gain credibility in people's eyes so they will come to me for information as a trustworthy source.[00:39:32] Simon: And it's the same for the sources that I'm, I'm consulting as well. So that's something I've, I've been thinking a lot about that sort of credibility focus on this thing for a while now.[00:39:40] swyx (2): Yeah, you can layer or structure credibility or decompose it like so one thing I would put in front of you I'm not saying that you should Agree with this or accept this at all is that you can use AI to generate different Variations and then and you pick you as the final sort of last mile person that you pick The last output and [00:40:00] you put your stamp of credibility behind that like that everything's human reviewed instead of human origin[00:40:04] Simon: Yeah, if you publish something you need to be able to put it on the ground Publishing it.[00:40:08] Simon: You need to say, I will put my name to this. I will attach my credibility to this thing. And if you're willing to do that, then, then that's great.[00:40:16] swyx (2): For creators, this is huge because there's a fundamental asymmetry between starting with a blank slate versus choosing from five different variations.[00:40:23] Brian: Right.[00:40:24] Brian: And also the key thing that you just said is like, if everything that I do, if all of the words were generated by an LLM, if the voice is generated by an LLM. If the video is also generated by the LLM, then I haven't done anything, right? But if, if one or two of those, you take a shortcut, but it's still, I'm willing to sign off on it.[00:40:47] Brian: Like, I feel like that's where I feel like people are coming around to like, this is maybe acceptable, sort of.[00:40:53] Simon: This is where I've been pushing the definition. I love the term slop. Where I've been pushing the definition of slop as AI generated [00:41:00] content that is both unrequested and unreviewed and the unreviewed thing is really important like that's the thing that elevates something from slop to not slop is if A human being has reviewed it and said, you know what, this is actually worth other people's time.[00:41:12] Simon: And again, I'm willing to attach my credibility to it and say, hey, this is worthwhile.[00:41:16] Brian: It's, it's, it's the cura curational, curatorial and editorial part of it that no matter what the tools are to do shortcuts, to do, as, as Swyx is saying choose between different edits or different cuts, but in the end, if there's a curatorial mind, Or editorial mind behind it.[00:41:32] Brian: Let me I want to wedge this in before we start to close.[00:41:36] The Future of LLM User Interfaces[00:41:36] Brian: One of the things coming back to your year end piece that has been a something that I've been banging the drum about is when you're talking about LLMs. Getting harder to use. You said most users are thrown in at the deep end.[00:41:48] Brian: The default LLM chat UI is like taking brand new computer users, dropping them into a Linux terminal and expecting them to figure it all out. I mean, it's, it's literally going back to the command line. The command line was defeated [00:42:00] by the GUI interface. And this is what I've been banging the drum about is like, this cannot be.[00:42:05] Brian: The user interface, what we have now cannot be the end result. Do you see any hints or seeds of a GUI moment for LLM interfaces?[00:42:17] Simon: I mean, it has to happen. It absolutely has to happen. The the, the, the, the usability of these things is turning into a bit of a crisis. And we are at least seeing some really interesting innovation in little directions.[00:42:28] Simon: Just like OpenAI's chat GPT canvas thing that they just launched. That is at least. Going a little bit more interesting than just chat, chats and responses. You know, you can, they're exploring that space where you're collaborating with an LLM. You're both working in the, on the same document. That makes a lot of sense to me.[00:42:44] Simon: Like that, that feels really smart. The one of the best things is still who was it who did the, the UI where you could, they had a drawing UI where you draw an interface and click a button. TL draw would then make it real thing. That was spectacular, [00:43:00] absolutely spectacular, like, alternative vision of how you'd interact with these models.[00:43:05] Simon: Because yeah, the and that's, you know, so I feel like there is so much scope for innovation there and it is beginning to happen. Like, like, I, I feel like most people do understand that we need to do better in terms of interfaces that both help explain what's going on and give people better tools for working with models.[00:43:23] Simon: I was going to say, I want to[00:43:25] Brian: dig a little deeper into this because think of the conceptual idea behind the GUI, which is instead of typing into a command line open word. exe, it's, you, you click an icon, right? So that's abstracting away sort of the, again, the programming stuff that like, you know, it's, it's a, a, a child can tap on an iPad and, and make a program open, right?[00:43:47] Brian: The problem it seems to me right now with how we're interacting with LLMs is it's sort of like you know a dumb robot where it's like you poke it and it goes over here, but no, I want it, I want to go over here so you poke it this way and you can't get it exactly [00:44:00] right, like, what can we abstract away from the From the current, what's going on that, that makes it more fine tuned and easier to get more precise.[00:44:12] Brian: You see what I'm saying?[00:44:13] Simon: Yes. And the this is the other trend that I've been following from the last year, which I think is super interesting. It's the, the prompt driven UI development thing. Basically, this is the pattern where Claude Artifacts was the first thing to do this really well. You type in a prompt and it goes, Oh, I should answer that by writing a custom HTML and JavaScript application for you that does a certain thing.[00:44:35] Simon: And when you think about that take and since then it turns out This is easy, right? Every decent LLM can produce HTML and JavaScript that does something useful. So we've actually got this alternative way of interacting where they can respond to your prompt with an interactive custom interface that you can work with.[00:44:54] Simon: People haven't quite wired those back up again. Like, ideally, I'd want the LLM ask me a [00:45:00] question where it builds me a custom little UI, For that question, and then it gets to see how I interacted with that. I don't know why, but that's like just such a small step from where we are right now. But that feels like such an obvious next step.[00:45:12] Simon: Like an LLM, why should it, why should you just be communicating with, with text when it can build interfaces on the fly that let you select a point on a map or or move like sliders up and down. It's gonna create knobs and dials. I keep saying knobs and dials. right. We can do that. And the LLMs can build, and Claude artifacts will build you a knobs and dials interface.[00:45:34] Simon: But at the moment they haven't closed the loop. When you twiddle those knobs, Claude doesn't see what you were doing. They're going to close that loop. I'm, I'm shocked that they haven't done it yet. So yeah, I think there's so much scope for innovation and there's so much scope for doing interesting stuff with that model where the LLM, anything you can represent in SVG, which is almost everything, can now be part of that ongoing conversation.[00:45:59] swyx (2): Yeah, [00:46:00] I would say the best executed version of this I've seen so far is Bolt where you can literally type in, make a Spotify clone, make an Airbnb clone, and it actually just does that for you zero shot with a nice design.[00:46:14] Simon: There's a benchmark for that now. The LMRena people now have a benchmark that is zero shot app, app generation, because all of the models can do it.[00:46:22] Simon: Like it's, it's, I've started figuring out. I'm building my own version of this for my own project, because I think within six months. I think it'll just be an expected feature. Like if you have a web application, why don't you have a thing where, oh, look, the, you can add a custom, like, so for my dataset data exploration project, I want you to be able to do things like conjure up a dashboard, just via a prompt.[00:46:43] Simon: You say, oh, I need a pie chart and a bar chart and put them next to each other, and then have a form where submitting the form inserts a row into my database table. And this is all suddenly feasible. It's, it's, it's not even particularly difficult to do, which is great. Utterly bizarre that these things are now easy.[00:47:00][00:47:00] swyx (2): I think for a general audience, that is what I would highlight, that software creation is becoming easier and easier. Gemini is now available in Gmail and Google Sheets. I don't write my own Google Sheets formulas anymore, I just tell Gemini to do it. And so I think those are, I almost wanted to basically somewhat disagree with, with your assertion that LMS got harder to use.[00:47:22] swyx (2): Like, yes, we, we expose more capabilities, but they're, they're in minor forms, like using canvas, like web search in, in in chat GPT and like Gemini being in, in Excel sheets or in Google sheets, like, yeah, we're getting, no,[00:47:37] Simon: no, no, no. Those are the things that make it harder, because the problem is that for each of those features, they're amazing.[00:47:43] Simon: If you understand the edges of the feature, if you're like, okay, so in Google, Gemini, Excel formulas, I can get it to do a certain amount of things, but I can't get it to go and read a web. You probably can't get it to read a webpage, right? But you know, there are, there are things that it can do and things that it can't do, which are completely undocumented.[00:47:58] Simon: If you ask it what it [00:48:00] can and can't do, they're terrible at answering questions about that. So like my favorite example is Claude artifacts. You can't build a Claude artifact that can hit an API somewhere else. Because the cause headers on that iframe prevents accessing anything outside of CDNJS. So, good luck learning cause headers as an end user in order to understand why Like, I've seen people saying, oh, this is rubbish.[00:48:26] Simon: I tried building an artifact that would run a prompt and it couldn't because Claude didn't expose an API with cause headers that all of this stuff is so weird and complicated. And yeah, like that, that, the more that with the more tools we add, the more expertise you need to really, To understand the full scope of what you can do.[00:48:44] Simon: And so it's, it's, I wouldn't say it's, it's, it's, it's like, the question really comes down to what does it take to understand the full extent of what's possible? And honestly, that, that's just getting more and more involved over time.[00:48:58] Local LLMs: A Growing Interest[00:48:58] swyx (2): I have one more topic that I, I [00:49:00] think you, you're kind of a champion of and we've touched on it a little bit, which is local LLMs.[00:49:05] swyx (2): And running AI applications on your desktop, I feel like you are an early adopter of many, many things.[00:49:12] Simon: I had an interesting experience with that over the past year. Six months ago, I almost completely lost interest. And the reason is that six months ago, the best local models you could run, There was no point in using them at all, because the best hosted models were so much better.[00:49:26] Simon: Like, there was no point at which I'd choose to run a model on my laptop if I had API access to Cloud 3. 5 SONNET. They just, they weren't even comparable. And that changed, basically, in the past three months, as the local models had this step changing capability, where now I can run some of these local models, and they're not as good as Cloud 3.[00:49:45] Simon: 5 SONNET, but they're not so far away that It's not worth me even using them. The other, the, the, the, the continuing problem is I've only got 64 gigabytes of RAM, and if you run, like, LLAMA370B, it's not going to work. Most of my RAM is gone. So now I have to shut down my Firefox tabs [00:50:00] and, and my Chrome and my VS Code windows in order to run it.[00:50:03] Simon: But it's got me interested again. Like, like the, the efficiency improvements are such that now, if you were to like stick me on a desert island with my laptop, I'd be very productive using those local models. And that's, that's pretty exciting. And if those trends continue, and also, like, I think my next laptop, if when I buy one is going to have twice the amount of RAM, At which point, maybe I can run the, almost the top tier, like open weights models and still be able to use it as a computer as well.[00:50:32] Simon: NVIDIA just announced their 3, 000 128 gigabyte monstrosity. That's pretty good price. You know, that's that's, if you're going to buy it,[00:50:42] swyx (2): custom OS and all.[00:50:46] Simon: If I get a job, if I, if, if, if I have enough of an income that I can justify blowing $3,000 on it, then yes.[00:50:52] swyx (2): Okay, let's do a GoFundMe to get Simon one it.[00:50:54] swyx (2): Come on. You know, you can get a job anytime you want. Is this, this is just purely discretionary .[00:50:59] Simon: I want, [00:51:00] I want a job that pays me to do exactly what I'm doing already and doesn't tell me what else to do. That's, thats the challenge.[00:51:06] swyx (2): I think Ethan Molik does pretty well. Whatever, whatever it is he's doing.[00:51:11] swyx (2): But yeah, basically I was trying to bring in also, you know, not just local models, but Apple intelligence is on every Mac machine. You're, you're, you seem skeptical. It's rubbish.[00:51:21] Simon: Apple intelligence is so bad. It's like, it does one thing well.[00:51:25] swyx (2): Oh yeah, what's that? It summarizes notifications. And sometimes it's humorous.[00:51:29] Brian: Are you sure it does that well? And also, by the way, the other, again, from a sort of a normie point of view. There's no indication from Apple of when to use it. Like, everybody upgrades their thing and it's like, okay, now you have Apple Intelligence, and you never know when to use it ever again.[00:51:47] swyx (2): Oh, yeah, you consult the Apple docs, which is MKBHD.[00:51:49] swyx (2): The[00:51:51] Simon: one thing, the one thing I'll say about Apple Intelligence is, One of the reasons it's so disappointing is that the models are just weak, but now, like, Llama 3b [00:52:00] is Such a good model in a 2 gigabyte file I think give Apple six months and hopefully they'll catch up to the state of the art on the small models And then maybe it'll start being a lot more interesting.[00:52:10] swyx (2): Yeah. Anyway, I like This was year one And and you know just like our first year of iPhone maybe maybe not that much of a hit and then year three They had the App Store so Hey I would say give it some time, and you know, I think Chrome also shipping Gemini Nano I think this year in Chrome, which means that every app, every web app will have for free access to a local model that just ships in the browser, which is kind of interesting.[00:52:38] swyx (2): And then I, I think I also wanted to just open the floor for any, like, you know, any of us what are the apps that, you know, AI applications that we've adopted that have, that we really recommend because these are all, you know, apps that are running on our browser that like, or apps that are running locally that we should be, that, that other people should be trying.[00:52:55] swyx (2): Right? Like, I, I feel like that's, that's one always one thing that is helpful at the start of the [00:53:00] year.[00:53:00] Simon: Okay. So for running local models. My top picks, firstly, on the iPhone, there's this thing called MLC Chat, which works, and it's easy to install, and it runs Llama 3B, and it's so much fun. Like, it's not necessarily a capable enough novel that I use it for real things, but my party trick right now is I get my phone to write a Netflix Christmas movie plot outline where, like, a bunch of Jeweller falls in love with the King of Sweden or whatever.[00:53:25] Simon: And it does a good job and it comes up with pun names for the movies. And that's, that's deeply entertaining. On my laptop, most recently, I've been getting heavy into, into Olama because the Olama team are very, very good at finding the good models and patching them up and making them work well. It gives you an API.[00:53:42] Simon: My little LLM command line tool that has a plugin that talks to Olama, which works really well. So that's my, my Olama is. I think the easiest on ramp to to running models locally, if you want a nice user interface, LMStudio is, I think, the best user interface [00:54:00] thing at that. It's not open source. It's good.[00:54:02] Simon: It's worth playing with. The other one that I've been trying with recently, there's a thing called, what's it called? Open web UI or something. Yeah. The UI is fantastic. It, if you've got Olama running and you fire this thing up, it spots Olama and it gives you an interface onto your Olama models. And t

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Beating Google at Search with Neural PageRank and $5M of H200s — with Will Bryk of Exa.ai

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 10, 2025 56:00


Applications close Monday for the NYC AI Engineer Summit focusing on AI Leadership and Agent Engineering! If you applied, invites should be rolling out shortly.The search landscape is experiencing a fundamental shift. Google built a >$2T company with the “10 blue links” experience, driven by PageRank as the core innovation for ranking. This was a big improvement from the previous directory-based experiences of AltaVista and Yahoo. Almost 4 decades later, Google is now stuck in this links-based experience, especially from a business model perspective. This legacy architecture creates fundamental constraints:* Must return results in ~400 milliseconds* Required to maintain comprehensive web coverage* Tied to keyword-based matching algorithms* Cost structures optimized for traditional indexingAs we move from the era of links to the era of answers, the way search works is changing. You're not showing a user links, but the goal is to provide context to an LLM. This means moving from keyword based search to more semantic understanding of the content:The link prediction objective can be seen as like a neural PageRank because what you're doing is you're predicting the links people share... but it's more powerful than PageRank. It's strictly more powerful because people might refer to that Paul Graham fundraising essay in like a thousand different ways. And so our model learns all the different ways.All of this is now powered by a $5M cluster with 144 H200s:This architectural choice enables entirely new search capabilities:* Comprehensive result sets instead of approximations* Deep semantic understanding of queries* Ability to process complex, natural language requestsAs search becomes more complex, time to results becomes a variable:People think of searches as like, oh, it takes 500 milliseconds because we've been conditioned... But what if searches can take like a minute or 10 minutes or a whole day, what can you then do?Unlike traditional search engines' fixed-cost indexing, Exa employs a hybrid approach:* Front-loaded compute for indexing and embeddings* Variable inference costs based on query complexity* Mix of owned infrastructure ($5M H200 cluster) and cloud resourcesExa sees a lot of competition from products like Perplexity and ChatGPT Search which layer AI on top of traditional search backends, but Exa is betting that true innovation requires rethinking search from the ground up. For example, the recently launched Websets, a way to turn searches into structured output in grid format, allowing you to create lists and databases out of web pages. The company raised a $17M Series A to build towards this mission, so keep an eye out for them in 2025. Chapters* 00:00:00 Introductions* 00:01:12 ExaAI's initial pitch and concept* 00:02:33 Will's background at SpaceX and Zoox* 00:03:45 Evolution of ExaAI (formerly Metaphor Systems)* 00:05:38 Exa's link prediction technology* 00:09:20 Meaning of the name "Exa"* 00:10:36 ExaAI's new product launch and capabilities* 00:13:33 Compute budgets and variable compute products* 00:14:43 Websets as a B2B offering* 00:19:28 How do you build a search engine?* 00:22:43 What is Neural PageRank?* 00:27:58 Exa use cases * 00:35:00 Auto-prompting* 00:38:42 Building agentic search* 00:44:19 Is o1 on the path to AGI?* 00:49:59 Company culture and nap pods* 00:54:52 Economics of AI search and the future of search technologyFull YouTube TranscriptPlease like and subscribe!Show Notes* ExaAI* Web Search Product* Websets* Series A Announcement* Exa Nap Pods* Perplexity AI* Character.AITranscriptAlessio [00:00:00]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:10]: Hey, and today we're in the studio with my good friend and former landlord, Will Bryk. Roommate. How you doing? Will, you're now CEO co-founder of ExaAI, used to be Metaphor Systems. What's your background, your story?Will [00:00:30]: Yeah, sure. So, yeah, I'm CEO of Exa. I've been doing it for three years. I guess I've always been interested in search, whether I knew it or not. Like, since I was a kid, I've always been interested in, like, high-quality information. And, like, you know, even in high school, wanted to improve the way we get information from news. And then in college, built a mini search engine. And then with Exa, like, you know, it's kind of like fulfilling the dream of actually being able to solve all the information needs I wanted as a kid. Yeah, I guess. I would say my entire life has kind of been rotating around this problem, which is pretty cool. Yeah.Swyx [00:00:50]: What'd you enter YC with?Will [00:00:53]: We entered YC with, uh, we are better than Google. Like, Google 2.0.Swyx [00:01:12]: What makes you say that? Like, that's so audacious to come out of the box with.Will [00:01:16]: Yeah, okay, so you have to remember the time. This was summer 2021. And, uh, GPT-3 had come out. Like, here was this magical thing that you could talk to, you could enter a whole paragraph, and it understands what you mean, understands the subtlety of your language. And then there was Google. Uh, which felt like it hadn't changed in a decade, uh, because it really hadn't. And it, like, you would give it a simple query, like, I don't know, uh, shirts without stripes, and it would give you a bunch of results for the shirts with stripes. And so, like, Google could barely understand you, and GBD3 could. And the theory was, what if you could make a search engine that actually understood you? What if you could apply the insights from LLMs to a search engine? And it's really been the same idea ever since. And we're actually a lot closer now, uh, to doing that. Yeah.Alessio [00:01:55]: Did you have any trouble making people believe? Obviously, there's the same element. I mean, YC overlap, was YC pretty AI forward, even 2021, or?Will [00:02:03]: It's nothing like it is today. But, um, uh, there were a few AI companies, but, uh, we were definitely, like, bold. And I think people, VCs generally like boldness, and we definitely had some AI background, and we had a working demo. So there was evidence that we could build something that was going to work. But yeah, I think, like, the fundamentals were there. I think people at the time were talking about how, you know, Google was failing in a lot of ways. And so there was a bit of conversation about it, but AI was not a big, big thing at the time. Yeah. Yeah.Alessio [00:02:33]: Before we jump into Exa, any fun background stories? I know you interned at SpaceX, any Elon, uh, stories? I know you were at Zoox as well, you know, kind of like robotics at Harvard. Any stuff that you saw early that you thought was going to get solved that maybe it's not solved today?Will [00:02:48]: Oh yeah. I mean, lots of things like that. Like, uh, I never really learned how to drive because I believed Elon that self-driving cars would happen. It did happen. And I take them every night to get home. But it took like 10 more years than I thought. Do you still not know how to drive? I know how to drive now. I learned it like two years ago. That would have been great to like, just, you know, Yeah, yeah, yeah. You know? Um, I was obsessed with Elon. Yeah. I mean, I worked at SpaceX because I really just wanted to work at one of his companies. And I remember they had a rule, like interns cannot touch Elon. And, um, that rule actually influenced my actions.Swyx [00:03:18]: Is it, can Elon touch interns? Ooh, like physically?Will [00:03:22]: Or like talk? Physically, physically, yeah, yeah, yeah, yeah. Okay, interesting. He's changed a lot, but, um, I mean, his companies are amazing. Um,Swyx [00:03:28]: What if you beat him at Diablo 2, Diablo 4, you know, like, Ah, maybe.Alessio [00:03:34]: I want to jump into, I know there's a lot of backstory used to be called metaphor system. So, um, and it, you've always been kind of like a prominent company, maybe at least RAI circles in the NSF.Swyx [00:03:45]: I'm actually curious how Metaphor got its initial aura. You launched with like, very little. We launched very little. Like there was, there was this like big splash image of like, this is Aurora or something. Yeah. Right. And then I was like, okay, what this thing, like the vibes are good, but I don't know what it is. And I think, I think it was much more sort of maybe consumer facing than what you are today. Would you say that's true?Will [00:04:06]: No, it's always been about building a better search algorithm, like search, like, just like the vision has always been perfect search. And if you do that, uh, we will figure out the downstream use cases later. It started on this fundamental belief that you could have perfect search over the web and we could talk about what that means. And like the initial thing we released was really just like our first search engine, like trying to get it out there. Kind of like, you know, an open source. So when OpenAI released, uh, ChachBt, like they didn't, I don't know how, how much of a game plan they had. They kind of just wanted to get something out there.Swyx [00:04:33]: Spooky research preview.Will [00:04:34]: Yeah, exactly. And it kind of morphed from a research company to a product company at that point. And I think similarly for us, like we were research, we started as a research endeavor with a, you know, clear eyes that like, if we succeed, it will be a massive business to make out of it. And that's kind of basically what happened. I think there are actually a lot of parallels to, of w between Exa and OpenAI. I often say we're the OpenAI of search. Um, because. Because we're a research company, we're a research startup that does like fundamental research into, uh, making like AGI for search in a, in a way. Uh, and then we have all these like, uh, business products that come out of that.Swyx [00:05:08]: Interesting. I want to ask a little bit more about Metaforesight and then we can go full Exa. When I first met you, which was really funny, cause like literally I stayed in your house in a very historic, uh, Hayes, Hayes Valley place. You said you were building sort of like link prediction foundation model, and I think there's still a lot of foundation model work. I mean, within Exa today, but what does that even mean? I cannot be the only person confused by that because like there's a limited vocabulary or tokens you're telling me, like the tokens are the links or, you know, like it's not, it's not clear. Yeah.Will [00:05:38]: Uh, what we meant by link prediction is that you are literally predicting, like given some texts, you're predicting the links that follow. Yes. That refers to like, it's how we describe the training procedure, which is that we find links on the web. Uh, we take the text surrounding the link. And then we predict. Which link follows you, like, uh, you know, similar to transformers where, uh, you're trying to predict the next token here, you're trying to predict the next link. And so you kind of like hide the link from the transformer. So if someone writes, you know, imagine some article where someone says, Hey, check out this really cool aerospace startup. And they, they say spacex.com afterwards, uh, we hide the spacex.com and ask the model, like what link came next. And by doing that many, many times, you know, billions of times, you could actually build a search engine out of that because then, uh, at query time at search time. Uh, you type in, uh, a query that's like really cool aerospace startup and the model will then try to predict what are the most likely links. So there's a lot of analogs to transformers, but like to actually make this work, it does require like a different architecture than, but it's transformer inspired. Yeah.Alessio [00:06:41]: What's the design decision between doing that versus extracting the link and the description and then embedding the description and then using, um, yeah. What do you need to predict the URL versus like just describing, because you're kind of do a similar thing in a way. Right. It's kind of like based on this description, it was like the closest link for it. So one thing is like predicting the link. The other approach is like I extract the link and the description, and then based on the query, I searched the closest description to it more. Yeah.Will [00:07:09]: That, that, by the way, that is, that is the link refers here to a document. It's not, I think one confusing thing is it's not, you're not actually predicting the URL, the URL itself that would require like the, the system to have memorized URLs. You're actually like getting the actual document, a more accurate name could be document prediction. I see. This was the initial like base model that Exo was trained on, but we've moved beyond that similar to like how, you know, uh, to train a really good like language model, you might start with this like self-supervised objective of predicting the next token and then, uh, just from random stuff on the web. But then you, you want to, uh, add a bunch of like synthetic data and like supervised fine tuning, um, stuff like that to make it really like controllable and robust. Yeah.Alessio [00:07:48]: Yeah. We just have flow from Lindy and, uh, their Lindy started to like hallucinate recrolling YouTube links instead of like, uh, something. Yeah. Support guide. So. Oh, interesting. Yeah.Swyx [00:07:57]: So round about January, you announced your series A and renamed to Exo. I didn't like the name at the, at the initial, but it's grown on me. I liked metaphor, but apparently people can spell metaphor. What would you say are the major components of Exo today? Right? Like, I feel like it used to be very model heavy. Then at the AI engineer conference, Shreyas gave a really good talk on the vector database that you guys have. What are the other major moving parts of Exo? Okay.Will [00:08:23]: So Exo overall is a search engine. Yeah. We're trying to make it like a perfect search engine. And to do that, you have to build lots of, and we're doing it from scratch, right? So to do that, you have to build lots of different. The crawler. Yeah. You have to crawl a bunch of the web. First of all, you have to find the URLs to crawl. Uh, it's connected to the crawler, but yeah, you find URLs, you crawl those URLs. Then you have to process them with some, you know, it could be an embedding model. It could be something more complex, but you need to take, you know, or like, you know, in the past it was like a keyword inverted index. Like you would process all these documents you gather into some processed index, and then you have to serve that. Uh, you had high throughput at low latency. And so that, and that's like the vector database. And so it's like the crawling system, the AI processing system, and then the serving system. Those are all like, you know, teams of like hundreds, maybe thousands of people at Google. Um, but for us, it's like one or two people each typically, but yeah.Alessio [00:09:13]: Can you explain the meaning of, uh, Exo, just the story 10 to the 16th, uh, 18, 18.Will [00:09:20]: Yeah, yeah, yeah, sure. So. Exo means 10 to the 18th, which is in stark contrast to. To Google, which is 10 to the hundredth. Uh, we actually have these like awesome shirts that are like 10th to 18th is greater than 10th to the hundredth. Yeah, it's great. And it's great because it's provocative. It's like every engineer in Silicon Valley is like, what? No, it's not true. Um, like, yeah. And, uh, and then you, you ask them, okay, what does it actually mean? And like the creative ones will, will recognize it. But yeah, I mean, 10 to the 18th is better than 10 to the hundredth when it comes to search, because with search, you want like the actual list of, of things that match what you're asking for. You don't want like the whole web. You want to basically with search filter, the, like everything that humanity has ever created to exactly what you want. And so the idea is like smaller is better there. You want like the best 10th to the 18th and not the 10th to the hundredth. I'm like, one way to say this is like, you know how Google often says at the top, uh, like, you know, 30 million results found. And it's like crazy. Cause you're looking for like the first startups in San Francisco that work on hardware or something. And like, they're not 30 million results like that. What you want is like 325 results found. And those are all the results. That's what you really want with search. And that's, that's our vision. It's like, it just gives you. Perfectly what you asked for.Swyx [00:10:24]: We're recording this ahead of your launch. Uh, we haven't released, we haven't figured out the, the, the name of the launch yet, but what is the product that you're launching? I guess now that we're coinciding this podcast with. Yeah.Will [00:10:36]: So we've basically developed the next version of Exa, which is the ability to get a near perfect list of results of whatever you want. And what that means is you can make a complex query now to Exa, for example, startups working on hardware in SF, and then just get a huge list of all the things that match. And, you know, our goal is if there are 325 startups that match that we find you all of them. And this is just like, there's just like a new experience that's never existed before. It's really like, I don't know how you would go about that right now with current tools and you can apply this same type of like technology to anything. Like, let's say you want, uh, you want to find all the blog posts that talk about Alessio's podcast, um, that have come out in the past year. That is 30 million results. Yeah. Right.Will [00:11:24]: But that, I mean, that would, I'm sure that would be extremely useful to you guys. And like, I don't really know how you would get that full comprehensive list.Swyx [00:11:29]: I just like, how do you, well, there's so many questions with regards to how do you know it's complete, right? Cause you're saying there's only 30 million, 325, whatever. And then how do you do the semantic understanding that it might take, right? So working in hardware, like I might not use the words hardware. I might use the words robotics. I might use the words wearables. I might use like whatever. Yes. So yeah, just tell us more. Yeah. Yeah. Sure. Sure.Will [00:11:53]: So one aspect of this, it's a little subjective. So like certainly providing, you know, at some point we'll provide parameters to the user to like, you know, some sort of threshold to like, uh, gauge like, okay, like this is a cutoff. Like, this is actually not what I mean, because sometimes it's subjective and there needs to be a feedback loop. Like, oh, like it might give you like a few examples and you say, yeah, exactly. And so like, you're, you're kind of like creating a classifier on the fly, but like, that's ultimately how you solve the problem. So the subject, there's a subjectivity problem and then there's a comprehensiveness problem. Those are two different problems. So. Yeah. So you have the comprehensiveness problem. What you basically have to do is you have to put more compute into the query, into the search until you get the full comprehensiveness. Yeah. And I think there's an interesting point here, which is that not all queries are made equal. Some queries just like this blog post one might require scanning, like scavenging, like throughout the whole web in a way that just, just simply requires more compute. You know, at some point there's some amount of compute where you will just be comprehensive. You could imagine, for example, running GPT-4 over the internet. You could imagine running GPT-4 over the entire web and saying like, is this a blog post about Alessio's podcast, like, is this a blog post about Alessio's podcast? And then that would work, right? It would take, you know, a year, maybe cost like a million dollars, but, or many more, but, um, it would work. Uh, the point is that like, given sufficient compute, you can solve the query. And so it's really a question of like, how comprehensive do you want it given your compute budget? I think it's very similar to O1, by the way. And one way of thinking about what we built is like O1 for search, uh, because O1 is all about like, you know, some, some, some questions require more compute than others, and we'll put as much compute into the question as we need to solve it. So similarly with our search, we will put as much compute into the query in order to get comprehensiveness. Yeah.Swyx [00:13:33]: Does that mean you have like some kind of compute budget that I can specify? Yes. Yes. Okay. And like, what are the upper and lower bounds?Will [00:13:42]: Yeah, there's something we're still figuring out. I think like, like everyone is a new paradigm of like variable compute products. Yeah. How do you specify the amount of compute? Like what happens when you. Run out? Do you just like, ah, do you, can you like keep going with it? Like, do you just put in more credits to get more, um, for some, like this can get complex at like the really large compute queries. And like, one thing we do is we give you a preview of what you're going to get, and then you could then spin up like a much larger job, uh, to get like way more results. But yes, there is some compute limit, um, at, at least right now. Yeah. People think of searches as like, oh, it takes 500 milliseconds because we've been conditioned, uh, to have search that takes 500 milliseconds. But like search engines like Google, right. No matter how complex your query to Google, it will take like, you know, roughly 400 milliseconds. But what if searches can take like a minute or 10 minutes or a whole day, what can you then do? And you can do very powerful things. Um, you know, you can imagine, you know, writing a search, going and get a cup of coffee, coming back and you have a perfect list. Like that's okay for a lot of use cases. Yeah.Alessio [00:14:43]: Yeah. I mean, the use case closest to me is venture capital, right? So, uh, no, I mean, eight years ago, I built one of the first like data driven sourcing platforms. So we were. You look at GitHub, Twitter, Product Hunt, all these things, look at interesting things, evaluate them. If you think about some jobs that people have, it's like literally just make a list. If you're like an analyst at a venture firm, your job is to make a list of interesting companies. And then you reach out to them. How do you think about being infrastructure versus like a product you could say, Hey, this is like a product to find companies. This is a product to find things versus like offering more as a blank canvas that people can build on top of. Oh, right. Right.Will [00:15:20]: Uh, we are. We are a search infrastructure company. So we want people to build, uh, on top of us, uh, build amazing products on top of us. But with this one, we try to build something that makes it really easy for users to just log in, put a few, you know, put some credits in and just get like amazing results right away and not have to wait to build some API integration. So we're kind of doing both. Uh, we, we want, we want people to integrate this into all their applications at the same time. We want to just make it really easy to use very similar again to open AI. Like they'll have, they have an API, but they also have. Like a ChatGPT interface so that you could, it's really easy to use, but you could also build it in your applications. Yeah.Alessio [00:15:56]: I'm still trying to wrap my head around a lot of the implications. So, so many businesses run on like information arbitrage, you know, like I know this thing that you don't, especially in investment and financial services. So yeah, now all of a sudden you have these tools for like, oh, actually everybody can get the same information at the same time, the same quality level as an API call. You know, it just kind of changes a lot of things. Yeah.Will [00:16:19]: I think, I think what we're grappling with here. What, what you're just thinking about is like, what is the world like if knowledge is kind of solved, if like any knowledge request you want is just like right there on your computer, it's kind of different from when intelligence is solved. There's like a good, I've written before about like a different super intelligence, super knowledge. Yeah. Like I think that the, the distinction between intelligence and knowledge is actually a pretty good one. They're definitely connected and related in all sorts of ways, but there is a distinction. You could have a world and we are going to have this world where you have like GP five level systems and beyond that could like answer any complex request. Um, unless it requires some. Like, if you say like, uh, you know, give me a list of all the PhDs in New York city who, I don't know, have thought about search before. And even though this, this super intelligence is going to be like, I can't find it on Google, right. Which is kind of crazy. Like we're literally going to have like super intelligences that are using Google. And so if Google can't find them information, there's nothing they could do. They can't find it. So, but if you also have a super knowledge system where it's like, you know, I'm calling this term super knowledge where you just get whatever knowledge you want, then you can pair with a super intelligence system. And then the super intelligence can, we'll never. Be blocked by lack of knowledge.Alessio [00:17:23]: Yeah. You told me this, uh, when we had lunch, I forget how it came out, but we were talking about AGI and whatnot. And you were like, even AGI is going to need search. Yeah.Swyx [00:17:32]: Yeah. Right. Yeah. Um, so we're actually referencing a blog post that you wrote super intelligence and super knowledge. Uh, so I would refer people to that. And this is actually a discussion we've had on the podcast a couple of times. Um, there's so much of model weights that are just memorizing facts. Some of the, some of those might be outdated. Some of them are incomplete or not. Yeah. So like you just need search. So I do wonder, like, is there a maximum language model size that will be the intelligence layer and then the rest is just search, right? Like maybe we should just always use search. And then that sort of workhorse model is just like, and it like, like, like one B or three B parameter model that just drives everything. Yes.Will [00:18:13]: I believe this is a much more optimal system to have a smaller LM. That's really just like an intelligence module. And it makes a call to a search. Tool that's way more efficient because if, okay, I mean the, the opposite of that would be like the LM is so big that can memorize the whole web. That would be like way, but you know, it's not practical at all. I don't, it's not possible to train that at least right now. And Carpathy has actually written about this, how like he could, he could see models moving more and more towards like intelligence modules using various tools. Yeah.Swyx [00:18:39]: So for listeners, that's the, that was him on the no priors podcast. And for us, we talked about this and the, on the Shin Yu and Harrison chase podcasts. I'm doing search in my head. I told you 30 million results. I forgot about our neural link integration. Self-hosted exit.Will [00:18:54]: Yeah. Yeah. No, I do see that that is a much more, much more efficient world. Yeah. I mean, you could also have GB four level systems calling search, but it's just because of the cost of inference. It's just better to have a very efficient search tool and a very efficient LM and they're built for different things. Yeah.Swyx [00:19:09]: I'm just kind of curious. Like it is still something so audacious that I don't want to elide, which is you're, you're, you're building a search engine. Where do you start? How do you, like, are there any reference papers or implementation? That would really influence your thinking, anything like that? Because I don't even know where to start apart from just crawl a bunch of s**t, but there's gotta be more insight than that.Will [00:19:28]: I mean, yeah, there's more insight, but I'm always surprised by like, if you have a group of people who are really focused on solving a problem, um, with the tools today, like there's some in, in software, like there are all sorts of creative solutions that just haven't been thought of before, particularly in the information retrieval field. Yeah. I think a lot of the techniques are just very old, frankly. Like I know how Google and Bing work and. They're just not using new methods. There are all sorts of reasons for that. Like one, like Google has to be comprehensive over the web. So they're, and they have to return in 400 milliseconds. And those two things combined means they are kind of limit and it can't cost too much. They're kind of limited in, uh, what kinds of algorithms they could even deploy at scale. So they end up using like a limited keyword based algorithm. Also like Google was built in a time where like in, you know, in 1998, where we didn't have LMS, we didn't have embeddings. And so they never thought to build those things. And so now they have this like gigantic system that is built on old technology. Yeah. And so a lot of the information retrieval field we found just like thinks in terms of that framework. Yeah. Whereas we came in as like newcomers just thinking like, okay, there here's GB three. It's magical. Obviously we're going to build search that is using that technology. And we never even thought about using keywords really ever. Uh, like we were neural all the way we're building an end to end neural search engine. And just that whole framing just makes us ask different questions, like pursue different lines of work. And there's just a lot of low hanging fruit because no one else is thinking about it. We're just on the frontier of neural search. We just are, um, for, for at web scale, um, because there's just not a lot of people thinking that way about it.Swyx [00:20:57]: Yeah. Maybe let's spell this out since, uh, we're already on this topic, elephants in the room are Perplexity and SearchGPT. That's the, I think that it's all, it's no longer called SearchGPT. I think they call it ChatGPT Search. How would you contrast your approaches to them based on what we know of how they work and yeah, just any, anything in that, in that area? Yeah.Will [00:21:15]: So these systems, there are a few of them now, uh, they basically rely on like traditional search engines like Google or Bing, and then they combine them with like LLMs at the end to, you know, output some power graphics, uh, answering your question. So they like search GPT perplexity. I think they have their own crawlers. No. So there's this important distinction between like having your own search system and like having your own cache of the web. Like for example, so you could create, you could crawl a bunch of the web. Imagine you crawl a hundred billion URLs, and then you create a key value store of like mapping from URL to the document that is technically called an index, but it's not a search algorithm. So then to actually like, when you make a query to search GPT, for example, what is it actually doing it? Let's say it's, it's, it could, it's using the Bing API, uh, getting a list of results and then it could go, it has this cache of like all the contents of those results and then could like bring in the cache, like the index cache, but it's not actually like, it's not like they've built a search engine from scratch over, you know, hundreds of billions of pages. It's like, does that distinction clear? It's like, yeah, you could have like a mapping from URL to documents, but then rely on traditional search engines to actually get the list of results because it's a very hard problem to take. It's not hard. It's not hard to use DynamoDB and, and, and map URLs to documents. It's a very hard problem to take a hundred billion or more documents and given a query, like instantly get the list of results that match. That's a much harder problem that very few entities on, in, on the planet have done. Like there's Google, there's Bing, uh, you know, there's Yandex, but you know, there are not that many companies that are, that are crazy enough to actually build their search engine from scratch when you could just use traditional search APIs.Alessio [00:22:43]: So Google had PageRank as like the big thing. Is there a LLM equivalent or like any. Stuff that you're working on that you want to highlight?Will [00:22:51]: The link prediction objective can be seen as like a neural PageRank because what you're doing is you're predicting the links people share. And so if everyone is sharing some Paul Graham essay about fundraising, then like our model is more likely to predict it. So like inherent in our training objective is this, uh, a sense of like high canonicity and like high quality, but it's more powerful than PageRank. It's strictly more powerful because people might refer to that Paul Graham fundraising essay in like a thousand different ways. And so our model learns all the different ways. That someone refers that Paul Graham, I say, while also learning how important that Paul Graham essay is. Um, so it's like, it's like PageRank on steroids kind of thing. Yeah.Alessio [00:23:26]: I think to me, that's the most interesting thing about search today, like with Google and whatnot, it's like, it's mostly like domain authority. So like if you get back playing, like if you search any AI term, you get this like SEO slop websites with like a bunch of things in them. So this is interesting, but then how do you think about more timeless maybe content? So if you think about, yeah. You know, maybe the founder mode essay, right. It gets shared by like a lot of people, but then you might have a lot of other essays that are also good, but they just don't really get a lot of traction. Even though maybe the people that share them are high quality. How do you kind of solve that thing when you don't have the people authority, so to speak of who's sharing, whether or not they're worth kind of like bumping up? Yeah.Will [00:24:10]: I mean, you do have a lot of control over the training data, so you could like make sure that the training data contains like high quality sources so that, okay. Like if you, if you're. Training data, I mean, it's very similar to like language, language model training. Like if you train on like a bunch of crap, your prediction will be crap. Our model will match the training distribution is trained on. And so we could like, there are lots of ways to tweak the training data to refer to high quality content that we want. Yeah. I would say also this, like this slop that is returned by, by traditional search engines, like Google and Bing, you have the slop is then, uh, transferred into the, these LLMs in like a search GBT or, you know, our other systems like that. Like if slop comes in, slop will go out. And so, yeah, that's another answer to how we're different is like, we're not like traditional search engines. We want to give like the highest quality results and like have full control over whatever you want. If you don't want slop, you get that. And then if you put an LM on top of that, which our customers do, then you just get higher quality results or high quality output.Alessio [00:25:06]: And I use Excel search very often and it's very good. Especially.Swyx [00:25:09]: Wave uses it too.Alessio [00:25:10]: Yeah. Yeah. Yeah. Yeah. Yeah. Like the slop is everywhere, especially when it comes to AI, when it comes to investment. When it comes to all of these things for like, it's valuable to be at the top. And this problem is only going to get worse because. Yeah, no, it's totally. What else is in the toolkit? So you have search API, you have ExaSearch, kind of like the web version. Now you have the list builder. I think you also have web scraping. Maybe just touch on that. Like, I guess maybe people, they want to search and then they want to scrape. Right. So is that kind of the use case that people have? Yeah.Will [00:25:41]: A lot of our customers, they don't just want, because they're building AI applications on top of Exa, they don't just want a list of URLs. They actually want. Like the full content, like cleans, parsed. Markdown. Markdown, maybe chunked, whatever they want, we'll give it to them. And so that's been like huge for customers. Just like getting the URLs and instantly getting the content for each URL is like, and you can do this for 10 or 100 or 1,000 URLs, wherever you want. That's very powerful.Swyx [00:26:05]: Yeah. I think this is the first thing I asked you for when I tried using Exa.Will [00:26:09]: Funny story is like when I built the first version of Exa, it's like, we just happened to store the content. Yes. Like the first 1,024 tokens. Because I just kind of like kept it because I thought of, you know, I don't know why. Really for debugging purposes. And so then when people started asking for content, it was actually pretty easy to serve it. But then, and then we did that, like Exa took off. So the computer's content was so useful. So that was kind of cool.Swyx [00:26:30]: It is. I would say there are other players like Gina, I think is in this space. Firecrawl is in this space. There's a bunch of scraper companies. And obviously scraper is just one part of your stack, but you might as well offer it since you already do it.Will [00:26:43]: Yeah, it makes sense. It's just easy to have an all-in-one solution. And like. We are, you know, building the best scraper in the world. So scraping is a hard problem and it's easy to get like, you know, a good scraper. It's very hard to get a great scraper and it's super hard to get a perfect scraper. So like, and, and scraping really matters to people. Do you have a perfect scraper? Not yet. Okay.Swyx [00:27:05]: The web is increasingly closing to the bots and the scrapers, Twitter, Reddit, Quora, Stack Overflow. I don't know what else. How are you dealing with that? How are you navigating those things? Like, you know. You know, opening your eyes, like just paying them money.Will [00:27:19]: Yeah, no, I mean, I think it definitely makes it harder for search engines. One response is just that there's so much value in the long tail of sites that are open. Okay. Um, and just like, even just searching over those well gets you most of the value. But I mean, there, there is definitely a lot of content that is increasingly not unavailable. And so you could get through that through data partnerships. The bigger we get as a company, the more, the easier it is to just like, uh, make partnerships. But I, I mean, I do see the world as like the future where the. The data, the, the data producers, the content creators will make partnerships with the entities that find that data.Alessio [00:27:53]: Any other fun use case that maybe people are not thinking about? Yeah.Will [00:27:58]: Oh, I mean, uh, there are so many customers. Yeah. What are people doing on AXA? Well, I think dating is a really interesting, uh, application of search that is completely underserved because there's a lot of profiles on the web and a lot of people who want to find love and that I'll use it. They give me. Like, you know, age boundaries, you know, education level location. Yeah. I mean, you want to, what, what do you want to do with data? You want to find like a partner who matches this education level, who like, you know, maybe has written about these types of topics before. Like if you could get a list of all the people like that, like, I think you will unblock a lot of people. I mean, there, I mean, I think this is a very Silicon Valley view of dating for sure. And I'm, I'm well aware of that, but it's just an interesting application of like, you know, I would love to meet like an intellectual partner, um, who like shares a lot of ideas. Yeah. Like if you could do that through better search and yeah.Swyx [00:28:48]: But what is it with Jeff? Jeff has already set me up with a few people. So like Jeff, I think it's my personal exit.Will [00:28:55]: my mom's actually a matchmaker and has got a lot of married. Yeah. No kidding. Yeah. Yeah. Search is built into the book. It's in your jeans. Yeah. Yeah.Swyx [00:29:02]: Yeah. Other than dating, like I know you're having quite some success in colleges. I would just love to map out some more use cases so that our listeners can just use those examples to think about use cases for XR, right? Because it's such a general technology that it's hard to. Uh, really pin down, like, what should I use it for and what kind of products can I build with it?Will [00:29:20]: Yeah, sure. So, I mean, there are so many applications of XR and we have, you know, many, many companies using us for very diverse range of use cases, but I'll just highlight some interesting ones. Like one customer, a big customer is using us to, um, basically build like a, a writing assistant for students who want to write, uh, research papers. And basically like XR will search for, uh, like a list of research papers related to what the student is writing. And then this product has. Has like an LLM that like summarizes the papers to basically it's like a next word prediction, but in, uh, you know, prompted by like, you know, 20 research papers that X has returned. It's like literally just doing their homework for them. Yeah. Yeah. the key point is like, it's, it's, uh, you know, it's, it's, you know, research is, is a really hard thing to do and you need like high quality content as input.Swyx [00:30:08]: Oh, so we've had illicit on the podcast. I think it's pretty similar. Uh, they, they do focus pretty much on just, just research papers and, and that research. Basically, I think dating, uh, research, like I just wanted to like spell out more things, like just the big verticals.Will [00:30:23]: Yeah, yeah, no, I mean, there, there are so many use cases. So finance we talked about, yeah. I mean, one big vertical is just finding a list of companies, uh, so it's useful for VCs, like you said, who want to find like a list of competitors to a specific company they're investigating or just a list of companies in some field. Like, uh, there was one VC that told me that him and his team, like we're using XR for like eight hours straight. Like, like that. For many days on end, just like, like, uh, doing like lots of different queries of different types, like, oh, like all the companies in AI for law or, uh, all the companies for AI for, uh, construction and just like getting lists of things because you just can't find this information with, with traditional search engines. And then, you know, finding companies is also useful for, for selling. If you want to find, you know, like if we want to find a list of, uh, writing assistants to sell to, then we can just, we just use XR ourselves to find that is actually how we found a lot of our customers. Ooh, you can find your own customers using XR. Oh my God. I, in the spirit of. Uh, using XR to bolster XR, like recruiting is really helpful. It is really great use case of XR, um, because we can just get like a list of, you know, people who thought about search and just get like a long list and then, you know, reach out to those people.Swyx [00:31:29]: When you say thought about, are you, are you thinking LinkedIn, Twitter, or are you thinking just blogs?Will [00:31:33]: Or they've written, I mean, it's pretty general. So in that case, like ideally XR would return like the, the really blogs written by people who have just. So if I don't blog, I don't show up to XR, right? Like I have to blog. well, I mean, you could show up. That's like an incentive for people to blog.Swyx [00:31:47]: Well, if you've written about, uh, search in on Twitter and we, we do, we do index a bunch of tweets and then we, we should be able to service that. Yeah. Um, I mean, this is something I tell people, like you have to make yourself discoverable to the web, uh, you know, it's called learning in public, but like, it's even more imperative now because otherwise you don't exist at all.Will [00:32:07]: Yeah, no, no, this is a huge, uh, thing, which is like search engines completely influence. They have downstream effects. They influence the internet itself. They influence what people. Choose to create. And so Google, because they're a keyword based search engine, people like kind of like keyword stuff. Yeah. They're, they're, they're incentivized to create things that just match a lot of keywords, which is not very high quality. Uh, whereas XR is a search algorithm that, uh, optimizes for like high quality and actually like matching what you mean. And so people are incentivized to create content that is high quality, that like the create content that they know will be found by the right person. So like, you know, if I am a search researcher and I want to be found. By XR, I should blog about search and all the things I'm building because, because now we have a search engine like XR that's powerful enough to find them. And so the search engine will influence like the downstream internet in all sorts of amazing ways. Yeah. Uh, whatever the search engine optimizes for is what the internet looks like. Yeah.Swyx [00:33:01]: Are you familiar with the term? McLuhanism? No, it's not. Uh, it's this concept that, uh, like first we shape tools and then the tools shape us. Okay. Yeah. Uh, so there's like this reflexive connection between the things we search for and the things that get searched. Yes. So like once you change the tool. The tool that searches the, the, the things that get searched also change. Yes.Will [00:33:18]: I mean, there was a clear example of that with 30 years of Google. Yeah, exactly. Google has basically trained us to think of search and Google has Google is search like in people's heads. Right. It's one, uh, hard part about XR is like, uh, ripping people away from that notion of search and expanding their sense of what search could be. Because like when people think search, they think like a few keywords, or at least they used to, they think of a few keywords and that's it. They don't think to make these like really complex paragraph long requests for information and get a perfect list. ChatGPT was an interesting like thing that expanded people's understanding of search because you start using ChatGPT for a few hours and you go back to Google and you like paste in your code and Google just doesn't work and you're like, oh, wait, it, Google doesn't do work that way. So like ChatGPT expanded our understanding of what search can be. And I think XR is, uh, is part of that. We want to expand people's notion, like, Hey, you could actually get whatever you want. Yeah.Alessio [00:34:06]: I search on XR right now, people writing about learning in public. I was like, is it gonna come out with Alessio? Am I, am I there? You're not because. Bro. It's. So, no, it's, it's so about, because it thinks about learning, like in public, like public schools and like focuses more on that. You know, it's like how, when there are like these highly overlapping things, like this is like a good result based on the query, you know, but like, how do I get to Alessio? Right. So if you're like in these subcultures, I don't think this would work in Google well either, you know, but I, I don't know if you have any learnings.Swyx [00:34:40]: No, I'm the first result on Google.Alessio [00:34:42]: People writing about learning. In public, you're not first result anymore, I guess.Swyx [00:34:48]: Just type learning public in Google.Alessio [00:34:49]: Well, yeah, yeah, yeah, yeah. But this is also like, this is in Google, it doesn't work either. That's what I'm saying. It's like how, when you have like a movement.Will [00:34:56]: There's confusion about the, like what you mean, like your intention is a little, uh. Yeah.Alessio [00:35:00]: It's like, yeah, I'm using, I'm using a term that like I didn't invent, but I'm kind of taking over, but like, they're just so much about that term already that it's hard to overcome. If that makes sense, because public schools is like, well, it's, it's hard to overcome.Will [00:35:14]: Public schools, you know, so there's the right solution to this, which is to specify more clearly what you mean. And I'm not expecting you to do that, but so the, the right interface to search is actually an LLM.Swyx [00:35:25]: Like you should be talking to an LLM about what you want and the LLM translates its knowledge of you or knowledge of what people usually mean into a query that excellent uses, which you have called auto prompts, right?Will [00:35:35]: Or, yeah, but it's like a very light version of that. And really it's just basically the right answer is it's the wrong interface and like very soon interface to search and really to everything will be LLM. And the LLM just has a full knowledge of you, right? So we're kind of building for that world. We're skating to where the puck is going to be. And so since we're moving to a world where like LLMs are interfaced to everything, you should build a search engine that can handle complex LLM queries, queries that come from LLMs. Because you're probably too lazy, I'm too lazy too, to write like a whole paragraph explaining, okay, this is what I mean by this word. But an LLM is not lazy. And so like the LLM will spit out like a paragraph or more explaining exactly what it wants. You need a search engine that can handle that. Traditional search engines like Google or Bing, they're actually... Designed for humans typing keywords. If you give a paragraph to Google or Bing, they just completely fail. And so Exa can handle paragraphs and we want to be able to handle it more and more until it's like perfect.Alessio [00:36:24]: What about opinions? Do you have lists? When you think about the list product, do you think about just finding entries? Do you think about ranking entries? I'll give you a dumb example. So on Lindy, I've been building the spot that every week gives me like the top fantasy football waiver pickups. But every website is like different opinions. I'm like, you should pick up. These five players, these five players. When you're making lists, do you want to be kind of like also ranking and like telling people what's best? Or like, are you mostly focused on just surfacing information?Will [00:36:56]: There's a really good distinction between filtering to like things that match your query and then ranking based on like what is like your preferences. And ranking is like filtering is objective. It's like, does this document match what you asked for? Whereas ranking is more subjective. It's like, what is the best? Well, it depends what you mean by best, right? So first, first table stakes is let's get the filtering into a perfect place where you actually like every document matches what you asked for. No surgeon can do that today. And then ranking, you know, there are all sorts of interesting ways to do that where like you've maybe for, you know, have the user like specify more clearly what they mean by best. You could do it. And if the user doesn't specify, you do your best, you do your best based on what people typically mean by best. But ideally, like the user can specify, oh, when I mean best, I actually mean ranked by the, you know, the number of people who visited that site. Let's say is, is one example ranking or, oh, what I mean by best, let's say you're listing companies. What I mean by best is like the ones that have, uh, you know, have the most employees or something like that. Like there are all sorts of ways to rank a list of results that are not captured by something as subjective as best. Yeah. Yeah.Alessio [00:38:00]: I mean, it's like, who are the best NBA players in the history? It's like everybody has their own. Right.Will [00:38:06]: Right. But I mean, the, the, the search engine should definitely like, even if you don't specify it, it should do as good of a job as possible. Yeah. Yeah. No, no, totally. Yeah. Yeah. Yeah. Yeah. It's a new topic to people because we're not used to a search engine that can handle like a very complex ranking system. Like you think to type in best basketball players and not something more specific because you know, that's the only thing Google could handle. But if Google could handle like, oh, basketball players ranked by like number of shots scored on average per game, then you would do that. But you know, they can't do that. So.Swyx [00:38:32]: Yeah. That's fascinating. So you haven't used the word agents, but you're kind of building a search agent. Do you believe that that is agentic in feature? Do you think that term is distracting?Will [00:38:42]: I think it's a good term. I do think everything will eventually become agentic. And so then the term will lose power, but yes, like what we're building is agentic it in a sense that it takes actions. It decides when to go deeper into something, it has a loop, right? It feels different from traditional search, which is like an algorithm, not an agent. Ours is a combination of an algorithm and an agent.Swyx [00:39:05]: I think my reflection from seeing this in the coding space where there's basically sort of classic. Framework for thinking about this stuff is the self-driving levels of autonomy, right? Level one to five, typically the level five ones all failed because there's full autonomy and we're not, we're not there yet. And people like control. People like to be in the loop. So the, the, the level ones was co-pilot first and now it's like cursor and whatever. So I feel like if it's too agentic, it's too magical, like, like a, like a one shot, I stick a, stick a paragraph into the text box and then it spits it back to me. It might feel like I'm too disconnected from the process and I don't trust it. As opposed to something where I'm more intimately involved with the research product. I see. So like, uh, wait, so the earlier versions are, so if trying to stick to the example of the basketball thing, like best basketball player, but instead of best, you, you actually get to customize it with like, whatever the metric is that you, you guys care about. Yeah. I'm still not a basketballer, but, uh, but, but, you know, like, like B people like to be in my, my thesis is that agents level five agents failed because people like to. To kind of have drive assist rather than full self-driving.Will [00:40:15]: I mean, a lot of this has to do with how good agents are. Like at some point, if agents for coding are better than humans at all tests and then humans block, yeah, we're not there yet.Swyx [00:40:25]: So like in a world where we're not there yet, what you're pitching us is like, you're, you're kind of saying you're going all the way there. Like I kind of, I think all one is also very full, full self-driving. You don't get to see the plan. You don't get to affect the plan yet. You just fire off a query and then it goes away for a couple of minutes and comes back. Right. Which is effectively what you're saying you're going to do too. And you think there's.Will [00:40:42]: There's a, there's an in-between. I saw. Okay. So in building this product, we're exploring new interfaces because what does it mean to kick off a search that goes and takes 10 minutes? Like, is that a good interface? Because what if the search is actually wrong or it's not exactly, exactly specified to what you mean, which is why you get previews. Yeah. You get previews. So it is iterative, but ultimately once you've specified exactly what you mean, then you kind of do just want to kick off a batch job. Right. So perhaps what you're getting at is like, uh, there's this barrier with agents where you have to like explain the full context of what you mean, and a lot of failure modes happen when you have, when you don't. Yeah. There's failure modes from the agent, just not being smart enough. And then there's failure modes from the agent, not understanding exactly what you mean. And there's a lot of context that is shared between humans that is like lost between like humans and, and this like new creature.Alessio [00:41:32]: Yeah. Yeah. Because people don't know what's going on. I mean, to me, the best example of like system prompts is like, why are you writing? You're a helpful assistant. Like. Of course you should be an awful, but people don't yet know, like, can I assume that, you know, that, you know, it's like, why did the, and now people write, oh, you're a very smart software engineer, but like, you never made, you never make mistakes. Like, were you going to try and make mistakes before? So I think people don't yet have an understanding, like with, with driving people know what good driving is. It's like, don't crash, stay within kind of like a certain speed range. It's like, follow the directions. It's like, I don't really have to explain all of those things. I hope. But with. AI and like models and like search, people are like, okay, what do you actually know? What are like your assumptions about how search, how you're going to do search? And like, can I trust it? You know, can I influence it? So I think that's kind of the, the middle ground, like before you go ahead and like do all the search, it's like, can I see how you're doing it? And then maybe help show your work kind of like, yeah, steer you. Yeah. Yeah.Will [00:42:32]: No, I mean, yeah. Sure. Saying, even if you've crafted a great system prompt, you want to be part of the process itself. Uh, because the system prompt doesn't, it doesn't capture everything. Right. So yeah. A system prompt is like, you get to choose the person you work with. It's like, oh, like I want, I want a software engineer who thinks this way about code. But then even once you've chosen that person, you can't just give them a high level command and they go do it perfectly. You have to be part of that process. So yeah, I agree.Swyx [00:42:58]: Just a side note for my system, my favorite system, prompt programming anecdote now is the Apple intelligence system prompt that someone, someone's a prompt injected it and seen it. And like the Apple. Intelligence has the words, like, please don't, don't hallucinate. And it's like, of course we don't want you to hallucinate. Right. Like, so it's exactly that, that what you're talking about, like we should train this behavior into the model, but somehow we still feel the need to inject into the prompt. And I still don't even think that we are very scientific about it. Like it, I think it's almost like cargo culting. Like we have this like magical, like turn around three times, throw salt over your shoulder before you do something. And like, it worked the last time. So let's just do it the same time now. And like, we do, there's no science to this.Will [00:43:35]: I do think a lot of these problems might be ironed out in future versions. Right. So, and like, they might, they might hide the details from you. So it's like, they actually, all of them have a system prompt. That's like, you are a helpful assistant. You don't actually have to include it, even though it might actually be the way they've implemented in the backend. It should be done in RLE AF.Swyx [00:43:52]: Okay. Uh, one question I was just kind of curious about this episode is I'm going to try to frame this in terms of this, the general AI search wars, you know, you're, you're one player in that, um, there's perplexity, chat, GPT, search, and Google, but there's also like the B2B side, uh, we had. Drew Houston from Dropbox on, and he's competing with Glean, who've, uh, we've also had DD from, from Glean on, is there an appetite for Exa for my company's documents?Will [00:44:19]: There is appetite, but I think we have to be disciplined, focused, disciplined. I mean, we're already taking on like perfect web search, which is a lot. Um, but I mean, ultimately we want to build a perfect search engine, which definitely for a lot of queries involves your, your personal information, your company's information. And so, yeah, I mean, the grandest vision of Exa is perfect search really over everything, every domain, you know, we're going to have an Exa satellite, uh, because, because satellites can gather information that, uh, is not available publicly. Uh, gotcha. Yeah.Alessio [00:44:51]: Can we talk about AGI? We never, we never talk about AGI, but you had, uh, this whole tweet about, oh, one being the biggest kind of like AI step function towards it. Why does it feel so important to you? I know there's kind of like always criticism and saying, Hey, it's not the smartest son is better. It's like, blah, blah, blah. What? You choose C. So you say, this is what Ilias see or Sam see what they will see.Will [00:45:13]: I've just, I've just, you know, been connecting the dots. I mean, this was the key thing that a bunch of labs were working on, which is like, can you create a reward signal? Can you teach yourself based on a reward signal? Whether you're, if you're trying to learn coding or math, if you could have one model say, uh, be a grading system that says like you have successfully solved this programming assessment and then one model, like be the generative system. That's like, here are a bunch of programming assessments. You could train on that. It's basically whenever you could create a reward signal for some task, you could just generate a bunch of tasks for yourself. See that like, oh, on two of these thousand, you did well. And then you just train on that data. It's basically like, I mean, creating your own data for yourself and like, you know, all the labs working on that opening, I built the most impressive product doing that. And it's just very, it's very easy now to see how that could like scale to just solving, like, like solving programming or solving mathematics, which sounds crazy, but everything about our world right now is crazy.Alessio [00:46:07]: Um, and so I think if you remove that whole, like, oh, that's impossible, and you just think really clearly about like, what's now possible with like what, what they've done with O1, it's easy to see how that scales. How do you think about older GPT models then? Should people still work on them? You know, if like, obviously they just had the new Haiku, like, is it even worth spending time, like making these models better versus just, you know, Sam talked about O2 at that day. So obviously they're, they're spending a lot of time in it, but then you have maybe. The GPU poor, which are still working on making Lama good. Uh, and then you have the follower labs that do not have an O1 like model out yet. Yeah.Will [00:46:47]: This kind of gets into like, uh, what will the ecosystem of, of models be like in the future? And is there room is, is everything just gonna be O1 like models? I think, well, I mean, there's definitely a question of like inference speed and if certain things like O1 takes a long time, because that's the thing. Well, I mean, O1 is, is two things. It's like one it's it's use it's bootstrapping itself. It's teaching itself. And so the base model is smarter. But then it also has this like inference time compute where it could like spend like many minutes or many hours thinking. And so even the base model, which is also fast, it doesn't have to take minutes. It could take is, is better, smarter. I believe all models will be trained with this paradigm. Like you'll want to train on the best data, but there will be many different size models from different, very many different like companies, I believe. Yeah. Because like, I don't, yeah, I mean, it's hard, hard to predict, but I don't think opening eye is going to dominate like every possible LLM for every possible. Use case. I think for a lot of things, like you just want the fastest model and that might not involve O1 methods at all.Swyx [00:47:42]: I would say if you were to take the exit being O1 for search, literally, you really need to prioritize search trajectories, like almost maybe paying a bunch of grad students to go research things. And then you kind of track what they search and what the sequence of searching is, because it seems like that is the gold mine here, like the chain of thought or the thinking trajectory. Yeah.Will [00:48:05]: When it comes to search, I've always been skeptical. I've always been skeptical of human labeled data. Okay. Yeah, please. We tried something at our company at Exa recently where me and a bunch of engineers on the team like labeled a bunch of queries and it was really hard. Like, you know, you have all these niche queries and you're looking at a bunch of results and you're trying to identify which is matched to query. It's talking about, you know, the intricacies of like some biological experiment or something. I have no idea. Like, I don't know what matches and what, what labelers like me tend to do is just match by keyword. I'm like, oh, I don't know. Oh, like this document matches a bunch of keywords, so it must be good. But then you're actually completely missing the meaning of the document. Whereas an LLM like GB4 is really good at labeling. And so I actually think like you just we get by, which we are right now doing using like LLM

Machine Learning Street Talk
Francois Chollet - ARC reflections - NeurIPS 2024

Machine Learning Street Talk

Play Episode Listen Later Jan 9, 2025 86:46


François Chollet discusses the outcomes of the ARC-AGI (Abstraction and Reasoning Corpus) Prize competition in 2024, where accuracy rose from 33% to 55.5% on a private evaluation set. SPONSOR MESSAGES: *** CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. https://centml.ai/pricing/ Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. Are you interested in working on reasoning, or getting involved in their events? They are hosting an event in Zurich on January 9th with the ARChitects, join if you can. Goto https://tufalabs.ai/ *** Read about the recent result on o3 with ARC here (Chollet knew about it at the time of the interview but wasn't allowed to say): https://arcprize.org/blog/oai-o3-pub-breakthrough TOC: 1. Introduction and Opening [00:00:00] 1.1 Deep Learning vs. Symbolic Reasoning: François's Long-Standing Hybrid View [00:00:48] 1.2 “Why Do They Call You a Symbolist?” – Addressing Misconceptions [00:01:31] 1.3 Defining Reasoning 3. ARC Competition 2024 Results and Evolution [00:07:26] 3.1 ARC Prize 2024: Reflecting on the Narrative Shift Toward System 2 [00:10:29] 3.2 Comparing Private Leaderboard vs. Public Leaderboard Solutions [00:13:17] 3.3 Two Winning Approaches: Deep Learning–Guided Program Synthesis and Test-Time Training 4. Transduction vs. Induction in ARC [00:16:04] 4.1 Test-Time Training, Overfitting Concerns, and Developer-Aware Generalization [00:19:35] 4.2 Gradient Descent Adaptation vs. Discrete Program Search 5. ARC-2 Development and Future Directions [00:23:51] 5.1 Ensemble Methods, Benchmark Flaws, and the Need for ARC-2 [00:25:35] 5.2 Human-Level Performance Metrics and Private Test Sets [00:29:44] 5.3 Task Diversity, Redundancy Issues, and Expanded Evaluation Methodology 6. Program Synthesis Approaches [00:30:18] 6.1 Induction vs. Transduction [00:32:11] 6.2 Challenges of Writing Algorithms for Perceptual vs. Algorithmic Tasks [00:34:23] 6.3 Combining Induction and Transduction [00:37:05] 6.4 Multi-View Insight and Overfitting Regulation 7. Latent Space and Graph-Based Synthesis [00:38:17] 7.1 Clément Bonnet's Latent Program Search Approach [00:40:10] 7.2 Decoding to Symbolic Form and Local Discrete Search [00:41:15] 7.3 Graph of Operators vs. Token-by-Token Code Generation [00:45:50] 7.4 Iterative Program Graph Modifications and Reusable Functions 8. Compute Efficiency and Lifelong Learning [00:48:05] 8.1 Symbolic Process for Architecture Generation [00:50:33] 8.2 Logarithmic Relationship of Compute and Accuracy [00:52:20] 8.3 Learning New Building Blocks for Future Tasks 9. AI Reasoning and Future Development [00:53:15] 9.1 Consciousness as a Self-Consistency Mechanism in Iterative Reasoning [00:56:30] 9.2 Reconciling Symbolic and Connectionist Views [01:00:13] 9.3 System 2 Reasoning - Awareness and Consistency [01:03:05] 9.4 Novel Problem Solving, Abstraction, and Reusability 10. Program Synthesis and Research Lab [01:05:53] 10.1 François Leaving Google to Focus on Program Synthesis [01:09:55] 10.2 Democratizing Programming and Natural Language Instruction 11. Frontier Models and O1 Architecture [01:14:38] 11.1 Search-Based Chain of Thought vs. Standard Forward Pass [01:16:55] 11.2 o1's Natural Language Program Generation and Test-Time Compute Scaling [01:19:35] 11.3 Logarithmic Gains with Deeper Search 12. ARC Evaluation and Human Intelligence [01:22:55] 12.1 LLMs as Guessing Machines and Agent Reliability Issues [01:25:02] 12.2 ARC-2 Human Testing and Correlation with g-Factor [01:26:16] 12.3 Closing Remarks and Future Directions SHOWNOTES PDF: https://www.dropbox.com/scl/fi/ujaai0ewpdnsosc5mc30k/CholletNeurips.pdf?rlkey=s68dp432vefpj2z0dp5wmzqz6&st=hazphyx5&dl=0

Let's Talk AI
#195 - OpenAI o3 & for-profit, DeepSeek-V3, Latent Space

Let's Talk AI

Play Episode Listen Later Jan 5, 2025 99:05 Transcription Available


Our 195th episode with a summary and discussion of last week's* big AI news! *and sometimes last last week's  Recorded on 01/04/2024 Join our brand new Discord here! https://discord.gg/wDQkratW Note: apologies for Andrey's slurred speech and the jumpy editing, will be back to normal next week! Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/. Sponsors: The Generator - An interdisciplinary AI lab empowering innovators from all fields to bring visionary ideas to life by harnessing the capabilities of artificial intelligence. In this episode: - OpenAI teases new deliberative alignment techniques in its O3 model, showcasing major improvements in reasoning benchmarks, whilst surprising with autonomy in hacks against chess engines.  - Microsoft and OpenAI continue to wrangle over the terms of their partnership, highlighting tensions amid OpenAI's shift towards a for-profit model.  - Chinese AI companies like DeepSeek and Quen release advanced open-source models, presenting significant contributions to AI capabilities and performance optimization.  - Sakana AI introduces innovative applications of AI to the search for artificial life, emphasizing the potential and curiosity-driven outcomes of open-ended learning and exploration. If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form. Timestamps + Links: (00:00:00) Intro / Banter (00:03:07) News Preview (00:03:54) Response to listener comments (00:05:00) Sponsor Break Tools & Apps (00:06:11) OpenAI announces new o3 model (00:21:17) Alibaba slashes prices on large language models by up to 85% as China AI rivalry heats up (00:23:04) ElevenLabs launches Flash, its fastest text-to-speech AI yet Applications & Business (00:24:24) OpenAI announces plan to transform into a for-profit company (00:33:17) Microsoft and OpenAI Wrangle Over Terms of Their Blockbuster Partnership (00:37:36) Elon Musk's xAI gets investment from Nvidia in recent funding round: report (00:39:43) Sam Altman's nuclear energy startup signs one of the largest nuclear power deals to date (00:41:13) OpenAI Search Leader Departs After Less Than a Year (00:42:43) Senior OpenAI Researcher Radford Departs Projects & Open Source (00:45:21) DeepSeek-AI Just Released DeepSeek-V3: A Strong Mixture-of-Experts (MoE) Language Model with 671B Total Parameters with 37B Activated for Each Token (00:54:14) Qwen Team Releases QvQ: An Open-Weight Model for Multimodal Reasoning (00:58:09) LightOn and Answer.ai Releases ModernBERT: A New Model Series that is a Pareto Improvement over BERT with both Speed and Accuracy Research & Advancements (01:00:31) Deliberation in Latent Space via Differentiable Cache Augmentation (01:05:14) Automating the Search for Artificial Life with Foundation Models Policy & Safety (01:10:27) Nonprofit group joins Elon Musk's effort to block OpenAI's for-profit transition (01:14:35) OpenAI Researchers Propose 'Deliberative Alignment' : A Training Approach that Teaches LLMs to Explicitly Reason through Safety Specifications before Producing an Answer (01:22:06) o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed. (01:27:22) Elon Musk's xAI supercomputer gets 150MW power boost despite concerns over grid impact and local power stability (01:29:06) DOE: Data centers consumed 4.4% of US power in 2023, could hit 12% by 2028 Synthetic Media & Art (01:32:20) OpenAI failed to deliver the opt-out tool it promised by 2025 (01:36:15) Outro

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Applications for the NYC AI Engineer Summit, focused on Agents at Work, are open!When we first started Latent Space, in the lightning round we'd always ask guests: “What's your favorite AI product?”. The majority would say Midjourney. The simple UI of prompt → very aesthetic image turned it into a $300M+ ARR bootstrapped business as it rode the first wave of AI image generation.In open source land, StableDiffusion was congregating around AUTOMATIC1111 as the de-facto web UI. Unlike Midjourney, which offered some flags but was mostly prompt-driven, A1111 let users play with a lot more parameters, supported additional modalities like img2img, and allowed users to load in custom models. If you're interested in some of the SD history, you can look at our episodes with Lexica, Replicate, and Playground.One of the people involved with that community was comfyanonymous, who was also part of the Stability team in 2023, decided to build an alternative called ComfyUI, now one of the fastest growing open source projects in generative images, and is now the preferred partner for folks like Black Forest Labs's Flux Tools on Day 1. The idea behind it was simple: “Everyone is trying to make easy to use interfaces. Let me try to make a powerful interface that's not easy to use.”Unlike its predecessors, ComfyUI does not have an input text box. Everything is based around the idea of a node: there's a text input node, a CLIP node, a checkpoint loader node, a KSampler node, a VAE node, etc. While daunting for simple image generation, the tool is amazing for more complex workflows since you can break down every step of the process, and then chain many of them together rather than manually switching between tools. You can also re-start execution halfway instead of from the beginning, which can save a lot of time when using larger models.To give you an idea of some of the new use cases that this type of UI enables:* Sketch something → Generate an image with SD from sketch → feed it into SD Video to animate* Generate an image of an object → Turn into a 3D asset → Feed into interactive experiences* Input audio → Generate audio-reactive videosTheir Examples page also includes some of the more common use cases like AnimateDiff, etc. They recently launched the Comfy Registry, an online library of different nodes that users can pull from rather than having to build everything from scratch. The project has >60,000 Github stars, and as the community grows, some of the projects that people build have gotten quite complex:The most interesting thing about Comfy is that it's not a UI, it's a runtime. You can build full applications on top of image models simply by using Comfy. You can expose Comfy workflows as an endpoint and chain them together just like you chain a single node. We're seeing the rise of AI Engineering applied to art.Major Tom's ComfyUI Resources from the Latent Space DiscordMajor shoutouts to Major Tom on the LS Discord who is a image generation expert, who offered these pointers:* “best thing about comfy is the fact it supports almost immediately every new thing that comes out - unlike A1111 or forge, which still don't support flux cnet for instance. It will be perfect tool when conflicting nodes will be resolved”* AP Workflows from Alessandro Perili are a nice example of an all-in-one train-evaluate-generate system built atop Comfy* ComfyUI YouTubers to learn from:* @sebastiankamph* @NerdyRodent* @OlivioSarikas* @sedetweiler* @pixaroma* ComfyUI Nodes to check out:* https://github.com/kijai/ComfyUI-IC-Light* https://github.com/MrForExample/ComfyUI-3D-Pack* https://github.com/PowerHouseMan/ComfyUI-AdvancedLivePortrait* https://github.com/pydn/ComfyUI-to-Python-Extension* https://github.com/THtianhao/ComfyUI-Portrait-Maker* https://github.com/ssitu/ComfyUI_NestedNodeBuilder* https://github.com/longgui0318/comfyui-magic-clothing* https://github.com/atmaranto/ComfyUI-SaveAsScript* https://github.com/ZHO-ZHO-ZHO/ComfyUI-InstantID* https://github.com/AIFSH/ComfyUI-FishSpeech* https://github.com/coolzilj/ComfyUI-Photopea* https://github.com/lks-ai/anynode* Sarav: https://www.youtube.com/@mickmumpitz/videos ( applied stuff )* Sarav: https://www.youtube.com/@latentvision (technical, but infrequent)* look for comfyui node for https://github.com/magic-quill/MagicQuill* “Comfy for Video” resources* Kijai (https://github.com/kijai) pushing out support for Mochi, CogVideoX, AnimateDif, LivePortrait etc* Comfyui node support like LTX https://github.com/Lightricks/ComfyUI-LTXVideo , and HunyuanVideo* FloraFauna AI* Communities: https://www.reddit.com/r/StableDiffusion/, https://www.reddit.com/r/comfyui/Full YouTube EpisodeAs usual, you can find the full video episode on our YouTube (and don't forget to like and subscribe!)Timestamps* 00:00:04 Introduction of hosts and anonymous guest* 00:00:35 Origins of Comfy UI and early Stable Diffusion landscape* 00:02:58 Comfy's background and development of high-res fix* 00:05:37 Area conditioning and compositing in image generation* 00:07:20 Discussion on different AI image models (SD, Flux, etc.)* 00:11:10 Closed source model APIs and community discussions on SD versions* 00:14:41 LoRAs and textual inversion in image generation* 00:18:43 Evaluation methods in the Comfy community* 00:20:05 CLIP models and text encoders in image generation* 00:23:05 Prompt weighting and negative prompting* 00:26:22 Comfy UI's unique features and design choices* 00:31:00 Memory management in Comfy UI* 00:33:50 GPU market share and compatibility issues* 00:35:40 Node design and parameter settings in Comfy UI* 00:38:44 Custom nodes and community contributions* 00:41:40 Video generation models and capabilities* 00:44:47 Comfy UI's development timeline and rise to popularity* 00:48:13 Current state of Comfy UI team and future plans* 00:50:11 Discussion on other Comfy startups and potential text generation supportTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.swyx [00:00:12]: Hey everyone, we are in the Chroma Studio again, but with our first ever anonymous guest, Comfy Anonymous, welcome.Comfy [00:00:19]: Hello.swyx [00:00:21]: I feel like that's your full name, you just go by Comfy, right?Comfy [00:00:24]: Yeah, well, a lot of people just call me Comfy, even when they know my real name. Hey, Comfy.Alessio [00:00:32]: Swyx is the same. You know, not a lot of people call you Shawn.swyx [00:00:35]: Yeah, you have a professional name, right, that people know you by, and then you have a legal name. Yeah, it's fine. How do I phrase this? I think people who are in the know, know that Comfy is like the tool for image generation and now other multimodality stuff. I would say that when I first got started with Stable Diffusion, the star of the show was Automatic 111, right? And I actually looked back at my notes from 2022-ish, like Comfy was already getting started back then, but it was kind of like the up and comer, and your main feature was the flowchart. Can you just kind of rewind to that moment, that year and like, you know, how you looked at the landscape there and decided to start Comfy?Comfy [00:01:10]: Yeah, I discovered Stable Diffusion in 2022, in October 2022. And, well, I kind of started playing around with it. Yes, I, and back then I was using Automatic, which was what everyone was using back then. And so I started with that because I had, it was when I started, I had no idea like how Diffusion works. I didn't know how Diffusion models work, how any of this works, so.swyx [00:01:36]: Oh, yeah. What was your prior background as an engineer?Comfy [00:01:39]: Just a software engineer. Yeah. Boring software engineer.swyx [00:01:44]: But like any, any image stuff, any orchestration, distributed systems, GPUs?Comfy [00:01:49]: No, I was doing basically nothing interesting. Crud, web development? Yeah, a lot of web development, just, yeah, some basic, maybe some basic like automation stuff. Okay. Just. Yeah, no, like, no big companies or anything.swyx [00:02:08]: Yeah, but like already some interest in automations, probably a lot of Python.Comfy [00:02:12]: Yeah, yeah, of course, Python. But I wasn't actually used to like the Node graph interface before I started Comfy UI. It was just, I just thought it was like, oh, like, what's the best way to represent the Diffusion process in the user interface? And then like, oh, well. Well, like, naturally, oh, this is the best way I've found. And this was like with the Node interface. So how I got started was, yeah, so basic October 2022, just like I hadn't written a line of PyTorch before that. So it's completely new. What happened was I kind of got addicted to generating images.Alessio [00:02:58]: As we all did. Yeah.Comfy [00:03:00]: And then I started. I started experimenting with like the high-res fixed in auto, which was for those that don't know, the high-res fix is just since the Diffusion models back then could only generate that low-resolution. So what you would do, you would generate low-resolution image, then upscale, then refine it again. And that was kind of the hack to generate high-resolution images. I really liked generating. Like higher resolution images. So I was experimenting with that. And so I modified the code a bit. Okay. What happens if I, if I use different samplers on the second pass, I was edited the code of auto. So what happens if I use a different sampler? What happens if I use a different, like a different settings, different number of steps? And because back then the. The high-res fix was very basic, just, so. Yeah.swyx [00:04:05]: Now there's a whole library of just, uh, the upsamplers.Comfy [00:04:08]: I think, I think they added a bunch of, uh, of options to the high-res fix since, uh, since, since then. But before that was just so basic. So I wanted to go further. I wanted to try it. What happens if I use a different model for the second, the second pass? And then, well, then the auto code base was, wasn't good enough for. Like, it would have been, uh, harder to implement that in the auto interface than to create my own interface. So that's when I decided to create my own. And you were doing that mostly on your own when you started, or did you already have kind of like a subgroup of people? No, I was, uh, on my own because, because it was just me experimenting with stuff. So yeah, that was it. Then, so I started writing the code January one. 2023, and then I released the first version on GitHub, January 16th, 2023. That's how things got started.Alessio [00:05:11]: And what's, what's the name? Comfy UI right away or? Yeah.Comfy [00:05:14]: Comfy UI. The reason the name, my name is Comfy is people thought my pictures were comfy, so I just, uh, just named it, uh, uh, it's my Comfy UI. So yeah, that's, uh,swyx [00:05:27]: Is there a particular segment of the community that you targeted as users? Like more intensive workflow artists, you know, compared to the automatic crowd or, you know,Comfy [00:05:37]: This was my way of like experimenting with, uh, with new things, like the high risk fixed thing I mentioned, which was like in Comfy, the first thing you could easily do was just chain different models together. And then one of the first things, I think the first times it got a bit of popularity was when I started experimenting with the different, like applying. Prompts to different areas of the image. Yeah. I called it area conditioning, posted it on Reddit and it got a bunch of upvotes. So I think that's when, like, when people first learned of Comfy UI.swyx [00:06:17]: Is that mostly like fixing hands?Comfy [00:06:19]: Uh, no, no, no. That was just, uh, like, let's say, well, it was very, well, it still is kind of difficult to like, let's say you want a mountain, you have an image and then, okay. I'm like, okay. I want the mountain here and I want the, like a, a Fox here.swyx [00:06:37]: Yeah. So compositing the image. Yeah.Comfy [00:06:40]: My way was very easy. It was just like, oh, when you run the diffusion process, you kind of generate, okay. You do pass one pass through the diffusion, every step you do one pass. Okay. This place of the image with this brand, this space, place of the image with the other prop. And then. The entire image with another prop and then just average everything together, every step, and that was, uh, area composition, which I call it. And then, then a month later, there was a paper that came out called multi diffusion, which was the same thing, but yeah, that's, uh,Alessio [00:07:20]: could you do area composition with different models or because you're averaging out, you kind of need the same model.Comfy [00:07:26]: Could do it with, but yeah, I hadn't implemented it. For different models, but, uh, you, you can do it with, uh, with different models if you want, as long as the models share the same latent space, like we, we're supposed to ring a bell every time someone says, yeah, like, for example, you couldn't use like Excel and SD 1.5, because those have a different latent space, but like, uh, yeah, like SD 1.5 models, different ones. You could, you could do that.swyx [00:07:59]: There's some models that try to work in pixel space, right?Comfy [00:08:03]: Yeah. They're very slow. Of course. That's the problem. That that's the, the reason why stable diffusion actually became like popular, like, cause was because of the latent space.swyx [00:08:14]: Small and yeah. Because it used to be latent diffusion models and then they trained it up.Comfy [00:08:19]: Yeah. Cause a pixel pixel diffusion models are just too slow. So. Yeah.swyx [00:08:25]: Have you ever tried to talk to like, like stability, the latent diffusion guys, like, you know, Robin Rombach, that, that crew. Yeah.Comfy [00:08:32]: Well, I used to work at stability.swyx [00:08:34]: Oh, I actually didn't know. Yeah.Comfy [00:08:35]: I used to work at stability. I got, uh, I got hired, uh, in June, 2023.swyx [00:08:42]: Ah, that's the part of the story I didn't know about. Okay. Yeah.Comfy [00:08:46]: So the, the reason I was hired is because they were doing, uh, SDXL at the time and they were basically SDXL. I don't know if you remember it was a base model and then a refiner model. Basically they wanted to experiment, like chaining them together. And then, uh, they saw, oh, right. Oh, this, we can use this to do that. Well, let's hire that guy.swyx [00:09:10]: But they didn't, they didn't pursue it for like SD3. What do you mean? Like the SDXL approach. Yeah.Comfy [00:09:16]: The reason for that approach was because basically they had two models and then they wanted to publish both of them. So they, they trained one on. Lower time steps, which was the refiner model. And then they, the first one was trained normally. And then they went during their test, they realized, oh, like if we string these models together are like quality increases. So let's publish that. It worked. Yeah. But like right now, I don't think many people actually use the refiner anymore, even though it is actually a full diffusion model. Like you can use it on its own. And it's going to generate images. I don't think anyone, people have mostly forgotten about it. But, uh.Alessio [00:10:05]: Can we talk about models a little bit? So stable diffusion, obviously is the most known. I know flux has gotten a lot of traction. Are there any underrated models that people should use more or what's the state of the union?Comfy [00:10:17]: Well, the, the latest, uh, state of the art, at least, yeah, for images there's, uh, yeah, there's flux. There's also SD3.5. SD3.5 is two models. There's a, there's a small one, 2.5B and there's the bigger one, 8B. So it's, it's smaller than flux. So, and it's more, uh, creative in a way, but flux, yeah, flux is the best. People should give SD3.5 a try cause it's, uh, it's different. I won't say it's better. Well, it's better for some like specific use cases. Right. If you want some to make something more like creative, maybe SD3.5. If you want to make something more consistent and flux is probably better.swyx [00:11:06]: Do you ever consider supporting the closed source model APIs?Comfy [00:11:10]: Uh, well, they, we do support them as custom nodes. We actually have some, uh, official custom nodes from, uh, different. Ideogram.swyx [00:11:20]: Yeah. I guess DALI would have one. Yeah.Comfy [00:11:23]: That's, uh, it's just not, I'm not the person that handles that. Sure.swyx [00:11:28]: Sure. Quick question on, on SD. There's a lot of community discussion about the transition from SD1.5 to SD2 and then SD2 to SD3. People still like, you know, very loyal to the previous generations of SDs?Comfy [00:11:41]: Uh, yeah. SD1.5 then still has a lot of, a lot of users.swyx [00:11:46]: The last based model.Comfy [00:11:49]: Yeah. Then SD2 was mostly ignored. It wasn't, uh, it wasn't a big enough improvement over the previous one. Okay.swyx [00:11:58]: So SD1.5, SD3, flux and whatever else. SDXL. SDXL.Comfy [00:12:03]: That's the main one. Stable cascade. Stable cascade. That was a good model. But, uh, that's, uh, the problem with that one is, uh, it got, uh, like SD3 was announced one week after. Yeah.swyx [00:12:16]: It was like a weird release. Uh, what was it like inside of stability actually? I mean, statute of limitations. Yeah. The statute of limitations expired. You know, management has moved. So it's easier to talk about now. Yeah.Comfy [00:12:27]: And inside stability, actually that model was ready, uh, like three months before, but it got, uh, stuck in, uh, red teaming. So basically the product, if that model had released or was supposed to be released by the authors, then it would probably have gotten very popular since it's a, it's a step up from SDXL. But it got all of its momentum stolen. It got stolen by the SD3 announcement. So people kind of didn't develop anything on top of it, even though it's, uh, yeah. It was a good model, at least, uh, completely mostly ignored for some reason. Likeswyx [00:13:07]: I think the naming as well matters. It seemed like a branch off of the main, main tree of development. Yeah.Comfy [00:13:15]: Well, it was different researchers that did it. Yeah. Yeah. Very like, uh, good model. Like it's the Worcestershire authors. I don't know if I'm pronouncing it correctly. Yeah. Yeah. Yeah.swyx [00:13:28]: I actually met them in Vienna. Yeah.Comfy [00:13:30]: They worked at stability for a bit and they left right after the Cascade release.swyx [00:13:35]: This is Dustin, right? No. Uh, Dustin's SD3. Yeah.Comfy [00:13:38]: Dustin is a SD3 SDXL. That's, uh, Pablo and Dome. I think I'm pronouncing his name correctly. Yeah. Yeah. Yeah. Yeah. That's very good.swyx [00:13:51]: It seems like the community is very, they move very quickly. Yeah. Like when there's a new model out, they just drop whatever the current one is. And they just all move wholesale over. Like they don't really stay to explore the full capabilities. Like if, if the stable cascade was that good, they would have AB tested a bit more. Instead they're like, okay, SD3 is out. Let's go. You know?Comfy [00:14:11]: Well, I find the opposite actually. The community doesn't like, they only jump on a new model when there's a significant improvement. Like if there's a, only like a incremental improvement, which is what, uh, most of these models are going to have, especially if you, cause, uh, stay the same parameter count. Yeah. Like you're not going to get a massive improvement, uh, into like, unless there's something big that, that changes. So, uh. Yeah.swyx [00:14:41]: And how are they evaluating these improvements? Like, um, because there's, it's a whole chain of, you know, comfy workflows. Yeah. How does, how does one part of the chain actually affect the whole process?Comfy [00:14:52]: Are you talking on the model side specific?swyx [00:14:54]: Model specific, right? But like once you have your whole workflow based on a model, it's very hard to move.Comfy [00:15:01]: Uh, not, well, not really. Well, it depends on your, uh, depends on their specific kind of the workflow. Yeah.swyx [00:15:09]: So I do a lot of like text and image. Yeah.Comfy [00:15:12]: When you do change, like most workflows are kind of going to be complete. Yeah. It's just like, you might have to completely change your prompt completely change. Okay.swyx [00:15:24]: Well, I mean, then maybe the question is really about evals. Like what does the comfy community do for evals? Just, you know,Comfy [00:15:31]: Well, that they don't really do that. It's more like, oh, I think this image is nice. So that's, uh,swyx [00:15:38]: They just subscribe to Fofr AI and just see like, you know, what Fofr is doing. Yeah.Comfy [00:15:43]: Well, they just, they just generate like it. Like, I don't see anyone really doing it. Like, uh, at least on the comfy side, comfy users, they, it's more like, oh, generate images and see, oh, this one's nice. It's like, yeah, it's not, uh, like the, the more, uh, like, uh, scientific, uh, like, uh, like checking that's more on specifically on like model side. If, uh, yeah, but there is a lot of, uh, vibes also, cause it is a like, uh, artistic, uh, you can create a very good model that doesn't generate nice images. Cause most images on the internet are ugly. So if you, if that's like, if you just, oh, I have the best model at 10th giant, it's super smart. I created on all the, like I've trained on just all the images on the internet. The images are not going to look good. So yeah.Alessio [00:16:42]: Yeah.Comfy [00:16:43]: They're going to be very consistent. But yeah. People like, it's not going to be like the, the look that people are going to be expecting from, uh, from a model. So. Yeah.swyx [00:16:54]: Can we talk about LoRa's? Cause we thought we talked about models then like the next step is probably LoRa's. Before, I actually, I'm kind of curious how LoRa's entered the tool set of the image community because the LoRa paper was 2021. And then like, there was like other methods like textual inversion that was popular at the early SD stage. Yeah.Comfy [00:17:13]: I can't even explain the difference between that. Yeah. Textual inversions. That's basically what you're doing is you're, you're training a, cause well, yeah. Stable diffusion. You have the diffusion model, you have text encoder. So basically what you're doing is training a vector that you're going to pass to the text encoder. It's basically you're training a new word. Yeah.swyx [00:17:37]: It's a little bit like representation engineering now. Yeah.Comfy [00:17:40]: Yeah. Basically. Yeah. You're just, so yeah, if you know how like the text encoder works, basically you have, you take your, your words of your product, you convert those into tokens with the tokenizer and those are converted into vectors. Basically. Yeah. Each token represents a different vector. So each word presents a vector. And those, depending on your words, that's the list of vectors that get passed to the text encoder, which is just. Yeah. Yeah. I'm just a stack of, of attention. Like basically it's a very close to LLM architecture. Yeah. Yeah. So basically what you're doing is just training a new vector. We're saying, well, I have all these images and I want to know which word does that represent? And it's going to get like, you train this vector and then, and then when you use this vector, it hopefully generates. Like something similar to your images. Yeah.swyx [00:18:43]: I would say it's like surprisingly sample efficient in picking up the concept that you're trying to train it on. Yeah.Comfy [00:18:48]: Well, people have kind of stopped doing that even though back as like when I was at Stability, we, we actually did train internally some like textual versions on like T5 XXL actually worked pretty well. But for some reason, yeah, people don't use them. And also they might also work like, like, yeah, this is something and probably have to test, but maybe if you train a textual version, like on T5 XXL, it might also work with all the other models that use T5 XXL because same thing with like, like the textual inversions that, that were trained for SD 1.5, they also kind of work on SDXL because SDXL has the, has two text encoders. And one of them is the same as the, as the SD 1.5 CLIP-L. So those, they actually would, they don't work as strongly because they're only applied to one of the text encoders. But, and the same thing for SD3. SD3 has three text encoders. So it works. It's still, you can still use your textual version SD 1.5 on SD3, but it's just a lot weaker because now there's three text encoders. So it gets even more diluted. Yeah.swyx [00:20:05]: Do people experiment a lot on, just on the CLIP side, there's like Siglip, there's Blip, like do people experiment a lot on those?Comfy [00:20:12]: You can't really replace. Yeah.swyx [00:20:14]: Because they're trained together, right? Yeah.Comfy [00:20:15]: They're trained together. So you can't like, well, what I've seen people experimenting with is a long CLIP. So basically someone fine tuned the CLIP model to accept longer prompts.swyx [00:20:27]: Oh, it's kind of like long context fine tuning. Yeah.Comfy [00:20:31]: So, so like it's, it's actually supported in Core Comfy.swyx [00:20:35]: How long is long?Comfy [00:20:36]: Regular CLIP is 77 tokens. Yeah. Long CLIP is 256. Okay. So, but the hack that like you've, if you use stable diffusion 1.5, you've probably noticed, oh, it still works if I, if I use long prompts, prompts longer than 77 words. Well, that's because the hack is to just, well, you split, you split it up in chugs of 77, your whole big prompt. Let's say you, you give it like the massive text, like the Bible or something, and it would split it up in chugs of 77 and then just pass each one through the CLIP and then just cut anything together at the end. It's not ideal, but it actually works.swyx [00:21:26]: Like the positioning of the words really, really matters then, right? Like this is why order matters in prompts. Yeah.Comfy [00:21:33]: Yeah. Like it, it works, but it's, it's not ideal, but it's what people expect. Like if, if someone gives a huge prompt, they expect at least some of the concepts at the end to be like present in the image. But usually when they give long prompts, they, they don't, they like, they don't expect like detail, I think. So that's why it works very well.swyx [00:21:58]: And while we're on this topic, prompts waiting, negative comments. Negative prompting all, all sort of similar part of this layer of the stack. Yeah.Comfy [00:22:05]: The, the hack for that, which works on CLIP, like it, basically it's just for SD 1.5, well, for SD 1.5, the prompt waiting works well because CLIP L is a, is not a very deep model. So you have a very high correlation between, you have the input token, the index of the input token vector. And the output token, they're very, the concepts are very close, closely linked. So that means if you interpolate the vector from what, well, the, the way Comfy UI does it is it has, okay, you have the vector, you have an empty prompt. So you have a, a chunk, like a CLIP output for the empty prompt, and then you have the one for your prompt. And then it interpolates from that, depending on your prompt. Yeah.Comfy [00:23:07]: So that's how it, how it does prompt waiting. But this stops working the deeper your text encoder is. So on T5X itself, it doesn't work at all. So. Wow.swyx [00:23:20]: Is that a problem for people? I mean, cause I'm used to just move, moving up numbers. Probably not. Yeah.Comfy [00:23:25]: Well.swyx [00:23:26]: So you just use words to describe, right? Cause it's a bigger language model. Yeah.Comfy [00:23:30]: Yeah. So. Yeah. So honestly it might be good, but I haven't seen many complaints on Flux that it's not working. So, cause I guess people can sort of get around it with, with language. So. Yeah.swyx [00:23:46]: Yeah. And then coming back to LoRa's, now the, the popular way to, to customize models is LoRa's. And I saw you also support Locon and LoHa, which I've never heard of before.Comfy [00:23:56]: There's a bunch of, cause what, what the LoRa is essentially is. Instead of like, okay, you have your, your model and then you want to fine tune it. So instead of like, what you could do is you could fine tune the entire thing, but that's a bit heavy. So to speed things up and make things less heavy, what you can do is just fine tune some smaller weights, like basically two, two matrices that when you multiply like two low rank matrices and when you multiply them together, gives a, represents a difference between trained weights and your base weights. So by training those two smaller matrices, that's a lot less heavy. Yeah.Alessio [00:24:45]: And they're portable. So you're going to share them. Yeah. It's like easier. And also smaller.Comfy [00:24:49]: Yeah. That's the, how LoRa's work. So basically, so when, when inferencing you, you get an inference with them pretty efficiently, like how ComputeWrite does it. It just, when you use a LoRa, it just applies it straight on the weights so that there's only a small delay at the base, like before the sampling to when it applies the weights and then it just same speed as, as before. So for, for inference, it's, it's not that bad, but, and then you have, so basically all the LoRa types like LoHa, LoCon, everything, that's just different ways of representing that like. Basically, you can call it kind of like compression, even though it's not really compression, it's just different ways of represented, like just, okay, I want to train a different on the difference on the weights. What's the best way to represent that difference? There's the basic LoRa, which is just, oh, let's multiply these two matrices together. And then there's all the other ones, which are all different algorithms. So. Yeah.Alessio [00:25:57]: So let's talk about LoRa. Let's talk about what comfy UI actually is. I think most people have heard of it. Some people might've seen screenshots. I think fewer people have built very complex workflows. So when you started, automatic was like the super simple way. What were some of the choices that you made? So the node workflow, is there anything else that stands out as like, this was like a unique take on how to do image generation workflows?Comfy [00:26:22]: Well, I feel like, yeah, back then everyone was trying to make like easy to use interface. Yeah. So I'm like, well, everyone's trying to make an easy to use interface.swyx [00:26:32]: Let's make a hard to use interface.Comfy [00:26:37]: Like, so like, I like, I don't need to do that, everyone else doing it. So let me try something like, let me try to make a powerful interface that's not easy to use. So.swyx [00:26:52]: So like, yeah, there's a sort of node execution engine. Yeah. Yeah. And it actually lists, it has this really good list of features of things you prioritize, right? Like let me see, like sort of re-executing from, from any parts of the workflow that was changed, asynchronous queue system, smart memory management, like all this seems like a lot of engineering that. Yeah.Comfy [00:27:12]: There's a lot of engineering in the back end to make things, cause I was always focused on making things work locally very well. Cause that's cause I was using it locally. So everything. So there's a lot of, a lot of thought and working by getting everything to run as well as possible. So yeah. ConfUI is actually more of a back end, at least, well, not all the front ends getting a lot more development, but, but before, before it was, I was pretty much only focused on the backend. Yeah.swyx [00:27:50]: So v0.1 was only August this year. Yeah.Comfy [00:27:54]: With the new front end. Before there was no versioning. So yeah. Yeah. Yeah.swyx [00:27:57]: And so what was the big rewrite for the 0.1 and then the 1.0?Comfy [00:28:02]: Well, that's more on the front end side. That's cause before that it was just like the UI, what, cause when I first wrote it, I just, I said, okay, how can I make, like, I can do web development, but I don't like doing it. Like what's the easiest way I can slap a node interface on this. And then I found this library. Yeah. Like JavaScript library.swyx [00:28:26]: Live graph?Comfy [00:28:27]: Live graph.swyx [00:28:28]: Usually people will go for like react flow for like a flow builder. Yeah.Comfy [00:28:31]: But that seems like too complicated. So I didn't really want to spend time like developing the front end. So I'm like, well, oh, light graph. This has the whole node interface. So, okay. Let me just plug that into, to my backend.swyx [00:28:49]: I feel like if Streamlit or Gradio offered something that you would have used Streamlit or Gradio cause it's Python. Yeah.Comfy [00:28:54]: Yeah. Yeah. Yeah.Comfy [00:29:00]: Yeah.Comfy [00:29:14]: Yeah. logic and your backend logic and just sticks them together.swyx [00:29:20]: It's supposed to be easy for you guys. If you're a Python main, you know, I'm a JS main, right? Okay. If you're a Python main, it's supposed to be easy.Comfy [00:29:26]: Yeah, it's easy, but it makes your whole software a huge mess.swyx [00:29:30]: I see, I see. So you're mixing concerns instead of separating concerns?Comfy [00:29:34]: Well, it's because... Like frontend and backend. Frontend and backend should be well separated with a defined API. Like that's how you're supposed to do it. Smart people disagree. It just sticks everything together. It makes it easy to like a huge mess. And also it's, there's a lot of issues with Gradio. Like it's very good if all you want to do is just get like slap a quick interface on your, like to show off your ML project. Like that's what it's made for. Yeah. Like there's no problem using it. Like, oh, I have my, I have my code. I just wanted a quick interface on it. That's perfect. Like use Gradio. But if you want to make something that's like a real, like real software that will last a long time and will be easy to maintain, then I would avoid it. Yeah.swyx [00:30:32]: So your criticism is Streamlit and Gradio are the same. I mean, those are the same criticisms.Comfy [00:30:37]: Yeah, Streamlit I haven't used as much. Yeah, I just looked a bit.swyx [00:30:43]: Similar philosophy.Comfy [00:30:44]: Yeah, it's similar. It's just, it just seems to me like, okay, for quick, like AI demos, it's perfect.swyx [00:30:51]: Yeah. Going back to like the core tech, like asynchronous queues, slow re-execution, smart memory management, you know, anything that you were very proud of or was very hard to figure out?Comfy [00:31:00]: Yeah. The thing that's the biggest pain in the ass is probably the memory management. Yeah.swyx [00:31:05]: Were you just paging models in and out or? Yeah.Comfy [00:31:08]: Before it was just, okay, load the model, completely unload it. Then, okay, that, that works well when you, your model are small, but if your models are big and it takes sort of like, let's say someone has a, like a, a 4090, and the model size is 10 gigabytes, that can take a few seconds to like load and load, load and load, so you want to try to keep things like in memory, in the GPU memory as much as possible. What Comfy UI does right now is it. It tries to like estimate, okay, like, okay, you're going to sample this model, it's going to take probably this amount of memory, let's remove the models, like this amount of memory that's been loaded on the GPU and then just execute it. But so there's a fine line between just because try to remove the least amount of models that are already loaded. Because as fans, like Windows drivers, and one other problem is the NVIDIA driver on Windows by default, because there's a way to, there's an option to disable that feature, but by default it, like, if you start loading, you can overflow your GPU memory and then it's, the driver's going to automatically start paging to RAM. But the problem with that is it's, it makes everything extremely slow. So when you see people complaining, oh, this model, it works, but oh, s**t, it starts slowing down a lot, that's probably what's happening. So it's basically you have to just try to get, use as much memory as possible, but not too much, or else things start slowing down, or people get out of memory, and then just find, try to find that line where, oh, like the driver on Windows starts paging and stuff. Yeah. And the problem with PyTorch is it's, it's high levels, don't have that much fine-grained control over, like, specific memory stuff, so kind of have to leave, like, the memory freeing to, to Python and PyTorch, which is, can be annoying sometimes.swyx [00:33:32]: So, you know, I think one thing is, as a maintainer of this project, like, you're designing for a very wide surface area of compute, like, you even support CPUs.Comfy [00:33:42]: Yeah, well, that's... That's just, for PyTorch, PyTorch supports CPUs, so, yeah, it's just, that's not, that's not hard to support.swyx [00:33:50]: First of all, is there a market share estimate, like, is it, like, 70% NVIDIA, like, 30% AMD, and then, like, miscellaneous on Apple, Silicon, or whatever?Comfy [00:33:59]: For Comfy? Yeah. Yeah, and, yeah, I don't know the market share.swyx [00:34:03]: Can you guess?Comfy [00:34:04]: I think it's mostly NVIDIA. Right. Because, because AMD, the problem, like, AMD works horribly on Windows. Like, on Linux, it works fine. It's, it's lower than the price equivalent NVIDIA GPU, but it works, like, you can use it, you generate images, everything works. On Linux, on Windows, you might have a hard time, so, that's the problem, and most people, I think most people who bought AMD probably use Windows. They probably aren't going to switch to Linux, so... Yeah. So, until AMD actually, like, ports their, like, raw cam to, to Windows properly, and then there's actually PyTorch, I think they're, they're doing that, they're in the process of doing that, but, until they get it, they get a good, like, PyTorch raw cam build that works on Windows, it's, like, they're going to have a hard time. Yeah.Alessio [00:35:06]: We got to get George on it. Yeah. Well, he's trying to get Lisa Su to do it, but... Let's talk a bit about, like, the node design. So, unlike all the other text-to-image, you have a very, like, deep, so you have, like, a separate node for, like, clip and code, you have a separate node for, like, the case sampler, you have, like, all these nodes. Going back to, like, the making it easy versus making it hard, but, like, how much do people actually play with all the settings, you know? Kind of, like, how do you guide people to, like, hey, this is actually going to be very impactful versus this is maybe, like, less impactful, but we still want to expose it to you?Comfy [00:35:40]: Well, I try to... I try to expose, like, I try to expose everything or, but, yeah, at least for the, but for things, like, for example, for the samplers, like, there's, like, yeah, four different sampler nodes, which go in easiest to most advanced. So, yeah, if you go, like, the easy node, the regular sampler node, that's, you have just the basic settings. But if you use, like, the sampler advanced... If you use, like, the custom advanced node, that, that one you can actually, you'll see you have, like, different nodes.Alessio [00:36:19]: I'm looking it up now. Yeah. What are, like, the most impactful parameters that you use? So, it's, like, you know, you can have more, but, like, which ones, like, really make a difference?Comfy [00:36:30]: Yeah, they all do. They all have their own, like, they all, like, for example, yeah, steps. Usually you want steps, you want them to be as low as possible. But you want, if you're optimizing your workflow, you want to, you lower the steps until, like, the images start deteriorating too much. Because that, yeah, that's the number of steps you're running the diffusion process. So, if you want things to be faster, lower is better. But, yeah, CFG, that's more, you can kind of see that as the contrast of the image. Like, if your image looks too bursty. Then you can lower the CFG. So, yeah, CFG, that's how, yeah, that's how strongly the, like, the negative versus positive prompt. Because when you sample a diffusion model, it's basically a negative prompt. It's just, yeah, positive prediction minus negative prediction.swyx [00:37:32]: Contrastive loss. Yeah.Comfy [00:37:34]: It's positive minus negative, and the CFG does the multiplier. Yeah. Yeah. Yeah, so.Alessio [00:37:41]: What are, like, good resources to understand what the parameters do? I think most people start with automatic, and then they move over, and it's, like, snap, CFG, sampler, name, scheduler, denoise. Read it.Comfy [00:37:53]: But, honestly, well, it's more, it's something you should, like, try out yourself. I don't know, you don't necessarily need to know how it works to, like, what it does. Because even if you know, like, CFGO, it's, like, positive minus negative prompt. Yeah. So the only thing you know at CFG is if it's 1.0, then that means the negative prompt isn't applied. It also means sampling is two times faster. But, yeah. But other than that, it's more, like, you should really just see what it does to the images yourself, and you'll probably get a more intuitive understanding of what these things do.Alessio [00:38:34]: Any other nodes or things you want to shout out? Like, I know the animate diff IP adapter. Those are, like, some of the most popular ones. Yeah. What else comes to mind?Comfy [00:38:44]: Not nodes, but there's, like, what I like is when some people, sometimes they make things that use ComfyUI as their backend. Like, there's a plugin for Krita that uses ComfyUI as its backend. So you can use, like, all the models that work in Comfy in Krita. And I think I've tried it once. But I know a lot of people use it, and it's probably really nice, so.Alessio [00:39:15]: What's the craziest node that people have built, like, the most complicated?Comfy [00:39:21]: Craziest node? Like, yeah. I know some people have made, like, video games in Comfy with, like, stuff like that. So, like, someone, like, I remember, like, yeah, last, I think it was last year, someone made, like, a, like, Wolfenstein 3D in Comfy. Of course. And then one of the inputs was, oh, you can generate a texture, and then it changes the texture in the game. So you can plug it to, like, the workflow. And there's a lot of, if you look there, there's a lot of crazy things people do, so. Yeah.Alessio [00:39:59]: And now there's, like, a node register that people can use to, like, download nodes. Yeah.Comfy [00:40:04]: Like, well, there's always been the, like, the ComfyUI manager. Yeah. But we're trying to make this more, like, I don't know, official, like, with, yeah, with the node registry. Because before the node registry, the, like, okay, how did your custom node get into ComfyUI manager? That's the guy running it who, like, every day he searched GitHub for new custom nodes and added dev annually to his custom node manager. So we're trying to make it less effortless. So we're trying to make it less effortless for him, basically. Yeah.Alessio [00:40:40]: Yeah. But I was looking, I mean, there's, like, a YouTube download node. There's, like, this is almost like, you know, a data pipeline more than, like, an image generation thing at this point. It's, like, you can get data in, you can, like, apply filters to it, you can generate data out.Comfy [00:40:54]: Yeah. You can do a lot of different things. Yeah. So I'm thinking, I think what I did is I made it easy to make custom nodes. So I think that helped a lot. I think that helped a lot for, like, the ecosystem because it is very easy to just make a node. So, yeah, a bit too easy sometimes. Then we have the issue where there's a lot of custom node packs which share similar nodes. But, well, that's, yeah, something we're trying to solve by maybe bringing some of the functionality into the core. Yeah. Yeah. Yeah.Alessio [00:41:36]: And then there's, like, video. People can do video generation. Yeah.Comfy [00:41:40]: Video, that's, well, the first video model was, like, stable video diffusion, which was last, yeah, exactly last year, I think. Like, one year ago. But that wasn't a true video model. So it was...swyx [00:41:55]: It was, like, moving images? Yeah.Comfy [00:41:57]: I generated video. What I mean by that is it's, like, it's still 2D Latents. It's basically what I'm trying to do. So what they did is they took SD2, and then they added some temporal attention to it, and then trained it on videos and all. So it's kind of, like, animated, like, same idea, basically. Why I say it's not a true video model is that you still have, like, the 2D Latents. Like, a true video model, like Mochi, for example, would have 3D Latents. Mm-hmm.Alessio [00:42:32]: Which means you can, like, move through the space, basically. It's the difference. You're not just kind of, like, reorienting. Yeah.Comfy [00:42:39]: And it's also, well, it's also because you have a temporal VAE. Mm-hmm. Also, like, Mochi has a temporal VAE that compresses on, like, the temporal direction, also. So that's something you don't have with, like, yeah, animated diff and stable video diffusion. They only, like, compress spatially, not temporally. Mm-hmm. Right. So, yeah. That's why I call that, like, true video models. There's, yeah, there's actually a few of them, but the one I've implemented in comfy is Mochi, because that seems to be the best one so far. Yeah.swyx [00:43:15]: We had AJ come and speak at the stable diffusion meetup. The other open one I think I've seen is COG video. Yeah.Comfy [00:43:21]: COG video. Yeah. That one's, yeah, it also seems decent, but, yeah. Chinese, so we don't use it. No, it's fine. It's just, yeah, I could. Yeah. It's just that there's a, it's not the only one. There's also a few others, which I.swyx [00:43:36]: The rest are, like, closed source, right? Like, Cling. Yeah.Comfy [00:43:39]: Closed source, there's a bunch of them. But I mean, open. I've seen a few of them. Like, I can't remember their names, but there's COG videos, the big, the big one. Then there's also a few of them that released at the same time. There's one that released at the same time as SSD 3.5, same day, which is why I don't remember the name.swyx [00:44:02]: We should have a release schedule so we don't conflict on each of these things. Yeah.Comfy [00:44:06]: I think SD 3.5 and Mochi released on the same day. So everything else was kind of drowned, completely drowned out. So for some reason, lots of people picked that day to release their stuff.Comfy [00:44:21]: Yeah. Which is, well, shame for those. And I think Omnijet also released the same day, which also seems interesting. Yeah. Yeah.Alessio [00:44:30]: What's Comfy? So you are Comfy. And then there's like, comfy.org. I know we do a lot of things for, like, news research and those guys also have kind of like a more open source thing going on. How do you work? Like you mentioned, you mostly work on like, the core piece of it. And then what...Comfy [00:44:47]: Maybe I should fade it in because I, yeah, I feel like maybe, yeah, I only explain part of the story. Right. Yeah. Maybe I should explain the rest. So yeah. So yeah. Basically, January, that's when the first January 2023, January 16, 2023, that's when Amphi was first released to the public. Then, yeah, did a Reddit post about the area composition thing somewhere in, I don't remember exactly, maybe end of January, beginning of February. And then someone, a YouTuber, made a video about it, like Olivio, he made a video about Amphi in March 2023. I think that's when it was a real burst of attention. And by that time, I was continuing to develop it and it was getting, people were starting to use it more, which unfortunately meant that I had first written it to do like experiments, but then my time to do experiments went down. It started going down, because people were actually starting to use it then. Like, I had to, and I said, well, yeah, time to add all these features and stuff. Yeah, and then I got hired by Stability June, 2023. Then I made, basically, yeah, they hired me because they wanted the SD-XL. So I got the SD-XL working very well withітhe UI, because they were experimenting withámphi.house.com. Actually, the SDX, how the SDXL released worked is they released, for some reason, like they released the code first, but they didn't release the model checkpoint. So they released the code. And then, well, since the research was related to code, I released the code in Compute 2. And then the checkpoints were basically early access. People had to sign up and they only allowed a lot of people from edu emails. Like if you had an edu email, like they gave you access basically to the SDXL 0.9. And, well, that leaked. Right. Of course, because of course it's going to leak if you do that. Well, the only way people could easily use it was with Comfy. So, yeah, people started using. And then I fixed a few of the issues people had. So then the big 1.0 release happened. And, well, Comfy UI was the only way a lot of people could actually run it on their computers. Because it just like automatic was so like inefficient and bad that most people couldn't actually, like it just wouldn't work. Like because he did a quick implementation. So people were forced. To use Comfy UI, and that's how it became popular because people had no choice.swyx [00:47:55]: The growth hack.Comfy [00:47:56]: Yeah.swyx [00:47:56]: Yeah.Comfy [00:47:57]: Like everywhere, like people who didn't have the 4090, they had like, who had just regular GPUs, they didn't have a choice.Alessio [00:48:05]: So yeah, I got a 4070. So think of me. And so today, what's, is there like a core Comfy team or?Comfy [00:48:13]: Uh, yeah, well, right now, um, yeah, we are hiring. Okay. Actually, so right now core, like, um, the core core itself, it's, it's me. Uh, but because, uh, the reason where folks like all the focus has been mostly on the front end right now, because that's the thing that's been neglected for a long time. So, uh, so most of the focus right now is, uh, all on the front end, but we are, uh, yeah, we will soon get, uh, more people to like help me with the actual backend stuff. Yeah. So, no, I'm not going to say a hundred percent because that's why once the, once we have our V one release, which is because it'd be the package, come fee-wise with the nice interface and easy to install on windows and hopefully Mac. Uh, yeah. Yeah. Once we have that, uh, we're going to have to, lots of stuff to do on the backend side and also the front end side, but, uh.Alessio [00:49:14]: What's the release that I'm on the wait list. What's the timing?Comfy [00:49:18]: Uh, soon. Uh, soon. Yeah, I don't want to promise a release date. We do have a release date we're targeting, but I'm not sure if it's public. Yeah, and we're still going to continue doing the open source, making MPUI the best way to run stable infusion models. At least the open source side, it's going to be the best way to run models locally. But we will have a few things to make money from it, like cloud inference or that type of thing. And maybe some things for some enterprises.swyx [00:50:08]: I mean, a few questions on that. How do you feel about the other comfy startups?Comfy [00:50:11]: I mean, I think it's great. They're using your name. Yeah, well, it's better they use comfy than they use something else. Yeah, that's true. It's fine. We're going to try not to... We don't want to... We want people to use comfy. Like I said, it's better that people use comfy than something else. So as long as they use comfy, I think it helps the ecosystem. Because more people, even if they don't contribute directly, the fact that they are using comfy means that people are more likely to join the ecosystem. So, yeah.swyx [00:50:57]: And then would you ever do text?Comfy [00:50:59]: Yeah, well, you can already do text with some custom nodes. So, yeah, it's something we like. Yeah, it's something I've wanted to eventually add to core, but it's more like not a very... It's a very high priority. But because a lot of people use text for prompt enhancement and other things like that. So, yeah, it's just that my focus has always been on diffusion models. Yeah, unless some text diffusion model comes out.swyx [00:51:30]: Yeah, David Holtz is investing a lot in text diffusion.Comfy [00:51:34]: Yeah, well, if a good one comes out, then we'll probably implement it since it fits with the whole...swyx [00:51:39]: Yeah, I mean, I imagine it's going to be a close source to Midjourney. Yeah.Comfy [00:51:43]: Well, if an open one comes out, then I'll probably implement it.Alessio [00:51:54]: Cool, comfy. Thanks so much for coming on. This was fun. Bye. Get full access to Latent Space at www.latent.space/subscribe