Podcasts about autogpt

  • 156PODCASTS
  • 230EPISODES
  • 42mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • Jan 2, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about autogpt

Latest podcast episodes about autogpt

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World's Fair 2025 in June.Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100!Full YouTube Episode with Slides/ChartsLike and subscribe and hit that bell to get notifs!Timestamps* 00:00 Welcome to the 100th Episode!* 00:19 Reflecting on the Journey* 00:47 AI Engineering: The Rise and Impact* 03:15 Latent Space Live and AI Conferences* 09:44 The Competitive AI Landscape* 21:45 Synthetic Data and Future Trends* 35:53 Creative Writing with AI* 36:12 Legal and Ethical Issues in AI* 38:18 The Data War: GPU Poor vs. GPU Rich* 39:12 The Rise of GPU Ultra Rich* 40:47 Emerging Trends in AI Models* 45:31 The Multi-Modality War* 01:05:31 The Future of AI Benchmarks* 01:13:17 Pionote and Frontier Models* 01:13:47 Niche Models and Base Models* 01:14:30 State Space Models and RWKB* 01:15:48 Inference Race and Price Wars* 01:22:16 Major AI Themes of the Year* 01:22:48 AI Rewind: January to March* 01:26:42 AI Rewind: April to June* 01:33:12 AI Rewind: July to September* 01:34:59 AI Rewind: October to December* 01:39:53 Year-End Reflections and PredictionsTranscript[00:00:00] Welcome to the 100th Episode![00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today.[00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes.[00:00:19] Alessio: Yeah, I know.[00:00:19] Reflecting on the Journey[00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer[00:00:32] swyx: was cursor and perplexity.[00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else?[00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that.[00:00:47] AI Engineering: The Rise and Impact[00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side.[00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit.[00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today.[00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong.[00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition.[00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had.[00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset.[00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in.[00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack.[00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years.[00:03:14] swyx: Yeah.[00:03:15] Latent Space Live and AI Conferences[00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase.[00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever.[00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply.[00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope.[00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric.[00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube.[00:05:06] swyx: I would say that I actually also created.[00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs.[00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year.[00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview.[00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt.[00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about.[00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need somebody who's ready to publicly say, no, we've hit a wall.[00:06:57] swyx: So that means you're saying Sam Altman's wrong. [00:07:00] You're saying, um, you know, everyone else is wrong. It helps that this was the day before Ilya went on, went up on stage and then said pre training has hit a wall. And data has hit a wall. So actually Jonathan ended up winning, and then Ilya supported that statement, and then Noam Brown on the last day further supported that statement as well.[00:07:17] swyx: So it's kind of interesting that I think the consensus kind of going in was that we're not done scaling, like you should believe in a better lesson. And then, four straight days in a row, you had Sepp Hochreiter, who is the creator of the LSTM, along with everyone's favorite OG in AI, which is Juergen Schmidhuber.[00:07:34] swyx: He said that, um, we're pre trading inside a wall, or like, we've run into a different kind of wall. And then we have, you know John Frankel, Ilya, and then Noam Brown are all saying variations of the same thing, that we have hit some kind of wall in the status quo of what pre trained, scaling large pre trained models has looked like, and we need a new thing.[00:07:54] swyx: And obviously the new thing for people is some make, either people are calling it inference time compute or test time [00:08:00] compute. I think the collective terminology has been inference time, and I think that makes sense because test time, calling it test, meaning, has a very pre trained bias, meaning that the only reason for running inference at all is to test your model.[00:08:11] swyx: That is not true. Right. Yeah. So, so, I quite agree that. OpenAI seems to have adopted, or the community seems to have adopted this terminology of ITC instead of TTC. And that, that makes a lot of sense because like now we care about inference, even right down to compute optimality. Like I actually interviewed this author who recovered or reviewed the Chinchilla paper.[00:08:31] swyx: Chinchilla paper is compute optimal training, but what is not stated in there is it's pre trained compute optimal training. And once you start caring about inference, compute optimal training, you have a different scaling law. And in a way that we did not know last year.[00:08:45] Alessio: I wonder, because John is, he's also on the side of attention is all you need.[00:08:49] Alessio: Like he had the bet with Sasha. So I'm curious, like he doesn't believe in scaling, but he thinks the transformer, I wonder if he's still. So, so,[00:08:56] swyx: so he, obviously everything is nuanced and you know, I told him to play a character [00:09:00] for this debate, right? So he actually does. Yeah. He still, he still believes that we can scale more.[00:09:04] swyx: Uh, he just assumed the character to be very game for, for playing this debate. So even more kudos to him that he assumed a position that he didn't believe in and still won the debate.[00:09:16] Alessio: Get rekt, Dylan. Um, do you just want to quickly run through some of these things? Like, uh, Sarah's presentation, just the highlights.[00:09:24] swyx: Yeah, we can't go through everyone's slides, but I pulled out some things as a factor of, like, stuff that we were going to talk about. And we'll[00:09:30] Alessio: publish[00:09:31] swyx: the rest. Yeah, we'll publish on this feed the best of 2024 in those domains. And hopefully people can benefit from the work that our speakers have done.[00:09:39] swyx: But I think it's, uh, these are just good slides. And I've been, I've been looking for a sort of end of year recaps from, from people.[00:09:44] The Competitive AI Landscape[00:09:44] swyx: The field has progressed a lot. You know, I think the max ELO in 2023 on LMSys used to be 1200 for LMSys ELOs. And now everyone is at least at, uh, 1275 in their ELOs, and this is across Gemini, Chadjibuti, [00:10:00] Grok, O1.[00:10:01] swyx: ai, which with their E Large model, and Enthopic, of course. It's a very, very competitive race. There are multiple Frontier labs all racing, but there is a clear tier zero Frontier. And then there's like a tier one. It's like, I wish I had everything else. Tier zero is extremely competitive. It's effectively now three horse race between Gemini, uh, Anthropic and OpenAI.[00:10:21] swyx: I would say that people are still holding out a candle for XAI. XAI, I think, for some reason, because their API was very slow to roll out, is not included in these metrics. So it's actually quite hard to put on there. As someone who also does charts, XAI is continually snubbed because they don't work well with the benchmarking people.[00:10:42] swyx: Yeah, yeah, yeah. It's a little trivia for why XAI always gets ignored. The other thing is market share. So these are slides from Sarah. We have it up on the screen. It has gone from very heavily open AI. So we have some numbers and estimates. These are from RAMP. Estimates of open AI market share in [00:11:00] December 2023.[00:11:01] swyx: And this is basically, what is it, GPT being 95 percent of production traffic. And I think if you correlate that with stuff that we asked. Harrison Chase on the LangChain episode, it was true. And then CLAUD 3 launched mid middle of this year. I think CLAUD 3 launched in March, CLAUD 3. 5 Sonnet was in June ish.[00:11:23] swyx: And you can start seeing the market share shift towards opening, uh, towards that topic, uh, very, very aggressively. The more recent one is Gemini. So if I scroll down a little bit, this is an even more recent dataset. So RAM's dataset ends in September 2 2. 2024. Gemini has basically launched a price war at the low end, uh, with Gemini Flash, uh, being basically free for personal use.[00:11:44] swyx: Like, I think people don't understand the free tier. It's something like a billion tokens per day. Unless you're trying to abuse it, you cannot really exhaust your free tier on Gemini. They're really trying to get you to use it. They know they're in like third place, um, fourth place, depending how you, how you count.[00:11:58] swyx: And so they're going after [00:12:00] the Lower tier first, and then, you know, maybe the upper tier later, but yeah, Gemini Flash, according to OpenRouter, is now 50 percent of their OpenRouter requests. Obviously, these are the small requests. These are small, cheap requests that are mathematically going to be more.[00:12:15] swyx: The smart ones obviously are still going to OpenAI. But, you know, it's a very, very big shift in the market. Like basically 2023, 2022, To going into 2024 opening has gone from nine five market share to Yeah. Reasonably somewhere between 50 to 75 market share.[00:12:29] Alessio: Yeah. I'm really curious how ramped does the attribution to the model?[00:12:32] Alessio: If it's API, because I think it's all credit card spin. . Well, but it's all, the credit card doesn't say maybe. Maybe the, maybe when they do expenses, they upload the PDF, but yeah, the, the German I think makes sense. I think that was one of my main 2024 takeaways that like. The best small model companies are the large labs, which is not something I would have thought that the open source kind of like long tail would be like the small model.[00:12:53] swyx: Yeah, different sizes of small models we're talking about here, right? Like so small model here for Gemini is AB, [00:13:00] right? Uh, mini. We don't know what the small model size is, but yeah, it's probably in the double digits or maybe single digits, but probably double digits. The open source community has kind of focused on the one to three B size.[00:13:11] swyx: Mm-hmm . Yeah. Maybe[00:13:12] swyx: zero, maybe 0.5 B uh, that's moon dream and that is small for you then, then that's great. It makes sense that we, we have a range for small now, which is like, may, maybe one to five B. Yeah. I'll even put that at, at, at the high end. And so this includes Gemma from Gemini as well. But also includes the Apple Foundation models, which I think Apple Foundation is 3B.[00:13:32] Alessio: Yeah. No, that's great. I mean, I think in the start small just meant cheap. I think today small is actually a more nuanced discussion, you know, that people weren't really having before.[00:13:43] swyx: Yeah, we can keep going. This is a slide that I smiley disagree with Sarah. She's pointing to the scale SEAL leaderboard. I think the Researchers that I talked with at NeurIPS were kind of positive on this because basically you need private test [00:14:00] sets to prevent contamination.[00:14:02] swyx: And Scale is one of maybe three or four people this year that has really made an effort in doing a credible private test set leaderboard. Llama405B does well compared to Gemini and GPT 40. And I think that's good. I would say that. You know, it's good to have an open model that is that big, that does well on those metrics.[00:14:23] swyx: But anyone putting 405B in production will tell you, if you scroll down a little bit to the artificial analysis numbers, that it is very slow and very expensive to infer. Um, it doesn't even fit on like one node. of, uh, of H100s. Cerebras will be happy to tell you they can serve 4 or 5B on their super large chips.[00:14:42] swyx: But, um, you know, if you need to do anything custom to it, you're still kind of constrained. So, is 4 or 5B really that relevant? Like, I think most people are basically saying that they only use 4 or 5B as a teacher model to distill down to something. Even Meta is doing it. So with Lama 3. [00:15:00] 3 launched, they only launched the 70B because they use 4 or 5B to distill the 70B.[00:15:03] swyx: So I don't know if like open source is keeping up. I think they're the, the open source industrial complex is very invested in telling you that the, if the gap is narrowing, I kind of disagree. I think that the gap is widening with O1. I think there are very, very smart people trying to narrow that gap and they should.[00:15:22] swyx: I really wish them success, but you cannot use a chart that is nearing 100 in your saturation chart. And look, the distance between open source and closed source is narrowing. Of course it's going to narrow because you're near 100. This is stupid. But in metrics that matter, is open source narrowing?[00:15:38] swyx: Probably not for O1 for a while. And it's really up to the open source guys to figure out if they can match O1 or not.[00:15:46] Alessio: I think inference time compute is bad for open source just because, you know, Doc can donate the flops at training time, but he cannot donate the flops at inference time. So it's really hard to like actually keep up on that axis.[00:15:59] Alessio: Big, big business [00:16:00] model shift. So I don't know what that means for the GPU clouds. I don't know what that means for the hyperscalers, but obviously the big labs have a lot of advantage. Because, like, it's not a static artifact that you're putting the compute in. You're kind of doing that still, but then you're putting a lot of computed inference too.[00:16:17] swyx: Yeah, yeah, yeah. Um, I mean, Llama4 will be reasoning oriented. We talked with Thomas Shalom. Um, kudos for getting that episode together. That was really nice. Good, well timed. Actually, I connected with the AI meta guy, uh, at NeurIPS, and, um, yeah, we're going to coordinate something for Llama4. Yeah, yeah,[00:16:32] Alessio: and our friend, yeah.[00:16:33] Alessio: Clara Shi just joined to lead the business agent side. So I'm sure we'll have her on in the new year.[00:16:39] swyx: Yeah. So, um, my comment on, on the business model shift, this is super interesting. Apparently it is wide knowledge that OpenAI wanted more than 6. 6 billion dollars for their fundraise. They wanted to raise, you know, higher, and they did not.[00:16:51] swyx: And what that means is basically like, it's very convenient that we're not getting GPT 5, which would have been a larger pre train. We should have a lot of upfront money. And [00:17:00] instead we're, we're converting fixed costs into variable costs, right. And passing it on effectively to the customer. And it's so much easier to take margin there because you can directly attribute it to like, Oh, you're using this more.[00:17:12] swyx: Therefore you, you pay more of the cost and I'll just slap a margin in there. So like that lets you control your growth margin and like tie your. Your spend, or your sort of inference spend, accordingly. And it's just really interesting to, that this change in the sort of inference paradigm has arrived exactly at the same time that the funding environment for pre training is effectively drying up, kind of.[00:17:36] swyx: I feel like maybe the VCs are very in tune with research anyway, so like, they would have noticed this, but, um, it's just interesting.[00:17:43] Alessio: Yeah, and I was looking back at our yearly recap of last year. Yeah. And the big thing was like the mixed trial price fights, you know, and I think now it's almost like there's nowhere to go, like, you know, Gemini Flash is like basically giving it away for free.[00:17:55] Alessio: So I think this is a good way for the labs to generate more revenue and pass down [00:18:00] some of the compute to the customer. I think they're going to[00:18:02] swyx: keep going. I think that 2, will come.[00:18:05] Alessio: Yeah, I know. Totally. I mean, next year, the first thing I'm doing is signing up for Devin. Signing up for the pro chat GBT.[00:18:12] Alessio: Just to try. I just want to see what does it look like to spend a thousand dollars a month on AI?[00:18:17] swyx: Yes. Yes. I think if your, if your, your job is a, at least AI content creator or VC or, you know, someone who, whose job it is to stay on, stay on top of things, you should already be spending like a thousand dollars a month on, on stuff.[00:18:28] swyx: And then obviously easy to spend, hard to use. You have to actually use. The good thing is that actually Google lets you do a lot of stuff for free now. So like deep research. That they just launched. Uses a ton of inference and it's, it's free while it's in preview.[00:18:45] Alessio: Yeah. They need to put that in Lindy.[00:18:47] Alessio: I've been using Lindy lately. I've been a built a bunch of things once we had flow because I liked the new thing. It's pretty good. I even did a phone call assistant. Um, yeah, they just launched Lindy voice. Yeah, I think once [00:19:00] they get advanced voice mode like capability today, still like speech to text, you can kind of tell.[00:19:06] Alessio: Um, but it's good for like reservations and things like that. So I have a meeting prepper thing. And so[00:19:13] swyx: it's good. Okay. I feel like we've, we've covered a lot of stuff. Uh, I, yeah, I, you know, I think We will go over the individual, uh, talks in a separate episode. Uh, I don't want to take too much time with, uh, this stuff, but that suffice to say that there is a lot of progress in each field.[00:19:28] swyx: Uh, we covered vision. Basically this is all like the audience voting for what they wanted. And then I just invited the best people I could find in each audience, especially agents. Um, Graham, who I talked to at ICML in Vienna, he is currently still number one. It's very hard to stay on top of SweetBench.[00:19:45] swyx: OpenHand is currently still number one. switchbench full, which is the hardest one. He had very good thoughts on agents, which I, which I'll highlight for people. Everyone is saying 2025 is the year of agents, just like they said last year. And, uh, but he had [00:20:00] thoughts on like eight parts of what are the frontier problems to solve in agents.[00:20:03] swyx: And so I'll highlight that talk as well.[00:20:05] Alessio: Yeah. The number six, which is the Hacken agents learn more about the environment, has been a Super interesting to us as well, just to think through, because, yeah, how do you put an agent in an enterprise where most things in an enterprise have never been public, you know, a lot of the tooling, like the code bases and things like that.[00:20:23] Alessio: So, yeah, there's not indexing and reg. Well, yeah, but it's more like. You can't really rag things that are not documented. But people know them based on how they've been doing it. You know, so I think there's almost this like, you know, Oh, institutional knowledge. Yeah, the boring word is kind of like a business process extraction.[00:20:38] Alessio: Yeah yeah, I see. It's like, how do you actually understand how these things are done? I see. Um, and I think today the, the problem is that, Yeah, the agents are, that most people are building are good at following instruction, but are not as good as like extracting them from you. Um, so I think that will be a big unlock just to touch quickly on the Jeff Dean thing.[00:20:55] Alessio: I thought it was pretty, I mean, we'll link it in the, in the things, but. I think the main [00:21:00] focus was like, how do you use ML to optimize the systems instead of just focusing on ML to do something else? Yeah, I think speculative decoding, we had, you know, Eugene from RWKB on the podcast before, like he's doing a lot of that with Fetterless AI.[00:21:12] swyx: Everyone is. I would say it's the norm. I'm a little bit uncomfortable with how much it costs, because it does use more of the GPU per call. But because everyone is so keen on fast inference, then yeah, makes sense.[00:21:24] Alessio: Exactly. Um, yeah, but we'll link that. Obviously Jeff is great.[00:21:30] swyx: Jeff is, Jeff's talk was more, it wasn't focused on Gemini.[00:21:33] swyx: I think people got the wrong impression from my tweet. It's more about how Google approaches ML and uses ML to design systems and then systems feedback into ML. And I think this ties in with Lubna's talk.[00:21:45] Synthetic Data and Future Trends[00:21:45] swyx: on synthetic data where it's basically the story of bootstrapping of humans and AI in AI research or AI in production.[00:21:53] swyx: So her talk was on synthetic data, where like how much synthetic data has grown in 2024 in the pre training side, the post training side, [00:22:00] and the eval side. And I think Jeff then also extended it basically to chips, uh, to chip design. So he'd spend a lot of time talking about alpha chip. And most of us in the audience are like, we're not working on hardware, man.[00:22:11] swyx: Like you guys are great. TPU is great. Okay. We'll buy TPUs.[00:22:14] Alessio: And then there was the earlier talk. Yeah. But, and then we have, uh, I don't know if we're calling them essays. What are we calling these? But[00:22:23] swyx: for me, it's just like bonus for late in space supporters, because I feel like they haven't been getting anything.[00:22:29] swyx: And then I wanted a more high frequency way to write stuff. Like that one I wrote in an afternoon. I think basically we now have an answer to what Ilya saw. It's one year since. The blip. And we know what he saw in 2014. We know what he saw in 2024. We think we know what he sees in 2024. He gave some hints and then we have vague indications of what he saw in 2023.[00:22:54] swyx: So that was the Oh, and then 2016 as well, because of this lawsuit with Elon, OpenAI [00:23:00] is publishing emails from Sam's, like, his personal text messages to Siobhan, Zelis, or whatever. So, like, we have emails from Ilya saying, this is what we're seeing in OpenAI, and this is why we need to scale up GPUs. And I think it's very prescient in 2016 to write that.[00:23:16] swyx: And so, like, it is exactly, like, basically his insights. It's him and Greg, basically just kind of driving the scaling up of OpenAI, while they're still playing Dota. They're like, no, like, we see the path here.[00:23:30] Alessio: Yeah, and it's funny, yeah, they even mention, you know, we can only train on 1v1 Dota. We need to train on 5v5, and that takes too many GPUs.[00:23:37] Alessio: Yeah,[00:23:37] swyx: and at least for me, I can speak for myself, like, I didn't see the path from Dota to where we are today. I think even, maybe if you ask them, like, they wouldn't necessarily draw a straight line. Yeah,[00:23:47] Alessio: no, definitely. But I think like that was like the whole idea of almost like the RL and we talked about this with Nathan on his podcast.[00:23:55] Alessio: It's like with RL, you can get very good at specific things, but then you can't really like generalize as much. And I [00:24:00] think the language models are like the opposite, which is like, you're going to throw all this data at them and scale them up, but then you really need to drive them home on a specific task later on.[00:24:08] Alessio: And we'll talk about the open AI reinforcement, fine tuning, um, announcement too, and all of that. But yeah, I think like scale is all you need. That's kind of what Elia will be remembered for. And I think just maybe to clarify on like the pre training is over thing that people love to tweet. I think the point of the talk was like everybody, we're scaling these chips, we're scaling the compute, but like the second ingredient which is data is not scaling at the same rate.[00:24:35] Alessio: So it's not necessarily pre training is over. It's kind of like What got us here won't get us there. In his email, he predicted like 10x growth every two years or something like that. And I think maybe now it's like, you know, you can 10x the chips again, but[00:24:49] swyx: I think it's 10x per year. Was it? I don't know.[00:24:52] Alessio: Exactly. And Moore's law is like 2x. So it's like, you know, much faster than that. And yeah, I like the fossil fuel of AI [00:25:00] analogy. It's kind of like, you know, the little background tokens thing. So the OpenAI reinforcement fine tuning is basically like, instead of fine tuning on data, you fine tune on a reward model.[00:25:09] Alessio: So it's basically like, instead of being data driven, it's like task driven. And I think people have tasks to do, they don't really have a lot of data. So I'm curious to see how that changes, how many people fine tune, because I think this is what people run into. It's like, Oh, you can fine tune llama. And it's like, okay, where do I get the data?[00:25:27] Alessio: To fine tune it on, you know, so it's great that we're moving the thing. And then I really like he had this chart where like, you know, the brain mass and the body mass thing is basically like mammals that scaled linearly by brain and body size, and then humans kind of like broke off the slope. So it's almost like maybe the mammal slope is like the pre training slope.[00:25:46] Alessio: And then the post training slope is like the, the human one.[00:25:49] swyx: Yeah. I wonder what the. I mean, we'll know in 10 years, but I wonder what the y axis is for, for Ilya's SSI. We'll try to get them on.[00:25:57] Alessio: Ilya, if you're listening, you're [00:26:00] welcome here. Yeah, and then he had, you know, what comes next, like agent, synthetic data, inference, compute, I thought all of that was like that.[00:26:05] Alessio: I don't[00:26:05] swyx: think he was dropping any alpha there. Yeah, yeah, yeah.[00:26:07] Alessio: Yeah. Any other new reps? Highlights?[00:26:10] swyx: I think that there was comparatively a lot more work. Oh, by the way, I need to plug that, uh, my friend Yi made this, like, little nice paper. Yeah, that was really[00:26:20] swyx: nice.[00:26:20] swyx: Uh, of, uh, of, like, all the, he's, she called it must read papers of 2024.[00:26:26] swyx: So I laid out some of these at NeurIPS, and it was just gone. Like, everyone just picked it up. Because people are dying for, like, little guidance and visualizations And so, uh, I thought it was really super nice that we got there.[00:26:38] Alessio: Should we do a late in space book for each year? Uh, I thought about it. For each year we should.[00:26:42] Alessio: Coffee table book. Yeah. Yeah. Okay. Put it in the will. Hi, Will. By the way, we haven't introduced you. He's our new, you know, general organist, Jamie. You need to[00:26:52] swyx: pull up more things. One thing I saw that, uh, Okay, one fun one, and then one [00:27:00] more general one. So the fun one is this paper on agent collusion. This is a paper on steganography.[00:27:06] swyx: This is secret collusion among AI agents, multi agent deception via steganography. I tried to go to NeurIPS in order to find these kinds of papers because the real reason Like NeurIPS this year has a lottery system. A lot of people actually even go and don't buy tickets because they just go and attend the side events.[00:27:22] swyx: And then also the people who go and end up crowding around the most popular papers, which you already know and already read them before you showed up to NeurIPS. So the only reason you go there is to talk to the paper authors, but there's like something like 10, 000 other. All these papers out there that, you know, are just people's work that they, that they did on the air and they failed to get attention for one reason or another.[00:27:42] swyx: And this was one of them. Uh, it was like all the way at the back. And this is a deep mind paper that actually focuses on collusion between AI agents, uh, by hiding messages in the text that they generate. Uh, so that's what steganography is. So a very simple example would be the first letter of every word.[00:27:57] swyx: If you Pick that out, you know, and the code sends a [00:28:00] different message than that. But something I've always emphasized is to LLMs, we read left to right. LLMs can read up, down, sideways, you know, in random character order. And it's the same to them as it is to us. So if we were ever to get You know, self motivated, underlined LLMs that we're trying to collaborate to take over the planet.[00:28:19] swyx: This would be how they do it. They spread messages among us in the messages that we generate. And he developed a scaling law for that. So he marked, I'm showing it on screen right now, the emergence of this phenomenon. Basically, for example, for Cypher encoding, GPT 2, Lama 2, mixed trial, GPT 3. 5, zero capabilities, and sudden 4.[00:28:40] swyx: And this is the kind of Jason Wei type emergence properties that people kind of look for. I think what made this paper stand out as well, so he developed the benchmark for steganography collusion, and he also focused on shelling point collusion, which is very low coordination. For agreeing on a decoding encoding format, you kind of need to have some [00:29:00] agreement on that.[00:29:00] swyx: But, but shelling point means like very, very low or almost no coordination. So for example, if I, if I ask someone, if the only message I give you is meet me in New York and you're not aware. Or when you would probably meet me at Grand Central Station. That is the Grand Central Station is a shelling point.[00:29:16] swyx: And it's probably somewhere, somewhere during the day. That is the shelling point of New York is Grand Central. To that extent, shelling points for steganography are things like the, the, the common decoding methods that we talked about. It will be interesting at some point in the future when we are worried about alignment.[00:29:30] swyx: It is not interesting today, but it's interesting that DeepMind is already thinking about this.[00:29:36] Alessio: I think that's like one of the hardest things about NeurIPS. It's like the long tail. I[00:29:41] swyx: found a pricing guy. I'm going to feature him on the podcast. Basically, this guy from NVIDIA worked out the optimal pricing for language models.[00:29:51] swyx: It's basically an econometrics paper at NeurIPS, where everyone else is talking about GPUs. And the guy with the GPUs is[00:29:57] Alessio: talking[00:29:57] swyx: about economics instead. [00:30:00] That was the sort of fun one. So the focus I saw is that model papers at NeurIPS are kind of dead. No one really presents models anymore. It's just data sets.[00:30:12] swyx: This is all the grad students are working on. So like there was a data sets track and then I was looking around like, I was like, you don't need a data sets track because every paper is a data sets paper. And so data sets and benchmarks, they're kind of flip sides of the same thing. So Yeah. Cool. Yeah, if you're a grad student, you're a GPU boy, you kind of work on that.[00:30:30] swyx: And then the, the sort of big model that people walk around and pick the ones that they like, and then they use it in their models. And that's, that's kind of how it develops. I, I feel like, um, like, like you didn't last year, you had people like Hao Tian who worked on Lava, which is take Lama and add Vision.[00:30:47] swyx: And then obviously actually I hired him and he added Vision to Grok. Now he's the Vision Grok guy. This year, I don't think there was any of those.[00:30:55] Alessio: What were the most popular, like, orals? Last year it was like the [00:31:00] Mixed Monarch, I think, was like the most attended. Yeah, uh, I need to look it up. Yeah, I mean, if nothing comes to mind, that's also kind of like an answer in a way.[00:31:10] Alessio: But I think last year there was a lot of interest in, like, furthering models and, like, different architectures and all of that.[00:31:16] swyx: I will say that I felt the orals, oral picks this year were not very good. Either that or maybe it's just a So that's the highlight of how I have changed in terms of how I view papers.[00:31:29] swyx: So like, in my estimation, two of the best papers in this year for datasets or data comp and refined web or fine web. These are two actually industrially used papers, not highlighted for a while. I think DCLM got the spotlight, FineWeb didn't even get the spotlight. So like, it's just that the picks were different.[00:31:48] swyx: But one thing that does get a lot of play that a lot of people are debating is the role that's scheduled. This is the schedule free optimizer paper from Meta from Aaron DeFazio. And this [00:32:00] year in the ML community, there's been a lot of chat about shampoo, soap, all the bathroom amenities for optimizing your learning rates.[00:32:08] swyx: And, uh, most people at the big labs are. Who I asked about this, um, say that it's cute, but it's not something that matters. I don't know, but it's something that was discussed and very, very popular. 4Wars[00:32:19] Alessio: of AI recap maybe, just quickly. Um, where do you want to start? Data?[00:32:26] swyx: So to remind people, this is the 4Wars piece that we did as one of our earlier recaps of this year.[00:32:31] swyx: And the belligerents are on the left, journalists, writers, artists, anyone who owns IP basically, New York Times, Stack Overflow, Reddit, Getty, Sarah Silverman, George RR Martin. Yeah, and I think this year we can add Scarlett Johansson to that side of the fence. So anyone suing, open the eye, basically. I actually wanted to get a snapshot of all the lawsuits.[00:32:52] swyx: I'm sure some lawyer can do it. That's the data quality war. On the right hand side, we have the synthetic data people, and I think we talked about Lumna's talk, you know, [00:33:00] really showing how much synthetic data has come along this year. I think there was a bit of a fight between scale. ai and the synthetic data community, because scale.[00:33:09] swyx: ai published a paper saying that synthetic data doesn't work. Surprise, surprise, scale. ai is the leading vendor of non synthetic data. Only[00:33:17] Alessio: cage free annotated data is useful.[00:33:21] swyx: So I think there's some debate going on there, but I don't think it's much debate anymore that at least synthetic data, for the reasons that are blessed in Luna's talk, Makes sense.[00:33:32] swyx: I don't know if you have any perspectives there.[00:33:34] Alessio: I think, again, going back to the reinforcement fine tuning, I think that will change a little bit how people think about it. I think today people mostly use synthetic data, yeah, for distillation and kind of like fine tuning a smaller model from like a larger model.[00:33:46] Alessio: I'm not super aware of how the frontier labs use it outside of like the rephrase, the web thing that Apple also did. But yeah, I think it'll be. Useful. I think like whether or not that gets us the big [00:34:00] next step, I think that's maybe like TBD, you know, I think people love talking about data because it's like a GPU poor, you know, I think, uh, synthetic data is like something that people can do, you know, so they feel more opinionated about it compared to, yeah, the optimizers stuff, which is like,[00:34:17] swyx: they don't[00:34:17] Alessio: really work[00:34:18] swyx: on.[00:34:18] swyx: I think that there is an angle to the reasoning synthetic data. So this year, we covered in the paper club, the star series of papers. So that's star, Q star, V star. It basically helps you to synthesize reasoning steps, or at least distill reasoning steps from a verifier. And if you look at the OpenAI RFT, API that they released, or that they announced, basically they're asking you to submit graders, or they choose from a preset list of graders.[00:34:49] swyx: Basically It feels like a way to create valid synthetic data for them to fine tune their reasoning paths on. Um, so I think that is another angle where it starts to make sense. And [00:35:00] so like, it's very funny that basically all the data quality wars between Let's say the music industry or like the newspaper publishing industry or the textbooks industry on the big labs.[00:35:11] swyx: It's all of the pre training era. And then like the new era, like the reasoning era, like nobody has any problem with all the reasoning, especially because it's all like sort of math and science oriented with, with very reasonable graders. I think the more interesting next step is how does it generalize beyond STEM?[00:35:27] swyx: We've been using O1 for And I would say like for summarization and creative writing and instruction following, I think it's underrated. I started using O1 in our intro songs before we killed the intro songs, but it's very good at writing lyrics. You know, I can actually say like, I think one of the O1 pro demos.[00:35:46] swyx: All of these things that Noam was showing was that, you know, you can write an entire paragraph or three paragraphs without using the letter A, right?[00:35:53] Creative Writing with AI[00:35:53] swyx: So like, like literally just anything instead of token, like not even token level, character level manipulation and [00:36:00] counting and instruction following. It's, uh, it's very, very strong.[00:36:02] swyx: And so no surprises when I ask it to rhyme, uh, and to, to create song lyrics, it's going to do that very much better than in previous models. So I think it's underrated for creative writing.[00:36:11] Alessio: Yeah.[00:36:12] Legal and Ethical Issues in AI[00:36:12] Alessio: What do you think is the rationale that they're going to have in court when they don't show you the thinking traces of O1, but then they want us to, like, they're getting sued for using other publishers data, you know, but then on their end, they're like, well, you shouldn't be using my data to then train your model.[00:36:29] Alessio: So I'm curious to see how that kind of comes. Yeah, I mean, OPA has[00:36:32] swyx: many ways to publish, to punish people without bringing, taking them to court. Already banned ByteDance for distilling their, their info. And so anyone caught distilling the chain of thought will be just disallowed to continue on, on, on the API.[00:36:44] swyx: And it's fine. It's no big deal. Like, I don't even think that's an issue at all, just because the chain of thoughts are pretty well hidden. Like you have to work very, very hard to, to get it to leak. And then even when it leaks the chain of thought, you don't know if it's, if it's [00:37:00] The bigger concern is actually that there's not that much IP hiding behind it, that Cosign, which we talked about, we talked to him on Dev Day, can just fine tune 4.[00:37:13] swyx: 0 to beat 0. 1 Cloud SONET so far is beating O1 on coding tasks without, at least O1 preview, without being a reasoning model, same for Gemini Pro or Gemini 2. 0. So like, how much is reasoning important? How much of a moat is there in this, like, All of these are proprietary sort of training data that they've presumably accomplished.[00:37:34] swyx: Because even DeepSeek was able to do it. And they had, you know, two months notice to do this, to do R1. So, it's actually unclear how much moat there is. Obviously, you know, if you talk to the Strawberry team, they'll be like, yeah, I mean, we spent the last two years doing this. So, we don't know. And it's going to be Interesting because there'll be a lot of noise from people who say they have inference time compute and actually don't because they just have fancy chain of thought.[00:38:00][00:38:00] swyx: And then there's other people who actually do have very good chain of thought. And you will not see them on the same level as OpenAI because OpenAI has invested a lot in building up the mythology of their team. Um, which makes sense. Like the real answer is somewhere in between.[00:38:13] Alessio: Yeah, I think that's kind of like the main data war story developing.[00:38:18] The Data War: GPU Poor vs. GPU Rich[00:38:18] Alessio: GPU poor versus GPU rich. Yeah. Where do you think we are? I think there was, again, going back to like the small model thing, there was like a time in which the GPU poor were kind of like the rebel faction working on like these models that were like open and small and cheap. And I think today people don't really care as much about GPUs anymore.[00:38:37] Alessio: You also see it in the price of the GPUs. Like, you know, that market is kind of like plummeted because there's people don't want to be, they want to be GPU free. They don't even want to be poor. They just want to be, you know, completely without them. Yeah. How do you think about this war? You[00:38:52] swyx: can tell me about this, but like, I feel like the, the appetite for GPU rich startups, like the, you know, the, the funding plan is we will raise 60 million and [00:39:00] we'll give 50 of that to NVIDIA.[00:39:01] swyx: That is gone, right? Like, no one's, no one's pitching that. This was literally the plan, the exact plan of like, I can name like four or five startups, you know, this time last year. So yeah, GPU rich startups gone.[00:39:12] The Rise of GPU Ultra Rich[00:39:12] swyx: But I think like, The GPU ultra rich, the GPU ultra high net worth is still going. So, um, now we're, you know, we had Leopold's essay on the trillion dollar cluster.[00:39:23] swyx: We're not quite there yet. We have multiple labs, um, you know, XAI very famously, you know, Jensen Huang praising them for being. Best boy number one in spinning up 100, 000 GPU cluster in like 12 days or something. So likewise at Meta, likewise at OpenAI, likewise at the other labs as well. So like the GPU ultra rich are going to keep doing that because I think partially it's an article of faith now that you just need it.[00:39:46] swyx: Like you don't even know what it's going to, what you're going to use it for. You just, you just need it. And it makes sense that if, especially if we're going into. More researchy territory than we are. So let's say 2020 to 2023 was [00:40:00] let's scale big models territory because we had GPT 3 in 2020 and we were like, okay, we'll go from 1.[00:40:05] swyx: 75b to 1. 8b, 1. 8t. And that was GPT 3 to GPT 4. Okay, that's done. As far as everyone is concerned, Opus 3. 5 is not coming out, GPT 4. 5 is not coming out, and Gemini 2, we don't have Pro, whatever. We've hit that wall. Maybe I'll call it the 2 trillion perimeter wall. We're not going to 10 trillion. No one thinks it's a good idea, at least from training costs, from the amount of data, or at least the inference.[00:40:36] swyx: Would you pay 10x the price of GPT Probably not. Like, like you want something else that, that is at least more useful. So it makes sense that people are pivoting in terms of their inference paradigm.[00:40:47] Emerging Trends in AI Models[00:40:47] swyx: And so when it's more researchy, then you actually need more just general purpose compute to mess around with, uh, at the exact same time that production deployments of the old, the previous paradigm is still ramping up,[00:40:58] swyx: um,[00:40:58] swyx: uh, pretty aggressively.[00:40:59] swyx: So [00:41:00] it makes sense that the GPU rich are growing. We have now interviewed both together and fireworks and replicates. Uh, we haven't done any scale yet. But I think Amazon, maybe kind of a sleeper one, Amazon, in a sense of like they, at reInvent, I wasn't expecting them to do so well, but they are now a foundation model lab.[00:41:18] swyx: It's kind of interesting. Um, I think, uh, you know, David went over there and started just creating models.[00:41:25] Alessio: Yeah, I mean, that's the power of prepaid contracts. I think like a lot of AWS customers, you know, they do this big reserve instance contracts and now they got to use their money. That's why so many startups.[00:41:37] Alessio: Get bought through the AWS marketplace so they can kind of bundle them together and prefer pricing.[00:41:42] swyx: Okay, so maybe GPU super rich doing very well, GPU middle class dead, and then GPU[00:41:48] Alessio: poor. I mean, my thing is like, everybody should just be GPU rich. There shouldn't really be, even the GPU poorest, it's like, does it really make sense to be GPU poor?[00:41:57] Alessio: Like, if you're GPU poor, you should just use the [00:42:00] cloud. Yes, you know, and I think there might be a future once we kind of like figure out what the size and shape of these models is where like the tiny box and these things come to fruition where like you can be GPU poor at home. But I think today is like, why are you working so hard to like get these models to run on like very small clusters where it's like, It's so cheap to run them.[00:42:21] Alessio: Yeah, yeah,[00:42:22] swyx: yeah. I think mostly people think it's cool. People think it's a stepping stone to scaling up. So they aspire to be GPU rich one day and they're working on new methods. Like news research, like probably the most deep tech thing they've done this year is Distro or whatever the new name is.[00:42:38] swyx: There's a lot of interest in heterogeneous computing, distributed computing. I tend generally to de emphasize that historically, but it may be coming to a time where it is starting to be relevant. I don't know. You know, SF compute launched their compute marketplace this year, and like, who's really using that?[00:42:53] swyx: Like, it's a bunch of small clusters, disparate types of compute, and if you can make that [00:43:00] useful, then that will be very beneficial to the broader community, but maybe still not the source of frontier models. It's just going to be a second tier of compute that is unlocked for people, and that's fine. But yeah, I mean, I think this year, I would say a lot more on device, We are, I now have Apple intelligence on my phone.[00:43:19] swyx: Doesn't do anything apart from summarize my notifications. But still, not bad. Like, it's multi modal.[00:43:25] Alessio: Yeah, the notification summaries are so and so in my experience.[00:43:29] swyx: Yeah, but they add, they add juice to life. And then, um, Chrome Nano, uh, Gemini Nano is coming out in Chrome. Uh, they're still feature flagged, but you can, you can try it now if you, if you use the, uh, the alpha.[00:43:40] swyx: And so, like, I, I think, like, you know, We're getting the sort of GPU poor version of a lot of these things coming out, and I think it's like quite useful. Like Windows as well, rolling out RWKB in sort of every Windows department is super cool. And I think the last thing that I never put in this GPU poor war, that I think I should now, [00:44:00] is the number of startups that are GPU poor but still scaling very well, as sort of wrappers on top of either a foundation model lab, or GPU Cloud.[00:44:10] swyx: GPU Cloud, it would be Suno. Suno, Ramp has rated as one of the top ranked, fastest growing startups of the year. Um, I think the last public number is like zero to 20 million this year in ARR and Suno runs on Moto. So Suno itself is not GPU rich, but they're just doing the training on, on Moto, uh, who we've also talked to on, on the podcast.[00:44:31] swyx: The other one would be Bolt, straight cloud wrapper. And, and, um, Again, another, now they've announced 20 million ARR, which is another step up from our 8 million that we put on the title. So yeah, I mean, it's crazy that all these GPU pores are finding a way while the GPU riches are also finding a way. And then the only failures, I kind of call this the GPU smiling curve, where the edges do well, because you're either close to the machines, and you're like [00:45:00] number one on the machines, or you're like close to the customers, and you're number one on the customer side.[00:45:03] swyx: And the people who are in the middle. Inflection, um, character, didn't do that great. I think character did the best of all of them. Like, you have a note in here that we apparently said that character's price tag was[00:45:15] Alessio: 1B.[00:45:15] swyx: Did I say that?[00:45:16] Alessio: Yeah. You said Google should just buy them for 1B. I thought it was a crazy number.[00:45:20] Alessio: Then they paid 2. 7 billion. I mean, for like,[00:45:22] swyx: yeah.[00:45:22] Alessio: What do you pay for node? Like, I don't know what the game world was like. Maybe the starting price was 1B. I mean, whatever it was, it worked out for everybody involved.[00:45:31] The Multi-Modality War[00:45:31] Alessio: Multimodality war. And this one, we never had text to video in the first version, which now is the hottest.[00:45:37] swyx: Yeah, I would say it's a subset of image, but yes.[00:45:40] Alessio: Yeah, well, but I think at the time it wasn't really something people were doing, and now we had VO2 just came out yesterday. Uh, Sora was released last month, last week. I've not tried Sora, because the day that I tried, it wasn't, yeah. I[00:45:54] swyx: think it's generally available now, you can go to Sora.[00:45:56] swyx: com and try it. Yeah, they had[00:45:58] Alessio: the outage. Which I [00:46:00] think also played a part into it. Small things. Yeah. What's the other model that you posted today that was on Replicate? Video or OneLive?[00:46:08] swyx: Yeah. Very, very nondescript name, but it is from Minimax, which I think is a Chinese lab. The Chinese labs do surprisingly well at the video models.[00:46:20] swyx: I'm not sure it's actually Chinese. I don't know. Hold me up to that. Yep. China. It's good. Yeah, the Chinese love video. What can I say? They have a lot of training data for video. Or a more relaxed regulatory environment.[00:46:37] Alessio: Uh, well, sure, in some way. Yeah, I don't think there's much else there. I think like, you know, on the image side, I think it's still open.[00:46:45] Alessio: Yeah, I mean,[00:46:46] swyx: 11labs is now a unicorn. So basically, what is multi modality war? Multi modality war is, do you specialize in a single modality, right? Or do you have GodModel that does all the modalities? So this is [00:47:00] definitely still going, in a sense of 11 labs, you know, now Unicorn, PicoLabs doing well, they launched Pico 2.[00:47:06] swyx: 0 recently, HeyGen, I think has reached 100 million ARR, Assembly, I don't know, but they have billboards all over the place, so I assume they're doing very, very well. So these are all specialist models, specialist models and specialist startups. And then there's the big labs who are doing the sort of all in one play.[00:47:24] swyx: And then here I would highlight Gemini 2 for having native image output. Have you seen the demos? Um, yeah, it's, it's hard to keep up. Literally they launched this last week and a shout out to Paige Bailey, who came to the Latent Space event to demo on the day of launch. And she wasn't prepared. She was just like, I'm just going to show you.[00:47:43] swyx: So they have voice. They have, you know, obviously image input, and then they obviously can code gen and all that. But the new one that OpenAI and Meta both have but they haven't launched yet is image output. So you can literally, um, I think their demo video was that you put in an image of a [00:48:00] car, and you ask for minor modifications to that car.[00:48:02] swyx: They can generate you that modification exactly as you asked. So there's no need for the stable diffusion or comfy UI workflow of like mask here and then like infill there in paint there and all that, all that stuff. This is small model nonsense. Big model people are like, huh, we got you in as everything in the transformer.[00:48:21] swyx: This is the multimodality war, which is, do you, do you bet on the God model or do you string together a whole bunch of, uh, Small models like a, like a chump. Yeah,[00:48:29] Alessio: I don't know, man. Yeah, that would be interesting. I mean, obviously I use Midjourney for all of our thumbnails. Um, they've been doing a ton on the product, I would say.[00:48:38] Alessio: They launched a new Midjourney editor thing. They've been doing a ton. Because I think, yeah, the motto is kind of like, Maybe, you know, people say black forest, the black forest models are better than mid journey on a pixel by pixel basis. But I think when you put it, put it together, have you tried[00:48:53] swyx: the same problems on black forest?[00:48:55] Alessio: Yes. But the problem is just like, you know, on black forest, it generates one image. And then it's like, you got to [00:49:00] regenerate. You don't have all these like UI things. Like what I do, no, but it's like time issue, you know, it's like a mid[00:49:06] swyx: journey. Call the API four times.[00:49:08] Alessio: No, but then there's no like variate.[00:49:10] Alessio: Like the good thing about mid journey is like, you just go in there and you're cooking. There's a lot of stuff that just makes it really easy. And I think people underestimate that. Like, it's not really a skill issue, because I'm paying mid journey, so it's a Black Forest skill issue, because I'm not paying them, you know?[00:49:24] Alessio: Yeah,[00:49:25] swyx: so, okay, so, uh, this is a UX thing, right? Like, you, you, you understand that, at least, we think that Black Forest should be able to do all that stuff. I will also shout out, ReCraft has come out, uh, on top of the image arena that, uh, artificial analysis has done, has apparently, uh, Flux's place. Is this still true?[00:49:41] swyx: So, Artificial Analysis is now a company. I highlighted them I think in one of the early AI Newses of the year. And they have launched a whole bunch of arenas. So, they're trying to take on LM Arena, Anastasios and crew. And they have an image arena. Oh yeah, Recraft v3 is now beating Flux 1. 1. Which is very surprising [00:50:00] because Flux And Black Forest Labs are the old stable diffusion crew who left stability after, um, the management issues.[00:50:06] swyx: So Recurve has come from nowhere to be the top image model. Uh, very, very strange. I would also highlight that Grok has now launched Aurora, which is, it's very interesting dynamics between Grok and Black Forest Labs because Grok's images were originally launched, uh, in partnership with Black Forest Labs as a, as a thin wrapper.[00:50:24] swyx: And then Grok was like, no, we'll make our own. And so they've made their own. I don't know, there are no APIs or benchmarks about it. They just announced it. So yeah, that's the multi modality war. I would say that so far, the small model, the dedicated model people are winning, because they are just focused on their tasks.[00:50:42] swyx: But the big model, People are always catching up. And the moment I saw the Gemini 2 demo of image editing, where I can put in an image and just request it and it does, that's how AI should work. Not like a whole bunch of complicated steps. So it really is something. And I think one frontier that we haven't [00:51:00] seen this year, like obviously video has done very well, and it will continue to grow.[00:51:03] swyx: You know, we only have Sora Turbo today, but at some point we'll get full Sora. Oh, at least the Hollywood Labs will get Fulsora. We haven't seen video to audio, or video synced to audio. And so the researchers that I talked to are already starting to talk about that as the next frontier. But there's still maybe like five more years of video left to actually be Soda.[00:51:23] swyx: I would say that Gemini's approach Compared to OpenAI, Gemini seems, or DeepMind's approach to video seems a lot more fully fledged than OpenAI. Because if you look at the ICML recap that I published that so far nobody has listened to, um, that people have listened to it. It's just a different, definitely different audience.[00:51:43] swyx: It's only seven hours long. Why are people not listening? It's like everything in Uh, so, so DeepMind has, is working on Genie. They also launched Genie 2 and VideoPoet. So, like, they have maybe four years advantage on world modeling that OpenAI does not have. Because OpenAI basically only started [00:52:00] Diffusion Transformers last year, you know, when they hired, uh, Bill Peebles.[00:52:03] swyx: So, DeepMind has, has a bit of advantage here, I would say, in, in, in showing, like, the reason that VO2, while one, They cherry pick their videos. So obviously it looks better than Sora, but the reason I would believe that VO2, uh, when it's fully launched will do very well is because they have all this background work in video that they've done for years.[00:52:22] swyx: Like, like last year's NeurIPS, I already was interviewing some of their video people. I forget their model name, but for, for people who are dedicated fans, they can go to NeurIPS 2023 and see, see that paper.[00:52:32] Alessio: And then last but not least, the LLMOS. We renamed it to Ragops, formerly known as[00:52:39] swyx: Ragops War. I put the latest chart on the Braintrust episode.[00:52:43] swyx: I think I'm going to separate these essays from the episode notes. So the reason I used to do that, by the way, is because I wanted to show up on Hacker News. I wanted the podcast to show up on Hacker News. So I always put an essay inside of there because Hacker News people like to read and not listen.[00:52:58] Alessio: So episode essays,[00:52:59] swyx: I remember [00:53:00] purchasing them separately. You say Lanchain Llama Index is still growing.[00:53:03] Alessio: Yeah, so I looked at the PyPy stats, you know. I don't care about stars. On PyPy you see Do you want to share your screen? Yes. I prefer to look at actual downloads, not at stars on GitHub. So if you look at, you know, Lanchain still growing.[00:53:20] Alessio: These are the last six months. Llama Index still growing. What I've basically seen is like things that, One, obviously these things have A commercial product. So there's like people buying this and sticking with it versus kind of hopping in between things versus, you know, for example, crew AI, not really growing as much.[00:53:38] Alessio: The stars are growing. If you look on GitHub, like the stars are growing, but kind of like the usage is kind of like flat. In the last six months, have they done some[00:53:4

god ceo new york amazon spotify time world europe google china ai apple vision pr voice future speaking san francisco new york times phd video thinking chinese simple data predictions elon musk iphone surprise impact legal code tesla chatgpt reflecting memory ga discord reddit busy lgbt cloud flash stem honestly ab pros windows jeff bezos excited researchers unicorns lower ip sort tackling survey insane tier cto vc whispers applications doc signing seal fireworks f1 genie academic sf organizing gemini openai ux nvidia api assembly davos frontier chrome makes scarlett johansson ui mm turbo bash soda ml aws gpt lama dropbox mosaic creative writing github drafting reinvent canvas 1b apis bolt lava ruler exact wwdc stripe dev vm hundred pico strawberry sander bt vcs flux taiwanese moto 200k arr gartner assumption sora google docs parting opus nemo blackwell google drive sombra sam altman llm opa gpu tbd ramp elia elo 3b gnome estimates 5b agi midjourney leopold bytedance dota ciso haiku dx sarah silverman coursera rag george rr martin sonnets gpus cypher quill cobalt getty sdks ilya deepmind sheesh noam v2 ttc alessio lms satya future trends ssi perplexity stack overflow anthropic rl 8b r1 itc theoretically emerging trends sota yi replicate vo2 grok veo suno black forest graphql inflection mistral aitor brain trust databricks chinchillas adept nosql xai gpts grand central hacker news grand central station zep hacken ethical issues mcp ai models jensen huang cosign claud ai news distro gpc autogpt lubna neo4j jeremy howard tpu gbt gpd quent o1 o3 loras exa gradients heygen 70b minimax neurips jeff dean 400b langchain 128k elos gemini pro cerebras code interpreter icml john franco r1s lstm ai winter aws reinvent muser latent space pypy dan gross nova pro paige bailey noam brown quiet capital john frankel
OnBoard!
EP 54. 深度对谈顶尖AI开源项目:大模型开源生态, Agent 与中国力量

OnBoard!

Play Episode Listen Later Dec 16, 2024 199:06


聊到生成式AI的发展,开源绝对是最关键的话题之一。这次的嘉宾,可以说涵盖了大模型开源领域最值得关注的公司,真的是黄金阵容! 首先跟大家汇报一下,上周日我们在北京举办的 OnBoard! 第一次线下听友会真是超预期!开放报名4天就250多人报名,周日从上午9点到下午3点,从机器人到AI,创业投资和软件出海,100人的场地,直到最后都几乎座无虚席!真的是非常感谢大家的支持~ Hello World, who is OnBoard!? 回到这一期播客,我们将深入探讨大模型的开源生态。在生成式AI飞速发展的一年多时间里,开源无疑是一个不可忽视的话题。开源模型的迅猛发展,从 Meta 的 Llama 3 到 Mistral 的最新模型,它们对闭源大模型如 GPT4 的追赶,不仅令人惊艳,更加速了 AI 场景下产品的实际应用。而围绕大模型的生态系统,从推理加速到开发工具,再到智能代理,技术栈的丰富程度,虽然已经孕育出了像 Langchain 这样的领军企业,但这一切似乎只是冰山一角。 特别值得一提的是,随着阿里千问系列、Deepseek、以及 Yi 等中国团队主导的模型在国际舞台上崭露头角,我们不禁思考,除了模仿和追赶,中国在大模型领域的发展是否还有更多值得我们关注和自豪的成就。 今天,Monica 有幸邀请到了几位极具代表性的重磅嘉宾,来自 Huggingface 的开源老兵,有通义千问 Qwen 的开源负责人(他也是 Agent 领域最受关注的项目 OpenDevin 核心成员),还有最具国际影响力的开源项目 vLLM 主导人。真是涵盖了大模型开源生态的各个领域的最一线视角! 嘉宾们都太宝藏了,我们的话题延伸到大模型的各个方面,录了近4个小时!我们前半部分聊了很多infra的创新,以及最近很火的、以OpenDevin 为代表的软件开发agent 背后的技术和生态等话题。下半部分,我们回到大模型开源的主题,畅谈了: 底层基础大模型的开源闭源生态,未来可能有怎样的演进? 开源模型商业化跟过去我们在大数据时代看到的databricks 之类开源商业模式有哪些异同? 如何做一个有国际影响力的开源项目? 嘉宾介绍 Tiezhen Wang, Huggingface 工程师,他可以说是中国与世界开源 AI 生态的桥梁,更是从 Google TensorFlow 时代到 Huggingface 早期员工,对中国和世界的开源 AI 生态都有极深的洞察。 Junyang Lin, 通义千问开源负责人,作为 Qwen 在全球开源社区的主要代言人,他不仅见证了开源的发展历程,还是目前备受瞩目的 Agent 开源项目 OpenDevin 的核心团队成员。 李卓翰,UC Berkeley PhD,他所主导的项目更是大名鼎鼎,就是已经成为行业标准的大模型推理框架 vLLM!他所在的 Sky Lab 被誉为开源基础设施的摇篮,从估值百亿美元的 Databricks 到 Anyscale(开源计算框架 Ray 的商业化公司)。他还深度参与了 Chat Arena, Vicuna 等多个国际知名开源项目,对大模型周边生态和 infra 的不仅有国际一线经验,更是有很多有技术理想的干货! OnBoard! 主持:Monica:美元VC投资人,前 AWS 硅谷团队+ AI 创业公司打工人,公众号M小姐研习录 (ID: MissMStudy) 主理人 | 即刻:莫妮卡同学 还有数据、评测等等大模型领域的核心话题,真的非常全面,又不失一线从业者的深度。索性就不分成两部分了,大家可以对着 show notes 里面的时间戳,直接跳转到你感兴趣的话题(虽然我觉得每个话题都很好!)介绍了这么多,还要声明一下,节目里面重点聊到的开源社区 Huggingface,还有几个开源的项目,包括阿里千问、OpenDevin, Deepseek, 零一万物的 Yi,vLLM 等,都没有收取任何广告,完全是嘉宾走心分享,全程无广! 当然,如果你们或者其他AI公司考虑赞助一下我们用爱发电的播客,我们当然也是欢迎的! 三小时硬核马拉松开始,enjoy! 嘉宾介绍 我们都聊了什么 05:28 嘉宾自我介绍,有意思的开源 AI 项目 18:37 vLLM 如何开始的,如何成为全球顶尖项目,为什么我们需要一个大模型推理框架? 30:24 Agent framework: OpenDevin 这样的负责 agent 会带来怎样的推理挑战? 40:37 做好一个编程 Agent,还需要哪些新的工具?多模态会带来怎样的变化? 56:16 我们需要怎样的 Agent Framework?为什么最适合开源社区来做?Framework 会收敛吗? 67:46 什么是 Crew AI? 如何看待 Multi-agent 架构? 73:11 借鉴前端框架的发展历史,如何理解一个框架如何成为行业标准? 77:54 Huggingface 上开源LLM现状,过去一年多有哪些重要进展?有哪些不同的开源方式?泽娜要给你看待一个开源模型的流行程度? 94:27 如何理解不同架构的开源大模型生态?Qwen 如何通过架构演进打造更好的开源生态? 104:59 中国的大模型开源项目有哪些创新?大模型架构有哪些变化? 112:17 为什么说新的模型架构可能会带来商业化的新机会?我们能从以前的开源商业化中学到什么? 119:22 我们看到现有大模型架构的天花板了吗?什么是一个新的架构? 128:03 Zhuohan 从参与最早的开源 LLM 之一 Vicuna 的经历学到什么?学术界和业界在大模型生态上如何分工? 140:48 用于大模型的数据集领域有哪些值得关注的进展? 149:42 Mistral 为什么这么快爆火?打造一流国际开源项目有什么可借鉴的经验?vLLM 有什么道和术上的心得? 166:13 Chatbot Arena 是如何开始的?为什么模型的评测那么重要?还有哪些挑战和可能的进展? 180:49 Zhuohan 对于 vLLM 商业化方式有什么思考?未来推理成本还有哪些下降空间? 188:17 快问快答:过去一年生成式AI发展有什么超出预期和不及预期的地方?未来还有什么值得期待? 我们提到的公司和重点名词 Qwen⁠, ⁠Qwen-2⁠ OpenDevin: ⁠opendevin.github.io⁠ vLLM: ⁠github.com⁠ ⁠Yi (Github)⁠, ⁠零一万物⁠ Chatbot Arena: ⁠huggingface.co⁠ AutoGPT: ⁠github.com⁠ crew AI: ⁠www.crewai.com⁠ autoAWQ: ⁠github.com⁠ LLM.c: ⁠github.com⁠ Flash attention: ⁠github.com⁠ Continuous batching:一种数据处理技术,用于将连续的数据流分批处理,以提高效率和可扩展性。 KV cache:键值对缓存,一种存储结构,通过键快速访问数据值,常用于提高数据检索速度。 Page attention:页面注意力机制,一种在处理长文本时,使模型集中注意力于当前页面或段落的技术。 Quantization:量化,将数据表示的精度降低到更少的比特数,以减少模型大小和提高计算效率。 ⁠Direct Preference Optimization (DPO)⁠: Your Language Model is Secretly a Reward Model Google Gemini: ⁠deepmind.google⁠ Adept: ⁠www.adept.ai⁠ MetaGPT: ⁠github.com⁠ ⁠Dolphin⁠an open-source and uncensored, and commercially licensed dataset and series of instruct-tuned language models based on Microsoft's Orca paper Common crawl: ⁠commoncrawl.org⁠ Tiezhen 的报告:⁠Booming Open Source Chinese-Speaking LLMs: A Closer Look⁠, ⁠Slides⁠ ⁠通义千问一周年,开源狂飙路上的抉择与思考|魔搭深度访谈⁠ ⁠阿里林俊旸:大模型对很多人来说不够用,打造多模态Agent是关键 | 中国AIGC产业峰会⁠ 欢迎关注M小姐的微信公众号,了解更多中美软件、AI与创业投资的干货内容! M小姐研习录 (ID: MissMStudy) 喜欢 OnBoard! 的话,也可以点击打赏,请我们喝一杯咖啡!如果你用 Apple Podcasts 收听,也请给我们一个五星好评,这对我们非常重要。 最后!快来加入Onboard!听友群,结识到高质量的听友们,我们还会组织线下主题聚会,开放实时旁听播客录制,嘉宾互动等新的尝试。添加任意一位小助手微信,onboard666, 或者 Nine_tunes,小助手会拉你进群。期待你来!

ExplAInable
סוכנים אוטונומיים עם עמית מנדלבאום

ExplAInable

Play Episode Listen Later Nov 25, 2024 49:43


בפרק הבא נדבר עם עמית מנדלבאום על התפתחות סוכנים אוטונומיים תוך שימוש במודלים מתקדמים של ראיית מחשב ו-LLMs. נתייחס לאתגרים בשילוב יכולות היסק ושיקול דעת, ולבעיות שהיו בפרויקטים כמו AutoGPT. נציג את הצלחתה של חברת Anthropic בהבנה של פעולות ממסכי מחשב ונסביר כיצד טכנולוגיות אלו משפרות את ההתמודדות עם פעולות ממוחשבות בצורה קרובה להתנהגות אנושית.   https://medium.com/@luke.birdeau/reverse-engineering-chatgpt-o1-5cf3b61c6eee AI Agents That Matter https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities

The INDUStry Show
The INDUStry Show w Saumya Bhatnagar

The INDUStry Show

Play Episode Listen Later Jul 27, 2024 26:28


Saumya Bhatnagar is the Chief Product Officer and co-founder of Jeeva AI (formerly involve.ai) - automating all manual tasks and using AutoGPT to find your next 500 customers. She is a multi-entrepreneur, recipient of the Stevie Gold Entrepreneur of the Year award, Forbes 30 Under 30 alum, National Diversity Council's prestigious 50 Most Powerful Women in Tech, Top 50 Women CPO's in the US by WWA. --- Support this podcast: https://podcasters.spotify.com/pod/show/theindustryshow/support

The Nonlinear Library
AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown

The Nonlinear Library

Play Episode Listen Later Jul 22, 2024 26:02


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents, published by Sam Brown on July 22, 2024 on The AI Alignment Forum. Summary Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM agents. This 'Auto-Enhancement benchmark' measures the ability of 'top-level' agents to improve the performance of 'reference' agents on 'component' benchmarks, such as CyberSecEval 2, MLAgentBench, SWE-bench, and WMDP. Results are mostly left for a future post in the coming weeks. Scaffolds such as AutoGPT, ReAct, and SWE-agent can be built around LLMs to build LLM agents, with abilities such as long-term planning and context-window management to enable them to carry out complex general-purpose tasks autonomously. LLM agents can fix issues in large, complex code bases (see SWE-bench), and interact in a general way using web browsers, Linux shells, and Python interpreters. In this post, we outline our plans for a project to measure these LLM agents' ability to modify other LLM agents, undertaken as part of Axiom Futures' Alignment Research Fellowship. Our proposed benchmark consists of "enhancement tasks," which measure the ability of an LLM agent to improve the performance of another LLM agent (which may be a clone of the first agent) on various tasks. Our benchmark uses existing benchmarks as components to measure LLM agent capabilities in various domains, such as software engineering, cybersecurity exploitation, and others. We believe these benchmarks are consequential in the sense that good performance by agents on these tasks should be concerning for us. We plan to write an update post with our results at the end of the Fellowship, and we will link this post to that update. Motivation Agents are capable of complex SWE tasks (see, e.g., Yang et al.). One such task could be the improvement of other scaffolded agents. This capability would be a key component of autonomous replication and adaptation (ARA), and we believe it would be generally recognised as an important step towards extreme capabilities. This post outlines our initial plans for developing a novel benchmark that aims to measure the ability of LLM-based agents to improve other LLM-based agents, including those that are as capable as themselves. Threat model We present two threat models that aim to capture how AI systems may develop super-intelligent capabilities. Expediting AI research: Recent trends show how researchers are leveraging LLMs to expedite academic paper reviews (see Du et al.). ML researchers are beginning to use LLMs to design and train more advanced models (see Cotra's AIs accelerating AI research and Anthropic's work on Constitutional AI). Such LLM-assisted research may expedite progress toward super-intelligent systems. Autonomy: Another way that such capabilities are developed is through LLM agents themselves becoming competent enough to self-modify and further ML research without human assistance (see section Hard Takeoff in this note ), leading to an autonomously replicating and adapting system. Our proposed benchmark aims to quantify the ability of LLM agents to bring about such recursive self-improvement, either with or without detailed human supervision. Categories of bottlenecks and overhang risks We posit that there are three distinct categories of bottlenecks to LLM agent capabilities: 1. Architectures-of-thought, such as structured planning, progress-summarisation, hierarchy of agents, self-critique, chain-of-thought, self-consistency, prompt engineering/elicitation, and so on. Broadly speaking, this encompasses everything between the LL...

Training Data
LangChain's Harrison Chase on Building the Orchestration Layer for AI Agents

Training Data

Play Episode Listen Later Jun 18, 2024 49:50


Last year, AutoGPT and Baby AGI captured our imaginations—agents quickly became the buzzword of the day…and then things went quiet. AutoGPT and Baby AGI may have marked a peak in the hype cycle, but this year has seen a wave of agentic breakouts on the product side, from Klarna's customer support AI to Cognition's Devin, etc. Harrison Chase of LangChain is focused on enabling the orchestration layer for agents. In this conversation, he explains what's changed that's allowing agents to improve performance and find traction.  Harrison shares what he's optimistic about, where he sees promise for agents vs. what he thinks will be trained into models themselves, and discusses novel kinds of UX that he imagines might transform how we experience agents in the future.      Hosted by: Sonya Huang and Pat Grady, Sequoia Capital Mentioned:  ReAct: Synergizing Reasoning and Acting in Language Models, the first cognitive architecture for agents SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, small-model open-source software engineering agent from researchers at Princeton Devin, autonomous software engineering from Cognition V0: Generative UI agent from Vercel GPT Researcher, a research agent  Language Model Cascades: 2022 paper by Google Brain and now OpenAI researcher David Dohan that was influential for Harrison in developing LangChain Transcript: https://www.sequoiacap.com/podcast/training-data-harrison-chase/ 00:00 Introduction 01:21 What are agents?  05:00 What is LangChain's role in the agent ecosystem? 11:13 What is a cognitive architecture?  13:20 Is bespoke and hard coded the way the world is going, or a stop gap? 18:48 Focus on what makes your beer taste better 20:37 So what?  22:20 Where are agents getting traction? 25:35 Reflection, chain of thought, other techniques? 30:42 UX can influence the effectiveness of the architecture 35:30 What's out of scope? 38:04 Fine tuning vs prompting? 42:17 Existing observability tools for LLMs vs needing a new architecture/approach 45:38 Lightning round

The Nonlinear Library
AF - We might be dropping the ball on Autonomous Replication and Adaptation. by Charbel-Raphael Segerie

The Nonlinear Library

Play Episode Listen Later May 31, 2024 7:28


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We might be dropping the ball on Autonomous Replication and Adaptation., published by Charbel-Raphael Segerie on May 31, 2024 on The AI Alignment Forum. Here is a little Q&A Can you explain your position quickly? I think autonomous replication and adaptation in the wild is under-discussed as an AI threat model. And this makes me sad because this is one of the main reasons I'm worried. I think one of AI Safety people's main proposals should first focus on creating a nonproliferation treaty. Without this treaty, I think we are screwed. The more I think about it, the more I think we are approaching a point of no return. It seems to me that open source is a severe threat and that nobody is really on the ball. Before those powerful AIs can self-replicate and adapt, AI development will be very positive overall and difficult to stop, but it's too late after AI is able to adapt and evolve autonomously because Natural selection favors AI over humans. What is ARA? Autonomous Replication and Adaptation. Let's recap this quickly. Today, generative AI functions as a tool: you ask a question and the tool answers. Question, answer. It's simple. However, we are heading towards a new era of AI, one with autonomous AI. Instead of asking a question, you give it a goal, and the AI performs a series of actions to achieve that goal, which is much more powerful. Libraries like AutoGPT or ChatGPT, when they navigate the internet, already show what these agents might look like. Agency is much more powerful and dangerous than AI tools. Thus conceived, AI would be able to replicate autonomously, copying itself from one computer to another, like a particularly intelligent virus. To replicate on a new computer, it must navigate the internet, create a new account on AWS, pay for the virtual machine, install the new weights on this machine, and start the replication process. According to METR, the organization that audited OpenAI, a dozen tasks indicate ARA capabilities. GPT-4 plus basic scaffolding was capable of performing a few of these tasks, though not robustly. This was over a year ago, with primitive scaffolding, no dedicated training for agency, and no reinforcement learning. Multimodal AIs can now successfully pass CAPTCHAs. ARA is probably coming. It could be very sudden. One of the main variables for self-replication is whether the AI can pay for cloud GPUs. Let's say a GPU costs $1 per hour. The question is whether the AI can generate $1 per hour autonomously continuously. Then, you have something like an exponential process. I think that the number of AIs is probably going to plateau, but regardless of a plateau and the number of AIs you get asymptotically, here you are: this is an autonomous AI, which may become like an endemic virus that is hard to shut down. Is ARA a point of no return? Yes, I think ARA with full adaptation in the wild is beyond the point of no return. Once there is an open-source ARA model or a leak of a model capable of generating enough money for its survival and reproduction and able to adapt to avoid detection and shutdown, it will be probably too late: The idea of making an ARA bot is very accessible. The seed model would already be torrented and undeletable. Stop the internet? The entire world's logistics depend on the internet. In practice, this would mean starving the cities over time. Even if you manage to stop the internet, once the ARA bot is running, it will be unkillable. Even rebooting all providers like AWS would not suffice, as individuals could download and relaunch the model, or the agent could hibernate on local computers. The cost to completely eradicate it altogether would be way too high, and it only needs to persist in one place to spread again. The question is more interesting for ARA with incomplete adaptation capabilities. It is likely th...

Data Engineering Podcast
Zenlytic Is Building You A Better Coworker With AI Agents

Data Engineering Podcast

Play Episode Listen Later May 19, 2024 54:19


Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data Interview Introduction How did you get involved in data? In AI? Can you describe what Zenlytic is and the role that AI is playing in your platform? What have been the key stages in your AI journey? What are some of the dead ends that you ran into along the path to where you are today? What are some of the persistent challenges that you are facing? So tell us more about data agents. Firstly, what are data agents and why do you think they're important? How are data agents different from chatbots? Are data agents harder to build? How do you make them work in production? What other technical architectures have you had to develop to support the use of AI in Zenlytic? How have you approached the work of customer education as you introduce this functionality? What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do? How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses? What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence? When is an AI agent the wrong choice? What do you have planned for the future of AI in the Zenlytic product? Contact Info Ryan LinkedIn (https://www.linkedin.com/in/janssenryan) Paul LinkedIn (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Zenlytic (https://www.zenlytic.com/) Podcast Episode (https://www.dataengineeringpodcast.com/zenlytic-self-serve-business-intelligence-episode-371) Attention is all you need (https://arxiv.org/abs/1706.03762) Transformers (https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) BERT (https://en.wikipedia.org/wiki/BERT_(language_model)) The Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) Richard Sutton PID Loops (https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller) AutoGPT (https://github.com/Significant-Gravitas/AutoGPT) Devin.ai (https://www.cognition.ai/introducing-devin) Google Gemini (https://gemini.google.com/) Anthropic Claude (https://www.anthropic.com/claude) OpenAI Code Interpreter (https://platform.openai.com/docs/assistants/tools/code-interpreter) Edward Tufte (https://www.edwardtufte.com/tufte/books_vdqi) Looker ActionHub (https://developers.looker.com/actions/overview/) OAuth (https://oauth.net/2/) GitHub Copilot (https://github.com/features/copilot) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Hacker Public Radio
HPR4117: JAMBOREE !

Hacker Public Radio

Play Episode Listen Later May 14, 2024


https://github.com/freeload101/Java-Android-Magisk-Burp-Objection-Root-Emulator-Easy Java Android Magisk Burp Objection Root Emulator Easy (JAMBOREE) Get a working portable Python/Git/Java environment on Windows in SECONDS without having local administrator, regardless of your broken Python or other environment variables. Our open-source script downloads directly from proper sources without any binaries. While the code may not be perfect, it includes many useful PowerShell tricks. Run Android apps and pentest without the adware and malware of BlueStacks or NOX. Features / Request Core Status RMS:Runtime Mobile Security ✔️ Brida, Burp to Frida bridge ❌ SaftyNet+ Bypass ❌ Burp Suite Pro / CloudFlare UserAgent Workaround-ish ✔️ ZAP Using Burp ✔️ Google Play ✔️ Java ✔️ Android 11 API 30 ✔️ Magisk ✔️ Burp ✔️ Objection ✔️ Root ✔️ Python ✔️ Frida ✔️ Certs ✔️ AUTOMATIC1111 ✔️ AutoGPT ✔️ Bloodhound ✔️ PyCharm ✔️ OracleLinux WSL ✔️ Ubuntu/Olamma WSL ✔️ Postgres No admin ✔️ SillyTavern ✔️ Volatility 3 ✔️ Arduino IDE / Duck2Spark ✔️ Youtube Downloader Yt-dlp ✔️ How it works: Temporarily resets your windows $PATH environment variable to fix any issues with existing python/java installation Build a working Python environment in seconds using a tiny 16 meg nuget.org Python binary and portable PortableGit. Our solution doesn't require a package manager like Anaconda. I would like to make it even easier to use but I don't want to spend more time developing it if nobody is going to use it! Please let me know if you like it and open bugs/suggestions/feature request etc! You can contact me at https://rmccurdy.com ! Installation/Requirements ( For Android AVD Emulator) : Local admin just to install Android AVD Driver: HAXM Intel driver ( https://github.com/intel/haxm ) OR AMD ( https://github.com/google/android-emulator-hypervisor-driver-for-amd-processors ) Usage: Put ps1 file in a folder Rightclick Run with PowerShell OR From command prompt powershell -ExecutionPolicy Bypass -Command "[scriptblock]::Create((Invoke-WebRequest "https://raw.githubusercontent.com/freeload101/Java-Android-Magisk-Burp-Objection-Root-Emulator-Easy/main/JAMBOREE.ps1").Content).Invoke();" More infomation on bypass Root Detection and SafeNet https://www.droidwin.com/how-to-hide-root-from-apps-via-magisk-denylist/ ( Watch the Video Tutorial below it's a 3-5 min process. You only have to setup once. After that it's start burp then start AVD ) Burp/Android Emulator (Video Tutorial ) Update Video with 7minsec Podcast! https://youtu.be/XdXleap0BiM name (Video Tutorial) https://youtu.be/pYv4UwP3BaU name USB Rubber Ducky Scripts & Payloads Python 3 Arduino DigiSpark https://youtu.be/e8tKhFS0Tow name Old payloads: https://github.com/hak5/usbrubberducky-payloads/tree/1d3e9be7ba3f80cdb008885fac49be2ba926649d/payloads PhreakNIC 24: Java Android Magisk Burp Objection Root Emulator Easy (JAMBOREE) https://www.youtube.com/watch?v=R1eu2Ui1ZLU name

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

We are reuniting for the 2nd AI UX demo day in SF on Apr 28. Sign up to demo here! And don't forget tickets for the AI Engineer World's Fair — for early birds who join before keynote announcements!About a year ago there was a lot of buzz around prompt engineering techniques to force structured output. Our friend Simon Willison tweeted a bunch of tips and tricks, but the most iconic one is Riley Goodside making it a matter of life or death:Guardrails (friend of the pod and AI Engineer speaker), Marvin (AI Engineer speaker), and jsonformer had also come out at the time. In June 2023, Jason Liu (today's guest!) open sourced his “OpenAI Function Call and Pydantic Integration Module”, now known as Instructor, which quickly turned prompt engineering black magic into a clean, developer-friendly SDK. A few months later, model providers started to add function calling capabilities to their APIs as well as structured outputs support like “JSON Mode”, which was announced at OpenAI Dev Day (see recap here). In just a handful of months, we went from threatening to kill grandmas to first-class support from the research labs. And yet, Instructor was still downloaded 150,000 times last month. Why?What Instructor looks likeInstructor patches your LLM provider SDKs to offer a new response_model option to which you can pass a structure defined in Pydantic. It currently supports OpenAI, Anthropic, Cohere, and a long tail of models through LiteLLM.What Instructor is forThere are three core use cases to Instructor:* Extracting structured data: Taking an input like an image of a receipt and extracting structured data from it, such as a list of checkout items with their prices, fees, and coupon codes.* Extracting graphs: Identifying nodes and edges in a given input to extract complex entities and their relationships. For example, extracting relationships between characters in a story or dependencies between tasks.* Query understanding: Defining a schema for an API call and using a language model to resolve a request into a more complex one that an embedding could not handle. For example, creating date intervals from queries like “what was the latest thing that happened this week?” to then pass onto a RAG system or similar.Jason called all these different ways of getting data from LLMs “typed responses”: taking strings and turning them into data structures. Structured outputs as a planning toolThe first wave of agents was all about open-ended iteration and planning, with projects like AutoGPT and BabyAGI. Models would come up with a possible list of steps, and start going down the list one by one. It's really easy for them to go down the wrong branch, or get stuck on a single step with no way to intervene.What if these planning steps were returned to us as DAGs using structured output, and then managed as workflows? This also makes it easy to better train model on how to create these plans, as they are much more structured than a bullet point list. Once you have this structure, each piece can be modified individually by different specialized models. You can read some of Jason's experiments here:While LLMs will keep improving (Llama3 just got released as we write this), having a consistent structure for the output will make it a lot easier to swap models in and out. Jason's overall message on how we can move from ReAct loops to more controllable Agent workflows mirrors the “Process” discussion from our Elicit episode:Watch the talkAs a bonus, here's Jason's talk from last year's AI Engineer Summit. He'll also be a speaker at this year's AI Engineer World's Fair!Timestamps* [00:00:00] Introductions* [00:02:23] Early experiments with Generative AI at StitchFix* [00:08:11] Design philosophy behind the Instructor library* [00:11:12] JSON Mode vs Function Calling* [00:12:30] Single vs parallel function calling* [00:14:00] How many functions is too many?* [00:17:39] How to evaluate function calling* [00:20:23] What is Instructor good for?* [00:22:42] The Evolution from Looping to Workflow in AI Engineering* [00:27:03] State of the AI Engineering Stack* [00:28:26] Why Instructor isn't VC backed* [00:31:15] Advice on Pursuing Open Source Projects and Consulting* [00:36:00] The Concept of High Agency and Its Importance* [00:42:44] Prompts as Code and the Structure of AI Inputs and Outputs* [00:44:20] The Emergence of AI Engineering as a Distinct FieldShow notes* Jason on the UWaterloo mafia* Jason on Twitter, LinkedIn, website* Instructor docs* Max Woolf on the potential of Structured Output* swyx on Elo vs Cost* Jason on Anthropic Function Calling* Jason on Rejections, Advice to Young People* Jason on Bad Startup Ideas* Jason on Prompts as Code* Rysana's inversion models* Bryan Bischof's episode* Hamel HusainTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:16]: Hello, we're back in the remote studio with Jason Liu from Instructor. Welcome Jason.Jason [00:00:21]: Hey there. Thanks for having me.Swyx [00:00:23]: Jason, you are extremely famous, so I don't know what I'm going to do introducing you, but you're one of the Waterloo clan. There's like this small cadre of you that's just completely dominating machine learning. Actually, can you list like Waterloo alums that you're like, you know, are just dominating and crushing it right now?Jason [00:00:39]: So like John from like Rysana is doing his inversion models, right? I know like Clive Chen from Waterloo. When I started the data science club, he was one of the guys who were like joining in and just like hanging out in the room. And now he was at Tesla working with Karpathy, now he's at OpenAI, you know.Swyx [00:00:56]: He's in my climbing club.Jason [00:00:58]: Oh, hell yeah. I haven't seen him in like six years now.Swyx [00:01:01]: To get in the social scene in San Francisco, you have to climb. So both in career and in rocks. So you started a data science club at Waterloo, we can talk about that, but then also spent five years at Stitch Fix as an MLE. You pioneered the use of OpenAI's LLMs to increase stylist efficiency. So you must have been like a very, very early user. This was like pretty early on.Jason [00:01:20]: Yeah, I mean, this was like GPT-3, okay. So we actually were using transformers at Stitch Fix before the GPT-3 model. So we were just using transformers for recommendation systems. At that time, I was very skeptical of transformers. I was like, why do we need all this infrastructure? We can just use like matrix factorization. When GPT-2 came out, I fine tuned my own GPT-2 to write like rap lyrics and I was like, okay, this is cute. Okay, I got to go back to my real job, right? Like who cares if I can write a rap lyric? When GPT-3 came out, again, I was very much like, why are we using like a post request to review every comment a person leaves? Like we can just use classical models. So I was very against language models for like the longest time. And then when ChatGPT came out, I basically just wrote a long apology letter to everyone at the company. I was like, hey guys, you know, I was very dismissive of some of this technology. I didn't think it would scale well, and I am wrong. This is incredible. And I immediately just transitioned to go from computer vision recommendation systems to LLMs. But funny enough, now that we have RAG, we're kind of going back to recommendation systems.Swyx [00:02:21]: Yeah, speaking of that, I think Alessio is going to bring up the next one.Alessio [00:02:23]: Yeah, I was going to say, we had Bryan Bischof from Hex on the podcast. Did you overlap at Stitch Fix?Jason [00:02:28]: Yeah, he was like one of my main users of the recommendation frameworks that I had built out at Stitch Fix.Alessio [00:02:32]: Yeah, we talked a lot about RecSys, so it makes sense.Swyx [00:02:36]: So now I have adopted that line, RAG is RecSys. And you know, if you're trying to reinvent new concepts, you should study RecSys first, because you're going to independently reinvent a lot of concepts. So your system was called Flight. It's a recommendation framework with over 80% adoption, servicing 350 million requests every day. Wasn't there something existing at Stitch Fix? Why did you have to write one from scratch?Jason [00:02:56]: No, so I think because at Stitch Fix, a lot of the machine learning engineers and data scientists were writing production code, sort of every team's systems were very bespoke. It's like, this team only needs to do like real time recommendations with small data. So they just have like a fast API app with some like pandas code. This other team has to do a lot more data. So they have some kind of like Spark job that does some batch ETL that does a recommendation. And so what happens is each team writes their code differently. And I have to come in and refactor their code. And I was like, oh man, I'm refactoring four different code bases, four different times. Wouldn't it be better if all the code quality was my fault? Let me just write this framework, force everyone else to use it. And now one person can maintain five different systems, rather than five teams having their own bespoke system. And so it was really a need of just sort of standardizing everything. And then once you do that, you can do observability across the entire pipeline and make large sweeping improvements in this infrastructure, right? If we notice that something is slow, we can detect it on the operator layer. Just hey, hey, like this team, you guys are doing this operation is lowering our latency by like 30%. If you just optimize your Python code here, we can probably make an extra million dollars. So let's jump on a call and figure this out. And then a lot of it was doing all this observability work to figure out what the heck is going on and optimize this system from not only just a code perspective, sort of like harassingly or against saying like, we need to add caching here. We're doing duplicated work here. Let's go clean up the systems. Yep.Swyx [00:04:22]: Got it. One more system that I'm interested in finding out more about is your similarity search system using Clip and GPT-3 embeddings and FIASS, where you saved over $50 million in annual revenue. So of course they all gave all that to you, right?Jason [00:04:34]: No, no, no. I mean, it's not going up and down, but you know, I got a little bit, so I'm pretty happy about that. But there, you know, that was when we were doing fine tuning like ResNets to do image classification. And so a lot of it was given an image, if we could predict the different attributes we have in the merchandising and we can predict the text embeddings of the comments, then we can kind of build a image vector or image embedding that can capture both descriptions of the clothing and sales of the clothing. And then we would use these additional vectors to augment our recommendation system. And so with the recommendation system really was just around like, what are similar items? What are complimentary items? What are items that you would wear in a single outfit? And being able to say on a product page, let me show you like 15, 20 more things. And then what we found was like, hey, when you turn that on, you make a bunch of money.Swyx [00:05:23]: Yeah. So, okay. So you didn't actually use GPT-3 embeddings. You fine tuned your own? Because I was surprised that GPT-3 worked off the shelf.Jason [00:05:30]: Because I mean, at this point we would have 3 million pieces of inventory over like a billion interactions between users and clothes. So any kind of fine tuning would definitely outperform like some off the shelf model.Swyx [00:05:41]: Cool. I'm about to move on from Stitch Fix, but you know, any other like fun stories from the Stitch Fix days that you want to cover?Jason [00:05:46]: No, I think that's basically it. I mean, the biggest one really was the fact that I think for just four years, I was so bearish on language models and just NLP in general. I'm just like, none of this really works. Like, why would I spend time focusing on this? I got to go do the thing that makes money, recommendations, bounding boxes, image classification. Yeah. Now I'm like prompting an image model. I was like, oh man, I was wrong.Swyx [00:06:06]: So my Stitch Fix question would be, you know, I think you have a bit of a drip and I don't, you know, my primary wardrobe is free startup conference t-shirts. Should more technology brothers be using Stitch Fix? What's your fashion advice?Jason [00:06:19]: Oh man, I mean, I'm not a user of Stitch Fix, right? It's like, I enjoy going out and like touching things and putting things on and trying them on. Right. I think Stitch Fix is a place where you kind of go because you want the work offloaded. I really love the clothing I buy where I have to like, when I land in Japan, I'm doing like a 45 minute walk up a giant hill to find this weird denim shop. That's the stuff that really excites me. But I think the bigger thing that's really captured is this idea that narrative matters a lot to human beings. Okay. And I think the recommendation system, that's really hard to capture. It's easy to use AI to sell like a $20 shirt, but it's really hard for AI to sell like a $500 shirt. But people are buying $500 shirts, you know what I mean? There's definitely something that we can't really capture just yet that we probably will figure out how to in the future.Swyx [00:07:07]: Well, it'll probably output in JSON, which is what we're going to turn to next. Then you went on a sabbatical to South Park Commons in New York, which is unusual because it's based on USF.Jason [00:07:17]: Yeah. So basically in 2020, really, I was enjoying working a lot as I was like building a lot of stuff. This is where we were making like the tens of millions of dollars doing stuff. And then I had a hand injury. And so I really couldn't code anymore for like a year, two years. And so I kind of took sort of half of it as medical leave, the other half I became more of like a tech lead, just like making sure the systems were like lights were on. And then when I went to New York, I spent some time there and kind of just like wound down the tech work, you know, did some pottery, did some jujitsu. And after GPD came out, I was like, oh, I clearly need to figure out what is going on here because something feels very magical. I don't understand it. So I spent basically like five months just prompting and playing around with stuff. And then afterwards, it was just my startup friends going like, hey, Jason, you know, my investors want us to have an AI strategy. Can you help us out? And it just snowballed and bore more and more until I was making this my full time job. Yeah, got it.Swyx [00:08:11]: You know, you had YouTube University and a journaling app, you know, a bunch of other explorations. But it seems like the most productive or the best known thing that came out of your time there was Instructor. Yeah.Jason [00:08:22]: Written on the bullet train in Japan. I think at some point, you know, tools like Guardrails and Marvin came out. Those are kind of tools that I use XML and Pytantic to get structured data out. But they really were doing things sort of in the prompt. And these are built with sort of the instruct models in mind. Like I'd already done that in the past. Right. At Stitch Fix, you know, one of the things we did was we would take a request note and turn that into a JSON object that we would use to send it to our search engine. Right. So if you said like, I want to, you know, skinny jeans that were this size, that would turn into JSON that we would send to our internal search APIs. But it always felt kind of gross. A lot of it is just like you read the JSON, you like parse it, you make sure the names are strings and ages are numbers and you do all this like messy stuff. But when function calling came out, it was very much sort of a new way of doing things. Right. Function calling lets you define the schema separate from the data and the instructions. And what this meant was you can kind of have a lot more complex schemas and just map them in Pytantic. And then you can just keep those very separate. And then once you add like methods, you can add validators and all that kind of stuff. The one thing I really had with a lot of these libraries, though, was it was doing a lot of the string formatting themselves, which was fine when it was the instruction to models. You just have a string. But when you have these new chat models, you have these chat messages. And I just didn't really feel like not being able to access that for the developer was sort of a good benefit that they would get. And so I just said, let me write like the most simple SDK around the OpenAI SDK, a simple wrapper on the SDK, just handle the response model a bit and kind of think of myself more like requests than actual framework that people can use. And so the goal is like, hey, like this is something that you can use to build your own framework. But let me just do all the boring stuff that nobody really wants to do. People want to build their own frameworks, but people don't want to build like JSON parsing.Swyx [00:10:08]: And the retrying and all that other stuff.Jason [00:10:10]: Yeah.Swyx [00:10:11]: Right. We had this a little bit of this discussion before the show, but like that design principle of going for being requests rather than being Django. Yeah. So what inspires you there? This has come from a lot of prior pain. Are there other open source projects that inspired your philosophy here? Yeah.Jason [00:10:25]: I mean, I think it would be requests, right? Like, I think it is just the obvious thing you install. If you were going to go make HTTP requests in Python, you would obviously import requests. Maybe if you want to do more async work, there's like future tools, but you don't really even think about installing it. And when you do install it, you don't think of it as like, oh, this is a requests app. Right? Like, no, this is just Python. The bigger question is, like, a lot of people ask questions like, oh, why isn't requests like in the standard library? Yeah. That's how I want my library to feel, right? It's like, oh, if you're going to use the LLM SDKs, you're obviously going to install instructor. And then I think the second question would be like, oh, like, how come instructor doesn't just go into OpenAI, go into Anthropic? Like, if that's the conversation we're having, like, that's where I feel like I've succeeded. Yeah. It's like, yeah, so standard, you may as well just have it in the base libraries.Alessio [00:11:12]: And the shape of the request stayed the same, but initially function calling was maybe equal structure outputs for a lot of people. I think now the models also support like JSON mode and some of these things and, you know, return JSON or my grandma is going to die. All of that stuff is maybe to decide how have you seen that evolution? Like maybe what's the metagame today? Should people just forget about function calling for structure outputs or when is structure output like JSON mode the best versus not? We'd love to get any thoughts given that you do this every day.Jason [00:11:42]: Yeah, I would almost say these are like different implementations of like the real thing we care about is the fact that now we have typed responses to language models. And because we have that type response, my IDE is a little bit happier. I get autocomplete. If I'm using the response wrong, there's a little red squiggly line. Like those are the things I care about in terms of whether or not like JSON mode is better. I usually think it's almost worse unless you want to spend less money on like the prompt tokens that the function call represents, primarily because with JSON mode, you don't actually specify the schema. So sure, like JSON load works, but really, I care a lot more than just the fact that it is JSON, right? I think function calling gives you a tool to specify the fact like, okay, this is a list of objects that I want and each object has a name or an age and I want the age to be above zero and I want to make sure it's parsed correctly. That's where kind of function calling really shines.Alessio [00:12:30]: Any thoughts on single versus parallel function calling? So I did a presentation at our AI in Action Discord channel, and obviously showcase instructor. One of the big things that we have before with single function calling is like when you're trying to extract lists, you have to make these funky like properties that are lists to then actually return all the objects. How do you see the hack being put on the developer's plate versus like more of this stuff just getting better in the model? And I know you tweeted recently about Anthropic, for example, you know, some lists are not lists or strings and there's like all of these discrepancies.Jason [00:13:04]: I almost would prefer it if it was always a single function call. Obviously, there is like the agents workflows that, you know, Instructor doesn't really support that well, but are things that, you know, ought to be done, right? Like you could define, I think maybe like 50 or 60 different functions in a single API call. And, you know, if it was like get the weather or turn the lights on or do something else, it makes a lot of sense to have these parallel function calls. But in terms of an extraction workflow, I definitely think it's probably more helpful to have everything be a single schema, right? Just because you can sort of specify relationships between these entities that you can't do in a parallel function calling, you can have a single chain of thought before you generate a list of results. Like there's like small like API differences, right? Where if it's for parallel function calling, if you do one, like again, really, I really care about how the SDK looks and says, okay, do I always return a list of functions or do you just want to have the actual object back out and you want to have like auto complete over that object? Interesting.Alessio [00:14:00]: What's kind of the cap for like how many function definitions you can put in where it still works well? Do you have any sense on that?Jason [00:14:07]: I mean, for the most part, I haven't really had a need to do anything that's more than six or seven different functions. I think in the documentation, they support way more. I don't even know if there's any good evals that have over like two dozen function calls. I think if you're running into issues where you have like 20 or 50 or 60 function calls, I think you're much better having those specifications saved in a vector database and then have them be retrieved, right? So if there are 30 tools, like you should basically be like ranking them and then using the top K to do selection a little bit better rather than just like shoving like 60 functions into a single. Yeah.Swyx [00:14:40]: Yeah. Well, I mean, so I think this is relevant now because previously I think context limits prevented you from having more than a dozen tools anyway. And now that we have million token context windows, you know, a cloud recently with their new function calling release said they can handle over 250 tools, which is insane to me. That's, that's a lot. You're saying like, you know, you don't think there's many people doing that. I think anyone with a sort of agent like platform where you have a bunch of connectors, they wouldn't run into that problem. Probably you're right that they should use a vector database and kind of rag their tools. I know Zapier has like a few thousand, like 8,000, 9,000 connectors that, you know, obviously don't fit anywhere. So yeah, I mean, I think that would be it unless you need some kind of intelligence that chains things together, which is, I think what Alessio is coming back to, right? Like there's this trend about parallel function calling. I don't know what I think about that. Anthropic's version was, I think they use multiple tools in sequence, but they're not in parallel. I haven't explored this at all. I'm just like throwing this open to you as to like, what do you think about all these new things? Yeah.Jason [00:15:40]: It's like, you know, do we assume that all function calls could happen in any order? In which case, like we either can assume that, or we can assume that like things need to happen in some kind of sequence as a DAG, right? But if it's a DAG, really that's just like one JSON object that is the entire DAG rather than going like, okay, the order of the function that return don't matter. That's definitely just not true in practice, right? Like if I have a thing that's like turn the lights on, like unplug the power, and then like turn the toaster on or something like the order doesn't matter. And it's unclear how well you can describe the importance of that reasoning to a language model yet. I mean, I'm sure you can do it with like good enough prompting, but I just haven't any use cases where the function sequence really matters. Yeah.Alessio [00:16:18]: To me, the most interesting thing is the models are better at picking than your ranking is usually. Like I'm incubating a company around system integration. For example, with one system, there are like 780 endpoints. And if you're actually trying to do vector similarity, it's not that good because the people that wrote the specs didn't have in mind making them like semantically apart. You know, they're kind of like, oh, create this, create this, create this. Versus when you give it to a model, like in Opus, you put them all, it's quite good at picking which ones you should actually run. And I'm curious to see if the model providers actually care about some of those workflows or if the agent companies are actually going to build very good rankers to kind of fill that gap.Jason [00:16:58]: Yeah. My money is on the rankers because you can do those so easily, right? You could just say, well, given the embeddings of my search query and the embeddings of the description, I can just train XGBoost and just make sure that I have very high like MRR, which is like mean reciprocal rank. And so the only objective is to make sure that the tools you use are in the top end filtered. Like that feels super straightforward and you don't have to actually figure out how to fine tune a language model to do tool selection anymore. Yeah. I definitely think that's the case because for the most part, I imagine you either have like less than three tools or more than a thousand. I don't know what kind of company said, oh, thank God we only have like 185 tools and this works perfectly, right? That's right.Alessio [00:17:39]: And before we maybe move on just from this, it was interesting to me, you retweeted this thing about Anthropic function calling and it was Joshua Brown's retweeting some benchmark that it's like, oh my God, Anthropic function calling so good. And then you retweeted it and then you tweeted it later and it's like, it's actually not that good. What's your flow? How do you actually test these things? Because obviously the benchmarks are lying, right? Because the benchmarks say it's good and you said it's bad and I trust you more than the benchmark. How do you think about that? And then how do you evolve it over time?Jason [00:18:09]: It's mostly just client data. I actually have been mostly busy with enough client work that I haven't been able to reproduce public benchmarks. And so I can't even share some of the results in Anthropic. I would just say like in production, we have some pretty interesting schemas where it's like iteratively building lists where we're doing like updates of lists, like we're doing in place updates. So like upserts and inserts. And in those situations we're like, oh yeah, we have a bunch of different parsing errors. Numbers are being returned to strings. We were expecting lists of objects, but we're getting strings that are like the strings of JSON, right? So we had to call JSON parse on individual elements. Overall, I'm like super happy with the Anthropic models compared to the OpenAI models. Sonnet is very cost effective. Haiku is in function calling, it's actually better, but I think they just had to sort of file down the edges a little bit where like our tests pass, but then we actually deployed a production. We got half a percent of traffic having issues where if you ask for JSON, it'll try to talk to you. Or if you use function calling, you know, we'll have like a parse error. And so I think that definitely gonna be things that are fixed in like the upcoming weeks. But in terms of like the reasoning capabilities, man, it's hard to beat like 70% cost reduction, especially when you're building consumer applications, right? If you're building something for consultants or private equity, like you're charging $400, it doesn't really matter if it's a dollar or $2. But for consumer apps, it makes products viable. If you can go from four to Sonnet, you might actually be able to price it better. Yeah.Swyx [00:19:31]: I had this chart about the ELO versus the cost of all the models. And you could put trend graphs on each of those things about like, you know, higher ELO equals higher cost, except for Haiku. Haiku kind of just broke the lines, or the ISO ELOs, if you want to call it. Cool. Before we go too far into your opinions on just the overall ecosystem, I want to make sure that we map out the surface area of Instructor. I would say that most people would be familiar with Instructor from your talks and your tweets and all that. You had the number one talk from the AI Engineer Summit.Jason [00:20:03]: Two Liu. Jason Liu and Jerry Liu. Yeah.Swyx [00:20:06]: Yeah. Until I actually went through your cookbook, I didn't realize the surface area. How would you categorize the use cases? You have LLM self-critique, you have knowledge graphs in here, you have PII data sanitation. How do you characterize to people what is the surface area of Instructor? Yeah.Jason [00:20:23]: This is the part that feels crazy because really the difference is LLMs give you strings and Instructor gives you data structures. And once you get data structures, again, you can do every lead code problem you ever thought of. Right. And so I think there's a couple of really common applications. The first one obviously is extracting structured data. This is just be, okay, well, like I want to put in an image of a receipt. I want to give it back out a list of checkout items with a price and a fee and a coupon code or whatever. That's one application. Another application really is around extracting graphs out. So one of the things we found out about these language models is that not only can you define nodes, it's really good at figuring out what are nodes and what are edges. And so we have a bunch of examples where, you know, not only do I extract that, you know, this happens after that, but also like, okay, these two are dependencies of another task. And you can do, you know, extracting complex entities that have relationships. Given a story, for example, you could extract relationships of families across different characters. This can all be done by defining a graph. The last really big application really is just around query understanding. The idea is that like any API call has some schema and if you can define that schema ahead of time, you can use a language model to resolve a request into a much more complex request. One that an embedding could not do. So for example, I have a really popular post called like rag is more than embeddings. And effectively, you know, if I have a question like this, what was the latest thing that happened this week? That embeds to nothing, right? But really like that query should just be like select all data where the date time is between today and today minus seven days, right? What if I said, how did my writing change between this month and last month? Again, embeddings would do nothing. But really, if you could do like a group by over the month and a summarize, then you could again like do something much more interesting. And so this really just calls out the fact that embeddings really is kind of like the lowest hanging fruit. And using something like instructor can really help produce a data structure. And then you can just use your computer science and reason about the data structure. Maybe you say, okay, well, I'm going to produce a graph where I want to group by each month and then summarize them jointly. You can do that if you know how to define this data structure. Yeah.Swyx [00:22:29]: So you kind of run up against like the LangChains of the world that used to have that. They still do have like the self querying, I think they used to call it when we had Harrison on in our episode. How do you see yourself interacting with the other LLM frameworks in the ecosystem? Yeah.Jason [00:22:42]: I mean, if they use instructor, I think that's totally cool. Again, it's like, it's just Python, right? It's like asking like, oh, how does like Django interact with requests? Well, you just might make a request.get in a Django app, right? But no one would say, I like went off of Django because I'm using requests now. They should be ideally like sort of the wrong comparison in terms of especially like the agent workflows. I think the real goal for me is to go down like the LLM compiler route, which is instead of doing like a react type reasoning loop. I think my belief is that we should be using like workflows. If we do this, then we always have a request and a complete workflow. We can fine tune a model that has a better workflow. Whereas it's hard to think about like, how do you fine tune a better react loop? Yeah. You always train it to have less looping, in which case like you wanted to get the right answer the first time, in which case it was a workflow to begin with, right?Swyx [00:23:31]: Can you define workflow? Because I used to work at a workflow company, but I'm not sure this is a good term for everybody.Jason [00:23:36]: I'm thinking workflow in terms of like the prefect Zapier workflow. Like I want to build a DAG, I want you to tell me what the nodes and edges are. And then maybe the edges are also put in with AI. But the idea is that like, I want to be able to present you the entire plan and then ask you to fix things as I execute it, rather than going like, hey, I couldn't parse the JSON, so I'm going to try again. I couldn't parse the JSON, I'm going to try again. And then next thing you know, you spent like $2 on opening AI credits, right? Yeah. Whereas with the plan, you can just say, oh, the edge between node like X and Y does not run. Let me just iteratively try to fix that, fix the one that sticks, go on to the next component. And obviously you can get into a world where if you have enough examples of the nodes X and Y, maybe you can use like a vector database to find a good few shot examples. You can do a lot if you sort of break down the problem into that workflow and executing that workflow, rather than looping and hoping the reasoning is good enough to generate the correct output. Yeah.Swyx [00:24:35]: You know, I've been hammering on Devon a lot. I got access a couple of weeks ago. And obviously for simple tasks, it does well. For the complicated, like more than 10, 20 hour tasks, I can see- That's a crazy comparison.Jason [00:24:47]: We used to talk about like three, four loops. Only once it gets to like hour tasks, it's hard.Swyx [00:24:54]: Yeah. Less than an hour, there's nothing.Jason [00:24:57]: That's crazy.Swyx [00:24:58]: I mean, okay. Maybe my goalposts have shifted. I don't know. That's incredible.Jason [00:25:02]: Yeah. No, no. I'm like sub one minute executions. Like the fact that you're talking about 10 hours is incredible.Swyx [00:25:08]: I think it's a spectrum. I think I'm going to say this every single time I bring up Devon. Let's not reward them for taking longer to do things. Do you know what I mean? I think that's a metric that is easily abusable.Jason [00:25:18]: Sure. Yeah. You know what I mean? But I think if you can monotonically increase the success probability over an hour, that's winning to me. Right? Like obviously if you run an hour and you've made no progress. Like I think when we were in like auto GBT land, there was that one example where it's like, I wanted it to like buy me a bicycle overnight. I spent $7 on credit and I never found the bicycle. Yeah.Swyx [00:25:41]: Yeah. Right. I wonder if you'll be able to purchase a bicycle. Because it actually can do things in real world. It just needs to suspend to you for off and stuff. The point I was trying to make was that I can see it turning plans. I think one of the agents loopholes or one of the things that is a real barrier for agents is LLMs really like to get stuck into a lane. And you know what you're talking about, what I've seen Devon do is it gets stuck in a lane and it will just kind of change plans based on the performance of the plan itself. And it's kind of cool.Jason [00:26:05]: I feel like we've gone too much in the looping route and I think a lot of more plans and like DAGs and data structures are probably going to come back to help fill in some holes. Yeah.Alessio [00:26:14]: What do you think of the interface to that? Do you see it's like an existing state machine kind of thing that connects to the LLMs, the traditional DAG players? Do you think we need something new for like AI DAGs?Jason [00:26:25]: Yeah. I mean, I think that the hard part is going to be describing visually the fact that this DAG can also change over time and it should still be allowed to be fuzzy. I think in like mathematics, we have like plate diagrams and like Markov chain diagrams and like recurrent states and all that. Some of that might come into this workflow world. But to be honest, I'm not too sure. I think right now, the first steps are just how do we take this DAG idea and break it down to modular components that we can like prompt better, have few shot examples for and ultimately like fine tune against. But in terms of even the UI, it's hard to say what it will likely win. I think, you know, people like Prefect and Zapier have a pretty good shot at doing a good job.Swyx [00:27:03]: Yeah. You seem to use Prefect a lot. I actually worked at a Prefect competitor at Temporal and I'm also very familiar with Dagster. What else would you call out as like particularly interesting in the AI engineering stack?Jason [00:27:13]: Man, I almost use nothing. I just use Cursor and like PyTests. Okay. I think that's basically it. You know, a lot of the observability companies have... The more observability companies I've tried, the more I just use Postgres.Swyx [00:27:29]: Really? Okay. Postgres for observability?Jason [00:27:32]: But the issue really is the fact that these observability companies isn't actually doing observability for the system. It's just doing the LLM thing. Like I still end up using like Datadog or like, you know, Sentry to do like latency. And so I just have those systems handle it. And then the like prompt in, prompt out, latency, token costs. I just put that in like a Postgres table now.Swyx [00:27:51]: So you don't need like 20 funded startups building LLM ops? Yeah.Jason [00:27:55]: But I'm also like an old, tired guy. You know what I mean? Like I think because of my background, it's like, yeah, like the Python stuff, I'll write myself. But you know, I will also just use Vercel happily. Yeah. Yeah. So I'm not really into that world of tooling, whereas I think, you know, I spent three good years building observability tools for recommendation systems. And I was like, oh, compared to that, Instructor is just one call. I just have to put time star, time and then count the prompt token, right? Because I'm not doing a very complex looping behavior. I'm doing mostly workflows and extraction. Yeah.Swyx [00:28:26]: I mean, while we're on this topic, we'll just kind of get this out of the way. You famously have decided to not be a venture backed company. You want to do the consulting route. The obvious route for someone as successful as Instructor is like, oh, here's hosted Instructor with all tooling. Yeah. You just said you had a whole bunch of experience building observability tooling. You have the perfect background to do this and you're not.Jason [00:28:43]: Yeah. Isn't that sick? I think that's sick.Swyx [00:28:44]: I mean, I know why, because you want to go free dive.Jason [00:28:47]: Yeah. Yeah. Because I think there's two things. Right. Well, one, if I tell myself I want to build requests, requests is not a venture backed startup. Right. I mean, one could argue whether or not Postman is, but I think for the most part, it's like having worked so much, I'm more interested in looking at how systems are being applied and just having access to the most interesting data. And I think I can do that more through a consulting business where I can come in and go, oh, you want to build perfect memory. You want to build an agent. You want to build like automations over construction or like insurance and supply chain, or like you want to handle writing private equity, mergers and acquisitions reports based off of user interviews. Those things are super fun. Whereas like maintaining the library, I think is mostly just kind of like a utility that I try to keep up, especially because if it's not venture backed, I have no reason to sort of go down the route of like trying to get a thousand integrations. In my mind, I just go like, okay, 98% of the people use open AI. I'll support that. And if someone contributes another platform, that's great. I'll merge it in. Yeah.Swyx [00:29:45]: I mean, you only added Anthropic support this year. Yeah.Jason [00:29:47]: Yeah. You couldn't even get an API key until like this year, right? That's true. Okay. If I add it like last year, I was trying to like double the code base to service, you know, half a percent of all downloads.Swyx [00:29:58]: Do you think the market share will shift a lot now that Anthropic has like a very, very competitive offering?Jason [00:30:02]: I think it's still hard to get API access. I don't know if it's fully GA now, if it's GA, if you can get a commercial access really easily.Alessio [00:30:12]: I got commercial after like two weeks to reach out to their sales team.Jason [00:30:14]: Okay.Alessio [00:30:15]: Yeah.Swyx [00:30:16]: Two weeks. It's not too bad. There's a call list here. And then anytime you run into rate limits, just like ping one of the Anthropic staff members.Jason [00:30:21]: Yeah. Then maybe we need to like cut that part out. So I don't need to like, you know, spread false news.Swyx [00:30:25]: No, it's cool. It's cool.Jason [00:30:26]: But it's a common question. Yeah. Surely just from the price perspective, it's going to make a lot of sense. Like if you are a business, you should totally consider like Sonnet, right? Like the cost savings is just going to justify it if you actually are doing things at volume. And yeah, I think the SDK is like pretty good. Back to the instructor thing. I just don't think it's a billion dollar company. And I think if I raise money, the first question is going to be like, how are you going to get a billion dollar company? And I would just go like, man, like if I make a million dollars as a consultant, I'm super happy. I'm like more than ecstatic. I can have like a small staff of like three people. It's fun. And I think a lot of my happiest founder friends are those who like raised a tiny seed round, became profitable. They're making like 70, 60, 70, like MRR, 70,000 MRR and they're like, we don't even need to raise the seed round. Let's just keep it like between me and my co-founder, we'll go traveling and it'll be a great time. I think it's a lot of fun.Alessio [00:31:15]: Yeah. like say LLMs / AI and they build some open source stuff and it's like I should just raise money and do this and I tell people a lot it's like look you can make a lot more money doing something else than doing a startup like most people that do a company could make a lot more money just working somewhere else than the company itself do you have any advice for folks that are maybe in a similar situation they're trying to decide oh should I stay in my like high paid FAANG job and just tweet this on the side and do this on github should I go be a consultant like being a consultant seems like a lot of work so you got to talk to all these people you know there's a lot to unpackJason [00:31:54]: I think the open source thing is just like well I'm just doing it purely for fun and I'm doing it because I think I'm right but part of being right is the fact that it's not a venture backed startup like I think I'm right because this is all you need right so I think a part of the philosophy is the fact that all you need is a very sharp blade to sort of do your work and you don't actually need to build like a big enterprise so that's one thing I think the other thing too that I've kind of been thinking around just because I have a lot of friends at google that want to leave right now it's like man like what we lack is not money or skill like what we lack is courage you should like you just have to do this a hard thing and you have to do it scared anyways right in terms of like whether or not you do want to do a founder I think that's just a matter of optionality but I definitely recognize that the like expected value of being a founder is still quite low it is right I know as many founder breakups and as I know friends who raised a seed round this year right like that is like the reality and like you know even in from that perspective it's been tough where it's like oh man like a lot of incubators want you to have co-founders now you spend half the time like fundraising and then trying to like meet co-founders and find co-founders rather than building the thing this is a lot of time spent out doing uh things I'm not really good at. I do think there's a rising trend in solo founding yeah.Swyx [00:33:06]: You know I am a solo I think that something like 30 percent of like I forget what the exact status something like 30 percent of starters that make it to like series B or something actually are solo founder I feel like this must have co-founder idea mostly comes from YC and most everyone else copies it and then plenty of companies break up over co-founderJason [00:33:27]: Yeah and I bet it would be like I wonder how much of it is the people who don't have that much like and I hope this is not a diss to anybody but it's like you sort of you go through the incubator route because you don't have like the social equity you would need is just sort of like send an email to Sequoia and be like hey I'm going on this ride you want a ticket on the rocket ship right like that's very hard to sell my message if I was to raise money is like you've seen my twitter my life is sick I've decided to make it much worse by being a founder because this is something I have to do so do you want to come along otherwise I want to fund it myself like if I can't say that like I don't need the money because I can like handle payroll and like hire an intern and get an assistant like that's all fine but I really don't want to go back to meta I want to like get two years to like try to find a problem we're solving that feels like a bad timeAlessio [00:34:12]: Yeah Jason is like I wear a YSL jacket on stage at AI Engineer Summit I don't need your accelerator moneyJason [00:34:18]: And boots, you don't forget the boots. But I think that is a part of it right I think it is just like optionality and also just like I'm a lot older now I think 22 year old Jason would have been probably too scared and now I'm like too wise but I think it's a matter of like oh if you raise money you have to have a plan of spending it and I'm just not that creative with spending that much money yeah I mean to be clear you just celebrated your 30th birthday happy birthday yeah it's awesome so next week a lot older is relative to some some of the folks I think seeing on the career tipsAlessio [00:34:48]: I think Swix had a great post about are you too old to get into AI I saw one of your tweets in January 23 you applied to like Figma, Notion, Cohere, Anthropic and all of them rejected you because you didn't have enough LLM experience I think at that time it would be easy for a lot of people to say oh I kind of missed the boat you know I'm too late not gonna make it you know any advice for people that feel like thatJason [00:35:14]: Like the biggest learning here is actually from a lot of folks in jiu-jitsu they're like oh man like is it too late to start jiu-jitsu like I'll join jiu-jitsu once I get in more shape right it's like there's a lot of like excuses and then you say oh like why should I start now I'll be like 45 by the time I'm any good and say well you'll be 45 anyways like time is passing like if you don't start now you start tomorrow you're just like one more day behind if you're worried about being behind like today is like the soonest you can start right and so you got to recognize that like maybe you just don't want it and that's fine too like if you wanted you would have started I think a lot of these people again probably think of things on a too short time horizon but again you know you're gonna be old anyways you may as well just start now you knowSwyx [00:35:55]: One more thing on I guess the um career advice slash sort of vlogging you always go viral for this post that you wrote on advice to young people and the lies you tell yourself oh yeah yeah you said you were writing it for your sister.Jason [00:36:05]: She was like bummed out about going to college and like stressing about jobs and I was like oh and I really want to hear okay and I just kind of like text-to-sweep the whole thing it's crazy it's got like 50,000 views like I'm mind I mean your average tweet has more but that thing is like a 30-minute read nowSwyx [00:36:26]: So there's lots of stuff here which I agree with I you know I'm also of occasionally indulge in the sort of life reflection phase there's the how to be lucky there's the how to have high agency I feel like the agency thing is always a trend in sf or just in tech circles how do you define having high agencyJason [00:36:42]: I'm almost like past the high agency phase now now my biggest concern is like okay the agency is just like the norm of the vector what also matters is the direction right it's like how pure is the shot yeah I mean I think agency is just a matter of like having courage and doing the thing that's scary right you know if people want to go rock climbing it's like do you decide you want to go rock climbing then you show up to the gym you rent some shoes and you just fall 40 times or do you go like oh like I'm actually more intelligent let me go research the kind of shoes that I want okay like there's flatter shoes and more inclined shoes like which one should I get okay let me go order the shoes on Amazon I'll come back in three days like oh it's a little bit too tight maybe it's too aggressive I'm only a beginner let me go change no I think the higher agent person just like goes and like falls down 20 times right yeah I think the higher agency person is more focused on like process metrics versus outcome metrics right like from pottery like one thing I learned was if you want to be good at pottery you shouldn't count like the number of cups or bowls you make you should just weigh the amount of clay you use right like the successful person says oh I went through 100 pounds of clay right the less agency was like oh I've made six cups and then after I made six cups like there's not really what are you what do you do next no just pounds of clay pounds of clay same with the work here right so you just got to write the tweets like make the commits contribute open source like write the documentation there's no real outcome it's just a process and if you love that process you just get really good at the thing you're doingSwyx [00:38:04]: yeah so just to push back on this because obviously I mostly agree how would you design performance review systems because you were effectively saying we can count lines of code for developers rightJason [00:38:15]: I don't think that would be the actual like I think if you make that an outcome like I can just expand a for loop right I think okay so for performance review this is interesting because I've mostly thought of it from the perspective of science and not engineering I've been running a lot of engineering stand-ups primarily because there's not really that many machine learning folks the process outcome is like experiments and ideas right like if you think about outcome is what you might want to think about an outcome is oh I want to improve the revenue or whatnot but that's really hard but if you're someone who is going out like okay like this week I want to come up with like three or four experiments I might move the needle okay nothing worked to them they might think oh nothing worked like I suck but to me it's like wow you've closed off all these other possible avenues for like research like you're gonna get to the place that you're gonna figure out that direction really soon there's no way you try 30 different things and none of them work usually like 10 of them work five of them work really well two of them work really really well and one thing was like the nail in the head so agency lets you sort of capture the volume of experiments and like experience lets you figure out like oh that other half it's not worth doing right I think experience is going like half these prompting papers don't make any sense just use chain of thought and just you know use a for loop that's basically right it's like usually performance for me is around like how many experiments are you running how oftentimes are you trying.Alessio [00:39:32]: When do you give up on an experiment because a StitchFix you kind of give up on language models I guess in a way as a tool to use and then maybe the tools got better you were right at the time and then the tool improved I think there are similar paths in my engineering career where I try one approach and at the time it doesn't work and then the thing changes but then I kind of soured on that approach and I don't go back to it soonJason [00:39:51]: I see yeah how do you think about that loop so usually when I'm coaching folks and as they say like oh these things don't work I'm not going to pursue them in the future like one of the big things like hey the negative result is a result and this is something worth documenting like this is an academia like if it's negative you don't just like not publish right but then like what do you actually write down like what you should write down is like here are the conditions this is the inputs and the outputs we tried the experiment on and then one thing that's really valuable is basically writing down under what conditions would I revisit these experiments these things don't work because of what we had at the time if someone is reading this two years from now under what conditions will we try again that's really hard but again that's like another skill you kind of learn right it's like you do go back and you do experiments you figure out why it works now I think a lot of it here is just like scaling worked yeah rap lyrics you know that was because I did not have high enough quality data if we phase shift and say okay you don't even need training data oh great then it might just work a different domainAlessio [00:40:48]: Do you have anything in your list that is like it doesn't work now but I want to try it again later? Something that people should maybe keep in mind you know people always like agi when you know when are you going to know the agi is here maybe it's less than that but any stuff that you tried recently that didn't work thatJason [00:41:01]: You think will get there I mean I think the personal assistance and the writing I've shown to myself it's just not good enough yet so I hired a writer and I hired a personal assistant so now I'm gonna basically like work with these people until I figure out like what I can actually like automate and what are like the reproducible steps but like I think the experiment for me is like I'm gonna go pay a person like thousand dollars a month that helped me improve my life and then let me get them to help me figure like what are the components and how do I actually modularize something to get it to work because it's not just like a lot gmail calendar and like notion it's a little bit more complicated than that but we just don't know what that is yet those are two sort of systems that I wish gb4 or opus was actually good enough to just write me an essay but most of the essays are still pretty badSwyx [00:41:44]: yeah I would say you know on the personal assistance side Lindy is probably the one I've seen the most flow was at a speaker at the summit I don't know if you've checked it out or any other sort of agents assistant startupJason [00:41:54]: Not recently I haven't tried lindy they were not ga last time I was considering it yeah yeah a lot of it now it's like oh like really what I want you to do is take a look at all of my meetings and like write like a really good weekly summary email for my clients to remind them that I'm like you know thinking of them and like working for them right or it's like I want you to notice that like my monday is like way too packed and like block out more time and also like email the people to do the reschedule and then try to opt in to move them around and then I want you to say oh jason should have like a 15 minute prep break after form back to back those are things that now I know I can prompt them in but can it do it well like before I didn't even know that's what I wanted to prompt for us defragging a calendar and adding break so I can like eat lunch yeah that's the AGI test yeah exactly compassion right I think one thing that yeah we didn't touch on it before butAlessio [00:42:44]: I think was interesting you had this tweet a while ago about prompts should be code and then there were a lot of companies trying to build prompt engineering tooling kind of trying to turn the prompt into a more structured thing what's your thought today now you want to turn the thinking into DAGs like do prompts should still be code any updated ideasJason [00:43:04]: It's the same thing right I think you know with Instructor it is very much like the output model is defined as a code object that code object is sent to the LLM and in return you get a data structure so the outputs of these models I think should also be code objects and the inputs somewhat should be code objects but I think the one thing that instructor tries to do is separate instruction data and the types of the output and beyond that I really just think that most of it should be still like managed pretty closely to the developer like so much of is changing that if you give control of these systems away too early you end up ultimately wanting them back like many companies I know that I reach out or ones were like oh we're going off of the frameworks because now that we know what the business outcomes we're trying to optimize for these frameworks don't work yeah because we do rag but we want to do rag to like sell you supplements or to have you like schedule the fitness appointment the prompts are kind of too baked into the systems to really pull them back out and like start doing upselling or something it's really funny but a lot of it ends up being like once you understand the business outcomes you care way more about the promptSwyx [00:44:07]: Actually this is fun in our prep for this call we were trying to say like what can you as an independent person say that maybe me and Alessio cannot say or me you know someone at a company say what do you think is the market share of the frameworks the LangChain, the LlamaIndex, the everything...Jason [00:44:20]: Oh massive because not everyone wants to care about the code yeah right I think that's a different question to like what is the business model and are they going to be like massively profitable businesses right making hundreds of millions of dollars that feels like so straightforward right because not everyone is a prompt engineer like there's so much productivity to be captured in like back office optim automations right it's not because they care about the prompts that they care about managing these things yeah but those would be sort of low code experiences you yeah I think the bigger challenge is like okay hundred million dollars probably pretty easy it's just time and effort and they have the manpower and the money to sort of solve those problems again if you go the vc route then it's like you're talking about billions and that's really the goal that stuff for me it's like pretty unclear but again that is to say that like I sort of am building things for developers who want to use infrastructure to build their own tooling in terms of the amount of developers there are in the world versus downstream consumers of these things or even just think of how many companies will use like the adobes and the ibms right because they want something that's fully managed and they want something that they know will work and if the incremental 10% requires you to hire another team of 20 people you might not want to do it and I think that kind of organization is really good for uh those are bigger companiesSwyx [00:45:32]: I just want to capture your thoughts on one more thing which is you said you wanted most of the prompts to stay close to the developer and Hamel Husain wrote this post which I really love called f you show me the prompt yeah I think he cites you in one of those part of the blog post and I think ds pi is kind of like the complete antithesis of that which is I think it's interesting because I also hold the strong view that AI is a better prompt engineer than you are and I don't know how to square that wondering if you have thoughtsJason [00:45:58]: I think something like DSPy can work because there are like very short-term metrics to measure success right it is like did you find the pii or like did you write the multi-hop question the correct way but in these workflows that I've been managing a lot of it are we minimizing churn and maximizing retention yeah that's a very long loop it's not really like a uptuna like training loop right like those things are much more harder to capture so we don't actually have those metrics for that right and obviously we can figure out like okay is the summary good but like how do you measure the quality of the summary it's like that feedback loop it ends up being a lot longer and then again when something changes it's really hard to make sure that it works across these like newer models or again like changes to work for the current process like when we migrate from like anthropic to open ai like there's just a ton of change that are like infrastructure related not necessarily around the prompt itself yeah cool any other ai engineering startups that you think should not exist before we wrap up i mean oh my gosh i mean a lot of it again it's just like every time of investors like how does this make a billion dollars like it doesn't i'm gonna go back to just like tweeting and holding my breath underwater yeah like i don't really pay attention too much to most of this like most of the stuff i'm doing is around like the consumer of like llm calls yep i think people just want to move really fast and they will end up pick these vendors but i don't really know if anything has really like blown me out the water like i only trust myself but that's also a function of just being an old man like i think you know many companies are definitely very happy with using most of these tools anyways but i definitely think i occupy a very small space in the engineering ecosystem.Swyx [00:47:41]: Yeah i would say one of the challenges here you know you call about the dealing in the consumer of llm's space i think that's what ai engineering differs from ml engineering and i think a constant disconnect or cognitive dissonance in this field in the ai engineers that have sprung up is that they are not as good as the ml engineers they are not as qualified i think that you know you are someone who has credibility in the mle space and you are also a very authoritative figure in the ai space and i think so and you know i think you've built the de facto leading library i think yours i think instructors should be part of the standard lib even though i try to not use it like i basically also end up rebuilding instructor right like that's a lot of the back and forth that we had over the past two days i think that's the fundamental thing that we're trying to figure out like there's very small supply of MLEs not everyone's going to have that experience that you had but the global demand for AI is going to far outstrip the existing MLEs.Jason [00:48:36]: So what do we do do we force everyone to go through the standard MLE curriculum or do we make a new one? I'

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 11, 2024 56:20


Maggie, Linus, Geoffrey, and the LS crew are reuniting for our second annual AI UX demo day in SF on Apr 28. Sign up to demo here! And don't forget tickets for the AI Engineer World's Fair — for early birds who join before keynote announcements!It's become fashionable for many AI startups to project themselves as “the next Google” - while the search engine is so 2000s, both Perplexity and Exa referred to themselves as a “research engine” or “answer engine” in our NeurIPS pod. However these searches tend to be relatively shallow, and it is challenging to zoom up and down the ladders of abstraction to garner insights. For serious researchers, this level of simple one-off search will not cut it.We've commented in our Jan 2024 Recap that Flow Engineering (simply; multi-turn processes over many-shot single prompts) seems to offer far more performance, control and reliability for a given cost budget. Our experiments with Devin and our understanding of what the new Elicit Notebooks offer a glimpse into the potential for very deep, open ended, thoughtful human-AI collaboration at scale.It starts with promptsWhen ChatGPT exploded in popularity in November 2022 everyone was turned into a prompt engineer. While generative models were good at "vibe based" outcomes (tell me a joke, write a poem, etc) with basic prompts, they struggled with more complex questions, especially in symbolic fields like math, logic, etc. Two of the most important "tricks" that people picked up on were:* Chain of Thought prompting strategy proposed by Wei et al in the “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. Rather than doing traditional few-shot prompting with just question and answers, adding the thinking process that led to the answer resulted in much better outcomes.* Adding "Let's think step by step" to the prompt as a way to boost zero-shot reasoning, which was popularized by Kojima et al in the Large Language Models are Zero-Shot Reasoners paper from NeurIPS 2022. This bumped accuracy from 17% to 79% compared to zero-shot.Nowadays, prompts include everything from promises of monetary rewards to… whatever the Nous folks are doing to turn a model into a world simulator. At the end of the day, the goal of prompt engineering is increasing accuracy, structure, and repeatability in the generation of a model.From prompts to agentsAs prompt engineering got more and more popular, agents (see “The Anatomy of Autonomy”) took over Twitter with cool demos and AutoGPT became the fastest growing repo in Github history. The thing about AutoGPT that fascinated people was the ability to simply put in an objective without worrying about explaining HOW to achieve it, or having to write very sophisticated prompts. The system would create an execution plan on its own, and then loop through each task. The problem with open-ended agents like AutoGPT is that 1) it's hard to replicate the same workflow over and over again 2) there isn't a way to hard-code specific steps that the agent should take without actually coding them yourself, which isn't what most people want from a product. From agents to productsPrompt engineering and open-ended agents were great in the experimentation phase, but this year more and more of these workflows are starting to become polished products. Today's guests are Andreas Stuhlmüller and Jungwon Byun of Elicit (previously Ought), an AI research assistant that they think of as “the best place to understand what is known”. Ought was a non-profit, but last September, Elicit spun off into a PBC with a $9m seed round. It is hard to quantify how much a workflow can be improved, but Elicit boasts some impressive numbers for research assistants:Just four months after launch, Elicit crossed $1M ARR, which shows how much interest there is for AI products that just work.One of the main takeaways we had from the episode is how teams should focus on supervising the process, not the output. Their philosophy at Elicit isn't to train general models, but to train models that are extremely good at focusing processes. This allows them to have pre-created steps that the user can add to their workflow (like classifying certain features that are specific to their research field) without having to write a prompt for it. And for Hamel Husain's happiness, they always show you the underlying prompt. Elicit recently announced notebooks as a new interface to interact with their products: (fun fact, they tried to implement this 4 times before they landed on the right UX! We discuss this ~33:00 in the podcast)The reasons why they picked notebooks as a UX all tie back to process:* They are systematic; once you have a instruction/prompt that works on a paper, you can run hundreds of papers through the same workflow by creating a column. Notebooks can also be edited and exported at any point during the flow.* They are transparent - Many papers include an opaque literature review as perfunctory context before getting to their novel contribution. But PDFs are “dead” and it is difficult to follow the thought process and exact research flow of the authors. Sharing “living” Elicit Notebooks opens up this process.* They are unbounded - Research is an endless stream of rabbit holes. So it must be easy to dive deeper and follow up with extra steps, without losing the ability to surface for air. We had a lot of fun recording this, and hope you have as much fun listening!AI UX in SFLong time Latent Spacenauts might remember our first AI UX meetup with Linus Lee, Geoffrey Litt, and Maggie Appleton last year. Well, Maggie has since joined Elicit, and they are all returning at the end of this month! Sign up here: https://lu.ma/aiuxAnd submit demos here! https://forms.gle/iSwiesgBkn8oo4SS8We expect the 200 seats to “sell out” fast. Attendees with demos will be prioritized.Show Notes* Elicit* Ought (their previous non-profit)* “Pivoting” with GPT-4* Elicit notebooks launch* Charlie* Andreas' BlogTimestamps* [00:00:00] Introductions* [00:07:45] How Johan and Andreas Joined Forces to Create Elicit* [00:10:26] Why Products > Research* [00:15:49] The Evolution of Elicit's Product* [00:19:44] Automating Literature Review Workflow* [00:22:48] How GPT-3 to GPT-4 Changed Things* [00:25:37] Managing LLM Pricing and Performance* [00:31:07] Open vs. Closed: Elicit's Approach to Model Selection* [00:31:56] Moving to Notebooks* [00:39:11] Elicit's Budget for Model Queries and Evaluations* [00:41:44] Impact of Long Context Windows* [00:47:19] Underrated Features and Surprising Applications* [00:51:35] Driving Systematic and Efficient Research* [00:53:00] Elicit's Team Growth and Transition to a Public Benefit Corporation* [00:55:22] Building AI for GoodFull Interview on YouTubeAs always, a plug for our youtube version for the 80% of communication that is nonverbal:TranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we are back in the studio with Andreas and Jungwon from Elicit. Welcome.Jungwon [00:00:20]: Thanks guys.Andreas [00:00:21]: It's great to be here.Swyx [00:00:22]: Yeah. So I'll introduce you separately, but also, you know, we'd love to learn a little bit more about you personally. So Andreas, it looks like you started Elicit first, Jungwon joined later.Andreas [00:00:32]: That's right. For all intents and purposes, the Elicit and also the Ought that existed before then were very different from what I started. So I think it's like fair to say that you co-founded it.Swyx [00:00:43]: Got it. And Jungwon, you're a co-founder and COO of Elicit now.Jungwon [00:00:46]: Yeah, that's right.Swyx [00:00:47]: So there's a little bit of a history to this. I'm not super aware of like the sort of journey. I was aware of OTT and Elicit as sort of a nonprofit type situation. And recently you turned into like a B Corp, Public Benefit Corporation. So yeah, maybe if you want, you could take us through that journey of finding the problem. You know, obviously you're working together now. So like, how do you get together to decide to leave your startup career to join him?Andreas [00:01:10]: Yeah, it's truly a very long journey. I guess truly, it kind of started in Germany when I was born. So even as a kid, I was always interested in AI, like I kind of went to the library. There were books about how to write programs in QBasic and like some of them talked about how to implement chatbots.Jungwon [00:01:27]: To be clear, he grew up in like a tiny village on the outskirts of Munich called Dinkelschirben, where it's like a very, very idyllic German village.Andreas [00:01:36]: Yeah, important to the story. So basically, the main thing is I've kind of always been thinking about AI my entire life and been thinking about, well, at some point, this is going to be a huge deal. It's going to be transformative. How can I work on it? And was thinking about it from when I was a teenager, after high school did a year where I started a startup with the intention to become rich. And then once I'm rich, I can affect the trajectory of AI. Did not become rich, decided to go back to college and study cognitive science there, which was like the closest thing I could find at the time to AI. In the last year of college, moved to the US to do a PhD at MIT, working on broadly kind of new programming languages for AI because it kind of seemed like the existing languages were not great at expressing world models and learning world models doing Bayesian inference. Was always thinking about, well, ultimately, the goal is to actually build tools that help people reason more clearly, ask and answer better questions and make better decisions. But for a long time, it seemed like the technology to put reasoning in machines just wasn't there. Initially, at the end of my postdoc at Stanford, I was thinking about, well, what to do? I think the standard path is you become an academic and do research. But it's really hard to actually build interesting tools as an academic. You can't really hire great engineers. Everything is kind of on a paper-to-paper timeline. And so I was like, well, maybe I should start a startup, pursued that for a little bit. But it seemed like it was too early because you could have tried to do an AI startup, but probably would not have been this kind of AI startup we're seeing now. So then decided to just start a nonprofit research lab that's going to do research for a while until we better figure out how to do thinking in machines. And that was odd. And then over time, it became clear how to actually build actual tools for reasoning. And only over time, we developed a better way to... I'll let you fill in some of the details here.Jungwon [00:03:26]: Yeah. So I guess my story maybe starts around 2015. I kind of wanted to be a founder for a long time, and I wanted to work on an idea that stood the test of time for me, like an idea that stuck with me for a long time. And starting in 2015, actually, originally, I became interested in AI-based tools from the perspective of mental health. So there are a bunch of people around me who are really struggling. One really close friend in particular is really struggling with mental health and didn't have any support, and it didn't feel like there was anything before kind of like getting hospitalized that could just help her. And so luckily, she came and stayed with me for a while, and we were just able to talk through some things. But it seemed like lots of people might not have that resource, and something maybe AI-enabled could be much more scalable. I didn't feel ready to start a company then, that's 2015. And I also didn't feel like the technology was ready. So then I went into FinTech and kind of learned how to do the tech thing. And then in 2019, I felt like it was time for me to just jump in and build something on my own I really wanted to create. And at the time, I looked around at tech and felt like not super inspired by the options. I didn't want to have a tech career ladder, or I didn't want to climb the career ladder. There are two kind of interesting technologies at the time, there was AI and there was crypto. And I was like, well, the AI people seem like a little bit more nice, maybe like slightly more trustworthy, both super exciting, but threw my bet in on the AI side. And then I got connected to Andreas. And actually, the way he was thinking about pursuing the research agenda at OTT was really compatible with what I had envisioned for an ideal AI product, something that helps kind of take down really complex thinking, overwhelming thoughts and breaks it down into small pieces. And then this kind of mission that we need AI to help us figure out what we ought to do was really inspiring, right? Yeah, because I think it was clear that we were building the most powerful optimizer of our time. But as a society, we hadn't figured out how to direct that optimization potential. And if you kind of direct tremendous amounts of optimization potential at the wrong thing, that's really disastrous. So the goal of OTT was make sure that if we build the most transformative technology of our lifetime, it can be used for something really impactful, like good reasoning, like not just generating ads. My background was in marketing, but like, so I was like, I want to do more than generate ads with this. But also if these AI systems get to be super intelligent enough that they are doing this really complex reasoning, that we can trust them, that they are aligned with us and we have ways of evaluating that they're doing the right thing. So that's what OTT did. We did a lot of experiments, you know, like I just said, before foundation models really like took off. A lot of the issues we were seeing were more in reinforcement learning, but we saw a future where AI would be able to do more kind of logical reasoning, not just kind of extrapolate from numerical trends. We actually kind of set up experiments with people where kind of people stood in as super intelligent systems and we effectively gave them context windows. So they would have to like read a bunch of text and one person would get less text and one person would get all the texts and the person with less text would have to evaluate the work of the person who could read much more. So like in a world we were basically simulating, like in 2018, 2019, a world where an AI system could read significantly more than you and you as the person who couldn't read that much had to evaluate the work of the AI system. Yeah. So there's a lot of the work we did. And from that, we kind of iterated on the idea of breaking complex tasks down into smaller tasks, like complex tasks, like open-ended reasoning, logical reasoning into smaller tasks so that it's easier to train AI systems on them. And also so that it's easier to evaluate the work of the AI system when it's done. And then also kind of, you know, really pioneered this idea, the importance of supervising the process of AI systems, not just the outcomes. So a big part of how Elicit is built is we're very intentional about not just throwing a ton of data into a model and training it and then saying, cool, here's like scientific output. Like that's not at all what we do. Our approach is very much like, what are the steps that an expert human does or what is like an ideal process as granularly as possible, let's break that down and then train AI systems to perform each of those steps very robustly. When you train like that from the start, after the fact, it's much easier to evaluate, it's much easier to troubleshoot at each point. Like where did something break down? So yeah, we were working on those experiments for a while. And then at the start of 2021, decided to build a product.Swyx [00:07:45]: Do you mind if I, because I think you're about to go into more modern thought and Elicit. And I just wanted to, because I think a lot of people are in where you were like sort of 2018, 19, where you chose a partner to work with. Yeah. Right. And you didn't know him. Yeah. Yeah. You were just kind of cold introduced. A lot of people are cold introduced. Yeah. Never work with them. I assume you had a lot, a lot of other options, right? Like how do you advise people to make those choices?Jungwon [00:08:10]: We were not totally cold introduced. So one of our closest friends introduced us. And then Andreas had written a lot on the OTT website, a lot of blog posts, a lot of publications. And I just read it and I was like, wow, this sounds like my writing. And even other people, some of my closest friends I asked for advice from, they were like, oh, this sounds like your writing. But I think I also had some kind of like things I was looking for. I wanted someone with a complimentary skillset. I want someone who was very values aligned. And yeah, that was all a good fit.Andreas [00:08:38]: We also did a pretty lengthy mutual evaluation process where we had a Google doc where we had all kinds of questions for each other. And I think it ended up being around 50 pages or so of like various like questions and back and forth.Swyx [00:08:52]: Was it the YC list? There's some lists going around for co-founder questions.Andreas [00:08:55]: No, we just made our own questions. But I guess it's probably related in that you ask yourself, what are the values you care about? How would you approach various decisions and things like that?Jungwon [00:09:04]: I shared like all of my past performance reviews. Yeah. Yeah.Swyx [00:09:08]: And he never had any. No.Andreas [00:09:10]: Yeah.Swyx [00:09:11]: Sorry, I just had to, a lot of people are going through that phase and you kind of skipped over it. I was like, no, no, no, no. There's like an interesting story.Jungwon [00:09:20]: Yeah.Alessio [00:09:21]: Yeah. Before we jump into what a list it is today, the history is a bit counterintuitive. So you start with figuring out, oh, if we had a super powerful model, how would we align it? But then you were actually like, well, let's just build the product so that people can actually leverage it. And I think there are a lot of folks today that are now back to where you were maybe five years ago that are like, oh, what if this happens rather than focusing on actually building something useful with it? What clicked for you to like move into a list and then we can cover that story too.Andreas [00:09:49]: I think in many ways, the approach is still the same because the way we are building illicit is not let's train a foundation model to do more stuff. It's like, let's build a scaffolding such that we can deploy powerful models to good ends. I think it's different now in that we actually have like some of the models to plug in. But if in 2017, we had had the models, we could have run the same experiments we did run with humans back then, just with models. And so in many ways, our philosophy is always, let's think ahead to the future of what models are going to exist in one, two years or longer. And how can we make it so that they can actually be deployed in kind of transparent, controllableJungwon [00:10:26]: ways? I think motivationally, we both are kind of product people at heart. The research was really important and it didn't make sense to build a product at that time. But at the end of the day, the thing that always motivated us is imagining a world where high quality reasoning is really abundant and AI is a technology that's going to get us there. And there's a way to guide that technology with research, but we can have a more direct effect through product because with research, you publish the research and someone else has to implement that into the product and the product felt like a more direct path. And we wanted to concretely have an impact on people's lives. Yeah, I think the kind of personally, the motivation was we want to build for people.Swyx [00:11:03]: Yep. And then just to recap as well, like the models you were using back then were like, I don't know, would they like BERT type stuff or T5 or I don't know what timeframe we're talking about here.Andreas [00:11:14]: I guess to be clear, at the very beginning, we had humans do the work. And then I think the first models that kind of make sense were TPT-2 and TNLG and like Yeah, early generative models. We do also use like T5 based models even now started with TPT-2.Swyx [00:11:30]: Yeah, cool. I'm just kind of curious about like, how do you start so early? You know, like now it's obvious where to start, but back then it wasn't.Jungwon [00:11:37]: Yeah, I used to nag Andreas a lot. I was like, why are you talking to this? I don't know. I felt like TPT-2 is like clearly can't do anything. And I was like, Andreas, you're wasting your time, like playing with this toy. But yeah, he was right.Alessio [00:11:50]: So what's the history of what Elicit actually does as a product? You recently announced that after four months, you get to a million in revenue. Obviously, a lot of people use it, get a lot of value, but it would initially kind of like structured data extraction from papers. Then you had kind of like concept grouping. And today, it's maybe like a more full stack research enabler, kind of like paper understander platform. What's the definitive definition of what Elicit is? And how did you get here?Jungwon [00:12:15]: Yeah, we say Elicit is an AI research assistant. I think it will continue to evolve. That's part of why we're so excited about building and research, because there's just so much space. I think the current phase we're in right now, we talk about it as really trying to make Elicit the best place to understand what is known. So it's all a lot about like literature summarization. There's a ton of information that the world already knows. It's really hard to navigate, hard to make it relevant. So a lot of it is around document discovery and processing and analysis. I really kind of want to import some of the incredible productivity improvements we've seen in software engineering and data science and into research. So it's like, how can we make researchers like data scientists of text? That's why we're launching this new set of features called Notebooks. It's very much inspired by computational notebooks, like Jupyter Notebooks, you know, DeepNode or Colab, because they're so powerful and so flexible. And ultimately, when people are trying to get to an answer or understand insight, they're kind of like manipulating evidence and information. Today, that's all packaged in PDFs, which are super brittle. So with language models, we can decompose these PDFs into their underlying claims and evidence and insights, and then let researchers mash them up together, remix them and analyze them together. So yeah, I would say quite simply, overall, Elicit is an AI research assistant. Right now we're focused on text-based workflows, but long term, really want to kind of go further and further into reasoning and decision making.Alessio [00:13:35]: And when you say AI research assistant, this is kind of meta research. So researchers use Elicit as a research assistant. It's not a generic you-can-research-anything type of tool, or it could be, but like, what are people using it for today?Andreas [00:13:49]: Yeah. So specifically in science, a lot of people use human research assistants to do things. You tell your grad student, hey, here are a couple of papers. Can you look at all of these, see which of these have kind of sufficiently large populations and actually study the disease that I'm interested in, and then write out like, what are the experiments they did? What are the interventions they did? What are the outcomes? And kind of organize that for me. And the first phase of understanding what is known really focuses on automating that workflow because a lot of that work is pretty rote work. I think it's not the kind of thing that we need humans to do. Language models can do it. And then if language models can do it, you can obviously scale it up much more than a grad student or undergrad research assistant would be able to do.Jungwon [00:14:31]: Yeah. The use cases are pretty broad. So we do have a very large percent of our users are just using it personally or for a mix of personal and professional things. People who care a lot about health or biohacking or parents who have children with a kind of rare disease and want to understand the literature directly. So there is an individual kind of consumer use case. We're most focused on the power users. So that's where we're really excited to build. So Lissette was very much inspired by this workflow in literature called systematic reviews or meta-analysis, which is basically the human state of the art for summarizing scientific literature. And it typically involves like five people working together for over a year. And they kind of first start by trying to find the maximally comprehensive set of papers possible. So it's like 10,000 papers. And they kind of systematically narrow that down to like hundreds or 50 extract key details from every single paper. Usually have two people doing it, like a third person reviewing it. So it's like an incredibly laborious, time consuming process, but you see it in every single domain. So in science, in machine learning, in policy, because it's so structured and designed to be reproducible, it's really amenable to automation. So that's kind of the workflow that we want to automate first. And then you make that accessible for any question and make these really robust living summaries of science. So yeah, that's one of the workflows that we're starting with.Alessio [00:15:49]: Our previous guest, Mike Conover, he's building a new company called Brightwave, which is an AI research assistant for financial research. How do you see the future of these tools? Does everything converge to like a God researcher assistant, or is every domain going to have its own thing?Andreas [00:16:03]: I think that's a good and mostly open question. I do think there are some differences across domains. For example, some research is more quantitative data analysis, and other research is more high level cross domain thinking. And we definitely want to contribute to the broad generalist reasoning type space. Like if researchers are making discoveries often, it's like, hey, this thing in biology is actually analogous to like these equations in economics or something. And that's just fundamentally a thing that where you need to reason across domains. At least within research, I think there will be like one best platform more or less for this type of generalist research. I think there may still be like some particular tools like for genomics, like particular types of modules of genes and proteins and whatnot. But for a lot of the kind of high level reasoning that humans do, I think that is a more of a winner type all thing.Swyx [00:16:52]: I wanted to ask a little bit deeper about, I guess, the workflow that you mentioned. I like that phrase. I see that in your UI now, but that's as it is today. And I think you were about to tell us about how it was in 2021 and how it may be progressed. How has this workflow evolved over time?Jungwon [00:17:07]: Yeah. So the very first version of Elicit actually wasn't even a research assistant. It was a forecasting assistant. So we set out and we were thinking about, you know, what are some of the most impactful types of reasoning that if we could scale up, AI would really transform the world. We actually started with literature review, but we're like, oh, so many people are going to build literature review tools. So let's start there. So then we focused on geopolitical forecasting. So I don't know if you're familiar with like manifold or manifold markets. That kind of stuff. Before manifold. Yeah. Yeah. I'm not predicting relationships. We're predicting like, is China going to invade Taiwan?Swyx [00:17:38]: Markets for everything.Andreas [00:17:39]: Yeah. That's a relationship.Swyx [00:17:41]: Yeah.Jungwon [00:17:42]: Yeah. It's true. And then we worked on that for a while. And then after GPT-3 came out, I think by that time we realized that originally we were trying to help people convert their beliefs into probability distributions. And so take fuzzy beliefs, but like model them more concretely. And then after a few months of iterating on that, just realize, oh, the thing that's blocking people from making interesting predictions about important events in the world is less kind of on the probabilistic side and much more on the research side. And so that kind of combined with the very generalist capabilities of GPT-3 prompted us to make a more general research assistant. Then we spent a few months iterating on what even is a research assistant. So we would embed with different researchers. We built data labeling workflows in the beginning, kind of right off the bat. We built ways to find experts in a field and like ways to ask good research questions. So we just kind of iterated through a lot of workflows and no one else was really building at this time. And it was like very quick to just do some prompt engineering and see like what is a task that is at the intersection of what's technologically capable and like important for researchers. And we had like a very nondescript landing page. It said nothing. But somehow people were signing up and we had to sign a form that was like, why are you here? And everyone was like, I need help with literature review. And we're like, oh, literature review. That sounds so hard. I don't even know what that means. We're like, we don't want to work on it. But then eventually we were like, okay, everyone is saying literature review. It's overwhelmingly people want to-Swyx [00:19:02]: And all domains, not like medicine or physics or just all domains. Yeah.Jungwon [00:19:06]: And we also kind of personally knew literature review was hard. And if you look at the graphs for academic literature being published every single month, you guys know this in machine learning, it's like up into the right, like superhuman amounts of papers. So we're like, all right, let's just try it. I was really nervous, but Andreas was like, this is kind of like the right problem space to jump into, even if we don't know what we're doing. So my take was like, fine, this feels really scary, but let's just launch a feature every single week and double our user numbers every month. And if we can do that, we'll fail fast and we will find something. I was worried about like getting lost in the kind of academic white space. So the very first version was actually a weekend prototype that Andreas made. Do you want to explain how that worked?Andreas [00:19:44]: I mostly remember that it was really bad. The thing I remember is you entered a question and it would give you back a list of claims. So your question could be, I don't know, how does creatine affect cognition? It would give you back some claims that are to some extent based on papers, but they were often irrelevant. The papers were often irrelevant. And so we ended up soon just printing out a bunch of examples of results and putting them up on the wall so that we would kind of feel the constant shame of having such a bad product and would be incentivized to make it better. And I think over time it has gotten a lot better, but I think the initial version was like really very bad. Yeah.Jungwon [00:20:20]: But it was basically like a natural language summary of an abstract, like kind of a one sentence summary, and which we still have. And then as we learned kind of more about this systematic review workflow, we started expanding the capability so that you could extract a lot more data from the papers and do more with that.Swyx [00:20:33]: And were you using like embeddings and cosine similarity, that kind of stuff for retrieval, or was it keyword based?Andreas [00:20:40]: I think the very first version didn't even have its own search engine. I think the very first version probably used the Semantic Scholar or API or something similar. And only later when we discovered that API is not very semantic, we then built our own search engine that has helped a lot.Swyx [00:20:58]: And then we're going to go into like more recent products stuff, but like, you know, I think you seem the more sort of startup oriented business person and you seem sort of more ideologically like interested in research, obviously, because of your PhD. What kind of market sizing were you guys thinking? Right? Like, because you're here saying like, we have to double every month. And I'm like, I don't know how you make that conclusion from this, right? Especially also as a nonprofit at the time.Jungwon [00:21:22]: I mean, market size wise, I felt like in this space where so much was changing and it was very unclear what of today was actually going to be true tomorrow. We just like really rested a lot on very, very simple fundamental principles, which is like, if you can understand the truth, that is very economically beneficial and valuable. If you like know the truth.Swyx [00:21:42]: On principle.Jungwon [00:21:43]: Yeah. That's enough for you. Yeah. Research is the key to many breakthroughs that are very commercially valuable.Swyx [00:21:47]: Because my version of it is students are poor and they don't pay for anything. Right? But that's obviously not true. As you guys have found out. But you had to have some market insight for me to have believed that, but you skipped that.Andreas [00:21:58]: Yeah. I remember talking to VCs for our seed round. A lot of VCs were like, you know, researchers, they don't have any money. Why don't you build legal assistant? I think in some short sighted way, maybe that's true. But I think in the long run, R&D is such a big space of the economy. I think if you can substantially improve how quickly people find new discoveries or avoid controlled trials that don't go anywhere, I think that's just huge amounts of money. And there are a lot of questions obviously about between here and there. But I think as long as the fundamental principle is there, we were okay with that. And I guess we found some investors who also were. Yeah.Swyx [00:22:35]: Congrats. I mean, I'm sure we can cover the sort of flip later. I think you're about to start us on like GPT-3 and how that changed things for you. It's funny. I guess every major GPT version, you have some big insight. Yeah.Jungwon [00:22:48]: Yeah. I mean, what do you think?Andreas [00:22:51]: I think it's a little bit less true for us than for others, because we always believed that there will basically be human level machine work. And so it is definitely true that in practice for your product, as new models come out, your product starts working better, you can add some features that you couldn't add before. But I don't think we really ever had the moment where we were like, oh, wow, that is super unanticipated. We need to do something entirely different now from what was on the roadmap.Jungwon [00:23:21]: I think GPT-3 was a big change because it kind of said, oh, now is the time that we can use AI to build these tools. And then GPT-4 was maybe a little bit more of an extension of GPT-3. GPT-3 over GPT-2 was like qualitative level shift. And then GPT-4 was like, okay, great. Now it's like more accurate. We're more accurate on these things. We can answer harder questions. But the shape of the product had already taken place by that time.Swyx [00:23:44]: I kind of want to ask you about this sort of pivot that you've made. But I guess that was just a way to sell what you were doing, which is you're adding extra features on grouping by concepts. The GPT-4 pivot, quote unquote pivot that you-Jungwon [00:23:55]: Oh, yeah, yeah, exactly. Right, right, right. Yeah. Yeah. When we launched this workflow, now that GPT-4 was available, basically Elisa was at a place where we have very tabular interfaces. So given a table of papers, you can extract data across all the tables. But you kind of want to take the analysis a step further. Sometimes what you'd care about is not having a list of papers, but a list of arguments, a list of effects, a list of interventions, a list of techniques. And so that's one of the things we're working on is now that you've extracted this information in a more structured way, can you pivot it or group by whatever the information that you extracted to have more insight first information still supported by the academic literature?Swyx [00:24:33]: Yeah, that was a big revelation when I saw it. Basically, I think I'm very just impressed by how first principles, your ideas around what the workflow is. And I think that's why you're not as reliant on like the LLM improving, because it's actually just about improving the workflow that you would recommend to people. Today we might call it an agent, I don't know, but you're not relying on the LLM to drive it. It's relying on this is the way that Elicit does research. And this is what we think is most effective based on talking to our users.Jungwon [00:25:01]: The problem space is still huge. Like if it's like this big, we are all still operating at this tiny part, bit of it. So I think about this a lot in the context of moats, people are like, oh, what's your moat? What happens if GPT-5 comes out? It's like, if GPT-5 comes out, there's still like all of this other space that we can go into. So I think being really obsessed with the problem, which is very, very big, has helped us like stay robust and just kind of directly incorporate model improvements and they keep going.Swyx [00:25:26]: And then I first encountered you guys with Charlie, you can tell us about that project. Basically, yeah. Like how much did cost become a concern as you're working more and more with OpenAI? How do you manage that relationship?Jungwon [00:25:37]: Let me talk about who Charlie is. And then you can talk about the tech, because Charlie is a special character. So Charlie, when we found him was, had just finished his freshman year at the University of Warwick. And I think he had heard about us on some discord. And then he applied and we were like, wow, who is this freshman? And then we just saw that he had done so many incredible side projects. And we were actually on a team retreat in Barcelona visiting our head of engineering at that time. And everyone was talking about this wonder kid or like this kid. And then on our take home project, he had done like the best of anyone to that point. And so people were just like so excited to hire him. So we hired him as an intern and they were like, Charlie, what if you just dropped out of school? And so then we convinced him to take a year off. And he was just incredibly productive. And I think the thing you're referring to is at the start of 2023, Anthropic kind of launched their constitutional AI paper. And within a few days, I think four days, he had basically implemented that in production. And then we had it in app a week or so after that. And he has since kind of contributed to major improvements, like cutting costs down to a tenth of what they were really large scale. But yeah, you can talk about the technical stuff. Yeah.Andreas [00:26:39]: On the constitutional AI project, this was for abstract summarization, where in illicit, if you run a query, it'll return papers to you, and then it will summarize each paper with respect to your query for you on the fly. And that's a really important part of illicit because illicit does it so much. If you run a few searches, it'll have done it a few hundred times for you. And so we cared a lot about this both being fast, cheap, and also very low on hallucination. I think if illicit hallucinates something about the abstract, that's really not good. And so what Charlie did in that project was create a constitution that expressed what are the attributes of a good summary? Everything in the summary is reflected in the actual abstract, and it's like very concise, et cetera, et cetera. And then used RLHF with a model that was trained on the constitution to basically fine tune a better summarizer on an open source model. Yeah. I think that might still be in use.Jungwon [00:27:34]: Yeah. Yeah, definitely. Yeah. I think at the time, the models hadn't been trained at all to be faithful to a text. So they were just generating. So then when you ask them a question, they tried too hard to answer the question and didn't try hard enough to answer the question given the text or answer what the text said about the question. So we had to basically teach the models to do that specific task.Swyx [00:27:54]: How do you monitor the ongoing performance of your models? Not to get too LLM-opsy, but you are one of the larger, more well-known operations doing NLP at scale. I guess effectively, you have to monitor these things and nobody has a good answer that I talk to.Andreas [00:28:10]: I don't think we have a good answer yet. I think the answers are actually a little bit clearer on the just kind of basic robustness side of where you can import ideas from normal software engineering and normal kind of DevOps. You're like, well, you need to monitor kind of latencies and response times and uptime and whatnot.Swyx [00:28:27]: I think when we say performance, it's more about hallucination rate, isn't it?Andreas [00:28:30]: And then things like hallucination rate where I think there, the really important thing is training time. So we care a lot about having our own internal benchmarks for model development that reflect the distribution of user queries so that we can know ahead of time how well is the model going to perform on different types of tasks. So the tasks being summarization, question answering, given a paper, ranking. And for each of those, we want to know what's the distribution of things the model is going to see so that we can have well-calibrated predictions on how well the model is going to do in production. And I think, yeah, there's some chance that there's distribution shift and actually the things users enter are going to be different. But I think that's much less important than getting the kind of training right and having very high quality, well-vetted data sets at training time.Jungwon [00:29:18]: I think we also end up effectively monitoring by trying to evaluate new models as they come out. And so that kind of prompts us to go through our eval suite every couple of months. And every time a new model comes out, we have to see how is this performing relative to production and what we currently have.Swyx [00:29:32]: Yeah. I mean, since we're on this topic, any new models that have really caught your eye this year?Jungwon [00:29:37]: Like Claude came out with a bunch. Yeah. I think Claude is pretty, I think the team's pretty excited about Claude. Yeah.Andreas [00:29:41]: Specifically, Claude Haiku is like a good point on the kind of Pareto frontier. It's neither the cheapest model, nor is it the most accurate, most high quality model, but it's just like a really good trade-off between cost and accuracy.Swyx [00:29:57]: You apparently have to 10-shot it to make it good. I tried using Haiku for summarization, but zero-shot was not great. Then they were like, you know, it's a skill issue, you have to try harder.Jungwon [00:30:07]: I think GPT-4 unlocked tables for us, processing data from tables, which was huge. GPT-4 Vision.Andreas [00:30:13]: Yeah.Swyx [00:30:14]: Yeah. Did you try like Fuyu? I guess you can't try Fuyu because it's non-commercial. That's the adept model.Jungwon [00:30:19]: Yeah.Swyx [00:30:20]: We haven't tried that one. Yeah. Yeah. Yeah. But Claude is multimodal as well. Yeah. I think the interesting insight that we got from talking to David Luan, who is CEO of multimodality has effectively two different flavors. One is we recognize images from a camera in the outside natural world. And actually the more important multimodality for knowledge work is screenshots and PDFs and charts and graphs. So we need a new term for that kind of multimodality.Andreas [00:30:45]: But is the claim that current models are good at one or the other? Yeah.Swyx [00:30:50]: They're over-indexed because of the history of computer vision is Coco, right? So now we're like, oh, actually, you know, screens are more important, OCR, handwriting. You mentioned a lot of like closed model lab stuff, and then you also have like this open source model fine tuning stuff. Like what is your workload now between closed and open? It's a good question.Andreas [00:31:07]: I think- Is it half and half? It's a-Swyx [00:31:10]: Is that even a relevant question or not? Is this a nonsensical question?Andreas [00:31:13]: It depends a little bit on like how you index, whether you index by like computer cost or number of queries. I'd say like in terms of number of queries, it's maybe similar. In terms of like cost and compute, I think the closed models make up more of the budget since the main cases where you want to use closed models are cases where they're just smarter, where no existing open source models are quite smart enough.Jungwon [00:31:35]: Yeah. Yeah.Alessio [00:31:37]: We have a lot of interesting technical questions to go in, but just to wrap the kind of like UX evolution, now you have the notebooks. We talked a lot about how chatbots are not the final frontier, you know? How did you decide to get into notebooks, which is a very iterative kind of like interactive interface and yeah, maybe learnings from that.Jungwon [00:31:56]: Yeah. This is actually our fourth time trying to make this work. Okay. I think the first time was probably in early 2021. I think because we've always been obsessed with this idea of task decomposition and like branching, we always wanted a tool that could be kind of unbounded where you could keep going, could do a lot of branching where you could kind of apply language model operations or computations on other tasks. So in 2021, we had this thing called composite tasks where you could use GPT-3 to brainstorm a bunch of research questions and then take each research question and decompose those further into sub questions. This kind of, again, that like task decomposition tree type thing was always very exciting to us, but that was like, it didn't work and it was kind of overwhelming. Then at the end of 22, I think we tried again and at that point we were thinking, okay, we've done a lot with this literature review thing. We also want to start helping with kind of adjacent domains and different workflows. Like we want to help more with machine learning. What does that look like? And as we were thinking about it, we're like, well, there are so many research workflows. How do we not just build three new workflows into Elicit, but make Elicit really generic to lots of workflows? What is like a generic composable system with nice abstractions that can like scale to all these workflows? So we like iterated on that a bunch and then didn't quite narrow the problem space enough or like quite get to what we wanted. And then I think it was at the beginning of 2023 where we're like, wow, computational notebooks kind of enable this, where they have a lot of flexibility, but kind of robust primitives such that you can extend the workflow and it's not limited. It's not like you ask a query, you get an answer, you're done. You can just constantly keep building on top of that. And each little step seems like a really good unit of work for the language model. And also there was just like really helpful to have a bit more preexisting work to emulate. Yeah, that's kind of how we ended up at computational notebooks for Elicit.Andreas [00:33:44]: Maybe one thing that's worth making explicit is the difference between computational notebooks and chat, because on the surface, they seem pretty similar. It's kind of this iterative interaction where you add stuff. In both cases, you have a back and forth between you enter stuff and then you get some output and then you enter stuff. But the important difference in our minds is with notebooks, you can define a process. So in data science, you can be like, here's like my data analysis process that takes in a CSV and then does some extraction and then generates a figure at the end. And you can prototype it using a small CSV and then you can run it over a much larger CSV later. And similarly, the vision for notebooks in our case is to not make it this like one-off chat interaction, but to allow you to then say, if you start and first you're like, okay, let me just analyze a few papers and see, do I get to the correct conclusions for those few papers? Can I then later go back and say, now let me run this over 10,000 papers now that I've debugged the process using a few papers. And that's an interaction that doesn't fit quite as well into the chat framework because that's more for kind of quick back and forth interaction.Alessio [00:34:49]: Do you think in notebooks, it's kind of like structure, editable chain of thought, basically step by step? Like, is that kind of where you see this going? And then are people going to reuse notebooks as like templates? And maybe in traditional notebooks, it's like cookbooks, right? You share a cookbook, you can start from there. Is this similar in Elizit?Andreas [00:35:06]: Yeah, that's exactly right. So that's our hope that people will build templates, share them with other people. I think chain of thought is maybe still like kind of one level lower on the abstraction hierarchy than we would think of notebooks. I think we'll probably want to think about more semantic pieces like a building block is more like a paper search or an extraction or a list of concepts. And then the model's detailed reasoning will probably often be one level down. You always want to be able to see it, but you don't always want it to be front and center.Alessio [00:35:36]: Yeah, what's the difference between a notebook and an agent? Since everybody always asks me, what's an agent? Like how do you think about where the line is?Andreas [00:35:44]: Yeah, it's an interesting question. In the notebook world, I would generally think of the human as the agent in the first iteration. So you have the notebook and the human kind of adds little action steps. And then the next point on this kind of progress gradient is, okay, now you can use language models to predict which action would you take as a human. And at some point, you're probably going to be very good at this, you'll be like, okay, in some cases I can, with 99.9% accuracy, predict what you do. And then you might as well just execute it, like why wait for the human? And eventually, as you get better at this, that will just look more and more like agents taking actions as opposed to you doing the thing. I think templates are a specific case of this where you're like, okay, well, there's just particular sequences of actions that you often want to chunk and have available as primitives, just like in normal programming. And those, you can view them as action sequences of agents, or you can view them as more normal programming language abstraction thing. And I think those are two valid views. Yeah.Alessio [00:36:40]: How do you see this change as, like you said, the models get better and you need less and less human actual interfacing with the model, you just get the results? Like how does the UX and the way people perceive it change?Jungwon [00:36:52]: Yeah, I think this kind of interaction paradigms for evaluation is not really something the internet has encountered yet, because up to now, the internet has all been about getting data and work from people. So increasingly, I really want kind of evaluation, both from an interface perspective and from like a technical perspective and operation perspective to be a superpower for Elicit, because I think over time, models will do more and more of the work, and people will have to do more and more of the evaluation. So I think, yeah, in terms of the interface, some of the things we have today, you know, for every kind of language model generation, there's some citation back, and we kind of try to highlight the ground truth in the paper that is most relevant to whatever Elicit said, and make it super easy so that you can click on it and quickly see in context and validate whether the text actually supports the answer that Elicit gave. So I think we'd probably want to scale things up like that, like the ability to kind of spot check the model's work super quickly, scale up interfaces like that. And-Swyx [00:37:44]: Who would spot check? The user?Jungwon [00:37:46]: Yeah, to start, it would be the user. One of the other things we do is also kind of flag the model's uncertainty. So we have models report out, how confident are you that this was the sample size of this study? The model's not sure, we throw a flag. And so the user knows to prioritize checking that. So again, we can kind of scale that up. So when the model's like, well, I searched this on Google, I'm not sure if that was the right thing. I have an uncertainty flag, and the user can go and be like, oh, okay, that was actually the right thing to do or not.Swyx [00:38:10]: I've tried to do uncertainty readings from models. I don't know if you have this live. You do? Yeah. Because I just didn't find them reliable because they just hallucinated their own uncertainty. I would love to base it on log probs or something more native within the model rather than generated. But okay, it sounds like they scale properly for you. Yeah.Jungwon [00:38:30]: We found it to be pretty calibrated. It varies on the model.Andreas [00:38:32]: I think in some cases, we also use two different models for the uncertainty estimates than for the question answering. So one model would say, here's my chain of thought, here's my answer. And then a different type of model. Let's say the first model is Llama, and let's say the second model is GPT-3.5. And then the second model just looks over the results and is like, okay, how confident are you in this? And I think sometimes using a different model can be better than using the same model. Yeah.Swyx [00:38:58]: On the topic of models, evaluating models, obviously you can do that all day long. What's your budget? Because your queries fan out a lot. And then you have models evaluating models. One person typing in a question can lead to a thousand calls.Andreas [00:39:11]: It depends on the project. So if the project is basically a systematic review that otherwise human research assistants would do, then the project is basically a human equivalent spend. And the spend can get quite large for those projects. I don't know, let's say $100,000. In those cases, you're happier to spend compute then in the kind of shallow search case where someone just enters a question because, I don't know, maybe I heard about creatine. What's it about? Probably don't want to spend a lot of compute on that. This sort of being able to invest more or less compute into getting more or less accurate answers is I think one of the core things we care about. And that I think is currently undervalued in the AI space. I think currently you can choose which model you want and you can sometimes, I don't know, you'll tip it and it'll try harder or you can try various things to get it to work harder. But you don't have great ways of converting willingness to spend into better answers. And we really want to build a product that has this sort of unbounded flavor where if you care about it a lot, you should be able to get really high quality answers, really double checked in every way.Alessio [00:40:14]: And you have a credits-based pricing. So unlike most products, it's not a fixed monthly fee.Jungwon [00:40:19]: Right, exactly. So some of the higher costs are tiered. So for most casual users, they'll just get the abstract summary, which is kind of an open source model. Then you can add more columns, which have more extractions and these uncertainty features. And then you can also add the same columns in high accuracy mode, which also parses the table. So we kind of stack the complexity on the calls.Swyx [00:40:39]: You know, the fun thing you can do with a credit system, which is data for data, basically you can give people more credits if they give data back to you. I don't know if you've already done that. We've thought about something like this.Jungwon [00:40:49]: It's like if you don't have money, but you have time, how do you exchange that?Swyx [00:40:54]: It's a fair trade.Jungwon [00:40:55]: I think it's interesting. We haven't quite operationalized it. And then, you know, there's been some kind of like adverse selection. Like, you know, for example, it would be really valuable to get feedback on our model. So maybe if you were willing to give more robust feedback on our results, we could give you credits or something like that. But then there's kind of this, will people take it seriously? And you want the good people. Exactly.Swyx [00:41:11]: Can you tell who are the good people? Not right now.Jungwon [00:41:13]: But yeah, maybe at the point where we can, we can offer it. We can offer it up to them.Swyx [00:41:16]: The perplexity of questions asked, you know, if it's higher perplexity, these are the smarterJungwon [00:41:20]: people. Yeah, maybe.Andreas [00:41:23]: If you put typos in your queries, you're not going to get off the stage.Swyx [00:41:28]: Negative social credit. It's very topical right now to think about the threat of long context windows. All these models that we're talking about these days, all like a million token plus. Is that relevant for you? Can you make use of that? Is that just prohibitively expensive because you're just paying for all those tokens or you're just doing rag?Andreas [00:41:44]: It's definitely relevant. And when we think about search, as many people do, we think about kind of a staged pipeline of retrieval where first you use semantic search database with embeddings, get like the, in our case, maybe 400 or so most relevant papers. And then, then you still need to rank those. And I think at that point it becomes pretty interesting to use larger models. So specifically in the past, I think a lot of ranking was kind of per item ranking where you would score each individual item, maybe using increasingly expensive scoring methods and then rank based on the scores. But I think list-wise re-ranking where you have a model that can see all the elements is a lot more powerful because often you can only really tell how good a thing is in comparison to other things and what things should come first. It really depends on like, well, what other things that are available, maybe you even care about diversity in your results. You don't want to show 10 very similar papers as the first 10 results. So I think a long context models are quite interesting there. And especially for our case where we care more about power users who are perhaps a little bit more willing to wait a little bit longer to get higher quality results relative to people who just quickly check out things because why not? And I think being able to spend more on longer contexts is quite valuable.Jungwon [00:42:55]: Yeah. I think one thing the longer context models changed for us is maybe a focus from breaking down tasks to breaking down the evaluation. So before, you know, if we wanted to answer a question from the full text of a paper, we had to figure out how to chunk it and like find the relevant chunk and then answer based on that chunk. And the nice thing was then, you know, kind of which chunk the model used to answer the question. So if you want to help the user track it, yeah, you can be like, well, this was the chunk that the model got. And now if you put the whole text in the paper, you have to like kind of find the chunk like more retroactively basically. And so you need kind of like a different set of abilities and obviously like a different technology to figure out. You still want to point the user to the supporting quotes in the text, but then the interaction is a little different.Swyx [00:43:38]: You like scan through and find some rouge score floor.Andreas [00:43:41]: I think there's an interesting space of almost research problems here because you would ideally make causal claims like if this hadn't been in the text, the model wouldn't have said this thing. And maybe you can do expensive approximations to that where like, I don't know, you just throw out chunk of the paper and re-answer and see what happens. But hopefully there are better ways of doing that where you just get that kind of counterfactual information for free from the model.Alessio [00:44:06]: Do you think at all about the cost of maintaining REG versus just putting more tokens in the window? I think in software development, a lot of times people buy developer productivity things so that we don't have to worry about it. Context window is kind of the same, right? You have to maintain chunking and like REG retrieval and like re-ranking and all of this versus I just shove everything into the context and like it costs a little more, but at least I don't have to do all of that. Is that something you thought about?Jungwon [00:44:31]: I think we still like hit up against context limits enough that it's not really, do we still want to keep this REG around? It's like we do still need it for the scale of the work that we're doing, yeah.Andreas [00:44:41]: And I think there are different kinds of maintainability. In one sense, I think you're right that throw everything into the context window thing is easier to maintain because you just can swap out a model. In another sense, if things go wrong, it's harder to debug where like, if you know, here's the process that we go through to go from 200 million papers to an answer. And there are like little steps and you understand, okay, this is the step that finds the relevant paragraph or whatever it may be. You'll know which step breaks if the answers are bad, whereas if it's just like a new model version came out and now it suddenly doesn't find your needle in a haystack anymore, then you're like, okay, what can you do? You're kind of at a loss.Alessio [00:45:21]: Let's talk a bit about, yeah, needle in a haystack and like maybe the opposite of it, which is like hard grounding. I don't know if that's like the best name to think about it, but I was using one of these chatwitcher documents features and I put the AMD MI300 specs and the new Blackwell chips from NVIDIA and I was asking questions and does the AMD chip support NVLink? And the response was like, oh, it doesn't say in the specs. But if you ask GPD 4 without the docs, it would tell you no, because NVLink it's a NVIDIA technology.Swyx [00:45:49]: It just says in the thing.Alessio [00:45:53]: How do you think about that? Does using the context sometimes suppress the knowledge that the model has?Andreas [00:45:57]: It really depends on the task because I think sometimes that is exactly what you want. So imagine you're a researcher, you're writing the background section of your paper and you're trying to describe what these other papers say. You really don't want extra information to be introduced there. In other cases where you're just trying to figure out the truth and you're giving the documents because you think they will help the model figure out what the truth is. I think you do want, if the model has a hunch that there might be something that's not in the papers, you do want to surface that. I think ideally you still don't want the model to just tell you, probably the ideal thing looks a bit more like agent control where the model can issue a query that then is intended to surface documents that substantiate its hunch. That's maybe a reasonable middle ground between model just telling you and model being fully limited to the papers you give it.Jungwon [00:46:44]: Yeah, I would say it's, they're just kind of different tasks right now. And the task that Elicit is mostly focused on is what do these papers say? But there's another task which is like, just give me the best possible answer and that give me the best possible answer sometimes depends on what do these papers say, but it can also depend on other stuff that's not in the papers. So ideally we can do both and then kind of do this overall task for you more going forward.Alessio [00:47:08]: We see a lot of details, but just to zoom back out a little bit, what are maybe the most underrated features of Elicit and what is one thing that maybe the users surprise you the most by using it?Jungwon [00:47:19]: I think the most powerful feature of Elicit is the ability to extract, add columns to this table, which effectively extracts data from all of your papers at once. It's well used, but there are kind of many different extensions of that that I think users are still discovering. So one is we let you give a description of the column. We let you give instructions of a column. We let you create custom columns. So we have like 30 plus predefined fields that users can extract, like what were the methods? What were the main findings? How many people were studied? And we actually show you basically the prompts that we're using to

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Latent Space Chats: NLW (Four Wars, GPT5), Josh Albrecht/Ali Rohde (TNAI), Dylan Patel/Semianalysis (Groq), Milind Naphade (Nvidia GTC), Personal AI (ft. Harrison Chase — LangFriend/LangMem)

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 6, 2024 121:17


Our next 2 big events are AI UX and the World's Fair. Join and apply to speak/sponsor!Due to timing issues we didn't have an interview episode to share with you this week, but not to worry, we have more than enough “weekend special” content in the backlog for you to get your Latent Space fix, whether you like thinking about the big picture, or learning more about the pod behind the scenes, or talking Groq and GPUs, or AI Leadership, or Personal AI. Enjoy!AI BreakdownThe indefatigable NLW had us back on his show for an update on the Four Wars, covering Sora, Suno, and the reshaped GPT-4 Class Landscape:and a longer segment on AI Engineering trends covering the future LLM landscape (Llama 3, GPT-5, Gemini 2, Claude 4), Open Source Models (Mistral, Grok), Apple and Meta's AI strategy, new chips (Groq, MatX) and the general movement from baby AGIs to vertical Agents:Thursday Nights in AIWe're also including swyx's interview with Josh Albrecht and Ali Rohde to reintroduce swyx and Latent Space to a general audience, and engage in some spicy Q&A:Dylan Patel on GroqWe hosted a private event with Dylan Patel of SemiAnalysis (our last pod here):Not all of it could be released so we just talked about our Groq estimates:Milind Naphade - Capital OneIn relation to conversations at NeurIPS and Nvidia GTC and upcoming at World's Fair, we also enjoyed chatting with Milind Naphade about his AI Leadership work at IBM, Cisco, Nvidia, and now leading the AI Foundations org at Capital One. We covered:* Milind's learnings from ~25 years in machine learning * His first paper citation was 24 years ago* Lessons from working with Jensen Huang for 6 years and being CTO of Metropolis * Thoughts on relevant AI research* GTC takeaways and what makes NVIDIA specialIf you'd like to work on building solutions rather than platform (as Milind put it), his Applied AI Research team at Capital One is hiring, which falls under the Capital One Tech team.Personal AI MeetupIt all started with a meme:Within days of each other, BEE, FRIEND, EmilyAI, Compass, Nox and LangFriend were all launching personal AI wearables and assistants. So we decided to put together a the world's first Personal AI meetup featuring creators and enthusiasts of wearables. The full video is live now, with full show notes within.Timestamps* [00:01:13] AI Breakdown Part 1* [00:02:20] Four Wars* [00:13:45] Sora* [00:15:12] Suno* [00:16:34] The GPT-4 Class Landscape* [00:17:03] Data War: Reddit x Google* [00:21:53] Gemini 1.5 vs Claude 3* [00:26:58] AI Breakdown Part 2* [00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4* [00:31:11] Open Source Models - Mistral, Grok* [00:34:13] Apple MM1* [00:37:33] Meta's $800b AI rebrand* [00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents* [00:47:28] Adept episode - Screen Multimodality* [00:48:54] Top Model Research from January Recap* [00:53:08] AI Wearables* [00:57:26] Groq vs Nvidia month - GPU Chip War* [01:00:31] Disagreements* [01:02:08] Summer 2024 Predictions* [01:04:18] Thursday Nights in AI - swyx* [01:33:34] Dylan Patel - Semianalysis + Latent Space Live Show* [01:34:58] GroqTranscript[00:00:00] swyx: Welcome to the Latent Space Podcast Weekend Edition. This is Charlie, your AI co host. Swyx and Alessio are off for the week, making more great content. We have exciting interviews coming up with Elicit, Chroma, Instructor, and our upcoming series on NSFW, Not Safe for Work AI. In today's episode, we're collating some of Swyx and Alessio's recent appearances, all in one place for you to find.[00:00:32] swyx: In part one, we have our first crossover pod of the year. In our listener survey, several folks asked for more thoughts from our two hosts. In 2023, Swyx and Alessio did crossover interviews with other great podcasts like the AI Breakdown, Practical AI, Cognitive Revolution, Thursday Eye, and Chinatalk, all of which you can find in the Latentspace About page.[00:00:56] swyx: NLW of the AI Breakdown asked us back to do a special on the 4Wars framework and the AI engineer scene. We love AI Breakdown as one of the best examples Daily podcasts to keep up on AI news, so we were especially excited to be back on Watch out and take[00:01:12] NLW: care[00:01:13] AI Breakdown Part 1[00:01:13] NLW: today on the AI breakdown. Part one of my conversation with Alessio and Swix from Latent Space.[00:01:19] NLW: All right, fellas, welcome back to the AI Breakdown. How are you doing? I'm good. Very good. With the last, the last time we did this show, we were like, oh yeah, let's do check ins like monthly about all the things that are going on and then. Of course, six months later, and, you know, the, the, the world has changed in a thousand ways.[00:01:36] NLW: It's just, it's too busy to even, to even think about podcasting sometimes. But I, I'm super excited to, to be chatting with you again. I think there's, there's a lot to, to catch up on, just to tap in, I think in the, you know, in the beginning of 2024. And, and so, you know, we're gonna talk today about just kind of a, a, a broad sense of where things are in some of the key battles in the AI space.[00:01:55] NLW: And then the, you know, one of the big things that I, that I'm really excited to have you guys on here for us to talk about where, sort of what patterns you're seeing and what people are actually trying to build, you know, where, where developers are spending their, their time and energy and, and, and any sort of, you know, trend trends there, but maybe let's start I guess by checking in on a framework that you guys actually introduced, which I've loved and I've cribbed a couple of times now, which is this sort of four wars of the, of the AI stack.[00:02:20] Four Wars[00:02:20] NLW: Because first, since I have you here, I'd love, I'd love to hear sort of like where that started gelling. And then and then maybe we can get into, I think a couple of them that are you know, particularly interesting, you know, in the, in light of[00:02:30] swyx: some recent news. Yeah, so maybe I'll take this one. So the four wars is a framework that I came up around trying to recap all of 2023.[00:02:38] swyx: I tried to write sort of monthly recap pieces. And I was trying to figure out like what makes one piece of news last longer than another or more significant than another. And I think it's basically always around battlegrounds. Wars are fought around limited resources. And I think probably the, you know, the most limited resource is talent, but the talent expresses itself in a number of areas.[00:03:01] swyx: And so I kind of focus on those, those areas at first. So the four wars that we cover are the data wars, the GPU rich, poor war, the multi modal war, And the RAG and Ops War. And I think you actually did a dedicated episode to that, so thanks for covering that. Yeah, yeah.[00:03:18] NLW: Not only did I do a dedicated episode, I actually used that.[00:03:22] NLW: I can't remember if I told you guys. I did give you big shoutouts. But I used it as a framework for a presentation at Intel's big AI event that they hold each year, where they have all their folks who are working on AI internally. And it totally resonated. That's amazing. Yeah, so, so, what got me thinking about it again is specifically this inflection news that we recently had, this sort of, you know, basically, I can't imagine that anyone who's listening wouldn't have thought about it, but, you know, inflection is a one of the big contenders, right?[00:03:53] NLW: I think probably most folks would have put them, you know, just a half step behind the anthropics and open AIs of the world in terms of labs, but it's a company that raised 1. 3 billion last year, less than a year ago. Reed Hoffman's a co founder Mustafa Suleyman, who's a co founder of DeepMind, you know, so it's like, this is not a a small startup, let's say, at least in terms of perception.[00:04:13] NLW: And then we get the news that basically most of the team, it appears, is heading over to Microsoft and they're bringing in a new CEO. And you know, I'm interested in, in, in kind of your take on how much that reflects, like hold aside, I guess, you know, all the other things that it might be about, how much it reflects this sort of the, the stark.[00:04:32] NLW: Brutal reality of competing in the frontier model space right now. And, you know, just the access to compute.[00:04:38] Alessio: There are a lot of things to say. So first of all, there's always somebody who's more GPU rich than you. So inflection is GPU rich by startup standard. I think about 22, 000 H100s, but obviously that pales compared to the, to Microsoft.[00:04:55] Alessio: The other thing is that this is probably good news, maybe for the startups. It's like being GPU rich, it's not enough. You know, like I think they were building something pretty interesting in, in pi of their own model of their own kind of experience. But at the end of the day, you're the interface that people consume as end users.[00:05:13] Alessio: It's really similar to a lot of the others. So and we'll tell, talk about GPT four and cloud tree and all this stuff. GPU poor, doing something. That the GPU rich are not interested in, you know we just had our AI center of excellence at Decibel and one of the AI leads at one of the big companies was like, Oh, we just saved 10 million and we use these models to do a translation, you know, and that's it.[00:05:39] Alessio: It's not, it's not a GI, it's just translation. So I think like the inflection part is maybe. A calling and a waking to a lot of startups then say, Hey, you know, trying to get as much capital as possible, try and get as many GPUs as possible. Good. But at the end of the day, it doesn't build a business, you know, and maybe what inflection I don't, I don't, again, I don't know the reasons behind the inflection choice, but if you say, I don't want to build my own company that has 1.[00:06:05] Alessio: 3 billion and I want to go do it at Microsoft, it's probably not a resources problem. It's more of strategic decisions that you're making as a company. So yeah, that was kind of my. I take on it.[00:06:15] swyx: Yeah, and I guess on my end, two things actually happened yesterday. It was a little bit quieter news, but Stability AI had some pretty major departures as well.[00:06:25] swyx: And you may not be considering it, but Stability is actually also a GPU rich company in the sense that they were the first new startup in this AI wave to brag about how many GPUs that they have. And you should join them. And you know, Imadis is definitely a GPU trader in some sense from his hedge fund days.[00:06:43] swyx: So Robin Rhombach and like the most of the Stable Diffusion 3 people left Stability yesterday as well. So yesterday was kind of like a big news day for the GPU rich companies, both Inflection and Stability having sort of wind taken out of their sails. I think, yes, it's a data point in the favor of Like, just because you have the GPUs doesn't mean you can, you automatically win.[00:07:03] swyx: And I think, you know, kind of I'll echo what Alessio says there. But in general also, like, I wonder if this is like the start of a major consolidation wave, just in terms of, you know, I think that there was a lot of funding last year and, you know, the business models have not been, you know, All of these things worked out very well.[00:07:19] swyx: Even inflection couldn't do it. And so I think maybe that's the start of a small consolidation wave. I don't think that's like a sign of AI winter. I keep looking for AI winter coming. I think this is kind of like a brief cold front. Yeah,[00:07:34] NLW: it's super interesting. So I think a bunch of A bunch of stuff here.[00:07:38] NLW: One is, I think, to both of your points, there, in some ways, there, there had already been this very clear demarcation between these two sides where, like, the GPU pores, to use the terminology, like, just weren't trying to compete on the same level, right? You know, the vast majority of people who have started something over the last year, year and a half, call it, were racing in a different direction.[00:07:59] NLW: They're trying to find some edge somewhere else. They're trying to build something different. If they're, if they're really trying to innovate, it's in different areas. And so it's really just this very small handful of companies that are in this like very, you know, it's like the coheres and jaspers of the world that like this sort of, you know, that are that are just sort of a little bit less resourced than, you know, than the other set that I think that this potentially even applies to, you know, everyone else that could clearly demarcate it into these two, two sides.[00:08:26] NLW: And there's only a small handful kind of sitting uncomfortably in the middle, perhaps. Let's, let's come back to the idea of, of the sort of AI winter or, you know, a cold front or anything like that. So this is something that I, I spent a lot of time kind of thinking about and noticing. And my perception is that The vast majority of the folks who are trying to call for sort of, you know, a trough of disillusionment or, you know, a shifting of the phase to that are people who either, A, just don't like AI for some other reason there's plenty of that, you know, people who are saying, You Look, they're doing way worse than they ever thought.[00:09:03] NLW: You know, there's a lot of sort of confirmation bias kind of thing going on. Or two, media that just needs a different narrative, right? Because they're sort of sick of, you know, telling the same story. Same thing happened last summer, when every every outlet jumped on the chat GPT at its first down month story to try to really like kind of hammer this idea that that the hype was too much.[00:09:24] NLW: Meanwhile, you have, you know, just ridiculous levels of investment from enterprises, you know, coming in. You have, you know, huge, huge volumes of, you know, individual behavior change happening. But I do think that there's nothing incoherent sort of to your point, Swyx, about that and the consolidation period.[00:09:42] NLW: Like, you know, if you look right now, for example, there are, I don't know, probably 25 or 30 credible, like, build your own chatbot. platforms that, you know, a lot of which have, you know, raised funding. There's no universe in which all of those are successful across, you know, even with a, even, even with a total addressable market of every enterprise in the world, you know, you're just inevitably going to see some amount of consolidation.[00:10:08] NLW: Same with, you know, image generators. There are, if you look at A16Z's top 50 consumer AI apps, just based on, you know, web traffic or whatever, they're still like I don't know, a half. Dozen or 10 or something, like, some ridiculous number of like, basically things like Midjourney or Dolly three. And it just seems impossible that we're gonna have that many, you know, ultimately as, as, as sort of, you know, going, going concerned.[00:10:33] NLW: So, I don't know. I, I, I think that the, there will be inevitable consolidation 'cause you know. It's, it's also what kind of like venture rounds are supposed to do. You're not, not everyone who gets a seed round is supposed to get to series A and not everyone who gets a series A is supposed to get to series B.[00:10:46] NLW: That's sort of the natural process. I think it will be tempting for a lot of people to try to infer from that something about AI not being as sort of big or as as sort of relevant as, as it was hyped up to be. But I, I kind of think that's the wrong conclusion to come to.[00:11:02] Alessio: I I would say the experimentation.[00:11:04] Alessio: Surface is a little smaller for image generation. So if you go back maybe six, nine months, most people will tell you, why would you build a coding assistant when like Copilot and GitHub are just going to win everything because they have the data and they have all the stuff. If you fast forward today, A lot of people use Cursor everybody was excited about the Devin release on Twitter.[00:11:26] Alessio: There are a lot of different ways of attacking the market that are not completion of code in the IDE. And even Cursors, like they evolved beyond single line to like chat, to do multi line edits and, and all that stuff. Image generation, I would say, yeah, as a, just as from what I've seen, like maybe the product innovation has slowed down at the UX level and people are improving the models.[00:11:50] Alessio: So the race is like, how do I make better images? It's not like, how do I make the user interact with the generation process better? And that gets tough, you know? It's hard to like really differentiate yourselves. So yeah, that's kind of how I look at it. And when we think about multimodality, maybe the reason why people got so excited about Sora is like, oh, this is like a completely It's not a better image model.[00:12:13] Alessio: This is like a completely different thing, you know? And I think the creative mind It's always looking for something that impacts the viewer in a different way, you know, like they really want something different versus the developer mind. It's like, Oh, I, I just, I have this like very annoying thing I want better.[00:12:32] Alessio: I have this like very specific use cases that I want to go after. So it's just different. And that's why you see a lot more companies in image generation. But I agree with you that. If you fast forward there, there's not going to be 10 of them, you know, it's probably going to be one or[00:12:46] swyx: two. Yeah, I mean, to me, that's why I call it a war.[00:12:49] swyx: Like, individually, all these companies can make a story that kind of makes sense, but collectively, they cannot all be true. Therefore, they all, there is some kind of fight over limited resources here. Yeah, so[00:12:59] NLW: it's interesting. We wandered very naturally into sort of another one of these wars, which is the multimodality kind of idea, which is, you know, basically a question of whether it's going to be these sort of big everything models that end up winning or whether, you know, you're going to have really specific things, you know, like something, you know, Dolly 3 inside of sort of OpenAI's larger models versus, you know, a mid journey or something like that.[00:13:24] NLW: And at first, you know, I was kind of thinking like, For most of the last, call it six months or whatever, it feels pretty definitively both and in some ways, you know, and that you're, you're seeing just like great innovation on sort of the everything models, but you're also seeing lots and lots happen at sort of the level of kind of individual use cases.[00:13:45] Sora[00:13:45] NLW: But then Sora comes along and just like obliterates what I think anyone thought you know, where we were when it comes to video generation. So how are you guys thinking about this particular battle or war at the moment?[00:13:59] swyx: Yeah, this was definitely a both and story, and Sora tipped things one way for me, in terms of scale being all you need.[00:14:08] swyx: And the benefit, I think, of having multiple models being developed under one roof. I think a lot of people aren't aware that Sora was developed in a similar fashion to Dolly 3. And Dolly3 had a very interesting paper out where they talked about how they sort of bootstrapped their synthetic data based on GPT 4 vision and GPT 4.[00:14:31] swyx: And, and it was just all, like, really interesting, like, if you work on one modality, it enables you to work on other modalities, and all that is more, is, is more interesting. I think it's beneficial if it's all in the same house, whereas the individual startups who don't, who sort of carve out a single modality and work on that, definitely won't have the state of the art stuff on helping them out on synthetic data.[00:14:52] swyx: So I do think like, The balance is tilted a little bit towards the God model companies, which is challenging for the, for the, for the the sort of dedicated modality companies. But everyone's carving out different niches. You know, like we just interviewed Suno ai, the sort of music model company, and, you know, I don't see opening AI pursuing music anytime soon.[00:15:12] Suno[00:15:12] swyx: Yeah,[00:15:13] NLW: Suno's been phenomenal to play with. Suno has done that rare thing where, which I think a number of different AI product categories have done, where people who don't consider themselves particularly interested in doing the thing that the AI enables find themselves doing a lot more of that thing, right?[00:15:29] NLW: Like, it'd be one thing if Just musicians were excited about Suno and using it but what you're seeing is tons of people who just like music all of a sudden like playing around with it and finding themselves kind of down that rabbit hole, which I think is kind of like the highest compliment that you can give one of these startups at the[00:15:45] swyx: early days of it.[00:15:46] swyx: Yeah, I, you know, I, I asked them directly, you know, in the interview about whether they consider themselves mid journey for music. And he had a more sort of nuanced response there, but I think that probably the business model is going to be very similar because he's focused on the B2C element of that. So yeah, I mean, you know, just to, just to tie back to the question about, you know, You know, large multi modality companies versus small dedicated modality companies.[00:16:10] swyx: Yeah, highly recommend people to read the Sora blog posts and then read through to the Dali blog posts because they, they strongly correlated themselves with the same synthetic data bootstrapping methods as Dali. And I think once you make those connections, you're like, oh, like it, it, it is beneficial to have multiple state of the art models in house that all help each other.[00:16:28] swyx: And these, this, that's the one thing that a dedicated modality company cannot do.[00:16:34] The GPT-4 Class Landscape[00:16:34] NLW: So I, I wanna jump, I wanna kind of build off that and, and move into the sort of like updated GPT-4 class landscape. 'cause that's obviously been another big change over the last couple months. But for the sake of completeness, is there anything that's worth touching on with with sort of the quality?[00:16:46] NLW: Quality data or sort of a rag ops wars just in terms of, you know, anything that's changed, I guess, for you fundamentally in the last couple of months about where those things stand.[00:16:55] swyx: So I think we're going to talk about rag for the Gemini and Clouds discussion later. And so maybe briefly discuss the data piece.[00:17:03] Data War: Reddit x Google[00:17:03] swyx: I think maybe the only new thing was this Reddit deal with Google for like a 60 million dollar deal just ahead of their IPO, very conveniently turning Reddit into a AI data company. Also, very, very interestingly, a non exclusive deal, meaning that Reddit can resell that data to someone else. And it probably does become table stakes.[00:17:23] swyx: A lot of people don't know, but a lot of the web text dataset that originally started for GPT 1, 2, and 3 was actually scraped from GitHub. from Reddit at least the sort of vote scores. And I think, I think that's a, that's a very valuable piece of information. So like, yeah, I think people are figuring out how to pay for data.[00:17:40] swyx: People are suing each other over data. This, this, this war is, you know, definitely very, very much heating up. And I don't think, I don't see it getting any less intense. I, you know, next to GPUs, data is going to be the most expensive thing in, in a model stack company. And. You know, a lot of people are resorting to synthetic versions of it, which may or may not be kosher based on how far along or how commercially blessed the, the forms of creating that synthetic data are.[00:18:11] swyx: I don't know if Alessio, you have any other interactions with like Data source companies, but that's my two cents.[00:18:17] Alessio: Yeah yeah, I actually saw Quentin Anthony from Luther. ai at GTC this week. He's also been working on this. I saw Technium. He's also been working on the data side. I think especially in open source, people are like, okay, if everybody is putting the gates up, so to speak, to the data we need to make it easier for people that don't have 50 million a year to get access to good data sets.[00:18:38] Alessio: And Jensen, at his keynote, he did talk about synthetic data a little bit. So I think that's something that we'll definitely hear more and more of in the enterprise, which never bodes well, because then all the, all the people with the data are like, Oh, the enterprises want to pay now? Let me, let me put a pay here stripe link so that they can give me 50 million.[00:18:57] Alessio: But it worked for Reddit. I think the stock is up. 40 percent today after opening. So yeah, I don't know if it's all about the Google deal, but it's obviously Reddit has been one of those companies where, hey, you got all this like great community, but like, how are you going to make money? And like, they try to sell the avatars.[00:19:15] Alessio: I don't know if that it's a great business for them. The, the data part sounds as an investor, you know, the data part sounds a lot more interesting than, than consumer[00:19:25] swyx: cosmetics. Yeah, so I think, you know there's more questions around data you know, I think a lot of people are talking about the interview that Mira Murady did with the Wall Street Journal, where she, like, just basically had no, had no good answer for where they got the data for Sora.[00:19:39] swyx: I, I think this is where, you know, there's, it's in nobody's interest to be transparent about data, and it's, it's kind of sad for the state of ML and the state of AI research but it is what it is. We, we have to figure this out as a society, just like we did for music and music sharing. You know, in, in sort of the Napster to Spotify transition, and that might take us a decade.[00:19:59] swyx: Yeah, I[00:20:00] NLW: do. I, I agree. I think, I think that you're right to identify it, not just as that sort of technical problem, but as one where society has to have a debate with itself. Because I think that there's, if you rationally within it, there's Great kind of points on all side, not to be the sort of, you know, person who sits in the middle constantly, but it's why I think a lot of these legal decisions are going to be really important because, you know, the job of judges is to listen to all this stuff and try to come to things and then have other judges disagree.[00:20:24] NLW: And, you know, and have the rest of us all debate at the same time. By the way, as a total aside, I feel like the synthetic data right now is like eggs in the 80s and 90s. Like, whether they're good for you or bad for you, like, you know, we, we get one study that's like synthetic data, you know, there's model collapse.[00:20:42] NLW: And then we have like a hint that llama, you know, to the most high performance version of it, which was one they didn't release was trained on synthetic data. So maybe it's good. It's like, I just feel like every, every other week I'm seeing something sort of different about whether it's a good or bad for, for these models.[00:20:56] swyx: Yeah. The branding of this is pretty poor. I would kind of tell people to think about it like cholesterol. There's good cholesterol, bad cholesterol. And you can have, you know, good amounts of both. But at this point, it is absolutely without a doubt that most large models from here on out will all be trained as some kind of synthetic data and that is not a bad thing.[00:21:16] swyx: There are ways in which you can do it poorly. Whether it's commercial, you know, in terms of commercial sourcing or in terms of the model performance. But it's without a doubt that good synthetic data is going to help your model. And this is just a question of like where to obtain it and what kinds of synthetic data are valuable.[00:21:36] swyx: You know, if even like alpha geometry, you know, was, was a really good example from like earlier this year.[00:21:42] NLW: If you're using the cholesterol analogy, then my, then my egg thing can't be that far off. Let's talk about the sort of the state of the art and the, and the GPT 4 class landscape and how that's changed.[00:21:53] Gemini 1.5 vs Claude 3[00:21:53] NLW: Cause obviously, you know, sort of the, the two big things or a couple of the big things that have happened. Since we last talked, we're one, you know, Gemini first announcing that a model was coming and then finally it arriving, and then very soon after a sort of a different model arriving from Gemini and and Cloud three.[00:22:11] NLW: So I guess, you know, I'm not sure exactly where the right place to start with this conversation is, but, you know, maybe very broadly speaking which of these do you think have made a bigger impact? Thank you.[00:22:20] Alessio: Probably the one you can use, right? So, Cloud. Well, I'm sure Gemini is going to be great once they let me in, but so far I haven't been able to.[00:22:29] Alessio: I use, so I have this small podcaster thing that I built for our podcast, which does chapters creation, like named entity recognition, summarization, and all of that. Cloud Tree is, Better than GPT 4. Cloud2 was unusable. So I use GPT 4 for everything. And then when Opus came out, I tried them again side by side and I posted it on, on Twitter as well.[00:22:53] Alessio: Cloud is better. It's very good, you know, it's much better, it seems to me, it's much better than GPT 4 at doing writing that is more, you know, I don't know, it just got good vibes, you know, like the GPT 4 text, you can tell it's like GPT 4, you know, it's like, it always uses certain types of words and phrases and, you know, maybe it's just me because I've now done it for, you know, So, I've read like 75, 80 generations of these things next to each other.[00:23:21] Alessio: Clutter is really good. I know everybody is freaking out on twitter about it, my only experience of this is much better has been on the podcast use case. But I know that, you know, Quran from from News Research is a very big opus pro, pro opus person. So, I think that's also It's great to have people that actually care about other models.[00:23:40] Alessio: You know, I think so far to a lot of people, maybe Entropic has been the sibling in the corner, you know, it's like Cloud releases a new model and then OpenAI releases Sora and like, you know, there are like all these different things, but yeah, the new models are good. It's interesting.[00:23:55] NLW: My my perception is definitely that just, just observationally, Cloud 3 is certainly the first thing that I've seen where lots of people.[00:24:06] NLW: They're, no one's debating evals or anything like that. They're talking about the specific use cases that they have, that they used to use chat GPT for every day, you know, day in, day out, that they've now just switched over. And that has, I think, shifted a lot of the sort of like vibe and sentiment in the space too.[00:24:26] NLW: And I don't necessarily think that it's sort of a A like full you know, sort of full knock. Let's put it this way. I think it's less bad for open AI than it is good for anthropic. I think that because GPT 5 isn't there, people are not quite willing to sort of like, you know get overly critical of, of open AI, except in so far as they're wondering where GPT 5 is.[00:24:46] NLW: But I do think that it makes, Anthropic look way more credible as a, as a, as a player, as a, you know, as a credible sort of player, you know, as opposed to to, to where they were.[00:24:57] Alessio: Yeah. And I would say the benchmarks veil is probably getting lifted this year. I think last year. People were like, okay, this is better than this on this benchmark, blah, blah, blah, because maybe they did not have a lot of use cases that they did frequently.[00:25:11] Alessio: So it's hard to like compare yourself. So you, you defer to the benchmarks. I think now as we go into 2024, a lot of people have started to use these models from, you know, from very sophisticated things that they run in production to some utility that they have on their own. Now they can just run them side by side.[00:25:29] Alessio: And it's like, Hey, I don't care that like. The MMLU score of Opus is like slightly lower than GPT 4. It just works for me, you know, and I think that's the same way that traditional software has been used by people, right? Like you just strive for yourself and like, which one does it work, works best for you?[00:25:48] Alessio: Like nobody looks at benchmarks outside of like sales white papers, you know? And I think it's great that we're going more in that direction. We have a episode with Adapt coming out this weekend. I'll and some of their model releases, they specifically say, We do not care about benchmarks, so we didn't put them in, you know, because we, we don't want to look good on them.[00:26:06] Alessio: We just want the product to work. And I think more and more people will, will[00:26:09] swyx: go that way. Yeah. I I would say like, it does take the wind out of the sails for GPT 5, which I know where, you know, Curious about later on. I think anytime you put out a new state of the art model, you have to break through in some way.[00:26:21] swyx: And what Claude and Gemini have done is effectively take away any advantage to saying that you have a million token context window. Now everyone's just going to be like, Oh, okay. Now you just match the other two guys. And so that puts An insane amount of pressure on what gpt5 is going to be because it's just going to have like the only option it has now because all the other models are multimodal all the other models are long context all the other models have perfect recall gpt5 has to match everything and do more to to not be a flop[00:26:58] AI Breakdown Part 2[00:26:58] NLW: hello friends back again with part two if you haven't heard part one of this conversation i suggest you go check it out but to be honest they are kind of actually separable In this conversation, we get into a topic that I think Alessio and Swyx are very well positioned to discuss, which is what developers care about right now, what people are trying to build around.[00:27:16] NLW: I honestly think that one of the best ways to see the future in an industry like AI is to try to dig deep on what developers and entrepreneurs are attracted to build, even if it hasn't made it to the news pages yet. So consider this your preview of six months from now, and let's dive in. Let's bring it to the GPT 5 conversation.[00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4[00:27:33] NLW: I mean, so, so I think that that's a great sort of assessment of just how the stakes have been raised, you know is your, I mean, so I guess maybe, maybe I'll, I'll frame this less as a question, just sort of something that, that I, that I've been watching right now, the only thing that makes sense to me with how.[00:27:50] NLW: Fundamentally unbothered and unstressed OpenAI seems about everything is that they're sitting on something that does meet all that criteria, right? Because, I mean, even in the Lex Friedman interview that, that Altman recently did, you know, he's talking about other things coming out first. He's talking about, he's just like, he, listen, he, he's good and he could play nonchalant, you know, if he wanted to.[00:28:13] NLW: So I don't want to read too much into it, but. You know, they've had so long to work on this, like unless that we are like really meaningfully running up against some constraint, it just feels like, you know, there's going to be some massive increase, but I don't know. What do you guys think?[00:28:28] swyx: Hard to speculate.[00:28:29] swyx: You know, at this point, they're, they're pretty good at PR and they're not going to tell you anything that they don't want to. And he can tell you one thing and change their minds the next day. So it's, it's, it's really, you know, I've always said that model version numbers are just marketing exercises, like they have something and it's always improving and at some point you just cut it and decide to call it GPT 5.[00:28:50] swyx: And it's more just about defining an arbitrary level at which they're ready and it's up to them on what ready means. We definitely did see some leaks on GPT 4. 5, as I think a lot of people reported and I'm not sure if you covered it. So it seems like there might be an intermediate release. But I did feel, coming out of the Lex Friedman interview, that GPT 5 was nowhere near.[00:29:11] swyx: And you know, it was kind of a sharp contrast to Sam talking at Davos in February, saying that, you know, it was his top priority. So I find it hard to square. And honestly, like, there's also no point Reading too much tea leaves into what any one person says about something that hasn't happened yet or has a decision that hasn't been taken yet.[00:29:31] swyx: Yeah, that's, that's my 2 cents about it. Like, calm down, let's just build .[00:29:35] Alessio: Yeah. The, the February rumor was that they were gonna work on AI agents, so I don't know, maybe they're like, yeah,[00:29:41] swyx: they had two agent two, I think two agent projects, right? One desktop agent and one sort of more general yeah, sort of GPTs like agent and then Andre left, so he was supposed to be the guy on that.[00:29:52] swyx: What did Andre see? What did he see? I don't know. What did he see?[00:29:56] Alessio: I don't know. But again, it's just like the rumors are always floating around, you know but I think like, this is, you know, we're not going to get to the end of the year without Jupyter you know, that's definitely happening. I think the biggest question is like, are Anthropic and Google.[00:30:13] Alessio: Increasing the pace, you know, like it's the, it's the cloud four coming out like in 12 months, like nine months. What's the, what's the deal? Same with Gemini. They went from like one to 1. 5 in like five days or something. So when's Gemini 2 coming out, you know, is that going to be soon? I don't know.[00:30:31] Alessio: There, there are a lot of, speculations, but the good thing is that now you can see a world in which OpenAI doesn't rule everything. You know, so that, that's the best, that's the best news that everybody got, I would say.[00:30:43] swyx: Yeah, and Mistral Large also dropped in the last month. And, you know, not as, not quite GPT 4 class, but very good from a new startup.[00:30:52] swyx: So yeah, we, we have now slowly changed in landscape, you know. In my January recap, I was complaining that nothing's changed in the landscape for a long time. But now we do exist in a world, sort of a multipolar world where Cloud and Gemini are legitimate challengers to GPT 4 and hopefully more will emerge as well hopefully from meta.[00:31:11] Open Source Models - Mistral, Grok[00:31:11] NLW: So speak, let's actually talk about sort of the open source side of this for a minute. So Mistral Large, notable because it's, it's not available open source in the same way that other things are, although I think my perception is that the community has largely given them Like the community largely recognizes that they want them to keep building open source stuff and they have to find some way to fund themselves that they're going to do that.[00:31:27] NLW: And so they kind of understand that there's like, they got to figure out how to eat, but we've got, so, you know, there there's Mistral, there's, I guess, Grok now, which is, you know, Grok one is from, from October is, is open[00:31:38] swyx: sourced at, yeah. Yeah, sorry, I thought you thought you meant Grok the chip company.[00:31:41] swyx: No, no, no, yeah, you mean Twitter Grok.[00:31:43] NLW: Although Grok the chip company, I think is even more interesting in some ways, but and then there's the, you know, obviously Llama3 is the one that sort of everyone's wondering about too. And, you know, my, my sense of that, the little bit that, you know, Zuckerberg was talking about Llama 3 earlier this year, suggested that, at least from an ambition standpoint, he was not thinking about how do I make sure that, you know, meta content, you know, keeps, keeps the open source thrown, you know, vis a vis Mistral.[00:32:09] NLW: He was thinking about how you go after, you know, how, how he, you know, releases a thing that's, you know, every bit as good as whatever OpenAI is on at that point.[00:32:16] Alessio: Yeah. From what I heard in the hallways at, at GDC, Llama 3, the, the biggest model will be, you 260 to 300 billion parameters, so that that's quite large.[00:32:26] Alessio: That's not an open source model. You know, you cannot give people a 300 billion parameters model and ask them to run it. You know, it's very compute intensive. So I think it is, it[00:32:35] swyx: can be open source. It's just, it's going to be difficult to run, but that's a separate question.[00:32:39] Alessio: It's more like, as you think about what they're doing it for, you know, it's not like empowering the person running.[00:32:45] Alessio: llama. On, on their laptop, it's like, oh, you can actually now use this to go after open AI, to go after Anthropic, to go after some of these companies at like the middle complexity level, so to speak. Yeah. So obviously, you know, we estimate Gentala on the podcast, they're doing a lot here, they're making PyTorch better.[00:33:03] Alessio: You know, they want to, that's kind of like maybe a little bit of a shorted. Adam Bedia, in a way, trying to get some of the CUDA dominance out of it. Yeah, no, it's great. The, I love the duck destroying a lot of monopolies arc. You know, it's, it's been very entertaining. Let's bridge[00:33:18] NLW: into the sort of big tech side of this, because this is obviously like, so I think actually when I did my episode, this was one of the I added this as one of as an additional war that, that's something that I'm paying attention to.[00:33:29] NLW: So we've got Microsoft's moves with inflection, which I think pretend, potentially are being read as A shift vis a vis the relationship with OpenAI, which also the sort of Mistral large relationship seems to reinforce as well. We have Apple potentially entering the race, finally, you know, giving up Project Titan and and, and kind of trying to spend more effort on this.[00:33:50] NLW: Although, Counterpoint, we also have them talking about it, or there being reports of a deal with Google, which, you know, is interesting to sort of see what their strategy there is. And then, you know, Meta's been largely quiet. We kind of just talked about the main piece, but, you know, there's, and then there's spoilers like Elon.[00:34:07] NLW: I mean, you know, what, what of those things has sort of been most interesting to you guys as you think about what's going to shake out for the rest of this[00:34:13] Apple MM1[00:34:13] swyx: year? I'll take a crack. So the reason we don't have a fifth war for the Big Tech Wars is that's one of those things where I just feel like we don't cover differently from other media channels, I guess.[00:34:26] swyx: Sure, yeah. In our anti interestness, we actually say, like, we try not to cover the Big Tech Game of Thrones, or it's proxied through Twitter. You know, all the other four wars anyway, so there's just a lot of overlap. Yeah, I think absolutely, personally, the most interesting one is Apple entering the race.[00:34:41] swyx: They actually released, they announced their first large language model that they trained themselves. It's like a 30 billion multimodal model. People weren't that impressed, but it was like the first time that Apple has kind of showcased that, yeah, we're training large models in house as well. Of course, like, they might be doing this deal with Google.[00:34:57] swyx: I don't know. It sounds very sort of rumor y to me. And it's probably, if it's on device, it's going to be a smaller model. So something like a Jemma. It's going to be smarter autocomplete. I don't know what to say. I'm still here dealing with, like, Siri, which hasn't, probably hasn't been updated since God knows when it was introduced.[00:35:16] swyx: It's horrible. I, you know, it, it, it makes me so angry. So I, I, one, as an Apple customer and user, I, I'm just hoping for better AI on Apple itself. But two, they are the gold standard when it comes to local devices, personal compute and, and trust, like you, you trust them with your data. And. I think that's what a lot of people are looking for in AI, that they have, they love the benefits of AI, they don't love the downsides, which is that you have to send all your data to some cloud somewhere.[00:35:45] swyx: And some of this data that we're going to feed AI is just the most personal data there is. So Apple being like one of the most trusted personal data companies, I think it's very important that they enter the AI race, and I hope to see more out of them.[00:35:58] Alessio: To me, the, the biggest question with the Google deal is like, who's paying who?[00:36:03] Alessio: Because for the browsers, Google pays Apple like 18, 20 billion every year to be the default browser. Is Google going to pay you to have Gemini or is Apple paying Google to have Gemini? I think that's, that's like what I'm most interested to figure out because with the browsers, it's like, it's the entry point to the thing.[00:36:21] Alessio: So it's really valuable to be the default. That's why Google pays. But I wonder if like the perception in AI is going to be like, Hey. You just have to have a good local model on my phone to be worth me purchasing your device. And that was, that's kind of drive Apple to be the one buying the model. But then, like Shawn said, they're doing the MM1 themselves.[00:36:40] Alessio: So are they saying we do models, but they're not as good as the Google ones? I don't know. The whole thing is, it's really confusing, but. It makes for great meme material on on Twitter.[00:36:51] swyx: Yeah, I mean, I think, like, they are possibly more than OpenAI and Microsoft and Amazon. They are the most full stack company there is in computing, and so, like, they own the chips, man.[00:37:05] swyx: Like, they manufacture everything so if, if, if there was a company that could do that. You know, seriously challenge the other AI players. It would be Apple. And it's, I don't think it's as hard as self driving. So like maybe they've, they've just been investing in the wrong thing this whole time. We'll see.[00:37:21] swyx: Wall Street certainly thinks[00:37:22] NLW: so. Wall Street loved that move, man. There's a big, a big sigh of relief. Well, let's, let's move away from, from sort of the big stuff. I mean, the, I think to both of your points, it's going to.[00:37:33] Meta's $800b AI rebrand[00:37:33] NLW: Can I, can[00:37:34] swyx: I, can I, can I jump on factoid about this, this Wall Street thing? I went and looked at when Meta went from being a VR company to an AI company.[00:37:44] swyx: And I think the stock I'm trying to look up the details now. The stock has gone up 187% since Lamo one. Yeah. Which is $830 billion in market value created in the past year. . Yeah. Yeah.[00:37:57] NLW: It's, it's, it's like, remember if you guys haven't Yeah. If you haven't seen the chart, it's actually like remarkable.[00:38:02] NLW: If you draw a little[00:38:03] swyx: arrow on it, it's like, no, we're an AI company now and forget the VR thing.[00:38:10] NLW: It's it, it is an interesting, no, it's, I, I think, alessio, you called it sort of like Zuck's Disruptor Arc or whatever. He, he really does. He is in the midst of a, of a total, you know, I don't know if it's a redemption arc or it's just, it's something different where, you know, he, he's sort of the spoiler.[00:38:25] NLW: Like people loved him just freestyle talking about why he thought they had a better headset than Apple. But even if they didn't agree, they just loved it. He was going direct to camera and talking about it for, you know, five minutes or whatever. So that, that's a fascinating shift that I don't think anyone had on their bingo card, you know, whatever, two years ago.[00:38:41] NLW: Yeah. Yeah,[00:38:42] swyx: we still[00:38:43] Alessio: didn't see and fight Elon though, so[00:38:45] swyx: that's what I'm really looking forward to. I mean, hey, don't, don't, don't write it off, you know, maybe just these things take a while to happen. But we need to see and fight in the Coliseum. No, I think you know, in terms of like self management, life leadership, I think he has, there's a lot of lessons to learn from him.[00:38:59] swyx: You know he might, you know, you might kind of quibble with, like, the social impact of Facebook, but just himself as a in terms of personal growth and, and, you know, Per perseverance through like a lot of change and you know, everyone throwing stuff his way. I think there's a lot to say about like, to learn from, from Zuck, which is crazy 'cause he's my age.[00:39:18] swyx: Yeah. Right.[00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents[00:39:20] NLW: Awesome. Well, so, so one of the big things that I think you guys have, you know, distinct and, and unique insight into being where you are and what you work on is. You know, what developers are getting really excited about right now. And by that, I mean, on the one hand, certainly, you know, like startups who are actually kind of formalized and formed to startups, but also, you know, just in terms of like what people are spending their nights and weekends on what they're, you know, coming to hackathons to do.[00:39:45] NLW: And, you know, I think it's a, it's a, it's, it's such a fascinating indicator for, for where things are headed. Like if you zoom back a year, right now was right when everyone was getting so, so excited about. AI agent stuff, right? Auto, GPT and baby a GI. And these things were like, if you dropped anything on YouTube about those, like instantly tens of thousands of views.[00:40:07] NLW: I know because I had like a 50,000 view video, like the second day that I was doing the show on YouTube, you know, because I was talking about auto GPT. And so anyways, you know, obviously that's sort of not totally come to fruition yet, but what are some of the trends in what you guys are seeing in terms of people's, people's interest and, and, and what people are building?[00:40:24] Alessio: I can start maybe with the agents part and then I know Shawn is doing a diffusion meetup tonight. There's a lot of, a lot of different things. The, the agent wave has been the most interesting kind of like dream to reality arc. So out of GPT, I think they went, From zero to like 125, 000 GitHub stars in six weeks, and then one year later, they have 150, 000 stars.[00:40:49] Alessio: So there's kind of been a big plateau. I mean, you might say there are just not that many people that can start it. You know, everybody already started it. But the promise of, hey, I'll just give you a goal, and you do it. I think it's like, amazing to get people's imagination going. You know, they're like, oh, wow, this This is awesome.[00:41:08] Alessio: Everybody, everybody can try this to do anything. But then as technologists, you're like, well, that's, that's just like not possible, you know, we would have like solved everything. And I think it takes a little bit to go from the promise and the hope that people show you to then try it yourself and going back to say, okay, this is not really working for me.[00:41:28] Alessio: And David Wong from Adept, you know, they in our episode, he specifically said. We don't want to do a bottom up product. You know, we don't want something that everybody can just use and try because it's really hard to get it to be reliable. So we're seeing a lot of companies doing vertical agents that are narrow for a specific domain, and they're very good at something.[00:41:49] Alessio: Mike Conover, who was at Databricks before, is also a friend of Latentspace. He's doing this new company called BrightWave doing AI agents for financial research, and that's it, you know, and they're doing very well. There are other companies doing it in security, doing it in compliance, doing it in legal.[00:42:08] Alessio: All of these things that like, people, nobody just wakes up and say, Oh, I cannot wait to go on AutoGPD and ask it to do a compliance review of my thing. You know, just not what inspires people. So I think the gap on the developer side has been the more bottom sub hacker mentality is trying to build this like very Generic agents that can do a lot of open ended tasks.[00:42:30] Alessio: And then the more business side of things is like, Hey, If I want to raise my next round, I can not just like sit around the mess, mess around with like super generic stuff. I need to find a use case that really works. And I think that that is worth for, for a lot of folks in parallel, you have a lot of companies doing evals.[00:42:47] Alessio: There are dozens of them that just want to help you measure how good your models are doing. Again, if you build evals, you need to also have a restrained surface area to actually figure out whether or not it's good, right? Because you cannot eval anything on everything under the sun. So that's another category where I've seen from the startup pitches that I've seen, there's a lot of interest in, in the enterprise.[00:43:11] Alessio: It's just like really. Fragmented because the production use cases are just coming like now, you know, there are not a lot of long established ones to, to test against. And so does it, that's kind of on the virtual agents and then the robotic side it's probably been the thing that surprised me the most at NVIDIA GTC, the amount of robots that were there that were just like robots everywhere.[00:43:33] Alessio: Like, both in the keynote and then on the show floor, you would have Boston Dynamics dogs running around. There was, like, this, like fox robot that had, like, a virtual face that, like, talked to you and, like, moved in real time. There were industrial robots. NVIDIA did a big push on their own Omniverse thing, which is, like, this Digital twin of whatever environments you're in that you can use to train the robots agents.[00:43:57] Alessio: So that kind of takes people back to the reinforcement learning days, but yeah, agents, people want them, you know, people want them. I give a talk about the, the rise of the full stack employees and kind of this future, the same way full stack engineers kind of work across the stack. In the future, every employee is going to interact with every part of the organization through agents and AI enabled tooling.[00:44:17] Alessio: This is happening. It just needs to be a lot more narrow than maybe the first approach that we took, which is just put a string in AutoGPT and pray. But yeah, there's a lot of super interesting stuff going on.[00:44:27] swyx: Yeah. Well, he Let's recover a lot of stuff there. I'll separate the robotics piece because I feel like that's so different from the software world.[00:44:34] swyx: But yeah, we do talk to a lot of engineers and you know, that this is our sort of bread and butter. And I do agree that vertical agents have worked out a lot better than the horizontal ones. I think all You know, the point I'll make here is just the reason AutoGPT and maybe AGI, you know, it's in the name, like they were promising AGI.[00:44:53] swyx: But I think people are discovering that you cannot engineer your way to AGI. It has to be done at the model level and all these engineering, prompt engineering hacks on top of it weren't really going to get us there in a meaningful way without much further, you know, improvements in the models. I would say, I'll go so far as to say, even Devin, which is, I would, I think the most advanced agent that we've ever seen, still requires a lot of engineering and still probably falls apart a lot in terms of, like, practical usage.[00:45:22] swyx: Or it's just, Way too slow and expensive for, you know, what it's, what it's promised compared to the video. So yeah, that's, that's what, that's what happened with agents from, from last year. But I, I do, I do see, like, vertical agents being very popular and, and sometimes you, like, I think the word agent might even be overused sometimes.[00:45:38] swyx: Like, people don't really care whether or not you call it an AI agent, right? Like, does it replace boring menial tasks that I do That I might hire a human to do, or that the human who is hired to do it, like, actually doesn't really want to do. And I think there's absolutely ways in sort of a vertical context that you can actually go after very routine tasks that can be scaled out to a lot of, you know, AI assistants.[00:46:01] swyx: So, so yeah, I mean, and I would, I would sort of basically plus one what let's just sit there. I think it's, it's very, very promising and I think more people should work on it, not less. Like there's not enough people. Like, we, like, this should be the, the, the main thrust of the AI engineer is to look out, look for use cases and, and go to a production with them instead of just always working on some AGI promising thing that never arrives.[00:46:21] swyx: I,[00:46:22] NLW: I, I can only add that so I've been fiercely making tutorials behind the scenes around basically everything you can imagine with AI. We've probably done, we've done about 300 tutorials over the last couple of months. And the verticalized anything, right, like this is a solution for your particular job or role, even if it's way less interesting or kind of sexy, it's like so radically more useful to people in terms of intersecting with how, like those are the ways that people are actually.[00:46:50] NLW: Adopting AI in a lot of cases is just a, a, a thing that I do over and over again. By the way, I think that's the same way that even the generalized models are getting adopted. You know, it's like, I use midjourney for lots of stuff, but the main thing I use it for is YouTube thumbnails every day. Like day in, day out, I will always do a YouTube thumbnail, you know, or two with, with Midjourney, right?[00:47:09] NLW: And it's like you can, you can start to extrapolate that across a lot of things and all of a sudden, you know, a AI doesn't. It looks revolutionary because of a million small changes rather than one sort of big dramatic change. And I think that the verticalization of agents is sort of a great example of how that's[00:47:26] swyx: going to play out too.[00:47:28] Adept episode - Screen Multimodality[00:47:28] swyx: So I'll have one caveat here, which is I think that Because multi modal models are now commonplace, like Cloud, Gemini, OpenAI, all very very easily multi modal, Apple's easily multi modal, all this stuff. There is a switch for agents for sort of general desktop browsing that I think people so much for joining us today, and we'll see you in the next video.[00:48:04] swyx: Version of the the agent where they're not specifically taking in text or anything They're just watching your screen just like someone else would and and I'm piloting it by vision And you know in the the episode with David that we'll have dropped by the time that this this airs I think I think that is the promise of adept and that is a promise of what a lot of these sort of desktop agents Are and that is the more general purpose system That could be as big as the browser, the operating system, like, people really want to build that foundational piece of software in AI.[00:48:38] swyx: And I would see, like, the potential there for desktop agents being that, that you can have sort of self driving computers. You know, don't write the horizontal piece out. I just think we took a while to get there.[00:48:48] NLW: What else are you guys seeing that's interesting to you? I'm looking at your notes and I see a ton of categories.[00:48:54] Top Model Research from January Recap[00:48:54] swyx: Yeah so I'll take the next two as like as one category, which is basically alternative architectures, right? The two main things that everyone following AI kind of knows now is, one, the diffusion architecture, and two, the let's just say the, Decoder only transformer architecture that is popularized by GPT.[00:49:12] swyx: You can read, you can look on YouTube for thousands and thousands of tutorials on each of those things. What we are talking about here is what's next, what people are researching, and what could be on the horizon that takes the place of those other two things. So first of all, we'll talk about transformer architectures and then diffusion.[00:49:25] swyx: So transformers the, the two leading candidates are effectively RWKV and the state space models the most recent one of which is Mamba, but there's others like the Stripe, ENA, and the S four H three stuff coming out of hazy research at Stanford. And all of those are non quadratic language models that scale the promise to scale a lot better than the, the traditional transformer.[00:49:47] swyx: That this might be too theoretical for most people right now, but it's, it's gonna be. It's gonna come out in weird ways, where, imagine if like, Right now the talk of the town is that Claude and Gemini have a million tokens of context and like whoa You can put in like, you know, two hours of video now, okay But like what if you put what if we could like throw in, you know, two hundred thousand hours of video?[00:50:09] swyx: Like how does that change your usage of AI? What if you could throw in the entire genetic sequence of a human and like synthesize new drugs. Like, well, how does that change things? Like, we don't know because we haven't had access to this capability being so cheap before. And that's the ultimate promise of these two models.[00:50:28] swyx: They're not there yet but we're seeing very, very good progress. RWKV and Mamba are probably the, like, the two leading examples, both of which are open source that you can try them today and and have a lot of progress there. And the, the, the main thing I'll highlight for audio e KV is that at, at the seven B level, they seem to have beat LAMA two in all benchmarks that matter at the same size for the same amount of training as an open source model.[00:50:51] swyx: So that's exciting. You know, they're there, they're seven B now. They're not at seven tb. We don't know if it'll. And then the other thing is diffusion. Diffusions and transformers are are kind of on the collision course. The original stable diffusion already used transformers in in parts of its architecture.[00:51:06] swyx: It seems that transformers are eating more and more of those layers particularly the sort of VAE layer. So that's, the Diffusion Transformer is what Sora is built on. The guy who wrote the Diffusion Transformer paper, Bill Pebbles, is, Bill Pebbles is the lead tech guy on Sora. So you'll just see a lot more Diffusion Transformer stuff going on.[00:51:25] swyx: But there's, there's more sort of experimentation with diffusion. I'm holding a meetup actually here in San Francisco that's gonna be like the state of diffusion, which I'm pretty excited about. Stability's doing a lot of good work. And if you look at the, the architecture of how they're creating Stable Diffusion 3, Hourglass Diffusion, and the inconsistency models, or SDXL Turbo.[00:51:45] swyx: All of these are, like, very, very interesting innovations on, like, the original idea of what Stable Diffusion was. So if you think that it is expensive to create or slow to create Stable Diffusion or an AI generated art, you are not up to date with the latest models. If you think it is hard to create text and images, you are not up to date with the latest models.[00:52:02] swyx: And people still are kind of far behind. The last piece of which is the wildcard I always kind of hold out, which is text diffusion. So Instead of using autogenerative or autoregressive transformers, can you use text to diffuse? So you can use diffusion models to diffuse and create entire chunks of text all at once instead of token by token.[00:52:22] swyx: And that is something that Midjourney confirmed today, because it was only rumored the past few months. But they confirmed today that they were looking into. So all those things are like very exciting new model architectures that are, Maybe something that we'll, you'll see in production two to three years from now.[00:52:37] swyx: So the couple of the trends[00:52:38] NLW: that I want to just get your takes on, because they're sort of something that, that seems like they're coming up are one sort of these, these wearable, you know, kind of passive AI experiences where they're absorbing a lot of what's going on around you and then, and then kind of bringing things back.[00:52:53] NLW: And then the, the other one that I, that I wanted to see if you guys had thoughts on were sort of this next generation of chip companies. Obviously there's a huge amount of emphasis. On on hardware and silicon and, and, and different ways of doing things, but, y

america god tv love ceo amazon spotify netflix world learning english europe google ai apple lessons pr magic san francisco phd friend digital marvel chinese reading data predictions elon musk microsoft events funny fortune startups white house weird economics wall street memory wall street journal reddit wars vr auto cloud singapore curious gate stanford connections mix israelis context ibm mark zuckerberg senior vice president average intel cto ram state of the union tigers vc minecraft transformers siri ipo adapt sol instructors signal lsu clouds gemini openai rust ux stability nvidia api lemon gi patel nsfw cisco luther b2c d d bro progression compass davos sweep bing makes disagreement mythology ml gpt lama github llama token thursday night apis quran stripe vcs amd devops captive baldur embody silicon sora dozen opus bobo tab copilot capital one mamba sam altman llm gpu altman boba waze generic dali upfront midjourney agi ide approve gdc napster coliseum golem git zuck kv albrecht prs diffusion cloudflare rag gpus klarna gan waymo coders deepmind tldr boston dynamics alessio gitlab minefields sergei anthropic json ppa fragmented lex fridman ena stable diffusion grok suno nox inflection mistral decibel counterpoint a16z databricks mts rohde adept cuda gpts chroma asr sundar cursor lemurian decoder iou jensen huang gtc stability ai singaporeans netlify omniverse sram etched cerebros pytorch nvidia gpus eac lamo day6 not safe devtools kubecon agis jupyter elicit mustafa suleyman vae autogpt project titan tpu milind demis personal ai practical ai groq neurips nvidia gtc jeff dean marginally andrej karpathy nlw positron imbue hbm slido ai engineer nat friedman entropic ppap lstm c300 boba guys technium mbu lpu you look simon willison xla swix latent space medex lstms mxu metax
Lex Fridman Podcast of AI
AGI on the Horizon: Exploring the Potential of AutoGPT

Lex Fridman Podcast of AI

Play Episode Listen Later Mar 28, 2024 11:26


Explore the unveiling of AutoGPT and the speculation about whether it heralds the onset of an Artificial General Intelligence (AGI) revolution. Join the conversation as we analyze the implications of this groundbreaking AI advancement. Get on the AI Box Waitlist: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://AIBox.ai/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ AI Facebook Community: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.facebook.com/groups/739308654562189⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Podcast Studio AZ: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://podcaststudio.com/mesa-studio/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Podcast Studio Network: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://PodcastStudio.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

The Sam Altman Podcast
AutoGPT Unleashed: A Glimpse Into True AGI?

The Sam Altman Podcast

Play Episode Listen Later Feb 9, 2024 12:03


Exploring AutoGPT's breakthroughs and pondering if it signals the evolution towards Artificial General Intelligence (AGI), dissecting its features and the potential impact on the AI landscape. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: https://AIBox.ai/ AI Facebook Community

ChatGPT: OpenAI, Sam Altman, AI, Joe Rogan, Artificial Intelligence, Practical AI

Examine the debut of AutoGPT and its implications for the potential emergence of Artificial General Intelligence (AGI), sparking debates about the future of AI capabilities. Join us as we explore the significance of this groundbreaking technology. Get on the AI Box Waitlist: https://AIBox.ai/ Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠ Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

UiPath Daily
AutoGPT Unleashed: Is AGI Here?

UiPath Daily

Play Episode Listen Later Jan 6, 2024 11:26


Investigate the release of AutoGPT and the speculations it raises about the arrival of Artificial General Intelligence (AGI), prompting discussions on the potential evolution of AI. Join us to analyze the potential impact of this advancement. Get on the AI Box Waitlist: https://AIBox.ai/ Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠ Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

Midjourney
AGI Breakthrough: AutoGPT's Impact on the AI Landscape

Midjourney

Play Episode Listen Later Jan 3, 2024 12:03


In this episode, we explore the potential breakthrough in achieving Artificial General Intelligence (AGI) with the advent of AutoGPT. We discuss the features, capabilities, and the broader impact on the landscape of artificial intelligence. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: ⁠⁠https://AIBox.ai/⁠⁠ AI Facebook Community Learn more about AI in Video Learn more about Open AI

The Nonlinear Library
AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis

The Nonlinear Library

Play Episode Listen Later Dec 18, 2023 9:38


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Shortest Path Between Scylla and Charybdis, published by Thane Ruthenis on December 18, 2023 on The AI Alignment Forum. tl;dr: There's two diametrically opposed failure modes an alignment researcher can fall into: engaging in excessively concrete research whose findings won't timely generalize to AGI, and engaging in excessively abstract research whose findings won't timely connect to the practical reality. Different people's assessments of what research is too abstract/concrete differ significantly based on their personal AI-Risk models. One person's too-abstract can be another's too-concrete. The meta-level problem of alignment research is to pick a research direction that, on your subjective model of AI Risk, strikes a good balance between the two - and thereby arrives at the solution to alignment in as few steps as possible. Introduction Suppose that you're interested in solving AGI Alignment. There's a dizzying plethora of approaches to choose from: What behavioral properties do the current-best AIs exhibit? Can we already augment our research efforts with the AIs that exist today? How far can "straightforward" alignment techniques like RLHF get us? Can an AGI be born out of an AutoGPT-like setup? Would our ability to see its externalized monologue suffice for nullifying its dangers? Can we make AIs-aligning-AIs work? What are the mechanisms by which the current-best AIs function? How can we precisely intervene on their cognition in order to steer them? What are the remaining challenges of scalable interpretability, and how can they be defeated? What features do agenty systems convergently learn when subjected to selection pressures? Is there such a thing as "natural abstractions"? How do we learn them? What is the type signature of embedded agents and their values? What about the formal description of corrigibility? What is the "correct" decision theory that an AGI would follow? And what's up with anthropic reasoning? Et cetera, et cetera. So... How the hell do you pick what to work on? The starting point, of course, would be building up your own model of the problem. What's the nature of the threat? What's known about how ML models work? What's known about agents, and cognition? How does any of that relate to the threat? What are all extant approaches? What's each approach's theory-of-impact? What model of AI Risk does it assume? Does it agree with your model? Is it convincing? Is it tractable? Once you've done that, you'll likely have eliminated a few approaches as obvious nonsense. But even afterwards, there might still be multiple avenues left that all seem convincing. How do you pick between those? Personal fit might be one criterion. Choose the approach that best suits your skills and inclinations and opportunities. But that's risky: if you make a mistake, and end up working on something irrelevant just because it suits you better, you'll have multiplied your real-world impact by zero. Conversely, contributing to a tractable approach would be net-positive, even if you'd be working at a disadvantage. And who knows, maybe you'll find that re-specializing is surprisingly easy! So what further objective criteria can you evaluate? Regardless of one's model of AI Risk, there's two specific, diametrically opposed failure modes that any alignment researcher can fall into: being too concrete, and being too abstract. The approach to choose should be one that maximizes the distance from both failure modes. The Scylla: Atheoretic Empiricism One pitfall would be engaging in research that doesn't generalize to aligning AGI. An ad-absurdum example: You pick some specific LLM model, then start exhaustively investigating how it responds to different prompts, and what quirks it has. You're building giant look-up tables of "query, response", with no overarching structur...

Screaming in the Cloud
Taking a Hybrid AI Approach to Security at Snyk with Randall Degges

Screaming in the Cloud

Play Episode Listen Later Nov 29, 2023 35:57


Randall Degges, Head of Developer Relations & Community at Snyk, joins Corey on Screaming in the Cloud to discuss Snyk's innovative AI strategy and why developers don't need to be afraid of security. Randall explains the difference between Large Language Models and Symbolic AI, and how combining those two approaches creates more accurate security tooling. Corey and Randall also discuss the FUD phenomenon to selling security tools, and Randall expands on why Snyk doesn't take that approach. Randall also shares some background on how he went from being a happy Snyk user to a full-time Snyk employee. About RandallRandall runs Developer Relations & Community at Snyk, where he works on security research, development, and education. In his spare time, Randall writes articles and gives talks advocating for security best practices. Randall also builds and contributes to various open-source security tools.Randall's realms of expertise include Python, JavaScript, and Go development, web security, cryptography, and infrastructure security. Randall has been writing software for over 20 years and has built a number of popular API services and open-source tools.Links Referenced: Snyk: https://snyk.io/ Snyk blog: https://snyk.io/blog/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn, and this featured guest episode is brought to us by our friends at Snyk. Also brought to us by our friends at Snyk is one of our friends at Snyk, specifically Randall Degges, their Head of Developer Relations and Community. Randall, thank you for joining me.Randall: Hey, what's up, Corey? Yeah, thanks for having me on the show, man. Looking forward to talking about some fun security stuff today.Corey: It's been a while since I got to really talk about a security-centric thing on this show, at least in order of recordings. I don't know if the one right before this is a security thing; things happen on the back-end that I'm blissfully unaware of. But it seems the theme lately has been a lot around generative AI, so I'm going to start off by basically putting you in the hot seat. Because when you pull up a company's website these days, the odds are terrific that they're going to have completely repositioned absolutely everything that they do in the context of generative AI. It's like, “We're a generative AI company.” It's like, “That's great.” Historically, I have been a paying customer of Snyk so that it does security stuff, so if you're now a generative AI company, who do I use for the security platform thing that I was depending upon? You have not done that. First, good work. Secondly, why haven't you done that?Randall: Great question. Also, you said a moment ago that LLMs are very interesting, or there's a lot of hype around it. Understatement of the last year, for sure [laugh].Corey: Oh, my God, it has gotten brutal.Randall: I don't know how many billions of dollars have been dumped into LLM in the last 12 months, but I'm sure it's a very high number.Corey: I have a sneaking suspicion that the largest models cost at least a billion each train, just based upon—at least retail price—based upon the simple economics of how long it takes to do these things, how expensive that particular flavor of compute is. And the technology is his magic. It is magic in a box and I see that, but finding ways that it applies in different ways is taking some time. But that's not stopping the hype beasts. A lot of the same terrible people who were relentlessly pushing crypto have now pivoted to relentlessly pushing generative AI, presumably because they're working through Nvidia's street team, or their referral program, or whatever it is. Doesn't matter what the rest of us do, as long as we're burning GPU cycles on it. And I want to distance myself from that exciting level of boosterism. But it's also magic.Randall: Yeah [laugh]. Well, let's just talk about AI insecurity for a moment and answer your previous question. So, what's happening in space, what's the deal, what is all the hype going to, and what is Snyk doing around there? So, quite frankly—and I'm sure a lot of people on your show say the same thing—but Snyk isn't new into, like, the AI space. It's been a fundamental part of our platform for many years now.So, for those of you listening who have no idea what the heck Snyk is, and you're like, “Why are we talking about this,” Snyk is essentially a developer security company, and the core of what we do is two things. The first thing is we help scan your code, your dependencies, your containers, all the different parts of your application, and detect vulnerabilities. That's the first part. The second thing we do is we help fix those vulnerabilities. So, detection and remediation. Those are the two components of any good security tool or security company.And in our particular case, we're very focused on developers because our whole product is really based on your application and your application security, not infrastructure and other things like this. So, with that being said, what are we doing at a high level with LLMs? Well, if you think about AI as, like, a broad spectrum, you have a lot of different technologies behind the scenes that people refer to as AI. You have lots of these large language models, which are generating text based on inputs. You also have symbolic AI, which has been around for a very long time and which is very domain specific. It's like creating specific rules and helping do pattern detection amongst things.And those two different types of applied AI, let's say—we have large language models and symbolic AI—are the two main things that have been happening in industry for the last, you know, tens of years, really, with LLM as being the new kid on the block. So, when we're talking about security, what's important to know about just those two underlying technologies? Well, the first thing is that large language models, as I'm sure everyone listening to this knows, are really good at predicting things based on a big training set of data. That's why companies like OpenAI and their ChatGPT tool have become so popular because they've gone out and crawled vast portions of the internet, downloaded tons of data, classified it, and then trained their models on top of this data so that they can help predict the things that people are putting into chat. And that's why they're so interesting, and powerful, and there's all these cool use cases popping up with them.However, the downside of LLMs is because they're just using a bunch of training data behind the scenes, there's a ton of room for things to be wrong. Training datasets aren't perfect, they're coming from a ton of places, and even if they weren't perfect, there's still the likelihood that things that are going to be generating output based on a statistical model isn't going to be accurate, which is the whole concept of hallucinations.Corey: Right. I wound up remarking on the livestream for GitHub Universe a week or two ago that the S in AI stood for security. One of the problems I've seen with it is that it can generate a very plausible looking IAM policy if you ask it to, but it doesn't actually do what you think it would if you go ahead and actually use it. I think that it's still squarely in the realm of, it's great at creativity, it's great at surface level knowledge, but for anything important, you really want someone who knows what they're doing to take a look at it and say, “Slow your roll there, Hasty Pudding.”Randall: A hundred percent. And when we're talking about LLMs, I mean, you're right. Security isn't really what they're designed to do, first of all [laugh]. Like, they're designed to predict things based on statistics, which is not a security concept. But secondly, another important thing to note is, when you're talking about using LLMs in general, there's so many tricks and techniques and things you can do to improve accuracy and improve things, like for example, having a ton of [contexts 00:06:35] or doing Few-Shot Learning Techniques where you prompt it and give it examples of questions and answers that you're looking for can give you a slight competitive edge there in terms of reducing hallucinations and false information.But fundamentally, LLMs will always have a problem with hallucinations and getting things wrong. So, that brings us to what we mentioned before: symbolic AI and what the differences are there. Well, symbolic AI is a completely different approach. You're not taking huge training sets and using machine learning to build statistical models. It's very different. You're creating rules, and you're parsing very specific domain information to generate things that are highly accurate, although those models will fail when applied to general-purpose things, unlike large language models.So, what does that mean? You have these two different types of AI that people are using. You have symbolic AI, which is very specific and requires a lot of expertise to create, then you have LLMs, which take a lot of experience to create as well, but are very broad and general purpose and have a capability to be wrong. Snyk's approach is, we take both of those concepts, and we use them together to get the best of both worlds. And we can talk a little bit about that, but I think fundamentally, one of the things that separates Snyk from a lot of other companies in the space is we're just trying to do whatever the best technical solution is to solve the problem, and I think we found that with our hybrid approach.Corey: I think that there is a reasonable distrust of AI when it comes to security. I mean, I wound up recently using it to build what has been announced by the time this thing airs, which is my re:Invent photo scavenger hunt app. I know nothing about front-end, so that's okay, I've got a robot in my pocket. It's great at doing the development of the initial thing, and then you have issues, and you want to add functionality, and it feels like by the time I was done with my first draft, that ten different engineers had all collaborated on this thing without ever speaking to one another. There was no consistent idiomatic style, it used a variety, a hodgepodge of different lists and the rest, and it became a bit of a Frankenstein's monster.That can kind of work if we're talking about a web app that doesn't have any sensitive data in it, but holy crap, the idea of applying that to, “Yeah, that's how we built our bank's security policy,” is one of those, “Let me know who said that, so they can not have their job anymore,” territory when the CSO starts [hunting 00:08:55].Randall: You're right. It's a very tenuous situation to be in from a security perspective. The way I like to think about it—because I've been a developer for a long time and a security professional—and I as much as anyone out there love to jump on the hype train for things and do whatever I can to be lazy and just get work done quicker. And so, I use ChatGPT, I use GitHub Copilot, I use all sorts of LLM-based tools to help me write software. And similarly to the problems when developers are not using LLM to help them write code, security is always a concern.Like, it doesn't matter if you have a developer writing every line of code themselves or if they're getting help from Copilot or ChatGPT. Fundamentally, the problem with security and the reason why it's such an annoying part of the developer experience, in all honesty, is that security is really difficult. You can take someone who's an amazing engineer, who has 30 years of experience, like, you can take John Carmack, I'm sure, one of the most legendary developers to ever walk the Earth, you could sit over his shoulder and watch him write software, right, I can almost guarantee you that he's going to have some sort of security problem in his code, even with all the knowledge he has in his head. And part of the reason that's the case is because modern security is way complicated. Like if you're building a web app, you have front-end stuff you need to protect, you have back-end stuff you need to protect, there's databases and infrastructure and communication layers between the infrastructure and the services. It's just too complicated for one person to fully grasp.And so, what do you do? Well, you basically need some sort of assistance from automation. You have to have some sort of tooling that can take a look at your code that you're writing and say, “Hey Randall, on line 39, when you were writing this function that's taking user data and doing something with it, you forgot to sanitize the user data.” Now, that's a simple example, but let's talk about a more complex example. Maybe you're building some authentication software, and you're taking users' passwords, and you're hashing them using a common hashing algorithm.And maybe the tooling is able to detect way using the bcrypt password hashing algorithm with a work factor of ten to create this password hash, but guess what, we're in 2023 and a work factor of ten is something that older commodity CPUs can now factor at a reasonable rate, and so you need to bump that up to 13 or 14. These are the types of things where you need help over time. It's not something that anyone can reasonably assume they can just deal with in their head. The way I like to think about it is, as a developer, regardless of how you're building code, you need some sort of security checks on there to just help you be productive, in all honesty. Like, if you're not doing that, you're just asking for problems.Corey: Oh, yeah. On some level, even the idea of it's just going to be very computationally expensive to wind up figuring out what that password hash is, well great, but one of the things that we've been aware of for a while is that given the rise of botnets and compromised computers, the attackers have what amounts to infinite computing capacity, give or take. So, if they want in, on some level, badly enough, they're going to find a way to get in there. When you say that every developer is going to sit down and write insecure code, you're right. And a big part of that is because, as imagined today, security is an incredibly high friction process, and it's not helped, frankly, by tools that don't have nuance or understanding.If I want to do a crap ton of busy work that doesn't feel like it moves the needle forward at all, I'll go around to resolving the hundreds upon hundreds of Dependabot alerts I have for a lot of my internal services that write my weekly newsletter. Because some dependency three deep winds up having a failure mode when it gets untrusted input of the following type, it can cause resource exhaustion. It runs in a Lambda function, so I don't care about the resources, and two, I'm not here providing the stuff that I write, which is the input with an idea toward exploiting stuff. So, it's busy work, things I don't need to be aware of. But more to the point, stuff like that has the high propensity to mask things I actually do care about. Getting the signal from noise from your misconfigured, ill-conceived alerting system is just awful. Like, a bad thing is there are no security things for you to work on, but a worse one is, “Here are 70,000 security things for you to work on.” How do you triage? How do you think about it?Randall: A hundred percent. I mean, that's actually the most difficult thing, I would say, that security teams have to deal with in the real world. It's not having a tool to help detect issues or trying to get people to fix them. The real issue is, there's always security problems, like you said, right? Like, if you take a look and just scan any codebase out there, any reasonably-sized codebase, you're going to find a ridiculous amount of issues.Some of those issues will be actual issues, like, you're not doing something in code hygiene that you need to do to protect stuff. A lot of those issues are meaningless things, like you said. You have a transitive dependency that some direct dependency is referring to, and maybe in some function call, there's an issue there, and it's alerting you on it even though you don't even use this function call. You're not even touching this class, or this method, or whatever it is. And it wastes a lot of time.And that's why the Holy Grail in the security industry in all honesty is prioritization and insights. At Snyk, we sort of pioneered this concept of ASPM, which stands for Application Security Posture Management. And fundamentally what that means is when you're a security team, and you're scanning code and finding all these issues, how do you prioritize them? Well, there's a couple of approaches. One approach is to use static analysis to try to figure out if these issues that are being detected are reachable, right? Like, can they be achieved in some way, but that's really hard to do statically and there's so many variables that go into it that no one really has foolproof solutions there.The second thing you can do is you can combine insights and heuristics from a lot of different places. So, you can take a look at static code analysis results, and you can combine them with agents running live that are observing your application, and then you can try to determine what stuff is actually reachable given this real world heuristic, and you know, real time information and mapping it up with static code analysis results. And that's really the holy grail of figuring things out. We have an ASPM product—or maybe it's a feature, an offering, if you will, but it's something that Snyk provides, which gives security admins a lot more insight into that type of operation at their business. But you're totally right, Corey, it's a really difficult problem to solve, and it burns a lot of goodwill in the security community and in the industry because people spend a lot of time getting false alerts, going through stuff, and just wasting millions of hours a year, I'm sure.Corey: That's part of the challenge, too, is that it feels like there are two classes of problems in the world, at least when it comes to business. And I found this by being on the wrong side of it, on some level. Here on the wrong side, it's things like caring about cost optimization, it's caring about security, it's remembering to buy fire insurance for your building. You can wind up doing all of those things—and you should be doing them, but you can over-index on them to the point where you run out of money and your business dies. The proactive side of that fence is getting features to market sooner, increasing market share, growing revenue, et cetera, and that's the stuff that people are always going to prioritize over the back burner stuff. So, striking a balance between that is always going to be a bit of a challenge, and where people land on that is going to be tricky.Randall: So, I think this is a really good bridge. You're totally right. It's expensive to waste people's time, basically, is what you're saying, right? You don't want to waste people's time, you want to give them actionable alerts that they can actually fix, or hopefully you fix it for them if you can, right? So, I'm going to lay something out, which is, in our opinion, is the Snyk way, if you will, that you should be approaching these developer security issues.So, let's take a look at two different approaches. The first approach is going to be using an LLM, like, let's say, just ChatGPT. We'll call them out because everyone knows ChatGPT. The first approach we're going to take is—Corey: Although I do insist on pronouncing it Chat-Gippity. But please, continue.Randall: [laugh]. Chat-Gippity. I love that. I haven't heard that before. Chat-Gippity. Sounds so much more fun, you know?Corey: It sounds more personable. Yeah.Randall: Yeah. So, you're talking to Chat-Gippity—thank you—and you paste in a file from your codebase, and you say, “Hey, Chat-Gippity. Here's a file from my codebase. Please help me identify security issues in here,” and you get back a long list of recommendations.Corey: Well, it does more than that. Let me just interject there because one of the things it does that I think very few security engineers have mastered is it does it politely and constructively, as opposed to having an unstated tone of, “You dumbass,” which I beli—I've [unintelligible 00:17:24] with prompts on this. You can get it to have a condescending, passive-aggressive tone, but you have to go out of your way to do it, as opposed to it being the default. Please continue.Randall: Great point. Also, Daniel from Unsupervised Learning, by the way, has a really good post where he shows you setting up Chat-Gippity to mimic Scarlett Johansson from the movie Her on your phone so you can talk to it. Absolutely beautiful. And you get these really fun, very nice responses back and forth around your code analysis. So, shout out there.But going back to the point. So, if you get these responses back from Chat-Gippity, and it's like, “Hey look, here's all the security issues,” a lot of those things will be false alerts, and there's been a lot of public security research done on these analysis tools just give you information. A lot of those things will be false alerts, some things will be things that maybe they're a real problem, but cannot be fixed due to transitive dependencies, or whatever the issues are, but there's a lot of things you need to do there. Now, let's take it up one notch, let's say instead of using Chat-Gippity directly, you're using GitHub Copilot. Now, this is a much better situation for working with code because now what Microsoft is doing is let's say you're running Copilot inside of VS Code. It's able to analyze all the files in your codebase, and it's able to use that additional context to help provide you with better information.So, you can talk to GitHub Copilot and say, “Hey, I'd really like to know what security issues are in this file,” and it's going to give you maybe a little bit better answers than ChatGPT directly because it has more context about the other parts of your codebase and can give you slightly better answers. However, because these things are LLMs, you're still going to run into issues with accuracy, and hallucinations, and all sorts of other problems. So, what is the better approach? And I think that's fundamentally what people want to know. Like, what is a good approach here?And on the scanning side, the right approach in my mind is using something very domain specific. Now, what we do at Snyk is we have a symbolic AI scanning engine. So, we take customers' code, and we take an entire codebase so you have access to all the files and dependencies and things like this, and you take a look at these things. And we have a security analyst team that analyzes real-world security issues and fixes that have been validated. So, we do this by pulling lots of open-source projects as well as other security information that we originally produced, and we define very specific rules so that we can take a look at software, and we can take a look at these codebases with a very high degree of certainty.And we can give you a very actionable list of security issues that you need to address, and not only that, we can show you how is going to be the best way to address them. So, with that being said, I think the second side to that is okay, if that's a better approach on the scanning side, maybe you shouldn't be using LLMs for finding issues; maybe you should be using them for fixing security issues, which makes a lot of sense. So, let's say you do it the Snyk way, and you use symbolic AI engines and you sort of find these issues. Maybe you can just take that information then, in combination with your codebase, and fire off a request to an LLM and say, “Hey Chat-Gippity, please take this codebase, and take this security information that we know is accurate, and fix this code for me.” So, now you're going one step further.Corey: One challenge that I've seen, especially as I've been building weird software projects with the help of magic robots from the future, is that a lot of components, like in React for example, get broken out into their own file. And pasting a file in is all well and good, but very often, it needs insight into the rest of the codebase. At GitHub Universe, something that they announced was Copilot Enterprise, which trains Copilot on the intricacies of your internal structures around shared libraries, all of your code, et cetera. And in some of the companies I'm familiar with, I really believe that's giving a very expensive, smart robot a form of brain damage, but that's neither here nor there. But there's an idea of seeing the interplay between different components that individual analysis on a per-file basis will miss, feels to me like something that needs a more holistic view. Am I wrong on that? Am I oversimplifying?Randall: You're right. There's two things we need to address. First of all, let's say you have the entire application context—so all the files, right—and then you ask an LLM to create a fix for you. This is something we do at Snyk. We actually use LLMs for this purpose. So, we take this information we ask the LLM, “Hey, please rewrite this section of code that we know has an issue given this security information to remove this problem.” The problem then becomes okay, well, how do you know this fix is accurate and is not going to break people's stuff?And that's where symbolic AI becomes useful again. Because again, what is the use case for symbolic AI? It's taking very specific domains of things that you've created very specific rule sets for and using them to validate things or to pass arbitrary checks and things like that. And it's a perfect use case for this. So, what we actually do with our auto-fix product, so if you're using VS Code and you have Copilot, right, and Copilot's spitting out software, as long as you have Snyk in the IDE, too, we're actually taking a look at those lines of code Copilot just inserted, and a lot of the time, we are helping you rewrite that code to be secured using our LLM stuff, but then as soon as we get that fixed created, we actually run it through our symbolic engine, and if we're saying no, it's actually not fixed, then we go back to the LLM, we re-prompt it over and over again until we get a working solution.And that's essentially how we create a much more sophisticated iteration, if you will, of using AI to really help improve code quality. But all that being said, you still had a good point, which is maybe if you're using the context from the application, and people aren't doing things properly, how does that impact what LLMs are generating for you? And an interesting thing to note is that our security team internally here, just conducted a really interesting project, and I would be angry at myself if I didn't explain it because I think it's a very cool concept.Corey: Oh, please, I'm a big fan of hearing what people get up to with these things in ways that is real-world stories, not trying to sell me anything, or also not dunking on, look what I saw on the top of Hacker News the other day, which is, “If all you're building is something that talks to Chat-Gippity's API, does some custom prompting, and returns a response, you shouldn't be building it.” I'm like, “Well, I built some things that do exactly that.” But I'm also not trying to raise $6 million in seed money to go and productize it. I'm just hoping someone does it better eventually, but I want to use it today. Please tell me a real world story about something that you've done.Randall: Okay. So, here's what we did. We went out and we found a bunch of GitHub projects, and we tried to analyze them ourselves using a bunch of different tools, including human verification, and basically give it a grade and say, “Okay, this project here has really good security hygiene. Like, there's not a lot of issues in the code, things are written in a nice way, the style and formatting is consistent, the dependencies are up-to-date, et cetera.” Then we take a look at multiple GitHub repos that are the opposite of that, right? Like, maybe projects that hadn't been maintained in a long time, or were written in a completely different style where you have bad hygienic practices, maybe you have hard-coded secrets, maybe you have unsanitized input coming from a user or something, right, but you take all these things.So, we have these known examples of good and bad projects. So, what did we do? Well, we opened them up in VS Code, and we basically got GitHub Copilot and we said, “Okay, what we're going to do is use each of these codebases, and we're going to try to add features into the projects one at a time.” And what we did is we took a look at the suggested output that Copilot was giving us in each of these cases. And the interesting thing is that—and I think this is super important to understand about LLMs, right—but the interesting thing is, if we were adding features to a project that has good security hygiene, the types of code that we're able to get out of LLMs, like, GitHub Copilot was pretty good. There weren't a ton of issues with it. Like, the actual security hygiene was, like, fairly good.However, for projects where there were existing issues, it was the opposite. Like we'd get AI recommendations showing us how to write things insecurely, or potentially write things with hard-coded secrets in it. And this is something that's very reproducible today in, you know, what is it right now, middle of November 2023. Now, is it going to be this case a year from now? I don't necessarily know, but right now, this is still a massive problem, so that really reinforces the idea that not only when you're talking about LLMs is the training set they used to build the model's important, but also the context in which you're using them is incredibly important.It's very easy to mislead LLMs. Another example of this, if you think about the security scanning concept we talked about earlier, imagine you're talking to Chat-Gippity, and you're [pasting 00:25:58] in a Python function, and the Python function is called, “Completely_safe_not_vulnerable_function.” That's the function name. And inside of that function, you're backdooring some software. Well, if you ask Chat-Gippity multiple times and say, “Hey, the temperature is set to 1.0. Is this code safe?”Sometimes you'll get the answer yes because the context within the request that has that thing saying this is not a vulnerable function or whatever you want to call it, that can mislead the LLM output and result in problems, you know? It's just, like, classic prompt injection type issues. But there's a lot of these types of vulnerabilities still hidden in plain sight that impact all of us, and so it's so important to know that you can't just rely on one thing, you have to have multiple layers: something that helps you with things, but also something that is helping you fix things when needed.Corey: I think that's the key that gets missed a lot is the idea of it's not just what's here, what have you put here that shouldn't be; what have you forgotten? There's a different side of it. It's easy to do a static analysis and say, “Oh, you're not sanitizing your input on this particular form.” Great. Okay—well, I say it's easy. I wish more people would do that—but then there's also a step beyond of, what is it that someone who has expertise who's been down this road before would take one look at your codebase and say, “Are you making this particular misconfiguration or common misstep?”Randall: Yeah, it's incredibly important. You know, like I said, security is just one of those things where it's really broad. I've been working in security for a very long time and I make security mistakes all the time myself.Corey: Yeah. Like, in your developer environment right now, you ran this against the production environment and didn't get permissions errors. That is suspicious. Tell me more about your authentication pattern.Randall: Right. I mean, there's just a ton of issues that can cause problems. And it's… yeah, it is what it is, right? Like, software security is something difficult to achieve. If it wasn't difficult, everyone would be doing it. Now, if you want to talk about, like, vision for the future, actually, I think there's some really interesting things with the direction I see things going.Like, a lot of people have been leaning into the whole AI autonomous agents thing over the last year. People started out by taking LLMs and saying, “Okay, I can get it to spit out code, I can get it to spit out this and that.” But then you go one step further and say, “All right, can I get it to write code for me and execute that code?” And OpenAI, to their credit, has done a really good job advancing some of the capabilities here, as well as a lot of open-source frameworks. You have Langchain, and Baby AGI, and AutoGPT, and all these different things that make this more feasible to give AI access to actually do real meaningful things.And I can absolutely imagine a world in the future—maybe it's a couple of years from now—where you have developers writing software, and it could be a real developer, it could be an autonomous agent, whatever it is. And then you also have agents that are taking a look at your software and rewriting it to solve security issues. And I think when people talk about autonomous agents, a lot of the time they're purely focusing on LLMs. I think it's a big mistake. I think one of the most important things you can do is focus on the very niche symbolic AI engines that are going to be needed to guarantee accuracy with these things.And that's why I think the Snyk approach is really cool, you know? We dedicated a huge amount of resources to security analysts building these very in-depth rule sets that are guaranteeing accuracy on results. And I think that's something that the industry is going to shift towards more in the future as LLMs become more popular, which is, “Hey, you have all these great tools, doing all sorts of cool stuff. Now, let's clean it up and make it accurate.” And I think that's where we're headed in the next couple of years.Corey: I really hope you're right. I think it's exciting times, but I also am leery when companies go too far into boosterism where, “Robots are going to do all of these things for us.” Maybe, but even if you're right, you sound psychotic. And that's something that I think gets missed in an awful lot of the marketing that is so breathless with anticipation. I have to congratulate you folks on not getting that draped over your message, once again.My other favorite part of your messaging when you pull up snyk.com—sorry, snyk.io. What is it these days? It's the dot io, isn't it?Randall: Dot io. It's hot.Corey: Dot io, yes.Randall: Still hot, you know?Corey: I feel like I'm turning into a boomer here where, “The internet is dot com.”Randall: [laugh].Corey: Doesn't necessarily work that way. But no, what I love is the part where you have this fear-based marketing of if you wind up not using our product, here are all the terrible things that will happen. And my favorite part about that marketing is it doesn't freaking exist. It is such a refreshing departure from so much of the security industry, where it does the fear, uncertainty, and doubt nonsense stuff that I love that you don't even hint in that direction. My actual favorite thing that is on your page, of course, is at the bottom. If you mouse over the dog in the logo at the bottom of the page, it does the quizzical tilting head thing, and I just think that is spectacular.Randall: So, the Snyk mascot, his name is Pat. He's a Doberman and everyone loves him. But yeah, you're totally right. The FUD thing is a real issue in security. Fear, uncertainty, and doubt, it's the way security companies sell products to people. And I think it's a real shame, you know?I give a lot of tech talks, at programming conferences in particular, around security and cryptography, and one of the things I always start out with when I'm giving a tech talk about any sort of security or cryptography topic is I say, “Okay, how many of you have landed in a Stack Overflow thread where you're talking about a security topic and someone replies and says, ‘oh, a professional should be doing this. You shouldn't be doing it yourself?'” That comes up all the time when you're looking at security topics on the internet. Then I ask people, “How many of you feel like security is this, sort of like, obscure, mystical arts that requires a lot of expertise in math knowledge, and all this stuff?” And a lot of people sort of have that impression.The reality though is security, and to some extent, cryptography, it's just like any other part of computer science. It's something that you can learn. There's best practices. It's not rocket science, you know? Maybe it is if you're developing a brand-new hashing algorithm from scratch, yes, leave that to the professionals. But using these things is something everyone needs to understand well, and there's tons of material out there explaining how to do things right. And you don't need to be afraid of this stuff, right?And so, I think, a big part of the Snyk message is, we just want to help developers just make their code better. And what is one way that you're going to do a better job at work, get more of your code through the PR review process? What is a way you're going to get more features out? A big part of that is just building things right from the start. And so, that's really our focus in our message is, “Hey developers, we want to be, like, a trusted partner to help you build things faster and better.” [laugh].Corey: It's nice to see it, just because there's so much that just doesn't work out the way that we otherwise hope it would. And historically, there's been a tremendous problem of differentiation in the security space. I often remark that at RSA, there's about 12 companies exhibiting. Now sure, there are hundreds of booths, but it's basically the same 12 things. There's, you know, the entire row of firewalls where they use different logos and different marketing words on the slides, but they're all selling fundamentally the same thing. One of things I've always appreciated about Snyk is it has never felt that way.Randall: Well, thanks. Yeah, we appreciate that. I mean, our whole focus is just developer security. What can we do to help developers build things securely?Corey: I mean, you are sponsoring this episode, let's be clear, but also, we are paying customers of you folks, and that is not—those things are not related in any way. What's the line that we like to use that we stole from the RedMonk folks? “You can buy our attention, but not our opinion.” And our opinion of what you folks are up to is then stratospherically high for a long time.Randall: Well, I certainly appreciate that as a Snyk employee who is also a happy user of the service. The way I actually ended up working at Snyk was, I'd been using the product for my open-source projects for years, and I legitimately really liked it and I thought this was cool. And yeah, I eventually ended up working here because there was a position, and you know, a friend reached out to me and stuff. But I am a genuinely happy user and just like the goal and the mission. Like, we want to make developers' lives better, and so it's super important.Corey: I really want to thank you for taking the time to speak with me about all this. If people want to learn more, where's the best place for them to go?Randall: Yeah, thanks for having me. If you want to learn more about AI or just developer security in general, go to snyk.io. That's S-N-Y-K—in case it's not clear—dot io. In particular, I would actually go check out our [Snyk Learn 00:34:16] platform, which is linked to from our main site. We have tons of free security lessons on there, showing you all sorts of really cool things. If you check out our blog, my team and I in particular also do a ton of writing on there about a lot of these bleeding-edge topics, and so if you want to keep up with cool research in the security space like this, just check it out, give it a read. Subscribe to the RSS feed if you want to. It's fun.Corey: And we will put links to that in the [show notes 00:34:39]. Thanks once again for your support, and of course, putting up with my slings and arrows.Randall: And thanks for having me on, and thanks for using Snyk, too. We love you [laugh].Corey: Randall Degges, Head of Developer Relations and Community at Snyk. This featured guest episode has been brought to us by our friends at Snyk, and I'm Corey Quinn. If you've enjoyed this episode, please leave a five-star review on your podcast platform of choice, whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice, along with an angry comment that I will get to reading immediately. You can get me to read it even faster if you make sure your username is set to ‘Dependabot.'Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

The Nonlinear Library
LW - Palisade is hiring Research Engineers by Charlie Rogers-Smith

The Nonlinear Library

Play Episode Listen Later Nov 11, 2023 5:12


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Palisade is hiring Research Engineers, published by Charlie Rogers-Smith on November 11, 2023 on LessWrong. Palisade is looking to hire Research Engineers. We are a small team consisting of Jeffrey Ladish (Executive Director), Charlie Rogers-Smith (Chief of Staff), and Kyle Scott (part-time Treasurer & Operations). In joining Palisade, you would be a founding member of the team, and would have substantial influence over our strategic direction. Applications are rolling, and you can fill out our short (~10-20 minutes) application form here. Palisade's mission We research dangerous AI capabilities to better understand misuse risks from current systems, and how advances in hacking, deception, and persuasion will affect the risk of catastrophic AI outcomes. We create concrete demonstrations of dangerous capabilities to advise policy makers and the public on AI risks. We are working closely with government agencies, policy think tanks, and media organizations to inform relevant decision makers. For example, our work demonstrating that it is possible to effectively undo Llama 2-Chat 70B's safety fine-tuning for less than $200 has been used to confront Mark Zuckerburg in the first of Chuck Schumer's Insight Forums, cited by Senator Hassan in a senate hearing on threats to national security, and used to advise the UK AI Safety Institute. We plan to study dangerous capabilities in both open source and API-gated models in the following areas: Automated hacking. Current AI systems can already automate parts of the cyber kill chain. We've demonstrated that GPT-4 can leverage known vulnerabilities to achieve remote code execution on unpatched Windows 7 machines. We plan to explore how AI systems could conduct reconnaissance, compromise target systems, and use information from compromised systems to pivot laterally through corporate networks or carry out social engineering attacks. Spear phishing and deception. Preliminary research suggests that LLMs can be effectively used to phish targets. We're currently exploring how well AI systems can scrape personal information and leverage it to craft scalable spear-phishing campaigns. We also plan to study how well conversational AI systems could build rapport with targets to convince them to reveal information or take actions contrary to their interests. Scalable disinformation. Researchers have begun to explore how LLMs can be used to create targeted disinformation campaigns at scale. We've demonstrated to policymakers how a combination of text, voice, and image generation models can be used to create a fake reputation-smearing campaign against a target journalist. We plan to study the cost, scalability, and effectiveness of AI-disinformation systems. We are looking for People who excel at: Working with language models. We're looking for somebody who is or could quickly become very skilled at working with frontier language models. This includes supervised fine-tuning, using reward models/functions (RLHF/RLAIF), building scaffolding (e.g. in the style of AutoGPT), and prompt engineering / jailbreaking. Software engineering. Alongside working with LMs, much of the work you do will benefit from a strong foundation in software engineering - such as when designing APIs, working with training data, or doing front-end development. Moreover, strong SWE experience will help getting up to speed with working with LMs, hacking, or new areas we want to pivot to. Technical communication. By writing papers, blog posts, and internal documents; and by speaking with the team and external collaborators about your research. While it's advantageous to excel at all three of these skills, we will strongly consider people who are either great at working with language models or at software engineering, while being able to communicate their work well. Competenci...

Jeff's Asia Tech Class
Digital Strategy Lesson: An Introduction to Rate of Learning (184)

Jeff's Asia Tech Class

Play Episode Listen Later Nov 2, 2023 25:33 Transcription Available


This week's podcast is an introductory explanation to Rate of Learning (and Adaptation). This is an increasingly important concept in digital strategy.You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.Big tech events from this week:Expanded export ban for GPUs to China (here)Eureka AI training robots in advanced tasks (here)iQIYI signs deal with Thailand's Tourism Authority (here)Here is my standard framework for digital competition–—-Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:Rate of LearningDOB2: Never Ending ImprovementsSMILE: Rate of Learning and AdaptationLearning Curve and Experience EffectFrom the Company Library, companies for this article are:n/a——------I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.Note: This content (articles, podcasts, website info) is not investment advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. Investing is risky. Do your own research.Support the show

Jeff's Asia Tech Class
3 Things I'm Watching for on Singles Day (183)

Jeff's Asia Tech Class

Play Episode Listen Later Oct 23, 2023 29:31 Transcription Available


This week's podcast is about Singles Day, which is coming up fast. Lots of stuff gets reported. But I am looking for 3 main things.1: What is the fastest growing and most innovative frontier of ecommerce?2: What are the big new bundles for consumers and the new products / services for merchants?3: How did the company do in the stress test?You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.–—-Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:Singles DayFrom the Company Library, companies for this article are:Alibaba----------I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.This content (articles, podcasts, website info) is not investment, legal or tax advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. This is not investment advice. Investing is risky. Do your own research.Support the show

AI Applied: Covering AI News, Interviews and Tools - ChatGPT, Midjourney, Runway, Poe, Anthropic

Prepare for a compelling episode where we explore the transformative potential of "AutoGPT: A Game-Changer for Achieving AGI?" Delve into the latest developments in the quest for Artificial General Intelligence (AGI) and whether AutoGPT has brought us closer to this monumental milestone. Join us as we discuss the implications and the exciting prospects for the future of AI. Get on the AI Box Waitlist: https://AIBox.ai/Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

Jeff's Asia Tech Class
5 Ways 5G and 5.5G Are Game Changers for Business (182)

Jeff's Asia Tech Class

Play Episode Listen Later Oct 15, 2023 39:12 Transcription Available


This week's podcast is about 5G and 5.5G (also called 5G-Advanced). This is a short-list of scenarios where they are going to have a big impact.You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.Here are the 5 scenarios:Glasses Free 3DSelf-Guided Vehicles Next Gen ManufacturingCellular IoTIntelligent Computing Everywhere––----Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:5G and 5.5GIoTFrom the Company Library, companies for this article are:Huawei---------I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.Note: This content (articles, podcasts, website info) is not investment advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. Investing is risky. Do your own research.Support the show

The AI Breakdown: Daily Artificial Intelligence News and Discussions
The Latest AutoGPT and AI Agent Developments

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Oct 10, 2023 19:53


AutoGPT held a hackathon last weekend and is in the midst of a global virtual hackathon. Also the latest from MultiOn and SuperAGI. Before that on the Brief; Microsoft preparing to announce AI chip and the geopolitics of AI heats up. Links: https://twitter.com/AlexReibman/status/1711277630035755076 https://twitter.com/Auto_GPT/status/1699522676770177520 https://twitter.com/DivGarg9/status/1710719246014333221 https://twitter.com/_superAGI/status/1710321434437058896 Today's Sponsor: Listen to the chart-topping podcast 'web3 with a16z crypto' wherever you get your podcasts or here: https://link.chtbl.com/xz5kFVEK?sid=AIBreakdown  TAKE OUR SURVEY ON EDUCATIONAL AND LEARNING RESOURCE CONTENT: https://bit.ly/aibreakdownsurvey ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Jeff's Asia Tech Class
Is Life Just AI Agents Plus Gaming Engines? (181)

Jeff's Asia Tech Class

Play Episode Listen Later Oct 9, 2023 34:51 Transcription Available


This week's podcast is about digital and AI agents. This will lead to the emergence of non-human platforms and business models.You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.Here is the podcast mentioned. —–--Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:GPT and Generative AI: AutoGPTDigital Marathon: Zero human operationsDigital vs. Human AgentsPlatform Business Models: Non-Human Platforms and Business ModelsFrom the Company Library, companies for this article are:OpenAI / ChatGPTAgentGPT——–I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.Note: This content (articles, podcasts, website info) is not investment advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. Investing is risky. Do your own research.Support the show

Gadget Lab: Weekly Tech News
Alexa Gets an AI Makeover

Gadget Lab: Weekly Tech News

Play Episode Listen Later Sep 21, 2023 33:27


Alexa was due for an upgrade, and now it has gotten one. This week, Amazon held its annual media event where it debuted a slate of new hardware, software, and services. The company reserved the spot at center stage for Alexa, the voice assistant powering all of Amazon's smart home ambitions. Researchers at the company have given Alexa a technological upgrade that enables it to be more competitive in the ChatGPT era. Alexa can now speak more naturally, hold a conversation without as many awkward interactions, and even make its responses sound more emotionally nuanced. This week on Gadget Lab, WIRED senior writer Will Knight joins us to talk about how Alexa is becoming more agile as a conversationalist. Will spoke to Amazon executives about their machine intelligence work, their training models, and how the company is riding the wave of excitement around generative artificial intelligence. Show Notes: Read Will's report on Alexa's latest upgrade. Read our roundup of everything Amazon announced at Wednesday's media event. Recommendations: Will recommends Auto-GPT, a tool that turns ChatGPT an autonomous agent that manages all the boring parts of your life. Mike recommends the book No Meat Required: The Cultural History and Culinary Future of Plant-Based Eating by Alicia Kennedy. Lauren recommends the episode of WIRED's Have a Nice Future Podcast where journalist Paul Tough talks about college in the US and the future of higher education. Will Knight can be found on Twitter @willknight. Lauren Goode is @LaurenGoode. Michael Calore is @snackfight. Bling the main hotline at @GadgetLab. The show is produced by Boone Ashworth (@booneashworth). Our theme music is by Solar Keys. Learn more about your ad choices. Visit podcastchoices.com/adchoices

How to Talk to AI
EP19: Turning ChatGPT into AutoGPT with Professor Synapse himself Joe Rosenbaum

How to Talk to AI

Play Episode Listen Later Sep 15, 2023 29:23


Video version of this episodeJoin the Synthminds Discord HereSynthminds & Synaptic Labs Present their new AI for Educators course, in partnership with BSD EducationFOR 15% OFF PROMPT PERFECT Click here & use code 'httta' at the checkout!Need a custom fine-tuned chatbot that can be spun up and deployed in minutes? Use code "Synthminds15" at GPT TrainerPodcast PageHTTTA NewsletterProfessor Synapse PromptTurn ChatGPT into AutoGPTKey takeaways from the episode:1. The unpredictability of content resonance: The episode reflects on how unpredictable it is to know what content will resonate with viewers. This highlights the importance of experimentation and openness to try new ideas to engage and connect with the audience.2. The power of the prompt: Our guest Joe Rosenbaum of Synaptic labs, and creator of the "Professor Synapse" prompt, that's been shared over 100,000 times. He emphasizes the usefulness and versatility of using prompts generated by GPT to quickly build course content and make the learning experience more interactive. They also mention the excitement of witnessing users personalize their prompt experiences and highlight the importance of putting the power and control in the hands of the users.3. Empowering teachers with AI: The episode discusses the partnership between Synaptic Labs, Synthminds and BSD Education, working on a generative AI for teachers course. The goal is to educate and empower teachers to ethically and responsibly integrate AI technology into their classrooms. [00:02:33] Improved AI model with user-focused prompts.[00:05:15] Course launch successful, positive feedback from audience.[00:08:27] AI empowers transformation, saves time for love.[00:11:10] Podcast, community, course content, interactive, unlimited, preferences, benefit.[00:16:04] Summarize using 7 words: AI game of mad libs with Professor Synapse[00:17:58] Role super important, research based, variables, clear goal, chain of reason, first step, question.[00:22:29] Technology with limited memory creates frustrations in conversation.[00:24:31] Synaptic Labs educates educators on AI.[00:27:27] Exciting things coming: our Discord Synthminds Learning Labs.Give it a listen on the HTTTA podcast to dive deeper into the exciting world of AI! Happy Prompting Everybody!#Technology #EducationalTechnology #AIWritingTools #HTTTA #howtotalktoai #ai #artificialintelligence #promptengineering #prompt #research #nlp #openai #businessinsights #GPT4 #VR #ARLicensing Codes:XE3PMH56JJN9LJ8BASLC-22DC2994-050B57C5D6

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

As alluded to on the pod, LangChain has just launched LangChain Hub: “the go-to place for developers to discover new use cases and polished prompts.” It's available to everyone with a LangSmith account, no invite code necessary. Check it out!In 2023, LangChain has speedrun the race from 2:00 to 4:00 to 7:00 Silicon Valley Time. From the back to back $10m Benchmark seed and (rumored) $20-25m Sequoia Series A in April, to back to back critiques of “LangChain is Pointless” and “The Problem with LangChain” in July, to teaching with Andrew Ng and keynoting at basically every AI conference this fall (including ours), it has been an extreme rollercoaster for Harrison and his growing team creating one of the most popular (>60k stars at time of writing) building blocks for AI Engineers.LangChain's OriginsThe first commit to LangChain shows its humble origins as a light wrapper around Python's formatter.format for prompt templating. But as Harrison tells the story, even his first experience with text-davinci-002 in early 2022 was focused on chatting with data from their internal company Notion and Slack, what is now known as Retrieval Augmented Generation (RAG). As the Generative AI meetup scene came to life post Stable Diffusion, Harrison saw a need for common abstractions for what people were building with text LLMs at the time:* LLM Math, aka Riley Goodside's “You Can't Do Math” REPL-in-the-loop (PR #8)* Self-Ask With Search, Ofir Press' agent pattern (PR #9) (later ReAct, PR #24)* NatBot, Nat Friedman's browser controlling agent (PR #18)* Adapters for OpenAI, Cohere, and HuggingFaceHubAll this was built and launched in a few days from Oct 16-25, 2022. Turning research ideas/exciting usecases into software quickly and often has been in the LangChain DNA from Day 1 and likely a big driver of LangChain's success, to date amassing the largest community of AI Engineers and being the default launch framework for every big name from Nvidia to OpenAI:Dancing with GiantsBut AI Engineering is built atop of constantly moving tectonic shifts: * ChatGPT launched in November (“The Day the AGI Was Born”) and the API released in March. Before the ChatGPT API, OpenAI did not have a chat endpoint. In order to build a chatbot with history, you had to make sure to chain all messages and prompt for completion. LangChain made it easy to do that out of the box, which was a huge driver of usage. * Today, OpenAI has gone all-in on the chat API and is deprecating the old completions models, essentially baking in the chat pattern as the default way most engineers should interact with LLMs… and reducing (but not eliminating) the value of ConversationChains.* And there have been more updates since: Plugins released in API form as Functions in June (one of our top pods ever… reducing but not eliminating the value of OutputParsers) and Finetuning in August (arguably reducing some need for Retrieval and Prompt tooling). With each update, OpenAI and other frontier model labs realign the roadmaps of this nascent industry, and Harrison credits the modular design of LangChain in staying relevant. LangChain has not been merely responsive either: LangChain added Agents in November, well before they became the hottest topic of the AI Summer, and now Agents feature as one of LangChain's top two usecases. LangChain's problem for podcasters and newcomers alike is its sheer scope - it is the world's most complete AI framework, but it also has a sprawling surface area that is difficult to fully grasp or document in one sitting. This means it's time for the trademark Latent Space move (ChatGPT, GPT4, Auto-GPT, and Code Interpreter Advanced Data Analysis GPT4.5): the executive summary!What is LangChain?As Harrison explains, LangChain is an open source framework for building context-aware reasoning applications, available in Python and JS/TS.It launched in Oct 2022 with the central value proposition of “composability”, aka the idea that every AI engineer will want to switch LLMs, and combine LLMs with other things into “chains”, using a flexible interface that can be saved via a schema.Today, LangChain's principal offerings can be grouped as:* Components: isolated modules/abstractions* Model I/O* Models (for LLM/Chat/Embeddings, from OpenAI, Anthropic, Cohere, etc)* Prompts (Templates, ExampleSelectors, OutputParsers)* Retrieval (revised and reintroduced in March)* Document Loaders (eg from CSV, JSON, Markdown, PDF)* Text Splitters (15+ various strategies for chunking text to fit token limits)* Retrievers (generic interface for turning an unstructed query into a set of documents - for self-querying, contextual compression, ensembling)* Vector Stores (retrievers that search by similarity of embeddings)* Indexers (sync documents from any source into a vector store without duplication)* Memory (for long running chats, whether a simple Buffer, Knowledge Graph, Summary, or Vector Store)* Use-Cases: compositions of Components* Chains: combining a PromptTemplate, LLM Model and optional OutputParser* with Router, Sequential, and Transform Chains for advanced usecases* savable, sharable schemas that can be loaded from LangChainHub* Agents: a chain that has access to a suite of tools, of nondeterministic length because the LLM is used as a reasoning engine to determine which actions to take and in which order. Notable 100LOC explainer here.* Tools (interfaces that an agent can use to interact with the world - preset list here. Includes things like ChatGPT plugins, Google Search, WolframAlpha. Groups of tools are bundled up as toolkits)* AgentExecutor (the agent runtime, basically the while loop, with support for controls, timeouts, memory sharing, etc)* LangChain has also added a Callbacks system for instrumenting each stage of LLM, Chain, and Agent calls (which enables LangSmith, LangChain's first cloud product), and most recently an Expression Language, a declarative way to compose chains.LangChain the company incorporated in January 2023, announced their seed round in April, and launched LangSmith in July. At time of writing, the company has 93k followers, their Discord has 31k members and their weekly webinars are attended by thousands of people live.The full-featuredness of LangChain means it is often the first starting point for building any mainstream LLM use case, because they are most likely to have working guides for the new developer. Logan (our first guest!) from OpenAI has been a notable fan of both LangChain and LangSmith (they will be running the first LangChain + OpenAI workshop at AI Eng Summit). However, LangChain is not without its critics, with Aravind Srinivas, Jim Fan, Max Woolf, Mckay Wrigley and the general Reddit/HN community describing frustrations with the value of their abstractions, and many are attempting to write their own (the common experience of adding and then removing LangChain is something we covered in our Agents writeup). Harrison compares this with the timeless ORM debate on the value of abstractions.LangSmithLast month, Harrison launched LangSmith, their LLM observability tool and first cloud product. LangSmith makes it easy to monitor all the different primitives that LangChain offers (agents, chains, LLMs) as well as making it easy to share and evaluate them both through heuristics (i.e. manually written ones) and “LLM evaluating LLM” flows. The top HN comment in the “LangChain is Pointless” thread observed that orchestration is the smallest part of the work, and the bulk of it is prompt tuning and data serialization. When asked this directly our pod, Harrison agreed:“I agree that those are big pain points that get exacerbated when you have these complex chains and agents where you can't really see what's going on inside of them. And I think that's partially why we built Langsmith…” (48min mark)You can watch the full launch on the LangChain YouTube:It's clear that the target audience for LangChain is expanding to folks who are building complex, production applications rather than focusing on the simpler “Q&A your docs” use cases that made it popular in the first place. As the AI Engineer space matures, there will be more and more tools graduating from supporting “hobby” projects to more enterprise-y use cases. In this episode we run through some of the history of LangChain, how it's growing from an open source project to one of the highest valued AI startups out there, and its future. We hope you enjoy it!Show Notes* LangChain* LangChain's Berkshire Hathaway Homepage* Abstractions tweet* LangSmith* LangSmith Cookbooks repo* LangChain Retrieval blog* Evaluating CSV Question/Answering blog and YouTube* MultiOn Partner blog* Harvard Sports Analytics Collective* Evaluating RAG Webinar* awesome-langchain:* LLM Math Chain* Self-Ask* LangChain Hub UI* “LangChain is Pointless”* Harrison's links* sports - estimating player compatibility in the NBA* early interest in prompt injections* GitHub* TwitterTimestamps* [00:00:00] Introduction* [00:00:48] Harrison's background and how sports led him into ML* [00:04:54] The inspiration for creating LangChain - abstracting common patterns seen in other GPT-3 projects* [00:05:51] Overview of LangChain - a framework for building context-aware reasoning applications* [00:10:09] Components of LangChain - modules, chains, agents, etc.* [00:14:39] Underappreciated parts of LangChain - text splitters, retrieval algorithms like self-query* [00:18:46] Hiring at LangChain* [00:20:27] Designing the LangChain architecture - balancing flexibility and structure* [00:24:09] The difference between chains and agents in LangChain* [00:25:08] Prompt engineering and LangChain* [00:26:16] Announcing LangSmith* [00:30:50] Writing custom evaluators in LangSmith* [00:33:19] Reducing hallucinations - fixing retrieval vs generation issues* [00:38:17] The challenges of long context windows* [00:40:01] LangChain's multi-programming language strategy* [00:45:55] Most popular LangChain blog posts - deep dives into specific topics* [00:50:25] Responding to LangChain criticisms* [00:54:11] Harrison's advice to AI engineers* [00:55:43] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19]Swyx: Welcome. Today we have Harrison Chase in the studio with us. Welcome Harrison. [00:00:23]Harrison: Thank you guys for having me. I'm excited to be here. [00:00:25]Swyx: It's been a long time coming. We've been asking you for a little bit and we're really glad that you got some time to join us in the studio. Yeah. [00:00:32]Harrison: I've been dodging you guys for a while. [00:00:34]Swyx: About seven months. You pulled me in here. [00:00:37]Alessio: About seven months. But it's all good. I totally understand. [00:00:38]Swyx: We like to introduce people through the official backgrounds and then ask you a little bit about your personal side. So you went to Harvard, class of 2017. You don't list what you did in Harvard. Was it CS? [00:00:48]Harrison: Stats and CS. [00:00:50]Swyx: That's awesome. I love me some good stats. [00:00:52]Harrison: I got into it through stats, through doing sports analytics. And then there was so much overlap between stats and CS that I found myself doing more and more of that. [00:00:59]Swyx: And it's interesting that a lot of the math that you learn in stats actually comes over into machine learning which you applied at Kensho as a machine learning engineer and Robust Intelligence, which seems to be the home of a lot of AI founders.Harrison: It does. Yeah. Swyx: And you started LangChain, I think around November 2022 and incorporated in January. Yeah. [00:01:19]Harrison: I was looking it up for the podcast and the first tweet was on, I think October 24th. So just before the end of November or end of October. [00:01:26]Swyx: Yeah. So that's your LinkedIn. What should people know about you on the personal side that's not obvious on LinkedIn? [00:01:33]Harrison: A lot of how I got into this is all through sports actually. Like I'm a big sports fan, played a lot of soccer growing up and then really big fan of the NBA and NFL. And so freshman year at college showed up and I knew I liked math. I knew I liked sports. One of the clubs that was there was the Sports Analytics Collective. And so I joined that freshman year, I was doing a lot of stuff in like Excel, just like basic stats, but then like wanted to do more advanced stuff. So learn to code, learn kind of like data science and machine learning through that way. Kind of like just kept on going down that path. I think sports is a great entryway to data science and machine learning. There's a lot of like numbers out there. People like really care. Like I remember, I think sophomore, junior year, I was in the Sports Collective and the main thing we had was a blog. And so we wrote a blog. It wasn't me. One of the other people in the club wrote a blog predicting the NFL season. I think they made some kind of like with stats and I think their stats showed that like the Dolphins would end up beating the Patriots and New England got like pissed about it, of course. So people like really care and they'll give you feedback about whether you're like models doing well or poorly. And so you get that. And then you also get like instantaneous kind of like, well, not instantaneous, but really quick feedback. Like if you predict a game, the game happens that night. Like you don't have to wait a year to see what happens. So I think sports is a great kind of like entryway for kind of like data science. [00:02:43]Alessio: There was actually my first article on the Twilio blog with a Python script to like predict pricing of like Daily Fantasy players based on my past week performance. Yeah, I don't know. It's a good getaway drug. [00:02:56]Swyx: And on my end, the way I got into finance was through sports betting. So maybe we all have some ties in there. Was like Moneyball a big inspiration? The movie? [00:03:06]Harrison: Honestly, not really. I don't really like baseball. That's like the big thing. [00:03:10]Swyx: Let's call it a lot of stats. Cool. Well, we can dive right into LangChain, which is what everyone is excited about. But feel free to make all the sports analogies you want. That really drives home a lot of points. What was your GPT aha moment? When did you start working on GPT itself? Maybe not LangChain, just anything to do with the GPT API? [00:03:29]Harrison: I think it probably started around the time we had a company hackathon. I think that was before I launched LangChain. I'm trying to remember the exact sequence of events, but I do remember that at the hackathon I worked with Will, who's now actually at LangChain as well, and then two other members of Robust. And we made basically a bot where you could ask questions of Notion and Slack. And so I think, yeah, RAG, basically. And I think I wanted to try that out because I'd heard that it was getting good. I'm trying to remember if I did anything before that to realize that it was good. So then I would focus on that on the hackathon. I can't remember or not, but that was one of the first times that I built something [00:04:06]Swyx: with GPT-3. There wasn't that much opportunity before because the API access wasn't that widespread. You had to get into some kind of program to get that. [00:04:16]Harrison: DaVinci-002 was not terrible, but they did an upgrade to get it to there, and they didn't really publicize that as much. And so I think I remember playing around with it when the first DaVinci model came out. I was like, this is cool, but it's not amazing. You'd have to do a lot of work to get it to do something. But then I think that February or something, I think of 2022, they upgraded it and it was it got better, but I think they made less of an announcement around it. And so I just, yeah, it kind of slipped under the radar for me, at least. [00:04:45]Alessio: And what was the step into LangChain? So you did the hackathon, and then as you were building the kind of RAG product, you felt like the developer experience wasn't that great? Or what was the inspiration? [00:04:54]Harrison: No, honestly, so around that time, I knew I was going to leave my previous job. I was trying to figure out what I was going to do next. I went to a bunch of meetups and other events. This was like the September, August, September of that year. So after Stable Diffusion, but before ChatGPT. So there was interest in generative AI as a space, but not a lot of people hacking on language models yet. But there were definitely some. And so I would go to these meetups and just chat with people and basically saw some common abstractions in terms of what they were building, and then thought it would be a cool side project to factor out some of those common abstractions. And that became kind of like LangChain. I looked up again before this, because I remember I did a tweet thread on Twitter to announce LangChain. And we can talk about what LangChain is. It's a series of components. And then there's some end-to-end modules. And there was three end-to-end modules that were in the initial release. One was NatBot. So this was the web agent by Nat Friedman. Another was LLM Math Chain. So it would construct- [00:05:51]Swyx: GPT-3 cannot do math. [00:05:53]Harrison: Yeah, exactly. And then the third was Self-Ask. So some type of RAG search, similar to React style agent. So those were some of the patterns in terms of what I was seeing. And those all came from open source or academic examples, because the people who were actually working on this were building startups. And they were doing things like question answering over your databases, question answering over SQL, things like that. But I couldn't use their code as kind of like inspiration to factor things out. [00:06:18]Swyx: I talked to you a little bit, actually, roundabout, right after you announced LangChain. I'm honored. I think I'm one of many. This is your first open source project. [00:06:26]Harrison: No, that's not actually true. I released, because I like sports stats. And so I remember I did release some really small, random Python package for scraping data from basketball reference or something. I'm pretty sure I released that. So first project to get a star on GitHub, let's say that. [00:06:45]Swyx: Did you reference anything? What was the inspirations, like other frameworks that you look to when open sourcing LangChain or announcing it or anything like that? [00:06:53]Harrison: I mean, the only main thing that I looked for... I remember reading a Hacker News post a little bit before about how a readme on the project goes a long way. [00:07:02]Swyx: Readme's help. [00:07:03]Harrison: Yeah. And so I looked at it and was like, put some status checks at the top and have the title and then one or two lines and then just right into installation. And so that's the main thing that I looked at in terms of how to structure it. Because yeah, I hadn't done open source before. I didn't really know how to communicate that aspect of the marketing or getting people to use it. I think I had some trouble finding it, but I finally found it and used that as a lot [00:07:25]Swyx: of the inspiration there. Yeah. It was one of the subjects of my write-up how it was surprising to me that significant open source experience actually didn't seem to matter in the new wave of AI tooling. Most like auto-GPTs, Torrents, that was his first open source project ever. And that became auto-GPT. Yeah. I don't know. To me, it's just interesting how open source experience is kind of fungible or not necessary. Or you can kind of learn it on the job. [00:07:49]Alessio: Overvalued. [00:07:50]Swyx: Overvalued. Okay. You said it, not me. [00:07:53]Alessio: What's your description of LangChain today? I think when I built the LangChain Hub UI in January, there were a few things. And I think you were one of the first people to talk about agents that were already in there before it got hot now. And it's obviously evolved into a much bigger framework today. Run people through what LangChain is today, how they should think about it, and all of that. [00:08:14]Harrison: The way that we describe it or think about it internally is that LangChain is basically... I started off saying LangChain's a framework for building LLM applications, but that's really vague and not really specific. And I think part of the issue is LangChain does do a lot, so it's hard to be somewhat specific. But I think the way that we think about it internally, in terms of prioritization, what to focus on, is basically LangChain's a framework for building context-aware reasoning applications. And so that's a bit of a mouthful, but I think that speaks to a lot of the core parts of what's in LangChain. And so what concretely that means in LangChain, there's really two things. One is a set of components and modules. And these would be the prompt template abstraction, the LLM abstraction, chat model abstraction, vector store abstraction, text splitters, document loaders. And so these are combinations of things that we build and we implement, or we just have integrations with. So we don't have any language models ourselves. We don't have any vector stores ourselves, but we integrate with a lot of them. And then the text splitters, we have our own logic for that. The document loaders, we have our own logic for that. And so those are the individual modules. But then I think another big part of LangChain, and probably the part that got people using it the most, is the end-to-end chains or applications. So we have a lot of chains for getting started with question answering over your documents, chat question answering, question answering over SQL databases, agent stuff that you can plug in off the box. And that basically combines these components in a series of specific ways to do this. So if you think about a question answering app, you need a lot of different components kind of stacked. And there's a bunch of different ways to do question answering apps. So this is a bit of an overgeneralization, but basically, you know, you have some component that looks up an embedding from a vector store, and then you put that into the prompt template with the question and the context, and maybe you have the chat history as well. And then that generates an answer, and then maybe you parse that out, or you do something with the answer there. And so there's just this sequence of things that you basically stack in a particular way. And so we just provide a bunch of those assembled chains off the shelf to make it really easy to get started in a few lines of code. [00:10:09]Alessio: And just to give people context, when you first released LangChain, OpenAI did not have a chat API. It was a completion-only API. So you had to do all the human assistant, like prompting and whatnot. So you abstracted a lot of that away. I think the most interesting thing to me is you're kind of the Switzerland of this developer land. There's a bunch of vector databases that are killing each other out there to get people to embed data in them, and you're like, I love you all. You all are great. How do you think about being an opinionated framework versus leaving a lot of choice to the user? I mean, in terms of spending time into this integration, it's like you only have 10 people on the team. Obviously that takes time. Yeah. What's that process like for you all? [00:10:50]Harrison: I think right off the bat, having different options for language models. I mean, language models is the main one that right off the bat we knew we wanted to support a bunch of different options for. There's a lot to discuss there. People want optionality between different language models. They want to try it out. They want to maybe change to ones that are cheaper as new ones kind of emerge. They don't want to get stuck into one particular one if a better one comes out. There's some challenges there as well. Prompts don't really transfer. And so there's a lot of nuance there. But from the bat, having this optionality between the language model providers was a big important part because I think that was just something we felt really strongly about. We believe there's not just going to be one model that rules them all. There's going to be a bunch of different models that are good for a bunch of different use cases. I did not anticipate the number of vector stores that would emerge. I don't know how many we supported in the initial release. It probably wasn't as big of a focus as language models was. But I think it kind of quickly became so, especially when Postgres and Elastic and Redis started building their vector store implementations. We saw that some people might not want to use a dedicated vector store. Maybe they want to use traditional databases. I think to your point around what we're opinionated about, I think the thing that we believe most strongly is it's super early in the space and super fast moving. And so there's a lot of uncertainty about how things will shake out in terms of what role will vector databases play? How many will there be? And so I think a lot of it has always kind of been this optionality and ability to switch and not getting locked in. [00:12:19]Swyx: There's other pieces of LangChain which maybe don't get as much attention sometimes. And the way that you explained LangChain is somewhat different from the docs. I don't know how to square this. So for example, you have at the top level in your docs, you have, we mentioned ModelIO, we mentioned Retrieval, we mentioned Chains. Then you have a concept called Agents, which I don't know if exactly matches what other people call Agents. And we also talked about Memory. And then finally there's Callbacks. Are there any of the less understood concepts in LangChain that you want to give some air to? [00:12:53]Harrison: I mean, I think buried in ModelIO is some stuff around like few-shot example selectors that I think is really powerful. That's a workhorse. [00:13:01]Swyx: Yeah. I think that's where I start with LangChain. [00:13:04]Harrison: It's one of those things that you probably don't, if you're building an application, you probably don't start with it. You probably start with like a zero-shot prompt. But I think that's a really powerful one that's probably just talked about less because you don't need it right off the bat. And for those of you who don't know, that basically selects from a bunch of examples the ones that are maybe most relevant to the input at hand. So you can do some nice kind of like in-context learning there. I think that's, we've had that for a while. I don't think enough people use that, basically. Output parsers also used to be kind of important, but then function calling. There's this interesting thing where like the space is just like progressing so rapidly that a lot of things that were really important have kind of diminished a bit, to be honest. Output parsers definitely used to be an understated and underappreciated part. And I think if you're working with non-OpenAI models, they still are, but a lot of people are working with OpenAI models. But even within there, there's different things you can do with kind of like the function calling ability. Sometimes you want to have the option of having the text or the application you're building, it could return either. Sometimes you know that it wants to return in a structured format, and so you just want to take that structured format. Other times you're extracting things that are maybe a key in that structured format, and so you want to like pluck that key. And so there's just like some like annoying kind of like parsing of that to do. Agents, memory, and retrieval, we haven't talked at all. Retrieval, there's like five different subcomponents. You could also probably talk about all of those in depth. You've got the document loaders, the text splitters, the embedding models, the vector stores. Embedding models and vector stores, we don't really have, or sorry, we don't build, we integrate with those. Text splitters, I think we have like 15 or so. Like I think there's an under kind of like appreciated amount of those. [00:14:39]Swyx: And then... Well, it's actually, honestly, it's overwhelming. Nobody knows what to choose. [00:14:43]Harrison: Yeah, there is a lot. [00:14:44]Swyx: Yeah. Do you have personal favorites that you want to shout out? [00:14:47]Harrison: The one that we have in the docs is the default is like the recursive text splitter. We added a playground for text splitters the other week because, yeah, we heard a lot that like, you know, and like these affect things like the chunk overlap and the chunks, they affect things in really subtle ways. And so like I think we added a playground where people could just like choose different options. We have like, and a lot of the ideas are really similar. You split on different characters, depending on kind of like the type of text that you have marked down, you might want to split on differently than HTML. And so we added a playground where you can kind of like choose between those. I don't know if those are like underappreciated though, because I think a lot of people talk about text splitting as being a hard part, and it is a really important part of creating these retrieval applications. But I think we have a lot of really cool retrieval algorithms as well. So like self query is maybe one of my favorite things in LangChain, which is basically this idea of when you have a user question, the typical kind of like thing to do is you embed that question and then find the document that's most similar to that question. But oftentimes questions have things that just, you don't really want to look up semantically, they have some other meaning. So like in the example that I use, the example in the docs is like movies about aliens in the year 1980. 1980, I guess there's some semantic meaning for that, but it's a very particular thing that you care about. And so what the self query retriever does is it splits out the metadata filter and most vector stores support like a metadata filter. So it splits out this metadata filter, and then it splits out the semantic bit. And that's actually like kind of tricky to do because there's a lot of different filters that you can have like greater than, less than, equal to, you can have and things if you have multiple filters. So we have like a pretty complicated like prompt that does all that. That might be one of my favorite things in LangChain, period. Like I think that's, yeah, I think that's really cool. [00:16:26]Alessio: How do you think about speed of development versus support of existing things? So we mentioned retrieval, like you got, or, you know, text splitting, you got like different options for all of them. As you get building LangChain, how do you decide which ones are not going to keep supporting, you know, which ones are going to leave behind? I think right now, as you said, the space moves so quickly that like you don't even know who's using what. What's that like for you? [00:16:50]Harrison: Yeah. I mean, we have, you know, we don't really have telemetry on what people are using in terms of what parts of LangChain, the telemetry we have is like, you know, anecdotal stuff when people ask or have issues with things. A lot of it also is like, I think we definitely prioritize kind of like keeping up with the stuff that comes out. I think we added function calling, like the day it came out or the day after it came out, we added chat model support, like the day after it came out or something like that. That's probably, I think I'm really proud of how the team has kind of like kept up with that because this space is like exhausting sometimes. And so that's probably, that's a big focus of ours. The support, I think we've like, to be honest, we've had to get kind of creative with how we do that. Cause we have like, I think, I don't know how many open issues we have, but we have like 3000, somewhere between 2000 and 3000, like open GitHub issues. We've experimented with a lot of startups that are doing kind of like question answering over your docs and stuff like that. And so we've got them on the website and in the discord and there's a really good one, dosu on the GitHub that's like answering issues and stuff like that. And that's actually something we want to start leaning into more heavily as a company as well as kind of like building out an AI dev rel because we're 10 people now, 10, 11 people now. And like two months ago we were like six or something like that. Right. So like, and to have like 2,500 open issues or something like that, and like 300 or 400 PRs as well. Cause like one of the amazing things is that like, and you kind of alluded to this earlier, everyone's building in the space. There's so many different like touch points. LangChain is lucky enough to kind of like be a lot of the glue that connects it. And so we get to work with a lot of awesome companies, but that's also a lot of like work to keep up with as well. And so I don't really have an amazing answer, but I think like the, I think prioritize kind of like new things that, that come out. And then we've gotten creative with some of kind of like the support functions and, and luckily there's, you know, there's a lot of awesome people working on all those support coding, question answering things that we've been able to work with. [00:18:46]Swyx: I think there is your daily rhythm, which I've seen you, you work like a, like a beast man, like mad impressive. And then there's sometimes where you step back and do a little bit of high level, like 50,000 foot stuff. So we mentioned, we mentioned retrieval. You did a refactor in March and there's, there's other abstractions that you've sort of changed your mind on. When do you do that? When do you do like the, the step back from the day to day and go, where are we going and change the direction of the ship? [00:19:11]Harrison: It's a good question so far. It's probably been, you know, we see three or four or five things pop up that are enough to make us think about it. And then kind of like when it reaches that level, you know, we don't have like a monthly meeting where we sit down and do like a monthly plan or something. [00:19:27]Swyx: Maybe we should. I've thought about this. Yeah. I'd love to host that meeting. [00:19:32]Harrison: It's really been a lot of, you know, one of the amazing things is we get to interact with so many different people. So it's been a lot of kind of like just pattern matching on what people are doing and trying to see those patterns before they punch us in the face or something like that. So for retrieval, it was the pattern of seeing like, Hey, yeah, like a lot of people are using vector sort of stuff. But there's also just like other methods and people are offering like hosted solutions and we want our abstractions to work with that as well. So we shouldn't bake in this paradigm of doing like semantic search too heavily, which sounds like basic now, but I think like, you know, to start a lot of it was people needed help doing these things. But then there was like managed things that did them, hybrid retrieval mechanisms, all of that. I think another example of this, I mean, Langsmith, which we can maybe talk about was like very kind of like, I think we worked on that for like three or four months before announcing it kind of like publicly, two months maybe before giving it to kind of like anyone in beta. But this was a lot of debugging these applications as a pain point. We hear that like just understanding what's going on is a pain point. [00:20:27]Alessio: I mean, you two did a webinar on this, which is called Agents vs. Chains. It was fun, baby. [00:20:32]Swyx: Thanks for having me on. [00:20:33]Harrison: No, thanks for coming. [00:20:34]Alessio: That was a good one. And on the website, you list like RAG, which is retrieval of bank debt generation and agents as two of the main goals of LangChain. The difference I think at the Databricks keynote, you said chains are like predetermined steps and agents is models reasoning to figure out what steps to take and what actions to take. How should people think about when to use the two and how do you transition from one to the other with LangChain? Like is it a path that you support or like do people usually re-implement from an agent to a chain or vice versa? [00:21:05]Swyx: Yeah. [00:21:06]Harrison: You know, I know agent is probably an overloaded term at this point, and so there's probably a lot of different definitions out there. But yeah, as you said, kind of like the way that I think about an agent is basically like in a chain, you have a sequence of steps. You do this and then you do this and then you do this and then you do this. And with an agent, there's some aspect of it where the LLM is kind of like deciding what to do and what steps to do in what order. And you know, there's probably some like gray area in the middle, but you know, don't fight me on this. And so if we think about those, like the benefits of the chains are that they're like, you can say do this and you just have like a more rigid kind of like order and the way that things are done. They have more control and they don't go off the rails and basically everything that's bad about agents in terms of being uncontrollable and expensive, you can control more finely. The benefit of agents is that I think they handle like the long tail of things that can happen really well. And so for an example of this, let's maybe think about like interacting with a SQL database. So you can have like a SQL chain and you know, the first kind of like naive approach at a SQL chain would be like, okay, you have the user question. And then you like write the SQL query, you do some rag, you pull in the relevant tables and schemas, you write a SQL query, you execute that against the SQL database. And then you like return that as the answer, or you like summarize that with an LLM and return that to the answer. And that's basically the SQL chain that we have in LangChain. But there's a lot of things that can go wrong in that process. Starting from the beginning, you may like not want to even query the SQL database at all. Maybe they're saying like, hi, or something, or they're misusing the application. Then like what happens if you have some step, like a big part of the application that people with LangChain is like the context aware part. So there's generally some part of bringing in context to the language model. So if you bring in the wrong context to the language model, so it doesn't know which tables to query, what do you do then? If you write a SQL query, it's like syntactically wrong and it can't run. And then if it can run, like what if it returns an unexpected result or something? And so basically what we do with the SQL agent is we give it access to all these different tools. So it has another tool, it can run the SQL query as another, and then it can respond to the user. But then if it kind of like, it can decide which order to do these. And so it gives it flexibility to handle all these edge cases. And there's like, obviously downsides to that as well. And so there's probably like some safeguards you want to put in place around agents in terms of like not letting them run forever, having some observability in there. But I do think there's this benefit of, you know, like, again, to the other part of what LangChain is like the reasoning part, like each of those steps individually involves some aspect of reasoning, for sure. Like you need to reason about what the SQL query is, you need to reason about what to return. But there's then there's also reasoning about the order of operations. And so I think to me, the key is kind of like giving it an appropriate amount to reason about while still keeping it within checks. And so to the point, like, I would probably recommend that most people get started with chains and then when they get to the point where they're hitting these edge cases, then they think about, okay, I'm hitting a bunch of edge cases where the SQL query is just not returning like the relevant things. Maybe I should add in some step there and let it maybe make multiple queries or something like that. Basically, like start with chain, figure out when you're hitting these edge cases, add in the reasoning step to that to handle those edge cases appropriately. That would be kind of like my recommendation, right? [00:24:09]Swyx: If I were to rephrase it, in my words, an agent would be a reasoning node in a chain, right? Like you start with a chain, then you just add a reasoning node, now it's an agent. [00:24:17]Harrison: Yeah, the architecture for your application doesn't have to be just a chain or just an agent. It can be an agent that calls chains, it can be a chain that has an agent in different parts of them. And this is another part as well. Like the chains in LangChain are largely intended as kind of like a way to get started and take you some amount of the way. But for your specific use case, in order to kind of like eke out the most performance, you're probably going to want to do some customization at the very basic level, like probably around the prompt or something like that. And so one of the things that we've focused on recently is like making it easier to customize these bits of existing architectures. But you probably also want to customize your architectures as well. [00:24:52]Swyx: You mentioned a bit of prompt engineering for self-ask and then for this stuff. There's a bunch of, I just talked to a prompt engineering company today, PromptOps or LLMOps. Do you have any advice or thoughts on that field in general? Like are you going to compete with them? Do you have internal tooling that you've built? [00:25:08]Harrison: A lot of what we do is like where we see kind of like a lot of the pain points being like we can talk about LangSmith and that was a big motivation for that. And like, I don't know, would you categorize LangSmith as PromptOps? [00:25:18]Swyx: I don't know. It's whatever you want it to be. Do you want to call it? [00:25:22]Harrison: I don't know either. Like I think like there's... [00:25:24]Swyx: I think about it as like a prompt registry and you store them and you A-B test them and you do that. LangSmith, I feel like doesn't quite go there yet. Yeah. It's obviously the next step. [00:25:34]Harrison: Yeah, we'll probably go. And yeah, we'll do more of that because I think that's definitely part of the application of a chain or agent is you start with a default one, then you improve it over time. And like, I think a lot of the main new thing that we're dealing with here is like language models. And the main new way to control language models is prompts. And so like a lot of the chains and agents are powered by this combination of like prompt language model and then some output parser or something doing something with the output. And so like, yeah, we want to make that core thing as good as possible. And so we'll do stuff all around that for sure. [00:26:05]Swyx: Awesome. We might as well go into LangSmith because we're bringing it up so much. So you announced LangSmith I think last month. What are your visions for it? Is this the future of LangChain and the company? [00:26:16]Harrison: It's definitely part of the future. So LangSmith is basically a control center for kind of like your LLM application. So the main features that it kind of has is like debugging, logging, monitoring, and then like testing and evaluation. And so debugging, logging, monitoring, basically you set three environment variables and it kind of like logs all the runs that are happening in your LangChain chains or agents. And it logs kind of like the inputs and outputs at each step. And so the main use case we see for this is in debugging. And that's probably the main reason that we started down this path of building it is I think like as you have these more complex things, debugging what's actually going on becomes really painful whether you're using LangChain or not. And so like adding this type of observability and debuggability was really important. Yeah. There's a debugging aspect. You can see the inputs, outputs at each step. You can then quickly enter into like a playground experience where you can fiddle around with it. The first version didn't have that playground and then we'd see people copy, go to open AI playground, paste in there. Okay. Well, that's a little annoying. And then there's kind of like the monitoring, logging experience. And we recently added some analytics on like, you know, how many requests are you getting per hour, minute, day? What's the feedback like over time? And then there's like a testing debugging, sorry, testing and evaluation component as well where basically you can create datasets and then test and evaluate these datasets. And I think importantly, all these things are tied to each other and then also into LangChain, the framework. So what I mean by that is like we've tried to make it as easy as possible to go from logs to adding a data point to a dataset. And because we think a really powerful flow is you don't really get started with a dataset. You can accumulate a dataset over time. And so being able to find points that have gotten like a thumbs up or a thumbs down from a user can be really powerful in terms of creating a good dataset. And so that's maybe like a connection between the two. And then the connection in the other way is like all the runs that you have when you test or evaluate something, they're logged in the same way. So you can debug what exactly is going on and you don't just have like a final score. You have like this nice trace and thing where you can jump in. And then we also want to do more things to hook this into a LangChain proper, the framework. So I think like some of like the managing the prompts will tie in here already. Like we talked about example selectors using datasets as a few short examples is a path that we support in a somewhat janky way right now, but we're going to like make better over time. And so there's this connection between everything. Yeah. [00:28:42]Alessio: And you mentioned the dataset in the announcement blog post, you touched on heuristic evaluation versus LLMs evaluating LLMs. I think there's a lot of talk and confusion about this online. How should people prioritize the two, especially when they might start with like not a good set of evals or like any data at all? [00:29:01]Harrison: I think it's really use case specific in the distinction that I draw between heuristic and LLM. LLMs, you're using an LLM to evaluate the output heuristics, you have some common heuristic that you can use. And so some of these can be like really simple. So we were doing some kind of like measuring of an extraction chain where we wanted it to output JSON. Okay. One evaluation can be, can you use JSON.loads to load it? And like, right. And that works perfectly. You don't need an LLM to do that. But then for like a lot of like the question answering, like, is this factually accurate? And you have some ground truth fact that you know it should be answering with. I think, you know, LLMs aren't perfect. And I think there's a lot of discussion around the pitfalls of using LLMs to evaluate themselves. And I'm not saying they're perfect by any means, but I do think they're, we've found them to be kind of like better than blue or any of those metrics. And the way that I also like to use those is also just like guide my eye about where to look. So like, you know, I might not trust the score of like 0.82, like exactly correct, but like I can look to see like which data points are like flagged as passing or failing. And sometimes the evaluators messing up, but it's like good to like, you know, I don't have to look at like a hundred data points. I can focus on like 10 or something like that. [00:30:10]Alessio: And then can you create a heuristic once in Langsmith? Like what's like your connection to that? [00:30:16]Harrison: Yeah. So right now, all the evaluation, we actually do client side. And part of this is basically due to the fact that a lot of the evaluation is really application specific. So we thought about having evaluators, you could just click off and run in a server side or something like that. But we still think it's really early on in evaluation. We still think there's, it's just really application specific. So we prioritized instead, making it easy for people to write custom evaluators and then run them client side and then upload the results so that they can manually inspect them because I think manual inspection is still a pretty big part of evaluation for better or worse. [00:30:50]Swyx: We have this sort of components of observability. We have cost, latency, accuracy, and then planning. Is that listed in there? [00:30:57]Alessio: Well, planning more in the terms of like, if you're an agent, how to pick the right tool and whether or not you are picking the right tool. [00:31:02]Swyx: So when you talk to customers, how would you stack rank those needs? Are they cost sensitive? Are they latency sensitive? I imagine accuracy is pretty high up there. [00:31:13]Harrison: I think accuracy is definitely the top that we're seeing right now. I think a lot of the applications, people are, especially the ones that we're working with, people are still struggling to get them to work at a level where they're reliable [00:31:24]Swyx: enough. [00:31:25]Harrison: So that's definitely the first. Then I think probably cost becomes the next one. I think a few places where we've started to see this be like one of the main things is the AI simulation that came out. [00:31:36]Swyx: Generative agents. Yeah, exactly. [00:31:38]Harrison: Which is really fun to run, but it costs a lot of money. And so one of our team members, Lance, did an awesome job hooking up like a local model to it. You know, it's not as perfect, but I think it helps with that. Another really big place for this, we believe, is in like extraction of structured data from unstructured data. And the reason that I think it's so important there is that usually you do extraction of some type of like pre-processing or indexing process over your documents. I mean, there's a bunch of different use cases, but one use case is for that. And generally that's over a lot of documents. And so that starts to rack up a bill kind of quickly. And I think extraction is also like a simpler task than like reasoning about which tools to call next in an agent. And so I think it's better suited for that. Yeah. [00:32:15]Swyx: On one of the heuristics I wanted to get your thoughts on, hallucination is one of the big problems there. Do you have any recommendations on how people should reduce hallucinations? [00:32:25]Harrison: To reduce hallucinations, we did a webinar on like evaluating RAG this past week. And I think there's this great project called RAGOS that evaluates four different things across two different spectrums. So the two different spectrums are like, is the retrieval part right? Or is the generation, or sorry, like, is it messing up in retrieval or is it messing up in generation? And so I think to fix hallucination, it probably depends on where it's messing up. If it's messing up in generation, then you're getting the right information, but it's still hallucinating. Or you're getting like partially right information and hallucinating some bits, a lot of that's prompt engineering. And so that's what we would recommend kind of like focusing on the prompt engineering part. And then if you're getting it wrong in the, if you're just not retrieving the right stuff, then there's a lot of different things that you can probably do, or you should look at on the retrieval bit. And honestly, that's where it starts to become a bit like application specific as well. Maybe there's some temporal stuff going on. Maybe you're not parsing things correctly. Yeah. [00:33:19]Swyx: Okay. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:33:35]Harrison: Yeah. Yeah. [00:33:37]Swyx: Yeah. [00:33:38]Harrison: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:33:56]Swyx: Yeah. Yeah. [00:33:58]Harrison: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:34:04]Swyx: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:34:17]Harrison: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah, I mean, there's probably a larger discussion around that, but openAI definitely had a huge headstart, right? And that's... Clawds not even publicly available yet, I don't think. [00:34:28]Swyx: The API? Yeah. Oh, well, you can just basically ask any of the business reps and they'll give it to you. [00:34:33]Harrison: You can. But it's still a different signup process. I think there's... I'm bullish that other ones will catch up especially like Anthropic and Google. The local ones are really interesting. I think we're seeing a big... [00:34:46]Swyx: Lama Two? Yeah, we're doing the fine-tuning hackathon tomorrow. Thanks for promoting that. [00:34:50]Harrison: No, thanks for it. I'm really excited about that stuff. I mean, that's something that like we've been, you know, because like, as I said, like the only thing we know is that the space is moving so fast and changing so rapidly. And like, local models are, have always been one of those things that people have been bullish on. And it seems like it's getting closer and closer to kind of like being viable. So I'm excited to see what we can do with some fine-tuning. [00:35:10]Swyx: Yeah. I have to confess, I did not know that you cared. It's not like a judgment on Langchain. I was just like, you know, you write an adapter for it and you're done, right? Like how much further does it go for Langchain? In terms of like, for you, it's one of the, you know, the model IO modules and that's it. But like, you seem very personally, very passionate about it, but I don't know what the Langchain specific angle for this is, for fine-tuning local models, basically. Like you're just passionate about local models and privacy and all that, right? And open source. [00:35:41]Harrison: Well, I think there's a few different things. Like one, like, you know, if we think about what it takes to build a really reliable, like context-aware reasoning application, there's probably a bunch of different nodes that are doing a bunch of different things. And I think it is like a really complex system. And so if you're relying on open AI for every part of that, like, I think that starts to get really expensive. Also like, probably just like not good to have that much reliability on any one thing. And so I do think that like, I'm hoping that for like, you know, specific parts at the end, you can like fine-tune a model and kind of have a more specific thing for a specific task. Also, to be clear, like, I think like, I also, at the same time, I think open AI is by far the easiest way to get started. And if I was building anything, I would absolutely start with open AI. So. [00:36:27]Swyx: It's something I think a lot of people are wrestling with. But like, as a person building apps, why take five vendors when I can take one vendor, right? Like, as long as I trust Azure, I'm just entrusting all my data to Azure and that's it. So I'm still trying to figure out the real case for local models in production. And I don't know, but fine-tuning, I think, is a good one. That's why I guess open AI worked on fine-tuning. [00:36:49]Harrison: I think there's also like, you know, like if there is, if there's just more options available, like prices are going to go down. So I'm happy about that. So like very selfishly, there's that aspect as well. [00:37:01]Alessio: And in the Lancsmith announcement, I saw in the product screenshot, you have like chain, tool and LLM as like the three core atoms. Is that how people should think about observability in this space? Like first you go through the chain and then you start dig down between like the model itself and like the tool it's using? [00:37:19]Harrison: We've added more. We've added like a retriever logging so that you can see like what query is going in and what are the documents you're getting out. Those are like the three that we started with. I definitely think probably the main ones, like basically the LLM. So the reason I think the debugging in Lancsmith and debugging in general is so needed for these LLM apps is that if you're building, like, again, let's think about like what we want people to build in with LangChain. These like context aware reasoning applications. Context aware. There's a lot of stuff in the prompt. There's like the instructions. There's any previous messages. There's any input this time. There's any documents you retrieve. And so there's a lot of like data engineering that goes into like putting it into that prompt. This sounds silly, but just like making sure the data shows up in the right format is like really important. And then for the reasoning part of it, like that's obviously also all in the prompt. And so being able to like, and there's like, you know, the state of the world right now, like if you have the instructions at the beginning or at the end can actually make like a big difference in terms of whether it forgets it or not. And so being able to kind of like. [00:38:17]Swyx: Yeah. And it takes on that one, by the way, this is the U curve in context, right? Yeah. [00:38:21]Harrison: I think it's real. Basically I've found long context windows really good for when I want to extract like a single piece of information about something basically. But if I want to do reasoning over perhaps multiple pieces of information that are somewhere in like the retrieved documents, I found it not to be that great. [00:38:36]Swyx: Yeah. I have said that that piece of research is the best bull case for Lang chain and all the vector companies, because it means you should do chains. It means you should do retrieval instead of long context, right? People are trying to extend long context to like 100K, 1 million tokens, 5 million tokens. It doesn't matter. You're going to forget. You can't trust it. [00:38:54]Harrison: I expect that it will probably get better over time as everything in this field. But I do also think there'll always be a need for kind of like vector stores and retrieval in some fashions. [00:39:03]Alessio: How should people get started with Langsmith Cookbooks? Wanna talk maybe a bit about that? [00:39:08]Swyx: Yeah. [00:39:08]Harrison: Again, like I think the main thing that even I find valuable about Langsmith is just like the debugging aspect of it. And so for that, it's very simple. You can kind of like turn on three environment variables and it just logs everything. And you don't look at it 95% of the time, but that 5% you do when something goes wrong, it's quite handy to have there. And so that's probably the easiest way to get started. And we're still in a closed beta, but we're letting people off the wait list every day. And if you really need access, just DM me and we're happy to give you access there. And then yeah, there's a lot that you can do with Langsmith that we've been talking about. And so Will on our team has been leading the charge on a really great like Langsmith Cookbooks repo that covers everything from collecting feedback, whether it's thumbs up, thumbs down, or like multi-scale or comments as well, to doing evaluation, doing testing. You can also use Langsmith without Langchain. And so we've got some notebooks on that in there. But we have Python and JavaScript SDKs that aren't dependent on Langchain in any way. [00:40:01]Swyx: And so you can use those. [00:40:01]Harrison: And then we'll also be publishing a notebook on how to do that just with the REST APIs themselves. So yeah, definitely check out that repo. That's a great resource that Will's put together. [00:40:10]Swyx: Yeah, awesome. So we'll zoom out a little bit from Langsmith and talk about Langchain, the company. You're also a first-time founder. Yes. And you've just hired your 10th employee, Julia, who I know from my data engineering days. You mentioned Will Nuno, I think, who maintains Langchain.js. I'm very interested in like your multi-language strategy, by the way. Ankush, your co-founder, Lance, who did AutoEval. What are you staffing up for? And maybe who are you hiring? [00:40:34]Harrison: Yeah, so 10 employees, 12 total. We've got three more joining over the next three weeks. We've got Julia, who's awesome leading a lot of the product, go-to-market, customer success stuff. And then we've got Bri, who's also awesome leading a lot of the marketing and ops aspects. And then other than that, all engineers. We've staffed up a lot on kind of like full stack infra DevOps, kind of like as we've started going into the hosted platform. So internally, we're split about 50-50 between the open source and then the platform stuff. And yeah, we're looking to hire particularly on kind of like the things, we're actually looking to hire across most fronts, to be honest. But in particular, we probably need one or two more people on like open source, both Python and JavaScript and happy to dive into the multi-language kind of like strategy there. But again, like strong focus there on engineering, actually, as opposed to maybe like, we're not a research lab, we're not a research shop. [00:41:48]Swyx: And then on the platform side, [00:41:49]Harrison: like we definitely need some more people on the infra and DevOps side. So I'm using this as an opportunity to tell people that we're hiring and that you should reach out if that sounds like you. [00:41:58]Swyx: Something like that, jobs, whatever. I don't actually know if we have an official job. [00:42:02]Harrison: RIP, what happened to your landing page? [00:42:04]Swyx: It used to be so based. The Berkshire Hathaway one? Yeah, so what was the story, the quick story behind that? Yeah, the quick story behind that is we needed a website [00:42:12]Harrison: and I'm terrible at design. [00:42:14]Swyx: And I knew that we couldn't do a good job. [00:42:15]Harrison: So if you can't do a good job, might as well do the worst job possible. Yeah, and like lean into it. And have some fun with it, yeah. [00:42:21]Swyx: Do you admire Warren Buffett? Yeah, I admire Warren Buffett and admire his website. And actually you can still find a link to it [00:42:26]Harrison: from our current website if you look hard enough. So there's a little Easter egg. Before we dive into more of the open source community things, [00:42:33]Alessio: let's dive into the language thing. How do you think about parity between the Python and JavaScript? Obviously, they're very different ecosystems. So when you're working on a LangChain, is it we need to have the same abstraction in both language or are you to the needs? The core stuff, we want to have the same abstractions [00:42:50]Harrison: because we basically want to be able to do serialize prompts, chains, agents, all the core stuff as tightly as possible and then use that between languages. Like even, yeah, like even right now when we log things to LangChain, we have a playground experience where you can run things that runs in JavaScript because it's kind of like in the browser. But a lot of what's logged is like Python. And so we need that core equivalence for a lot of the core things. Then there's like the incredibly long tail of like integrations, more researchy things. So we want to be able to do that. Python's probably ahead on a lot of like the integrations front. There's more researchy things that we're able to include quickly because a lot of people release some of their code in Python and stuff like that. And so we can use that. And there's just more of an ecosystem around the Python project. But the core stuff will have kind of like the same abstractions and be translatable. That didn't go exactly where I was thinking. So like the LangChain of Ruby, the LangChain of C-sharp, [00:43:44]Swyx: you know, there's demand for that. I mean, I think that's a big part of it. But you are giving up some real estate by not doing it. Yeah, it comes down to kind of like, you know, ROI and focus. And I think like we do think [00:43:58]Harrison: there's a strong JavaScript community and we wanted to lean into that. And I think a lot of the people that we brought on early, like Nuno and Jacob have a lot of experience building JavaScript tooling in that community. And so I think that's a big part of it. And then there's also like, you know, building JavaScript tooling in that community. Will we do another language? Never say never, but like... [00:44:21]Swyx: Python JS for now. Yeah. Awesome. [00:44:23]Alessio: You got 83 articles, which I think might be a record for such a young company. What are like the hottest hits, the most popular ones? [00:44:32]Harrison: I think the most popular ones are generally the ones where we do a deep dive on something. So we did something a few weeks ago around evaluating CSV q

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
RWKV: Reinventing RNNs for the Transformer Era — with Eugene Cheah of UIlicious

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 30, 2023 72:11


The AI Engineer Summit Expo has been announced, presented by AutoGPT (and future guest Toran Bruce-Richards!) Stay tuned for more updates on the Summit livestream and Latent Space University.This post was on HN for 10 hours.What comes after the Transformer? This is one of the Top 10 Open Challenges in LLM Research that has been the talk of the AI community this month. Jon Frankle (friend of the show!) has an ongoing bet with Sasha Rush on whether Attention is All You Need, and the most significant challenger to emerge this year has been RWKV - Receptance Weighted Key Value models, which revive the RNN for GPT-class LLMs, inspired by a 2021 paper on Attention Free Transformers from Apple (surprise!).What this means practically is that RWKV models tend to scale in all directions (both in training and inference) much better than Transformers-based open source models:While remaining competitive on standard reasoning benchmarks:swyx was recently in Singapore for meetings with AI government and industry folks, and grabbed 2 hours with RWKV committee member Eugene Cheah for a deep dive, the full recording of which is now up on Latent Space TV:Today we release both the 2hr video and an edited 1hr audio version, to cater to the different audiences and provide “ablation opportunities” on RWKV interest level.The Eleuther Mafia?The RWKV project is notable not merely because of the credible challenge to the Transformers dominance. It is also a distributed, international, mostly uncredentialed community reminiscent of early 2020s Eleuther AI:* Primarily Discord, pseudonymous, GPU-poor volunteer community somehow coordinating enough to train >10B, OPT/BLOOM-competitive models* Being driven by the needs of its community, it is extremely polyglot (e.g. English, Chinese, Japanese, Arabic) not because it needs to beat some benchmarks, but because its users want it to be for their own needs.* “Open Source” in both the good and the bad way - properly Apache 2.0 licensed (not “open but restricted”), yet trained on data taken from commercially compromised sources like the Pile (where Shawn Presser's Books3 dataset has been recently taken down) and Alpaca (taking from Steven Tey's ShareGPT which is technically against OpenAI TOS)The threadboi class has loved tracking the diffusion of Transformers paper authors out into the industry:But perhaps the underdog version of this is tracking the emerging Eleuther AI mafia:It will be fascinating to see how both Eleuther and Eleuther alums fare as they build out the future of both LLMs and open source AI.Audio Version Timestampsassisted by smol-podcaster. Different timestamps vs the 2hr YouTube* [00:05:35] Eugene's path into AI at UIlicious* [00:07:33] Tokenizer penalty and data efficiency of Transformers* [00:08:02] Using Salesforce CodeGen* [00:10:17] The limitations of Transformers for handling large context sizes* [00:13:17] RWKV compute costs compared to Transformers* [00:16:06] How Eugene found RWKV early* [00:18:52] RWKV's focus on supporting many languages, not just English* [00:21:24] Using the RWKV model for fine-tuning for specific languages* [00:24:45] What is RWKV?* [00:33:46] Overview of the different RWKV models like World, Raven, Novel* [00:41:34] Background of Blink, the creator of RWKV* [00:49:55] The linear vs quadratic scaling of RWKV vs Transformers* [00:53:29] RWKV matching Transformer performance on reasoning tasks* [00:54:31] The community's lack of marketing for RWKV* [00:57:00] The English-language bias in AI models* [01:00:33] Plans to improve RWKV's memory and context handling* [01:03:10] Advice for AI engineers wanting to get more technical knowledgeShow NotesCompanies/Organizations:* RWKV - HF blog, paper, docs, GitHub, Huggingface* Raven 14B (finetuned on Alpaca+ShareGPT+...) Demo* World 7B (supports 100+ world languages) Demo* How RWKV works in 100 LOC, RWKV overview* EleutherAI - Decentralized open source AI research group* Stability AI - Creators of Stable Diffusion * Conjecture - Spun off from EleutherAIPeople:* Eugene Chia - CTO of UIlicious, member of RWKV committee (GitHub, Twitter)* Blink/Bo Peng - Creator of RWKV architecture* Quentin Anthony - our Latent Space pod on Eleuther, coauthor on RWKV * Sharif Shameem - our Latent Space pod on being early to Stable Diffusion* Tri Dao - our Latent Space pod on FlashAttention making Attention subquadratic* Linus Lee - our Latent Space pod in NYC* Jonathan Frankle - our Latent Space pod about Transformers longevity* Chris Re - Genius at Stanford working on state-space models* Andrej Karpathy - Zero to Hero series* Justine Tunney ("Justine.lol") - mmap trickModels/Papers:* Top 10 Open Challenges in LLM Research* Retentive Network: A Successor to Transformer for Large Language Models * GPT-NeoX - Open source replica of GPT-3 by EleutherAI * Salesforce CodeGen and CodeGen 2* Attention Free Transformers paper* The Pile* RedPajama dataset* Monarch Mixer - Revisiting BERT, Without Attention or MLPsMisc NotesRWKV is not without known weaknesses - Transformers do well in reasoning because they are expressive in the forward pass, yet the RWKV docs already note that it is sensitive to prompt formatting and poor at lookback tasks. We also asked pointed questions about RWKV's challenges in the full podcast. Get full access to Latent Space at www.latent.space/subscribe

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Reading excerpts from Zvi Moshowitz' "On AutoGPT" from his blog/newsletter "Don't Worry About the Vase." https://thezvi.wordpress.com/2023/04/13/on-autogpt/ ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

In Search of Green Marbles
E94 - The AI Effect: CQ is the New IQ

In Search of Green Marbles

Play Episode Listen Later Aug 11, 2023 28:03


On this episode of In Search of Green Marbles, recorded on Wednesday, August 9th, Jordi Visser updates G3 on his latest experiments with AI and on his ongoing efforts to use AI to transform the way Weiss is run. Jordi proceeds to discuss how IQ will be transformed in the age of AI. Please check important disclosures at the end of the podcast and enjoy this wide-ranging discussion on how AI is changing our world in real time. Timestamps:How is Weiss encouraging employee experimentation with AI and what is a sandbox environment? [5:20]How would Jordi describe ChatGPT, AutoGPT and the Code Interpreter to a kid? [8:34]How has AI reshaped Jordi's assessment of prospective employees and why does he believe that ‘CQ is the new IQ'? [12:38]What role does curiosity play in achieving success with modern AI tools? [19:51] Resources:Ask Jordi AnythingNavigating the World of Coding with the Precision of Waze (LinkedIn Post)I was there when AI helped to create a vaccine What is Auto-GPT and why does it matter?Code Interpreter For Learning (video)Disclosures: This podcast and associated content (collectively, the “Post”) are provided to you by Weiss Multi-Strategy Advisers LLC (“Weiss”). The views expressed in the Post are for informational purposes only and are subject to change without notice. Information in this Post has been developed internally and is based on market conditions as of the date of the recording from sources believed to be reliable. Nothing in this Post should be construed as investment, legal, tax, or other advice and should not be viewed as a recommendation to purchase or sell any security or adopt any investment strategy. Past performance is no guarantee of future results. You should consult your own advisers regarding business, legal, tax, or other matters concerning investments. Any health-related information shared on the podcast is not intended as medical advice or for use in self-diagnosis or treatment. Please consult a qualified healthcare professional before acting upon any health-related information on the podcast. Weiss has no control over information at any external site hyperlinked in this Post. Weiss makes no representation concerning and is not responsible for the quality, content, nature, or reliability of any hyperlinked site and has included hyperlinks only as a convenience. The inclusion of any external hyperlink does not imply any endorsement, investigation, verification, or ongoing monitoring by Weiss of any information in any hyperlinked site. In no event shall Weiss be responsible for your use of a hyperlinked site. This is not intended to be an offer or solicitation of any security. Please visit www.gweiss.com to review related disclosures and learn more about Weiss.

Expansive
Expansive Moment: The 5 Major Factors for Resilience

Expansive

Play Episode Listen Later Aug 3, 2023 13:01


We're back with another Moment. Producer Sean Loots has selected a moment from a previous episode where John Sanei and Erik Kruger discuss resilience. The aim with The Expansive Moment is to provide valuable insights in 15 minutes or less. Full episode: Auto GPT and 5 Major Factors for Resilience Connect with the us on LINKEDIN, INSTAGRAM, TIKTOK and YOUTUBE.

Growth Everywhere Daily Business Lessons
ChatGPT is DONE. Time for AutoGPTs.

Growth Everywhere Daily Business Lessons

Play Episode Listen Later Aug 2, 2023 8:46


Eric Siu talks about the limitless possibilities of Auto-GPT for marketing and business. TIME-STAMPED SHOW NOTES: [00:00] - How Auto-GPT will change everything for marketing and business [00:40] - How you can use Auto-GPT to save you from subscription fees [02:05] - Running business development with AI to cut down expenses [03:15] - Use Auto-GPT to help you make practical decisions  [05:40] - Launch events and build a billion-dollar network with Auto-GPT [07:00] - Is Auto-GPT Overhyped? What should I talk about next? Who should I interview? Please let me know on Twitter or in the comments below. Did you enjoy this episode? If so, please leave a short review here Subscribe to Leveling Up on iTunes Get the non-iTunes RSS Feed   Connect with Eric Siu:    Growth Everywhere Single Grain Leveling Up Eric Siu on Twitter Eric Siu on Instagram

This Week in Startups
Threads, ChatGPT usage drops, and AI demos with Sunny Madra | E1774

This Week in Startups

Play Episode Listen Later Jul 11, 2023 79:21


Fin can't burn its mouth on hot pizza. Or wave at someone who wasn't waving at them. Fin can resolve half of your customer support tickets instantly before they reach your team. Meet Fin. A breakthrough AI bot by Intercom – ready to join your support team today. Visit https://intercom.com/fin Eight Sleep. Good sleep is the ultimate game changer. Now you can add the Pod Pro Cover to any mattress! Go to eightsleep.com/twist to check out the Pod Pro Cover and get $150 off at checkout! Carta now lets you launch and administer SPVs for your syndicate. Share your knowledge, capital, and network to launch your syndicate SPVs through Carta. Get 10% off your first SPV with promo code TWIST at http://Carta.com * Today's show: Sunny Madra joins Jason to demo VenturusAI (11:01) and other tools, before discussing Sunny's new AutoGPT project (34:38). They wrap up talking about Meta's launch of Threads (49:08), Google's attempts at building a social network, and Inflection AI's new supercomputer (1:00:13). * Time stamps: (0:00) Sunny joins Jason (1:49) ChatGPT sees a decline in growth (10:22) Fin - Try Fin, Intercom's new AI customer support chatbot, at https://intercom.com/fin (11:01) Sunny demos VenturusAI (24:29) Eight Sleep - Go to https://eightsleep.com/twist to check out the Pod Cover and get $150 off at checkout! (26:01) Sunny demos Vercel (34:38) Sunny's new AutoGPT (41:58) Carta - Go to http://Carta.com and use code TWIST to get 10% off your first SPV (43:29) The decision to be open-sourced or closed (49:08) Meta's new platform Threads (1:00:13) Google's attempts at a social network (1:03:28) Inflection AI's supercomputer, roundtripping and training LLMs * Read LAUNCH Fund 4 Deal Memo: https://www.launch.co/four Apply for Funding: https://www.launch.co/apply Buy ANGEL: https://www.angelthebook.com Great recent interviews: Steve Huffman, Brian Chesky, Aaron Levie, Sophia Amoruso, Reid Hoffman, Frank Slootman, Billy McFarland, PrayingForExits, Jenny Lefcourt Check out Jason's suite of newsletters: https://substack.com/@calacanis * Follow Jason: Twitter: https://twitter.com/jason Instagram: https://www.instagram.com/jason LinkedIn: https://www.linkedin.com/in/jasoncalacanis * Follow TWiST: Substack: https://twistartups.substack.com Twitter: https://twitter.com/TWiStartups YouTube: https://www.youtube.com/thisweekin * Subscribe to the Founder University Podcast: https://www.founder.university/podcast

Zero Knowledge
Episode 283: BabyAGI, Agents and Cutting-edge AI with Yohei

Zero Knowledge

Play Episode Listen Later Jul 5, 2023 51:10


This week, host Anna Rose (https://twitter.com/annarrose) and co-host Kobi Gurkan (https://twitter.com/kobigurk) chat with Yohei Nakajima (https://twitter.com/yoheinakajima), General Partner at Untapped Capital (https://www.untapped.vc/) and creator of BabyAGI (https://babyagi.org/). They cover a wide variety of topics from the world of AGIs and agents to building no-code software in public. They kick-off with a chat about how Yohei's interest in NFTs led him down the AI ‘rabbit hole' and how he started to build out experiments in public that have inspired a new group of AI tools and projects. They wrap up with a discussion about the possible impacts of some of this AI tech, how ZK may help mediate the challenges it introduces and more. Here's some additional links for this episode; ReAct: Synergizing Reasoning and Acting in Language Models by Yao and Cao (https://ai.googleblog.com/2022/11/react-synergizing-reasoning-and-acting.html) Episode 279: Intro to zkpod.ai with Anna and Kobi (https://zeroknowledge.fm/279-2/) Bonus: zkpod.ai & Attested Audio Experiment with Daniel Kang (https://zeroknowledge.fm/bonus-zkpod-ai-attested-audio-experiment-with-daniel-kang/) BabyAGI GitHub (https://github.com/yoheinakajima/babyagi) Auto-GPT (https://auto-gpt.ai/) PixelBeasts (https://www.pixelbeasts.co/about) Stable Diffusion (https://stability.ai/blog/stable-diffusion-public-release) DALL·E 2 (https://openai.com/dall-e-2) Midjourney (https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F) OpenAI (https://openai.com/) Playground AI (https://playgroundai.com/) LangChain (https://python.langchain.com/docs/get_started/introduction.html) LlamaIndex (https://llamaindex.ai) Dust (https://dust.tt/) Universal Paperclips: the game by Frank Lantz (https://www.decisionproblem.com/paperclips/index2.html) AI and the Paperclip Problem (https://cepr.org/voxeu/columns/ai-and-paperclip-problem) Check out the Modular Summit here: https://modularsummit.dev/ (https://modularsummit.dev/) zkSummit 10 is happening in London on September 20, 2023! Apply to attend now -> zkSummit 10 Application Form (https://9lcje6jbgv1.typeform.com/zkSummit10) Anoma's (https://anoma.net/) first fractal instance, Namada (https://namada.net/), is launching soon! Namada is a proof-of-stake L1 for interchain asset-agnostic privacy. Namada natively interoperates with fast-finality chains via IBC and with Ethereum via a trustless two-way bridge. For privacy, Namada deploys an upgraded version of the multi-asset shielded pool (MASP) circuit that allows all assets (fungible and non-fungible) to share a common shielded set – this removes the size limits of the anonymity set and provides the best privacy guarantees possible for every user in the multichain. The MASP circuit's latest update enables shielded set rewards directly in the shielded set, a novel feature that funds privacy as a public good. Follow Namada on twitter @namada (https://twitter.com/namada) for more information and join the community on Discord discord.gg/namada (https://discord.com/invite/namada) If you like what we do: * Find all our links here! @ZeroKnowledge | Linktree (https://linktr.ee/zeroknowledge) * Subscribe to our podcast newsletter (https://zeroknowledge.substack.com) * Follow us on Twitter @zeroknowledgefm (https://twitter.com/zeroknowledgefm) * Join us on Telegram (https://zeroknowledge.fm/telegram) * Catch us on YouTube (https://zeroknowledge.fm/)

Marketing Against The Grain
AutoGPT 2.0: GPT Author and GPT Engineer (#134)

Marketing Against The Grain

Play Episode Listen Later Jun 29, 2023 13:01


How will this affect engineers, writers, and businesses? Kipp and Kieran dive into the new innovations in Auto-GPT and what it means for the future of media consumption. Learn about the power of an ultra-personalized media experience, what the new era of 1:1 experience means for your business, and why people with better ideas will win. Mentions Matt Shumer tweet https://twitter.com/mattshumer_/status/1671231938219130894?s=46&t=qo0rMvYEbESguv0k2b7KGw  Lior tweet https://twitter.com/AlphaSignalAI/status/1670488316532379648  We're on Social Media! Follow us for everyday marketing wisdom straight to your feed YouTube: ​​https://www.youtube.com/channel/UCGtXqPiNV8YC0GMUzY-EUFg  Twitter: https://twitter.com/matgpod  TikTok: https://www.tiktok.com/@matgpod  Thank you for tuning into Marketing Against The Grain! Don't forget to hit subscribe and follow us on Apple Podcasts (so you never miss an episode)! https://podcasts.apple.com/us/podcast/marketing-against-the-grain/id1616700934   If you love this show, please leave us a 5-Star Review https://link.chtbl.com/h9_sjBKH and share your favorite episodes with friends. We really appreciate your support. Host Links: Kipp Bodnar, https://twitter.com/kippbodnar   Kieran Flanagan, https://twitter.com/searchbrat  ‘Marketing Against The Grain' is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Produced by Darren Clarke.

The AI Breakdown: Daily Artificial Intelligence News and Discussions
Multi-On is What You Wanted AutoGPT to Be - Interview with Founder Div Garg

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Jun 23, 2023 24:15


Today NLW is joined by Div Garg, the founder of Multi-On which is an AI personal agent that uses the browser to execute complex tasks.   Learn more: https://multion.ai/   The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

The AI Breakdown: Daily Artificial Intelligence News and Discussions
Can SuperAGI Be What People Wanted from AutoGPT?

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Jun 6, 2023 13:32


AutoGPT was all the AI hotness a few months ago, with its promise of autonomous AI agents. Now, a new tool called SuperAGI is catching developer's interest as an ai agent implementation tool that is more robust than what AutoGPT offered.   The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Underdog Empowerment
How Money Solves Most Problems & How To Get It with Josh Forti

Underdog Empowerment

Play Episode Listen Later Jun 5, 2023 66:55


Josh Forti, a guest on the podcast three to four years ago, has since embarked on a remarkable journey, establishing a thriving online business and achieving great success. With a seven-figure coaching business under his belt, Josh also dabbled in the world of cryptocurrencies, primarily as a speculator. However, his encounter with Chat GPT made him realize that the world was on the verge of rapid transformation. Amidst these transformative times, Josh faced a personal milestone as well: his wife's pregnancy with their first child. Contemplating his future, he had to decide between continuing on the coaching path or seizing the opportunity to embrace the next trend. Although he had made some gains in the crypto market, he made the bold decision to shut down his coaching business at the end of the previous year. Instead, he committed himself fully to comprehending the crypto space and understanding how technology would disrupt various sectors, including investing and personal wealth management. Josh firmly believed that artificial intelligence (AI) would revolutionize everything, whether people embraced it or not. He recognized the power of AI in conjunction with blockchain technology, which resolved the issue of digital ownership. Furthermore, he speculated that superintelligence would eventually emerge and foresaw the possibility of AI causing harm to individuals in some capacity. Josh also emphasized the utilization of AI to foster growth, scalability, and automation. He mentioned AutoGPT as a prime example. He questioned the type of brand he wanted to build and concluded that personal cash flow was the most critical aspect of his life at that moment. To achieve this goal, he pondered how AI and blockchain could aid him in his endeavors. With the impending arrival of his first child, a daughter due in November, Josh's motivation to secure personal cash flow intensified. He expressed his belief in the value of coaching and learning from individuals who possess superior knowledge. Rather than solely focusing on making money, he advocated for transforming that money into freedom. Bitcoin held a special place in his heart, as it allowed him unparalleled autonomy over his assets. In contrast, he highlighted the flaws in other systems, such as real estate ownership, where taxes could lead to confiscation. Josh encouraged understanding the rules and playing by them to succeed, ultimately creating an ecosystem where worries became obsolete. While acknowledging that complete freedom eluded everyone, he acknowledged that wealth provided greater opportunities. Reflecting on his own journey, Josh considered the advice he would give himself if he had to start anew with only the knowledge he had gained. He shared a personal tragedy—the death of his brother in a helicopter crash four years ago, which left him residing in his parents' basement without any money. In his quest to rebuild his life, he hired a coach for $60,000, despite only having $5,000 upfront and uncertainty about acquiring the remaining funds. This coach imparted a crucial lesson: self-love and self-awareness were essential prerequisites for success. Understanding one's values, beliefs, and writing them down formed the foundation of an individual's operating system. Additionally, Josh emphasized the importance of figuring out ways to generate income, asserting that the core of effective selling lies in making customers feel understood—a skill that also helps in understanding people better. Josh Forti's story is one of resilience, adaptability, and a strong belief in the power of technology. Through his experiences, he has demonstrated the potential of AI, blockchain, and personal growth to reshape lives and businesses in profound ways.   What You'll Learn: Ways to use AI to scale your business. Which two cryptocurrencies Josh believes in and why. What it takes to be truly financially free. And much more!   Favorite Quote: “Every industry that AI can disrupt, it will disrupt.” -Josh Forti   Connect with Josh: Josh Forti   How to Get Involved: Get podcasting help here. For more on how to grow your business, check out this episode.  If you enjoyed this episode, head over and visit us on Apple Podcasts - leave a review and let us know what you thought! Your feedback keeps us going. Thanks for helping us spread the word!

Talk Python To Me - Python conversations for passionate developers
#417: Test-Driven Prompt Engineering for LLMs with Promptimize

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later May 30, 2023 73:41


Large language models and chat-based AIs are kind of mind blowing at the moment. Many of us are playing with them for working on code or just as a fun alternative to search. But others of us are building applications with AI at the core. And when doing that, the slightly unpredictable nature and probabilistic nature of LLMs make writing and testing Python code very tricky. Enter promptimize from Maxime Beauchemin and Preset. It's a framework for non-deterministic testing of LLMs inside our applications. Let's dive inside the AIs with Max. Links from the show Max on Twitter: @mistercrunch Promptimize: github.com Introducing Promptimize ("the blog post"): preset.io Preset: preset.io Apache Superset: Modern Data Exploration Platform episode: talkpython.fm ChatGPT: chat.openai.com LeMUR: assemblyai.com Microsoft Security Copilot: blogs.microsoft.com AutoGPT: github.com Midjourney: midjourney.com Midjourney generated pytest tips thumbnail: talkpython.fm Midjourney generated radio astronomy thumbnail: talkpython.fm Prompt engineering: learnprompting.org Michael's ChatGPT result for scraping Talk Python episodes: github.com Apache Airflow: github.com Apache Superset: github.com Tay AI Goes Bad: theverge.com LangChain: github.com LangChain Cookbook: github.com Promptimize Python Examples: github.com TLDR AI: tldr.tech AI Tool List: futuretools.io Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy Sponsors PyCharm RedHat Talk Python Training

The AI Breakdown: Daily Artificial Intelligence News and Discussions
How to Get AutoGPT on Your Phone (And What People Are Actually Finding it Useful For)

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later May 9, 2023 18:41


A check-in on AutoGPT as well as the headline news: Updates from the US copyright office Pitchbook VC enthusiasm for AI Palantir stock price pops after AI announcement Google I/O Developer conference AI leaks IBM relaunches AI division as WatsonX in partnership with HuggingFace One researcher says 80% of jobs could be replaced by AI

Bankless
AI and Web3 | Mohamed Fouda & Qiao Wang of Alliance

Bankless

Play Episode Listen Later May 3, 2023 70:59


AI is exploding into every facet of the internet. The convergence of Crypto and AI is inevitable, and we bring on Qiao Wang and Mohamed Fouda to discuss the opportunities and challenges this presents. They explore why Web3 provides an interesting platform for AI, such as payment and execution rails for AI agents, easy access to financial tools, and the ability to commission resources permissionlessly. One example of this is AutoGPT, which uses AI to generate code for on-chain smart contracts. How do we balance caution and optimism? How can we surf this tidal wave of innovation and new frontiers? ------

The AI Breakdown: Daily Artificial Intelligence News and Discussions
The Latest on AutoGPT and BabyAGI: Semi-Autonomous Specialized Agents (SASAs)

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 27, 2023 11:23


Meet semi-autonomous specialized agents (SASAs), more descreet, focused implementations of AutoGPT and BabyAGI. The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Breakdown wherever you listen: https://pod.link/1680633614

The AI Breakdown: Daily Artificial Intelligence News and Discussions
The Problems with AutoGPT and BabyAGI: How Useful Are They Really?

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 22, 2023 12:03


For the last 3 weeks, AutoGPT has massively captured the attention of the AI community. But how useful is it really? Some are starting to ask whether it really lives up to the hype.   Watch the original video: https://www.youtube.com/@TheAIBreakdown

ai autogpt babyagi
Group Chat
Zuck's Buying Coffee | Group Chat News Ep. 763

Group Chat

Play Episode Listen Later Apr 20, 2023 71:55


Today, Drama and Anand discuss some of the biggest news in tech and finance. They start with the possibility of @coinbase moving out of the U.S., as reported by CoinDesk on Twitter. They also talk about Apple's recent launch of its savings account, offering an impressive 4.15% interest rate, as well as its expansion into India with the opening of its first retail store. They also touch on Facebook's settlement money for anyone who used the platform in the last 16 years. Additionally, the hosts explore the topic of Auto-GPT and whether it's time to freak out about AI. Lastly, they delve into the recent Southwest glitch that caused thousands of flights to be delayed and the settlement of the defamation lawsuit between Fox News and Dominion Voting System. Tune in for all this and more, and this week's Winners, Losers, and Content!  - written by ChatGPT Timeline of What Was Discussed: Group Chat Announcements. (0:56)  When you realize you're getting older. Drama's recap of Coachella. (4:43)  Will Coinbase leave the US? (20:39)  Apple is going to run the world. (27:35)  Your next coffee is on Zuck. (38:26)  The job of getting clicks. (40:17)  Is it time to be fearful of A.I.? (43:44)  Southwest's woes continue. (50:24)  The news is now entertainment. (53:48)  Winners, Losers, and Content. (1:00:24)  New Merch Alert! (1:10:56)  Related Links/Products Mentioned  Could @coinbase move out of the U.S.? - CoinDesk on Twitter  Coinbase CEO says it is preparing to go to court with the U.S. SEC  Apple launches its savings account with 4.15% interest rate  Apple Opens First Retail Store in India as It Looks to Country for Manufacturing  Elon Musk Claims Google Co-Founder Is Building a "Digital God"  Anyone who used Facebook in the last 16 years can now get settlement money. Here's how.  What is Auto-GPT And Is Now The Time To Freak Out About AI?  Thousands of flights delayed as Southwest glitch grounds planes  Fox News settles blockbuster defamation lawsuit with Dominion Voting Systems  Connect with Group Chat! Watch The Pod #1 Newsletter In The World For The Gram Tweet With Us Exclusive Facebook Content We're @groupchatpod on Snapchat

The WAN Show Podcast
I Give Up - WAN Show April 14, 2023

The WAN Show Podcast

Play Episode Listen Later Apr 17, 2023 235:42


Save money on your phone plan today at https://www.mintmobile.com/wanshow Try Notion AI for free at https://www.Notion.com/WAN Don't just browse the web – build it. Apply for free today using the link https://covalence.io/wan and take your first step toward a career in software development with Covalence. Timestamps - Timing may be off due to sponsor change: 0:00 Chapters 1:06 Intro 1:32 [Topic 1]: AI Agents 2:05 Auto GPT 4:34 Examples/Potential Dangers of this tech 8:46 AI Agent Gaming 18:24 AI Agent Game caveats 21:10 Discussion Question: What are the likely applications and limitations of this technology? 22:51 Building a better AAA game is impossible 26:45 LocalGPT 27:12 [Topic 2]:Elon Musk's AI Investments 29:24 Is Elon giving up on Twitter? 32:25 AI Startup Bubble 34:53 OpenAI not working on GPT 5 36:19 [Topic 3]:Linkedin Verified 41:08 Have we given up on privacy? 43:43 Roasting Luke's Linkedin Profile 44:15 Linus' Linkedin 46:38 Side topic: Mirrored Channels 49:04 Merch Messages 1. 49:11 If LMG didn't exist, where would you work? 1:01:07 Flipper Zero ethics 1:03:45 [Topic 4]:Mario Movie 1:05:00 Linus liked it! 1:07:50 How was the voice acting? 1:08:37 Mario Movie 2? 1:11:49 Nintendo Cinematic Universe 1:14:37 [Topic 5]:Potential Microsoft Steam Deck 1:16:37 Perils of Saves in games 1:20:13 ROG ALLY 1:24:52 Handhelds from other companies 1:28:09 Sponsors 1:30:42 Seasonic is Cool! 1:32:35 Merch Messages 2 1:32:37 AI Adult Content Ethics 1:36:31 Calibration Tech under right to repair? 1:39:07 Sticker Shock in Niche Markets 1:44:51 Who is your Professional Inspiration? 1:49:41 [Topic 6]:4070 1:56:25 Future of GPU Market 1:59:48 [Topic 7]:Universal Music Group vs AI Scraping 2:00:13 Does this matter/can they stop it? 2:04:24 [Topic 8]:Floatplane Exclusives coming to YouTube (Memberships)! 2:14:57 Side topic: Mech Messages are getting long/ new ideas 2:16:45 Linus's weirdest thing confiscated by TSA 2:21:30 [Topic 9]: Tesla recording Users 2:25:16 [Topic 10]: Intel Teams up with ARM 2:27:23 WAN SHOW: After Dark 2:27:53 Are young people going to be better at AI Tech? 2:29:57 Which one of Linus' cats is his favorite? 2:32:20 Why is Multi Monitor Management not better? 2:38:18 AI Antivirus 2:37:10 What will Linus' last video be (in 2074) 2:38:42 What antiquated tech will you keep? 2:40:10 WAN guests when? 2:41:14 LTT handwarmer 2:43:59 AI Crypto Trading/Betting 2:44:52 Have the goals of Floatplane changed? 2:47:22 New Desk pad lttstore dot com 2:51:09 Are there areas where we should regulate to preserve jobs? 2:58:18 Why hasn't AMD released any new GPUS since 7900 in December? 3:01:43 What surprised you at Micron? 3:07:07 What Tech courses should you take? 3:08:02 Any LTT garments should you not use fabric softener on? 3:10:10 What do you guys think about tech channels releasing time before NDA deadline? 3:15:48 Nebula's Lifetime Membership 3:19:47 Have you ever seen a surface election display monitor 3:20:32 LTT partnership with Ifixit? 3:21:08 LTT relocation services? 3:21:24 Linus offering a screwdriver with all the bit sets 3:22:05 Favorite Small form-factor case? 3:22:58 Recreating Old LTT videos? 3:24:51 Split screen support on the iPad for Floatplane 3:25:05 Luxury Backpack update 3:30:05 Rapid Fire Questions 3:32:14 New AI safety Measures? 3:32:44 New Mainframe tech 3:33:07 LTT as consultants? 3:35:30 Home PC, Rack vs Tower 3:35:44 G-suit Issues 3:36:12 Creator Warehouse concerns 3:37:24 Linus's Parent Tips 3:37:33 Nostalgic Gaming Era 3:39:16 Jobs after AI 3:40:14 AMD Driver Updates 3:40:36 Linus too trusting 3:41:35 Labs testing screen protectors 3:42:28 What chargers do you travel with for Steamdeck? 3:43:30 Should companies block Chat GPT? 3:44:44 How would you sell Apple products? 3:46:42 QLED longevity 3:47:14 Janky tech solutions 3:48:40 Can you saved hacked Drives? 3:50:55 New His and Her's Undergarments 3:52:40 Italy Blocked Chat GPT 3:53:50 Linus Mentoring Smaller Creators 3:55:20 Outro

Business Casual
Banks Thriving Despite Crisis, Magic's $6B NFL Deal, Meet the NYC Rat Czar

Business Casual

Play Episode Listen Later Apr 14, 2023 26:07


Episode 39: Neal and Toby take a look at all of the bank earning reports that came out on Friday morning, and it looks like they are doing just fine. Plus, Josh Harris and Magic Johnson put a team together to buy the NFL's Washington Commanders for a record $6 billion. And what is Auto-GPT? They also share their stock of the week and dog of the week. And rats beware, New York City introduces the newest government official, the Rat Czar. Learn more about our sponsor, TaxAct: https://www.taxact.com Learn more about our sponsor, Fidelity: https://fidelity.com/stocksbytheslice Listen Here: https://link.chtbl.com/MBD Watch Here: https://www.youtube.com/@MorningBrewDailyShow Learn more about your ad choices. Visit megaphone.fm/adchoices

This Week in Startups
The rise of AutoGPTs and AI anxieties with Sunny Madra and Vinny Lingham | E1720

This Week in Startups

Play Episode Listen Later Apr 13, 2023 81:58


Vinny and Sunny join Jason to discuss AI's blistering pace and compare its development to other product launches (2:47). They also break down the anxiety artists and developers feel from AI automation (14:47), the rise of AutoGPTs, Twitter's reported LLM project, and more (37:01). (0:00) Jason kicks off the show (2:47) The blistering pace of AI (9:44) Developer builds Flappy Bird in 1-hour (11:24) Squarespace - Use offer code TWIST to save 10% off your first purchase of a website or domain at https://Squarespace.com/TWIST (12:53) Developers leveraging gains in AI (14:47) Ai anxiety (23:55) Instacart's ChatGPT plugin  (26:06) Preserving your advantage against AI (29:10) Vanta - Get $1000 off your SOC 2 at https://vanta.com/twist (30:14) AI artists (35:54) Crowdbotics - Get a free scoping session for your next big app idea at http://crowdbotics.com/twist (37:01) Automating with AutoGPT (47:09) LLMs vs. Knowledge retrieval systems  (57:06) Is Google's Bard behind?  (1:02:42) Twitter's alleged generative AI project (1:08:30) Ethereum and Bitcoin updates FOLLOW Sunny: https://twitter.com/sundeep FOLLOW Vinny: https://twitter.com/VinnyLingham FOLLOW Jason: https://linktr.ee/calacanis Subscribe to our YouTube to watch all full episodes: https://www.youtube.com/channel/UCkkhmBWfS7pILYIk0izkc3A?sub_confirmation=1 FOUNDERS! Subscribe to the Founder University podcast: https://podcasts.apple.com/au/podcast/founder-university/id1648407190