POPULARITY
We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show!Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We've had Harrison from LangChain on twice (as a guest and as a co-host), and we've now finally come full circle as Stanislas from Dust joined us in the studio.After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down."The History of DustDust's journey can be broken down into three phases:* Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn't just `print` statements.* Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4.* Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments.The Case for Being HorizontalThe big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages:* Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team.* Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions.* Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures.The Vertical ChallengeHowever, this approach comes with trade-offs:* Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'"* Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions." The Future of AI PlatformsStan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance.This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations.Full YouTube EpisodeChapters* 00:00:00 Introductions* 00:04:33 Joining OpenAI from Paris* 00:09:54 Research evolution and compute allocation at OpenAI* 00:13:12 Working with Ilya Sutskever and OpenAI's vision* 00:15:51 Leaving OpenAI to start Dust* 00:18:15 Early focus on browser extension and WebGPT-like functionality* 00:20:20 Dust as the infrastructure for agents* 00:24:03 Challenges of building with early AI models* 00:28:17 LLMs and Workflow Automation* 00:35:28 Building dependency graphs of agents* 00:37:34 Simulating API endpoints* 00:40:41 State of AI models* 00:43:19 Running evals* 00:46:36 Challenges in building AI agents infra* 00:49:21 Buy vs. build decisions for infrastructure components* 00:51:02 Future of SaaS and AI's Impact on Software* 00:53:07 The single employee $1B company race* 00:56:32 Horizontal vs. vertical approaches to AI agentsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome.Stan [00:00:14]: Thank you very much for having me.Swyx [00:00:16]: Visiting from Paris.Stan [00:00:17]: Paris.Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders.Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave.Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit.Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei.Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure.Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess.Swyx [00:03:01]: Discovering new math would be very foundational.Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around.Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move.Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs. And we tried to apply a transformer to that and do reinforcement learning with the signal of how much you propagate within the binary. Didn't work at all because the transformers are so slow compared to evolutionary algorithms that it kind of didn't work. Then I started interested in math and AI and started working on SAT solving with AI. And at the same time, OpenAI was kind of starting the reasoning team that were tackling that project as well. I was in touch with Greg and eventually got in touch with Ilya and finally found my way to OpenAI. I don't know how much you want to dig into that. The way to find your way to OpenAI when you're in Paris was kind of an interesting adventure as well.Swyx [00:04:33]: Please. And I want to note, this was a two-month journey. You did all this in two months.Stan [00:04:38]: The search.Swyx [00:04:40]: Your search for your next thing, because you left in July 2019 and then you joined OpenAI in September.Stan [00:04:45]: I'm going to be ashamed to say that.Swyx [00:04:47]: You were searching before. I was searching before.Stan [00:04:49]: I mean, it's normal. No, the truth is that I moved back to Paris through Stripe and I just felt the hardship of being remote from your team nine hours away. And so it kind of freed a bit of time for me to start the exploration before. Sorry, Patrick. Sorry, John.Swyx [00:05:05]: Hopefully they're listening. So you joined OpenAI from Paris and from like, obviously you had worked with Greg, but notStan [00:05:13]: anyone else. No. Yeah. So I had worked with Greg, but not Ilya, but I had started chatting with Ilya and Ilya was kind of excited because he knew that I was a good engineer through Greg, I presume, but I was not a trained researcher, didn't do a PhD, never did research. And I started chatting and he was excited all the way to the point where he was like, hey, come pass interviews, it's going to be fun. I think he didn't care where I was, he just wanted to try working together. So I go to SF, go through the interview process, get an offer. And so I get Bob McGrew on the phone for the first time, he's like, hey, Stan, it's awesome. You've got an offer. When are you coming to SF? I'm like, hey, it's awesome. I'm not coming to the SF. I'm based in Paris and we just moved. He was like, hey, it's awesome. Well, you don't have an offer anymore. Oh, my God. No, it wasn't as hard as that. But that's basically the idea. And it took me like maybe a couple more time to keep chatting and they eventually decided to try a contractor set up. And that's how I kind of started working at OpenAI, officially as a contractor, but in practice really felt like being an employee.Swyx [00:06:14]: What did you work on?Stan [00:06:15]: So it was solely focused on math and AI. And in particular in the application, so the study of the larger grid models, mathematical reasoning capabilities, and in particular in the context of formal mathematics. The motivation was simple, transformers are very creative, but yet they do mistakes. Formal math systems are of the ability to verify a proof and the tactics they can use to solve problems are very mechanical, so you miss the creativity. And so the idea was to try to explore both together. You would get the creativity of the LLMs and the kind of verification capabilities of the formal system. A formal system, just to give a little bit of context, is a system in which a proof is a program and the formal system is a type system, a type system that is so evolved that you can verify the program. If the type checks, it means that the program is correct.Swyx [00:07:06]: Is the verification much faster than actually executing the program?Stan [00:07:12]: Verification is instantaneous, basically. So the truth is that what you code in involves tactics that may involve computation to search for solutions. So it's not instantaneous. You do have to do the computation to expand the tactics into the actual proof. The verification of the proof at the very low level is instantaneous.Swyx [00:07:32]: How quickly do you run into like, you know, halting problem PNP type things, like impossibilities where you're just like that?Stan [00:07:39]: I mean, you don't run into it at the time. It was really trying to solve very easy problems. So I think the... Can you give an example of easy? Yeah, so that's the mass benchmark that everybody knows today. The Dan Hendricks one. The Dan Hendricks one, yeah. And I think it was the low end part of the mass benchmark at the time, because that mass benchmark includes AMC problems, AMC 8, AMC 10, 12. So these are the easy ones. Then AIME problems, somewhat harder, and some IMO problems, like Crazy Arm.Swyx [00:08:07]: For our listeners, we covered this in our Benchmarks 101 episode. AMC is literally the grade of like high school, grade 8, grade 10, grade 12. So you can solve this. Just briefly to mention this, because I don't think we'll touch on this again. There's a bit of work with like Lean, and then with, you know, more recently with DeepMind doing like scoring like silver on the IMO. Any commentary on like how math has evolved from your early work to today?Stan [00:08:34]: I mean, that result is mind blowing. I mean, from my perspective, spent three years on that. At the same time, Guillaume Lampe in Paris, we were both in Paris, actually. He was at FAIR, was working on some problems. We were pushing the boundaries, and the goal was the IMO. And we cracked a few problems here and there. But the idea of getting a medal at an IMO was like just remote. So this is an impressive result. And we can, I think the DeepMind team just did a good job of scaling. I think there's nothing too magical in their approach, even if it hasn't been published. There's a Dan Silver talk from seven days ago where it goes a little bit into more details. It feels like there's nothing magical there. It's really applying reinforcement learning and scaling up the amount of data that can generate through autoformalization. So we can dig into what autoformalization means if you want.Alessio [00:09:26]: Let's talk about the tail end, maybe, of the OpenAI. So you joined, and you're like, I'm going to work on math and do all of these things. I saw on one of your blog posts, you mentioned you fine-tuned over 10,000 models at OpenAI using 10 million A100 hours. How did the research evolve from the GPD 2, and then getting closer to DaVinci 003? And then you left just before ChatGPD was released, but tell people a bit more about the research path that took you there.Stan [00:09:54]: I can give you my perspective of it. I think at OpenAI, there's always been a large chunk of the compute that was reserved to train the GPTs, which makes sense. So it was pre-entropic splits. Most of the compute was going to a product called Nest, which was basically GPT-3. And then you had a bunch of, let's say, remote, not core research teams that were trying to explore maybe more specific problems or maybe the algorithm part of it. The interesting part, I don't know if it was where your question was going, is that in those labs, you're managing researchers. So by definition, you shouldn't be managing them. But in that space, there's a managing tool that is great, which is compute allocation. Basically by managing the compute allocation, you can message the team of where you think the priority should go. And so it was really a question of, you were free as a researcher to work on whatever you wanted. But if it was not aligned with OpenAI mission, and that's fair, you wouldn't get the compute allocation. As it happens, solving math was very much aligned with the direction of OpenAI. And so I was lucky to generally get the compute I needed to make good progress.Swyx [00:11:06]: What do you need to show as incremental results to get funded for further results?Stan [00:11:12]: It's an imperfect process because there's a bit of a... If you're working on math and AI, obviously there's kind of a prior that it's going to be aligned with the company. So it's much easier than to go into something much more risky, much riskier, I guess. You have to show incremental progress, I guess. It's like you ask for a certain amount of compute and you deliver a few weeks after and you demonstrate that you have a progress. Progress might be a positive result. Progress might be a strong negative result. And a strong negative result is actually often much harder to get or much more interesting than a positive result. And then it generally goes into, as any organization, you would have people finding your project or any other project cool and fancy. And so you would have that kind of phase of growing up compute allocation for it all the way to a point. And then maybe you reach an apex and then maybe you go back mostly to zero and restart the process because you're going in a different direction or something else. That's how I felt. Explore, exploit. Yeah, exactly. Exactly. Exactly. It's a reinforcement learning approach.Swyx [00:12:14]: Classic PhD student search process.Alessio [00:12:17]: And you were reporting to Ilya, like the results you were kind of bringing back to him or like what's the structure? It's almost like when you're doing such cutting edge research, you need to report to somebody who is actually really smart to understand that the direction is right.Stan [00:12:29]: So we had a reasoning team, which was working on reasoning, obviously, and so math in general. And that team had a manager, but Ilya was extremely involved in the team as an advisor, I guess. Since he brought me in OpenAI, I was lucky to mostly during the first years to have kind of a direct access to him. He would really coach me as a trainee researcher, I guess, with good engineering skills. And Ilya, I think at OpenAI, he was the one showing the North Star, right? He was his job and I think he really enjoyed it and he did it super well, was going through the teams and saying, this is where we should be going and trying to, you know, flock the different teams together towards an objective.Swyx [00:13:12]: I would say like the public perception of him is that he was the strongest believer in scaling. Oh, yeah. Obviously, he has always pursued the compression thesis. You have worked with him personally, what does the public not know about how he works?Stan [00:13:26]: I think he's really focused on building the vision and communicating the vision within the company, which was extremely useful. I was personally surprised that he spent so much time, you know, working on communicating that vision and getting the teams to work together versus...Swyx [00:13:40]: To be specific, vision is AGI? Oh, yeah.Stan [00:13:42]: Vision is like, yeah, it's the belief in compression and scanning computes. I remember when I started working on the Reasoning team, the excitement was really about scaling the compute around Reasoning and that was really the belief we wanted to ingrain in the team. And that's what has been useful to the team and with the DeepMind results shows that it was the right approach with the success of GPT-4 and stuff shows that it was the right approach.Swyx [00:14:06]: Was it according to the neural scaling laws, the Kaplan paper that was published?Stan [00:14:12]: I think it was before that, because those ones came with GPT-3, basically at the time of GPT-3 being released or being ready internally. But before that, there really was a strong belief in scale. I think it was just the belief that the transformer was a generic enough architecture that you could learn anything. And that was just a question of scaling.Alessio [00:14:33]: Any other fun stories you want to tell? Sam Altman, Greg, you know, anything.Stan [00:14:37]: Weirdly, I didn't work that much with Greg when I was at OpenAI. He had always been mostly focused on training the GPTs and rightfully so. One thing about Sam Altman, he really impressed me because when I joined, he had joined not that long ago and it felt like he was kind of a very high level CEO. And I was mind blown by how deep he was able to go into the subjects within a year or something, all the way to a situation where when I was having lunch by year two, I was at OpenAI with him. He would just quite know deeply what I was doing. With no ML background. Yeah, with no ML background, but I didn't have any either, so I guess that explains why. But I think it's a question about, you don't necessarily need to understand the very technicalities of how things are done, but you need to understand what's the goal and what's being done and what are the recent results and all of that in you. And we could have kind of a very productive discussion. And that really impressed me, given the size at the time of OpenAI, which was not negligible.Swyx [00:15:44]: Yeah. I mean, you've been a, you were a founder before, you're a founder now, and you've seen Sam as a founder. How has he affected you as a founder?Stan [00:15:51]: I think having that capability of changing the scale of your attention in the company, because most of the time you operate at a very high level, but being able to go deep down and being in the known of what's happening on the ground is something that I feel is really enlightening. That's not a place in which I ever was as a founder, because first company, we went all the way to 10 people. Current company, there's 25 of us. So the high level, the sky and the ground are pretty much at the same place. No, you're being too humble.Swyx [00:16:21]: I mean, Stripe was also like a huge rocket ship.Stan [00:16:23]: Stripe, I was a founder. So I was, like at OpenAI, I was really happy being on the ground, pushing the machine, making it work. Yeah.Swyx [00:16:31]: Last OpenAI question. The Anthropic split you mentioned, you were around for that. Very dramatic. David also left around that time, you left. This year, we've also had a similar management shakeup, let's just call it. Can you compare what it was like going through that split during that time? And then like, does that have any similarities now? Like, are we going to see a new Anthropic emerge from these folks that just left?Stan [00:16:54]: That I really, really don't know. At the time, the split was pretty surprising because they had been trying GPT-3, it was a success. And to be completely transparent, I wasn't in the weeds of the splits. What I understood of it is that there was a disagreement of the commercialization of that technology. I think the focal point of that disagreement was the fact that we started working on the API and wanted to make those models available through an API. Is that really the core disagreement? I don't know.Swyx [00:17:25]: Was it safety?Stan [00:17:26]: Was it commercialization?Swyx [00:17:27]: Or did they just want to start a company?Stan [00:17:28]: Exactly. Exactly. That I don't know. But I think what I was surprised of is how quickly OpenAI recovered at the time. And I think it's just because we were mostly a research org and the mission was so clear that some divergence in some teams, some people leave, the mission is still there. We have the compute. We have a site. So it just keeps going.Swyx [00:17:50]: Very deep bench. Like just a lot of talent. Yeah.Alessio [00:17:53]: So that was the OpenAI part of the history. Exactly. So then you leave OpenAI in September 2022. And I would say in Silicon Valley, the two hottest companies at the time were you and Lanktrain. What was that start like and why did you decide to start with a more developer focused kind of like an AI engineer tool rather than going back into some more research and something else?Stan [00:18:15]: Yeah. First, I'm not a trained researcher. So going through OpenAI was really kind of the PhD I always wanted to do. But research is hard. You're digging into a field all day long for weeks and weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, oh, yeah, that was obvious. And you go back to digging. I'm not a trained, like formally trained researcher, and it wasn't kind of a necessarily an ambition of me of creating, of having a research career. And I felt the hardness of it. I enjoyed a lot of like that a ton. But at the time, I decided that I wanted to go back to something more productive. And the other fun motivation was like, I mean, if we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down. And so that was kind of the true motivation for like trying to go there. So that's kind of the core motivation at the beginning of personally. And the motivation for starting a company was pretty simple. I had seen GPT-4 internally at the time, it was September 2022. So it was pre-GPT, but GPT-4 was ready since, I mean, I'd been ready for a few months internally. I was like, okay, that's obvious, the capabilities are there to create an insane amount of value to the world. And yet the deployment is not there yet. The revenue of OpenAI at the time were ridiculously small compared to what it is today. So the thesis was, there's probably a lot to be done at the product level to unlock the usage.Alessio [00:19:49]: Yeah. Let's talk a bit more about the form factor, maybe. I think one of the first successes you had was kind of like the WebGPT-like thing, like using the models to traverse the web and like summarize things. And the browser was really the interface. Why did you start with the browser? Like what was it important? And then you built XP1, which was kind of like the browser extension.Stan [00:20:09]: So the starting point at the time was, if you wanted to talk about LLMs, it was still a rather small community, a community of mostly researchers and to some extent, very early adopters, very early engineers. It was almost inconceivable to just build a product and go sell it to the enterprise, though at the time there was a few companies doing that. The one on marketing, I don't remember its name, Jasper. But so the natural first intention, the first, first, first intention was to go to the developers and try to create tooling for them to create product on top of those models. And so that's what Dust was originally. It was quite different than Lanchain, and Lanchain just beat the s**t out of us, which is great. It's a choice.Swyx [00:20:53]: You were cloud, in closed source. They were open source.Stan [00:20:56]: Yeah. So technically we were open source and we still are open source, but I think that doesn't really matter. I had the strong belief from my research time that you cannot create an LLM-based workflow on just one example. Basically, if you just have one example, you overfit. So as you develop your interaction, your orchestration around the LLM, you need a dozen examples. Obviously, if you're running a dozen examples on a multi-step workflow, you start paralyzing stuff. And if you do that in the console, you just have like a messy stream of tokens going out and it's very hard to observe what's going there. And so the idea was to go with an UI so that you could kind of introspect easily the output of each interaction with the model and dig into there through an UI, which is-Swyx [00:21:42]: Was that open source? I actually didn't come across it.Stan [00:21:44]: Oh yeah, it wasn't. I mean, Dust is entirely open source even today. We're not going for an open source-Swyx [00:21:48]: If it matters, I didn't know that.Stan [00:21:49]: No, no, no, no, no. The reason why is because we're not open source because we're not doing an open source strategy. It's not an open source go-to-market at all. We're open source because we can and it's fun.Swyx [00:21:59]: Open source is marketing. You have all the downsides of open source, which is like people can clone you.Stan [00:22:03]: But I think that downside is a big fallacy. Okay. Yes, anybody can clone Dust today, but the value of Dust is not the current state. The value of Dust is the number of eyeballs and hands of developers that are creating to it in the future. And so yes, anybody can clone it today, but that wouldn't change anything. There is some value in being open source. In a discussion with the security team, you can be extremely transparent and just show the code. When you have discussion with users and there's a bug or a feature missing, you can just point to the issue, show the pull request, show the, show the, exactly, oh, PR welcome. That doesn't happen that much, but you can show the progress if the person that you're chatting with is a little bit technical, they really enjoy seeing the pull request advancing and seeing all the way to deploy. And then the downsides are mostly around security. You never want to do security by obfuscation. But the truth is that your vector of attack is facilitated by you being open source. But at the same time, it's a good thing because if you're doing anything like a bug bountying or stuff like that, you just give much more tools to the bug bountiers so that their output is much better. So there's many, many, many trade-offs. I don't believe in the value of the code base per se. I think it's really the people that are on the code base that have the value and go to market and the product and all of those things that are around the code base. Obviously, that's not true for every code base. If you're working on a very secret kernel to accelerate the inference of LLMs, I would buy that you don't want to be open source. But for product stuff, I really think there's very little risk. Yeah.Alessio [00:23:39]: I signed up for XP1, I was looking, January 2023. I think at the time you were on DaVinci 003. Given that you had seen GPD 4, how did you feel having to push a product out that was using this model that was so inferior? And you're like, please, just use it today. I promise it's going to get better. Just overall, as a founder, how do you build something that maybe doesn't quite work with the model today, but you're just expecting the new model to be better?Stan [00:24:03]: Yeah, so actually, XP1 was even on a smaller one that was the post-GDPT release, small version, so it was... Ada, Babbage... No, no, no, not that far away. But it was the small version of GDPT, basically. I don't remember its name. Yes, you have a frustration there. But at the same time, I think XP1 was designed, was an experiment, but was designed as a way to be useful at the current capability of the model. If you just want to extract data from a LinkedIn page, that model was just fine. If you want to summarize an article on a newspaper, that model was just fine. And so it was really a question of trying to find a product that works with the current capability, knowing that you will always have tailwinds as models get better and faster and cheaper. So that was kind of a... There's a bit of a frustration because you know what's out there and you know that you don't have access to it yet. It's also interesting to try to find a product that works with the current capability.Alessio [00:24:55]: And we highlighted XP1 in our anatomy of autonomy post in April of last year, which was, you know, where are all the agents, right? So now we spent 30 minutes getting to what you're building now. So you basically had a developer framework, then you had a browser extension, then you had all these things, and then you kind of got to where Dust is today. So maybe just give people an overview of what Dust is today and the courtesies behind it. Yeah, of course.Stan [00:25:20]: So Dust, we really want to build the infrastructure so that companies can deploy agents within their teams. We are horizontal by nature because we strongly believe in the emergence of use cases from the people having access to creating an agent that don't need to be developers. They have to be thinkers. They have to be curious. But anybody can create an agent that will solve an operational thing that they're doing in their day-to-day job. And to make those agents useful, there's two focus, which is interesting. The first one is an infrastructure focus. You have to build the pipes so that the agent has access to the data. You have to build the pipes such that the agents can take action, can access the web, et cetera. So that's really an infrastructure play. Maintaining connections to Notion, Slack, GitHub, all of them is a lot of work. It is boring work, boring infrastructure work, but that's something that we know is extremely valuable in the same way that Stripe is extremely valuable because it maintains the pipes. And we have that dual focus because we're also building the product for people to use it. And there it's fascinating because everything started from the conversational interface, obviously, which is a great starting point. But we're only scratching the surface, right? I think we are at the pong level of LLM productization. And we haven't invented the C3. We haven't invented Counter-Strike. We haven't invented Cyberpunk 2077. So this is really our mission is to really create the product that lets people equip themselves to just get away all the work that can be automated or assisted by LLMs.Alessio [00:26:57]: And can you just comment on different takes that people had? So maybe the most open is like auto-GPT. It's just kind of like just trying to do anything. It's like it's all magic. There's no way for you to do anything. Then you had the ADAPT, you know, we had David on the podcast. They're very like super hands-on with each individual customer to build super tailored. How do you decide where to draw the line between this is magic? This is exposed to you, especially in a market where most people don't know how to build with AI at all. So if you expect them to do the thing, they're probably not going to do it. Yeah, exactly.Stan [00:27:29]: So the auto-GPT approach obviously is extremely exciting, but we know that the agentic capability of models are not quite there yet. It just gets lost. So we're starting, we're starting where it works. Same with the XP one. And where it works is pretty simple. It's like simple workflows that involve a couple tools where you don't even need to have the model decide which tools it's used in the sense of you just want people to put it in the instructions. It's like take that page, do that search, pick up that document, do the work that I want in the format I want, and give me the results. There's no smartness there, right? In terms of orchestrating the tools, it's mostly using English for people to program a workflow where you don't have the constraint of having compatible API between the two.Swyx [00:28:17]: That kind of personal automation, would you say it's kind of like an LLM Zapier type ofStan [00:28:22]: thing?Swyx [00:28:22]: Like if this, then that, and then, you know, do this, then this. You're programming with English?Stan [00:28:28]: So you're programming with English. So you're just saying, oh, do this and then that. You can even create some form of APIs. You say, when I give you the command X, do this. When I give you the command Y, do this. And you describe the workflow. But you don't have to create boxes and create the workflow explicitly. It just needs to describe what are the tasks supposed to be and make the tool available to the agent. The tool can be a semantic search. The tool can be querying into a structured database. The tool can be searching on the web. And obviously, the interesting tools that we're only starting to scratch are actually creating external actions like reimbursing something on Stripe, sending an email, clicking on a button in the admin or something like that.Swyx [00:29:11]: Do you maintain all these integrations?Stan [00:29:13]: Today, we maintain most of the integrations. We do always have an escape hatch for people to kind of custom integrate. But the reality is that the reality of the market today is that people just want it to work, right? And so it's mostly us maintaining the integration. As an example, a very good source of information that is tricky to productize is Salesforce. Because Salesforce is basically a database and a UI. And they do the f**k they want with it. And so every company has different models and stuff like that. So right now, we don't support it natively. And the type of support or real native support will be slightly more complex than just osing into it, like is the case with Slack as an example. Because it's probably going to be, oh, you want to connect your Salesforce to us? Give us the SQL. That's the Salesforce QL language. Give us the queries you want us to run on it and inject in the context of dust. So that's interesting how not only integrations are cool, and some of them require a bit of work on the user. And for some of them that are really valuable to our users, but we don't support yet, they can just build them internally and push the data to us.Swyx [00:30:18]: I think I understand the Salesforce thing. But let me just clarify, are you using browser automation because there's no API for something?Stan [00:30:24]: No, no, no, no. In that case, so we do have browser automation for all the use cases and apply the public web. But for most of the integration with the internal system of the company, it really runs through API.Swyx [00:30:35]: Haven't you felt the pull to RPA, browser automation, that kind of stuff?Stan [00:30:39]: I mean, what I've been saying for a long time, maybe I'm wrong, is that if the future is that you're going to stand in front of a computer and looking at an agent clicking on stuff, then I'll hit my computer. And my computer is a big Lenovo. It's black. Doesn't sound good at all compared to a Mac. And if the APIs are there, we should use them. There is going to be a long tail of stuff that don't have APIs, but as the world is moving forward, that's disappearing. So the core API value in the past has really been, oh, this old 90s product doesn't have an API. So I need to use the UI to automate. I think for most of the ICP companies, the companies that ICP for us, the scale ups that are between 500 and 5,000 people, tech companies, most of the SaaS they use have APIs. Now there's an interesting question for the open web, because there are stuff that you want to do that involve websites that don't necessarily have APIs. And the current state of web integration from, which is us and OpenAI and Anthropic, I don't even know if they have web navigation, but I don't think so. The current state of affair is really, really broken because you have what? You have basically search and headless browsing. But headless browsing, I think everybody's doing basically body.innertext and fill that into the model, right?Swyx [00:31:56]: MARK MIRCHANDANI There's parsers into Markdown and stuff.Stan [00:31:58]: FRANCESC CAMPOY I'm super excited by the companies that are exploring the capability of rendering a web page into a way that is compatible for a model, being able to maintain the selector. So that's basically the place where to click in the page through that process, expose the actions to the model, have the model select an action in a way that is compatible with model, which is not a big page of a full DOM that is very noisy, and then being able to decompress that back to the original page and take the action. And that's something that is really exciting and that will kind of change the level of things that agents can do on the web. That I feel exciting, but I also feel that the bulk of the useful stuff that you can do within the company can be done through API. The data can be retrieved by API. The actions can be taken through API.Swyx [00:32:44]: For listeners, I'll note that you're basically completely disagreeing with David Wan. FRANCESC CAMPOY Exactly, exactly. I've seen it since it's summer. ADEPT is where it is, and Dust is where it is. So Dust is still standing.Alessio [00:32:55]: Can we just quickly comment on function calling? You mentioned you don't need the models to be that smart to actually pick the tools. Have you seen the models not be good enough? Or is it just like, you just don't want to put the complexity in there? Like, is there any room for improvement left in function calling? Or do you feel you usually consistently get always the right response, the right parametersStan [00:33:13]: and all of that?Alessio [00:33:13]: FRANCESC CAMPOY So that's a tricky product question.Stan [00:33:15]: Because if the instructions are good and precise, then you don't have any issue, because it's scripted for you. And the model will just look at the scripts and just follow and say, oh, he's probably talking about that action, and I'm going to use it. And the parameters are kind of abused from the state of the conversation. I'll just go with it. If you provide a very high level, kind of an auto-GPT-esque level in the instructions and provide 16 different tools to your model, yes, we're seeing the models in that state making mistakes. And there is obviously some progress can be made on the capabilities. But the interesting part is that there is already so much work that can assist, augment, accelerate by just going with pretty simply scripted for actions agents. What I'm excited about by pushing our users to create rather simple agents is that once you have those working really well, you can create meta agents that use the agents as actions. And all of a sudden, you can kind of have a hierarchy of responsibility that will probably get you almost to the point of the auto-GPT value. It requires the construction of intermediary artifacts, but you're probably going to be able to achieve something great. I'll give you some example. We have our incidents are shared in Slack in a specific channel, or shipped are shared in Slack. We have a weekly meeting where we have a table about incidents and shipped stuff. We're not writing that weekly meeting table anymore. We have an assistant that just go find the right data on Slack and create the table for us. And that assistant works perfectly. It's trivially simple, right? Take one week of data from that channel and just create the table. And then we have in that weekly meeting, obviously some graphs and reporting about our financials and our progress and our ARR. And we've created assistants to generate those graphs directly. And those assistants works great. By creating those assistants that cover those small parts of that weekly meeting, slowly we're getting to in a world where we'll have a weekly meeting assistance. We'll just call it. You don't need to prompt it. You don't need to say anything. It's going to run those different assistants and get that notion page just ready. And by doing that, if you get there, and that's an objective for us to us using Dust, get there, you're saving an hour of company time every time you run it. Yeah.Alessio [00:35:28]: That's my pet topic of NPM for agents. How do you build dependency graphs of agents? And how do you share them? Because why do I have to rebuild some of the smaller levels of what you built already?Swyx [00:35:40]: I have a quick follow-up question on agents managing other agents. It's a topic of a lot of research, both from Microsoft and even in startups. What you've discovered best practice for, let's say like a manager agent controlling a bunch of small agents. It's two-way communication. I don't know if there should be a protocol format.Stan [00:35:59]: To be completely honest, the state we are at right now is creating the simple agents. So we haven't even explored yet the meta agents. We know it's there. We know it's going to be valuable. We know it's going to be awesome. But we're starting there because it's the simplest place to start. And it's also what the market understands. If you go to a company, random SaaS B2B company, not necessarily specialized in AI, and you take an operational team and you tell them, build some tooling for yourself, they'll understand the small agents. If you tell them, build AutoGP, they'll be like, Auto what?Swyx [00:36:31]: And I noticed that in your language, you're very much focused on non-technical users. You don't really mention API here. You mention instruction instead of system prompt, right? That's very conscious.Stan [00:36:41]: Yeah, it's very conscious. It's a mark of our designer, Ed, who kind of pushed us to create a friendly product. I was knee-deep into AI when I started, obviously. And my co-founder, Gabriel, was a Stripe as well. We started a company together that got acquired by Stripe 15 years ago. It was at Alain, a healthcare company in Paris. After that, it was a little bit less so knee-deep in AI, but really focused on product. And I didn't realize how important it is to make that technology not scary to end users. It didn't feel scary to me, but it was really seen by Ed, our designer, that it was feeling scary to the users. And so we were very proactive and very deliberate about creating a brand that feels not too scary and creating a wording and a language, as you say, that really tried to communicate the fact that it's going to be fine. It's going to be easy. You're going to make it.Alessio [00:37:34]: And another big point that David had about ADAPT is we need to build an environment for the agents to act. And then if you have the environment, you can simulate what they do. How's that different when you're interacting with APIs and you're kind of touching systems that you cannot really simulate? If you call it the Salesforce API, you're just calling it.Stan [00:37:52]: So I think that goes back to the DNA of the companies that are very different. ADAPT, I think, was a product company with a very strong research DNA, and they were still doing research. One of their goals was building a model. And that's why they raised a large amount of money, et cetera. We are 100% deliberately a product company. We don't do research. We don't train models. We don't even run GPUs. We're using the models that exist, and we try to push the product boundary as far as possible with the existing models. So that creates an issue. Indeed, so to answer your question, when you're interacting in the real world, well, you cannot simulate, so you cannot improve the models. Even improving your instructions is complicated for a builder. The hope is that you can use models to evaluate the conversations so that you can get at least feedback and you could get contradictive information about the performance of the assistance. But if you take actual trace of interaction of humans with those agents, it is even for us humans extremely hard to decide whether it was a productive interaction or a really bad interaction. You don't know why the person left. You don't know if they left happy or not. So being extremely, extremely, extremely pragmatic here, it becomes a product issue. We have to build a product that identifies the end users to provide feedback so that as a first step, the person that is building the agent can iterate on it. As a second step, maybe later when we start training model and post-training, et cetera, we can optimize around that for each of those companies. Yeah.Alessio [00:39:17]: Do you see in the future products offering kind of like a simulation environment, the same way all SaaS now kind of offers APIs to build programmatically? Like in cybersecurity, there are a lot of companies working on building simulative environments so that then you can use agents like Red Team, but I haven't really seen that.Stan [00:39:34]: Yeah, no, me neither. That's a super interesting question. I think it's really going to depend on how much, because you need to simulate to generate data, you need to train data to train models. And the question at the end is, are we going to be training models or are we just going to be using frontier models as they are? On that question, I don't have a strong opinion. It might be the case that we'll be training models because in all of those AI first products, the model is so close to the product surface that as you get big and you want to really own your product, you're going to have to own the model as well. Owning the model doesn't mean doing the pre-training, that would be crazy. But at least having an internal post-training realignment loop, it makes a lot of sense. And so if we see many companies going towards that all the time, then there might be incentives for the SaaS's of the world to provide assistance in getting there. But at the same time, there's a tension because those SaaS, they don't want to be interacted by agents, they want the human to click on the button. Yeah, they got to sell seats. Exactly.Swyx [00:40:41]: Just a quick question on models. I'm sure you've used many, probably not just OpenAI. Would you characterize some models as better than others? Do you use any open source models? What have been the trends in models over the last two years?Stan [00:40:53]: We've seen over the past two years kind of a bit of a race in between models. And at times, it's the OpenAI model that is the best. At times, it's the Anthropic models that is the best. Our take on that is that we are agnostic and we let our users pick their model. Oh, they choose? Yeah, so when you create an assistant or an agent, you can just say, oh, I'm going to run it on GP4, GP4 Turbo, or...Swyx [00:41:16]: Don't you think for the non-technical user, that is actually an abstraction that you should take away from them?Stan [00:41:20]: We have a sane default. So we move the default to the latest model that is cool. And we have a sane default, and it's actually not very visible. In our flow to create an agent, you would have to go in advance and go pick your model. So this is something that the technical person will care about. But that's something that obviously is a bit too complicated for the...Swyx [00:41:40]: And do you care most about function calling or instruction following or something else?Stan [00:41:44]: I think we care most for function calling because you want to... There's nothing worse than a function call, including incorrect parameters or being a bit off because it just drives the whole interaction off.Swyx [00:41:56]: Yeah, so got the Berkeley function calling.Stan [00:42:00]: These days, it's funny how the comparison between GP4O and GP4 Turbo is still up in the air on function calling. I personally don't have proof, but I know many people, and I'm probably part of them, to think that GP4 Turbo is still better than GP4O on function calling. Wow. We'll see what comes out of the O1 class if it ever gets function calling. And Cloud 3.5 Summit is great as well. They kind of innovated in an interesting way, which was never quite publicized. But it's that they have that kind of chain of thought step whenever you use a Cloud model or Summit model with function calling. That chain of thought step doesn't exist when you just interact with it just for answering questions. But when you use function calling, you get that step, and it really helps getting better function calling.Swyx [00:42:43]: Yeah, we actually just recorded a podcast with the Berkeley team that runs that leaderboard this week. So they just released V3.Stan [00:42:49]: Yeah.Swyx [00:42:49]: It was V1 like two months ago, and then they V2, V3. Turbo is on top.Stan [00:42:53]: Turbo is on top. Turbo is over 4.0.Swyx [00:42:54]: And then the third place is XLAM from Salesforce, which is a large action model they've been trying to popularize.Stan [00:43:01]: Yep.Swyx [00:43:01]: O1 Mini is actually on here, I think. O1 Mini is number 11.Stan [00:43:05]: But arguably, O1 Mini has been in a line for that. Yeah.Alessio [00:43:09]: Do you use leaderboards? Do you have your own evals? I mean, this is kind of intuitive, right? Like using the older model is better. I think most people just upgrade. Yeah. What's the eval process like?Stan [00:43:19]: It's funny because I've been doing research for three years, and we have bigger stuff to cook. When you're deploying in a company, one thing where we really spike is that when we manage to activate the company, we have a crazy penetration. The highest penetration we have is 88% daily active users within the entire employee of the company. The kind of average penetration and activation we have in our current enterprise customers is something like more like 60% to 70% weekly active. So we basically have the entire company interacting with us. And when you're there, there is so many stuff that matters most than getting evals, getting the best model. Because there is so many places where you can create products or do stuff that will give you the 80% with the work you do. Whereas deciding if it's GPT-4 or GPT-4 Turbo or et cetera, you know, it'll just give you the 5% improvement. But the reality is that you want to focus on the places where you can really change the direction or change the interaction more drastically. But that's something that we'll have to do eventually because we still want to be serious people.Swyx [00:44:24]: It's funny because in some ways, the model labs are competing for you, right? You don't have to do any effort. You just switch model and then it'll grow. What are you really limited by? Is it additional sources?Stan [00:44:36]: It's not models, right?Swyx [00:44:37]: You're not really limited by quality of model.Stan [00:44:40]: Right now, we are limited by the infrastructure part, which is the ability to connect easily for users to all the data they need to do the job they want to do.Swyx [00:44:51]: Because you maintain all your own stuff.Stan [00:44:53]: You know, there are companies out thereSwyx [00:44:54]: that are starting to provide integrations as a service, right? I used to work in an integrations company. Yeah, I know.Stan [00:44:59]: It's just that there is some intricacies about how you chunk stuff and how you process information from one platform to the other. If you look at the end of the spectrum, you could think of, you could say, oh, I'm going to support AirByte and AirByte has- I used to work at AirByte.Swyx [00:45:12]: Oh, really?Stan [00:45:13]: That makes sense.Swyx [00:45:14]: They're the French founders as well.Stan [00:45:15]: I know Jean very well. I'm seeing him today. And the reality is that if you look at Notion, AirByte does the job of taking Notion and putting it in a structured way. But that's the way it is not really usable to actually make it available to models in a useful way. Because you get all the blocks, details, et cetera, which is useful for many use cases.Swyx [00:45:35]: It's also for data scientists and not for AI.Stan [00:45:38]: The reality of Notion is that sometimes you have a- so when you have a page, there's a lot of structure in it and you want to capture the structure and chunk the information in a way that respects that structure. In Notion, you have databases. Sometimes those databases are real tabular data. Sometimes those databases are full of text. You want to get the distinction and understand that this database should be considered like text information, whereas this other one is actually quantitative information. And to really get a very high quality interaction with that piece of information, I haven't found a solution that will work without us owning the connection end-to-end.Swyx [00:46:15]: That's why I don't invest in, there's Composio, there's All Hands from Graham Newbig. There's all these other companies that are like, we will do the integrations for you. You just, we have the open source community. We'll do off the shelf. But then you are so specific in your needs that you want to own it.Swyx [00:46:28]: Yeah, exactly.Stan [00:46:29]: You can talk to Michel about that.Swyx [00:46:30]: You know, he wants to put the AI in there, but you know. Yeah, I will. I will.Stan [00:46:35]: Cool. What are we missing?Alessio [00:46:36]: You know, what are like the things that are like sneakily hard that you're tackling that maybe people don't even realize they're like really hard?Stan [00:46:43]: The real parts as we kind of touch base throughout the conversation is really building the infra that works for those agents because it's a tenuous walk. It's an evergreen piece of work because you always have an extra integration that will be useful to a non-negligible set of your users. I'm super excited about is that there's so many interactions that shouldn't be conversational interactions and that could be very useful. Basically, know that we have the firehose of information of those companies and there's not going to be that many companies that capture the firehose of information. When you have the firehose of information, you can do a ton of stuff with models that are just not accelerating people, but giving them superhuman capability, even with the current model capability because you can just sift through much more information. An example is documentation repair. If I have the firehose of Slack messages and new Notion pages, if somebody says, I own that page, I want to be updated when there is a piece of information that should update that page, this is not possible. You get an email saying, oh, look at that Slack message. It says the opposite of what you have in that paragraph. Maybe you want to update or just ping that person. I think there is a lot to be explored on the product layer in terms of what it means to interact productively with those models. And that's a problem that's extremely hard and extremely exciting.Swyx [00:48:00]: One thing you keep mentioning about infra work, obviously, Dust is building that infra and serving that in a very consumer-friendly way. You always talk about infra being additional sources, additional connectors. That is very important. But I'm also interested in the vertical infra. There is an orchestrator underlying all these things where you're doing asynchronous work. For example, the simplest one is a cron job. You just schedule things. But also, for if this and that, you have to wait for something to be executed and proceed to the next task. I used to work on an orchestrator as well, Temporal.Stan [00:48:31]: We used Temporal. Oh, you used Temporal? Yeah. Oh, how was the experience?Swyx [00:48:34]: I need the NPS.Stan [00:48:36]: We're doing a self-discovery call now.Swyx [00:48:39]: But you can also complain to me because I don't work there anymore.Stan [00:48:42]: No, we love Temporal. There's some edges that are a bit rough, surprisingly rough. And you would say, why is it so complicated?Swyx [00:48:49]: It's always versioning.Stan [00:48:50]: Yeah, stuff like that. But we really love it. And we use it for exactly what you said, like managing the entire set of stuff that needs to happen so that in semi-real time, we get all the updates from Slack or Notion or GitHub into the system. And whenever we see that piece of information goes through, maybe trigger workflows to run agents because they need to provide alerts to users and stuff like that. And Temporal is great. Love it.Swyx [00:49:17]: You haven't evaluated others. You don't want to build your own. You're happy with...Stan [00:49:21]: Oh, no, we're not in the business of replacing Temporal. And Temporal is so... I mean, it is or any other competitive product. They're very general. If it's there, there's an interesting theory about buy versus build. I think in that case, when you're a high-growth company, your buy-build trade-off is very much on the side of buy. Because if you have the capability, you're just going to be saving time, you can focus on your core competency, etc. And it's funny because we're seeing, we're starting to see the post-high-growth company, post-SKF company, going back on that trade-off, interestingly. So that's the cloud news about removing Zendesk and Salesforce. Do you believe that, by the way?Alessio [00:49:56]: Yeah, I did a podcast with them.Stan [00:49:58]: Oh, yeah?Alessio [00:49:58]: It's true.Swyx [00:49:59]: No, no, I know.Stan [00:50:00]: Of course they say it's true,Swyx [00:50:00]: but also how well is it going to go?Stan [00:50:02]: So I'm not talking about deflecting the customer traffic. I'm talking about building AI on top of Salesforce and Zendesk, basically, if I understand correctly. And all of a sudden, your product surface becomes much smaller because you're interacting with an AI system that will take some actions. And so all of a sudden, you don't need the product layer anymore. And you realize that, oh, those things are just databases that I pay a hundred times the price, right? Because you're a post-SKF company and you have tech capabilities, you are incentivized to reduce your costs and you have the capability to do so. And then it makes sense to just scratch the SaaS away. So it's interesting that we might see kind of a bad time for SaaS in post-hyper-growth tech companies. So it's still a big market, but it's not that big because if you're not a tech company, you don't have the capabilities to reduce that cost. If you're a high-growth company, always going to be buying because you go faster with that. But that's an interesting new space, new category of companies that might remove some SaaS. Yeah, Alessio's firmSwyx [00:51:02]: has an interesting thesis on the future of SaaS in AI.Alessio [00:51:05]: Service as a software, we call it. It's basically like, well, the most extreme is like, why is there any software at all? You know, ideally, it's all a labor interface where you're asking somebody to do something for you, whether that's a person, an AI agent or whatnot.Stan [00:51:17]: Yeah, yeah, that's interesting. I have to ask.Swyx [00:51:19]: Are you paying for Temporal Cloud or are you self-hosting?Stan [00:51:22]: Oh, no, no, we're paying, we're paying. Oh, okay, interesting.Swyx [00:51:24]: We're paying way too much.Stan [00:51:26]: It's crazy expensive, but it makes us-Swyx [00:51:28]: That's why as a shareholder, I like to hear that. It makes us go faster,Stan [00:51:31]: so we're happy to pay.Swyx [00:51:33]: Other things in the infrastack, I just want a list for other founders to think about. Ops, API gateway, evals, you know, anything interesting there that you build or buy?Stan [00:51:41]: I mean, there's always an interesting question. We've been building a lot around the interface between models and because Dust, the original version, was an orchestration platform and we basically provide a unified interface to every model providers.Swyx [00:51:56]: That's what I call gateway.Stan [00:51:57]: That we add because Dust was that and so we continued building upon and we own it. But that's an interesting question was in you, you want to build that or buy it?Swyx [00:52:06]: Yeah, I always say light LLM is the current open source consensus.Stan [00:52:09]: Exactly, yeah. There's an interesting question there.Swyx [00:52:12]: Ops, Datadog, just tracking.Stan [00:52:14]: Oh yeah, so Datadog is an obvious... What are the mistakes that I regret? I started as pure JavaScript, not TypeScript, and I think you want to, if you're wondering, oh, I want to go fast, I'll do a little bit of JavaScript. No, don't, just start with TypeScript. I see, okay.Swyx [00:52:30]: So interesting, you are a research engineer that came out of OpenAI that bet on TypeScript.Stan [00:52:36]: Well, the reality is that if you're building a product, you're going to be doing a lot of JavaScript, right? And Next, we're using Next as an example. It's
It's return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition): Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B.While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else: * Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks* An entirely new code-focused reasoning benchmark* A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity* A new dataset of 450,000 human judgments about ambiguity* Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training* Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any sizeAs well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta's OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue.We are busy running the sold-out AI Engineer World's Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes.Video podTimestamps* [00:00:00] Introduction and catch up with guests* [00:01:55] Databricks' text to image model release* [00:03:46] Details about the DBRX model* [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases* [00:09:18] Challenges of training foundation models and getting infrastructure to work* [00:12:03] Details of Imbue's cluster setup* [00:18:53] Process of bringing machines online and common failures* [00:22:52] Health checks and monitoring for the cluster* [00:25:06] Typical timelines and team composition for setting up a cluster* [00:27:24] Monitoring GPU utilization and performance* [00:29:39] Open source tools and libraries used* [00:32:33] Reproducibility and portability of cluster setup* [00:35:57] Infrastructure changes needed for different model architectures* [00:40:49] Imbue's focus on text-only models for coding and reasoning* [00:42:26] CARBS hyperparameter tuner and cost-aware optimization* [00:51:01] Emergence and CARBS* [00:53:18] Evaluation datasets and reproducing them with high quality* [00:58:40] Challenges of evaluating on more realistic tasks* [01:06:01] Abstract reasoning benchmarks like ARC* [01:10:13] Long context evaluation and needle-in-a-haystack tasks* [01:13:50] Function calling and tool use evaluation* [01:19:19] Imbue's future plans for coding and reasoning applications* [01:20:14] Databricks' future plans for useful applications and upcoming blog postsTranscriptSWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome.JOSH [00:00:12]: Hey, glad to be here.SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B?JONATHAN [00:00:30]: Yeah, back when reproducing LLAMA1-7B was considered a huge accomplishment for the field. Those are the good old days. I miss that.SWYX [00:00:38]: As the things have accelerated a lot. Actually, let's do a quick catch up and Josh, you can chime on in as well. So Databricks got acquired. I talked to you at New York.JONATHAN [00:00:45]: Mosaic got acquired, although sometimes it feels like Mosaic acquired Databricks because, you know, we're having a lot of fun being here. But, you know, yeah.SWYX [00:00:52]: Yeah. I mean, you are chief scientist now of Databricks.JONATHAN [00:00:55]: Chief AI scientist. Careful with the title. As much as I would love to understand how Spark works, I'm going to have to defer that to much smarter people than me.SWYX [00:01:03]: Got it. And I don't know about like what you would highlight so far as a post-acquisition, but the most recent news is that you guys released DBRX. Is that the thing that most people should be aware of?JONATHAN [00:01:13]: Actually, that's no longer the most recent news. Honestly, the most recent news, we announced this, but it was at our Data and AI Summit last week. So it was announced among like 100,000 other things, is that we finally released our text to image model, which has been a year in the making through a collaboration directly with Shutterstock. There was a lot of work put into finding a dataset that we were comfortable with working on and trying to build a model that honestly, I felt like I could trust and that others might be able to trust to put out in the world. So that model was released last week. It's unfortunately just available via API due to the fact that the data is quite sensitive and quite valuable. It's Shutterstock's entire business in a lot of ways, but I'm still really excited that there's now a model that is trained on a dataset where the provenance of every single image is known, and it's a damn good model. So I'm really proud of the team on that.SWYX [00:01:55]: Yeah, amazing. Josh, do you have any thoughts on image model questions?JOSH [00:01:59]: That is not my area of expertise, but I was excited to see the release of it last week as well, and very happy that you guys did a nice job on the data side of everything there. So that was cool to see.SWYX [00:02:09]: I think what's unusual is like, I think Shutterstock's doing multiple deals in multiple labs. So what is the Shutterstock model? Like, I guess, is this the house model for Shutterstock? Is this Databricks' version of the Shutterstock model? Like, what is this?JONATHAN [00:02:22]: The way that I would think about it is that Shutterstock is doing an amazing business in AI across the board. Their dataset is kind of widely known to be the best stock photos dataset in the world, the most comprehensive, the biggest. When you think about like, what dataset am I going to train a multimodal model on? You call Shutterstock. And I, at least I've heard in the news, like OpenAI, Google, Meta, Apple have all called Shutterstock and made those deals. So a lot of models have had Shutterstock data incorporated into them. But this is the only model I know of so far where it was, you know, exclusively and specifically trained just on the vanilla Shutterstock data. There was nothing else mixed in. We didn't go and scrape the web and find other data or combined datasets or anything like that. And so this is, in some sense, the house blend. But the other piece is that it's just a dataset where the provenance of every image is known in public. Where did the data come from? It is the Shutterstock collection. That's it. You know, nothing less, nothing more. And certainly being at Databricks, if I've learned one thing, I've learned about enterprise customers and what they want out of AI. And one of the things they ask for most is just, what can you tell me about the data the model was trained on? And here, especially for text to image models, where images are just tricky subject matter, there's been a lot of kind of legal conversation about images, especially. It's nice to just have something where I can point to it and say, you know, if you want to know where the images came from, these are what they are and this is how they got there.SWYX [00:03:36]: I will talk a little bit about Databricks because it's relevant to the rest of today's episode. So Databricks, sorry, I keep misspeaking. It's DBRX.JONATHAN [00:03:46]: DBRX, actually, there's been a pronunciation update. It is now D-B-Rex. So we have decided to add a dinosaur mascot because what model doesn't like a mascot? So literally, I wish I could pull it up. There is a little plush dinosaur that we had made. It's like the world's cutest dinosaur, but it is the official mascot of D-B-Rex. And there's a little dinosaur logo that, you know, you'll probably see around a little bit more because DBRX is a mouthful, but D-B-Rex, like, you know, it's just kind of...SWYX [00:04:13]: Rolls off the tongue. I love mascots. Like every company should have a mascot. And I think Hugging Face got it right. You need an emoji mascot because that's the minimal viable image.JONATHAN [00:04:21]: I probably shouldn't talk at all about, you know, Velociraptor, but, you know, that's a, maybe that's something we can talk about later in the summer. I'll just leave it at that.SWYX [00:04:28]: Okay. That's a hint to names. I feel like your names leak a lot of alpha. So just to quickly cover the headline details, DBRX, as Make Sure Experts model, that's fairly big, 132 billion total parameters, so 36 billion active on any input, pre-trained on 12 trillion tokens of text and code, and did really well on evals to the point where you had to dye your hair blue. That's my high level conclusion.JONATHAN [00:04:53]: Never make a bet with your team two weeks out from model launch, even when, you know, human eval is looking quite bad. Because if you set some bar, even if it's arbitrary and you think there's no way in hell they're going to hit it, apparently money doesn't motivate people anymore. Humiliating their boss motivates people. So Josh, you should really take a hint from this. You know, you cannot pay someone enough money to make up for you dyeing your hair blue.JOSH [00:05:15]: I'll keep that in mind for our next model.SWYX [00:05:17]: It works. So speaking of Imbue's next model, perhaps Josh, you want to actually just say hi to the general sort of latent space audience and talk about what we're releasing today. Yeah.JOSH [00:05:26]: I'm Josh, CTO of Imbue, and we're not releasing the model. We're not releasing the weights, but we are releasing a bunch of different things that should make it easier for other people to make their own models. So I think right now, training foundation models from scratch is like a very difficult, time-consuming, expensive, kind of risky endeavor, especially for smaller companies. And the things that we're releasing hopefully make that at least a little bit easier. So the things that we're releasing fall into kind of three different buckets. One is infrastructure and scripts for dealing with the kind of hardware and hardware failures and understanding how well is the actually lowest level of thing actually working so that you can actually do your training at all and at a reasonable speed without having to constantly restart, etc. So infrastructure and training scripts. A second set of things is around the evaluation. So after you've trained it, like how well is this actually working and how do you know how well it's working? We're releasing a whole bunch of different data there, a new benchmark about code, reasoning, understanding, as well as our own private versions of 11 different open source benchmarks. So things like pool queue or ANLI, where we've gone through and kind of cleaned up the data as much as possible by looking at all the ones that models get wrong or that are flagged for ambiguity and also our own kind of private reproductions of those where we've done like a kind of clean room black box, like, okay, this is what the data set is supposed to be. Here are some examples. Let's make our own version of this to make sure that there is no data contamination, etc. To make sure that we're actually, you know, not testing on train. And then I think a final thing that we're releasing there is around 450,000 human judgments about ambiguity and question quality, which we used in the process of cleaning these evaluations and we also hope will be helpful for other people training kind of similar models. And then the third thing is CARBS, our hyperparameter, our cost-aware hyperparameter optimizer, which was especially helpful for being able to experiment at much smaller scales and then scale those experiments up to the much larger scale kind of on the first try without having to retry it. You don't want to be training, you know, 10, 20 different 70B models. You really want to get these larger modelsSWYX [00:07:30]: right on the first try.JOSH [00:07:30]: And so the ability to kind of tune things very precisely and learn scaling laws, not just for, you know, the like data and flops, but also for learning rate and all the other hyperparameters and see like how should you scale these things up was extremely valuable to us as we were training the larger models. Yeah, that's a lot of stuff.SWYX [00:07:49]: Yeah, exactly. So there's a bunch of stuffJOSH [00:07:50]: we'll have to go through all of it.JONATHAN [00:07:52]: Yeah, I just want to throw in how excited I am about this. This is the stuff that nobody ever talks about. That is the difference between success and failure in this stuff. Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke, your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI hello world that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting. There's so many levels of things you have to accomplish. This is the kind of stuff that matters. I think to a point that Josh made earlier, before we got on here, there are plenty of weights out there. Nobody's released this.JOSH [00:08:46]: Yeah, that was part of the motivation actually is that there are lots of other things that are complimentary, but I have not seen nearly as much discussion about some of these other things that we think are pretty important. I mean, in some sense,SWYX [00:08:56]: I'm very excited to have Jonathan on because this is a little bit, you're a bread and butter with Mosaic. And I think you've released some part with Composer. And I think it's just really interesting to see like a different take, basically a full stack take that's kind of open source today.JONATHAN [00:09:18]: Yeah, it's really kind of, it's been an ordeal to figure this out. And every time something changes, whether it's a new GPU or even a new driver update, you get new creative errors and new things go wrong. And, you know, we've dealt with the weirdest things from, you know, our InfiniBand cables getting stolen from the data center twice, like in boxes before they arrived at the data center. Like, you know, Porch Pirate basically had stolen our InfiniBand cables back when those were hard to come by. To like, you know, weird recalls of switches to like the strangest stuff has happened. I have my favorite GPU failures I've seen, like ones where the GPU doesn't fail, it has a correctable memory issue and the memory correction causes the GPU to become a straggler and hold up the whole job. Like weird stuff happens and figuring out how to not just identify all of that, but then eventually productize it, is in some sense, the entire story of Mosaic and now Databricks in terms of our ML offering. Really, the thing we offer is we have gone through this suffering and figured out how to even productize that. It has been a pain in the butt.SWYX [00:10:20]: Yeah, it's a lot of work.JOSH [00:10:20]: I think my favorite failure was GPU is just giving wrong math. Like if they give errors, great, because you can see the errors, but if they just give you the wrong math back, not so fun.SWYX [00:10:30]: When did they give you wrong math?JOSH [00:10:32]: Like literally you could just, you know, add two things. For example, the numbers come back. They're not the numbers that they're supposed to be.JONATHAN [00:10:40]: I think it's important to say at this stage, just because like it, I think it goes without saying for Josh and I, but it's worth saying here, this isn't to say that like anything is wrong with us. It's not like NVIDIA did a bad job or, you know, Mellanox did a bad job or the like the server builder, the data center operator, the cloud provider, like the million other parties that are involved in building this. We are running these insane chips that are huge and complicated and built on tiny transistors at insane frequencies with insane heat in data centers that for the most part, were not built remotely for this kind of power or heat and have been retrofitted for this. Like failures happen on a good day with normal CPUs. And this is not a good day and not a normal CPU for the most part. It's fun to joke about all the weird things we see. This is not to say anybody's done anything wrong. This is just kind of part and parcel of working on a massive cluster running at multiple megawatts of power at a time.SWYX [00:11:32]: It's crazy. Yeah.JONATHAN [00:11:33]: So optical cables, like all sorts, like everything.SWYX [00:11:37]: I'll take the opportunity to start going to the sort of infra piece. There's just like a description of the infra just to give people a sense of what we talk about when we talk about massive clusters. So I'm just going to read off the blog post here. This post is about one cluster that has 4,092 H100 GPUs spread across 511 computers. They use unified fabric manager nodes, which manage the infinite band network. And you talk a little bit about your networking. Is there anything unusual about this setup that you'll call out to people?JOSH [00:12:03]: Yeah, actually this particular cluster is a little bit non-standard. The normal, like vanilla setup for these large clusters as vanilla as it can be is what's normally like a 127 node cluster. So closer to like 1024 GPUs instead of 4,000. Here we have a larger cluster. As you start to get into the larger clusters, the networking becomes a little bit more custom. It's a little bit more, it's a little bit trickier. It's a little bit more difficult to get these things to all be able to talk to each other at the same speed. And so this has, in this particular case, this is a three tier network architecture instead of two tiers, kind of the normal one. So most of the clusters are a little bit smaller. As you get to even larger scales, then this becomes even much more complicated,SWYX [00:12:43]: much more expensive.JOSH [00:12:43]: So we chose this particular scale, kind of knowing our own workloads and kind of what we wanted to do. This was kind of the right size for us. But yeah, I think it's not exactly vanilla already. It's already getting into kind of the custom territory.SWYX [00:12:54]: So my understanding is that there, and is there any part of this that comes with the Voltage Park deal that you guys had? Is that part of the hardware that you got from the deal with them?JOSH [00:13:04]: Yeah, so we worked really closely with Voltage Park to set up all their clusters and infrastructure and everything and kind of decide even like what to order, how should the networking work? Like we were very involved in kind of the construction and bring up of this. And that's what this post is about, is about that process of like bringing up all these, there's like different clusters in different places of different scales. So in this particular post, we're talking about this one 4096 GPU, but there are other clusters that they have as well. And we were very closely involved with figuring out the exact architecture and kind of the trade-offs that go along with picking, you know, those exact components. You really don't want to like place the wrong order because it takes months to get it and it's very expensive. So yeah, we were happy to help out with that.JONATHAN [00:13:43]: And then your bit of good cables get stolen.SWYX [00:13:44]: Yeah, yeah, exactly.JOSH [00:13:47]: We wanted to make sure that we ended up with compute that would work for us and that would also work for their other customers. And so we kind of helped design something so that we would get exactly what we were looking for. We knew that these kinds of details would be super important and that getting down to the level of the hardware and like having these good scripts and everything was going to be a core part of like actually getting this to work. I'm very glad that we did that. I don't think that most companies kind of take that full stack approach, but for us, it certainly paid off.SWYX [00:14:12]: Yeah, it's basically sort of built to spec. It's interesting that relationship because you usually, for the rest of us who don't operate at your scale, we take whatever we can get from cloud providers, but you are basically co-designing from the single machine up. And you described that a little bit. Do you want to take us through the process that you described here?JOSH [00:14:27]: Yeah, so for the actual, like the blog post and kind of bringing these machines online.SWYX [00:14:32]: Yeah.JOSH [00:14:32]: So yeah, I think the process, as we have it broken down in the blog post, there's kind of a few different layers. First is like getting the individual machines to work at all and then getting the machines to actually be able to talk to each other. So getting the InfiniBand networking to work and then getting to a point where, you know, not just the machines are working and they can talk to each other, but everything is actually working correctly. There's a big gap between like it's working at all to it's working perfectly correctly. And then after you have all this stuff working perfectly correctly, nice and healthy, then now you get into kind of the software data, like training issues. And then after that, you're still not done. Like now, even once you're training at full speed, things are going to fail over time. Things are going to change. There's going to be new, you know, firmware updates. Like how do you kind of deal with this change and flux over time without going crazySWYX [00:15:16]: and pulling your hair out,JOSH [00:15:16]: trying to like reproduce things or understand why there were regressions. And so there's a lot of work to kind of automate the infrastructure tooling as well. And kind of the first step, like bringing these things online in the first place, you know, you have hundreds of machines at this point. So you don't necessarily want to be like walking around with like a CD-ROM or a USB drive, like plugging it in with your keyboard, like hitting next, next, next on the OS install. That's not how this works. You do that for one machine. And then you use, we use this thing called Metal as a Service to bring up all the other machines. So it's a kind of server that can kind of install the operating system on these other machines. So most like when you're talking about these machines, like each machine is, you know, on the order of hundreds of thousands of dollars. So they usually come with a kind of out-of-band management interface as well. So they don't, they have their InfiniBand networking. They have their normal 100 gigabit per second Ethernet networking. These are like dual, redundant, et cetera. And then you also have this extra out-of-band management network. So you can log in and you can see like the boot screen or you can see the blue screen of death. You can like get in there and actually see what was wrong, which is pretty fun. And it makes it like possible to automate a lot of this work. So the beginning of that, and the blog post goes into much more detail about like exactly how we set these up and kind of the other errors that we ran into. When you're bringing these online, you'll definitely have failures. Even if they all worked in the factory, they get shipped, some parts come loose, something fails, something goes wrong. So when you're bringing them online, there'll be some that don't quite work for all sorts of reasons. As you start to be working with machines at this scale, like if something happens one in a thousand times, you're like pretty likely to see it. And so you can get pretty rare, weird things, especially since we had fairly early builds and fairly early versions of this hardware. Like these are some of the like first machines that were ever produced, some of the first GPUs. So you've got some extra special things there. We definitely worked with Dell, for example, on making fixes in the firmware level to be like, okay, like this thing is wrong. Like we need to update this at the firmware to like actually fix this particular thing. So we worked pretty closely with Dell and Nvidia. Yeah, that's what I'm saying. Like this stuff gets complicated. And the thing is like, you know, taking a step back, the whole reason we're doing this, right, is that we knew that this was going to be complicated. There would be these kinds of failures. And if we're just using, you know, AWS or some other cloud provider, these errors are still gonna be there and you're gonna have no way to know and no way to debug this and no way to diagnose what's going wrong. And so we would much rather be able to like call up Dell and say, hey, this isn't working. And they're like, yep, okay, cool. Let's debug it together. Oh, I see. Yeah, cool. We'll ship a firmware update and actually fix this for you. That was a much better experience than like, great, just magically fails. I guess we restart and hope that that machine goes away. Like that's not a very good place to be. So yeah, that's kind of the first place is getting to a place where like GPU training is working on your single node machines. You can observe stuff. We have tons of tooling around like, you know, Prometheus and all sorts of other tools for understanding what's going on in these machines because you don't want to be like logging into each one and looking at the temperature or something you really need to have tooling to collect all these metrics, et cetera. Unfortunately, all of the scripts that we have for this are like for this entire cluster and for all this infrastructure are a little bit like special purpose for our particular thing. So it's not that every script that we have, it's not that you can just like take this and plug this in. Even if we did open source all the tooling that we have, you'd still have to do like a lot of work to open source it. What we are releasing is as many of the things that we can that are going to be useful for other people. You're still going to have to have some way of kind of managing these things, making your own like logging aggregators, et cetera, et cetera. So that's kind of bringing them up to the like, you know, the single nodes that are working. From there, it goes into, I'm happy to keep going if you want. Well, I just want to leave the opportunity for JohnSWYX [00:18:53]: to comment if there's anything that's different from how he runs things.JONATHAN [00:18:57]: Oh, I mean, all I'll say is I'll endorse this and say this s**t is hard. Like this is really, really hard. And, you know, I have a special props to, you know, the folks in Vue because they were building this from the ground up. You know, at Databricks and at Mosaic, we typically work with cloud providers because some of this stuff is just, there's too much to handle. It's complicated. There's a lot to deal with. And this doesn't even get into things like physical security, you know, securing power if you're the data center operator. Like this gets infinitely complicated and you have to abstract somewhere. Like, you know, and then you get to the folks who are literally building their own custom chips and like, good God.SWYX [00:19:36]: Like, oh my God, that's, you know,JONATHAN [00:19:38]: if you're one of those folks, you're having, you know, pour one out for the infra people at some of the AI chip startups who are having a really, really interesting time right now. But this stuff is really hard. And I don't think we talk about it much because there's so many other things that are hard. But the other hard things, I think everybody's becoming pretty familiar with at this point. This is something that I don't think there's ever really been a comprehensive discussion of, at least not that I've seen.SWYX [00:20:00]: Yeah, so my impression is that you guys, Mosaic, have your own software for sort of spinning up and down machines, just like Imbue had to build. But Imbue probably, it sounds like Imbue, you guys went fuller stack. I don't know how to describe it. Like Mosaic is not working with Dell on like their firmware.JONATHAN [00:20:21]: No, no, we're typically working with like, you know, pick your cloud provider on their Dell firmware or what have you. Like, it's kind of, I think one of the things, I don't know, Josh, you can correct me on this. It's kind of impossible if you're doing training to not go all the way through the entire stack, regardless of what happens. Like somehow I'm still chatting with cloud providers about power contracts, even though the whole point of dealing with the cloud provider is not to have to think about power contracts. Somehow I'm still asking them about which InfiniBand provider they used this time to see if this is part of the bad batch of cables I encountered on that cloud provider or what have you. Or like, we're still talking about a firmware update from pick your provider. You can't not do this. It's convenient that they have data center staff who are worrying about what to send back to which provider when, and they have people who can go and wait for the InfiniBand cables so they don't get stolen outside. But, you know, it's kind of, it's impossible not to really go full stack if you're thinking about the infrastructure at all. I don't know, Josh, correct me. No, I think that's right.JOSH [00:21:17]: That's what we expected from the beginning as well, is that we would inevitably have to get into the details here. And I'm glad that we kind of just planned for it. I think it made it a lot easier from our perspective to have direct control over this. Instead of having to go to the cloud provider that goes to the data center, that goes to the supplier, we could just go direct to NVIDIA or DellSWYX [00:21:37]: or the data center,JOSH [00:21:37]: whoever was responsible and be like, hey, this thing needs to change. And they're like, oh, okay. Yeah, that is our responsibility. Great, we can fix that. So it was just a lot easier for us to fix these bugs than if we had to go through an extra layer of email.SWYX [00:21:48]: Something we discussed in the pre-show was that you had a rule of thumb for your cluster of reliability. You say here in the post, by and large, you expect around 3% of your machines to break every week. So you're basically going to turn through all your machines in a year.JOSH [00:22:04]: As it says in the post. So that would be true if it was a uniform failure like that. But as it says in the post, it's usually these kind of problematic nodes. And to be clear, that is the number that we've heard from other people is like they're having about 3%. I don't think we're experiencing failure rates that are that high. I think ours is actually quite a bit lower than that, probably because we've taken the time to like dig into a large, maybe larger number than we should have of these failures and get to the root cause of it and be like, oh, okay, like that's exactly what's going wrong.SWYX [00:22:33]: How do we fix this?JOSH [00:22:33]: How do we prevent this from happening? How do we make automated checks for this so that if it does happen, it just goes back to whoever owns that particular part of the process and they can fix it immediately.SWYX [00:22:43]: And that's part of what you're also open sourcing, which is the health checks, right? You got the NIC health checks, GPU health check, this space health check, Docker D message. I don't know what that is.JOSH [00:22:52]: That one is just a lot of stuff.SWYX [00:22:54]: Yeah.JOSH [00:22:55]: That one is one where we realized that actually like when these machines boot, sometimes they wouldn't actually boot cleanly all the way. Or when they rebooted, they had problems that they didn't have when they were working before, which was kind of frustrating. Like usually if you restart your computer,SWYX [00:23:08]: it gets better.JOSH [00:23:08]: Here you restart. It did not get better.SWYX [00:23:10]: It got worse.JOSH [00:23:10]: That was very frustrating. So this health check looks at every particular line we've ever seen from the boot, like in D message, like every single log line that your computer emitsSWYX [00:23:21]: and says like,JOSH [00:23:21]: have we ever seen this before?SWYX [00:23:23]: Is this expected?JOSH [00:23:23]: Is this in the right order? Or is there something out of place? If there's anything out of place, let me say, okay, great. Like now it goes into this, like longer, more triage list of like, all right, great. Like, is this acceptable?SWYX [00:23:33]: Should we flag this?JOSH [00:23:33]: Like, should someone take a look at this? So we're looking down at a very, very granular detail level, what's happening on these computers to make sure that nothing is out of place. And that's critical because without that, if you're running your training, as Jonathan said, and this thing is slow, like what are you supposed to do? Right?SWYX [00:23:49]: Like you really,JOSH [00:23:49]: you really want to be very certain that like all 4,000 of these GPUs are working like they're supposed to.SWYX [00:23:54]: We know that.JOSH [00:23:54]: And so if it's slow, it's because like we messed up the config or something else and not because of this earlier thing that's like really hard to detect in software later.JONATHAN [00:24:01]: Yeah. I think the, I'm just curious to ask,SWYX [00:24:03]: like, you know,JONATHAN [00:24:03]: suppose you were to set up another, let's say another H100 cluster and it were at a different data center. And instead of the vendor being Dell, it was super micro or what have you. How much of this would be repeatable? And how much of this would you have to redo? I, you know, I genuinely don't know.SWYX [00:24:18]: A decent amount.JOSH [00:24:19]: I think it would go a lot faster the second time. I think there's lots of learnings that we had. And also the blog post,SWYX [00:24:24]: you know, yes,JOSH [00:24:24]: we are releasing the health checks, releasing some scripts, but a lot of the valuable stuff is also in the blog post itself, in the details and kind of the, you know, the learnings that we've had and the sort of errors that we run into. We tried to as much as possible surface those to other peopleSWYX [00:24:36]: could learn from thoseJOSH [00:24:36]: and avoid the same mistakes or failures as well. But I think it would go a lot faster.SWYX [00:24:41]: Although, yes,JOSH [00:24:41]: there would certainly be some things that'd be a little bit different. I mean, there'd probably be different CPUsSWYX [00:24:46]: or whatever,JOSH [00:24:46]: but I think a lot of that stuff is less,SWYX [00:24:49]: it's less,JOSH [00:24:49]: that's the like, that's less variable. I think most of it would apply the second time around. Although I'm sure next timeSWYX [00:24:56]: we're building one,JOSH [00:24:56]: it'll probably be, you know, at a scale that's 10x as big with a different chip or something like this.SWYX [00:25:00]: And then who knows?JOSH [00:25:01]: Yeah, with Kinect X8,JONATHAN [00:25:02]: that will have its own fun behavior and all that good stuff. Yeah.SWYX [00:25:06]: Perhaps there's something that people don't discuss about, and you don't even talk about this in the blog, but I always wonder is what is the timeline that's like kind of reasonable for this amount of work, at least the initial stages? And also what does the team composition look like for setting up a cluster, right? Like what are the mix of skills that you typically would require to get all this going?JOSH [00:25:27]: I'm, I can't really speak to typical. One thing I am very proud of is how much we accomplished with such a ridiculously small team. Like our infrastructure team is like, you know, fluctuates from week to week, depending on like how many things are on fire and how much we need to build. But it's like between like three and six people, like it's small. It's not like some huge team of like tons and tons of engineers. But those people are very, very good at what they do. And so that has allowed us to get a lot of mileage out of out of these things. I think it's not that we're building everything, right? It's not that three to six people build this whole thing. I definitely want to like, you know, say thanks very much to Dell and H5 and NVIDIA and the other people that have done a lot of the work, like to bring up this cluster, you know, with 4000 GPUs and three tier networking, networking architecture, you have 12,000 cables. So that's 24,000 things that need to be plugged in. Like that's just a lot of stuff to plug in, right? And you don't want to mess it up. Like each one needs to be done correctly. Like it's a little bit loose. Like it doesn't really work.SWYX [00:26:23]: If you break it,JOSH [00:26:23]: you need to replace it. Like there's a lot of workSWYX [00:26:26]: that goes into this.JOSH [00:26:27]: Yeah.SWYX [00:26:28]: And then, you know,JOSH [00:26:28]: that's just like that's it. That's if you were to do everything right the first time.SWYX [00:26:32]: And if you didn'tJOSH [00:26:32]: have to fix anything. But inevitably, you know, you will have to replace something, which means like taking all the wires out, pulling the thing out, taking all the GPUs out, going and fixing some cable, putting it all back correctly, putting it back in, doing this every time. So there were a lot of people at Dell, NVIDIA and at H5 that all helped a ton with this stuff. I don't know the exact size of the Dell team. It also fluctuated over time.SWYX [00:26:55]: Yeah, excellent. And then, you know, you so you have all the hardware set up and now you're firing it up for a single node. There's a long description that you guys have about just like monitoring the MFU, right? And what each situation might look might be indicative of. One of the most interesting things to me that I saw from here is like, you know, if training immediately starts off at 60 to 80% MFU, something's wrong.SWYX [00:27:24]: But like, you know, like what what are like, you know, some anecdotes or, you know, notable scenarios here that you might you might call out as maybe counterintuitive or super interesting.JOSH [00:27:36]: There's just so many of them. I mean, one of them, which I think is probably pretty common, like common knowledge by this point. But like we did have a sort of likeSWYX [00:27:46]: which one was this exactly?JOSH [00:27:47]: I think for the MFU, like gradually getting worse over time. I think that one, when we saw that the first time we were like, what the heck is going on? Like, why does it get just like a little bit worse? This is so strange. Like, what is it getting lazy or tired or something? Like, is it heat? Like what's going on? And in this particular case, it was memory fragmentation. Because you have hundreds of machines, they're doing garbage collection slightly different times. And then they get slightly further apart and slightly more and more jittered until eventually they're all happening kind of at random times. And just like really messing up each one of your steps. So you just turn off garbage collection and call it a day, basically,SWYX [00:28:20]: to be honest.JOSH [00:28:20]: There's other things you can do if you want to be a little bit more sophisticated about it. But you can also just manuallyJONATHAN [00:28:25]: have it all garbage collect on some interval. Like that's what we've done. We just have a garbage collection callback that just runs. But I've seen the exact same thing.JOSH [00:28:33]: Yeah, yeah, exactly. So I thought that one was kind of funny. And we did trace that one down and look and we did find the actual call. Like, again, this goes to like having good tools. So we had really good tools where we could look at a bunch of like actual traces in C and be like, OK, cool. This is the thing that's taking a lot of time. Or like, you know, this is the thing that doesn't quite line up here. Like, oh, I guess it's garbage collection. OK, cool.SWYX [00:28:52]: Interesting.JOSH [00:28:52]: Yeah, let's just try taking it off.SWYX [00:28:54]: OK, great.JOSH [00:28:54]: That's what it was. Now we can fix it. So for each of them, like basically bugs are not hard if you have good tools. But if you don't have good tools, bugs can be very, very hard. So similarly for like heat, another thing that we saw was like, oh, you know, the CPU is getting throttled. OK, well, it's easy to see if you're monitoring the CPU throttling or monitoring the heat. If you're not monitoring that, it's really hard to know why it's just suddenly one of them is going slower. I noticed also in the pieceSWYX [00:29:17]: that you mentioned FSDP with 0.3. Actually, we met, I went to iClear and Guanhua from the DSP team was there presenting 0++. I was wondering if you want to make any call outs to, you know, particular open source or open library or open whatever implementation teams that were super helpful in your process. I think we ended up actuallyJOSH [00:29:39]: pulling from a whole bunch of different ones to pull things in into our own particular pipeline. So we use things from NVIDIA's, you know, Megatron stuff. We use stuff from probably DeepSpeed. I think we pulled in a bunch of different pieces from a bunch of different places. So it was really nice to see all these working open source like examples. I think I really appreciate all the effort that has gone into actually tuning these things because you can tune them, but it's a lot of work to like tune this stuff and do all this stuff from scratch. It's really nice to have like a working example. I think those are probably the two biggest ones, DeepSpeed and Megatron alone, but there are probably other ones as well.SWYX [00:30:13]: Is there a particular thing in the ecosystem where you would call out as like, you know, there should be something here that is open source, but like it's not really, it's like everyone kind of builds it on their own. I want to say something with the file system because everyone talks about the file system eventually.JOSH [00:30:28]: The file system actually was,SWYX [00:30:30]: I mean, we did somethingJOSH [00:30:31]: kind of dumb there. Like we have our own sort of local mirror so that we can, you know, like a crappy version of S3SWYX [00:30:38]: that's local,JOSH [00:30:38]: but it's just a pretty simple script, right?SWYX [00:30:41]: Like I think we run likeJOSH [00:30:41]: a little web server that just like serves files and then, you know, it can upload themSWYX [00:30:45]: and download them.JOSH [00:30:45]: Okay, great. And part of the reason we did that is that our internet connectionSWYX [00:30:50]: in the beginningJOSH [00:30:50]: was not the like full speedSWYX [00:30:52]: one that we wouldJOSH [00:30:52]: eventually have. And so we are a little bit more kind of bottlenecked in terms of internet bandwidth. And so we had this. I think we looked at a bunch of services out there like Minio and some other ones, but a lot of these like come with a lot of extra overhead and maintenance. And since we already have so much infrastructureSWYX [00:31:09]: to deal with,JOSH [00:31:09]: we kind of didn't want to, you know, bring in a whole other like cloud provider, virtualize something, something.SWYX [00:31:14]: We just wanted something simple.JOSH [00:31:14]: So we went with that, which has been quite helpful. Like our toolsSWYX [00:31:19]: are usually quite simple.JOSH [00:31:19]: It's like Bash and Python and SSH and Docker. Like we'd like to keep things simple so that's easier to debug, like less layers of infrastructure, less layers of abstraction, make it a lot easier to work with. Like we don't use Kubernetes,SWYX [00:31:30]: for example,JOSH [00:31:30]: and we just directly launch these things. And it's just been much easier to debug this way. One tool actually that does come into mind that I will call out is Kraken from Uber. That was great. We love that tool. We were a little bit skeptical. What is it?SWYX [00:31:44]: I'm sorry. Yeah.JOSH [00:31:45]: So Kraken is this, yeah, it's a distributed like Docker registry, basically, that uses BitTorrent to like transfer things between the machines in a sort of nice optimal way. Like in the very beginning, the naive way is like you have this one Docker registry, which was outside of the cluster. So every time we change an image, you know, there's many gigabytes that each of the 500 machines needs to download.SWYX [00:32:07]: So that just takesJOSH [00:32:07]: a really long time. So what this thing does is like just one of them downloads it and then like they all sort of broadcast all the pieces to each other. And it was just like a really nice, fast way of getting these images down. And it was very robust.SWYX [00:32:19]: Like there's a lotJOSH [00:32:19]: going on under the hood, but I think it's a pretty cool tool that we haven't really had any bugs with it at all. Amazing.SWYX [00:32:26]: Yeah. I mean, that's all my questions, I guess, for the info piece. I don't know if, John, you had something that you were sort of burning to ask or.JONATHAN [00:32:33]: No, all I can say is just sameSWYX [00:32:36]: in a lot of places, like, you know, and they're done thatJONATHAN [00:32:38]: seeing this plus one. I think the one big difference, you know, perhaps in philosophies is we've tried to basically standardize on as much commodity stuff as possible, just because, you know, I think the reason I asked about trying to do thisSWYX [00:32:50]: on multiple differentJONATHAN [00:32:50]: pieces of infrastructure is like, I think we're running on like six or seven different clouds right now. And everybody has done something slightly different. And my gosh, the little differences add up as you know, you've seen. And so, you know,SWYX [00:33:04]: our philosophy has been like, whatever the hellJONATHAN [00:33:05]: we can standardize, please let's standardize it. Like vanilla off the shelf FSDB.SWYX [00:33:10]: And like, you know,JONATHAN [00:33:10]: we wrote our own data loader, but we've tried to make that as much of a standard as we can across our infrastructure and in Databricks, because things just start getting really complicatedSWYX [00:33:18]: or like we useJONATHAN [00:33:18]: Kubernetes extensively because it at least gives us a uniform set of APIs. Like that's our hardware abstraction layer to a certain extent for everything else. So it's just, you know, a difference in philosophy there. But otherwise, like, yeah, this stuff is really, really hard. And I feel like we take for granted how much of this, you know, is done for us when you go and you just query chat GPT, for example. Like, oh my God, everything going on underneath that, you know, it's kind of a miracle that the machines boot up, let alone that you can like query a giant language model that's probably doing inference across multiple machines and was trained across thousands of machines. Like, you know, minor miracle.SWYX [00:33:54]: Yeah, it is an awesome amount of power that we invoke with a single API call that we take for granted these days. It's absurd. Yeah, I mean, like Kubernetes, like that point about Kubernetes, I will say as a former AWS employee, like it seems like it would be ideal for imbue to at some point make it more abstracted or agnostic because you're going to want to, you know, replicate your setup. We do have our ownJOSH [00:34:19]: sort of replacement. It's just a much simpler version of Kubernetes. Kubernetes is really designed for running services, not for running experiments. Like that's not its like main architecture. And so for us, like we have everything that's like, cool, you're going to run an experiment. So you want it to run to completion, right?SWYX [00:34:34]: OK, great.JOSH [00:34:34]: Like the primitives are sort of built around a slightly different style. And that makes it a lot easier, like just a lot simpler to fit that the nature of like these machines are going to disappear. They will need to be rebooted for infrastructure upgrades. They will like something will happen to the GPUs. Failure is like baked into this as like a core part of our infrastructure. So it's not that we don't have an abstraction. It's that it's a sort of simpler, more tailored abstraction for the particular work that we're doing.JONATHAN [00:34:58]: Yeah, I think it all depends on what your goals are. And like, I think the challenge in a lot of the deep learning stuff right now is that people are trying to like, people often build things that are more complicated than necessary to get the job done. And the complication is the enemy of everything. You know, don't use a fancier parallelism strategy than you have to. Don't use a fancier set of libraries than you have to.SWYX [00:35:18]: Don't do anythingJONATHAN [00:35:18]: that you don't have to do because it's hard enough as it is. Like, don't overcomplicateSWYX [00:35:23]: your own life.JONATHAN [00:35:23]: Don't try to bring in more tools or more fancy architecture tweaks if you absolutely don't have to.SWYX [00:35:29]: Like getting to the minimumJONATHAN [00:35:30]: necessary to get the job done. And it's really tempting to want to try to use everything. So like, I totally understand that one.SWYX [00:35:37]: I think the last piece I'll maybe call out is that I'm just going to weave this in just because I see the opportunity to do it. Are there any infrastructure shifts that need to be, that need to rise because of changing architecture? So I think, for example,SWYX [00:35:57]: you're announcing a dense model, a 70B dense model, whereas John just worked on DBRX and the image-to-text model, which presumably has different bottlenecks.JONATHAN [00:36:10]: That's correct for us. You know, we train both dense and mixture of expert models. The one we happened to, you know, kind of get permission to open source was a mixture of expert model. And those models are very demanding when it comes to network bandwidth, at least if you're training them in kind of FSTP 03 style, where there's just a lot of parameters getting shuffled back and forth. And your ratio of kind of compute to amount of data that you have to shuffle back and forth becomes a lot worse because you're now, you know, you're only using a fraction of the parameters for every token instead of all the parameters. And so we had to really push the envelope on getting all the stuff to the right places on time. And so actually the networking part of DBRX was the single hardest thing, I think, of the entire process. Just get MOE training, working at scale across a big cluster. We still managed to, I think, do it all with commodity parts, which was very exciting. You know, we were using FSTP and we eventually used HSTP so that we could have HSTP as a version of FSTP where you have multiple smaller replicas and you're doing data parallel within those replicas. And that helped a lot with network latency issues that we were running into just because we were transmitting so much data, you know, for every single part of the process. I think it actually, like, it was instructive for how Google designs their hardware and software together personally. Their training, as far as I understand, using kind of a 03 style of training and have been for a while. They also train mixture of expert models. TPUs have a very different network bandwidth to compute ratio. They have a lot more bandwidth just objectively. And TPUs per chip tend to be a little bit less compute intensive and have a little bit less memory. You know, it's just a different design choice. So the ratio of flops to bandwidth is very different. And that means that it's much easier for Google to be able to pull offSWYX [00:37:54]: some of this stuff.JONATHAN [00:37:54]: They also have interesting, you know, Torus style network architecture or Torus style, like, literal network architectureSWYX [00:38:00]: is not like the model,JONATHAN [00:38:00]: but the network.SWYX [00:38:02]: Is this the sort of block attention? I forgot what you call it. So this is just more or the,JONATHAN [00:38:07]: yeah, this is more, not the ring attention, but these are the ring all reduces. Like you have three different dimensions of rings because they kind of put you in these three dimensional Toruses from what I understand. And so like, you know, Google's infrastructure in some sense is kind of, I wouldn't say built for this, but maybe the way that Google trains models is built for a slightly different bit of infrastructure they have. And it's kind of neat to think about that. You know, as one thing that I think NVIDIA announced for, you know, for, for both the GH200 and the GB200 is this hybrid networking where you'll have blocks of NVLink network chips. I think for the GB200, I think it's like groups of 72 GPUs will all have NVLink to each other. So higher bandwidth, then you'll have normal networking of some kind, InfiniBand or Rocky or what have you between these blocks. And that's kind of a, you know, it's a change due to the fact that, you know, it's hard to build really high bandwidth networks over very large groups, but it is now a blocked networking. And you have to think about how you architect your model and your parallelism differently. You also have to think about fault tolerance differently because it now matters where you lose a GPU, whereas it didn't before. So, you know, it's, it's, it's just all really interesting and really fun speaking personally, but it's going to mean new nightmares when we all move to that generation and have to think about, you know, new versions of these problems.JOSH [00:39:20]: As you go up to larger scales, it gets quite different. Like right now, you know, if you're experiencing, let's say, for example, you experience a GPU failure every day, that's fine.SWYX [00:39:31]: Just restart.JOSH [00:39:31]: If you make your thing 24 times as big, now it's once an hour. Now it stops being quite as easy to just restart, right? So now you have to kind of break, like bake in this sort of redundancy that you didn't have before. So I think as you go up in scale, you end up running into like a lot of really interesting problems that also inform the, the actual like design. Yeah, I mean, as an orchestration guy,SWYX [00:39:52]: this is why I always emphasize like very cheap storage or very fast storage. So you can checkpoint more, but I don't think that's probably not the best solution to for fast, you know, training.JONATHAN [00:40:05]: Which works fine when you're doing language and then you move to vision or video. And then, you know, you have multi petabyte datasetsSWYX [00:40:12]: and getting, you know,JONATHAN [00:40:13]: cheap, fast multi petabyte storage starts to bite. Like I've certainly encountered issues where the literal data center where my GPUs were did not have enough, you know, object store to fit the datasets that people wanted to bring into that data center from whichever users were, were trying to bring them in. And then you get to a wholeSWYX [00:40:31]: different world of hurtJONATHAN [00:40:31]: where you have to keep your data in a different region because the region is just out of storage. So things get fun really fast.SWYX [00:40:39]: Speaking of vision, Josh, actually, you know, Embu is an agents company, but you're only, you're announcing a text-only model. What, where does, where does the vision side come in?JOSH [00:40:49]: I think we've actually done a lot of work in the past and people can see kind of our blog posts about sort of self-supervised learning and some other kind of vision-related stuff in the past as well. So we're very familiar with, with that stuff. But I think our main focus right now is on kind of, as we say, coding and reasoning. And there, there's certainly a visual component to some problems. But, you know, it's not necessarily required for all problems. And actually we found that for most of the kind of like code writing and, and reasoning problems that we care about, the visual part isn't really a huge important part of it. Sometimes if you really need to, you can maybe describeSWYX [00:41:24]: the thing.JOSH [00:41:24]: There are other like, you know, multimodal models that you can use off the shelf to sort of plug in for those particular piecesSWYX [00:41:30]: that you need, right?JOSH [00:41:30]: Like if something is driving a browser or whatever, like you can sometimes get away with not having to have that baked into the original model. So our folk were, you know, in a sense, we kind of do a lot across the stack. We're working on our own infrastructure and pre-training and RL and fine tuning and products and everything. But in another sense, we're very narrowly focused on the application side. So all of the stuff across the stack is kind of going toward a very particular purpose. And so that particular purpose right now doesn't really need vision. So we think that people are going to make all sorts of really cool image modelsSWYX [00:42:00]: like Jonathan, right?JOSH [00:42:00]: And all sorts of interesting multimodal models into the future. We'll let them go do that. That's great. We'll take advantage of that, partner with those people in the future. And right now we're really focused on kind of the core reasoning and coding capabilities and aspects of the model.SWYX [00:42:14]: I wanted to go into carbs since that's kind of the next layer of the stack. We talked about carbs in the first episode with Kanjin because you've actually had a blog post about it like a couple of years ago. Maybe let's introduce it.JONATHAN [00:42:26]: Has that been a couple of years now?JOSH [00:42:28]: No, it must have been at least one year. Hopefully it's not multiple years.SWYX [00:42:32]: Sorry, I'm counting AI time. Yeah, yeah. Yeah, I was going to sayJONATHAN [00:42:35]: you're making me feel really old right now.SWYX [00:42:39]: I count everything before the generally intelligent rename as like, you know, prehistory. Yeah. And now sort of modernity, right? So I actually thought carbs was more about hyperparameter optimization in a sense of like sort of parameters, hyperparameter search. Whereas, you know, when you introduced it, especially in this blog post, it's more about scaling laws and predictability of like, are we sort of in the right ballpark before we scale things up? Maybe sort of recount the history of carbs.JOSH [00:43:10]: Yeah, so it really is a little bit of both. So carbs is, it's maybe a backronym, but it's for cost aware Pareto region Bayesian search. So this is about technically how it works, but carbs is like, you know, we like pastries and stuff.SWYX [00:43:26]: So great, why not? But the point is thatJOSH [00:43:29]: it's a cost aware hyperparameter tuner. So most hyperparameter tuners, you kind of say, OK, here's this objective function. I want you to make this number as big as possible or as small as possible, whichever direction you want to go. So yeah, just go make this number, you know, as small as possible. OK, so it'll try a bunch of differentSWYX [00:43:46]: hyperparameters,JOSH [00:43:46]: a bunch of different configurationsSWYX [00:43:48]: to figure out, like,JOSH [00:43:48]: how do I tweak your network and architecture, et cetera, to get the kind of best performance I possibly can. That's usually saying, like, you know, almost all of these hyperparameter configurations are, let's say they're all going to use the same number of GPUs or the same number of nodes.SWYX [00:44:01]: So it's going to runJOSH [00:44:01]: for the same amount of time.SWYX [00:44:03]: So you can do that.JOSH [00:44:03]: You can get a number out and that's great. But what carbs does is it says,SWYX [00:44:07]: OK, actually,JOSH [00:44:07]: what if we relax that constraint? What if we say each of these different points, we're going to model how expensive it will be to sample this configuration. So if what if we train with just one one hundredth of the data? Like, how well can we do?SWYX [00:44:19]: What if we trainJOSH [00:44:19]: with one tenth of the data? What if we train with all the data? That way you can understand, like, as we get more and more data, as we spend more and more compute,SWYX [00:44:26]: as we make a biggerJOSH [00:44:26]: and bigger network, how does performance change with these things that change? Like how expensive it is to even explore this data point. So by doing that, we can see the scaling laws for not just, you know,SWYX [00:44:36]: the scaling lawsJOSH [00:44:36]: from like the, you know, Chantilla paper, the scaling laws for all parameters. We can see how does how does the number of layers change with this? How does the, you know, the learning rate change? How do the like, you know, various types of regularization change? So you can see these nice scaling laws. And as you're going across costs, like how should this be changing as you're scaling up your model? So that, coupled with the kind of metric that we chose, which is a very precise way of measuring performance, allowed us to really like hone in on parameters that worked really wellSWYX [00:45:05]: and understand, like,JOSH [00:45:05]: how do we want to scale those up, especially as we're changingSWYX [00:45:08]: things about the network?JOSH [00:45:08]: Like one of the things that we did is we used a custom tokenizer. As we change this tokenizer, changes a bunch of other things about the model. So how should we scale up this entirely new tokenizer? Like no one has ever made a model this large with this tokenizer before. And so how do we want toSWYX [00:45:22]: change all these things?JOSH [00:45:22]: Harps kind of shows you, like, look, as you change these parameters, like these other ones are kind of dependent on this.SWYX [00:45:28]: Like this is the, these areJOSH [00:45:28]: the relationships between them. So you can better understand, like, OK, if I'm going to scale this up 10x or 100x, like, where do I want to be? I can only go so far. And so, you know, we did run, like, I think maybe it was like a 14b one or somethingSWYX [00:45:40]: like that to check.JOSH [00:45:41]: But and so we had a bunch of like 1b or 14b and then at 70b. I don't think we had a, I think we just did like one at 14b. So you can, we get to check that like, oh, is this on the curve? Like, is this where we expect? It was like right there. So then great, go on to the next one. Yeah, I mean, that makes a lot of sense.SWYX [00:45:56]: I wonder if, so one of the key questions, and correct me if I'm wrong, but like usually people do search or do their evals just based on loss. But you actually evaluate based on, you know, the sort of end state evals that people might expect, like HellaSwag and Lombata, whatever. What is the norm here? Is there a norm?JOSH [00:46:20]: Yeah, I don't know if there's a hundred percent.SWYX [00:46:21]: I don't know. I only see loss on most people's reports.JOSH [00:46:25]: I think it's easy to, like, loss is very nice because it's very precise. It will tell you, like, very fine grained differences between like really small changes in your hyperparameters or network architecture. Whereas, especially at the smaller scales, if you're looking at like accuracy, it's very noisy. Like it might be zero or a hundred or like, you know, fluctuating by like 10 or 20 percentage points, which makes it really hard to tell, like, did that change actually mean anything? So our loss is sort of a combination of these two. Instead of saying, like, let's just look at perplexity, we say, let's look at perplexity on the tasks that we care about for multiple choice questions effectively.SWYX [00:47:00]: So we're saying like, yes,JOSH [00:47:00]: this is formulated as a multiple choice question, and we're going to look at the, like, you know, the loss of perplexity for this particular answer token. And that ends up being something that's like both targeted to what you actually care about and also very precise. The nice thing about this though is that it's independent of the data that you train on. One thing that's annoying about perplexity or about loss is that as you change your data set, this is really obnoxious because now it fundamentally changes your loss, right? And so you can't tell, like, how do I tweak my data set? But because we have this held out evaluation dat
Last week, fishermen in Northern New Brunswick were told by DFO there would be no dredging this year in the channel at Tabousintac and in Pointe Sapin. Martin Mallet of the MFU explains why dredging is essential. Chief Dorene Bernard of Lennox Island First Nation discusses a proposed national park reserve on PEI called Pituamkek. And on the phone-in: Diane Leblanc talks about birding.
On this episode of Insight, New Media Broadcasters spoke with Matt Rains, Chief of Staff at MFU, about their meat science scholarship available at MSU Northern
This episode came together at ~4 hrs notice since Dylan had just landed in SF and we had to setup quickly; you might notice some small audio issues in some segments, we apologize. We're currently building our own podcast studio for 2024!
Ukens handleliste: Fjorårets strømpriser var heftige nok til å holde både kjøpmenn og vanlige folk våkne om natta. Men hvordan kan man bruke strømmen smartere og spare penger dersom man driver en butikk, et meieri eller annen form for matproduksjon? Handlevogna får besøk av Petter Røkke, som er forskningssjef ved SINTEF Energi og som også leder forskningssenteret HighEFF, der de driver med anvendt forskning på energilagring og energieffektivisering for norsk næringsliv og industri. Vil du vite hvordan du kan spare strøm og penger på bunnlinja? Da bør du lytte til Petter!I tillegg diskuterer Runar og Bendik hvordan selektiv statistikkbruk i riksdekkende medier gir et misvisende bilde av matprisutviklingen. Misliker du når mediene "trunkerer" (utelater deler av) y-aksen? Da bør du høre denne praten!Handleturen avsluttes i "Kassa", der Runar har med ferskt nytt om usunn mat og drikke fra Bulgaria. Trenger Bulgaria også et MFU? Og hvordan skal det i så fall uttales? Hosted on Acast. See acast.com/privacy for more information.
Ukens handleliste: Hvor henvender man seg hvis man observerer en reklame for usunn mat eller drikke og som man mener er tydelig rettet mot barn og unge? Handlevogna har snakket med Martin Rove, som er styreleder i Matbransjens faglige utvalg (MFU), for å lære mer om klageorganet som skal sikre at barn og unge ikke utsettes for målrettet reklame for usunn mat og drikke. Er du opptatt av kosthold, har barn selv eller bare vil lære mer om artige trebokstavsforkortelser, bør du bruke 25 minutter på denne praten! Les mer om MFU på nettsiden www.mfu.as. Hosted on Acast. See acast.com/privacy for more information.
Join host Knockin' Doorz Down Jason LaChance as he sits down with special guest Nick Graystone. Nick was born on December 8, 1977. He grew up on Long Island in the town of Bellmore. Following his dream, Nick began playing in various music acts primarily in the NYC area. He is currently the bass player and singer of DemonScar and has been linked with the bands Hollow, MFU, and Death Proof. Nick is a devoted horror movie fan and collector for over 30 years. Nick is a proud father to a daughter named Zoe Rose. Nick and Jason discuss why now is the time for Nick to share his addiction recovery journey. How the birth of his daughter was the catalyst for Nick's sobriety. As well as how his addiction hindered his musical pursuits, having the wrong types of relationships and owning his actions during his addiction, and making amends. We also talk about our mutual fandom for Motley Crue, and how his band Demon Scar had several songs on a movie soundtrack staring past Knockin' Doorz Down Podcast guest Edward Furlong. This is Nick Graystone Knockin' Doorz Down. For more on Nick Graystone https://www.instagram.com/nzasixx/ https://open.spotify.com/show/1tCqrzXh1mS5zVo9navXJ5?si=xdjs1AX7S0-8Y5t1L2o3OA For 51FIFTY use the discount code KDD20 for 20% off! https://51fiftyltm.com/ For more information on Carlos Vieira's autobiography Knockin' Doorz Down, the Carlos Vieira Foundation, the Race 2B Drug-Free, Race to End the Stigma, and Race For Autism programs visit: https://www.carlosvieirafoundation.org/ Listen to and Subscribe to the podcast wherever you listen for more Celebrities, Change Makers, and expert conversations at https://www.KDDPodcast.com © 2023 by KDD Media Company. All rights reserved. Learn more about your ad choices. Visit megaphone.fm/adchoices
In der neuesten Ausgabe der Stadionsprechstunde ist Roland Spöttling zu Gast. Das Besondere am Stadionsprecher des Wiener Sport-Clubs: Er ist seit seinem vierten Lebensjahr blind. Das hält ihn allerdings keinesfalls von seiner größten Leidenschaft ab: de mFußball. So berichtet er im Gespräch, wie er das erste Mal mit dem Fußball in Kontakt gekommen ist, er spricht über seine Rolle als WSC-Stadionsprecher und warum der Sport-Club für ihn ein einzigartiger Fußballklub ist.
I denne episoden av Podcasten har vi fått besøkt av en av Norges mest omtalte og kontroversielle rappere de siste årene. I tillegg til å ha sittet inne i årevis selv for vold og narkotika, sitter også medlemmer av hans crew MFU inne for overlagt drap. Doni Mula er en av veldig få artister her til lands som faktisk kan backe opp sin gangster-rap med handling om det trengs. Mannegruppa Ottar Podcast er sponset av: Game-on.no Få 15% avslag av å bruke kodeordet ottar0223!
The Maritime Fishermen's Union requested an extension to the lobster fishing season, after Hurricane Fiona destroyed about 50 per cent of fishers' gear. We check in with the MFU's Luc LeBlanc on where things stand now.
Most people love the idea of buying local. But seeking out local products and getting to the farmers market can be challenging.This week we are chatting with Patrick Darrough of Milwaukee Farmers United and Milwaukee Microgreens about how they took the notions of collaboration and support and used them to create a convenient online farmers market and home delivery service that benefits farmers, makers and customers.During our chat, Darrough shares the origin story of MFU as well as the circumstances under which the business has evolved and grown. He talks about how the pandemic impacted the online delivery business, as well as the positive impact they’ve had on local farms and small food businesses. We also chat about the differences between MFU and national delivery services that claim to carry local products.
We hear from Gordon Beaton, VP of the MFU in NS, who talks about the moderate livelihood Indigenous fishery, the mayor of Miramichi, Adam Lordon, tells us about Covid-19 testing in the city and on the phone-in: Wildlife sightings with Bob Bancroft
●マッチングアプリ 彼氏欲しいですか 「ラジオをしていて一番楽しい時は、いつですか?」 ●あいりーんの同級生でおすすめのラジオ ▶︎MFU-radio GOLD →( https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwijh7DMzLHrAhU6w4sBHUDlA3wQFjAAegQIAhAB&url=https%3A%2F%2Fwww.google.com%2Fpodcasts%3Ffeed%3DaHR0cHM6Ly9hbmNob3IuZm0vcy8yMDYwMzIyMC9wb2RjYXN0L3Jzcw%3D%3D&usg=AOvVaw0aAEP2EV_yVqtLG70RmXe4 ) ▶︎某村(#17-1~9) →( https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy84NTVhN2EwL3BvZGNhc3QvcnNz/episode/Zjg3Yzk1YjItOTZmNy1mYjkwLWI5YjYtMTQ3MjA4YzFhNWVl?sa=X&ved=0CK0CEM7CAmoXChMI4MyN4cyx6wIVAAAAAB0AAAAAEAc ) ●【Set List】 『ソラリス』 『どこかのROCK』 ●【Our Info】 Link Tree(https://linktr.ee/2radio4radio) メール (2radio4radio@gmail.com) ●収録日20200510 ●編集日20200817
Cape Breton's Information Morning from CBC Radio Nova Scotia (Highlights)
As worries continue to float about over the state of the lobster industry, we hear from Kevin Squires, President of Local 6 of the Maritime Fishermen's Union - he fishes out of LFA27, about the MFU surveying its members on the upcoming season.
Acknowledgement of countryNews with Cait Kelly We listen to the English translation of a speech from the Primera Linea (Front Line) given on the 25th January at the InterAmerican Forum of Human Rights in Santiago Chile. The Forum determined that the Human Rights abuses in Chile are systematic.We hear audio of Uncle Robbie Thorpe speaking at Invasion Day 2020.We speak with Gerald Roche, one of the organisers of Melbourne Free University, to discuss upcoming MFU events. Tonight at the Alderman MFU present a session on the Israeli annexation of Palestine.Carol and Yanni from the Anticolonial Asian Alliance join us in the studio. They share AAA’s statement, to be published today, addressing the public racism surrounding the coronavirus and draw links between these experiences and ongoing First Nation and refugee struggles and histories of white supremacy and Anti-Blackness. Ripley and Meleika join us to discuss FAMILI a platform for Pasifka artists to come together and explore identity. The FAMILI EP launch is tonight at the Abbotsford convent as part of Midsumma. Fresh hip hop, new wave R&B, Oceanic musicality, words, and movement. Focusing on Oceanic, LGBTQIA+ and diaspora experiences. Artists: Wahe, Lay the Mystic, Luscious, Kalala, Lonelyspeck, Iki, Bella Waru and Meleika Gesa-Fatafehi.SongsSurvive - DRMNGNOWDrowning -Miiesha Blackfellas - Nooky Chapter One - Dallas Woods
Author and podcaster Minister Faust returns to lead the Invasion once again in the season 3 finale of Invaders From Planet 3! We talk about The Coyote Kings vs. the Myconauts of Plutonium City — his new, serialized novel — and his plans to unite all of his stories within the single, overarching MFU (the Minister Faust Universe). As part of this, we also discuss why he decided to keep this instalment in the adventures of the Coyotes in the 1990s, rather than bringing our heroes up to date. Minister also shares some of his thoughts about writing, including mining old, half-developed ideas for material that can be refurbished and combined with other content into something new and compelling; how a writer decides when a story's time has come; and how deciding to publish a new novel as a serial, rather than a single, finished book, is a source of pressure, but also offers a lot of positive opportunities. He talks about the importance of letting the real world provide the conflict in a story, rather than its artificial elements. He discusses how to expand the focus of the story to give supporting characters time at centre stage, without losing the thrust of who the story is really about. And he explores the reality of being a creator in a world where writers (and other artists) have to consider expanding their work into other media platforms, and thus need to always be ready to make a pitch. Along the way, we talk about the current Golden Age — or Gold Rush, as Minister suggests — of television. We also take time to debate the movie adaptation of The Martian. And we discuss why British TV productions are often more interesting and entertaining than their Hollywood counterparts. Minister also teases some of his upcoming projects, including a novel about Dread Scott Jahplin (one of the supporting characters in The Coyote Kings vs. the Myconauts of Plutonium City), and a new novel called Shango, God of Thunder City. Our interview took place in February, 2019, via a Skype connection between his headquarters at The Grand Lodge of Imhotep in Edmonton, and my studio in the lair of bloginhood, currently located beneath a small island in an ornamental duck pond in that new condo development just down the street. You can learn more about Minister Faust and his stories on his website: ministerfaust.com And you can listen to his podcast, MF Galaxy, on various platforms, including his Patreon site: https://www.patreon.com/mfgalaxy To listen to Invaders From Planet 3, or subscribe, visit Libsyn, iTunes, Stitcher, Overcast and Spotify. Be sure to rate and review the show while you're there!
Fun chat and music with Tom O'Connor and Travis Huff from Mobile Funk Unit. Plus a visit from Tony Lotven of The Fried Crawdaddies.https://www.facebook.com/funkymusicinmotion/https://www.facebook.com/tony.lotven
Soundcity MFU : l’univers foutraque et captivant de l’artiste électronicien anglais Man From Uranus.
Soundcity MFU : l’univers foutraque et captivant de l’artiste électronicien anglais Man From Uranus.
Soundcity MFU : l'univers foutraque et captivant de l'artiste électronicien anglais Man From Uranus.
Soundcity MFU : l'univers foutraque et captivant de l'artiste électronicien anglais Man From Uranus.
Soundcity MFU : l'univers foutraque et captivant de l'artiste électronicien anglais Man From Uranus.
Soundcity MFU : l’univers foutraque et captivant de l’artiste électronicien anglais Man From Uranus.
Podcast mensuel réalisé, programmé et présenté par Benjamin Schoos. Durée: 60 minutes. Le front des micro-labels fédérés par le boss de Freaksville. La chronique des sorties diffusées par Freaksville Music, distributeur indépendant en Belgique francophone. Voir Freaksville Music sur Facebook. Mois chargé pour Freaksville et les labels : ressortie de Déréglée/Daisy de Marie France (1977), Sortie du nouveau The Keys, Alex Gavaghan, focus sur MFU, Modern Tanz Music, Benjamin de Roubaix, ..
Podcast mensuel réalisé, programmé et présenté par Benjamin Schoos.Durée: 60 minutes.Le front des micro-labels fédérés par le boss de Freaksville.La chronique des sorties diffusées par Freaksville Music, distributeur indépendant en Belgique francophone.Voir Freaksville Music sur Facebook.Mois chargé pour Freaksville et les labels : ressortie de Déréglée/Daisy de Marie France (1977), Sortie du nouveau The Keys, Alex Gavaghan, focus sur MFU, Modern Tanz Music, Benjamin de Roubaix, ..
Podcast mensuel autour de l'actualité du distributeur Freaksville Music présenté par Benjamin Schoos Durée: 60 minutes http://www.freaksvillemusic.com Jardin -Crystal Leave me Love Drones Escape from the terror drones Santiago L’aurore Dalton Télégramme Notre Route Benjamin Schoos Une fille en or ( edit) La féline ta main MFU 12 Houses Sabino Orsini Tarantella (live avec Jacques Duvall) Superstar Chazam Powerfeeling (instrumental)
Podcast mensuel autour de l'actualité du distributeur Freaksville Music présenté par Benjamin SchoosDurée: 60 minuteshttp://www.freaksvillemusic.comJardin -Crystal Leave me Love Drones Escape from the terror dronesSantiago L'auroreDalton Télégramme Notre RouteBenjamin Schoos Une fille en or ( edit)La féline ta main MFU 12 HousesSabino Orsini Tarantella (live avec Jacques Duvall)Superstar Chazam Powerfeeling (instrumental)
Some more obscure tracks this time, featuring South American 60ts covers and a little self indulgence! We're still mod though...especially those MFU suits. Enjoy.